Data Mining Survivor: Associate

DATA MINING
Desktop Survival Guide
by Graham Williams

Basket Analysis

The simplest association analysis is often referred to as market basket analysis. Within Rattle this is enabled when the Baskets button is checked. In this case, the data is thought of as representing shopping baskets (or any other type of collection of items, such as a basket of medical tests, a basket of medicines prescribed to a patient, a basket of stocks held by an investor, and so on). Each basket has a unique identifier, and the variable specified as an Ident variable in the Variables tab is taken as the identifier of a shopping basket. The contents of the basket are then the items contained in the column of data identified as the target variable. For market basket analysis, these are the only two variables used.

To illustrate market basket analysis with Rattle, we will use a very simple dataset consisting of the DVD movies purchased by customers. Suppose the data is stored in the file dvdtrans.csv and consists of the following:

ID,Item 1,Sixth Sense 1,LOTR1 1,Harry Potter1 1,Green Mile 1,LOTR2 2,Gladiator 2,Patriot 2,Braveheart 3,LOTR1 3,LOTR2 4,Gladiator 4,Patriot 4,Sixth Sense 5,Gladiator 5,Patriot 5,Sixth Sense 6,Gladiator 6,Patriot 6,Sixth Sense 7,Harry Potter1 7,Harry Potter2 8,Gladiator 8,Patriot 9,Gladiator 9,Patriot 9,Sixth Sense 10,Sixth Sense 10,LOTR 10,Galdiator 10,Green Mile

We load this data into Rattle and choose the appropriate variable roles. In this case it is quite simple:

[width=,trim=0 250 0 0, clip]rattle-dvd-variables

On the Associate tab (of the Unsupervised paradigm) ensure the Baskets check button is checked. Click the Execute button to identify the associations:

[width=]rattle-dvd-associate-top

Here we see a summary of the associations found. There were 38 association rules that met the criteria of having a minimum support of 0.1 and a minimum confidence of 0.1. Of these, 9 were of length 1 (i.e., a single item that has occurred frequently enough in the data), 20 were of length 2 and another 9 of length 3. Across the rules the support ranges from 0.11 up to 0.56. Confidence ranges from 0.11 up to 1.0, and lift from 0.9 up to 9.0.

The lower part of the same textview contains information about the running of the algorithm:

[width=]rattle-dvd-associate-bot

We can see the parameter settings used, noting that Rattle only provides access to a smaller set of settings (support and confidence). The output includes timing information fore the various phases of the algorithm. For such a small dataset, the times are of course essentially 0!

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.