Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Video Marketing: Transactions From File

A simple example from e-commerce is that of an on-line retailer of DVDs, maintaining a database of all purchases made by each customer. (They will also, of course, have web log data about what the customers browsed.) The retailer might be interested to know what DVDs appear regularly together and to then use this information to make recommendations to other customers.

The input data consists of ``transactions'' like the following, which record on each line the purchase history of a customer, with each purchase separated by a comma (i.e., CSV format as discussed in See Section [*]):



Sixth Sense,LOTR1,Harry Potter1,Green Mile,LOTR2
Gladiator,Patriot,Braveheart
LOTR1,LOTR2
Gladiator,Patriot,Sixth Sense
Gladiator,Patriot,Sixth Sense
Gladiator,Patriot,Sixth Sense
Harry Potter1,Harry Potter2
Gladiator,Patriot
Gladiator,Patriot,Sixth Sense
Sixth Sense,LOTR,Galdiator,Green Mile

This data might be stored in the file DVD.csv which can be directly loaded into R using the read.transactions function of the arules package:



> library(arules)
> dvd.transactions <- read.transactions("DVD.csv", sep=",")
> dvd.transactions

transactions in sparse format with
 10 transactions (rows) and
 11 items (columns)

This tells us that there are, in total, 11 items that appear in the basket. The read.transactions function can also read data from a file with transaction ID and a single item per line (using the format="single" option).

For example, if the data consists of:

1,Sixth Sense
1,LOTR1
1,Harry Potter1
1,Green Mile
1,LOTR2
2,Gladiator
2,Patriot
2,Braveheart
3,LOTR1
3,LOTR2
4,Gladiator
4,Patriot
4,Sixth Sense
5,Gladiator
5,Patriot
5,Sixth Sense
6,Gladiator
6,Patriot
6,Sixth Sense
7,Harry Potter1
7,Harry Potter2
8,Gladiator
8,Patriot
9,Gladiator
9,Patriot
9,Sixth Sense
10,Sixth Sense
10,LOTR
10,Galdiator
10,Green Mile

we read the data with:

> dvd.transactions <- read.transactions("DVD.csv", format="single", 
                                        sep=",", cols=c(1,2))
> dvd.transactions

transactions in sparse format with
 10 transactions (rows) and
 11 items (columns)

A summary of the dataset is obtained in the usual way:



> summary(dvd.transactions)

transactions as itemMatrix in sparse format with
 10 rows (elements/itemsets/transactions) and
 11 columns (items)

most frequent items:
    Gladiator       Patriot   Sixth Sense    Green Mile
            6             6             6             2
Harry Potter1       (Other) 
            2             8 

element (itemset/transaction) length distribution:
2 3 4 5 
3 5 1 1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.25    3.00    3.00    3.00    5.00 

includes extended transaction information - examples:
  transactionIDs
1              1
2              2
3              3

The dataset is identified as a sparse matrix consisting of 10 rows (transactions in this case) and 11 columns or items. In fact, this corresponds to the total number of distinct items in the dataset, which internally are represented as a binary matrix, one column for each item. A distribution across the most frequent items (Gladiator appears in 6 ``baskets'') is followed by a distribution over the length of each transaction (one transaction has 5 items in the ``basket''). The final extended transaction information can be ignored in this simple example, but is explained for the more complex example that follows.

Association rules can now be built from the dataset:



> dvd.apriori <- apriori(dvd.transactions)

parameter specification:
 confidence minval smax arem  aval originalSupport support minlen 
        0.8    0.1    1 none FALSE            TRUE     0.1      1 
maxlen target   ext
     5  rules FALSE

algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[11 item(s), 10 transaction(s)] done [0.00s].
sorting and recoding items ... [7 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [7 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].

The output here begins with a summary of the parameters chosen for the algorithm. The default values of confidence (0.8) and support (0.1) are noted, in addition to the minimum and maximum number of items in an itemset (minlen=1 and maxlen=5). The default target is rules, but you could instead target itemsets or hyperedges. These can be set in the call to apriori with the parameter argument which takes a list of keyword arguments.

We view the actual results of the modelling with the inspect function:



> inspect(dvd.apriori)

  lhs              rhs           support confidence     lift
1 {LOTR1}       => {LOTR2}           0.2          1 5.000000
2 {LOTR2}       => {LOTR1}           0.2          1 5.000000
3 {Green Mile}  => {Sixth Sense}     0.2          1 1.666667
4 {Gladiator}   => {Patriot}         0.6          1 1.666667
5 {Patriot}     => {Gladiator}       0.6          1 1.666667
6 {Sixth Sense,                                             
    Gladiator}  => {Patriot}         0.4          1 1.666667
7 {Sixth Sense,                                             
    Patriot}    => {Gladiator}       0.4          1 1.666667

The rules are listed in order of decreasing lift.

We can change the parameters to get other association rules. For example we might reduce the support and deliver many more rules (81 rules):

> dvd.apriori <- apriori(dvd.transactions, par=list(supp=0.01))

Or else we might maintain support but reduce confidence (20 rules):

> dvd.apriori <- apriori(dvd.transactions, par=list(conf=0.1))

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.