DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Often we will find ourselves loading data from a CSV file which is readily supported by R (See Section ). On the first loading of the data we generally want to get a quick summary, using R's summary function. It is here that we might note that some numeric columns have become factors!
Consider the example of the cardiac dataset (See Section ).
> cardiac <- read.csv("cardiac.data", header=F) > summary(cardiac) [...] V10 V11 V12 V13 V14 Min. :-172.00 52 : 13 60 : 23 49 : 9 ? :376 1st Qu.: 3.75 36 : 10 ? : 22 55 : 9 84 : 3 Median : 40.00 42 : 9 61 : 16 59 : 9 -157 : 2 Mean : 33.68 10 : 8 56 : 14 62 : 9 -164 : 2 3rd Qu.: 66.00 33 : 8 58 : 13 26 : 8 -93 : 2 Max. : 169.00 41 : 8 68 : 12 33 : 8 103 : 2 (Other):396 (Other):352 (Other):400 (Other): 65 [...] |
Our understanding of the data might be that we expect these variables to be numeric. Indeed, the telltale sign is V14 having a ? as one of its values. A little more exploration to show the frequency of each value will indicate that the apparently nominal variables only have a single non-numeric value, the ? When we read the data from the CSV file we need to tell R that the ? is used to indicate missing values
> cardiac <- read.csv("cardiac.data", header=F, na.string="?") > summary(cardiac) [...] V11 V12 V13 V14 Min. :-177.00 Min. :-170.00 Min. :-135.00 Min. :-179.00 1st Qu.: 14.00 1st Qu.: 41.00 1st Qu.: 12.00 1st Qu.:-124.50 Median : 41.00 Median : 56.00 Median : 40.00 Median : -50.50 Mean : 36.15 Mean : 48.91 Mean : 36.72 Mean : -13.59 3rd Qu.: 63.25 3rd Qu.: 65.00 3rd Qu.: 62.00 3rd Qu.: 117.25 Max. : 179.00 Max. : 176.00 Max. : 166.00 Max. : 178.00 NA's : 8.00 NA's : 22.00 NA's : 1.00 NA's : 376.00 [...] |
That's looking better. Note that the NAs are reported and that V14 has 376 of them, in accord with the previous observation of 376 ?'s.
Copyright © 2004-2006 Graham.Williams@togaware.com Support further development through the purchase of the PDF version of the book.