Data Mining Survivor: Data_Cleaning

DATA MINING
Desktop Survival Guide
by Graham Williams

Review Data

Often we will find ourselves loading data from a CSV file which is readily supported by R (See Section ). On the first loading of the data we generally want to get a quick summary, using R's summary function. It is here that we might note that some numeric columns have become factors!

Consider the example of the cardiac dataset (See Section ).

> cardiac <- read.csv("cardiac.data", header=F) > summary(cardiac) [...] V10 V11 V12 V13 V14 Min. :-172.00 52 : 13 60 : 23 49 : 9 ? :376 1st Qu.: 3.75 36 : 10 ? : 22 55 : 9 84 : 3 Median : 40.00 42 : 9 61 : 16 59 : 9 -157 : 2 Mean : 33.68 10 : 8 56 : 14 62 : 9 -164 : 2 3rd Qu.: 66.00 33 : 8 58 : 13 26 : 8 -93 : 2 Max. : 169.00 41 : 8 68 : 12 33 : 8 103 : 2 (Other):396 (Other):352 (Other):400 (Other): 65 [...]

Our understanding of the data might be that we expect these variables to be numeric. Indeed, the telltale sign is V14 having a ? as one of its values. A little more exploration to show the frequency of each value will indicate that the apparently nominal variables only have a single non-numeric value, the ? When we read the data from the CSV file we need to tell R that the ? is used to indicate missing values

> cardiac <- read.csv("cardiac.data", header=F, na.string="?") > summary(cardiac) [...] V11 V12 V13 V14 Min. :-177.00 Min. :-170.00 Min. :-135.00 Min. :-179.00 1st Qu.: 14.00 1st Qu.: 41.00 1st Qu.: 12.00 1st Qu.:-124.50 Median : 41.00 Median : 56.00 Median : 40.00 Median : -50.50 Mean : 36.15 Mean : 48.91 Mean : 36.72 Mean : -13.59 3rd Qu.: 63.25 3rd Qu.: 65.00 3rd Qu.: 62.00 3rd Qu.: 117.25 Max. : 179.00 Max. : 176.00 Max. : 166.00 Max. : 178.00 NA's : 8.00 NA's : 22.00 NA's : 1.00 NA's : 376.00 [...]

That's looking better. Note that the NAs are reported and that V14 has 376 of them, in accord with the previous observation of 376 ?'s.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.