Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


The Cardiac Arrhythmia Dataset

The Arrhythmia dataset will be used to illustrate issues with data cleaning.

The dataset is of moderate size (392Kb), with 452 entities. This dataset has 280 variables, one being an output variable with 16 values. Of the input variables some 40 of them are categorical. Although a meta-data file on the repository lists the variables, we may not want to give them all names just now (too many to do by hand). We select a few to give other than the default R names to them. As with other data from the UCI repository ? is used for missing values and we deal with that when we read the downloaded data into R.



> UCI <- "ftp://ftp.ics.uci.edu/pub"
> REPOS <- "ml-repos/machine-learning-databases"
> cardiac.url <- sprintf("%s/%s/arrhythmia/arrhythmia.data", UCI, REPOS)
> download.file(cardiac.url, "cardiac.data")
> cardiac <- read.csv("cardiac.data", header=F, na.strings="?")
> summary(cardiac)

       V1              V2               V3              V4        
 Min.   : 0.00   Min.   :0.0000   Min.   :105.0   Min.   :  6.00  
 1st Qu.:36.00   1st Qu.:0.0000   1st Qu.:160.0   1st Qu.: 59.00  
 Median :47.00   Median :1.0000   Median :164.0   Median : 68.00  
 Mean   :46.47   Mean   :0.5509   Mean   :166.2   Mean   : 68.17  
 3rd Qu.:58.00   3rd Qu.:1.0000   3rd Qu.:170.0   3rd Qu.: 79.00  
 Max.   :83.00   Max.   :1.0000   Max.   :780.0   Max.   :176.00  
[...]
> str(cardiac)
`data.frame':   452 obs. of  280 variables:
 $ V1  : int  75 56 54 55 75 13 40 49 44 50 ...
 $ V2  : int  0 1 0 0 0 0 1 1 0 1 ...
 $ V3  : int  190 165 172 175 190 169 160 162 168 167 ...
 $ V4  : int  80 64 95 94 80 51 52 54 56 67 ...
 $ V5  : int  91 81 138 100 88 100 77 78 84 89 ...
 $ V6  : int  193 174 163 202 181 167 129 0 118 130 ...
 $ V7  : int  371 401 386 380 360 321 377 376 354 383 ...
 $ V8  : int  174 149 185 179 177 174 133 157 160 156 ...
 $ V9  : int  121 39 102 143 103 91 77 70 63 73 ...
 $ V10 : int  -16 25 96 28 -16 107 77 67 61 85 ...
 [...]
 $ V278: num  23.3 20.4 12.3 34.6 25.4 13.5 14.3 15.8 12.5 20.1 ...
 $ V279: num  49.4 38.8 49 61.6 62.8 31.1 20.5 19.8 30.9 25.1 ...
 $ V280: int  8 6 10 1 7 14 1 1 1 10 ...

We will now give a names to a few columns, then save it to a cleaner CSV file and a binary RData file where ? will be NA, and all columns will have names, some that we have given, and the rest as given by R.



> colnames(cardiac)[1:4] <- c("Age", "Gender", "Height", "Weight")
> write.table(cardiac, "cardiac.csv", sep=",", row.names=F)
> save(cardiac, file="cardiac.RData", compress=TRUE)
> dim(cardiac)
[1] 452 280
> str(cardiac)
`data.frame':   452 obs. of  280 variables:
 $ Age   : int  75 56 54 55 75 13 40 49 44 50 ...
 $ Gender: int  0 1 0 0 0 0 1 1 0 1 ...
\$ Height: int  190 165 172 175 190 169 160 162 168 167 ...
 $ Weight: int  80 64 95 94 80 51 52 54 56 67 ...
 $ V5    : int  91 81 138 100 88 100 77 78 84 89 ...
 [...]

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.