Data Mining Survivor: Impute

DATA MINING
Desktop Survival Guide
by Graham Williams

Mean/Median

Often a simple, if not always satisfactory, choice for missing values is to use some ``central'' value of the variable. This is often the mean or median. We might choose to use the mean, for example, if the variable is otherwise generally normally distributed (and in particular does not have any skewness). If the data does exhibit some skewness though (e.g., there are a small number of very large values) then the median might be a better choice.

This is achieved in R with

crs$dataset[is.na(crs$dataset$Age), "Age"] <- mean(crs$dataset$Age, na.rm=T) crs$dataset[is.na(crs$dataset$Age), "Age"] <- median(crs$dataset$Age, na.rm=T)

Whilst this is a simple and computationally quick approach, it is a very blunt approach to imputation and can lead to poor performance from the resulting models.

Refer to http://www.liacc.up.pt/ ltorgo/DataMiningWithR/PDF/DataMiningWithR.pngData Mining With R, from page 42, for more details.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.