Desktop Survival Guide
by Graham Williams
Often a simple, if not always satisfactory, choice for missing values is to use some ``central'' value of the variable. This is often the mean or median. We might choose to use the mean, for example, if the variable is otherwise generally normally distributed (and in particular does not have any skewness). If the data does exhibit some skewness though (e.g., there are a small number of very large values) then the median might be a better choice.
This is achieved in R with
crs$dataset[is.na(crs$dataset$Age), "Age"] <- mean(crs$dataset$Age, na.rm=T) crs$dataset[is.na(crs$dataset$Age), "Age"] <- median(crs$dataset$Age, na.rm=T)
Whilst this is a simple and computationally quick approach, it is a very blunt approach to imputation and can lead to poor performance from the resulting models.
Refer to http://www.liacc.up.pt/ ltorgo/DataMiningWithR/PDF/DataMiningWithR.pngData Mining With R, from page 42, for more details.