Data Mining Survivor: Data_Cleaning

DATA MINING
Desktop Survival Guide
by Graham Williams

Missing Values

Missing data can affect modelling, particularly if the data is not randomly missing, but missing because of some underlying systematic reason (e.g., censoring). If data is missing at random then it is more likely that the missing values will have little affect on the modelling.

An excellent reference on dealing with missing data is schafer97:incomplete_data.

Missing values are specially recorded in R as NA. Various functions can be used to check for a missing value (is.na), to remove any entities with missing values (na.omit and to identify those entities that are complete (complete.cases. The apply function also comes in handy here.

> ds <- ds[!apply(is.na(ds),1,all),] # Remove all rows of all NA's. > ds <- na.omit(ds) # Remove all rows that have any NA's. > ds <- ds[complete.cases(ds),] # Remove all rows that have any NA's.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.