Data Mining Survivor: Data_Cleaning

DATA MINING
Desktop Survival Guide
by Graham Williams

Removing Outliers

Tests for outliers have primarily been superseded by the use of robust methods. Outlier tests are poor in that outliers tend to damage results long before they are detected. Robust methods attempt to compensate rather than reject outliers. RandomForrest modelling helps avoid the issue of outliers.

You can get a list of what the boxplot function thinks are outliers:

> load("wine.RData") > bp <- boxplot(wine$Ash, plot=FALSE) > bp$out [1] 3.22 1.36 3.23

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.