DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
|
Note and illustrate with the audit data how building a random forest and evaluate on training dataset gives "prefect" results (that might make you wonder about it overfitting) but on test data you get a realistic (and generally good still) performance.
In RF if there are many noise variables, increase the number of variables considered at each node.
Random Forrests are implemented:
> library(randomForest) > |
The classwt option in the current randomForrest package does not fully work and should be avoided. The sampsize and strata options can be used together. Note that if strata is not specified, the class labels will be used.
Here's an example using the iris data:
> iris.rf <- randomForest(Species ~ ., iris, sampsize=c(10, 20, 10)) |
You can also name the classes in the sampsize specification:
> samples <- c(setosa=10, versicolor=20, virginica=10) > iris.rf <- randomForest(Species ~ ., iris, sampsize=samples) |
You can do a stratified sampling using a different variable than the
class labels so that you even up the distribution of the class. Andy
Liaw gives an example of the multi-centered clinical trial data where you
want to draw the same number of patients per center to grow each tree
where you can do something like:
> randomForest(..., strata=center, sampsize=rep(min(table(center))), nlevels(center))) |
The importance option allows us to review the importance of each variable in determining the outcome. The first importance is the scaled average of the prediction accuracy of each variable, and the second is the total decrease in node impurities splitting on the variable over all trees, using the Gini index.