Data Mining Survivor: Data_Transformation

DATA MINING
Desktop Survival Guide
by Graham Williams

Binning

Many algorithms for data mining and before that for machine learning have been developed to deal only with categorial variables. Thus, the ability to turn numeric variables into categorical variables is important.

To divide the range of a numeric variable into intervals, and to then code the values of the variable according to which interval they fall into, thus transforming the variable into a categorical variable, we can use the cut function. In the example below the values are cut into three ranges, and given appropriate labels. As a bonus, the percentage distribution across the three ranges is also given!

> v <- c(1, 1.4, 3, 1.1, 0.3, 0.6, 4,5) > v.cuts <- cut(v, breaks=c(-Inf, 1, 2, Inf), labels=c("Low", "Med", "High")) > v.cuts [1] Low Med High Med Low Low High High Levels: Low Med High > table(v.cuts)/length(v.cuts)*100 v.cuts Low Med High 37.5 25.0 37.5

An example of this kind of transformation in practise is given in See Chapter , where the apriori function requires categorical variables.

Binning is in fact a common concept and tools exist to automatically bin data using different strategies. The binning function of the sm package provides basic binning functionality.

> library(sm) > x <- rnorm(100) > y <- cut(x, breaks=binning(x, nbins=3)$breaks, labels=c("Lo", "Med", "Hi")) > y [1] Lo Lo Med Med Med Lo Med Med Med Med Med Lo Lo Med Hi Hi Med Lo [19] Lo Med Lo Hi Lo Med Hi Lo Med Lo Med Lo Med Med Lo Med Med Med [37] Lo Med Med Lo Lo Lo Lo Med Med Med Lo Med Med Med Lo Med Lo Med [55] Med Lo Med Med Med Med Lo Med Med Lo Med Lo Med Med Med Med Lo Med [73] Med Med Med Med Lo Med Lo Med Med Med Lo Med Med Lo Lo Med Lo Lo [91] Lo Med Med Lo Lo Med Med Med Lo Med Levels: Lo Med Hi

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.