Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Unbalanced Classification

This is a common problem that we find in areas such as fraud, rare disease diagnosis, network intrusion, and others. The problem is that one class is very much underrepresented in the data. For example, cases of fraud in a very large medical insurance dataset are perhaps less than 1%. In compliance work where claims are being reviewed for compliance, often the number of claims that require adjustment is perhaps only 10%. In such circumstances, if we build a model in the usual way, where the aim is to minimise error rates, we can build the most accurate model to say that there is no fraud, and the model is up to 99% accurate, but of very little use.

Data mining of unbalanced datasets will involve adjustments to the modelling in some way. One approach is to down sample the majority case to even up the classes. Alternatively, we might over sample entities from the rare class and by so doing increase the weight of the minorities!

We illustrate two approaches to dealing with unbalanced datasets in See Chapter [*]. There, one approach is to modify the weights, and the second is to down sample to balance up the classes. Both have been found to be very effective approaches when coupled with random forests.

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.