Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Outlier Analysis

Rare, unusual, or just plain infrequent events are of interest in data mining in many contexts including fraud, marketing, etc. We classify analyses that focus on the discovery of such events as outlier analysis. hawkins:1980:ident_outlier captures the concept of an outlier as:

an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.

A general approach to identifying outliers is to assume a known distribution for the data and to examine the deviation of individuals from the distribution. Such approaches are common in statistics barnett.lewis:1994:outlier_statis but such approaches do not scale well.

Distance based methods are common in data mining where the measure of an entities outliedness is based on its distance to nearby entities. The number of nearby entities and the minimum distance are two parameters. (see knorr and ng 1998 vldb24)

Density based approaches from breuning kriegel ng and sander 2000 sigmod LOF: local outlier factor. See also jin tung and han kdd2001.

The early work on outliers was carried out from a statistical view point where outliers are data points that deviate significantly from the identified underlying distribution of the data. hawkins:1980:ident_outlier is a good textbook on outliers. barnett.lewis:1994:outlier_statis is another overview of statistical outliers.

Distance based approaches have been developed by knorr.ng:1998:algor_minin, ramaswamy.rastogi.etal:2000:effic_outli and knorr.ng:1999:findin_inten. Such approaches usually explore some neighbourhood and do not rely on underlying distributions.

knorr.ng:1998:algor_minin identify outliers by counting the number of neighbours within a specified radius of a data point. The radius $q$ and the threshold number of points $n$ are the only two parameters of the approach. The approach is simple but is inadequate for data that is distributed with uneven density where $q$ and $n$ might need to vary to cope with the changes. ramaswamy.rastogi.etal:2000:effic_outli have a similar approach whereby data points are ranked by the sum of their distance to their nearest $k$ neighbours.

breunig.kriegel.etal:1999:optic_of and then breunig.kriegel.etal:2000:lof introduce a density based approach to score data points with a local outlier factor (LOF). jin.tung.etal:2001:topn_lof introduce a heuristic to more efficiently identify the top outliers using the LOF.

yamanishi.takeuchi.etal:2000:online_unsup build mixture models as data becomes available and identifies outliers as those data items causing the most perturbation to the model.

aggarwal.yu:2001:outlier_high explore the issue of outliers in high dimensional space where data tends to be sparse and consequently all data points tend to be equidistant to other points beyer.goldstein.etal:1999:near_neigh and suggest an algorithm where the high dimensional space is projected to a lower dimensional space having unusually low density. An evolutionary algorithm is proposed to generate candidate subspaces in which outliers are to be searched for.

SmartSifter

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.