Data Mining Survivor: Distributions_Option

DATA MINING
Desktop Survival Guide
by Graham Williams

Benford's Law

The use of http://en.wikipedia.org/wiki/Benfords_lawBenford's Law has proven to be effective in identifying oddities in data. For example, it has been used for sample selection in fraud detection. Benford's law relates to the frequency of occurrence of the first digit in a collection of numbers. In many cases, the digit `1' appears as the first digit of the numbers in the collection some 30% of the time, whilst the digit `9' appears as the first digit less than 5% of the time. This rather startling observation is certainly found, empirically, to hold in many collections of numbers, such as bank account balances, taxation refunds, and so on. By plotting a collection of numbers against the expectation as based on Benford's law, we are able quickly ascertain any odd behaviour in the data.

Benford's law is not valid for all collections of numbers. For example, people's ages would not be expected to follow Benford's Law, nor would telephone numbers. So use the observations with care.

You can select any number of continuous variables to be compared with Benford's Law. By default, a line chart is used, with the red line corresponding to the expected frequency for each of the initial digits. In this plot we have requested that Income be compared to Benford's Law. A Target variable has been identified (in the Variables tab) and so not only is the whole population's distribution of initial digits compared to Benford's Law, but so are the distributions of the subsets corresponding to the different values of the target variable. It is interesting to observe here that those cases in this dataset that required an adjustment after investigation () conformed much less to Benford's Law than those that were found to require no adjustment (). In fact, this latter group had a very close conformance to Benford's Law.

By selecting the Benford Bars option a bar chart will be used to display the same information. The expected distribution of the initial digit of the numbers under consideration, according to Benford's Law, is once again shown as the initial red bar in each group. This is followed by the population distribution, and then the distribution for each of the sub-populations corresponding to the value of the Target variable. The bar chart again shows a very clear differentiation between the adjusted and non-adjusted cases.

Some users find the bar chart presentation more readily conveys the information, whilst many prefer the less clutter and increased clarity of the line chart. Regardless of which you prefer, Rattle will generate a single plot for each of the variables that have been selected for comparison with Benford's Law.

In the situation where no target variable has been identified (either because, for the dataset being explored, there is no target variable or because the user has purposely not identified the target variable to Rattle) and where a line chart, rather than a bar chart, is requested, the distribution of all variables will be displayed on the one plot. This is the case here where we have chosen to explore Age, Income, Deductions, and Adjustment.

This particular exploration of Benford's Law leads to a number of interesting observations. In the first instance, the variable clearly does not conform. As mentioned, age is not expected to conform since it is a number series that is constrained in various ways. In particular, people under the age of 20 are very much under-represented in this dataset, and the proportion of people over 50 diminishes with age.

The variable also looks particularly odd with numbers beginning with `1' being way beyond expectations. In fact, numbers beginning with `3' and beyond are very much under-represented, although, interestingly, there is a small surge at `9'. There are good reasons for this. In this dataset we know that people are claiming deductions of less than $300, since this is a threshold in the tax law below which less documentation is required to substantiate the claims. The surge at `9' could be something to explore further, thinking perhaps that clients committing fraud may be trying to push their claims as high as possible (although there is really no need, in such circumstances, to limit oneself, it would seem, to less than $1000).

By exploring this single plot (i.e., without partitioning the data according to whether the case was adjusted or not) we see that the interesting behaviours we observed with relation to have disappeared. This highlights a point that the approach of exploring Benford's Law may be of most use in exploring the behaviours of particular sub-populations.

Note that even when no target is identified (in the Variables tab) and the user chooses to produce Benford Bars, a new plot will be generated for each variable, as the bar charts can otherwise become quite full.

Benford's Law primarily applies to the first digit of the numbers. A similar, but much less strong, law also applies to the second, third and fourth digits. In particular, the second digit distributions are approximately XXXX. However, as we proceed to the third and fourth and so on, each has an expected frequency pretty close to 0.1 (or 10%), indicating they are all generally equally likely.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.