DATA MINING
Desktop Survival Guide
by Graham Williams

Distributions Option

It is usually a good idea to review the distributions of the values of each of the variables in your dataset. The Distributions option allows you to visually explore the distributions for specific variables.

Using graphical tools to visually investigate the data's characteristics can help our understanding the data, in error correction, and in variable selection and variable transformation.

Graphical presentations are more effective for most people, and Rattle provides a graphical summary of the distribution of the data with the Distribution option of the Explore tab.

Visualising data has been an area of study within statistics for many years. A vast array of tools are available within R for presenting data visually and the topic is covered in detail in books in their own right, including cleveland:1993:visual_data and Tufte.

[width=]rattle-audit-explore-distr-income

By choosing the Distributions radio button you can select specific variables of interest, and display various distribution plots. Selecting many variables will lead to many plots being displayed, and so it may be useful to display multiple plots per page (i.e., per window) by setting the appropriate value in the interface. By default, four plots will be displayed per page or window, but you can change this to anywhere from 1 plot per page to 9 plots per page.

Here we illustrate a window with the default four plots. Four plots per page are useful, for example, to display each of the four different types of plots for a single continuous variable. Clockwise, they are the Box Plot, the Histogram, a Cumulative Function Plot, and a Benford's Law Plot. Because we have identified a target variable the plots include the distributions for each subset of entities associated with each value of the target variable, wherever this makes sense to do so (e.g., not for the histogram).

The box plot identifies the median and mean of the variable, the spread from the first quartile to the third, and indicates the outliers. The histogram splits the range of values of the variable into segments and shows the number of entities in each segment. The cumulative plot shows the percentage of entities below any particular value of the variable. And the Benford's Law plot compares the distribution of the first digit of the numbers against that which is expected according to Benford's Law. Each of the plots shown here is explained in more detail in the following sections.

For categorical variables two types of plots are supported, more as alternatives than adding extra information: the Bar Plot and the Dot Plot. Each plot shows the number of entities that have a particular value for the chosen variable. Both are sorted from the most frequent to the least frequent value. For example, we can see that the value Private of the variable Employment is the most frequent, occurring over 1,400 times in this dataset.

A bar plot uses vertical bars while the dot plot uses dots placed horizontally. The dot plot has more of a chance to list the actual values for the variable, whilst a bar plot will have trouble listing all of the values (as illustrated here). Each of the plots shown here is explained in more detail in the following sections.

Subsections

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.