DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
It is usually a good idea to review the distributions of the values of each of the variables in your dataset. The Distributions option allows you to visually explore the distributions for specific variables.
Using graphical tools to visually investigate the data's characteristics can help our understanding the data, in error correction, and in variable selection and variable transformation.
Graphical presentations are more effective for most people, and Rattle provides a graphical summary of the distribution of the data with the Distribution option of the Explore tab.
Visualising data has been an area of study within statistics for many years. A vast array of tools are available within R for presenting data visually and the topic is covered in detail in books in their own right, including cleveland:1993:visual_data and Tufte.
By choosing the Distributions radio button you can select specific variables of interest, and display various distribution plots. Selecting many variables will lead to many plots being displayed, and so it may be useful to display multiple plots per page (i.e., per window) by setting the appropriate value in the interface. By default, four plots will be displayed per page or window, but you can change this to anywhere from 1 plot per page to 9 plots per page.
Here we illustrate a window with the default four plots. Four plots
per page are useful, for example, to display each of the four
different types of plots for a single continuous variable. Clockwise,
they are the Box Plot, the Histogram, a Cumulative Function Plot, and
a Benford's Law Plot. Because we have identified a target variable the
plots include the distributions for each subset of entities associated
with each value of the target variable, wherever this makes sense to
do so (e.g., not for the histogram).
The box plot identifies the median and mean of the variable, the spread from the first quartile to the third, and indicates the outliers. The histogram splits the range of values of the variable into segments and shows the number of entities in each segment. The cumulative plot shows the percentage of entities below any particular value of the variable. And the Benford's Law plot compares the distribution of the first digit of the numbers against that which is expected according to Benford's Law. Each of the plots shown here is explained in more detail in the following sections.
For categorical variables two types of plots are supported, more as
alternatives than adding extra information: the Bar Plot and the Dot
Plot. Each plot shows the number of entities that have a particular
value for the chosen variable. Both are sorted from the most frequent
to the least frequent value. For example, we can see that the value
Private of the variable Employment is the most
frequent, occurring over 1,400 times in this dataset.
A bar plot uses vertical bars while the dot plot uses dots placed horizontally. The dot plot has more of a chance to list the actual values for the variable, whilst a bar plot will have trouble listing all of the values (as illustrated here). Each of the plots shown here is explained in more detail in the following sections.