DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
A key task in any data mining project is http://en.wikipedia.org/wiki/Exploratory_data_analysisexploratory data analysis (often abbreviated as EDA), which generally involves getting a basic understanding of a dataset. Statistics, the fundamental tool here, is essentially about uncertainty--to understand it and thereby to make allowance for it. It also provides a framework for understanding the discoveries made in data mining. Discoveries need to be statistically sound and statistically significant--uncertainty associated with the modelling needs to be understood.
We explore the shape or distribution of our data before we begin mining. Through this exploration we begin to understand the ``lay of the land,'' just as a miner works to understand the terrain before blindly digging for gold. Through this exploration we may identify problems with the data, including missing values, noise and erroneous data, and skewed distributions. This will then drive our choice of tools for preparing and transforming our data and for mining it.
Rattle provides tools ranging from textual summaries to visually appealing graphical summaries, tools for identifying correlations between variables, and a link to the very sophisticated GGobi tool for visualising data. The Explore tab provides an opportunity to understand our data in various ways.