DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Data cleaning deals with issues of removing errant transactions, updating transactions to account for reversals, elimination of missing data, and so on.
The aim of data cleaning is to raise the data quality to a level suitable for the selected analyses.
The data cleaning to be performed depends on purpose to which the data is to be put. Some activities will require a selection of data cleaning and data transformation modules to be applied to the data.
Data cleaning occurs early in the process and then continually throughout the process as we learn more about the data.
Field selection
Sampling
Data correction
Missing values treatment
Data transformation, e.g., birth date to age.
Derive new fields
Useful steps:
Understand the business problem.
Collect the materials about the data sources and study them to understand what data is available.
Identify the data items relevant to the business problem, e.g., tables and attributes.
Make a data extraction plan and arrange the data extraction (with DBAs).
Calculate the summary statistics of the extracted data.