Data Mining Survivor: Loading_Data

DATA MINING
Desktop Survival Guide
by Graham Williams

CSV File Option

The CSV File option of the Data tab is an easy way to load data into Rattle. CSV stands for ``comma separated value'' and is a standard file format often used to exchange data between various applications. CSV files can be exported from various spreadsheets and databases, including MS/Excel, Gnumeric, SAS/Enterprise Miner, Teradata's Warehouse, and many, many, other applications. This is a pretty good option for importing your data into Rattle, although it does lose meta data information (that is, information about the data types of the dataset). Without this meta data R sometimes guesses at the wrong data type for a particular column, but it isn't usually fatal!

A CSV file is actually a normal text file that you could load into ant text editor to review its contents. A CSV file usually begins with a header row, listing the names of the variables, each separated by a comma. If any name (or indeed, any value in the file) contains an embedded comma, then that name (or value) will be surrounded by quote marks. The remainder of the file after the header is expected to consist of rows of data that record information about the entities, with fields generally separated by commas recording the values of the variables for this entity.

To make a CSV file known to Rattle we click the Filename button. A file chooser dialog will pop up. We can use this to browse our file system to find the file we wish to load into Rattle. By default, only files that have a .csv extension will be listed (together with folders). The pop up includes a pull down menu near the bottom right, above the Open button, to allow you to select which files are listed. You can list only files that end with a .csv or a .txt or else to list all files. The .txt files are similar to CSV files but tend to use tab to separate columns in the data, rather than commas. The window on the left of the popup allows us to browse to the different file systems available to us, while the series of boxes at the top let us navigate through a series of folders on a single file system. Once we have navigated to the folder on the file system on which we have saved the audit.csv file, we can select this file in the main panel of the file chooser dialog. Then click the Open button to tell Rattle that this is the file we are interested in.

Notice that the textview of the Data tab has changed to give a reminder as to what we need to do next.

[width=]rattle-data-csv-file-selected

We note that we have not yet told Rattle to actually load the data--we have just identified where the data is. So we now click the Execute button (or press the F5 key) to load the dataset from the audit.csv file. Since Rattle is a simple graphical interface sitting on top or R itself, the message in the textview also reminds us that some errors encountered by R on loading the data (and in fact during any operation performed by Rattle) may be displayed in the R Console.

The contents of the textview of the Data tab has now changed again.

[width=]rattle-audit-data

The panel contains a brief summary of the dataset. We have loaded 2,000 entities (called observations in R), each described by 13 variables. The data type, and the first few values, for each entity are also displayed. We can start getting an idea of the shape of the data, noting that Adjusted, for example looks like it might be a categorical variable, with values 0 and 1, but R identifies it as an integer! That's fine.

You can choose the field delimiter through the Separator entry. A comma is the default. To load a .txt file which uses a tab as the field separator enter \\t as the separator. You can also leave the separator empty and any white space will be used as the separator.

Any data with missing values, or having the value ``NA'' or else ``.'', is treated as a missing value, which is represented in R as the string NA. Support for the ``.'' convention allows the importation of CSV data generated by SAS.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.