Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Variable Roles

The Variables tab is used to identify the role played by each of the variables in the dataset.

[width=]rattle-audit-variables

Variables can be inputs to modelling, the target variable for modelling, the risk variable, an identifier, or an ignored variable. The default role for most variables is that of an Input variable. Generally, these are the variables that will be used to predict the value of a Target variable.

Rattle uses simple heuristics to guess at a Target role for one of the variables. Here we see that Adjusted has been selected as the target variable. In this instance it is correct. The heuristic involves examining the number of distinct values that a variable has, and if it has less than 5, then it is considered as a candidate. The candidate list is ordered starting with the last variable (often the last variable is the target), and then proceeding from the first onwards to find the first variable that meets the conditions of looking like a target.

Any numeric variables that have a unique value for each record is automatically identified as an Ident. Any number of variables can be tagged as being an Ident. All Ident variables are ignored when modelling, but are used after scoring a dataset, being written to the resulting score file so that the cases that are scored can be identified.

Sometimes not all variables in your dataset should be used or may not be appropriate for a particular modelling task. For example, the random forest model builder does not handle categorical variables with more than 32 levels, so you may choose to Ignore Accounts. You can change the role of any variable to suit your needs, although you can only have one Target and one Risk.

Special variable names can be used with data imported into Rattle (and in fact for any data used by Rattle) to identify their role. Any variable with a name beginning with IGNORE_ will have the default role of Ignore. Similarly RISK_ and TARGET_. Any variable beginning with IMP_ is assumed to be an imputed variable, and if there exists a variable with the same name, but without the IMP_ prefix, that variable will be marked as Ignore.

For any changes you make to the Variables tab to take effect click the Execute button.

For an example of the use of the Risk variable, see Section [*].

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.