DATA MINING
Desktop Survival Guide
by Graham Williams

Nomenclature

Data miners have a plethora of terminology for many of the same things, due primarily from the history of data mining with its roots in many disciplines. Throughout this book we will use a single, consistent nomenclature, that is generally accepted.

We refer to collections of data as datasets. This might be a matrix or a database table. A dataset consists of rows which we might refer to as entities, and those entities are described in terms of variables which form the columns. Synonyms for entity include record and object, while synonyms for variable include attribute and feature.

Variables can serve one of two roles: as input variables or output variableshastie.tibshirani.etal:2001:stats_learn. Input variables are measured or preset data items while output variables are those that are ``influenced'' by the input variables. Often in data mining we build models to predict the output variables in terms of the input variables. Input variables are also known as predictors, independent variables, observed variables and descriptive variables. Output variables are also known as response and dependent variables.

A categorical variable takes on a value from a fixed set of values (e.g., low, medium, and high) while a numeric variable has values that are integers or real numbers. Synonyms for categorical variable include nominal variable, qualitative variable and factor, while synonyms for numeric variable, include quantitative variable and continuous.

Thus, we will talk of datasets consisting of entities described using variables, which might consist of a mixture of input variables and output variables, either of which may be categorical or numeric.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.