DATA MINING
Desktop Survival Guide
by Graham Williams

Example

Suppose we have a database of patients for whom we have, in the past, identified suitable types of contact lenses. Four categorical variables might describe each patient: Age (having possible values young, pre-presbyopic, and presbyopic), type of Prescription (myope and hypermetrope), whether the patient is Astigmatic (boolean), and TearRate (reduced or normal). The patients are already classified into one of three classes indicating the type of contact lens to fit: hard, soft, none.

Sample data is presented in Table .

**Table:** Contact lens training data.

We can make use of this historic data to talk about the probabilities of requiring soft, hard, or no lenses, according to a patient's Age, Prescription, Astigmatic condition and TearRate. Having a probabilistic model we could then make predictions using the model about the suitability of particular lens types for new clients. Thus, if we know historically that the probability of a patient requiring a soft contact lens, given that they were young and had a normal rate of tear production, was 0.9 (i.e., 90%) then we would supply soft lens to most such patients in the future. Symbolically we write this probability as:

$\begin{displaymath} P(soft\vert young,normal) = 0.9 \end{displaymath}$

Generally the real problem is to determine something like:

$\begin{displaymath} P(soft\vert young,myope,no,normal) \end{displaymath}$

To do this we would need to have a sufficient number of examples of

in the database. For databases with many variables, and even for very large databases, it is not likely that we will have sufficient (or even any) examples of all possible combinations of all variable values. This is where naïve Bayes helps.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.