Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Algorithm

The basis of naïve Bayes is Bayes theorem. Essentially we want to classify a new record based on probabilities estimated from the training data. That is, we want to determine the probability of a hypothesis $h$ (or specifically class membership) given the training data $D$ (i.e., $P(h\vert D)$) Bayes theorem gives us the clue to calculating this value:


\begin{displaymath}P(h\vert D) = {{P(D\vert h)P(h)}\over{P(D)}} \end{displaymath}

Ignoring the detail we can work with $P(D\vert h)P(h)$ to identify a hypothesis that best matches our data (since $P(D)$ is constant for a particular $D$). Both of these quantities we can estimate from the training data $D$. $P(h)$ is simply the proportion of the database consistent with hypothesis $h$ (i.e., the proportion of entities with class $c_i$--$c_i$ is regarded as the hypothesis).

Calculation of $P(D\vert h)$ still poses some challenges and this is where the naïve part of naïve Bayes comes in. The naïve assumption is that the variables (each record $D$ is describe by variables $a_1,a_2,\ldots,a_n$) are conditionally independent given the class (i.e., the hypothesis that the classification is $c_j$, which could be the class soft in our example data) of the record. That is, given that a patient has a soft contact lens, to use our example, the probability of the patient being all of young, myope, non-astigmatic and with a normal-tear-rate is the same as the product of the individual probabilities. The naïve assumption, more concretely, is that being astigmatic or not, for example, does not affect the relationship between being young given the use of soft lenses. Mathematically we write this as:


\begin{displaymath}P(a_1,a_2,\ldots,a_n\vert c_j) = \prod_i{P(a_i\vert c_j)}\end{displaymath}

Empirically determining the values of the joint probability on the left of this equation is a problem. To estimate it from the data we need to have available in the data every possible combination of values and examples of their classification so we can then use these frequencies to estimate the probabilities. This is usually not feasible. However, the collection of probabilities on the right poses little difficulty. We can easily obtain the estimates of these probabilities by counting their occurrence in the database.

For example, we can count the number of patients with Age being young and belonging to class soft (perhaps there are only two) and divide this by the number of patients overall belonging to class soft (perhaps there are five). The resulting estimate of the probability (i.e., $P(young\vert soft)$) is $0.4$.

The Naïve Bayes algorithm is then quite simple. From the training data we estimate the probability of each class, $P(c_j)$, by the proportions exhibited in the database. Similarly the probabilities of each variable's value, given a particular class ($P(a_i\vert c_j)$), is simply the proportion of those training entities with that class having the particular variable value.

Now to place a new record ($d$) into a class we simply assign it to the class with the highest probability. That is, we choose the $c_j$ which maximises:

\begin{displaymath}P(c_j)\prod_{a_i\in d}P(a_i\vert c_j) \end{displaymath}

Copyright © 2004-2006 Graham.Williams@togaware.com
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.