DATA MINING
Desktop Survival Guide
by Graham Williams

Resources and Further Reading

[#!freund.schapire:1995:decis_theor!#] introduced AdaBoost, popularising the idea of ensemble learning where a committee of models cooperate to deliver a better outcome. For a demonstration of AdaBoost visit http://www1.cs.columbia.edu/~freund/adaboost/.

The original formulation of the algorithm, as in [#!freund.schapire:1995:decis_theor!#], adjusts all weights each iteration. Weights are increased if the corresponding record is misclassified by $\mathcal{M}_i$ or decreased if it is correctly classified by $\mathcal{M}_i$ . The weights are then further normalised each iteration to ensure they continue to represent a distribution (so that $\sum_{j=1}^{n}{w_j}=1$ ). This can be simplified, as by hastie.tibshirani.etal:2001:stats_learn, to only increase the weights of the misclassified entities. We use this simpler formulation in the above description of the algorithm. Consequently, the calculation of $\epsilon_i$ (line 9) includes the sum of the weights as a denominator (which is 1 in the original formulation). Only the weights associated with the misclassified entities are modified in line 11. The original algorithm modified all weights by $e^{-\alpha_i y_i \mathcal{M}_i(x_j)}$ which equates to $e^{\alpha_i}$ for misclassified entities (since either or $\mathcal{M}_i(x_j)$ is -1, but not both) and to $e^{-\alpha_i}$ for correctly classified entities (since both or $\mathcal{M}_i(x_j)$ are either 1 or -1). For each iteration the new weights in the original algorithm are normalised by dividing each weight by a calculated factor .

Cost functions other than the exponential loss criterion $e^{-m}$ have been proposed. These include the logistic log-likelihood criterion $\log(1+\exp(-m))$ used in LogitBoost), $1-\tanh(m)$ (Doom II) and (Support Vector Machines).

BrownBoost addresses the issue of the sensitivity of AdaBoost to noise.

We note that if each weak classifier is always better than chance, then AdaBoost can be proven to converge to a perfectly accurate model (no training error). Also note that even after achieving an ensemble model with no error, as we add in new models to the ensemble the generalisation error continues to improve (the margin continues to grow). Although it was thought, at first, that AdaBoost does not overfit the data, it has since been shown that it can. However, it generally does not, even for large numbers of iterations.

Extensions to AdaBoost include multi-class classification, application to regression (by transforming the problem into a binary classification task), and localised boosting which is similar to mixtures of experts.

Some early practical work on boosting was undertaken with the Australian Taxation Office using boosted stumps. Multiple, simple models of tax compliance were produced. The models were easily and independently interpretable. Effectively, the models identified a collection of factors that in combination were useful in predicting compliance.

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.