This is a short review of Ensemble Methods in Data Mining, by G. Seni and J. Elder (Morgan & Claypool, 2010). I won’t get over the whole textbook, but rather summarize the introductory chapter which provides a nice overview of how ensemble methods work and why they are interesting from a predictive perspective.
As it is well known, there is a large variety of DM algorithms whose predictive accuracy depends on the problem at hand. Choosing the right algorithm (we might even call this model selection, provided we were to test several of them one against the other) is then a matter of knowledge. However, “there is a way to improve model accuracy that is easier and more powerful than judicious algorithm selection: one can gather models into ensembles.”
Useful references include:
and some papers by Leo Breiman that I listed in another post.
The essential steps in building ensemble models are (1) constructing varied models, and (2) combining their estimates. The first step can be seen as a way to introduce some kind of variety, or fluctuations, in model outputs, while the second step is a way to circumvent the decision of a single weak classifier by averaging over a collection of such decisions. Note that we do not need to use the same model, but different kind of classifiers might be combined. Here is another and more complete definition, by Nisbet et al.,(†) pp. 306-307:
To build ensembles, you need to construct varied (and accurate enough) models and combine their estimates. There are five ways to generate variability, by modifying
- Case weights, as with boosting (real-valued weights) and bagging (integer weights);
- Case values—adding noise to the input or output variable values;
- Variable subsets (as with random forests, where each tree being built only considers for splitting a random subset—say, 10%—of the potential candidate inputs at any one node);
- Guiding parameters (such as running an algorithm with different settings);
- Modeling technique (e.g., tree, regression, MARS, neural network).
There are three ways to combine estimates, by using
- Estimator weights (perhaps based on estimated or training accuracy);
- Voting (when the problem is predicting the best class);
- Partitions of the design space (as with model gating, where different models take control depending on which input space region we are in).
In our estimate, the most common combination used is cross-validation and averaging, especially with decision trees. That is, building a model of one type on V different overlap- ping folds of the data and then averaging the resulting estimates. In our experience this (and other variations) are much more likely to help, rather than hurt, performance.
The authors cite illustrative examples of ensemble methods: Bayesian model averaging which sums estimates of possible models, weighted by their posterior evidence; Bagging [1] where we bootstrap the training data set and the consider a majority vote or the average of model estimates; Random Forest [2,3] which adds a stochastic component to create more “diversity” among the trees being combined; AdaBoost [4] and ARCing [1] where models are built iteratively by varying case weights (up-weighting cases with large current errors and down-weighting those accurately estimated) and we use a weighted sum of the estimates of the sequence of models; Gradient Boosting [5,6] which extends the AdaBoost algorithm to a variety of error functions for regression and classification. Most of these algorithms are well explained in The Elements of Statistical Learning, by Hastie et al. Less well-known algorithms are the Group Method of Data Handling (GMDH) [7] and its successor, Polynomial Networks [8,9], where multiple layers of moderate-order polynomials fitted by linear regression are built by considering different variable sets to introduce variety. An early related method, Stacking, was proposed by Wolpert [10].
Another important aspect in model inference is regularization. In effect, it is generally advisable to favor both accuracy (of predictions) and simplicity (of the model) although there is a tradeoff between the two: a flexible model will yield a higher accuracy but at the expense of possibly overfitting the data, hence have poor generalizability performance. Regularization offers a solution by penalizing model complexity through an error function that depends on the number of parameters in the model. Common examples of such regularization techniques are the Lasso [11], which inherits from Breiman’s work on the nonnegative Garotte estimator [12], or the LARS algorithm [13]. Another one is the Path Seeker algorithm [14] which allows for a Lasso penalty with other loss functions. Likewise, the Generalized Path Seeker algorithm [15] extends the Elastic Net criterion [16] for producing sparse solution to non-convex problems.
The conclusion of all studies done in this field of research is that generally “bundling competing models into ensembles improves generalization.” The use of Generalized Degrees of Freedom [17], which is an empirical measure of the flexibility of the models, allows to derive a measure of complexity showing that ensembles are simpler than the sum of their components. (More to read in Chapter 6.)