More on bagging

March 20, 2011

Last day, I was reading some entries in the new Encyclopedia of Machine Learning (Springer, 2011), and I stumble upon the one on bagging (which is itself helded under the “ensemble learning” tag) which provides a very clean overview of bagging and ensemble learning methods.

If you don’t know the principle of this handbook, it just basically aims at offering a thorough overview of ML and AI-related techniques. Curiously, I didn’t found any single definition of Machine Learning per se. Ensemble learning techniques (PDF) were nicely reviewed by Gavin Brown. Another instructive paper from the author is: Managing Diversity in Regression Ensembles, JMLR (2005) 6: 1621-1650.

Applied to a regression or classification context, the general idea of bagging, or more generally of ensemble methods, is rather simple:

Ensemble learning refers to the procedures employed to train multiple learning machines and combine their outputs, treating them as a “committee” of decision makers. The principle is that the decision of the committee, with individual predictions combined appropriately, should have better overall accuracy, on average, than any individual committee member.

At a more abstract level,

The study of ensemble methods, with model outputs considered for their abstract properties rather than the specifics of the algorithm which produced them, allows for a wide impact across many fields of study. If we can understand precisely why, when, and how particular ensemble methods can be applied successfully, we would have made progress toward a powerful new tool for Machine Learning: the ability to automatically exploit the strengths and weaknesses of different learning systems.

Several algorithms are then described, including Adaboost, bagging, boosting, mixture of experts, before discussing some more theoretical perspectives, namely “ensemble diversity” where the central idea is quoted below:

The optimal “diversity” is fundamentally a credit assignment problem. If the committee as a whole makes an erroneous prediction, how much of this error should be attributed to each member? More precisely, how much of the committee prediction is due to the accuracies of the individual models, and how much is due to their interactions when they were combined? We would ideally like to reexpress the ensemble error as two distinct components: a term for the accuracies of the individual models, plus a term for their interactions, i.e., their diversity.

Finally, the following article offers a good review of eight supervised ML techniques (SVM, ANN, Logistic Regression, Naive Bayes, KNN, RF, Decision Trees, and Bagged Trees): Caruana, R. and Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms (PDF). In Proceedings of the 23rd international conference on machine learning (pp. 161-168). New York: ACM.

I was also lucky enough to find Combining Pattern Classifiers, Methods and Algorithms, by Kuncheva (2004).

I then decided to fetch some papers from citeseerx; this includes:

See Also

» Measures of accuracy for classification » A recap' on the statistical analysis of RCTs » High-dimensional data analysis in cancer research » Statistical learning and regression » Visualizing What Random Forests Really Do