< a quantity that can be divided into another a whole number of time />

Statistical learning and regression

March 12, 2011

Here is a short review of Statistical Learning from a Regression Perspective, by Richard A. Berk (Springer, 2008). There already exist some good reviews of this book, including the one by John Maindonald in the Journal of Statistical Software. Also, some answers to the exercises proposed in this book were made available by Brian Kriegler. Richard A. Berk also co-authored a nice paper on Statistical Inference After Model Selection (PDF).

I did not take a lot of notes while reading the book, and I just skim over some of the chapters due to a clear lack of time in recent days (tons of papers to read, long overdue MSs to finish, profiling code, and beat my son on our iPhone games : smile : In my opinion, this textbook cannot be used as textbook for a course, I would prefer the ESLII by Hastie and coll. In fact, Berk refers to it almost all the time in his book. However, I think it is a good complement for someone who already knows ML techniques and want to read a short overview with emphasis added.

Take-away message

First, let me start with some of the conclusions raised by Richard A. Berk by the end of the book. (I am emphasizing part of these quotes.)


Chapter 1 is an essential read for everyone interested in getting the core idea of regression analysis. Interestingly, a distinction is made between estimation of the parameters of the functional relationship postulated between Y and X, and statistical inference or statistical learning (p. 7 and § 1.3).

Chapter 2 deals with splines and smoothers. First, after a brief overview of polynomial and natural cubic splines, the author describes B-splines which allow for a cubic piece-wise fit while keeping some nice numerical properties before presenting penalzation methods (shrinkage, ridge regression, lasso, elasticnet, dantzig) deriving from the work of Hastie and Zou. However, I discovered Rodeo which stands for Regularization and Derivative Expectation Operator and is related to adaptive smoothing and the L1 penalization: the idea is to apply shrinkage to nonparametric regression in order to discard irrelevant predictors. More information can be found in Lafferty, J. and Wasserman, L. (2008). Rodeo: Sparse Greedy Nonparametric Regression. Annals of Statistics, 26(1), 28-63.

Then comes an extended overview of smoothing splines (§ 2.4), which do not require that the number and location of knots are known or optimized through some measure of fit. They take the general form

$$ \text{RSS}(f,\lambda)=\sum_{i=1}^N\left[y_i-f(x_i)\right]^2+\lambda\int\left[f^{“}(t)\right]^2dt, $$

where $\lambda$ is a tuning parameter that can be chosen with common model selection procedures, e.g. by minimizing using N-fold cross-validation

$$ \text{CV}(\hat f_{\lambda})=\sum_{i=1}^N\left[y_i-\hat f_i^{(-i)}(x_i)\right]^2, $$

which amounts to optimize the tradeoff between bias and variance (in the fitted values). Examples of use can be found in the mgcv package (see the gam() function), and this topic is largely covered in

  1. Hastie, T.J. and Tibshirani, R.J. (1990). Generalized Additive Models. Chapman & Hall
  2. Wood, S. (2006). Generalized Additive Models. An Introduction with R. Chapman & Hall

Locally weighted regression is another approach built upon nearest neighbor methods instead of the LM framework. A well-know example is the lowess smoother which also depends on a tuning parameter, namely the bandwidth (i.e., the total no. observations used to fit a local regression line). Again, a generalized CV statistic can be used to select the optimal one.

In multivariate settings, there also exist some kind of smoothers but now the main problems are (a) to find a good measure of neighborhood, and (b) to find a good way for visually assessing the results. Here, the GAM approach highlighted above prove to be the most interesting approach. The idea is that instead of considering all predictors, as in the classical GLM framework — and when there are numerous, we might face the curse of dimensionality issue, we will be using linear combinations of them. So the conditional mean can be related to the $X_j$ as in the formula

$$ \bar Y\mid X = \alpha + \sum_{j=1}^p f_j(X_j) $$

where $\alpha$ is the intercept.

Chapters 4 and 5 are devoted to Bagging and Random Forests. The basic ideas are that (1) averaging over a collection of fitted values can help compensate for overfitting, and (2) a large number of fitting attempts can produce very flexible fitting functions able to respond to systematic, but highly localized, features of the data; altogether, (3) this might break the bias-variance tradeoff (pp. 169-170).

I already wrote some stuff on RFs, see Visualizing what Random Forests really do.

review statistics

See Also

» Regression Methods in Biostatistics » Visualizing What Random Forests Really Do » Python for statistical computing » Diving Into Lisp for Statistical Computing » Archiving my responses on StackExchange