November 17, 2015

The greatest value of a picture is when it forces us to notice what we never expected to see. John Tukey (1915-2000)

**Lectures:** OpenIntro Statistics, 7.1-7.4.

Linear regression allows to model the relationship between a continuous outcome and one or more variables of interest, also called predictors. Unlike correlation analysis, these variables play an asymmetrical role, and usually we are interested in quantifying the strength of this relationship, as well as the amount of variance in the response variable acounted for by the predictors.

In linear regression, we assume a causal effect of the predictor(s) on the outcome. When quantifying the degree of association between two variables, however, both variables play a symmetrical role (there's no outcome or response variable). Moreover, correlation usually does not imply causation.

(Bravais-)Pearson correlation coefficient provides a unit-less measure of linear co-variation between two numeric variables, contrary to covariance which depends linearly on the measurement scale of each variable.

A perfect positive (negative) linear correlation would yield a value of \(r=+1\) (\(-1\)).

\[ r=\frac{\overbrace{\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})}^{\text{covariance (x,y)}}}{\underbrace{\sqrt{\sum_{i=1}^n(x_i-\overline{x})^2}}_{\text{std deviation of x}}\underbrace{\sqrt{\sum_{i=1}^n(y_i-\overline{y})^2}}_{\text{std deviation of y}}} \]

We want to account for individual variations of a response variable, \(Y\), considering one or more predictors or explanatory variables, \(X_j\) (numeric or categorical). **Simple linear regression** considers only one continuous predictor.

Usually, we will consider that there is a **systematic and a random (residual) part** at the level of these individual variations. The linear model allows to formalize the asymmetric relationship between response variable (\(Y\)) and predictors (\(X_j\)) while separating these two sources of variation:

\[ \text{response} = \text{predictor(s) effect} + \hskip-11ex \underbrace{\;\;\text{noise,}}_{\text{measurement error, observation period, etc.}} \]

Importantly, this theoretical relationship is **linear and additive**: (Draper and Smith, 1998; Fox and Weisberg, 2010)

\[ \mathbb{E}(y \mid x) = f(x; \beta) \]

… or rather, "the practical question is how wrong do they have to be to not be useful" (Georges Box).

"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise" (Tukey, 1962).

But all statistical models rely on more or less stringent assumptions. In the case of linear regression, the **linearity of the relationship** and **influence of individual data points** must be carefully checked. Alternatives do exist, though: resistant or robust (Huber, LAD, quantile) methods, MARS, GAM, or restricted cublic splines (Marrie, Dawson, and Garland, 2009; Harrell, 2001).

- Not so much different, in a certain sense.
- Under the GLM umbrella, we can use regression to fit an ANOVA model.

# Simulate 10 obs. assuming the true model is Y=5.1+1.8X. set.seed(101) n <- 10 x <- runif(n, 0, 10) y <- 5.1 + 1.8 * x + rnorm(n) # add white noise stem(y, scale = 2)

## ## The decimal point is at the | ## ## 6 | 5 ## 8 | 4 ## 10 | 0 ## 12 | 50 ## 14 | 788 ## 16 | 89

summary(m <- lm(y ~ x))

## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.32379 -0.61054 -0.07249 0.69580 1.12465 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.0831 0.6618 9.192 1.59e-05 *** ## x 1.6190 0.1360 11.903 2.28e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8752 on 8 degrees of freedom ## Multiple R-squared: 0.9466, Adjusted R-squared: 0.9399 ## F-statistic: 141.7 on 1 and 8 DF, p-value: 2.281e-06

- OLS minimizes vertical distances between observed and fitted \(y\) values (residual sum of squares).
- The regression line pass through the mean point, \((\bar x, \bar y)\); \(b_0=\bar y-b_1\bar x\).
- The slope of the regression line is found to be \(\sum_i(y_i-\bar y)(x_i-\bar x)\big/\sum_i(x_i-\bar x)^2\). (Compare to the formula for the correlation coefficient.)

summary(m)$coefficients

## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.083082 0.6617682 9.192164 1.586118e-05 ## x 1.618957 0.1360150 11.902782 2.280917e-06

2*pt(1.619/0.136, 10-2, lower.tail=FALSE) ## t-test for slope

## [1] 2.278541e-06

We might also be interested in testing the model as a whole, especially when there are more than one predictor.

anova(m)

## Analysis of Variance Table ## ## Response: y ## Df Sum Sq Mean Sq F value Pr(>F) ## x 1 108.520 108.520 141.68 2.281e-06 *** ## Residuals 8 6.128 0.766 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Recall that **data = model fit + residuals**, as described in the ANOVA case.

coef(m)[1] + coef(m)[2] * x[1:5] ## adjusted values

## [1] 12.108813 6.792587 17.572560 16.730805 10.128138

fitted(m)[1:5]

## 1 2 3 4 5 ## 12.108813 6.792587 17.572560 16.730805 10.128138

**residuals = data - model fit**, what is not explained by the linear regression.

y[1:5] - predict(m)[1:5] ## residuals

## 1 2 3 4 5 ## 0.8647242 -0.2849502 0.1890184 1.1246501 -0.7539945

resid(m)[1:5]

## 1 2 3 4 5 ## 0.8647242 -0.2849502 0.1890184 1.1246501 -0.7539945

Predicted/adjusted values are estimated conditional on regressor values, assumed to be fixed and measured without error.

We need further distributional assumption to draw inference or estimate 95% CIs for the parameters, namely that residuals follows an \(\mathcal{N}(0;\sigma^2)\).

**Residual analysis shows what has not yet been accounted for in a model**.

head(influence.measures(m)$infmat) # Chambers & Hastie, 1992

## dfb.1_ dfb.x dffit cov.r cook.d hat ## 1 0.25846802 -0.12143822 0.37450291 1.0941814 0.069134146 0.1117503 ## 2 -0.41418354 0.36911615 -0.41453198 2.3977301 0.095676637 0.4828228 ## 3 -0.06856568 0.11611354 0.14584143 1.7682042 0.012056633 0.2731311 ## 4 -0.30613652 0.59570565 0.81886116 0.9030332 0.282742139 0.2124171 ## 5 -0.42604533 0.31527329 -0.45926799 1.2632509 0.106739464 0.1891217 ## 6 0.03617861 -0.02398987 0.04194061 1.5297314 0.001003707 0.1486280

[1] N. Draper and H. Smith. *Applied Regression Analysis*. 3rd. Wiley-Interscience, 1998.

[2] J. Fox and H. Weisberg. *An R Companion to Applied Regression*. 2nd. Sage Publications, 2010.

[3] F. Harrell. *Regression modeling strategies with applications to linear models, logistic regression and survival analysis*. Springer, 2001.

[4] R. Marrie, N. Dawson and A. Garland. "Quantile regression and restricted cubic splines are useful for exploring relationships between continuous variables". In: *Journal of Clinical Epidemiology* 62.5 (2009), pp. 511-517.

[5] J. Tukey. "The future of data analysis". In: *Annals of Mathematical Statistics* 33 (1962), pp. 1-67.