Processing math: 100%

Regression modeling ➊

Christophe Lalanne

November 17, 2015

Synopsis

The greatest value of a picture is when it forces us to notice what we never expected to see. John Tukey (1915-2000)

correlation • simple linear regression • parameter estimation • diagnostic measures

Lectures: OpenIntro Statistics, 7.1-7.4.

Association measures

Association, correlation, causality

Linear regression allows to model the relationship between a continuous outcome and one or more variables of interest, also called predictors. Unlike correlation analysis, these variables play an asymmetrical role, and usually we are interested in quantifying the strength of this relationship, as well as the amount of variance in the response variable acounted for by the predictors.

In linear regression, we assume a causal effect of the predictor(s) on the outcome. When quantifying the degree of association between two variables, however, both variables play a symmetrical role (there's no outcome or response variable). Moreover, correlation usually does not imply causation.

Linear correlation

(Bravais-)Pearson correlation coefficient provides a unit-less measure of linear co-variation between two numeric variables, contrary to covariance which depends linearly on the measurement scale of each variable.

A perfect positive (negative) linear correlation would yield a value of $r=+1$ ( $-1$ ).

$r=\frac{\overbrace{\sum_{i=1}^n(x_i-\overline{x})(y_i-\overline{y})}^{\text{covariance (x,y)}}}{\underbrace{\sqrt{\sum_{i=1}^n(x_i-\overline{x})^2}}_{\text{std deviation of x}}\underbrace{\sqrt{\sum_{i=1}^n(y_i-\overline{y})^2}}_{\text{std deviation of y}}}$

Linear regression

It's always about explained variance

We want to account for individual variations of a response variable, $Y$ , considering one or more predictors or explanatory variables, $X_j$ (numeric or categorical). Simple linear regression considers only one continuous predictor.

A general modeling framework

Usually, we will consider that there is a systematic and a random (residual) part at the level of these individual variations. The linear model allows to formalize the asymmetric relationship between response variable ( $Y$ ) and predictors ( $X_j$ ) while separating these two sources of variation:

$\text{response} = \text{predictor(s) effect} + \hskip-11ex \underbrace{\;\;\text{noise,}}_{\text{measurement error, observation period, etc.}}$

Importantly, this theoretical relationship is linear and additive: (Draper and Smith, 1998; Fox and Weisberg, 2010)

$\mathbb{E}(y \mid x) = f(x; \beta)$

All model are wrong, some are useful

… or rather, "the practical question is how wrong do they have to be to not be useful" (Georges Box).

"Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise" (Tukey, 1962).

But all statistical models rely on more or less stringent assumptions. In the case of linear regression, the linearity of the relationship and influence of individual data points must be carefully checked. Alternatives do exist, though: resistant or robust (Huber, LAD, quantile) methods, MARS, GAM, or restricted cublic splines (Marrie, Dawson, and Garland, 2009; Harrell, 2001).

ANOVA vs. regression

Not so much different, in a certain sense.
Under the GLM umbrella, we can use regression to fit an ANOVA model.

A working example

# Simulate 10 obs. assuming the true model is Y=5.1+1.8X. 
set.seed(101)
n <- 10
x <- runif(n, 0, 10)
y <- 5.1 + 1.8 * x + rnorm(n)  # add white noise
stem(y, scale = 2)

## 
##   The decimal point is at the |
## 
##    6 | 5
##    8 | 4
##   10 | 0
##   12 | 50
##   14 | 788
##   16 | 89

summary(m <- lm(y ~ x))

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32379 -0.61054 -0.07249  0.69580  1.12465 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0831     0.6618   9.192 1.59e-05 ***
## x             1.6190     0.1360  11.903 2.28e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8752 on 8 degrees of freedom
## Multiple R-squared:  0.9466, Adjusted R-squared:  0.9399 
## F-statistic: 141.7 on 1 and 8 DF,  p-value: 2.281e-06

Recap'

OLS minimizes vertical distances between observed and fitted $y$ values (residual sum of squares).
The regression line pass through the mean point, $(\bar x, \bar y)$ ; $b_0=\bar y-b_1\bar x$ .
The slope of the regression line is found to be $\sum_i(y_i-\bar y)(x_i-\bar x)\big/\sum_i(x_i-\bar x)^2$ . (Compare to the formula for the correlation coefficient.)

Testing significance of slope or model

summary(m)$coefficients

##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 6.083082  0.6617682  9.192164 1.586118e-05
## x           1.618957  0.1360150 11.902782 2.280917e-06

2*pt(1.619/0.136, 10-2, lower.tail=FALSE) ## t-test for slope

## [1] 2.278541e-06

We might also be interested in testing the model as a whole, especially when there are more than one predictor.

anova(m)

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x          1 108.520 108.520  141.68 2.281e-06 ***
## Residuals  8   6.128   0.766                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fitted values and residuals

Recall that data = model fit + residuals, as described in the ANOVA case.

coef(m)[1] + coef(m)[2] * x[1:5]  ## adjusted values

## [1] 12.108813  6.792587 17.572560 16.730805 10.128138

fitted(m)[1:5]

##         1         2         3         4         5 
## 12.108813  6.792587 17.572560 16.730805 10.128138

residuals = data - model fit, what is not explained by the linear regression.

y[1:5] - predict(m)[1:5]          ## residuals

##          1          2          3          4          5 
##  0.8647242 -0.2849502  0.1890184  1.1246501 -0.7539945

resid(m)[1:5]

##          1          2          3          4          5 
##  0.8647242 -0.2849502  0.1890184  1.1246501 -0.7539945

Beyond the regression line

Predicted/adjusted values are estimated conditional on regressor values, assumed to be fixed and measured without error.

We need further distributional assumption to draw inference or estimate 95% CIs for the parameters, namely that residuals follows an $\mathcal{N}(0;\sigma^2)$ .

Residual analysis shows what has not yet been accounted for in a model.

Influence measures

head(influence.measures(m)$infmat)  # Chambers & Hastie, 1992

##        dfb.1_       dfb.x       dffit     cov.r      cook.d       hat
## 1  0.25846802 -0.12143822  0.37450291 1.0941814 0.069134146 0.1117503
## 2 -0.41418354  0.36911615 -0.41453198 2.3977301 0.095676637 0.4828228
## 3 -0.06856568  0.11611354  0.14584143 1.7682042 0.012056633 0.2731311
## 4 -0.30613652  0.59570565  0.81886116 0.9030332 0.282742139 0.2124171
## 5 -0.42604533  0.31527329 -0.45926799 1.2632509 0.106739464 0.1891217
## 6  0.03617861 -0.02398987  0.04194061 1.5297314 0.001003707 0.1486280

References

[1] N. Draper and H. Smith. Applied Regression Analysis. 3rd. Wiley-Interscience, 1998.

[2] J. Fox and H. Weisberg. An R Companion to Applied Regression. 2nd. Sage Publications, 2010.

[3] F. Harrell. Regression modeling strategies with applications to linear models, logistic regression and survival analysis. Springer, 2001.

[4] R. Marrie, N. Dawson and A. Garland. "Quantile regression and restricted cubic splines are useful for exploring relationships between continuous variables". In: Journal of Clinical Epidemiology 62.5 (2009), pp. 511-517.

[5] J. Tukey. "The future of data analysis". In: Annals of Mathematical Statistics 33 (1962), pp. 1-67.