Processing math: 100%

Regression modeling ➋

Christophe Lalanne

November 17, 2015

Synopsis

Statistical models are sometimes misunderstood in epidemiology. Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model. Scott Zeger (1991)

model comparisons • coding of categorical predictors • contrasts • analysis of covariance

Categorical predictors in ANOVA and regression

ANOVA vs. linear regression

ANOVA: Explain variations observed on a numerical response variable by taking into account manipulated or fixed values (levels) for some factors. We may also assume random effects for the factors under study.

Regression: Explain variations observed on a numerical response variable, or predict future values, based on a set of $k$ predictors (explanatory variables), which might be either numerical or categorical.

$y_i=\beta_0+\sum_{j=1}^k\beta_jx_i$

A model comparison approach

Base (null) model, M0: no factors/predictors involved, only the grand mean or intercept, and residual variations around this constant value.

Comparing M1 vs. M0 allows to quantify and test the variance accounted for by the factor included in M1, or, equivalently, reduction in RSS (unexplained variance).

Illustration

data(ToothGrowth)
fm <- len ~ supp
m1 <- aov(fm, data = ToothGrowth)
summary(m1)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## supp         1    205  205.35   3.668 0.0604 .
## Residuals   58   3247   55.98                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model with the supp factor vs. grand mean only:

m0 <- update(m1, . ~ - supp)
anova(m0, m1)

## Analysis of Variance Table
## 
## Model 1: len ~ 1
## Model 2: len ~ supp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     59 3452.2                              
## 2     58 3246.9  1    205.35 3.6683 0.06039 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Coding of categorical variables

n <- 6
x <- gl(2, 1, n, labels = letters[1:2])          ## x={a,b}
y <- 1.1 + 0.5 * (as.numeric(x)-1) + rnorm(n)    ## x={0,1}
m <- lm(y ~ x)
formula(m)

## y ~ x

model.matrix(m)

##   (Intercept) xb
## 1           1  0
## 2           1  1
## 3           1  0
## 4           1  1
## 5           1  0
## 6           1  1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$x
## [1] "contr.treatment"

Regression or t-test?

Writing down regression equations

When the explanatory variable is numerical, the linear regression model reads $y_i=\beta_0+\beta_1x_i+\varepsilon_i$ , where $\beta_0$ is the intercept, $\beta_1$ the slope of the regression line, and $\varepsilon_i$ are random errors (assumed to follow $\mathcal{N}(0;\sigma^2)$ ).
Let $x$ be a categorical variable ( $x=a$ or $x=b$ ), we can then write $y = \beta_0+\beta_1\mathbb{I}(x=b)$ with $\mathbb{I}(x=b)=1$ if $x=b$ , 0 otherwise. Whence, $\begin{align} y &= \beta_0 & (x=a)\\ &= \beta_0 + \beta_1 & (x=b) \end{align}$ The interpretation of $\beta_1$ remains the same: it reflects the increase in $y$ when $x$ increases by 1 unit ( $a\rightarrow b$ ). Regarding $\beta_0$ , it is the average value of $y$ when $x=0$ (i.e., $x=a$ ).

Illustration

Let us consider data on birth weight (Hosmer and Lemeshow, 1989).

data(birthwt, package = "MASS")
ethn <- c("White", "Black", "Other")
birthwt$race <- factor(birthwt$race, labels = ethn)
birthwt$race <- relevel(birthwt$race, ref = "White")
xyplot(bwt ~ race, data = birthwt, jitter.x = TRUE, 
       cex = .8, col = "cornflowerblue", alpha=.5)

m <- lm(bwt ~ race, data = birthwt)
print(summary(m), signif.stars = FALSE)

## 
## Call:
## lm(formula = bwt ~ race, data = birthwt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2096.28  -502.72   -12.72   526.28  1887.28 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3102.72      72.92  42.548  < 2e-16
## raceBlack    -383.03     157.96  -2.425  0.01627
## raceOther    -297.44     113.74  -2.615  0.00965
## 
## Residual standard error: 714.5 on 186 degrees of freedom
## Multiple R-squared:  0.05017,    Adjusted R-squared:  0.03996 
## F-statistic: 4.913 on 2 and 186 DF,  p-value: 0.008336

ANOVA Table for weight ~ ethnicity

anova(m)

## Analysis of Variance Table
## 
## Response: bwt
##            Df   Sum Sq Mean Sq F value   Pr(>F)   
## race        2  5015725 2507863  4.9125 0.008336 **
## Residuals 186 94953931  510505                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Using summary(aov(bwt ~ race, data=birthwt)) would yield exactly the same results.

Contrast coding in R

http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

options("contrasts")

## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"

coef(m)

## (Intercept)   raceBlack   raceOther 
##   3102.7188   -383.0264   -297.4352

What does -383.0 (raceBlack) reflect?

options(contrasts=c("contr.sum", "contr.poly"))
m2 <- lm(bwt ~ race, data = birthwt)
coef(m2)

## (Intercept)       race1       race2 
##   2875.8982    226.8205   -156.2059

What does -156.2 (race2) reflect?

Treatment vs. sum contrasts:

grp.means <- with(birthwt, tapply(bwt, race, mean))
grp.means

##    White    Black    Other 
## 3102.719 2719.692 2805.284

grp.means[2:3] - grp.means[1]     ## m  "contr.treatment"

##     Black     Other 
## -383.0264 -297.4352

grp.means[1:2] - mean(grp.means)  ## m2 "contr.sum"

##     White     Black 
##  226.8205 -156.2059

Analysis of covariance

Rationale

Analysis of covariance (ANCOVA) consists in testing the effect of different levels of a factor on a numerical response when other numerical covariates are also considered. The response variable is 'associated' to the numerical covariate. The idea is to get an estimate of the average response corrected for the possible between-group differences (at the level of the covariates).

Such analyses are frequently carried out on pre/post measurements, and they can generally be seen as a post-hoc adjustment method (Miller and Chapman, 2001; Senn, 2006).

The ANCOVA model

Let $y_{ij}$ be the $j$ th observation in group $i$ , the ANCOVA model with one covariate can be written as

$y_{ij} = \mu+\alpha_i+\beta(x_{ij}-\overline{x})+\varepsilon_{ij},$

where $\beta$ is the regression coefficient connecting the response $y$ to the cofactor $x$ (numerical), and $\overline{x}$ is the mean of the $x_{ij}$ . As usual, $\varepsilon_{ij}$ is a random term distributed as $\mathcal{N}(0, \sigma^2)$ .

Note that it is assumed that $\beta$ is the same in each group. This hypothesis of 'parallelism' of regression slopes can be verifed by testing the interaction term $\alpha\beta$ .

Illustration

The anorexia data set includes weight change data for young female anorexia patients following different treatment regimen (Cognitive Behavioural treatment, Family treatment, or Control) (Hand, Daly, McConway, and Ostrowski, 1993).

data(anorexia, package = "MASS")
anorexia$Treat <- relevel(anorexia$Treat, ref="Cont")
f <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(cbind(Prewt, Postwt) ~ Treat, data=anorexia, f)

##   Treat Prewt.mean  Prewt.sd Postwt.mean Postwt.sd
## 1  Cont  81.557692  5.707060   81.107692  4.744253
## 2   CBT  82.689655  4.845495   85.696552  8.351924
## 3    FT  83.229412  5.016693   90.494118  8.475072

## Model considering identical slopes per group
m0 <- aov(Postwt ~ Prewt + Treat, data=anorexia)  
## Model considering different slopes per group
m1 <- aov(Postwt ~ Prewt * Treat, data=anorexia)  
anova(m0, m1)

## Analysis of Variance Table
## 
## Model 1: Postwt ~ Prewt + Treat
## Model 2: Postwt ~ Prewt * Treat
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     68 3311.3                                
## 2     66 2844.8  2    466.48 5.4112 0.006666 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison between the two nested models corresponds to a test of the interaction term. It should be kept as including it in the model results in a significant decrease in RSS (cf. summary(m1)).

The model without interaction writes down:

$\begin{align} \tilde y_i &= 45.67 + 0.43\times\text{Prewt}_i +4.10\times\mathbb{I}(\text{Treat}_i=\text{CBT})\\ &\phantom{= 45.67 }+8.66\times\mathbb{I}(\text{Treat}_i=\text{FT}). \end{align}$

For the control group (CTRL), $\tilde y_i = 45.67 + 0.43\times\text{Prewt}_i,$ while for the FT group $\tilde y_i = 45.67 + 0.43\times\text{Prewt}_i+8.66.$ The effect of Prewt is the same for all patients, and the grouping factor only introduces a mean change (+4.10 ou +8.66) with respect to the control group.

On the contrary, the model with interaction implies

$\begin{align} \tilde y_i &= 80.99 - 0.13\times\text{Prewt}_i \\ &\phantom{80.99 -}+4.46\times\mathbb{I}(\text{Treat}_i=\text{CBT})\\ &\phantom{80.99 -}+8.75\times\mathbb{I}(\text{Treat}_i=\text{FT})\\ &\phantom{80.99 -}+0.98\times\text{Prewt}_i\times\mathbb{I}(\text{Treat}_i=\text{CBT})\\ & \phantom{80.99 -}+1.04\times\text{Prewt}_i\times\mathbb{I}(\text{Treat}_i=\text{FT}). \end{align}$

References

[1] D. Hand, F. Daly, K. McConway, et al. A Handbook of Small Data Sets. Data set 285, p.~ 229. Chapman & Hall, 1993.

[2] D. Hosmer and S. Lemeshow. Applied Logistic Regression. New York: Wiley, 1989.

[3] G. Miller and J. Chapman. "Misunderstanding Analysis of Covariance". In: Journal of Abnormal Psychology 110.1 (2001), pp. 40-48.

[4] S. Senn. "Change from baseline and analysis of covariance revisited". In: Statistics in Medicine 25.24 (2006), pp. 4334-4344.