Christophe Lalanne
November 17, 2015
Statistical models are sometimes misunderstood in epidemiology. Statistical models for data are never true. The question whether a model is true is irrelevant. A more appropriate question is whether we obtain the correct scientific conclusion if we pretend that the process under study behaves according to a particular statistical model. Scott Zeger (1991)
ANOVA: Explain variations observed on a numerical response variable by taking into account manipulated or fixed values (levels) for some factors. We may also assume random effects for the factors under study.
Regression: Explain variations observed on a numerical response variable, or predict future values, based on a set of k predictors (explanatory variables), which might be either numerical or categorical.
yi=β0+k∑j=1βjxi
Base (null) model, M0: no factors/predictors involved, only the grand mean or intercept, and residual variations around this constant value.
Comparing M1 vs. M0 allows to quantify and test the variance accounted for by the factor included in M1, or, equivalently, reduction in RSS (unexplained variance).
data(ToothGrowth) fm <- len ~ supp m1 <- aov(fm, data = ToothGrowth) summary(m1)
## Df Sum Sq Mean Sq F value Pr(>F) ## supp 1 205 205.35 3.668 0.0604 . ## Residuals 58 3247 55.98 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model with the supp
factor vs. grand mean only:
m0 <- update(m1, . ~ - supp) anova(m0, m1)
## Analysis of Variance Table ## ## Model 1: len ~ 1 ## Model 2: len ~ supp ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 59 3452.2 ## 2 58 3246.9 1 205.35 3.6683 0.06039 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n <- 6 x <- gl(2, 1, n, labels = letters[1:2]) ## x={a,b} y <- 1.1 + 0.5 * (as.numeric(x)-1) + rnorm(n) ## x={0,1} m <- lm(y ~ x) formula(m)
## y ~ x
model.matrix(m)
## (Intercept) xb ## 1 1 0 ## 2 1 1 ## 3 1 0 ## 4 1 1 ## 5 1 0 ## 6 1 1 ## attr(,"assign") ## [1] 0 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment"
When the explanatory variable is numerical, the linear regression model reads yi=β0+β1xi+εi, where β0 is the intercept, β1 the slope of the regression line, and εi are random errors (assumed to follow N(0;σ2)).
Let x be a categorical variable (x=a or x=b), we can then write y=β0+β1I(x=b) with I(x=b)=1 if x=b, 0 otherwise. Whence, y=β0(x=a)=β0+β1(x=b) The interpretation of β1 remains the same: it reflects the increase in y when x increases by 1 unit (a→b). Regarding β0, it is the average value of y when x=0 (i.e., x=a).
Let us consider data on birth weight (Hosmer and Lemeshow, 1989).
data(birthwt, package = "MASS") ethn <- c("White", "Black", "Other") birthwt$race <- factor(birthwt$race, labels = ethn) birthwt$race <- relevel(birthwt$race, ref = "White") xyplot(bwt ~ race, data = birthwt, jitter.x = TRUE, cex = .8, col = "cornflowerblue", alpha=.5)
m <- lm(bwt ~ race, data = birthwt) print(summary(m), signif.stars = FALSE)
## ## Call: ## lm(formula = bwt ~ race, data = birthwt) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2096.28 -502.72 -12.72 526.28 1887.28 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3102.72 72.92 42.548 < 2e-16 ## raceBlack -383.03 157.96 -2.425 0.01627 ## raceOther -297.44 113.74 -2.615 0.00965 ## ## Residual standard error: 714.5 on 186 degrees of freedom ## Multiple R-squared: 0.05017, Adjusted R-squared: 0.03996 ## F-statistic: 4.913 on 2 and 186 DF, p-value: 0.008336
ANOVA Table for weight ~ ethnicity
anova(m)
## Analysis of Variance Table ## ## Response: bwt ## Df Sum Sq Mean Sq F value Pr(>F) ## race 2 5015725 2507863 4.9125 0.008336 ** ## Residuals 186 94953931 510505 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using summary(aov(bwt ~ race, data=birthwt))
would yield exactly the same results.
http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
options("contrasts")
## $contrasts ## unordered ordered ## "contr.treatment" "contr.poly"
coef(m)
## (Intercept) raceBlack raceOther ## 3102.7188 -383.0264 -297.4352
What does -383.0 (raceBlack
) reflect?
options(contrasts=c("contr.sum", "contr.poly")) m2 <- lm(bwt ~ race, data = birthwt) coef(m2)
## (Intercept) race1 race2 ## 2875.8982 226.8205 -156.2059
What does -156.2 (race2
) reflect?
Treatment vs. sum contrasts:
grp.means <- with(birthwt, tapply(bwt, race, mean)) grp.means
## White Black Other ## 3102.719 2719.692 2805.284
grp.means[2:3] - grp.means[1] ## m "contr.treatment"
## Black Other ## -383.0264 -297.4352
grp.means[1:2] - mean(grp.means) ## m2 "contr.sum"
## White Black ## 226.8205 -156.2059
Analysis of covariance (ANCOVA) consists in testing the effect of different levels of a factor on a numerical response when other numerical covariates are also considered. The response variable is 'associated' to the numerical covariate. The idea is to get an estimate of the average response corrected for the possible between-group differences (at the level of the covariates).
Such analyses are frequently carried out on pre/post measurements, and they can generally be seen as a post-hoc adjustment method (Miller and Chapman, 2001; Senn, 2006).
Let yij be the j th observation in group i, the ANCOVA model with one covariate can be written as
yij=μ+αi+β(xij−¯x)+εij,
where β is the regression coefficient connecting the response y to the cofactor x (numerical), and ¯x is the mean of the xij. As usual, εij is a random term distributed as N(0,σ2).
Note that it is assumed that β is the same in each group. This hypothesis of 'parallelism' of regression slopes can be verifed by testing the interaction term αβ.
The anorexia
data set includes weight change data for young female anorexia patients following different treatment regimen (Cognitive Behavioural treatment, Family treatment, or Control) (Hand, Daly, McConway, and Ostrowski, 1993).
data(anorexia, package = "MASS") anorexia$Treat <- relevel(anorexia$Treat, ref="Cont") f <- function(x) c(mean=mean(x), sd=sd(x)) aggregate(cbind(Prewt, Postwt) ~ Treat, data=anorexia, f)
## Treat Prewt.mean Prewt.sd Postwt.mean Postwt.sd ## 1 Cont 81.557692 5.707060 81.107692 4.744253 ## 2 CBT 82.689655 4.845495 85.696552 8.351924 ## 3 FT 83.229412 5.016693 90.494118 8.475072
## Model considering identical slopes per group m0 <- aov(Postwt ~ Prewt + Treat, data=anorexia) ## Model considering different slopes per group m1 <- aov(Postwt ~ Prewt * Treat, data=anorexia) anova(m0, m1)
## Analysis of Variance Table ## ## Model 1: Postwt ~ Prewt + Treat ## Model 2: Postwt ~ Prewt * Treat ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 68 3311.3 ## 2 66 2844.8 2 466.48 5.4112 0.006666 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The comparison between the two nested models corresponds to a test of the interaction term. It should be kept as including it in the model results in a significant decrease in RSS (cf. summary(m1)
).
The model without interaction writes down:
˜yi=45.67+0.43×Prewti+4.10×I(Treati=CBT)=45.67+8.66×I(Treati=FT).
For the control group (CTRL
), ˜yi=45.67+0.43×Prewti, while for the FT
group ˜yi=45.67+0.43×Prewti+8.66. The effect of Prewt
is the same for all patients, and the grouping factor only introduces a mean change (+4.10 ou +8.66) with respect to the control group.
On the contrary, the model with interaction implies
˜yi=80.99−0.13×Prewti80.99−+4.46×I(Treati=CBT)80.99−+8.75×I(Treati=FT)80.99−+0.98×Prewti×I(Treati=CBT)80.99−+1.04×Prewti×I(Treati=FT).
[1] D. Hand, F. Daly, K. McConway, et al. A Handbook of Small Data Sets. Data set 285, p.~ 229. Chapman & Hall, 1993.
[2] D. Hosmer and S. Lemeshow. Applied Logistic Regression. New York: Wiley, 1989.
[3] G. Miller and J. Chapman. "Misunderstanding Analysis of Covariance". In: Journal of Abnormal Psychology 110.1 (2001), pp. 40-48.
[4] S. Senn. "Change from baseline and analysis of covariance revisited". In: Statistics in Medicine 25.24 (2006), pp. 4334-4344.