Christophe Lalanne
October 22, 2013
The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. Ronald Fisher (1890-1962)
design of experiments • split-apply-combine • one-way and two-way ANOVA • interaction
Lectures: OpenIntro Statistics, 4.2, 5.2, 5.5.
Maximize precision while minimizing number of trials.
Implementation of an organized set of experimental units to characterize the effect of certain treatments or combination of treatments, on one or more response variables.
Taking into account one or more nuisance factors for the establishment of experimental design: organize sources of unwanted variation so that we can say that they affect treatment equivalently, making the comparison between treatments possible.
Randomization (random allocation of units to treatments–experimental vs. quasi-experimental design), factorial arrangement of treatments, and blocking (grouping of similar units based on known but irrelevant characteristics) are keys components of experimental design (Montgomery, 2012).
R relies on a 'formula' to describe the relation between one or multiple response variables and one or more explanatory variables, according to Wilkinson & Rogers's notation (Wilkinson & Rogers, 1973; Chambers & Hastie, 1992).
In the case of ANOVA and regression, a single response variable is put on the left of the ~
operator, followed by
RHS | Variable type | Meaning | Equiv. to |
---|---|---|---|
x | numeric | simple linear regression | 1 + x |
x + 0 | numeric | idem without intercept | x - 1 |
a + b | factor (numeric) | two main crossed effects | |
a * b | factor | idem including interaction | 1 + a + b + a:b |
a / b | factor | nested relationship | 1 + a + b + a %in% b |
Most of the time, the same formula can be used to perform several 'data steps' (tidying and summarizing data, plotting, or reporting), but it is also the core element of many statistical models in R (linear and generalized linear models, decision trees, partitioning methods, etc.).
The use of formulae means we also need to work with a well-arranged data frame, for which reshaping (melting/casting) are essential tools.
d <- data.frame(x1 = rnorm(n=5, mean=10, sd=1.5),
x2 = rnorm(n=5, mean=12, sd=1.5))
d[1:3,] # same as head(d, n=3)
x1 x2
1 9.511 13.76
2 10.829 12.93
3 8.988 11.83
t.test(d$x1, d$x2, var.equal=TRUE)$p.value
[1] 0.001021
library(reshape2)
dm <- melt(d) # switch from wide to long format
head(dm)
variable value
1 x1 9.511
2 x1 10.829
3 x1 8.988
4 x1 10.322
5 x1 10.466
6 x2 13.761
t.test(value ~ variable, data=dm, var.equal=TRUE)$p.value
[1] 0.001021
Note: R's formulae (and data frames, de facto) have been ported to Python and Julia.
“(…) break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.” (Wickham, 2011)
See the plyr package (we won't use it, though).
Here is a working example:
x <- rnorm(n=15) # 15 random gaussian variates
grp <- gl(n=3, k=5, labels=letters[1:3])
spl <- split(x, grp) # split x by levels of grp
apl <- lapply(spl, mean) # apply mean() to each split
cbn <- rbind(x=apl) # combine the means
cbn
a b c
x -0.1088 -0.7705 -0.08854
Shorter version (other than by()
, tapply()
, or ave()
):
aggregate(x ~ grp, FUN=mean)
grp x
1 a -0.10883
2 b -0.77052
3 c -0.08854
Let \( y_{ij} \) be the \( j\text{th} \) observation in group \( i \) (factor \( A \), with \( a \) levels). An effect model can be written as
\[ y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \]
where \( \mu \) stands for the overall (grand) mean, \( \alpha_i \) is the effect of group or treatment \( i \) (\( i=1,\dots,a \)), and \( \varepsilon_{ij}\sim \mathcal{N}(0,\sigma^2) \) reflects random error. The following restriction is usually considered: \( \sum_{i=1}^a\alpha_i=0 \).
The null hypothesis reads: \( H_0:\alpha_1=\alpha_2=\dots=\alpha_a \). It can be tested with an F-test with \( a-1 \) et \( N-a \) degrees of freedom.
Each observation can be seen as a deviation from its group mean, \( y_{ij}=\bar y_i+\varepsilon_{ij} \). Then, the whole variability can be expressed as follows:
\[ \underbrace{(y_{ij}-\bar y)}_{\text{total}}=\underbrace{(\bar y_{i\phantom{j}}\hskip-.5ex-\bar y)}_{\text{group}} + \underbrace{(y_{ij}-\bar y_i)}_{\text{residuals}}. \]
Effect of different sugars on length of pea sections grown in tissue culture with auxin present. (Sokal & Rohlf, 1995)
ctrl | X2G | X2F | X1G1F | X2S | |
---|---|---|---|---|---|
1 | 75 | 57 | 58 | 58 | 62 |
2 | 67 | 58 | 61 | 59 | 66 |
3 | 70 | 60 | 56 | 58 | 65 |
4 | 75 | 59 | 58 | 61 | 63 |
5 | 65 | 62 | 57 | 57 | 64 |
6 | 71 | 60 | 56 | 56 | 62 |
7 | 67 | 60 | 61 | 58 | 65 |
8 | 67 | 57 | 60 | 57 | 65 |
9 | 76 | 59 | 57 | 57 | 62 |
10 | 68 | 61 | 58 | 59 | 67 |
head(peas, n=3)
ctrl X2G X2F X1G1F X2S
1 75 57 58 58 62
2 67 58 61 59 66
3 70 60 56 58 65
peas.melted <- melt(peas, value.name="length",
variable.name="treatment")
head(peas.melted, n=3)
treatment length
1 ctrl 75
2 ctrl 67
3 ctrl 70
f <- function(x) c(mean=mean(x), sd=sd(x))
aggregate(length ~ treatment, data=peas.melted, f)
treatment length.mean length.sd
1 ctrl 70.100 3.985
2 X2G 59.300 1.636
3 X2F 58.200 1.874
4 X1G1F 58.000 1.414
5 X2S 64.100 1.792
mod <- aov(length ~ treatment, data=peas.melted)
summary(mod)
Df Sum Sq Mean Sq F value Pr(>F)
treatment 4 1077 269.3 49.4 6.7e-16
Residuals 45 246 5.5
model.tables(mod)
Tables of effects
treatment
treatment
ctrl X2G X2F X1G1F X2S
8.16 -2.64 -3.74 -3.94 2.16
aggregate(length ~ treatment, peas.melted, mean)$length - mean(peas.melted$length)
[1] 8.16 -2.64 -3.74 -3.94 2.16
Model fit = data + residual
d <- cbind.data.frame(peas.melted,
fit = fitted(mod),
residual = resid(mod))
res <- aggregate(. ~ treatment, data=d, mean)
res$effect <- res$fit - mean(peas.melted$length)
res
treatment length fit residual effect
1 ctrl 70.1 70.1 -1.232e-15 8.16
2 X2G 59.3 59.3 5.995e-16 -2.64
3 X2F 58.2 58.2 -6.665e-17 -3.74
4 X1G1F 58.0 58.0 1.999e-16 -3.94
5 X2S 64.1 64.1 3.108e-16 2.16
Let \( y_{ijk} \) be the \( k\text{th} \) observation for level \( i \) of factor \( A \) (\( i=1,\dots,a \)) and level \( j \) of factor \( B \) (\( j=1,\dots,b \)). The full model can be written as follows:
\[ y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk}, \]
where \( \mu \) is the overall mean, \( \alpha_i \) (\( \beta_j \)) is the deviation of the \( a \) (\( b \)) means from the overall mean, \( \gamma_{ij} \) is the deviation of the \( A\times B \) treatments from \( \mu \), and \( \varepsilon_{ijk}\sim {\cal N}(0,\sigma^2) \) is the residual term.
The \( \alpha_i \) et \( \beta_j \) are called main effects, and \( \gamma_{ij} \) is the interaction effect.
Null hypotheses associated to the full factorial design are given below:
The ratio of Mean Squares corresponding to each factor and that of the residuals can be tested with Fisher-Snedecor F-tests.
Chambers J and Hastie T (1992). Statistical Models in S. Wadsworth & Brooks. ISBN: 0534167649.
Montgomery D (2012). Design and Analysis of Experiments, 8th edition. John Wiley & Sons.
Sokal R and Rohlf F (1995). Biometry, 3rd edition. WH Freeman and Company.
Wickham H (2011). “The Split-Apply-Combine Strategy for Data Analysis.” Journal of Statistical Software, 40(1).
Wilkinson G and Rogers C (1973). “Symbolic description of factorial models for analysis of variance.” Applied Statistics, 22, pp. 392-399.