November 3, 2015
The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. Ronald Fisher (1890-1962)
We are interested in assessing whether several groups differ with respect to their mean. We use an omnibus Fisher-Snedecor F-test to test the null hypothesis that all group means are equal, assuming that group variances are identical in the population (homoskedasticity) and that the normality assumption holds. The alternative hypothesis is that at least one pair of means is different at a fixed \(\alpha=0.05\) level.
Effect of different sugars on length of pea sections grown in tissue culture with auxin present. (Sokal and Rohlf, 1995)
ctrl | X2G | X2F | X1G1F | X2S | |
---|---|---|---|---|---|
1 | 75 | 57 | 58 | 58 | 62 |
2 | 67 | 58 | 61 | 59 | 66 |
3 | 70 | 60 | 56 | 58 | 65 |
4 | 75 | 59 | 58 | 61 | 63 |
5 | 65 | 62 | 57 | 57 | 64 |
6 | 71 | 60 | 56 | 56 | 62 |
head(peas, n = 3)
## ctrl X2G X2F X1G1F X2S ## 1 75 57 58 58 62 ## 2 67 58 61 59 66 ## 3 70 60 56 58 65
library(reshape2) peas.m <- melt(peas, value.name = "length", variable.name = "treatment") head(peas.m, n = 3)
## treatment length ## 1 ctrl 75 ## 2 ctrl 67 ## 3 ctrl 70
This way, we can express relation like length ~ treatment
.
To compute group means and SDs, we can use aggregate()
two times, or write a little helper function.
f <- function(x, digits = 1) round(c(mean = mean(x), sd = sd(x)), digits = digits) aggregate(length ~ treatment, data = peas.m, f)
## treatment length.mean length.sd ## 1 ctrl 70.1 4.0 ## 2 X2G 59.3 1.6 ## 3 X2F 58.2 1.9 ## 4 X1G1F 58.0 1.4 ## 5 X2S 64.1 1.8
Note: aggregate()
does not handle multiple output correctly (check the dimension of the returned object), but it is ok for printing on the console.
Here, the idea is to show the distribution of the response variable in each group using box-and-whiskers charts.
There are 5 groups of independant statistical units, and it looks like variance are approximately comparable in all but the Control group.
mod <- aov(length ~ treatment, data = peas.m) summary(mod)
## Df Sum Sq Mean Sq F value Pr(>F) ## treatment 4 1077.3 269.33 49.37 6.74e-16 *** ## Residuals 45 245.5 5.46 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pf(49.37, 4, 45, lower.tail = FALSE)
The model reads: \(y_{ij} = \mu + \alpha_i + \varepsilon_{ij}\), where the \(\alpha_i\)'s represent average group variations from the 'grand mean', \(\mu\).
A table of effects can be obtained using:
model.tables(mod)
## Tables of effects ## ## treatment ## treatment ## ctrl X2G X2F X1G1F X2S ## 8.16 -2.64 -3.74 -3.94 2.16
This is the same as:
aggregate(length ~ treatment, peas.m, mean)$length - mean(peas.m$length)
d <- cbind.data.frame(peas.m, fit = fitted(mod), residual = resid(mod)) res <- aggregate(. ~ treatment, data=d, mean) res$effect <- res$fit - mean(peas.m$length) res
## treatment length fit residual effect ## 1 ctrl 70.1 70.1 -4.330303e-16 8.16 ## 2 X2G 59.3 59.3 5.328637e-16 -2.64 ## 3 X2F 58.2 58.2 4.440892e-16 -3.74 ## 4 X1G1F 58.0 58.0 6.772740e-16 -3.94 ## 5 X2S 64.1 64.1 3.552931e-16 2.16
If the F-test is significant, then we reject the null, \(H_0:\alpha_1=\dots=\alpha_5\). We can conclude that the average length of pea sections differ between the groups (at least two of them), and that this difference is statistically significant at the 5% level (\(p<0.001\)).
What are the problems?
Solutions:
We need to carry out some kind of post-hoc multiple comparisons.
with(peas.m, pairwise.t.test(length, treatment, p.adj = "bonf"))
The ratio SS(effect) / SS(effect+residual) reflects the proportion of variance accounted for by the factor of interest.
summary(mod)[[1]][["Sum Sq"]]
It is also possible to get 95% CI for such effect size (Steiger, 2004; Nakagawa and Cuthill, 2007), see the MBESS package.
Formal tests of hypothesis (e.g., Bartlett and Levene tests) can be used, but it is better to verify that those assumptions are met based on some graphics (boxplot and QQ-plot). Or we can make use of a non-parametric test (Kruskal-Wallis ANOVA), or the same parametric test but some correction (Welch, 1951) (oneway.test()
).
Instead of working with only one factor and a single working hypothesis to test, we might want to study:
This raises several questions, notably the effect of one factor can be different depending on the levels of the other factor (Sokal and Rohlf, 1995; Montgomery, 2012).
Let \(y_{ijk}\) be the \(k\text{th}\) observation for level \(i\) of factor \(A\) (\(i=1,\dots,a\)) and level \(j\) of factor \(B\) (\(j=1,\dots,b\)). The full model can be written as follows:
\[ y_{ijk} = \mu + \alpha_i + \beta_j + \gamma_{ij} + \varepsilon_{ijk}, \]
where \(\mu\) is the overall mean, \(\alpha_i\) (\(\beta_j\)) is the deviation of the \(a\) (\(b\)) means from the overall mean, \(\gamma_{ij}\) is the deviation of the \(A\times B\) treatments from \(\mu\), and \(\varepsilon_{ijk}\sim {\cal N}(0,\sigma^2)\) is the residual term.
The \(\alpha_i\) et \(\beta_j\) are called main effects, and \(\gamma_{ij}\) is the interaction effect.
Null hypotheses associated to the full factorial design are given below:
The ratio of Mean Squares corresponding to each factor and that of the residuals can be tested with Fisher-Snedecor F-tests.
If the interaction is significant, this means that the effect of the first factor depends on the levels of the second factor. In other words, there is no simple effect for the first factor, and we need to study its effect for each level of the second factor. If the interaction is not statistically significant, then we can remove this term from the model if it was not of primary interest, and refit the model.
Often times, the interaction term is the effect we are interested in. But we need to describe the direction and magnitude of this effect carefully. Graphical displays are useful, but be careful with error bars (Cumming and Finch, 2005; Masson, 2003)!
Finally, why a two-way ANOVA rather than two one-way ANOVAs?
Because of Simpson's paradox and biased estimates of the residual.
The main commands are: aggregate()
, aov()
, summary.aov()
, model.tables()
, interaction.plot()
, plot.design()
.
See Lab 3 : Analysis of variance with R for an application of two-way ANOVA and extended discussion.
[1] G. Cumming and S. Finch. "Inference by Eye, Confidence Intervals and How to Read Pictures of Data". In: American Psychologist 60.2 (2005), pp. 170-180.
[2] M. Masson. "Using confidence intervals for graphically based data interpretation". In: Canadian Journal of Experimental Psychology 57.3 (2003), pp. 203-220.
[3] D. Montgomery. Design and Analysis of Experiments. 8th. John Wiley & Sons, 2012.
[4] S. Nakagawa and I. Cuthill. "Effect size, confidence interval and statistical significance: a practical guide for biologists". In: Biological Reviews of the Cambridge Philosophical Society 82.4 (2007), pp. 591-605.
[5] R. Sokal and F. Rohlf. Biometry. 3rd. WH Freeman and Company, 1995.
[6] J. Steiger. "Beyond the F Test: Effect size confidence intervals and tests of close fit in the Analysis of Variance and Contrast Analysis". In: Psychological Methods 9 (2004), pp. 164-182.
[7] B. Welch. "On the comparison of several mean values: An alternative approach". In: Biometrika 38 (1951), pp. 330-336.