Christophe Lalanne
October 20, 2015
The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. Ronald Fisher (1890-1962)
Lectures: OpenIntro Statistics, 4.2, 5.2, 5.5.
Implementation of an organized set of experimental units to characterize the effect of certain treatments or combination of treatments, on one or more response variables.
Taking into account one or more nuisance factors for the establishment of experimental design: organize sources of unwanted variation so that we can say that they affect treatment equivalently, making the comparison between treatments possible.
Randomization (random allocation of units to treatments–experimental vs. quasi-experimental design), factorial arrangement of treatments, and blocking (grouping of similar units based on known but irrelevant characteristics) are keys components of experimental design (Montgomery, 2012).
Let us consider two series of measurement collected on n=n1+n2 subjects. A classification factor (e.g., gender) is used to allocate n1 and n2 independant statistical units in two groups. This is a simple illustration of two parallel groups.
In clinical trials, usually we consider a control group and an 'active arm'. The control group serves as a comparator. Randomization ensures that all factors that could potentially affect the outcome are balanced across the two groups. Therefore, any observed effect can be explained by what makes the groups different.
To compare two means, a test statistic can be constructed as follows: tobs=ˉx1−ˉx2sc√1n1+1n2, where the ˉxi's and ni's are sample means and sizes, and sc=[((n1−1)s2x1+(n2−1)s2x2)/(n1+n2−2)]1/2 is the common variance. Under H0, this test statistic follow a Student distribution with n1+n2−2 dof.
Two possible ways: wide form and long form.
n <- 20 x <- rnorm(n, mean = 120, sd = 10) g <- gl(n = 2, k = 10, len = n) d <- data.frame(x1 = x[g == 1], x2 = x[g == 2]) head(d, 4)
## x1 x2 ## 1 116.7396 125.2645 ## 2 125.5246 112.0516 ## 3 113.2506 134.2776 ## 4 122.1436 105.3318
The two columns define two response variables: not always easy to manipulate, except in the two-group case.
da <- data.frame(outcome = x, group = g) levels(da$group) <- c("Control", "Treated") head(da)
## outcome group ## 1 116.7396 Control ## 2 125.5246 Control ## 3 113.2506 Control ## 4 122.1436 Control ## 5 123.1077 Control ## 6 131.7397 Control
There are now two variables: a response variable and a factor with two levels. Now it is easy to do things like this: describe outcome
as a function of group
.
aggregate(outcome ~ group, data = da, mean)
Yet another way to do this:
library(reshape2) db <- melt(d) head(db)
## variable value ## 1 x1 116.7396 ## 2 x1 125.5246 ## 3 x1 113.2506 ## 4 x1 122.1436 ## 5 x1 123.1077 ## 6 x1 131.7397
This produces the same result, but it is easily extended to two factors or more.
aggregate(value ~ variable, data = db, mean)
R relies on a formula to describe the relation between one or multiple response variables and one or more explanatory variables, according to Wilkinson & Rogers's notation (Wilkinson and Rogers, 1973; Chambers and Hastie, 1992).
This is a convenient way of representing relations between variables when designing an experiment, but it assumes that data are in the so-called long format. We must always describe a response variable as a function of other variables, unless we want to build a design matrix by hand. Data reshaping (melting/casting) utilities are essential tools.
Note: R's formulae (and data frames, de facto) have been ported to Python and Julia.
In the case of ANOVA and regression, a single response variable is put on the left of the ~
operator, followed by
RHS | Variable type | Meaning | Equiv. to |
---|---|---|---|
x | numeric | simple linear regression | 1 + x |
x + 0 | numeric | idem without intercept | x - 1 |
a + b | factor (numeric) | two main crossed effects | |
a * b | factor | idem including interaction | 1 + a + b + a:b |
a / b | factor | nested relationship | 1 + a + b + a %in% b |
Most of the time, the same formula can be used to perform several 'data steps' (tidying and summarizing data, plotting, or reporting), but it is also the core element of many statistical models in R (linear and generalized linear models, decision trees, partitioning methods, etc.).
(…) break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together." (Wickham, 2011)
grp <- gl(n = 4, k = 5, labels = letters[1:4]) spl <- split(x, grp) # split x by levels of grp apl <- lapply(spl, mean) # apply mean() to each split cbn <- rbind(x = apl) # combine the means cbn
## a b c d ## x 120.1532 124.7476 118.9117 112.2948
## shorter version (other than `by()`, `tapply()`, or `ave()`): aggregate(x ~ grp, FUN=mean)
## grp x ## 1 a 120.1532 ## 2 b 124.7476 ## 3 c 118.9117 ## 4 d 112.2948
Let yij be the jth observation in group i (factor A, with a levels). An effect model can be written as
yij=μ+αi+εij,
where μ stands for the overall (grand) mean, αi is the effect of group or treatment i (i=1,…,a), and εij∼N(0,σ2) reflects random error. The following restriction is usually considered: ∑ai=1αi=0.
The null hypothesis reads: H0:α1=α2=⋯=αa. It can be tested with an F-test with a−1 et N−a degrees of freedom.
Each observation can be seen as a deviation from its group mean, yij=ˉyi+εij. Then, the whole variability can be expressed as follows:
(yij−ˉy)⏟total=(ˉyij−ˉy)⏟group+(yij−ˉyi)⏟residuals.
Hence the term: decomposition of variance.
[1] J. Chambers and T. Hastie. Statistical Models in S. ISBN: 0534167649. Wadsworth & Brooks, 1992.
[2] D. Montgomery. Design and Analysis of Experiments. 8th. John Wiley & Sons, 2012.
[3] H. Wickham. "The Split-Apply-Combine Strategy for Data Analysis". In: Journal of Statistical Software 40.1 (2011).
[4] G. Wilkinson and C. Rogers. "Symbolic description of factorial models for analysis of variance". In: Applied Statistics 22 (1973), pp. 392-399.