October 20, 2015


The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. Ronald Fisher (1890-1962)

design of experiments • R data structure • split-apply-combine • one-way ANOVA

Lectures: OpenIntro Statistics, 4.2, 5.2, 5.5.

Design of experiments

Maximize precision while minimizing number of trials

Implementation of an organized set of experimental units to characterize the effect of certain treatments or combination of treatments, on one or more response variables.

Taking into account one or more nuisance factors for the establishment of experimental design: organize sources of unwanted variation so that we can say that they affect treatment equivalently, making the comparison between treatments possible.

Some examples

  • Parallel (independent) groups
  • Completely randomized design
  • Incomplete block design (e.g., Latin square)
  • Split-plot design
  • Repeated measures, including cross-over trials

Randomization (random allocation of units to treatments–experimental vs. quasi-experimental design), factorial arrangement of treatments, and blocking (grouping of similar units based on known but irrelevant characteristics) are keys components of experimental design (Montgomery, 2012).

Comparing two groups

Let us consider two series of measurement collected on \(n = n_1 + n_2\) subjects. A classification factor (e.g., gender) is used to allocate \(n_1\) and \(n_2\) independant statistical units in two groups. This is a simple illustration of two parallel groups.

In clinical trials, usually we consider a control group and an 'active arm'. The control group serves as a comparator. Randomization ensures that all factors that could potentially affect the outcome are balanced across the two groups. Therefore, any observed effect can be explained by what makes the groups different.

To compare two means, a test statistic can be constructed as follows: \[ t_{\text{obs}}=\frac{\bar x_1 - \bar x_2}{s_c\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}, \] where the \(\bar x_i\)'s and \(n_i\)'s are sample means and sizes, and \(s_c=\left[\left((n_1-1)s^2_{x_1}+(n_2-1)s^2_{x_2}\right)/(n_1+n_2-2)\right]^{1/2}\) is the common variance. Under \(H_0\), this test statistic follow a Student distribution with \(n_1+n_2-2\) dof.

Data structures in R

R representation of two-group data

Two possible ways: wide form and long form.

n <- 20
x <- rnorm(n, mean = 120, sd = 10)
g <- gl(n = 2, k = 10, len = n)
d <- data.frame(x1 = x[g == 1], x2 = x[g == 2])
head(d, 4)
##         x1       x2
## 1 116.7396 125.2645
## 2 125.5246 112.0516
## 3 113.2506 134.2776
## 4 122.1436 105.3318

The two columns define two response variables: not always easy to manipulate, except in the two-group case.

da <- data.frame(outcome = x, group = g)
levels(da$group) <- c("Control", "Treated")
##    outcome   group
## 1 116.7396 Control
## 2 125.5246 Control
## 3 113.2506 Control
## 4 122.1436 Control
## 5 123.1077 Control
## 6 131.7397 Control

There are now two variables: a response variable and a factor with two levels. Now it is easy to do things like this: describe outcome as a function of group.

aggregate(outcome ~ group, data = da, mean)

Yet another way to do this:

db <- melt(d)
##   variable    value
## 1       x1 116.7396
## 2       x1 125.5246
## 3       x1 113.2506
## 4       x1 122.1436
## 5       x1 123.1077
## 6       x1 131.7397

This produces the same result, but it is easily extended to two factors or more.

aggregate(value ~ variable, data = db, mean)

Describing variables relationships

R relies on a formula to describe the relation between one or multiple response variables and one or more explanatory variables, according to Wilkinson & Rogers's notation (Wilkinson and Rogers, 1973; Chambers and Hastie, 1992).

This is a convenient way of representing relations between variables when designing an experiment, but it assumes that data are in the so-called long format. We must always describe a response variable as a function of other variables, unless we want to build a design matrix by hand. Data reshaping (melting/casting) utilities are essential tools.

Note: R's formulae (and data frames, de facto) have been ported to Python and Julia.

In the case of ANOVA and regression, a single response variable is put on the left of the ~ operator, followed by

RHS Variable type Meaning Equiv. to
x numeric simple linear regression 1 + x
x + 0 numeric idem without intercept x - 1
a + b factor (numeric) two main crossed effects
a * b factor idem including interaction 1 + a + b + a:b
a / b factor nested relationship 1 + a + b + a %in% b

R's formula and data analysis

Most of the time, the same formula can be used to perform several 'data steps' (tidying and summarizing data, plotting, or reporting), but it is also the core element of many statistical models in R (linear and generalized linear models, decision trees, partitioning methods, etc.).

Split – Apply – Combine

(…) break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together." (Wickham, 2011)

See the plyr or dplyr packages.

Aggregation of data

grp <- gl(n = 4, k = 5, labels = letters[1:4])
spl <- split(x, grp)        # split x by levels of grp
apl <- lapply(spl, mean)    # apply mean() to each split 
cbn <- rbind(x = apl)       # combine the means
##   a        b        c        d       
## x 120.1532 124.7476 118.9117 112.2948
## shorter version (other than `by()`, `tapply()`, or `ave()`):
aggregate(x ~ grp, FUN=mean)
##   grp        x
## 1   a 120.1532
## 2   b 124.7476
## 3   c 118.9117
## 4   d 112.2948

The analysis of variance

One-way ANOVA

Let \(y_{ij}\) be the \(j\text{th}\) observation in group \(i\) (factor \(A\), with \(a\) levels). An effect model can be written as

\[ y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \]

where \(\mu\) stands for the overall (grand) mean, \(\alpha_i\) is the effect of group or treatment \(i\) (\(i=1,\dots,a\)), and \(\varepsilon_{ij}\sim \mathcal{N}(0,\sigma^2)\) reflects random error. The following restriction is usually considered: \(\sum_{i=1}^a\alpha_i=0\).

The null hypothesis reads: \(H_0:\alpha_1=\alpha_2=\dots=\alpha_a\). It can be tested with an F-test with \(a-1\) et \(N-a\) degrees of freedom.

The big picture

Each observation can be seen as a deviation from its group mean, \(y_{ij}=\bar y_i+\varepsilon_{ij}\). Then, the whole variability can be expressed as follows:

\[\underbrace{(y_{ij}-\bar y)}_{\text{total}}=\underbrace{(\bar y_{i\phantom{j}}\hskip-.5ex-\bar y)}_{\text{group}} + \underbrace{(y_{ij}-\bar y_i)}_{\text{residuals}}.\]

Hence the term: decomposition of variance.

Assumptions, caveats, etc.

  • This is an omnibus test for which the alternative hypothesis reads \(\exists i,j\mid \alpha_i\neq\alpha_j,\: i, j=1,\dots,a\, (i\neq j)\). If the result is significant, it doesn't tell us which pairs of means really differ.
  • Beside independence of observations, this model assumes that variances are equal in each population (which is hard to assess with formal tests) and that residuals are approximately normally distributed.
  • As always, a statistically significant result does not necessarily mean an interesting result from a practical point of view: We need to provide a summary of raw or standardized effects.

Different scenarios



[1] J. Chambers and T. Hastie. Statistical Models in S. ISBN: 0534167649. Wadsworth & Brooks, 1992.

[2] D. Montgomery. Design and Analysis of Experiments. 8th. John Wiley & Sons, 2012.

[3] H. Wickham. "The Split-Apply-Combine Strategy for Data Analysis". In: Journal of Statistical Software 40.1 (2011).

[4] G. Wilkinson and C. Rogers. "Symbolic description of factorial models for analysis of variance". In: Applied Statistics 22 (1973), pp. 392-399.