October 20, 2015

The analysis of variance is not a mathematical theorem, but rather a convenient method of arranging the arithmetic. Ronald Fisher (1890-1962)

**Lectures:** OpenIntro Statistics, 4.2, 5.2, 5.5.

Implementation of an organized set of experimental units to **characterize the effect of certain treatments** or combination of treatments, on one or more response variables.

Taking into account one or more nuisance factors for the establishment of experimental design: **organize sources of unwanted variation** so that we can say that they affect treatment equivalently, making the comparison between treatments possible.

- Parallel (independent) groups
- Completely randomized design
- Incomplete block design (e.g., Latin square)
- Split-plot design
- Repeated measures, including cross-over trials

**Randomization** (random allocation of units to treatments–experimental vs. quasi-experimental design), **factorial arrangement** of treatments, and **blocking** (grouping of similar units based on known but irrelevant characteristics) are keys components of experimental design (Montgomery, 2012).

Let us consider two series of measurement collected on \(n = n_1 + n_2\) subjects. A classification factor (e.g., gender) is used to allocate \(n_1\) and \(n_2\) independant statistical units in two groups. This is a simple illustration of **two parallel groups**.

In clinical trials, usually we consider a control group and an 'active arm'. The control group serves as a comparator. Randomization ensures that all factors that could potentially affect the outcome are balanced across the two groups. Therefore, any observed effect can be explained by what makes the groups different.

To compare two means, a test statistic can be constructed as follows: \[ t_{\text{obs}}=\frac{\bar x_1 - \bar x_2}{s_c\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}, \] where the \(\bar x_i\)'s and \(n_i\)'s are sample means and sizes, and \(s_c=\left[\left((n_1-1)s^2_{x_1}+(n_2-1)s^2_{x_2}\right)/(n_1+n_2-2)\right]^{1/2}\) is the common variance. Under \(H_0\), this test statistic follow a Student distribution with \(n_1+n_2-2\) dof.

Two possible ways: **wide form** and **long form**.

n <- 20 x <- rnorm(n, mean = 120, sd = 10) g <- gl(n = 2, k = 10, len = n) d <- data.frame(x1 = x[g == 1], x2 = x[g == 2]) head(d, 4)

## x1 x2 ## 1 116.7396 125.2645 ## 2 125.5246 112.0516 ## 3 113.2506 134.2776 ## 4 122.1436 105.3318

The two columns define two response variables: not always easy to manipulate, except in the two-group case.

da <- data.frame(outcome = x, group = g) levels(da$group) <- c("Control", "Treated") head(da)

## outcome group ## 1 116.7396 Control ## 2 125.5246 Control ## 3 113.2506 Control ## 4 122.1436 Control ## 5 123.1077 Control ## 6 131.7397 Control

There are now **two variables**: a response variable and a factor with two levels. Now it is easy to do things like this: describe `outcome`

as a function of `group`

.

aggregate(outcome ~ group, data = da, mean)

Yet another way to do this:

library(reshape2) db <- melt(d) head(db)

## variable value ## 1 x1 116.7396 ## 2 x1 125.5246 ## 3 x1 113.2506 ## 4 x1 122.1436 ## 5 x1 123.1077 ## 6 x1 131.7397

This produces the same result, but it is easily extended to two factors or more.

aggregate(value ~ variable, data = db, mean)

R relies on a **formula** to describe the relation between one or multiple response variables and one or more explanatory variables, according to Wilkinson & Rogers's notation (Wilkinson and Rogers, 1973; Chambers and Hastie, 1992).

This is a convenient way of representing relations between variables when designing an experiment, but it assumes that data are in the so-called long format. We must always describe a response variable as a function of other variables, unless we want to build a design matrix by hand. Data **reshaping** (melting/casting) utilities are essential tools.

**Note:** R's formulae (and data frames, *de facto*) have been ported to Python and Julia.

In the case of ANOVA and regression, a **single response variable** is put on the left of the `~`

operator, followed by

RHS | Variable type | Meaning | Equiv. to |
---|---|---|---|

x | numeric | simple linear regression | 1 + x |

x + 0 | numeric | idem without intercept | x - 1 |

a + b | factor (numeric) | two main crossed effects | |

a * b | factor | idem including interaction | 1 + a + b + a:b |

a / b | factor | nested relationship | 1 + a + b + a %in% b |

Most of the time, the same formula can be used to perform several 'data steps' (tidying and summarizing data, plotting, or reporting), but it is also the core element of many statistical models in R (linear and generalized linear models, decision trees, partitioning methods, etc.).

(…) break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together." (Wickham, 2011)

grp <- gl(n = 4, k = 5, labels = letters[1:4]) spl <- split(x, grp) # split x by levels of grp apl <- lapply(spl, mean) # apply mean() to each split cbn <- rbind(x = apl) # combine the means cbn

## a b c d ## x 120.1532 124.7476 118.9117 112.2948

## shorter version (other than `by()`, `tapply()`, or `ave()`): aggregate(x ~ grp, FUN=mean)

## grp x ## 1 a 120.1532 ## 2 b 124.7476 ## 3 c 118.9117 ## 4 d 112.2948

Let \(y_{ij}\) be the \(j\text{th}\) observation in group \(i\) (factor \(A\), with \(a\) levels). An **effect model** can be written as

\[ y_{ij} = \mu + \alpha_i + \varepsilon_{ij}, \]

where \(\mu\) stands for the overall (grand) mean, \(\alpha_i\) is the effect of group or treatment \(i\) (\(i=1,\dots,a\)), and \(\varepsilon_{ij}\sim \mathcal{N}(0,\sigma^2)\) reflects random error. The following restriction is usually considered: \(\sum_{i=1}^a\alpha_i=0\).

The **null hypothesis** reads: \(H_0:\alpha_1=\alpha_2=\dots=\alpha_a\). It can be tested with an F-test with \(a-1\) et \(N-a\) degrees of freedom.

Each observation can be seen as a deviation from its group mean, \(y_{ij}=\bar y_i+\varepsilon_{ij}\). Then, the whole variability can be expressed as follows:

\[\underbrace{(y_{ij}-\bar y)}_{\text{total}}=\underbrace{(\bar y_{i\phantom{j}}\hskip-.5ex-\bar y)}_{\text{group}} + \underbrace{(y_{ij}-\bar y_i)}_{\text{residuals}}.\]

Hence the term: **decomposition of variance**.

- This is an
**omnibus test**for which the alternative hypothesis reads \(\exists i,j\mid \alpha_i\neq\alpha_j,\: i, j=1,\dots,a\, (i\neq j)\). If the result is significant, it doesn't tell us which pairs of means really differ. - Beside independence of observations, this model assumes that
**variances are equal in each population**(which is hard to assess with formal tests) and that residuals are approximately normally distributed. - As always, a statistically significant result does not necessarily mean an interesting result from a practical point of view: We need to provide a summary of
**raw or standardized effects**.

[1] J. Chambers and T. Hastie. *Statistical Models in S*. ISBN: 0534167649. Wadsworth & Brooks, 1992.

[2] D. Montgomery. *Design and Analysis of Experiments*. 8th. John Wiley & Sons, 2012.

[3] H. Wickham. "The Split-Apply-Combine Strategy for Data Analysis". In: *Journal of Statistical Software* 40.1 (2011).

[4] G. Wilkinson and C. Rogers. "Symbolic description of factorial models for analysis of variance". In: *Applied Statistics* 22 (1973), pp. 392-399.