Against the systematic use of Fisher's exact test

September 16, 2010

Here is an article published in Statistics in Medicine last year and that argues against the systematic use of Fisher’s exact test: Lydersen, S. Fagerland, M.W., and Laake, P., Recommended tests for association in 2×2 tables, Statistics in Medicine (2009) 28: 1159-1175.

Since this is the test that is most of the times reported in clinical biostatistics and epidemiology articles, let’s look at the rationale behind this claim. Below is the abstract of the article by Lydersen and coll.

The asymptotic Pearson’s chi-squared test and Fisher’s exact test have long been the most used for testing association in 2×2 tables. Unconditional tests preserve the significance level and generally are more powerful than Fisher’s exact test for moderate to small samples, but previously were disadvantaged by being computationally demanding. This disadvantage is now moot, as software to facilitate unconditional tests has been available for years. Moreover, Fisher’s exact test with mid-p adjustment gives about the same results as an unconditional test. Consequently, several better tests are available, and the choice of a test should depend only on its merits for the application involved. Unconditional tests and the mid-p approach ought to be used more than they now are. The traditional Fisher’s exact test should practically never be used.

The chi-square and Fisher’s test

First of all, let’s recall what’s the difference between Fisher’s test and the standard $\chi^2$ test. Basically, Pearson’s $\chi^2$ reads as the suqared differences between observed and expected counts, divided by expected counts (to get a convenient normalization). Under the hypothesis of no association (i.e., statistical independence between row and column totals, the so-called margins), the expected count of n_ij is simply the product of its corresponding totals, n_i+ and n_+j. Under a large sample hypothesis, the observed $\chi^2$ statistic follow a $\chi^2$(1) distribution. It generalizes to IxJ Tables, where the degree of freedom is (I-1)(J-1). The Fisher’s test is an exact test relying on the hypergeometric distribution (where all possible Tables having the same marginal totals are computed). It is a computer intensive job, and in R (fisher.test()) it is relegated to a C function.

It is important to recall that Fisher’s test assumes that margins are fixed, whereas Pearson’s $\chi^2$ might be used to test either independence or homogeneity of proportions when one marginal total is fixed.

A more comprehensive framework is provided in Sokal and Rohlf,⁽¹⁾ p. 721 and ff. They consider three different kind of sampling schemes or designs, named Model I, II and III, depending on whether “the totals in the margins of the 2x2 Table are fixed by the investigator or are free to vary and reflect population parameters.”

Model I considers all Tables with the same total sample size n (an alternative model I is where the same individuals are assessed two times–but the hypothesis to be tested is clearly different).
Model II considers that one of the margin totals is fixed.
Model III considers that both margin totals are fixed.

The Fisher’s test is appropriate for Model III design. The G-test of independence is designed for the other two models. The G-test is a likelihood ratio test whose expression might be derived from multinomial expectations (p. 692) as

$$ G=2\sum_{i=1}^af_i\ln\left(\frac{f_i}{\hat f_i}\right), $$

i.e., it summarizes “the sum of the independent contributions of departures from expectations ln(•) weighted by the frequency of the particular class (f_i).” It is approximately $\chi^2$(1) distributed.

About Yates’ correction

The Yates’ continuity correction consists in subtracting 0.5 from the difference between each observed value and its expected value in 2×2 table only.⁽¹⁾

Yates’ correction is known to result in tests that are more conservative as with Fisher’s “exact” tests. However, we can read in Agresti (CDA, 2002, p. 103) that

Yates (1934) mentioned that Fisher suggested the hypergeometric to him for an exact test. He proposed a continuity-corrected version of $\chi^2$, $$ \chi_c^2=\sum\sum\frac{\left(\mid n_{ij}-\hat\mu_{ij}\mid -0.5\right)^2}{\hat\mu_{ij}} $$ to approximate the exact test. (…) Since software now makes Fisher’s exact test feasible even with large samples, this correction is no longer needed.

There is an excellent tutorial by Jerry Dallal on Contingency Tables, and the same remark is made: Continuity-corrected chi-square are useless when it comes to approximate Fisher exact test now that the latter is readily available in any statistical software. Dallal considers the following example:

tab <- matrix(c(8,3,5,10), nr=2)
chisq.test(tab, correct=FALSE)
chisq.test(tab)
fisher.test(tab)

	$\chi^2$	p-value
Uncorrected $\chi^2$	3.9394	0.0472
Corrected $\chi^2$	2.5212	0.1123
Fisher $\chi^2$	--	0.1107

As can be seen, the p-values from the continuity-corrected $\chi^2$ and Fisher’s test are in close agreement. The LRT (G-test) would yield $\chi^2$(1)=4.0573, p=0.0440. It is available through assocstats() from the vcd package.

The mid-p approach

Oups, it looks like this has yet to be written.

In summary

All of the above would lead to the following practical considerations: For large samples, or tables satisfying Cochran’s rule¹, use uncorrected chi-square test;^(4,5) otherwise, use Fisher’s test. See also Assumptions/Restrictions for Chi-square Tests on Contingency Tables, by Bruce Weaver.

Campbell⁽⁶⁾ discussed the use of Fisher’s test in lieu of the usual chi-square test, but see his associated website on Two by two Tables. As quoted on this website, the following rules are recommended:

Where all expected numbers are at least 1, analyse by the “N - 1 chi-squared test” (the K. Pearson chi-squared test but with N replaced by N - 1).
Otherwise, analyse by the Fisher-Irwin test, with two-sided tests carried out by Irwin’s rule (taking tables from either tail as likely, or less, as that observed).

Finally, Lydersen et coll. said that, as a conditional test, Fisher’s test may be replaced by more appropriate unconditional tests depending on the hypothesis of interest.

References

Sokal, RR and Rohlf, FJ (1995). Biometry (3rd ed). Freeman.
Yates, F (1934). Contingency table involving small numbers and the $\chi^2$ test. Supplement to the Journal of the Royal Statistical Society 1(2): 217–235.
Mantel, N (1974). Some Reasons for Not Using the Yates Continuity Correction on 2 × 2 Contingency Tables: Comment and a Suggestion. Journal of the American Statistical Association 69(346): 378-380.
Cochran, WG (1952). The chi-squared test of goodness of fit. Annals of Mathematical Statistics 25: 315–345.
Cochran WG (1954). Some methods for strengthening the common [chi-squared] tests. Biometrics 10: 417–451.
Campbell, I (2007). Chi-squared and Fisher–Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine 26(19): 3661–3675.
Haviland, MG (1990). Yates’s correction for continuity and the analysis of 2 × 2 contingency tables. Statistics in Medicine 9(4): 363–367.

Cochran’s rule just says that there must be no expected cell count less than 1 and no more than 20% less than 5; in the 2x2 case, the second rule reduces to all cells. ↩︎

aliquote.org