We are often tasked with computing the number of subjects needed to achieve a certain statistical power, especially in biomedical sciences. Various formulae are available online or in dedicated textbooks.1 Here is a simple illustration with a two-sample proportion: Let’s say we want to compare the efficacy of a new treatment (N) with that of the reference treatment (R) in a clinical trial. The outcome is the percentage of complete remission after one week. We know that this percentage equals 50% with R, and we expect 80% of recovery with N.2 To achieve 95% statistical power, how many patients do we need to enroll?
First, let’s say we can only enroll 50 patients per group. Then, using standard formulae and considering a 5% type I error, Φ=1.107−0.785√0.25×2/50=3.22, which means we will end up with 90% power.3 Now, to achieve 95% statistical power, Φ=3.61, hence n=2×0.25×3.612(1.107−0.785)2=63 subjects per group. Dealing with proportions usually involves the arcsin transformation, which can be applied to raw data or directly to estimated proportions. This is discussed in Ryan1, pp. 116-117. Here is what I get with Stata 13 for a two-sample proportions likelihood-ratio test: (Note that we get N=63 when using a chi-squared test)
. power twoproportions 0.5 0.8, test(lrchi2) power(0.95)
Performing iteration ...
Estimated sample sizes for a two-sample proportions test
Likelihood-ratio test
Ho: p2 = p1 versus Ha: p2 != p1
Study parameters:
alpha = 0.0500
power = 0.9500
delta = 0.3000 (difference)
p1 = 0.5000
p2 = 0.8000
Estimated sample sizes:
N = 130
N per group = 65
The same principle applies to the case of the arithmetic mean and a two-sample Student t-test, of course. Note that in the case of small proportion (here, the difference between PN and PR, the number of subjects required to achieve a certain power 1−β will grow up quickly as β decreases: to detect a 5% difference in proportion, with PR=0.5, using 1000 subjects per arm will give you only 61% power; for a 2.5% difference with the same number of subjects, don’t expect to get more than 20% power.
Bayesian approaches4 usually yield smaller sample sizes compared to frequentist approaches, and most of the time no closed-form solutions do exist which means numerical computation must be undergone. I found an interesting review of the pros and cons of those approaches in a recent article of The American Statistician: A Review of Bayesian Perspectives on Sample Size Derivation for Confirmatory Trials. And here is a blog post that discusses a simulation-based approach for power analysis using posterior distributions: Sample size determination in the context of Bayesian analysis.
Likewise, computing confidence intervals most of the time relies on the use of a normal approximation in the case of two-sample proportions, and it is covered in most textbooks in the case of two-sample experiments. Basically, we would compute a 95% CI for the difference between two proportions as follows: (P_N–P_R) \pm \Phi^{-1}(\alpha/2)\sqrt{P_N(1-P_N)/n_N + P_R(1-P_R)/n_R}, where \Phi is the cumulative distribution function for the standard normal distribution (with mean 0 and variance 1), and \alpha is the predefined type I error rate. Typically, \alpha=0.05, and \Phi^{-1}(0.025)=-1.96 and \Phi^{-1}(0.975)=1.96.
Now, we can take a different approach and rely on credible intervals, as illustrated in Murphy’s updated edition of his famous ML textbook,5 which I am currently reading. The problem is stated as follows: Let X\sim\mathcal{N}(\mu, \sigma^2=4) where \mu is unknown but has prior \mu\sim\mathcal{N}(\mu_0,\sigma_0^2=9). The posterior after seeing n samples is \mu_n\sim\mathcal{N}(\mu_n,\sigma_n^2). How big does n have to be to ensure P(l\leq\mu_n\leq u\mid D) \geq 0.95, where (l,u) is an interval centered on \mu_n of width 1, and D is the data?
Here is the full derivation in the case of a two-sample problem with a continuous outcome. As stated, we want an interval that verifies P(l\leq \mu_n \leq u\mid D) \geq 0.95, where as usual:
\begin{equation} \begin{aligned} l & = \mu_n + \Phi^{-1}(\alpha/2)\sigma_n = \mu - 1.96\sigma_n \cr u & = \mu_n + \Phi^{-1}(1-\alpha/2)\sigma_n = \mu + 1.96\sigma_n \end{aligned} \end{equation}
We need to find n such that u - l = 1, that is 2(1.96)\sigma_n = 1, or \sigma_n^2 = \frac{1}{4(1.96)^2} with \sigma_n^2 = \frac{\sigma^2\sigma_0^2}{n\sigma_0^2+\sigma^2}. Hence we have
\begin{equation} \begin{aligned} n\sigma_0^2 + \sigma^2 & = (\sigma^2\sigma_0^2)4(1.96)^2 \cr n & = \frac{\sigma^2(\sigma_0^24(1.96)^2-1}{\sigma_0^2} \cr & = \frac{4(9\times (1.96)^2-1}{9} = 61.0212 \end{aligned} \end{equation}
We would need at least n\geq 62 subjects per group.
♪ Jean Carn, Adrian Younge & Ali Shaheed Muhammad • Black Rainbows
Thomas P. Ryan. Sample Size and Determination and Power. Wiley, 2013. ↩︎ ↩︎
Bouvenot & Vray. Essais Cliniques. Théorie, Pratique et Critique (4ème éd.). Lavoisier, 2006 (p. 248). ↩︎
An other interesting question would be to ask what is the minimal expected recovery percentage we would get with 50 patients per group at 80% power. The derivation is as follows: \Pi_N - \Pi_R = 2.80\sqrt{(2 \times 0.25)/50} = 0.28, so that \Pi_N = \Pi_R + 0.28 = 0.785 + 0.28 = 1.065. In other words, P_N = 76%. ↩︎
I’m not talking about Bayesian updating whereby we sample size is increased during the course of a study by using the Bayes factor to quantify the degree of support for a hypothesis agaisnt the alternative given the observed data. ↩︎
Probabilistic Machine Learning: An introduction. See all available textbooks here. ↩︎