A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Statistical Power and Sample Size Determination

Graduate Depth 227 in the knowledge graph ☐ I know this ☆ Set as goal

8topics build on this

1,316prerequisites beneath it

Hypothesis Testing Fundamentals Study Design in Biostatistics +2 more→→Bayesian Methods in Biostatistics Group Sequential Methods for Clinical Trials +4 more

Core Idea

Statistical power is the probability of correctly rejecting a false null hypothesis — equivalently, the probability of detecting a real effect when one exists. Power depends on four interconnected quantities: the significance level (alpha), the sample size (n), the effect size (the magnitude of the true difference), and the variability of the outcome. Sample size calculations performed before a study begins determine how many subjects are needed to achieve adequate power (conventionally 80% or higher) for a clinically meaningful effect size. An underpowered study wastes resources by being unable to detect effects that matter; an overpowered study wastes resources by enrolling more subjects than necessary and potentially detecting effects too small to be clinically relevant.

Explainer

From your study of hypothesis testing, you know that every test carries two types of error: Type I (rejecting a true null, controlled by alpha) and Type II (failing to reject a false null, denoted beta). Power is simply 1 minus beta — the probability that the test correctly rejects the null when the alternative is true. A study with 80% power and alpha = 0.05 will detect a true effect 80% of the time while maintaining a 5% false-positive rate. The 20% miss rate is the cost of doing science with finite samples.

The four determinants of power are tightly linked. Alpha sets the rejection threshold — a more lenient alpha increases power but increases false positives. Sample size reduces the standard error of the estimate, making it easier to distinguish signal from noise. Effect size is the magnitude of the true difference — larger effects are easier to detect. Variability (the standard deviation of the outcome) is noise — more variability obscures the signal and requires more subjects to detect it. A sample size calculation solves for n given fixed values of the other three quantities: "How many subjects do I need to detect this effect size at this alpha with this power, given the expected variability?"

The most consequential decision in a sample size calculation is the choice of effect size. This should reflect the minimum clinically important difference (MCID) — the smallest effect that would change clinical practice or patient outcomes. A blood pressure drug that lowers systolic pressure by 0.5 mmHg might be statistically significant with 100,000 subjects, but no clinician would change prescribing behavior for such a trivial effect. Conversely, a study powered to detect only a 20 mmHg difference will miss a real 10 mmHg effect that genuinely matters. The effect size should come from clinical judgment, prior literature, or pilot data — never from statistical convenience.

Sample size calculations must be performed and reported before data collection begins. Post-hoc power calculations — computing the power of a completed study using the observed effect size — are widely recognized as uninformative and circular. If the study found a non-significant result, the observed effect will always yield low post-hoc power, telling you nothing you did not already know. The proper way to interpret a non-significant result is through the confidence interval: a wide interval that includes both clinically important and null effects indicates the study was uninformative, while a narrow interval tightly centered on zero provides genuine evidence of no meaningful effect.

Practice Questions 4 questions