Questions: Statistical Power, Effect Size, and Sample Size Planning
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
Study A (n = 10,000) finds a drug reduces headache severity by 0.1 points on a 100-point scale (p < 0.001). Study B (n = 50) finds a 15-point reduction (p = 0.04). Which conclusion is most accurate?
AStudy A's finding is more important because p < 0.001 is far more significant than p = 0.04
BStudy B likely demonstrates a more practically meaningful effect, even though Study A is more statistically significant
CNeither study is meaningful without pre-registration
DStudy A is definitive because large samples eliminate statistical uncertainty
Statistical significance reflects how unlikely the observed result is under the null hypothesis — it is heavily influenced by sample size. With n = 10,000, even a trivially small effect (0.1 points on a 100-point scale) becomes highly significant. Study B's 15-point reduction is far larger in magnitude and likely clinically meaningful, even though its p-value is less extreme. Effect size (the magnitude of the difference) is the relevant metric for practical significance; p-values should not be used as a proxy for importance.
Question 2 Multiple Choice
A researcher wants 80% power to detect a small effect (Cohen's d = 0.2) at α = .05. Compared to detecting a large effect (d = 0.8) with the same power and alpha, how does the required sample size compare?
AAbout the same — sample size requirements don't vary much with effect size
BMuch larger — smaller effects are harder to distinguish from noise and require more data
CSmaller — small effects are more common in nature, making them easier to detect
DThe answer depends entirely on the specific alpha level chosen
Effect size and required sample size have an inverse relationship for fixed power and alpha. For d = 0.2, you need approximately 197 participants per group; for d = 0.8, only about 26 per group (using standard power formulas). Smaller effects produce smaller differences in sample distributions, making them harder to distinguish from random variation — which requires more observations to accumulate sufficient evidence. Effect prevalence in nature is irrelevant to how difficult the effect is to detect statistically.
Question 3 True / False
A statistically significant result (p < .05) from a study with mainly 20% power is strong evidence that a real effect exists.
TTrue
FFalse
Answer: False
An underpowered study that achieves statistical significance is actually suspect, not reassuring. With only 20% power, the study had a high base rate of failing to detect true effects. The studies that 'succeed' despite low power are disproportionately those that observed inflated effects by chance sampling — a phenomenon called the 'winner's curse.' These inflated estimates tend not to replicate. High power matters not just for detecting effects but for producing stable, accurate effect size estimates.
Question 4 True / False
Effect size is a standardized measure of the magnitude of an effect that does not depend on sample size.
TTrue
FFalse
Answer: True
This is the defining feature that makes effect sizes valuable. Unlike p-values, which decrease (become more significant) as sample size increases for any fixed true effect, Cohen's d, r, η², and related measures quantify the size of an effect in scale-free units that remain constant regardless of how many participants were tested. This is why effect sizes are required for meta-analyses — they allow combining results across studies with different sample sizes.
Question 5 Short Answer
Why do researchers conduct a-priori power analyses before collecting data? What goes wrong scientifically when this step is skipped?
Think about your answer, then reveal below.
Model answer: A-priori power analysis determines the sample size needed to detect your expected effect with adequate probability (typically 80%), given your significance threshold. Skipping it leads to underpowered studies: if the true effect exists but is modest, an underpowered study will usually miss it (Type II error), wasting resources. Worse, researchers who lack a pre-specified sample size often collect data until p < .05 appears — a practice called 'optional stopping' — which inflates the false positive rate well above the nominal alpha level. Pre-specifying sample size (and ideally pre-registering hypotheses) ensures that a significant result reflects a planned, adequately powered test rather than sampling until luck produces significance.
The replication crisis in psychology was partly caused by widespread underpowered studies combined with flexible stopping rules. Understanding power analysis reveals exactly why this is problematic: the tools for rigorous inference require commitment before observing data. Power analysis is not bureaucratic overhead; it's the mechanism that connects sampling precision to the strength of scientific claims.