Study design is the architecture of a research question — the plan that determines what data are collected, from whom, and in what structure. The fundamental distinction is between experimental designs (where the investigator assigns exposure, as in randomized trials) and observational designs (where exposure occurs naturally, as in cohort, case-control, and cross-sectional studies). Each design trades off internal validity, external validity, feasibility, and ethical constraints. Choosing the wrong design for a research question guarantees that no amount of sophisticated analysis can rescue the conclusions.
Every research question has a natural study design, and mismatching the two creates problems that statistical methods cannot fix. The choice of design determines what comparisons are valid, what biases are present, and what measures of association you can compute. This is why biostatistics begins with design rather than analysis — a flawed design analyzed brilliantly still produces flawed conclusions.
The hierarchy of evidence places randomized controlled trials at the top for questions about treatment effects because randomization breaks the link between treatment assignment and all other variables, including those the investigator has not measured. This controls confounding in a way that no observational analysis can fully replicate. But RCTs are not always feasible (you cannot randomize people to smoke for 30 years) or ethical (you cannot withhold a proven treatment), and they may lack generalizability if the trial population differs from the target population. Design is always a set of tradeoffs.
Among observational designs, prospective cohort studies establish temporal sequence — they classify subjects by exposure status and follow them forward to observe who develops the outcome. This makes them strong for studying incidence and risk ratios. Case-control studies reverse the logic: they start with cases (who have the outcome) and controls (who do not) and look backward at exposure. This is far more efficient for rare diseases — instead of following 100,000 people for 20 years hoping for 200 cases, you simply find those 200 cases and match them with controls. The tradeoff is that you can only estimate odds ratios, not risk directly, and recall bias (cases remembering exposures differently than controls) can distort results.
Cross-sectional studies measure exposure and outcome at the same time, providing a snapshot. They are efficient for estimating prevalence and generating hypotheses, but they cannot establish whether exposure preceded outcome. Finding that depression and sedentary behavior co-occur tells you nothing about which came first. The temporal ambiguity is not a statistical limitation — it is a structural feature of the design that no adjustment can resolve. Understanding these design-level constraints is the foundation for every analytical technique that follows in biostatistics.