A researcher uses tweets collected via Twitter's API to study how political opinions spread during an election. They find strong evidence of echo chambers. A methodologist raises concerns. Which concern is most fundamental?
AText analysis algorithms are not reliable enough to classify political content accurately
BTwitter's API returns a random sample of all tweets, so the dataset should be representative
CTwitter users systematically differ from the general voting public in age, education, and political engagement, and the platform's algorithm shapes which content is visible — the data is not representative of actual opinion formation
DThe study should have used survey data instead, since computational methods cannot study opinion formation
This is the selection bias problem that computational methods can conceal but not cure. Twitter users skew younger, more politically active, and more extreme than the general electorate. The platform algorithm amplifies outrage and engagement, shaping what content gets recorded. A finding about Twitter echo chambers is a finding about Twitter — generalizing to 'political opinion formation' requires bridging arguments that the researcher must make explicit. Big data makes this bias easier to overlook, not easier to correct.
Question 2 Multiple Choice
A researcher builds an agent-based model of protest mobilization where agents join a protest if more than 30% of their network has already joined. The model generates output that visually resembles historical protest waves. The researcher concludes the model is validated. What is the fundamental flaw?
AABMs cannot model social phenomena like protests because human behavior is too unpredictable to simulate
BThe model needs more agents — at least 100,000 — before the output is statistically meaningful
CVisual resemblance to historical patterns does not validate the model; it must be calibrated against real data quantitatively and tested on held-out cases not used during model design
DThe 30% threshold is the wrong value and should be determined by machine learning on historical data
Many different models with different underlying assumptions can generate output that looks like real patterns — this is the 'equifinality' problem in simulation. Visual plausibility is not validation. A validated ABM must have its parameters estimated from real behavioral data, make quantitative predictions that match empirical distributions, and be tested on cases it was not designed to reproduce. Without this, the model may be generating the right patterns for the wrong reasons.
Question 3 True / False
In computational social science, collecting a very large dataset (millions of records) from a web platform effectively eliminates selection bias, because the large N makes the sample representativeness less important.
TTrue
FFalse
Answer: False
Large N amplifies the precision of your estimates but does not change what population those estimates describe. If your data comes from Reddit, you have very precise estimates about Reddit users — not about the general public. A million non-representative observations can give you a very precise answer to the wrong question. The 2016 U.S. election forecasting failures demonstrated this: large datasets of online behavior systematically underweighted working-class voters who were offline or less active on social platforms.
Question 4 True / False
Agent-based models in computational social science are valuable partly because they allow researchers to explore 'what if' scenarios by systematically varying parameters, generating hypotheses about social mechanisms that can then be tested against empirical data.
TTrue
FFalse
Answer: True
This is the appropriate and powerful use of ABMs: theory generation and exploration, not definitive causal proof. When you cannot run a real experiment (you cannot randomly assign cities to different housing policies, for example), an ABM lets you reason systematically about how changing a parameter — network density, threshold for adoption, geographic clustering — would change macro-level outcomes. The key discipline is that model output must eventually be anchored to empirical data for the findings to be credible.
Question 5 Short Answer
Why does the validation imperative — comparing computational results against real empirical data — matter especially in computational social science compared to traditional small-sample social science research?
Think about your answer, then reveal below.
Model answer: Computational methods scale enormously — a model can process millions of records or run millions of simulations. This means errors in methodology, flawed assumptions in a text classifier, or non-representative training data get amplified at the same scale as the signal. In traditional small-N research, a flawed assumption affects a few dozen observations and is often visible during close reading of the data. In computational research, the same flaw silently affects every one of millions of records, and the sheer volume makes it easy to mistake precision for accuracy. Validation against real data is the check that distinguishes technically impressive analysis from valid social science.
The field's name — 'social science' — carries a methodological obligation. Scale is not a substitute for rigor; it is an amplifier of both good and bad methodology. A text classifier that misclassifies political speech 5% of the time applied to 10 million tweets produces 500,000 classification errors — and if those errors are not random (which they never are), the resulting analysis can be systematically wrong in ways that are not visible from looking at summary statistics.