Questions: Computational Text Analysis for Social Data
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A researcher completes a study using LDA topic modeling on 10 years of congressional speeches and reports: 'The algorithm identified 8 distinct political themes organizing the corpus.' What is the most critical missing element in this claim?
AThe software package and computational resources used to run the model
BThe number of documents and average document length in the corpus
CThe researcher's substantive interpretation of what the statistical word clusters actually mean — the algorithm produces patterns, not meaning
DValidation metrics showing the statistical fit of the model to the data
LDA produces statistical clusters of co-occurring words — it identifies patterns in which words appear together across documents. What those patterns mean substantively requires the researcher to interpret the word lists using domain knowledge. The algorithm cannot identify 'political themes'; it identifies word co-occurrence patterns. Presenting the output as directly meaningful without documenting the interpretive step misrepresents how the method works, makes the analysis unreproducible, and conflates statistical pattern-finding with substantive understanding.
Question 2 Multiple Choice
A researcher uses a validated dictionary of economic anxiety terms to measure that concept across 50,000 news articles. What is the most fundamental assumption this method requires?
AThat the articles represent a representative sample of media coverage during the study period
BThat economic anxiety appears in text in ways that prior theory can specify — that the dictionary words reliably indicate the concept across diverse linguistic contexts in the corpus
CThat the dictionary was developed on a corpus similar to the one being analyzed
DThat the researcher has manually read at least a sample of the articles to validate the results
Dictionary methods work by counting how often words associated with a concept appear. This assumes that the concept manifests in language in predictable, theory-specified ways that the dictionary captures. If economic anxiety is sometimes expressed through understatement, irony, or the absence of certain words, the dictionary will miss it. If dictionary words appear in contexts where the concept isn't meant (e.g., academic discussions of economic anxiety), it will overcount. This assumption — confident prior theory about how the concept appears linguistically — is substantial and must be validated, not assumed.
Question 3 True / False
In supervised text classification, biases that researchers introduce during the hand-labeling stage can propagate systematically into the trained model's classifications across the full corpus.
TTrue
FFalse
Answer: True
Supervised classification works by learning patterns from hand-labeled examples and applying those learned patterns to new documents. If human coders systematically label certain types of documents in ways that reflect their biases — coding ambiguous cases in one direction, applying different standards across demographic groups, or operationalizing concepts inconsistently — the trained model learns and amplifies those patterns. The model scales human judgment, including human error, which is why strong inter-coder reliability, transparent documentation of coding rules, and validation on held-out data are essential safeguards.
Question 4 True / False
Bag-of-words models are called 'bag-of-words' because they capture words along with their grammatical and sequential context within sentences.
TTrue
FFalse
Answer: False
Bag-of-words models treat documents as unordered collections of word tokens — sequence and grammar are discarded. 'Bag' is the key metaphor: just as items in a bag have no inherent order, words in a bag-of-words model are simply counted, not sequenced. This means 'the bank repossessed the house' and 'the house repossessed the bank' have identical representations. This is a significant limitation for capturing meaning that depends on word order, negation, or syntax — though for many research purposes (broad thematic analysis, topic modeling), the loss of sequence is an acceptable tradeoff for scalability.
Question 5 Short Answer
Why does having a larger corpus not automatically solve validity problems in computational text analysis?
Think about your answer, then reveal below.
Model answer: Scale amplifies whatever patterns the method is measuring — if the method is measuring the wrong thing, more data produces more precise measurements of the wrong thing. A dictionary method that miscategorizes a concept will misclassify millions of articles at scale. A supervised classifier trained on flawed labels will propagate those flaws across millions of documents. Validity — whether the method captures the construct of interest — is a conceptual and design problem that must be solved through careful operationalization and validation, not through additional data.
This is the fundamental distinction between reliability (consistent results) and validity (measuring what you intend). Computational methods are often highly reliable — they produce the same output from the same input — but reliability does not guarantee validity. Big data can produce reliably wrong answers at impressive scale. The solution is validation: reading samples of documents, checking whether model outputs correspond to human judgment, testing on cases where the correct answer is known, and documenting assumptions transparently so others can evaluate whether the method actually captures the intended concept.