Corpus linguistics is the empirical study of language through large, organized collections of natural texts (corpora). Corpus methodology involves designing corpus architecture, sampling strategies, annotation frameworks (tagging, parsing, semantic markup), and statistical analysis. Major corpora like the British National Corpus, Corpus of Contemporary American English (COCA), and TreeBank-annotated corpora enable systematic investigation of frequency distributions, collocation patterns, and linguistic variation. Corpus evidence constrains theoretical claims about language structure and use.
Study major corpus projects and their design decisions. Learn annotation schemes and tag sets. Practice concordance analysis (searching and analyzing word contexts). Understand statistical methods applied to corpus data (frequency analysis, collocation metrics, significance testing). Examine how corpus evidence has challenged or refined linguistic theories. Participate in corpus construction and annotation.
Traditional linguistic research has often relied on introspection — linguists make judgments about what is grammatical and how language works, based on their own intuition. Corpus linguistics takes a different approach: study language systematically through large, naturally-occurring text collections. This empirical methodology has revolutionized linguistics by revealing patterns invisible to introspection and challenging assumptions that seemed solid based on intuition alone.
A corpus is not simply a big collection of texts. It's a carefully designed, curated, and annotated collection of language data with specific properties:
Major corpus projects include the British National Corpus (100 million words of British English), the Corpus of Contemporary American English (COCA, 560+ million words), and TreeBanks (syntactically parsed corpora). These massive resources enable investigation impossible with traditional methods.
Corpus methodology proceeds through several stages:
1. Corpus design: Define population, sampling frame, and size
2. Data collection: Gather texts according to sampling strategy
3. Annotation: Apply linguistic markup (POS, parsing, etc.)
4. Analysis: Search, extract, and analyze patterns
5. Statistical inference: Draw conclusions about language properties
Concordance analysis is a core corpus technique: searching for a keyword and examining all contexts where it appears. For example, searching for "make" in COCA shows all contexts: "make a decision," "make progress," "make sense." Patterns emerge from examining hundreds of concordance lines that wouldn't be visible in anecdotal data.
Collocation analysis reveals which words frequently co-occur. Certain word combinations are more frequent than chance would predict: "collocate" with "with" (> 80% of the time), "accrue" with "benefits." These patterns shape speakers' productions and understanding; they're not in dictionaries but emerge from corpus frequency.
Corpus evidence has substantially reshaped linguistic theory. Suppositions about grammatical constraints have been challenged. "Rules" revealed by introspection are often actually strong tendencies with exceptions. Variation across registers and contexts is enormous — what's grammatical in conversation may be rare in academic writing. Corpora have shown that much linguistic variation had been invisible to theory focused on "core grammar."
Statistical rigor is essential. Corpus analysis must account for multiple comparisons, confidence intervals, significance testing, and effect size. Raw frequency is misleading without context. A word's rise in frequency over time is interesting, but before concluding language is changing, confounds must be ruled out: genre shifts in the corpus, demographic changes, or sampling artifacts.
Corpus linguistics hasn't replaced theory; it's complemented it. Empirical evidence from corpora constrains theoretical claims, reveals new phenomena requiring explanation, and reveals the extent of variation that pure theory might overlook. The combination of careful theorizing and rigorous empirical investigation through corpora is modern linguistics.
No topics depend on this one yet.