Corpus Linguistics - Methodology — Open Knowledge Graph

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Explainer

Traditional linguistic research has often relied on introspection — linguists make judgments about what is grammatical and how language works, based on their own intuition. Corpus linguistics takes a different approach: study language systematically through large, naturally-occurring text collections. This empirical methodology has revolutionized linguistics by revealing patterns invisible to introspection and challenging assumptions that seemed solid based on intuition alone.

A corpus is not simply a big collection of texts. It's a carefully designed, curated, and annotated collection of language data with specific properties:

Large scale: Sufficient size for statistical analysis (modern corpora contain millions to billions of words)
Systematic sampling: Representative sampling from defined populations (time period, genre, region, demographic)
Documented provenance: Source metadata (publication date, genre, author demographics, register)
Annotation: Systematic linguistic markup (part-of-speech tags, syntactic parsing, semantic annotations)
Quality control: Standardized annotation frameworks and inter-annotator agreement measures

Major corpus projects include the British National Corpus (100 million words of British English), the Corpus of Contemporary American English (COCA, 560+ million words), and TreeBanks (syntactically parsed corpora). These massive resources enable investigation impossible with traditional methods.

Corpus methodology proceeds through several stages:

1. Corpus design: Define population, sampling frame, and size

2. Data collection: Gather texts according to sampling strategy

3. Annotation: Apply linguistic markup (POS, parsing, etc.)

4. Analysis: Search, extract, and analyze patterns

5. Statistical inference: Draw conclusions about language properties

Concordance analysis is a core corpus technique: searching for a keyword and examining all contexts where it appears. For example, searching for "make" in COCA shows all contexts: "make a decision," "make progress," "make sense." Patterns emerge from examining hundreds of concordance lines that wouldn't be visible in anecdotal data.

Collocation analysis reveals which words frequently co-occur. Certain word combinations are more frequent than chance would predict: "collocate" with "with" (> 80% of the time), "accrue" with "benefits." These patterns shape speakers' productions and understanding; they're not in dictionaries but emerge from corpus frequency.

Corpus evidence has substantially reshaped linguistic theory. Suppositions about grammatical constraints have been challenged. "Rules" revealed by introspection are often actually strong tendencies with exceptions. Variation across registers and contexts is enormous — what's grammatical in conversation may be rare in academic writing. Corpora have shown that much linguistic variation had been invisible to theory focused on "core grammar."

Statistical rigor is essential. Corpus analysis must account for multiple comparisons, confidence intervals, significance testing, and effect size. Raw frequency is misleading without context. A word's rise in frequency over time is interesting, but before concluding language is changing, confounds must be ruled out: genre shifts in the corpus, demographic changes, or sampling artifacts.

Corpus linguistics hasn't replaced theory; it's complemented it. Empirical evidence from corpora constrains theoretical claims, reveals new phenomena requiring explanation, and reveals the extent of variation that pure theory might overlook. The combination of careful theorizing and rigorous empirical investigation through corpora is modern linguistics.

Corpus Linguistics - Methodology

Core Idea

How It's Best Learned

Common Misconceptions

Explainer

Prerequisite Chain

Prerequisites (1)

Leads To (0)