Speech production requires planning at multiple levels: conceptual planning (what to say), grammatical encoding (how to structure it), and phonetic encoding (how to pronounce it). Speech errors like spoonerisms and malapropisms reveal distinct processing stages and the normal mechanisms by which sounds are selected and sequenced.
When you produce a sentence, you are not simply converting a thought into sound — you are executing a tightly coordinated cascade of planning operations, each operating at a different level of abstraction, on a timescale measured in hundreds of milliseconds. Levelt's influential model of speech production identifies three broad stages. Conceptualization produces a preverbal message: the intention and its propositional content, before any linguistic form is selected. Formulation translates the preverbal message into a linguistic plan, subdividing into grammatical encoding (selecting words and building syntactic structure) and phonological encoding (assembling the sound sequence). Articulation executes the motor plan. Your prior work on language production gives you the macro-level picture; what this topic adds is the fine-grained mechanisms within formulation and articulation planning, and the evidence from errors.
Lexical selection — choosing the right word — occurs in two steps. First, lemma retrieval: the appropriate word is identified at an abstract lexical level that captures its meaning and syntactic properties (its grammatical category, whether it is transitive, its gender in languages that have it) without yet specifying its phonological form. Then lexeme retrieval adds the phonological encoding. Evidence for this two-step architecture comes from the tip-of-the-tongue phenomenon: you know the word's meaning and syntactic properties, you may know its first letter and number of syllables — the lemma is retrieved — but the lexeme (full phonological form) is temporarily unavailable. The partial information that is accessible in tip-of-the-tongue states reflects exactly the properties associated with the lemma level.
Speech errors are the primary experimental tool for revealing the architecture of this planning process, because errors show which units can interact with which other units. Spoonerisms (transpositions of phonological segments: "tips of the slung" for "tips of the tongue") show that phonological segments are planned as units and that segments from different words can exchange with each other within a planning window — demonstrating that speech is planned ahead, across multiple words simultaneously. Semantic substitutions (substituting "table" for "chair" or "cat" for "dog") show that lexical retrieval involves competition among semantically related words, any of which can be incorrectly selected if the target is not sufficiently activated. Malapropisms (substituting a word with a similar sound: "for all intensive purposes") occur at the lexeme level, where phonological neighbors can be retrieved instead of the target.
Connecting to your knowledge of primary motor cortex, articulation planning involves not only the sequential ordering of phonological units but also the preparation of the motor programs that drive the vocal tract. The forward model of motor control — in which the brain predicts the sensory consequences of a planned movement and uses prediction error to update the motor command — applies here as much as in limb movements. Speakers continuously monitor their own speech output (both via auditory feedback and efference copy) and can detect and correct errors in real time. Disrupting auditory feedback (by introducing a delay or shift in the pitch of heard speech) characteristically disrupts speech fluency, demonstrating that auditory prediction is integrated into articulatory control, not merely a post-hoc check on what was said.