Sequence Labeling and CRFs

Research Depth 82 in the knowledge graph I know this Set as goal
sequence-labeling crf structured

Core Idea

Sequence labeling assigns labels to each element in a sequence (part-of-speech tagging, named entity recognition). Conditional Random Fields (CRFs) model label dependencies, capturing that consecutive labels influence each other. As discriminative models, CRFs typically outperform generative HMMs for sequence labeling tasks.

Explainer

From your study of Hidden Markov Models, you understand the basic structure of sequence labeling: given a sequence of observations (words in a sentence, characters in a string, signals in a time series), assign a label to each position. In part-of-speech tagging, the input "The cat sat" gets labels [DET, NOUN, VERB]. In named entity recognition, "Barack Obama visited Paris" might get [B-PER, I-PER, O, B-LOC]. The challenge is that labels are not independent — knowing the current word is tagged as a determiner makes it much more likely that the next word is a noun.

HMMs handle these dependencies by modeling the joint probability P(observations, labels) as a product of emission probabilities (how likely is this word given this tag?) and transition probabilities (how likely is this tag given the previous tag?). But HMMs make a strong independence assumption: the probability of observing a word depends *only* on its tag, not on the surrounding words or any other features. This means an HMM cannot easily use rich features like "the word ends in -ing" or "the previous word is capitalized" without dramatically expanding the state space.

Conditional Random Fields (CRFs) solve this by modeling the conditional probability P(labels | observations) directly, without modeling how observations are generated. This discriminative approach means CRFs can incorporate arbitrary feature functions that examine any part of the input sequence — the current word, neighboring words, capitalization patterns, suffixes, prefixes — without worrying about their joint distribution. A linear-chain CRF defines a score for each label sequence as a weighted sum of feature functions evaluated at each position and each pair of adjacent labels, then normalizes over all possible label sequences to produce a valid probability distribution.

Training a CRF means finding feature weights that maximize the conditional likelihood of the correct label sequences in the training data. Inference — finding the most probable label sequence for a new input — uses the Viterbi algorithm, the same dynamic programming approach you learned for HMMs. The partition function (normalizing constant) is computed with the forward algorithm, again borrowed from HMMs. The algorithmic machinery is familiar; the difference is entirely in what is being modeled. Because CRFs are discriminative and feature-rich, they consistently outperform HMMs on tasks like POS tagging and NER. Modern systems often combine a neural network (BiLSTM or Transformer) to produce contextualized features with a CRF layer on top to enforce label consistency — getting the best of learned representations and structured prediction.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesIntroduction to Reinforcement LearningPolicy Gradient MethodsActor-Critic MethodsTemporal Difference LearningQ-Learning AlgorithmDeep Q-Networks (DQN)Recurrent Neural NetworksLSTM and Gated Recurrent UnitsSequence-to-Sequence ModelsNamed Entity Recognition (NER)Sequence Labeling and CRFs

Longest path: 83 steps · 562 total prerequisite topics

Prerequisites (2)

Leads To (0)

No topics depend on this one yet.