Named Entity Recognition (NER)

Research Depth 81 in the knowledge graph I know this Set as goal
Unlocks 1 downstream topic
nlp sequence-labeling entity-extraction information-extraction

Core Idea

Named entity recognition identifies and classifies named entities (people, organizations, locations, dates) in text as a sequence labeling task. BiLSTM-CRF models combine bidirectional context with Markov constraints on valid label transitions; transformer models achieve state-of-the-art performance through contextual embeddings that capture long-range dependencies.

How It's Best Learned

Implement NER using BiLSTM-CRF and compare with transformer-based models (BERT fine-tuned), observing how architectural differences affect recognition accuracy and speed.

Explainer

Named entity recognition is the task of scanning a sentence and identifying which words refer to real-world entities — and what kind of entity each one is. Given the sentence "Apple was founded by Steve Jobs in Cupertino in 1976," a NER system should tag "Apple" as an organization, "Steve Jobs" as a person, "Cupertino" as a location, and "1976" as a date. This is fundamentally a sequence labeling problem: each token in the input receives a label, and the model must decide the correct label for every position in the sequence.

The labeling scheme itself requires care. The standard approach is BIO tagging (Beginning, Inside, Outside): the first token of an entity gets a B-tag (e.g., B-PER for the start of a person name), continuation tokens get I-tags (I-PER), and non-entity tokens get O. This lets the model handle multi-word entities like "Steve Jobs" (B-PER I-PER) and distinguish adjacent entities of the same type. Without the B/I distinction, the model could not tell where one entity ends and the next begins.

The classic neural architecture for NER is the BiLSTM-CRF. You already know that neural networks can learn contextual representations — the BiLSTM reads the sentence in both directions, giving each token a representation informed by its full context. But sequence labeling has a structural constraint that a standard classifier ignores: adjacent labels are not independent. An I-PER tag should never follow a B-LOC tag, and an I-tag should never appear at the start of a sequence. The CRF (Conditional Random Field) layer on top of the BiLSTM learns a transition matrix between label pairs, scoring not just individual tag probabilities but entire label sequences. At inference time, the Viterbi algorithm efficiently finds the highest-scoring global label sequence rather than greedily picking the best tag at each position.

Transformer-based models like BERT have largely surpassed BiLSTM-CRFs by providing richer contextual embeddings. A fine-tuned BERT model for NER feeds its contextualized token representations into a classification head (with or without a CRF layer). The advantage is that BERT's pretraining on massive text corpora gives it deep knowledge of language structure and word usage patterns before it ever sees NER-labeled data. The word "Washington" in "Washington crossed the Delaware" and "Washington issued a statement" gets different contextual embeddings, helping the model distinguish person from organization or location uses. This contextual sensitivity, combined with the attention mechanism's ability to capture long-range dependencies, explains why transformer models achieve state-of-the-art NER performance across most benchmarks.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesIntroduction to Reinforcement LearningPolicy Gradient MethodsActor-Critic MethodsTemporal Difference LearningQ-Learning AlgorithmDeep Q-Networks (DQN)Recurrent Neural NetworksLSTM and Gated Recurrent UnitsSequence-to-Sequence ModelsNamed Entity Recognition (NER)

Longest path: 82 steps · 558 total prerequisite topics

Prerequisites (3)

Leads To (1)