← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Named Entity Recognition (NER)

Research Depth 108 in the knowledge graph ☐ I know this ☆ Set as goal

1topic build on this

781prerequisites beneath it

See this on the map →

Language Models and Neural Language Modeling Neural Network Fundamentals +1 more→→Sequence Labeling and CRFs

Core Idea

Named entity recognition identifies and classifies named entities (people, organizations, locations, dates) in text as a sequence labeling task. BiLSTM-CRF models combine bidirectional context with Markov constraints on valid label transitions; transformer models achieve state-of-the-art performance through contextual embeddings that capture long-range dependencies.

How It's Best Learned

Implement NER using BiLSTM-CRF and compare with transformer-based models (BERT fine-tuned), observing how architectural differences affect recognition accuracy and speed.

Explainer

Named entity recognition is the task of scanning a sentence and identifying which words refer to real-world entities — and what kind of entity each one is. Given the sentence "Apple was founded by Steve Jobs in Cupertino in 1976," a NER system should tag "Apple" as an organization, "Steve Jobs" as a person, "Cupertino" as a location, and "1976" as a date. This is fundamentally a sequence labeling problem: each token in the input receives a label, and the model must decide the correct label for every position in the sequence.

The labeling scheme itself requires care. The standard approach is BIO tagging (Beginning, Inside, Outside): the first token of an entity gets a B-tag (e.g., B-PER for the start of a person name), continuation tokens get I-tags (I-PER), and non-entity tokens get O. This lets the model handle multi-word entities like "Steve Jobs" (B-PER I-PER) and distinguish adjacent entities of the same type. Without the B/I distinction, the model could not tell where one entity ends and the next begins.

The classic neural architecture for NER is the BiLSTM-CRF. You already know that neural networks can learn contextual representations — the BiLSTM reads the sentence in both directions, giving each token a representation informed by its full context. But sequence labeling has a structural constraint that a standard classifier ignores: adjacent labels are not independent. An I-PER tag should never follow a B-LOC tag, and an I-tag should never appear at the start of a sequence. The CRF (Conditional Random Field) layer on top of the BiLSTM learns a transition matrix between label pairs, scoring not just individual tag probabilities but entire label sequences. At inference time, the Viterbi algorithm efficiently finds the highest-scoring global label sequence rather than greedily picking the best tag at each position.

Transformer-based models like BERT have largely surpassed BiLSTM-CRFs by providing richer contextual embeddings. A fine-tuned BERT model for NER feeds its contextualized token representations into a classification head (with or without a CRF layer). The advantage is that BERT's pretraining on massive text corpora gives it deep knowledge of language structure and word usage patterns before it ever sees NER-labeled data. The word "Washington" in "Washington crossed the Delaware" and "Washington issued a statement" gets different contextual embeddings, helping the model distinguish person from organization or location uses. This contextual sensitivity, combined with the attention mechanism's ability to capture long-range dependencies, explains why transformer models achieve state-of-the-art NER performance across most benchmarks.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Conditional Distributions → Conditional Expectation → Markov Chains → Markov Decision Processes → Introduction to Reinforcement Learning → Policy Gradient Methods → Policy Networks and Policy Gradients → Actor-Critic Methods → Temporal Difference Learning → Q-Learning Algorithm → Deep Q-Networks (DQN) → Recurrent Neural Networks → LSTM and Gated Recurrent Units → Gated Recurrent Units (GRU) → Sequence-to-Sequence Models → Named Entity Recognition (NER)

Longest path: 109 steps · 781 total prerequisite topics

Prerequisites (3)

Language Models and Neural Language Modelinghard Neural Network Fundamentalshard Sequence-to-Sequence Modelssoft

Leads To (1)

Sequence Labeling and CRFssoft