A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Sketching Data Structures

Research Depth 100 in the knowledge graph ☐ I know this ☆ Set as goal

566prerequisites beneath it

Randomized Algorithms Streaming Algorithms +1 more→

Core Idea

Sketching compresses high-dimensional data into a small-space summary (a "sketch") that supports approximate queries. The sketch operates via random linear projections: map the data vector x to a much lower-dimensional sketch s = A * x (mod prime or in a vector space), where A is a random matrix chosen from a structured family. The count-min sketch uses independent hash functions for item frequency estimation; the Johnson-Lindenstrauss lemma shows that random projections preserve pairwise distances with high probability, enabling approximate nearest-neighbor search in subquadratic space. Min-hashing estimates Jaccard similarity between sets by tracking the minimum hash value. Sketch linearity (the sketch of a sum equals the sum of sketches) enables distributed computation and streaming. These structures trade off sketch size, space, update time, and query error in precise ways, offering guarantees on approximation quality and probability of failure.

Explainer

Sketches are a fundamental tool for handling massive data: when the data is too large to store or process in real time, summarize it into a small sketch that supports approximate queries. The sketch is a lossy compression of the data, carefully designed so that despite the information loss, the approximation guarantees are strong and well-understood.

The count-min sketch is the workhorse. It maintains a d-by-w matrix where w = O(1 / epsilon) (space to achieve relative error epsilon) and d = O(log(1 / delta)) (rows to achieve failure probability delta). For each arriving item with frequency, increment w positions (one per row). To query item frequency, return the minimum counter across rows. Why minimum? Because collisions only cause overcounting: each counter tracks not just the item's frequency but also frequencies of other items hashing to the same bucket. The minimum over independent hash functions is the tightest overestimate, and the union bound over d rows bounds the overestimation. Total space: O((1 / epsilon) * log(1 / delta)) counters, completely independent of the stream size n.

The Johnson-Lindenstrauss lemma lifts sketching from frequency estimation to geometry. In high-dimensional space (like neural network embeddings, which live in thousands of dimensions), random projections to just O(log(n) / epsilon²) dimensions preserve all pairwise distances up to (1 ± epsilon) factors with high probability. This is counterintuitive: you can lose 99.9% of the dimensions and still preserve geometry. The proof uses concentration of measure: the projection of any vector has norm concentrated tightly around its expected value, and distances are sums of squared projections, which concentrate by Chebyshev's inequality.

Min-hashing is elegant for set similarity. Hash each element of a set, track the minimum hash value. Two sets' minimum hashes agree with probability equal to the Jaccard similarity (intersection over union). Repeat with k independent hash functions and average: estimate converges to true Jaccard. With k = O(1 / epsilon²) functions, estimate is (1 ± epsilon)-approximation with high probability. Each set requires only k integers (64 bits each), making similarity search on massive collections feasible. This is the core of large-scale clustering and deduplication systems.

The unifying property is linearity: sketches are linear functions of the data (matrix-vector products, hash-based counts). This enables merging — the sketch of combined data equals the sum of sketches. Distributed processing becomes seamless: compute sketches at each node, transmit O(space) bits to a coordinator, sum sketches, answer global queries. This is impossible for non-linear statistics (median, percentile), which cannot be recovered from local summaries. Sketch design is an active field: how to trade space, time per update, and approximation error? New sketches (t-digest, HyperLogLog variants) optimize for specific error metrics or distributions, but all preserve the core structure: random projections + linearity + provable approximation bounds.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Boolean Algebra and Fundamental Laws → Logic Gates Fundamentals → Implementing Boolean Functions with Gates → Karnaugh Map Simplification → Combinational Circuit Design → Flip-Flops and Latches → Finite State Machines (FSMs) → Deterministic Finite Automata (DFA) → Nondeterministic Finite Automata (NFA) → Two-Way Finite Automata → NFA to DFA Conversion (Subset Construction) → DFA Properties and Minimization Algorithms → Regular Languages: Definition and Characterization → Context-Free Grammars (CFGs) → Pushdown Automata (PDA) → Equivalence of CFGs and Pushdown Automata → Closure Properties of Context-Free Languages → Limitations of Context-Free Languages → Pumping Lemma for Context-Free Languages → Turing Machines → Variants of Turing Machines and Equivalence → Nondeterministic Time Complexity and NP → The P vs. NP Problem → Complexity Class P: Polynomial Time → Randomized Algorithms → Universal and Perfect Hashing → Bloom Filters → Streaming Algorithms → Sketching Data Structures

Longest path: 101 steps · 566 total prerequisite topics

Prerequisites (3)

Streaming Algorithmshard Randomized Algorithmshard Universal and Perfect Hashingsoft

Leads To (0)

No topics depend on this one yet.