Sketching Data Structures

Research Depth 70 in the knowledge graph I know this Set as goal
sketching data-structures approximation linear-algebra-over-finite-fields

Core Idea

Sketching compresses high-dimensional data into a small-space summary (a "sketch") that supports approximate queries. The sketch operates via random linear projections: map the data vector x to a much lower-dimensional sketch s = A * x (mod prime or in a vector space), where A is a random matrix chosen from a structured family. The count-min sketch uses independent hash functions for item frequency estimation; the Johnson-Lindenstrauss lemma shows that random projections preserve pairwise distances with high probability, enabling approximate nearest-neighbor search in subquadratic space. Min-hashing estimates Jaccard similarity between sets by tracking the minimum hash value. Sketch linearity (the sketch of a sum equals the sum of sketches) enables distributed computation and streaming. These structures trade off sketch size, space, update time, and query error in precise ways, offering guarantees on approximation quality and probability of failure.

Explainer

Sketches are a fundamental tool for handling massive data: when the data is too large to store or process in real time, summarize it into a small sketch that supports approximate queries. The sketch is a lossy compression of the data, carefully designed so that despite the information loss, the approximation guarantees are strong and well-understood.

The count-min sketch is the workhorse. It maintains a d-by-w matrix where w = O(1 / epsilon) (space to achieve relative error epsilon) and d = O(log(1 / delta)) (rows to achieve failure probability delta). For each arriving item with frequency, increment w positions (one per row). To query item frequency, return the minimum counter across rows. Why minimum? Because collisions only cause overcounting: each counter tracks not just the item's frequency but also frequencies of other items hashing to the same bucket. The minimum over independent hash functions is the tightest overestimate, and the union bound over d rows bounds the overestimation. Total space: O((1 / epsilon) * log(1 / delta)) counters, completely independent of the stream size n.

The Johnson-Lindenstrauss lemma lifts sketching from frequency estimation to geometry. In high-dimensional space (like neural network embeddings, which live in thousands of dimensions), random projections to just O(log(n) / epsilon^2) dimensions preserve all pairwise distances up to (1 ± epsilon) factors with high probability. This is counterintuitive: you can lose 99.9% of the dimensions and still preserve geometry. The proof uses concentration of measure: the projection of any vector has norm concentrated tightly around its expected value, and distances are sums of squared projections, which concentrate by Chebyshev's inequality.

Min-hashing is elegant for set similarity. Hash each element of a set, track the minimum hash value. Two sets' minimum hashes agree with probability equal to the Jaccard similarity (intersection over union). Repeat with k independent hash functions and average: estimate converges to true Jaccard. With k = O(1 / epsilon^2) functions, estimate is (1 ± epsilon)-approximation with high probability. Each set requires only k integers (64 bits each), making similarity search on massive collections feasible. This is the core of large-scale clustering and deduplication systems.

The unifying property is linearity: sketches are linear functions of the data (matrix-vector products, hash-based counts). This enables merging — the sketch of combined data equals the sum of sketches. Distributed processing becomes seamless: compute sketches at each node, transmit O(space) bits to a coordinator, sum sketches, answer global queries. This is impossible for non-linear statistics (median, percentile), which cannot be recovered from local summaries. Sketch design is an active field: how to trade space, time per update, and approximation error? New sketches (t-digest, HyperLogLog variants) optimize for specific error metrics or distributions, but all preserve the core structure: random projections + linearity + provable approximation bounds.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsLogarithms IntroductionTime and Space ComplexityTime Complexity Classes: P and EXPTIMENondeterministic Time Complexity and NPThe P vs. NP ProblemComplexity Class P: Polynomial TimeRandomized AlgorithmsUniversal and Perfect HashingBloom FiltersStreaming AlgorithmsSketching Data Structures

Longest path: 71 steps · 430 total prerequisite topics

Prerequisites (3)

Leads To (0)

No topics depend on this one yet.