Streaming Algorithms

Research Depth 69 in the knowledge graph I know this Set as goal
Unlocks 1 downstream topic
streaming count-min-sketch hyperloglog space-complexity

Core Idea

Streaming algorithms process massive data sequences in a single pass (or few passes) using memory sublinear in the input size — typically O(polylog n) or O(1/epsilon^2) space. The count-min sketch estimates item frequencies using a 2D array of counters with d hash functions, providing frequency estimates with additive error epsilon * ||f||_1 using O((1/epsilon) * log(1/delta)) counters. HyperLogLog estimates the number of distinct elements (cardinality) using O(log log n) bits per register across O(1/epsilon^2) registers, achieving epsilon-relative error. The AMS (Alon-Matias-Szegedy) sketch estimates frequency moments F_k = sum(f_i^k). These algorithms share a common structure: hash-based projections compress the stream into a compact summary, and probabilistic analysis guarantees approximation quality.

Explainer

The streaming model captures a fundamental constraint of modern data processing: the data is too large to store, arrives too fast to revisit, and you have severely limited memory. A streaming algorithm sees each element once (or a small constant number of times) and must maintain a compact summary — a sketch — that supports approximate queries about the entire stream. The theoretical question is: which statistics can be approximated in sublinear space, and how much space is necessary and sufficient?

The count-min sketch is perhaps the most practical streaming data structure. It maintains a d-by-w array of counters, where each of d rows uses a different hash function mapping items to w = O(1/epsilon) positions. When item x arrives, increment the counter at position h_i(x) in each row. To estimate the frequency of x, return the minimum counter value across all d rows. Each counter overestimates (collisions only add), so the minimum is the tightest estimate. With d = O(log(1/delta)) rows, the estimate exceeds the true frequency by at most epsilon * N (total stream length) with probability at least 1 - delta. Total space: O((1/epsilon) * log(1/delta)) counters.

HyperLogLog solves the distinct-count problem: how many unique elements have appeared in the stream? It exploits a probabilistic observation: if you hash elements to uniform random binary strings, the maximum number of leading zeros among n distinct hashes is approximately log_2(n). A single register tracking this maximum gives a rough cardinality estimate, but with high variance. HyperLogLog partitions elements into m = 2^p buckets (by the first p bits of the hash) and maintains a separate max-leading-zeros register per bucket. The stochastic averaging across buckets reduces variance, and the harmonic mean provides a better estimator than the arithmetic mean. With m = 1024 registers of 5 bits each (about 640 bytes total), HyperLogLog achieves ~3% standard error — estimating cardinalities up to billions with sub-kilobyte memory.

The theoretical foundations of streaming connect to communication complexity. The space lower bound for exact F_2 computation follows from a reduction to the communication complexity of set disjointness. More broadly, streaming lower bounds typically reduce to two-party communication problems: if Alice holds the first half of the stream and Bob the second, the sketch that Alice passes to Bob is a message in a communication protocol, and known communication lower bounds translate to streaming space lower bounds. This connection provides tight lower bounds showing that the sketching algorithms above are essentially optimal.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsLogarithms IntroductionTime and Space ComplexityTime Complexity Classes: P and EXPTIMENondeterministic Time Complexity and NPThe P vs. NP ProblemComplexity Class P: Polynomial TimeRandomized AlgorithmsUniversal and Perfect HashingBloom FiltersStreaming Algorithms

Longest path: 70 steps · 429 total prerequisite topics

Prerequisites (4)

Leads To (1)