← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Streaming Algorithms

Research Depth 99 in the knowledge graph ☐ I know this ☆ Set as goal

1topic build on this

565prerequisites beneath it

See this on the map →

Randomized Algorithms Universal and Perfect Hashing +2 more→→Sketching Data Structures

Core Idea

Streaming algorithms process massive data sequences in a single pass (or few passes) using memory sublinear in the input size — typically O(polylog n) or O(1/epsilon²) space. The count-min sketch estimates item frequencies using a 2D array of counters with d hash functions, providing frequency estimates with additive error epsilon * ||f||_1 using O((1/epsilon) * log(1/delta)) counters. HyperLogLog estimates the number of distinct elements (cardinality) using O(log log n) bits per register across O(1/epsilon²) registers, achieving epsilon-relative error. The AMS (Alon-Matias-Szegedy) sketch estimates frequency moments F_k = sum(f_i^k). These algorithms share a common structure: hash-based projections compress the stream into a compact summary, and probabilistic analysis guarantees approximation quality.

Explainer

The streaming model captures a fundamental constraint of modern data processing: the data is too large to store, arrives too fast to revisit, and you have severely limited memory. A streaming algorithm sees each element once (or a small constant number of times) and must maintain a compact summary — a sketch — that supports approximate queries about the entire stream. The theoretical question is: which statistics can be approximated in sublinear space, and how much space is necessary and sufficient?

The count-min sketch is perhaps the most practical streaming data structure. It maintains a d-by-w array of counters, where each of d rows uses a different hash function mapping items to w = O(1/epsilon) positions. When item x arrives, increment the counter at position h_i(x) in each row. To estimate the frequency of x, return the minimum counter value across all d rows. Each counter overestimates (collisions only add), so the minimum is the tightest estimate. With d = O(log(1/delta)) rows, the estimate exceeds the true frequency by at most epsilon * N (total stream length) with probability at least 1 - delta. Total space: O((1/epsilon) * log(1/delta)) counters.

HyperLogLog solves the distinct-count problem: how many unique elements have appeared in the stream? It exploits a probabilistic observation: if you hash elements to uniform random binary strings, the maximum number of leading zeros among n distinct hashes is approximately log_2(n). A single register tracking this maximum gives a rough cardinality estimate, but with high variance. HyperLogLog partitions elements into m = 2^p buckets (by the first p bits of the hash) and maintains a separate max-leading-zeros register per bucket. The stochastic averaging across buckets reduces variance, and the harmonic mean provides a better estimator than the arithmetic mean. With m = 1024 registers of 5 bits each (about 640 bytes total), HyperLogLog achieves ~3% standard error — estimating cardinalities up to billions with sub-kilobyte memory.

The theoretical foundations of streaming connect to communication complexity. The space lower bound for exact F_2 computation follows from a reduction to the communication complexity of set disjointness. More broadly, streaming lower bounds typically reduce to two-party communication problems: if Alice holds the first half of the stream and Bob the second, the sketch that Alice passes to Bob is a message in a communication protocol, and known communication lower bounds translate to streaming space lower bounds. This connection provides tight lower bounds showing that the sketching algorithms above are essentially optimal.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Boolean Algebra and Fundamental Laws → Logic Gates Fundamentals → Implementing Boolean Functions with Gates → Karnaugh Map Simplification → Combinational Circuit Design → Flip-Flops and Latches → Finite State Machines (FSMs) → Deterministic Finite Automata (DFA) → Nondeterministic Finite Automata (NFA) → Two-Way Finite Automata → NFA to DFA Conversion (Subset Construction) → DFA Properties and Minimization Algorithms → Regular Languages: Definition and Characterization → Context-Free Grammars (CFGs) → Pushdown Automata (PDA) → Equivalence of CFGs and Pushdown Automata → Closure Properties of Context-Free Languages → Limitations of Context-Free Languages → Pumping Lemma for Context-Free Languages → Turing Machines → Variants of Turing Machines and Equivalence → Nondeterministic Time Complexity and NP → The P vs. NP Problem → Complexity Class P: Polynomial Time → Randomized Algorithms → Universal and Perfect Hashing → Bloom Filters → Streaming Algorithms

Longest path: 100 steps · 565 total prerequisite topics

Prerequisites (4)

Universal and Perfect Hashinghard Randomized Algorithmshard Bloom Filterssoft Expected Value and Variancesoft

Leads To (1)

Sketching Data Structureshard