← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Bloom Filters

Research Depth 98 in the knowledge graph ☐ I know this ☆ Set as goal

2topics build on this

564prerequisites beneath it

See this on the map →

Bloom Filters: Space-Efficient Probabilistic Set Membership Universal and Perfect Hashing +1 more→→Streaming Algorithms

Core Idea

A Bloom filter is a space-efficient probabilistic data structure for approximate set membership queries. It uses a bit array of m bits and k independent hash functions: inserting an element sets k bit positions; querying checks whether all k positions are set. False negatives are impossible (inserted elements always test positive), but false positives occur when unrelated elements happen to set the same bit positions. The false positive rate for n elements is approximately (1 - e^-kn/m)^k, minimized when k = (m/n) ln 2. With optimal parameters, a Bloom filter uses about 1.44 log_2(1/epsilon) bits per element for false positive rate epsilon — far less than storing the elements themselves. Variants include counting Bloom filters (supporting deletion), cuckoo filters (better space for low epsilon), and Bloom filter cascades.

Explainer

You have seen the basic Bloom filter idea in data structures: a bit array with hash functions that supports fast approximate membership queries. At the expert level, the focus shifts to understanding the precise mathematical tradeoffs, the information-theoretic limits, and the design space of variants that extend the basic structure.

The false positive analysis is clean. After inserting n elements with k hash functions into m bits, each specific bit remains 0 with probability (1 - 1/m)^kn ≈ e^-kn/m. A false positive occurs when all k bits for a non-member happen to be 1, with probability (1 - e^-kn/m)^k. Minimizing over k (taking the derivative and setting it to zero) gives the optimal k = (m/n) ln 2, at which point exactly half the bits are set. This yields a false positive rate of (1/2)^k = (1/2)^((m/n) ln 2) = 2^{-m ln 2 / n}, or equivalently, achieving rate epsilon requires m = n * log_2(1/epsilon) / ln 2 ≈ 1.44 * n * log_2(1/epsilon) bits.

The 1.44 factor above the information-theoretic minimum of log_2(1/epsilon) bits per element is the "price of simplicity." Standard Bloom filters are not optimal data structures for approximate membership — but they are close, and their simplicity (bit-parallel operations, no pointer overhead, cache-friendly) makes them practical favorites. When the 44% overhead matters, alternatives exist: compressed Bloom filters reduce space by allowing the bit array to be entropy-coded; Golomb-coded sets achieve near-optimal space; cuckoo filters match or beat Bloom filter space while supporting deletions.

The variant landscape is rich. Counting Bloom filters replace bits with counters to support deletion (at ~4x space cost). Scalable Bloom filters grow dynamically as elements arrive. Bloomier filters store associated values, not just membership. Spectral Bloom filters count multiplicities. In distributed systems, Bloom filters are used for set reconciliation — two parties can efficiently determine which elements they share by exchanging Bloom filters. The common thread is the fundamental tradeoff: a small amount of space buys approximate answers to set queries, with a tunable, well-understood error rate.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Boolean Algebra and Fundamental Laws → Logic Gates Fundamentals → Implementing Boolean Functions with Gates → Karnaugh Map Simplification → Combinational Circuit Design → Flip-Flops and Latches → Finite State Machines (FSMs) → Deterministic Finite Automata (DFA) → Nondeterministic Finite Automata (NFA) → Two-Way Finite Automata → NFA to DFA Conversion (Subset Construction) → DFA Properties and Minimization Algorithms → Regular Languages: Definition and Characterization → Context-Free Grammars (CFGs) → Pushdown Automata (PDA) → Equivalence of CFGs and Pushdown Automata → Closure Properties of Context-Free Languages → Limitations of Context-Free Languages → Pumping Lemma for Context-Free Languages → Turing Machines → Variants of Turing Machines and Equivalence → Nondeterministic Time Complexity and NP → The P vs. NP Problem → Complexity Class P: Polynomial Time → Randomized Algorithms → Universal and Perfect Hashing → Bloom Filters

Longest path: 99 steps · 564 total prerequisite topics

Prerequisites (3)

Universal and Perfect Hashinghard Bloom Filters: Space-Efficient Probabilistic Set Membershiphard Probability Rules: Addition, Multiplication, and Complementsoft

Leads To (1)

Streaming Algorithmssoft