Bloom filters are space-efficient probabilistic data structures that answer 'is element X in the set?' with no false negatives and controllable false positives. In distributed systems, they efficiently share set membership information (e.g., which keys a replica has), allowing quick rejection without full data transfer.
Implement a simple Bloom filter (bit array + hash functions). Observe false positives as you add elements, then increase the bit array size and observe the rate drop. Use it in an anti-entropy protocol: exchange Bloom filters first to identify likely mismatches.
From your work with distributed hash tables, you know that nodes in a distributed system each hold a subset of the data, and coordination between nodes often requires answering a deceptively simple question: "does node B have key X?" The naive approach — send the key to node B and wait for a lookup response — works but is expensive at scale. If you need to check thousands of keys across dozens of nodes, the network traffic and latency add up fast. Bloom filters solve this by letting each node summarize its entire key set in a compact data structure that can be transmitted cheaply and queried locally.
A Bloom filter is a bit array of *m* bits, initially all set to zero, paired with *k* independent hash functions. To add an element, you feed it through all *k* hash functions, each producing an index into the bit array, and set those *k* bits to 1. To query membership, you hash the element with the same *k* functions and check whether all *k* bits are set. If any bit is 0, the element is definitely not in the set — this is the no false negatives guarantee. If all bits are 1, the element is *probably* in the set, but it could be a false positive: those bits might have been set by other elements. The false positive rate depends on the ratio of set bits to total bits, which grows as you add more elements. You control it by choosing *m* and *k* appropriately for your expected set size.
In distributed systems, Bloom filters shine in anti-entropy protocols — the mechanisms nodes use to synchronize their data. Instead of exchanging full key lists (which could be millions of entries), two nodes exchange compact Bloom filters. Each node queries the other's filter to identify keys the other probably lacks, then sends only those keys. The false positives mean you will occasionally send data the other node already has, but that is a minor cost compared to the bandwidth saved by not sending everything. This pattern appears in systems like Cassandra for replica synchronization and in content-delivery networks for cache coordination.
The key engineering tradeoff is between space and accuracy. A Bloom filter with 10 bits per element and 7 hash functions achieves roughly a 1% false positive rate — meaning for a million keys, the filter is only about 1.2 megabytes, vastly smaller than the data itself. Shrink the bit array and false positives rise; expand it and you get more accurate membership tests at the cost of more memory and bandwidth. Crucially, standard Bloom filters do not support deletion — setting a bit to 0 might clear a bit shared by another element. Variants like counting Bloom filters (which replace each bit with a counter) support deletion at the cost of additional space. Choosing the right parameters — and the right variant — depends on your system's tolerance for false positives, the expected set size, and whether elements need to be removed.
No topics depend on this one yet.