A Bloom filter uses a bit array and k independent hash functions. To insert, set k bits; to test membership, check if all k bits are set. False positives are possible (k bits set by other elements) but false negatives are not. Space is O(n) bits regardless of element size.
Implement a Bloom filter, measure false positive rates with different k and table sizes, and use it for a practical problem (e.g., checking if a URL has been visited). Compare space to a hash set.
From your work with hash tables, you know that hashing maps elements to array positions for fast lookup. A hash set can tell you whether an element is a member of a collection in O(1) expected time, but it stores the actual elements — which can be expensive when elements are large (URLs, file paths, cryptographic keys) or when the set contains millions of entries. A Bloom filter trades perfect accuracy for dramatic space savings: instead of storing the elements themselves, it stores only a compact bit array that encodes a probabilistic summary of the set's contents.
Here is how it works. Start with a bit array of m bits, all initialized to 0, and choose k independent hash functions, each of which maps an element to a position in the array. To insert an element, compute all k hash values and set the corresponding k bits to 1. To query whether an element is in the set, compute the same k hashes and check whether all k bits are 1. If any bit is 0, the element was definitely never inserted — you get a guaranteed no. But if all k bits are 1, it might be a true member, or it might be that other insertions coincidentally set those same k bits. This is a false positive. Crucially, false negatives are impossible: if you inserted an element, its k bits are set forever (bits are never cleared back to 0 in a standard Bloom filter).
The false positive rate depends on three parameters: the bit array size m, the number of elements inserted n, and the number of hash functions k. As the array fills up with 1-bits, the probability that a random query finds all k bits set by coincidence increases. The optimal number of hash functions — the one that minimizes false positives — is k = (m/n) · ln 2, which comes from balancing two competing effects: more hash functions means more bits are checked (reducing coincidence), but also more bits are set per insertion (filling the array faster). In practice, with 10 bits per element and 7 hash functions, the false positive rate is about 1%.
The practical applications are everywhere. Web browsers use Bloom filters to check URLs against a list of known malicious sites — the filter fits in memory while the full list would not. Databases use them to avoid expensive disk reads: before searching a file on disk for a key, check the Bloom filter in memory; if it says "no," skip the disk I/O entirely. Distributed systems use them to synchronize data between nodes without transferring entire sets. The pattern is always the same: you have a large set, you need fast membership checks, and you can tolerate a small fraction of false positives because the cost of a false positive (an extra disk read, a redundant network request) is low compared to the cost of storing the full set in memory.