A Bloom filter with m = 10n bits and optimal k hash functions achieves a false positive rate of approximately 0.82%. If you double the number of hash functions beyond the optimal k, what happens?
AThe false positive rate halves because more hash functions provide more verification
BThe false positive rate INCREASES because the extra hash functions fill the bit array faster, overwhelming the additional verification benefit
CThe false positive rate stays the same because m and n are unchanged
There is a sweet spot for k: too few hash functions means each query checks too few bits (insufficient evidence), but too many means the bit array fills up too quickly (every position becomes 1). The optimal k = (m/n) ln 2 balances these forces. Beyond the optimum, each additional hash function sets more bits (increasing the fraction of 1s) faster than it adds discriminative power, so the false positive rate rises. For m = 10n, optimal k ≈ 7; doubling to k = 14 fills ~75% of bits instead of ~50%, roughly tripling the false positive rate.
Question 2 True / False
Standard Bloom filters do not support element deletion because clearing a bit might remove evidence of other elements that hash to the same position.
TTrue
FFalse
Answer: True
When multiple elements share a bit position (which happens by design — the filter works precisely because positions are shared), clearing that bit for one element would create false negatives for others. Counting Bloom filters solve this by replacing each bit with a counter: insertion increments, deletion decrements. This supports deletion at the cost of ~4x more space (counters need multiple bits). Cuckoo filters offer an alternative that supports deletion more space-efficiently. The inability to delete is not a bug in standard Bloom filters — it is a necessary consequence of the compression that makes them space-efficient.
Question 3 Short Answer
A system architect proposes using a Bloom filter with 1 GB of memory to track 1 billion URLs for a web crawler's 'already visited' set. Calculate the approximate false positive rate and assess whether this is practical.
Think about your answer, then reveal below.
Model answer: 1 GB = 8 * 10^9 bits, n = 10^9 URLs, so m/n = 8 bits per element. Optimal k = 8 * ln(2) ≈ 5.5, so use k = 5 or 6. The false positive rate is approximately (1 - e^(-kn/m))^k ≈ (1 - e^(-5.5))^5.5 ≈ (0.5)^5.5 ≈ 0.0218, about 2.2%. This means ~2.2% of unvisited URLs would be incorrectly marked as visited and skipped. For a web crawler, this is often acceptable — missing 2% of pages is a reasonable tradeoff for using 1 byte per URL instead of storing the full URL (typically 50-100 bytes). With 2 GB (16 bits/element), the rate drops to ~0.015% (1 in 7000).
This is a realistic production scenario — Google's Bigtable and Chrome's malicious URL detection both use Bloom filters. The key insight is that 8 bits per element (1 byte!) gives a useful 2% false positive rate, while storing actual URLs would require 50-100x more memory. The tradeoff between false positive rate and space is smooth and predictable.
Question 4 True / False
Bloom filters use approximately 1.44 * log_2(1/epsilon) bits per element to achieve false positive rate epsilon. This is close to the information-theoretic minimum of log_2(1/epsilon) bits per element.
TTrue
FFalse
Answer: True
The information-theoretic lower bound for any data structure supporting approximate membership queries with false positive rate epsilon is log_2(1/epsilon) bits per element (you need at least this much information to distinguish n sets of the universe at precision epsilon). The standard Bloom filter uses 1.44 * log_2(1/epsilon) bits — a 44% overhead over the theoretical minimum. This overhead comes from using independent hash functions and a simple bit array. Compressed Bloom filters and other variants can approach the theoretical bound more closely, but standard Bloom filters are remarkably close to optimal given their simplicity.