Questions: Hash Function Design: Properties and Requirements
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
You build a hash table for student records using h(k) = birth_year mod 100. Most students were born in 2000–2005, so keys cluster into six buckets while 94 sit empty. What core hash function property does this violate?
ADeterminism — the function produces the same output for the same input
BUniform distribution — a good function spreads keys evenly, but this one creates severe clustering
CThe avalanche effect — changing one bit of a birth year doesn't change the hash much
DComputational efficiency — birth year extraction is too slow to compute
The function is deterministic (correct), computationally trivial (fast), but disastrously non-uniform. By hashing on just the birth year, the function ignores most of the key and maps a huge fraction of keys into a tiny number of buckets. Hash table performance degrades from O(1) toward O(n) when buckets are unbalanced. This is the clustering problem: a hash function must incorporate enough of the key to spread outputs across the full output range. The avalanche effect (option C) is also weak here — a 1-year change in birth year changes the hash predictably by 1 — but the primary failure is non-uniform distribution.
Question 2 Multiple Choice
A security engineer suggests using SHA-256 as the hash function for a high-throughput in-memory hash table storing billions of URL lookups per second. What is the main problem with this choice?
ASHA-256 is not deterministic — it may return different hashes for the same key
BSHA-256 lacks the avalanche effect needed for good distribution in hash tables
CSHA-256 is designed to be computationally expensive to prevent attacks, making it far too slow for hash table use
DSHA-256 cannot handle variable-length keys like URLs
SHA-256 is an excellent hash function — deterministic, excellent avalanche effect, near-perfect distribution. But it is deliberately designed to be expensive to compute: cryptographic security requires that brute-force preimage attacks cost enormous computation. In a hash table doing billions of lookups per second, that computational cost is pure overhead. Non-cryptographic functions like MurmurHash3 or xxHash achieve excellent distribution and avalanche properties with a fraction of the cost, because they're optimized for speed rather than cryptographic resistance. The right choice depends on your threat model: if you're worried about hash-flooding attacks (adversarial inputs chosen to cause collisions), a cryptographic function or universal hashing may be warranted. For performance-critical internal use, fast non-cryptographic functions dominate.
Question 3 True / False
A hash function exhibits the avalanche effect if changing a single bit of the input changes approximately half of the output bits unpredictably.
TTrue
FFalse
Answer: True
The avalanche effect is the formal description of input sensitivity in hash functions: any small change (even one bit) should produce a large, unpredictable change in the output. This is measured by looking at how many output bits flip on average for a single input bit change — good functions achieve close to 50% (i.e., random-looking). The avalanche effect directly prevents clustering of similar keys: if '[email protected]' and '[email protected]' hash to nearby values, records with similar keys will pile up in adjacent buckets. By scrambling the output thoroughly, the avalanche effect ensures similar inputs land in very different buckets.
Question 4 True / False
Any deterministic function that maps keys to integers in the range [0, m) is a suitable hash function for a hash table, as long as the same key generally produces the same value.
TTrue
FFalse
Answer: False
Determinism is necessary but far from sufficient. A function that maps every key to 0 is perfectly deterministic but catastrophically bad — every key collides, reducing the hash table to a linked list with O(n) lookups. The crucial additional requirement is uniform distribution: the function must spread keys evenly across all m buckets with no systematic clustering. Bad functions (using only a subset of input bits, using a modulus with a common factor shared by many keys, etc.) can be deterministic yet produce extreme clustering. Determinism ensures correctness (lookup finds what was inserted); distribution determines performance.
Question 5 Short Answer
Why might a fast, non-cryptographic hash function like MurmurHash be preferable to SHA-256 for a hash table, even though SHA-256 has stronger collision resistance? Under what circumstances would the choice reverse?
Think about your answer, then reveal below.
Model answer: MurmurHash achieves good distribution and avalanche effect with far less computation than SHA-256, which is deliberately expensive for cryptographic security. For internal hash tables without adversarial inputs, speed dominates — MurmurHash can hash billions of keys per second while SHA-256 cannot. The choice reverses when adversarial inputs are a concern: an attacker who knows your hash function can craft keys that all collide, degrading a hash table to O(n). SHA-256 (or universal hashing) prevents this by making it computationally infeasible to find collisions intentionally.
The design principle is that hash function properties are application-dependent tradeoffs, not absolute goods. 'Collision resistance' means different things for cryptography (computationally hard to find any two inputs with the same hash) versus hash tables (low expected collisions for typical inputs). A hash table in a web server processing user-supplied input may need collision-resistant hashing to prevent hash-flooding DoS attacks. An internal analytics pipeline crunching 64-bit integer keys at maximum throughput should use the fastest hash that distributes well. Understanding what property you need and why is what separates thoughtful function selection from cargo-culting SHA-256 everywhere.