A good hash function distributes keys uniformly across the hash table, minimizing collisions. Desirable properties include determinism, uniform distribution (no clustering), efficiency to compute, and avalanche effect (small changes in input cause large changes in output).
Analyze different hash functions (modulo, polynomial rolling hash, cryptographic) on real datasets. Measure collision rates and observe how poor functions (e.g., using just the first byte) create clustering.
From universal hashing, you know that no single fixed hash function can guarantee good performance against all possible inputs — an adversary who knows your hash function can always construct keys that collide. Universal hash families solve this by randomly selecting a function at runtime, making adversarial input construction infeasible. But this raises a practical question: what makes any individual hash function good or bad? Understanding the desirable properties of hash functions helps you evaluate specific designs and choose appropriately for your application.
The most fundamental property is uniform distribution: a good hash function spreads keys as evenly as possible across the output range, so that each bucket in a hash table receives roughly the same number of keys. Poor distribution creates clustering, where many keys hash to the same or nearby buckets while others sit empty. Imagine hashing student records by their birth year — with only a few dozen distinct years, most of the hash table is wasted. A function that incorporates all parts of the input (not just the first byte, or just one field) avoids such systematic clustering. The simplest example is `h(k) = k mod m`, which works reasonably when m is prime and keys are roughly uniformly distributed, but fails badly when keys share a common factor with m.
The avalanche effect captures a subtler requirement: small changes in input should produce large, unpredictable changes in output. If changing one bit of the key only changes one bit of the hash, then similar keys will hash to similar values — exactly the clustering you want to avoid. Good hash functions like MurmurHash and FNV achieve this by mixing bits through multiplication, XOR, and bit rotation at each step. The multiplication spreads information across bit positions, the XOR combines it nonlinearly, and the rotation ensures that high-order and low-order bits both influence the result. Determinism is also essential: the same key must always produce the same hash value within a single program execution, or lookups would fail to find previously stored keys.
In practice, hash function design involves tradeoffs between distribution quality and computational cost. Cryptographic hash functions like SHA-256 provide excellent distribution and collision resistance but are slow — they are designed to be computationally expensive to prevent attacks, which is unnecessary overhead for a hash table. Non-cryptographic functions like MurmurHash3, xxHash, and FNV-1a are optimized for speed while maintaining good distribution. The polynomial rolling hash, `h = (h · base + char) mod prime`, is popular for string hashing because it processes input incrementally and distributes well with appropriate base and prime choices. The right choice depends on your constraints: if you need to hash billions of keys per second in a database index, speed dominates; if you need collision resistance against malicious input and cannot use universal hashing, you might pay the cost of a cryptographic function.