A hash table stores key-value pairs and supports O(1) average-case insertion, deletion, and lookup. A hash function maps keys to array indices; since the key domain is typically larger than the table size, collisions (two keys mapping to the same index) are inevitable. Common collision resolution strategies are chaining (each slot holds a linked list) and open addressing (probe for the next open slot). A good hash function distributes keys uniformly; poor hash functions lead to many collisions and degrade performance to O(n).
Implement a simple hash table with chaining from scratch. Experiment with different hash functions and load factors to observe their effect on collision rates. Then examine how Python's dict handles resizing.
You already know that arrays give O(1) access by index — if you know the position, retrieval is instant. But what if you want to look things up by name rather than position? Hash tables solve exactly this problem: they let you store and retrieve values using arbitrary keys (strings, integers, objects) in O(1) average time.
The mechanism is elegant: a hash function converts any key into an integer, and that integer (modulo the table size) becomes the array index. For example, the key `"alice"` might hash to 42, and with a table of 100 slots, she is stored at index 42. Lookup is then O(1): hash the key, go to that index, done. The challenge is that the key space is typically enormous (all possible strings, for instance) while the table size is small, so many keys inevitably map to the same index — a collision. This is not a failure; it is expected and must be handled.
The two dominant collision resolution strategies are chaining and open addressing. Chaining stores a linked list at each slot; colliding entries are appended to the list. Open addressing keeps the table flat: on collision, the algorithm probes a sequence of other slots (linear probing, quadratic probing, or double hashing) until it finds an empty one. Chaining handles high load factors gracefully; open addressing has better cache performance because everything stays in the array. The choice between them is a real engineering tradeoff.
Performance degrades when the load factor (items / slots) grows too high. With many items and few slots, collisions become frequent and collision chains grow long, pushing lookup toward O(n). This is why hash tables resize — typically doubling the array size and rehashing all entries — when the load factor exceeds a threshold. The resizing itself is O(n), but it happens infrequently enough that the average cost per insertion is still O(1), which is an amortized analysis result you may have seen in that prerequisite topic.
A subtle misconception worth addressing: the O(1) claim assumes a good hash function and a low load factor. A bad hash function that maps all keys to the same slot gives O(n) always. And because adversaries can sometimes craft inputs that break a fixed hash function, production systems often use randomized hash seeds — so that even if an attacker knows your table size, they cannot predict which keys will collide. This is why Python randomizes its hash seed at startup.