Timeout and retry strategies determine how systems respond to transient failures. Immediate retries can amplify load during congestion; exponential backoff with jitter reduces cascading failures. Adaptive timeouts adjust based on measured latencies. Choosing timeouts is critical: too short causes false timeouts, too long degrades latency. Timeouts must be paired with idempotent operations for safe retries.
From your study of failure detection and heartbeats, you know that distributed systems cannot distinguish a slow node from a dead one — the network provides no certainty. Timeouts are the mechanism that forces a decision: after waiting long enough, the caller gives up and treats the request as failed. But "long enough" is the critical design choice. Set the timeout too short and you will declare healthy-but-slow nodes dead, triggering unnecessary retries and failovers. Set it too long and your system stalls waiting for responses that will never arrive, dragging latency up for every downstream caller.
The naive retry strategy — "if it fails, try again immediately" — is dangerous under load. Imagine a service that is slow because it is overloaded. Every client times out and retries, doubling the request volume hitting the already-struggling server. This is a retry storm, and it can turn a minor slowdown into a complete outage. The standard defense is exponential backoff: wait 1 second before the first retry, 2 seconds before the second, 4 before the third, and so on. This gives the overloaded system breathing room to recover. Adding jitter — randomizing the backoff interval within a range — prevents the thundering herd problem where many clients back off in lockstep and then all retry at exactly the same moment.
Adaptive timeouts take this further by learning from observed behavior. Instead of using a fixed timeout value, the system tracks recent response latencies (typically using a percentile like p99) and sets the timeout just above that threshold. If a service normally responds in 50ms but occasionally takes 200ms, an adaptive timeout might settle around 250ms — tight enough to detect genuine failures quickly but loose enough to avoid false alarms during normal variance. TCP itself uses this approach with its retransmission timeout calculation.
The final piece is safety: retries are only safe if the operation can be executed multiple times without changing the result. If a payment service charges a customer and the acknowledgment is lost, retrying the request must not charge them again. This is why timeout-retry strategies must be paired with idempotent operations — operations where applying them once and applying them multiple times produce the same outcome. Without idempotency guarantees, every retry risks corrupting state, making the retry cure worse than the timeout disease.
No topics depend on this one yet.