A payment service processes a charge to a customer's credit card. The charge succeeds on the server, but the network drops the acknowledgment before it reaches the client. The client times out and retries the request. What problem arises if the payment operation is not idempotent?
AThe retry increases network latency, slowing down subsequent requests from other clients
BThe customer may be charged twice, since the server processes the retry as a new, independent request
CThe timeout was set too short — increasing the timeout would have prevented the retry
DThe server's connection pool becomes exhausted from handling the duplicate connection
Without idempotency, the server cannot distinguish a retry from a new request. It processes the second charge as a separate transaction, billing the customer twice. This is the core danger of retrying non-idempotent operations: the retry 'fixes' the client's uncertainty but corrupts server state. The solution is idempotency keys — unique IDs sent with each request so the server can detect duplicates and return the original result without reprocessing. Retries are only safe when the operation produces the same final state regardless of how many times it is applied.
Question 2 Multiple Choice
A service becomes overloaded and starts responding slowly. All 500 clients time out and immediately retry simultaneously. What happens, and what is the standard mitigation?
AThe retries succeed — the brief timeout period allowed the server to recover
BA retry storm occurs — the simultaneous retries double the load on the already-struggling server, potentially causing complete failure; exponential backoff with jitter is the standard mitigation
CClients should use shorter timeouts to detect failures faster and retry more aggressively
DLoad balancers automatically absorb the retry burst by routing requests to healthy replicas
Immediate simultaneous retries create a retry storm: 500 clients each retrying once doubles the request volume hitting the overloaded server, worsening the problem. Exponential backoff (waiting 1s, then 2s, then 4s between retries) gives the server time to drain its queue and recover. Adding jitter (randomizing each client's backoff within a range) prevents the 'thundering herd' — all clients backing off to exactly the same interval and retrying simultaneously. These two techniques together convert a potential death spiral into a recoverable slowdown.
Question 3 True / False
Setting a shorter timeout typically improves distributed system reliability because it detects failures faster and allows clients to retry sooner.
TTrue
FFalse
Answer: False
Timeouts that are too short cause false positives — declaring slow-but-functional nodes as failed. This triggers unnecessary retries, failovers, and leader elections that add load to a system under stress. There is a fundamental tension: too short means false failures and unnecessary retries; too long means the system stalls waiting for responses that won't come. Adaptive timeouts resolve this by measuring p99 latency and setting the threshold just above it — tight enough to detect genuine failures quickly, but loose enough to absorb normal latency variance without false alarms.
Question 4 True / False
Adding jitter (randomized variation) to exponential backoff helps prevent multiple clients from retrying at exactly the same moment after backing off.
TTrue
FFalse
Answer: True
Without jitter, if 500 clients all apply the same exponential backoff formula (wait exactly 2 seconds, retry), they all retry at precisely t+2 seconds — creating a synchronized retry burst, the thundering herd problem. Jitter randomizes each client's backoff within a range (e.g., 1.5s–2.5s instead of exactly 2s), spreading retries across time. This desynchronization converts a correlated burst into a smooth arrival rate, giving the recovering server a chance to process retries without being overwhelmed by a coordinated wave.
Question 5 Short Answer
Why must retry strategies be paired with idempotent operations, and what would happen to a payment system that retries a non-idempotent charge operation?
Think about your answer, then reveal below.
Model answer: Retrying a non-idempotent operation applies its effect multiple times. In a payment system, if 'charge $50' is not idempotent and the client retries due to a lost acknowledgment, the server processes two separate $50 charges — billing the customer twice. Idempotency ensures executing an operation once or many times produces the same final state. Payment systems implement this with idempotency keys: a unique ID per request lets the server detect duplicates and return the original result without reprocessing.
Network communication is fundamentally unreliable — messages can be lost, delayed, or duplicated. Any retry strategy must assume the request may have already been processed. Idempotency is the property that makes retrying safe. Operations like 'set balance to $100' are naturally idempotent; operations like 'deduct $50 from balance' are not and require explicit deduplication. Without idempotency, retries are as dangerous as the failures they're meant to recover from — exchanging one kind of data corruption for another.