5 questions to test your understanding
A team sets their heartbeat timeout to 100ms to detect node failures as quickly as possible. Their operations team starts receiving many 'node down' alerts that resolve within seconds. What is the most likely root cause?
A node in a distributed system hasn't received a heartbeat from a peer for three times the normal interval. The network is experiencing unusual congestion. What can the observing node definitively conclude?
In a fully synchronous network where message delivery time is bounded by a known maximum delay δ, it is theoretically possible to build a perfect failure detector.
Heartbeat-based failure detection can definitively identify whether a node has crashed or is merely slow, as long as the timeout is calibrated correctly.
Why do systems like Cassandra use phi accrual failure detectors rather than simple binary timeout-based heartbeats?