Questions — Failure Detection with Heartbeats

Question 1 Multiple Choice

A team sets their heartbeat timeout to 100ms to detect node failures as quickly as possible. Their operations team starts receiving many 'node down' alerts that resolve within seconds. What is the most likely root cause?

AThe nodes are actually crashing and recovering — 100ms is appropriate for detecting this

BThe timeout is too short: brief network congestion or processing spikes cause heartbeats to arrive late, triggering false positives

CThe heartbeat interval itself should be shortened to match the timeout

DGossip-based heartbeats would eliminate this problem entirely without any tradeoff

Question 2 Multiple Choice

A node in a distributed system hasn't received a heartbeat from a peer for three times the normal interval. The network is experiencing unusual congestion. What can the observing node definitively conclude?

AThe peer has crashed and should be marked as failed immediately

BThe peer is experiencing a Byzantine failure and may be sending corrupt messages

CNothing definitive — in an asynchronous network, missing heartbeats cannot distinguish a crashed node from a very slow one

DThe peer has definitely not crashed but needs its heartbeat interval increased

Question 3 True / False

In a fully synchronous network where message delivery time is bounded by a known maximum delay δ, it is theoretically possible to build a perfect failure detector.

TTrue

FFalse

Question 4 True / False

Heartbeat-based failure detection can definitively identify whether a node has crashed or is merely slow, as long as the timeout is calibrated correctly.

TTrue

FFalse

Question 5 Short Answer

Why do systems like Cassandra use phi accrual failure detectors rather than simple binary timeout-based heartbeats?

Think about your answer, then reveal below.

Questions: Failure Detection with Heartbeats