The Consensus Problem

Graduate Depth 7 in the knowledge graph I know this Set as goal
Unlocks 22 downstream topics
consensus agreement agreement-protocols

Core Idea

Consensus requires all non-faulty processes to decide on a single value, even when some processes fail or propose conflicting values. Consensus must satisfy: agreement (all non-faulty processes decide identically), validity (a decided value was proposed), and termination (all non-faulty processes eventually decide). This foundational problem subsumes many practical coordination challenges.

Explainer

Imagine a group of generals who must unanimously agree on whether to attack or retreat, communicating only by messenger — and some generals may be traitors sending false orders. This is the essence of the consensus problem in distributed systems. The challenge is not just agreement but guaranteed agreement in the presence of failures, and doing so without any central coordinator.

The three properties of consensus — agreement, validity, and termination — each rules out a different trivial cheat. Without agreement, processes could decide different values. Without validity, an algorithm could satisfy agreement by having everyone decide some hardcoded constant regardless of input. Without termination, an algorithm could satisfy the other two by simply never deciding. All three are required simultaneously, and that turns out to be surprisingly hard.

The landmark FLP impossibility theorem (Fischer, Lynch, Paterson, 1985) shows that in a purely asynchronous system — one where message delays have no upper bound — consensus is impossible even if only one process can crash. The intuitive reason: a slow process looks exactly like a crashed one, so there is always some execution where the algorithm cannot determine whether a process has failed or is just delayed, and committing to a decision risks violating agreement in some scenario. Real-world consensus protocols escape this by adding timing assumptions (synchronous or partially synchronous models) or randomization.

The severity of failures also matters. Crash faults (stop-fail) are the gentlest model: failed processes simply go silent. Byzantine faults are the harshest: a faulty process can send contradictory messages to different peers, actively undermining agreement. Algorithms like Paxos and Raft handle crash faults with f < n/2 failures; Byzantine fault-tolerant algorithms require f < n/3, which is why they are reserved for adversarial settings like blockchains.

Understanding the consensus problem is the key to understanding why distributed coordination is hard in practice. Leader election, distributed transactions, and replicated state machines all reduce to consensus. Every time you interact with a system that promises consistency across multiple servers — a distributed database, a payment processor, a coordination service like ZooKeeper — consensus is the invisible foundation making that promise possible.

Practice Questions 3 questions

Prerequisite Chain

Longest path: 8 steps · 12 total prerequisite topics

Prerequisites (3)

Leads To (10)