State Machine Replication

Research Depth 70 in the knowledge graph I know this Set as goal
Unlocks 4 downstream topics
replication state-machine fault-tolerance

Core Idea

State machine replication replicates a deterministic service by using consensus to agree on a command sequence. All replicas execute identical commands in identical order, producing identical outputs. If f replicas fail, the system survives using consensus for f < n/2. SMR achieves linearizability by having consensus order all operations.

Explainer

You already understand the consensus problem: getting a group of nodes to agree on a single value despite failures. State machine replication (SMR) takes that idea and applies it repeatedly, turning consensus from a one-shot agreement into a continuous mechanism for keeping multiple copies of a service perfectly synchronized.

The core principle is deceptively simple. A deterministic state machine is any system where, given the same starting state and the same sequence of inputs, you always get the same ending state and outputs. A key-value store is a good example: if you start with an empty store and apply "SET x=1" then "SET y=2" then "DELETE x," every copy that processes those commands in that exact order will end up in the same state — just {y: 2}. SMR exploits this by running the same state machine on multiple nodes and using consensus to ensure every node processes the same commands in the same order. If the state machine is deterministic and the input sequence is identical, the replicas are guaranteed to stay in lockstep.

In practice, SMR works by assigning each client request a log position (slot number) through consensus. A client sends a request — say, "SET x=5" — to the system. The replicas run a consensus protocol (like Paxos or Raft) to agree that this request occupies slot 47 in the shared log. Once consensus is reached, every replica executes the command at slot 47 and moves on to slot 48. The log is the single source of truth for ordering. Even if messages arrive at different replicas in different orders, consensus ensures they all agree on what goes in each slot. This is why the consensus prerequisite is essential — without it, you cannot build the ordered log that SMR depends on.

The fault tolerance guarantee follows directly. If you have 2f + 1 replicas, up to f can crash and the system continues operating. The surviving f + 1 nodes still form a majority, so consensus can still make progress. When a crashed replica recovers, it simply replays the log from where it left off, reapplying each command in order until it catches up with the others. Because the state machine is deterministic, replay produces the exact same state as if the replica had never crashed. This combination — consensus for ordering, determinism for consistency, and log replay for recovery — is what makes SMR the foundational technique behind virtually every strongly consistent replicated system, from replicated databases to distributed lock services like Chubby and ZooKeeper.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsOperators and ExpressionsArithmetic Operators and Operator PrecedenceComparison Operators and Boolean TestsLogical Operators and Boolean AlgebraBoolean Algebra and Fundamental LawsCombinational Circuit DesignFlip-Flops and LatchesBinary Counters: Design and AnalysisBinary ArithmeticFixed-Point Number RepresentationTwo's Complement RepresentationOverflow and Underflow DetectionBinary Adders: Half-Adders and Full-AddersFull Adder and Carry PropagationCarry Lookahead Adder DesignHalf Adder Circuit DesignMultiplication Circuit DesignSequential Circuit DesignRegisters and Register FilesInstruction Set Architecture (ISA)Kernel Architecture and OS StructureSystem Calls and User/Kernel ModeProcesses and the Process Control BlockLogical Clocks and Event OrderingVector Clocks and Capturing CausalityHappened-Before Relation and Causal OrderingConsistency Models in Distributed SystemsRead-After-Write ConsistencySequential ConsistencyLinearizabilityState Machine Replication

Longest path: 71 steps · 258 total prerequisite topics

Prerequisites (3)

Leads To (3)