A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Byzantine Fault Tolerance and Practical BFT

Research Depth 64 in the knowledge graph ☐ I know this ☆ Set as goal

291prerequisites beneath it

Failure Models in Distributed Systems The Consensus Problem→

Core Idea

Byzantine fault tolerance (BFT) handles nodes that fail arbitrarily, including lying to different nodes. Consensus among n nodes tolerating f Byzantine failures requires n > 3f. Practical BFT (PBFT) uses a primary and backups, with request phases (pre-prepare, prepare, commit) coordinated by the primary; backups ensure agreement before committing.

Explainer

From your study of failure models, you know that crash failures are relatively benign — a node either works correctly or stops responding. Byzantine failures are far worse: a faulty node can behave arbitrarily, sending different messages to different peers, lying about its state, or even actively trying to sabotage the system. The name comes from the Byzantine Generals Problem, a thought experiment: imagine several generals surrounding a city, communicating by messenger, who must agree on whether to attack or retreat. Some generals are traitors who may send contradictory messages. The question is: can the loyal generals still reach agreement? The answer is yes, but only if fewer than one-third of the generals are traitors.

This one-third bound is a proven mathematical result, not a design choice. With n total nodes and f Byzantine-faulty nodes, consensus requires n > 3f. The intuition: a Byzantine node can send "attack" to some peers and "retreat" to others. To outvote these conflicting messages, the honest nodes need enough of a majority that even after removing f potentially faulty votes and accounting for f conflicting messages, a clear majority remains. With n = 3f, the system deadlocks — honest nodes can't distinguish between a faulty node and an honest node that received different information from another faulty node. At n = 3f + 1 (e.g., 4 nodes tolerating 1 Byzantine failure), the protocol has just enough redundancy to unmask the liar.

Practical Byzantine Fault Tolerance (PBFT) made BFT usable in real systems. The protocol works in three phases. A designated primary node receives the client request and broadcasts a pre-prepare message proposing an ordering. Each backup node validates this proposal and broadcasts a prepare message to all other nodes. Once a node collects 2f + 1 matching prepare messages (including its own), it knows that enough honest nodes agree, so it broadcasts a commit message. After collecting 2f + 1 commit messages, the node executes the request and replies to the client. The client waits for f + 1 matching replies to be confident at least one came from an honest node. If the primary is faulty (refusing to send pre-prepares or sending conflicting ones), a view change protocol replaces it with the next backup.

The cost of Byzantine tolerance is significant: PBFT requires O(n²) messages per consensus round because every node communicates with every other node in the prepare and commit phases. This limits practical deployments to relatively small clusters — typically tens of nodes, not thousands. For most internal distributed systems where you trust your own hardware and software, crash fault tolerance (like Raft or Paxos, requiring only n > 2f) is sufficient and far cheaper. BFT becomes essential in environments where nodes are controlled by different, potentially adversarial parties — the most prominent example being blockchain networks, where any participant might try to cheat. Understanding when Byzantine tolerance is actually needed versus when crash tolerance suffices is a key architectural judgment in distributed system design.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Making 10 as an Addition Strategy → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts Through 10 → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Opposites and Additive Inverses → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → Variables in Logic → Conditional Statements (If-Then Formal) → Converse, Inverse, and Contrapositive → Biconditional Statements → Biconditional Statements and Equivalence → Conditional and Biconditional Statements → Formal Logic and Propositional Calculus → The Consensus Problem → Byzantine Fault Tolerance and Practical BFT

Longest path: 65 steps · 291 total prerequisite topics

Prerequisites (2)

Failure Models in Distributed Systemshard The Consensus Problemhard

Leads To (0)

No topics depend on this one yet.