Machine A sends a request to Machine B in a distributed system and receives no reply after a timeout. Which situations can A NOT distinguish between based solely on the absence of a reply?
AB is running the correct software version vs. an outdated version
BA sent the request once vs. A sent it twice
CB never received the message, B received it and crashed while processing, and B processed it and replied but the reply was lost
DThe network is slow vs. the network is partitioned
A timeout is ambiguous by design — it only tells you 'no reply arrived in time,' which is consistent with at least three distinct failure modes: (1) the message was never delivered, (2) B received it and crashed mid-processing, (3) B processed it successfully and sent a reply that was lost. These have very different implications: case 3 means retrying would process the request twice, which can corrupt state (double-charging, duplicate records). This ambiguity is not an edge case — it is the fundamental communication challenge of distributed systems, motivating idempotency and exactly-once semantics in protocol design.
Question 2 Multiple Choice
A software architect proposes: 'We can avoid all distributed systems complexity by using a sufficiently powerful single machine.' When does this reasoning hold, and when does it break down?
AIt never holds — all modern applications require distribution regardless of scale
BIt always holds — distributed systems are only used for academic research and theoretical study
CIt holds when the workload fits on one machine, but breaks down when scale, fault tolerance, or geographic distribution genuinely require multiple nodes
DIt breaks down only when the application handles more than one million users simultaneously
Distribution adds real complexity — unreliable networks, no global clock, partial failures. A monolith on a powerful single machine is often simpler, faster to develop, and easier to reason about. The architect's reasoning holds when a single machine is sufficient. It breaks down when requirements genuinely exceed one machine: workloads too large for one node's memory or compute, fault tolerance requirements that demand eliminating single points of failure, or geographic distribution for latency. The decision to distribute should be driven by genuine need, not fashion or premature optimization.
Question 3 True / False
In a distributed system, even if all nodes are operating correctly, events on different machines cannot be reliably ordered by wall-clock timestamps alone.
TTrue
FFalse
Answer: True
Each machine has its own hardware clock, and clocks drift at different rates and can be set differently. Even with NTP synchronization, clocks can disagree by milliseconds to seconds. Two machines might assign timestamps showing that event X on Machine A happened before event Y on Machine B, when the actual causal order was the reverse. This is not a technical failure — it is the fundamental absence of a global clock in distributed systems. Solving ordering requires logical clocks (Lamport timestamps) or vector clocks that track causal relationships rather than relying on physical time.
Question 4 True / False
Partial failure in distributed systems — where some nodes fail while others continue — is an uncommon edge case that can be handled with standard exception handling in application code.
TTrue
FFalse
Answer: False
Partial failure is the *defining* challenge of distributed systems, not an edge case. On a single machine, failure is total — the machine either works or it doesn't. In a distributed system, some nodes fail while others continue, and the surviving nodes must decide how to respond. A three-node database with one crashed node, one working, and one returning stale data cannot be handled with a try-catch block — it requires explicit protocols for consistency, replication, and fault tolerance. This is why distributed systems engineering is a distinct discipline: it is primarily the art of building reliable systems from unreliable parts.
Question 5 Short Answer
Why is partial failure considered the most distinctive challenge of distributed systems compared to single-machine programming?
Think about your answer, then reveal below.
Model answer: On a single machine, failure is binary — the machine works or it doesn't. In a distributed system, components fail independently: some nodes crash while others run correctly; some networks partition while others stay up; some nodes return stale data rather than failing cleanly. Surviving nodes must continue operating usefully despite not knowing the state of failed components, and they cannot even reliably determine whether a remote node has failed or is just slow. This uncertainty — compounded by message loss ambiguity and the absence of a global clock — is qualitatively different from any challenge in single-machine programming.
The key insight is that distributed failure is *partial* and *ambiguous* in ways that single-machine failure is not. A crashed thread produces an exception; a crashed remote node produces... silence, or a timeout, or a stale response. Single-machine code can trust that if a function returns a value, it ran to completion. Distributed code cannot make this assumption. Every component must be designed to tolerate the failure of components it depends on, which requires explicit choices about consistency vs. availability tradeoffs — the territory mapped by CAP theorem and related results.