Questions: Observability, Tracing, and Debugging in Distributed Systems
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A distributed order system spans 5 services. A customer reports their order was slow. Each service has detailed structured logs. What critical problem do the logs present without distributed tracing?
AEach service logs in a different format that cannot be parsed by a single tool
BThe logs are scattered across 5 services with no way to identify which log entries from each service belong to this specific customer's request
CLogs do not record timing information, so slowdowns cannot be detected
DThe sheer volume of logs from 5 services is too large to query efficiently
This is the core problem distributed tracing solves. Each service independently writes logs with no shared identifier linking them to a particular request. Without a trace ID propagated across service boundaries, you cannot correlate which log lines in Service A, B, C, D, and E all belong to the same user request — you have five separate, disconnected log streams rather than one coherent picture of what happened.
Question 2 Multiple Choice
Why do most production distributed tracing systems sample only a small fraction of requests (e.g., 1%) rather than tracing every request?
AThe tracing protocol is inherently too slow to process every request at production throughput
BTrace IDs must be globally unique, and generating unique IDs at 100% rate causes collisions
CInstrumentation overhead (injecting headers, creating spans, transmitting data) and storage costs at full volume would significantly degrade system performance and economics
DThe happened-before relation only applies meaningfully to a sampled subset of requests
Full tracing at production scale means creating, transmitting, and storing spans for every single request — which can add meaningful latency overhead per request and requires enormous storage infrastructure. Sampling (typically 1–10%) captures enough traces to diagnose most issues while making the overhead manageable. Head-based sampling (decided at the entry point) and tail-based sampling (decided after seeing the full trace) are the two main strategies.
Question 3 True / False
A fully reconstructed distributed trace encodes a partial ordering of events across services that corresponds to the happened-before relation: span A called service B, which completed before A continued.
TTrue
FFalse
Answer: True
A trace is a practical implementation of the happened-before relation formalized by Lamport. Each parent-child span relationship encodes causality: the parent span called the child, so the child's start happened-after the parent initiated the call, and the parent's continuation happened-after the child completed. This partial ordering lets engineers reason about which events could have influenced which others — the same conceptual foundation as Lamport clocks and the Chandy-Lamport algorithm.
Question 4 True / False
If nearly every service in a distributed system writes detailed, timestamped structured logs, those logs alone are sufficient to reconstruct the causal sequence of events for any specific user request.
TTrue
FFalse
Answer: False
Timestamps alone cannot reconstruct causality in distributed systems because clocks are not perfectly synchronized across machines. More fundamentally, even with perfect timestamps, you cannot determine which log entries from Service A belong to the same request as specific entries from Service B without a shared correlation identifier. Logs tell you what each service did and when; tracing tells you which actions across services were causally connected to the same request.
Question 5 Short Answer
Explain why a trace ID must be actively propagated through every downstream service call, and what breaks in the trace if even one service in the chain fails to pass it along.
Think about your answer, then reveal below.
Model answer: Each service must extract the trace ID from its incoming request (e.g., from an HTTP header) and inject it into every outgoing call it makes. If a service is not instrumented and fails to propagate the trace ID, the downstream services either generate a new, unrelated trace ID or produce no trace context at all. The trace breaks at that point: the upstream and downstream portions appear as two unrelated traces with no causal connection. The engineer sees the request 'disappear' mid-journey and cannot diagnose what happened in the uninstrumented service or attribute downstream latency to upstream causes.
This is why full instrumentation coverage matters. Partial instrumentation creates invisible gaps in the causality chain — precisely the blind spots that make distributed debugging hard in the first place. Tools like OpenTelemetry provide standardized libraries for automatic context propagation to reduce the instrumentation burden per service.