Understanding what broke in a system of 50 microservices.
In distributed systems, a user clicks a button and the request bounces through 5 different services. If it's slow, how do you know which one caused the delay? Distributed Tracing assigns a unique Trace ID to the request, and every service emits a "Span" detailing how long it took.
By plotting these spans on a timeline (a Trace Waterfall), you can instantly spot the bottleneck. If the delay breaches your Service Level Objective (SLO) (e.g., "99% of requests must take < 500ms"), an alert is fired.
def handle_request(req):
# 1. Extract the Trace ID passed from the upstream service
trace_id = req.headers.get("X-B3-TraceId")
# 2. Start a Span for this local operation
with tracer.start_span("AuthService.Verify", trace_id=trace_id) as span:
# 3. Do the work...
result = verify_token()
# 4. Inject the Trace ID into the next downstream call!
headers = {"X-B3-TraceId": trace_id}
http.get("/downstream", headers=headers)
# Span auto-closes, recording start time, end time, and trace_id