Observability systems

Understanding what broke in a system of 50 microservices.

The idea

In distributed systems, a user clicks a button and the request bounces through 5 different services. If it's slow, how do you know which one caused the delay? Distributed Tracing assigns a unique Trace ID to the request, and every service emits a "Span" detailing how long it took.

By plotting these spans on a timeline (a Trace Waterfall), you can instantly spot the bottleneck. If the delay breaches your Service Level Objective (SLO) (e.g., "99% of requests must take < 500ms"), an alert is fired.

100ms 200ms 300ms SLO (400ms)
Send a request to see the trace waterfall.

How it works (Trace Context Propagation)

def handle_request(req):
    # 1. Extract the Trace ID passed from the upstream service
    trace_id = req.headers.get("X-B3-TraceId")
    
    # 2. Start a Span for this local operation
    with tracer.start_span("AuthService.Verify", trace_id=trace_id) as span:
        
        # 3. Do the work...
        result = verify_token()
        
        # 4. Inject the Trace ID into the next downstream call!
        headers = {"X-B3-TraceId": trace_id}
        http.get("/downstream", headers=headers)
        
    # Span auto-closes, recording start time, end time, and trace_id