On-callHardoc-g217

Subject Metastable failureLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

At 12:00 your gRPC service's goodput (successful RPS) collapses to near zero and stays there even after you scale it 3x. Dashboards: incoming connection count is pinned at the limit; clients have a 1s timeout and immediately reconnect+retry on failure; the server spends most of its time on TLS handshakes and request setup for connections whose deadline expires before the work finishes; CPU is 100% but useful completions ~0. A brief upstream slowdown at 11:58 kicked it off and has since cleared. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.