On-callHardoc-g277

Subject Config changeLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A config push lowers the client-side request timeout on service A (calling downstream B) from 5s to 800ms, intending to 'fail faster.' B's p99 is normally 600ms. After the push, A's error rate jumps, B's CPU and request rate both climb sharply, B's p99 balloons past 2s, and the whole A→B path degrades into near-total failure even though B was healthy before. Dashboards: B's *inbound* request rate is ~3x the rate A's users actually generate; A's retry counter is spiking. A retries failed/timed-out calls up to 3x with no backoff. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.