On-callHardoc-g244

Subject Noisy neighborLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your read-heavy Postgres replica that powers a customer dashboard starts showing periodic 3-5 second query stalls, but only in 10-minute bursts a few times a day, and never reproduces in staging. Your own QPS, connection count, buffer cache hit ratio, and query plans are all unchanged. The instance is a shared-tenant managed DB. During the stalls, you see disk read latency spike to 40ms (normally 1ms) and 'IO wait' climb, while your logical read volume is flat. There was no deploy on your side. How do you triage and what's your mitigation path?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.