On-callHardoc-g206

Subject Metastable failureLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

At 20:00 your social feed service falls over and STAYS down even after the traffic spike that started it has passed. Dashboards: DB CPU pinned at 100%; cache hit rate collapsed from 95% to 20% at 19:58; request rate to the DB is 10x normal; restarting app pods brings them up healthy for ~30s then they fall over again. Earlier at 19:57 there was a brief 3x traffic spike (a celebrity post) that has since subsided to normal. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.