Code Room
On-callMedium
Question
Right after a 16:00 deploy, your product-detail service falls over: p99 latency goes from 90ms to 6s, the database CPU pins at 100%, and the error rate spikes to 20% for about three minutes before everything settles back to normal on its own. The deploy was a routine code change, but it required a rolling restart of the Memcached fleet to pick up new client config. There were no schema or query changes. How do you triage, mitigate the immediate pain, and prevent the next deploy from doing this?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.