On-callHardoc-g547

Subject On callLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

It's 11:02 UTC. PagerDuty fires: product-catalog API p99 latency jumped from 40ms to 4.8s and the origin Postgres read replica is at 100% CPU with a queue of 900 active queries, almost all the identical `SELECT ... FROM catalog WHERE id IN (...)` for the homepage's top-50 SKUs. The Redis dashboard shows a hit rate that cratered from 98% to 11% about 90 seconds ago, and the keyspace shows the `catalog:home:*` keys are missing. There was no deploy. Traffic is normal for a Tuesday. A teammate mentions that an hour ago someone shortened the homepage cache TTL from 1h to 5m "to make merchandising changes show up faster." The replica is now so saturated that even uncached requests are timing out, and the error rate is climbing. Walk me through how you triage and stabilize this, then make it durable.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.