On-callHardoc-g075

Subject Dependency outagesLevel Senior–Staff~40 minCommon in Reliability & on-call · Code quality & review interviewsIndustries Technology, Software development

Question

Your product-listing service relies on a Redis cache in front of Postgres. At 20:10 the managed Redis cluster has a failover event and is unreachable for ~90 seconds, then comes back empty. Dashboards: during and after the blip, Postgres CPU spikes to 100% and connection count hits the pool ceiling; your service p99 jumps to 12s and error rate to 30%; even after Redis is healthy again, the database stays pegged for several minutes. Recent context: cache TTLs were recently unified to a single 10-minute value, and the cache-miss path reads directly from the DB with no concurrency control. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.