Question
Your product-listing service relies on a Redis cache in front of Postgres. At 20:10 the managed Redis cluster has a failover event and is unreachable for ~90 seconds, then comes back empty. Dashboards: during and after the blip, Postgres CPU spikes to 100% and connection count hits the pool ceiling; your service p99 jumps to 12s and error rate to 30%; even after Redis is healthy again, the database stays pegged for several minutes. Recent context: cache TTLs were recently unified to a single 10-minute value, and the cache-miss path reads directly from the DB with no concurrency control. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.