Question
Your recommendations service calls an upstream feature-store service. At 18:30 the feature-store's p99 creeps from 50ms to 400ms (a mild, real slowdown). Within two minutes the feature-store is fully down — 100% errors — and your recommendations service is also down. Dashboards: feature-store inbound request rate is 6x normal even though end-user traffic is flat; its CPU is pegged; your service shows massive retry counts. Recent context: a config change last sprint set client retries to '3 retries, no backoff' on the recommendations→feature-store call. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.