Question
Your mesh has Envoy outlier detection enabled: a backend host that returns enough 5xx is ejected from the load-balancing pool for a cooldown. At 18:00 one backend pod (of 12) for the catalog service gets briefly slow due to a GC pause and emits some 5xx. Envoy ejects it. But then a SECOND pod, now carrying extra load, also crosses the 5xx threshold and gets ejected; then a third. Within 4 minutes the catalog service's pool has collapsed from 12 healthy hosts to 3, the surviving 3 are overwhelmed, and catalog is effectively down. Dashboards: each individual pod looked momentarily unhealthy, but the underlying cause was transient. No deploy. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.