Question
After a routine deploy that rescheduled many pods, your checkout service starts intermittently connecting to dead backends. Dashboards: ~15% of requests to the payments pods fail with connection-refused / connection-timeout; the failing connections target pod IPs that no longer exist (old pods that were replaced during the deploy); CoreDNS query latency is elevated and its cache hit-rate is unusually high. Recent context: someone recently increased the in-cluster DNS TTL and enabled aggressive client-side DNS caching in the service's HTTP client to 'reduce DNS load.' How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.