Question
At 03:40 multiple services start failing intermittently with `UnknownHostException` and 'name or service not known' when calling internal services like `payments.svc.cluster.local` and `db-proxy.internal`. It's not total — maybe 20% of resolution attempts fail, and a retry usually succeeds. External (public-internet) calls are fine. The DNS dashboard shows the cluster's CoreDNS pods at very high CPU and their query latency p99 at 5s+ (normally 2ms); query volume to CoreDNS has roughly tripled. There was a deploy two hours ago that rolled out a new client library across many services. Nothing in the deploy touched DNS config. Walk me through triage and the fix.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.