On-callHardoc-g552

Subject On callLevel Mid–Senior~40 minCommon in Networking & APIs · Reliability & on-call · Code quality & review interviewsIndustries Technology

Question

At 03:40 multiple services start failing intermittently with `UnknownHostException` and 'name or service not known' when calling internal services like `payments.svc.cluster.local` and `db-proxy.internal`. It's not total — maybe 20% of resolution attempts fail, and a retry usually succeeds. External (public-internet) calls are fine. The DNS dashboard shows the cluster's CoreDNS pods at very high CPU and their query latency p99 at 5s+ (normally 2ms); query volume to CoreDNS has roughly tripled. There was a deploy two hours ago that rolled out a new client library across many services. Nothing in the deploy touched DNS config. Walk me through triage and the fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.