On-callHardoc-g069

Subject Dns failureLevel Senior–Staff~35 minCommon in Networking & APIs interviewsIndustries Technology, Software development

Question

Half your fleet suddenly can't reach an internal service db-primary.internal. Affected pods log 'no such host' / NXDOMAIN; the other half resolve it fine. Dashboards: error rate on the orders API is ~50% and bouncing; the unaffected half of pods are healthy. Recent context: 20 minutes ago an infra change migrated internal DNS from a self-hosted CoreDNS to a managed resolver, and someone also bumped record TTLs from 30s to 3600s last week. No application deploy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.