On-callMediumoc-g687

Subject Networking dnsLevel Mid–Senior~35 minCommon in Networking & APIs interviewsIndustries Technology

Question

A service's p95 latency jumps from 120ms to 900ms across all regions at once, with no deploy and no traffic change. The app's own DB and downstream call timings are unchanged, but a new 'time to first byte to downstream' metric shows a consistent ~700ms added before each outbound call. Tracing shows the delay is in establishing the connection to downstreams, before any request bytes flow. The platform team rolled out a new node-level DNS configuration (a switched-in caching resolver / different upstream nameservers) a bit before the incident, fleet-wide. Some lookups are slow; a few fail and retry. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.