Question
A service's p95 latency jumps from 120ms to 900ms across all regions at once, with no deploy and no traffic change. The app's own DB and downstream call timings are unchanged, but a new 'time to first byte to downstream' metric shows a consistent ~700ms added before each outbound call. Tracing shows the delay is in establishing the connection to downstreams, before any request bytes flow. The platform team rolled out a new node-level DNS configuration (a switched-in caching resolver / different upstream nameservers) a bit before the incident, fleet-wide. Some lookups are slow; a few fail and retry. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.