On-callHardoc-g486

Subject Dns failureLevel Senior–Staff~35 minCommon in Networking & APIs · Reliability & on-call interviewsIndustries Technology, Software development

Question

Your payment vendor posts a status update at 14:00: their primary endpoint had an outage and they failed over by repointing api.vendor.com to a new IP set ~15 minutes ago; they say traffic should be recovering. But your checkout error rate stays pinned at ~40% timeouts to the vendor for another 25 minutes even though they're 'back.' Dashboards: a subset of your app pods recovered immediately, while others keep dialing the OLD IPs and timing out. `dig api.vendor.com` from a fresh pod returns the new IPs; the stuck pods' in-process resolver cache still holds the old ones. The vendor's DNS record carries a 3600s TTL. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.