On-callHardoc-g690

Subject Networking anycastLevel Senior–Staff~40 minCommon in Networking & APIs · Reliability & on-call interviewsIndustries Technology, Telecom

Question

Your edge runs anycast: the same VIP is announced from multiple POPs, and BGP/ECMP steers users to the nearest one. After draining one POP for maintenance (you stopped announcing its routes there), users who were on that POP report their connections froze mid-session and had to reconnect, and you see a spike of TCP resets and aborted TLS sessions globally for a brief window. New connections are fine; it's established, in-flight connections that broke. The drain was a hard route withdrawal. Triage, explain the root cause, and describe how to do this without breaking sessions.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.