On-callHardoc-g288

Subject Dns failureLevel Senior–Staff~40 minCommon in Networking & APIs · Reliability & on-call interviewsIndustries Technology

Question

Your primary database endpoint failed over to standby at 09:00 — the orchestrator promoted standby and updated the DNS record db-primary.internal to the new IP. Failover is supposed to be 30 seconds. Instead, for ~14 minutes you keep getting partial write failures: roughly 40% of app pods write successfully to the new primary while 60% keep hammering the old (now read-only) primary and fail with 'cannot execute INSERT in a read-only transaction'. Dashboards: the DNS record updated correctly and immediately; promotion completed cleanly; the failing pods are not random — they cluster on certain nodes and certain language runtimes. No deploy. How do you triage and what's the durable fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.