Question
Your primary database endpoint failed over to standby at 09:00 — the orchestrator promoted standby and updated the DNS record db-primary.internal to the new IP. Failover is supposed to be 30 seconds. Instead, for ~14 minutes you keep getting partial write failures: roughly 40% of app pods write successfully to the new primary while 60% keep hammering the old (now read-only) primary and fail with 'cannot execute INSERT in a read-only transaction'. Dashboards: the DNS record updated correctly and immediately; promotion completed cleanly; the failing pods are not random — they cluster on certain nodes and certain language runtimes. No deploy. How do you triage and what's the durable fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.