Question
You connect to a managed database via the vendor's cluster DNS endpoint, which fronts a fleet of proxy nodes. The vendor ran a planned maintenance at 02:00 that cycled out half the proxy fleet. From 02:00–02:25 your app sees ~30% of new DB connections fail with 'connection refused' / timeouts, then it clears. Dashboards: failing connections target proxy IPs that were drained during maintenance; the vendor's cluster endpoint still resolved those IPs to your clients during the window. Your connection pool holds long-lived connections and your resolver caches the endpoint with a 60s TTL. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.