Question
A scheduled secret rotation runs at 02:00: the DB password is rotated in the secret manager and the old password is revoked. The service is configured to read the secret from a mounted file injected at pod start. At 02:00, no incident — traffic is low and most pods were recently restarted by a deploy at 23:40. The pager fires at 09:15 during morning ramp: a slowly-growing fraction of requests fail with `password authentication failed for user "app"`. Dashboards: the failing fraction matches exactly the set of pods with the *oldest* start times; newly autoscaled pods are fine; the failing fraction grows as HPA scales up and old pods stay alive. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.