On-callHardoc-g271

Subject Deploy incidentsLevel Senior–Staff~35 minCommon in Security · Reliability & on-call interviewsIndustries Technology, Software development

Question

A scheduled secret rotation runs at 02:00: the DB password is rotated in the secret manager and the old password is revoked. The service is configured to read the secret from a mounted file injected at pod start. At 02:00, no incident — traffic is low and most pods were recently restarted by a deploy at 23:40. The pager fires at 09:15 during morning ramp: a slowly-growing fraction of requests fail with `password authentication failed for user "app"`. Dashboards: the failing fraction matches exactly the set of pods with the *oldest* start times; newly autoscaled pods are fine; the failing fraction grows as HPA scales up and old pods stay alive. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.