On-callMediumoc-g284

Subject Deploy incidentsLevel Mid–Senior~30 minCommon in Storage & CDN · Reliability & on-call interviewsIndustries Technology, Software development

Question

A deploy ships at 11:00. Everything is fine for ~90 minutes. Then a slow-burn outage begins: pods across the fleet start failing health checks and getting evicted, and the log-ingestion pipeline backs up. Dashboards: node disk usage on the app nodes climbs steadily and crosses 95%; the eviction events correlate exactly with disk pressure; CPU/mem/request-rate are normal. The new release accidentally left a `logger.debug` on the hot request path enabled at INFO level (a level mis-set in the deploy's config), emitting ~5 KB per request to local stdout/files; log rotation is size-based but the rotation threshold was never hit fast enough. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.