Question
A deploy ships at 11:00. Everything is fine for ~90 minutes. Then a slow-burn outage begins: pods across the fleet start failing health checks and getting evicted, and the log-ingestion pipeline backs up. Dashboards: node disk usage on the app nodes climbs steadily and crosses 95%; the eviction events correlate exactly with disk pressure; CPU/mem/request-rate are normal. The new release accidentally left a `logger.debug` on the hot request path enabled at INFO level (a level mis-set in the deploy's config), emitting ~5 KB per request to local stdout/files; log rotation is size-based but the rotation threshold was never hit fast enough. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.