Code Room
On-callEasy
Question
A service fails to start and the orchestrator is reporting it as unhealthy across every instance. The startup logs show the process exiting immediately with an error about an invalid value in its configuration. The last change was a config update pushed five minutes ago to tune a setting — no application code changed. Every instance is failing identically. How do you triage and recover?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.