Code Room
On-callMedium
Question
At 00:00 UTC on the 1st of the month, every call from your services to one internal downstream API starts failing simultaneously and completely — 100% error rate, instant, across all regions at once. The errors are TLS handshake failures: `certificate has expired`. No deploy went out; nothing changed in your code; it broke at a clean clock boundary. The downstream team is paged too. Triage, mitigate, and describe the prevention angle.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.