On-callMediumoc-g070

Subject Cert expiryLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

At 00:01 UTC your background job that syncs inventory to a partner starts failing 100%, and a few minutes later your internal service mesh begins throwing errors between two services. Logs show 'x509: certificate has expired or is not yet valid' and TLS handshake failures. Dashboards: the partner-sync error rate is at 100%; mesh sidecar error rate is climbing; user-facing traffic over your public LB (managed cert) is fine. There was no deploy. The on-call before you mentioned 'cert renewals have been flaky since we turned off the old cron.' How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.