Question
At 00:01 UTC your background job that syncs inventory to a partner starts failing 100%, and a few minutes later your internal service mesh begins throwing errors between two services. Logs show 'x509: certificate has expired or is not yet valid' and TLS handshake failures. Dashboards: the partner-sync error rate is at 100%; mesh sidecar error rate is climbing; user-facing traffic over your public LB (managed cert) is fine. There was no deploy. The on-call before you mentioned 'cert renewals have been flaky since we turned off the old cron.' How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.