Question
A deploy upgrades a shared resilience LIBRARY (circuit breaker / retry) used by the `orders` service from 3.x to 4.x. The team's own feature flags are untouched. Forty minutes after deploy, during a brief downstream blip, `orders` melts down far worse than usual: a tiny 1% downstream error rate amplifies into a 40% orders error rate and a retry storm, where before the same blip was a non-event. Dashboards: outbound retry count to the downstream is ~5x historical; the circuit breaker never opens. Context: library 4.x changed the DEFAULT circuit-breaker behavior — the breaker is now OPT-IN (disabled unless explicitly enabled) where 3.x had it ON by default, and the upgrade didn't add the now-required enable flag. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.