On-callHardoc-g083

Subject Dependency outagesLevel Senior–Staff~40 minCommon in Storage & CDN · Reliability & on-call · Code quality & review interviewsIndustries Technology

Question

At 12:00 your app's images, uploads, and even some page loads break in one region. Dashboards: calls to object storage (S3-style) in us-east-1 return 503 'SlowDown' and elevated errors; the cloud provider's status page confirms a regional object-storage degradation. Your compute is healthy, but several services that synchronously read config/feature-flag JSON from that bucket on each request are now timing out and cascading. Recent context: a service was recently changed to read its feature-flag config from S3 on every request instead of caching it. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.