On-callMediumoc-g647

Subject Cold tier storage retrieval slaLevel Mid–Senior~30 minCommon in Storage & CDN · Reliability & on-call interviewsIndustries Technology

Question

Your media platform tiers older video segments from hot object storage to a cold archive tier (think S3 Glacier-class) to save cost. A new 'rewatch classics' feature drove a surge of requests for old content. On-call pages: the user-facing 'play' endpoint p95 went from 300ms to 47s, and a chunk of playback attempts time out entirely. Dashboards: cold-tier retrieval/restore queue depth is in the tens of thousands and climbing; restore jobs are completing but each takes minutes-to-hours (standard retrieval tier); hot-tier and CDN are healthy; error budget for the playback SLA is being burned fast. No deploy; this is purely a workload shift. How do you triage and mitigate the SLA miss on cold-tier retrieval?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.