Code Room
On-callHardoc-g482
Subject Bad rolloutLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Computer games, Technology

Question

A rolling deploy updates the authoritative game-server binary that hosts live match sessions (sticky: a match lives on one server for its duration). The deploy drains and replaces servers as matches end — but to roll quickly, the orchestrator also evicts servers whose matches are merely 'long-running.' Mid-rollout, players in long matches (ranked games >30 min) get disconnected mid-game in a wave, ranked results are voided, and the matchmaking queue spikes as those players re-queue. Dashboards: per-server, match-count drains to zero normally on most, but a cohort of servers hosting long matches were force-evicted at their drain deadline; disconnect events cluster on those evictions; CPU/error rate otherwise normal. Triage, explain, and prevent.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.