Code Room
Practice the whole technical round. Coding, system design, code review, and on-call — drilled by level, role, and subject, run against tests in the browser, and graded by a coach, not a judge.
Built for Computer games IT services Software development Technology Telecom
Coding Write, debug, and run code against tests — narrated. System design Design an architecture, clarify, scale, trade off. Code review Read a diff — find the bug, the smell, the risk. On-call Diagnose an incident, mitigate, write the postmortem. Vibe coding Direct an AI to build it — then catch what it got wrong.
Filters
Interview domain
Reliability & on-call Distributed systems Networking & APIs Databases & SQL Concurrency Code quality & review Algorithms & data structures Security Storage & CDN ML systems
Subject
Incident response Database incidents Capacity Observability Autoscaling failure Schema migration Upstream timeout Capacity incidents Quota exhaustion Traffic surge
Show more On-call Medium
You're paged at 2am: the main API's p99 latency jumped from 80ms to 4s and the error rate is climbing past 5%. Walk through your first 15 minutes.
oc-p001 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
The primary database is pinned at 100% CPU and queries are timing out, taking the app down with it. You could hunt for the offending query, or act now. What do you do?
oc-p002 Mitigation vs rootcause · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
A full outage is in progress and customers are complaining publicly. You're the incident commander. How do you handle communication?
oc-p003 Outage comms · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
A routine deploy took down checkout for 40 minutes. Outline a blameless postmortem and explain what makes it useful.
oc-p004 Postmortem · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
An alert fires for elevated errors, but every dashboard you open looks normal. How do you tell whether it's a real incident or a false alarm?
oc-p005 Observability · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
You're paged for an alert you've never seen, and there's no runbook. How do you proceed, and what do you do afterward?
oc-p006 Runbooks · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
The primary database disk is 98% full and climbing; writes are starting to fail. You have minutes. What now?
oc-p007 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
A service is OOM-killing and restarting roughly every 30 minutes, dropping requests on each restart. Walk through diagnosis and mitigation.
oc-p008 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
A downstream dependency slowed down, and now your whole service is failing as requests pile up and retries multiply — a cascading failure. How do you stop it?
oc-p009 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
Errors spiked right after a deploy 10 minutes ago. Do you roll back or roll forward, and how do you decide?
oc-p010 Mitigation vs rootcause · Mid–Senior · ~20 min · Reliability & on-call
On-call Easy
The whole site is down with TLS errors and users see certificate warnings. What's your first hypothesis and response?
oc-p011 Incident response · Entry–Mid · ~20 min · Reliability & on-call
On-call Hard
Your Redis cache cluster just went down, and the database — now taking traffic it used to absorb — is about to fall over too. What do you do?
oc-p012 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
Consumer lag on your main message queue is growing steadily — processing is hours behind production. How do you respond?
oc-p013 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
A migration that ran an hour ago wrote incorrect values into a column for a subset of rows, and the app has served wrong data since. How do you handle it?
oc-p014 Postmortem judgment · Senior–Staff · ~20 min · Reliability & on-call
On-call Hard
During your on-call, monitoring flags unusual access patterns and you suspect a credential may be leaked. What's your response?
oc-p015 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
Users in one region suddenly can't reach the service while other regions are fine. Recent changes include a DNS update. Where do you look?
oc-p016 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
Your payment provider (a critical third party) is clearly down on their side, returning errors. You can't fix their system. What do you do?
oc-p017 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
You're 30 minutes into an incident, unsure of impact or cause, and feeling stuck. What's the right move?
oc-p018 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
An alert keeps flapping (firing and resolving) every few minutes all night and is never actionable. How do you handle it tonight and longer-term?
oc-p019 Observability · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
Error rates jumped but there was no code deploy — though someone mentions a config / feature-flag change went out around that time. How do you proceed?
oc-p020 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
A marketing push (or viral moment) is driving 10x normal traffic and the system is saturating. You can't instantly add unlimited capacity. How do you keep it up?
oc-p021 Capacity · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
Your app is throwing 'too many connections' / pool-exhausted errors under normal traffic — connection counts climbed all day until they maxed out. Diagnose and mitigate.
oc-p022 Incident response · Senior · ~20 min · Reliability & on-call
On-call Medium
The same incident has now recurred three times despite a 'fix' after each one. As on-call and reviewer, how do you break the cycle?
oc-p023 Postmortem judgment · Senior–Staff · ~20 min · Reliability & on-call
On-call Hard
After a database failover, some users report stale or inconsistent data and writes seem to be conflicting. You suspect replication lag or split-brain. What's your response?
oc-p024 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
A junior engineer ran a manual command in production that caused an outage, and they're shaken. As the senior on-call and reviewer, how do you handle the incident and the…
oc-p025 Blameless culture · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
A Kafka (or similar) broker node went down, and you're seeing under-replicated partitions, consumer rebalancing, and some produce failures. How do you respond?
oc-p026 Incident response · Senior · ~20 min · Reliability & on-call
On-call Hard
Latency is spiking in a sawtooth pattern — periodic multi-second stalls every minute or so — on an otherwise healthy service. You suspect garbage-collection pauses. How do you…
oc-p027 Incident response · Senior · ~20 min · Reliability & on-call
On-call Hard
Authentication is failing intermittently across services — tokens 'expire' immediately or signatures fail to validate — and nothing was deployed. What's a likely cause and…
oc-p028 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
One service on shared infrastructure is suddenly slow, and you find another tenant/workload on the same hosts is consuming most of the CPU/IO — a noisy neighbor. What do you do?
oc-p029 Capacity · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
An internal dependency your service calls is degraded (slow, partial errors) but not fully down. You must decide whether to fail open (proceed without it) or fail closed (block).…
oc-p030 Mitigation vs rootcause · Senior · ~20 min · Reliability & on-call
On-call Hard
Traffic suddenly spiked 50x and it looks like malicious bot traffic, not real users — the system is buckling. How do you respond?
oc-p031 Capacity · Senior · ~20 min · Reliability & on-call
On-call Medium
A deploy is half rolled out — some servers run the new version, some the old — and users get inconsistent behavior or errors depending on which they hit. What do you do?
oc-p032 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
No errors are firing, but support reports users seeing wrong values (prices or balances look off). You suspect a dependency started returning subtly incorrect data. How do you…
oc-p033 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Medium
Load is rising and the system is overloaded, but autoscaling isn't adding capacity (or it added instances that aren't taking traffic). How do you respond?
oc-p034 Capacity · Mid–Senior · ~20 min · Reliability & on-call
On-call Medium
Your CDN (or DNS) provider is having a major outage, and large parts of your site are unreachable for many users even though your origin is healthy. What can you do?
oc-p035 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
After a scheduled key/certificate rotation, services started failing to authenticate to each other — mutual TLS or token validation is rejecting valid callers. How do you respond?
oc-p036 Incident response · Senior · ~20 min · Reliability & on-call
On-call Easy
Application servers are crashing, and you find their local disks are full — logs and temp files filled them up. How do you respond?
oc-p037 Incident response · Entry–Mid · ~20 min · Reliability & on-call
On-call Medium
A feature works correctly in one region but is broken in another, with no code difference. What's a likely cause and how do you proceed?
oc-p038 Incident response · Mid–Senior · ~20 min · Reliability & on-call
On-call Hard
The primary database is unhealthy and automatic failover hasn't triggered. You must decide whether and how to fail over manually. How do you reason about it?
oc-p039 Incident response · Senior–Staff · ~20 min · Reliability & on-call
On-call Hard
Users are being served incorrect or stale content, and you trace it to bad cache entries — some users even get another user's data. How do you respond?
oc-p040 Incident response · Senior · ~20 min · Reliability & on-call
Show more