Code Room
On-callHardoc-g096
Subject Scaling limitsLevel Senior–Staff~45 minCommon in Distributed systems interviewsIndustries Technology

Question

Your order service scaled the stateless app tier from 20 to 60 instances to handle holiday load, but write latency got worse, not better. The single primary Postgres is at 95% CPU with high WAL write rate and replication lag climbing on the replicas; read replicas are nearly idle. Connection count on the primary is near max. Adding more app instances now makes it strictly worse. There's no slow query — individual writes are fast, there are just too many. How do you triage, and what's the real ceiling here?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.