On-callMediumoc-g391

Subject Backlog buildupLevel Mid–Senior~35 minCommon in Distributed systems interviewsIndustries Technology

Question

A `search-indexer` consumes a Pub/Sub pull subscription and bulk-writes documents into an Elasticsearch cluster. At 16:30 `num_undelivered_messages` climbs from ~2k to 500k and oldest-unacked-age rises. Dashboards: worker CPU is LOW; the workers are spending most time blocked on Elasticsearch bulk calls that now return HTTP 429 (`es_rejected_execution_exception`) with rising frequency — ES write thread pools/queues are saturated. The workers retry the whole bulk request on 429, immediately. No deploy on the consumer side; ES is at high indexing load from a separate reindex job. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.