Question
A `search-indexer` consumes a Pub/Sub pull subscription and bulk-writes documents into an Elasticsearch cluster. At 16:30 `num_undelivered_messages` climbs from ~2k to 500k and oldest-unacked-age rises. Dashboards: worker CPU is LOW; the workers are spending most time blocked on Elasticsearch bulk calls that now return HTTP 429 (`es_rejected_execution_exception`) with rising frequency — ES write thread pools/queues are saturated. The workers retry the whole bulk request on 429, immediately. No deploy on the consumer side; ES is at high indexing load from a separate reindex job. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.