Question
A Pub/Sub *pull* subscription feeds a fleet of worker pods (on GKE) that do CPU-heavy ML inference. At 16:00 `num_undelivered_messages` climbs from ~5k to 400k over 30 minutes and oldest-unacked-age grows. The worker pods are pinned at ~100% CPU, but the pod count is *not* increasing. The HPA (horizontal pod autoscaler) is configured to scale on CPU with a max replica count that's already reached. `ack_message_count` is steady (workers are acking what they process, just not enough). Recent context: an upstream feature launch roughly doubled inference request volume today. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.