Question
A Google Cloud Pub/Sub subscription `notification-fanout` (with message ordering enabled, ordered by `user_id`) alerts at 21:00: `subscription/num_undelivered_messages` is climbing from ~1k to 300k, and `subscription/oldest_unacked_message_age` is at 12 minutes and rising. The push endpoint's logs show many handlers running ~25s each. The subscription's `ackDeadlineSeconds` is 10. Pub/Sub metrics show a high `subscription/expired_ack_deadlines_count`. Recent context: a downstream profile service the handler calls got slower today (its p95 went from 200ms to 4s) after a noisy-neighbor incident on its DB. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.