Question
At 14:10 a rolling deploy of the orders service begins (8 of 24 pods on the new version). Within two minutes, error rate climbs to ~12% with a tight cluster of `NULL value in column "fulfillment_channel" violates not-null constraint` on INSERT. The migration that ran in the deploy's pre-step added the column `fulfillment_channel TEXT NOT NULL` to `orders`. The new pods write the column; the old pods (still 16 of them) do not. The DB CPU is normal, replication lag is 0, latency p99 unchanged. The deploy is configured `maxUnavailable=0, maxSurge=25%`, so old and new pods serve concurrently for the full rollout window. How do you triage and mitigate, and what's the durable fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.