Question
Your image-processing pipeline writes a derived thumbnail to an object store and immediately enqueues a job whose worker reads that object back to do a second transform. On-call is paged: ~0.5% of jobs fail with 'object not found' or read a stale previous version, even though the write 'succeeded'. Dashboards: write success rate is 100%; the read-back error correlates with jobs where the read happens <200ms after the write; the failures cluster on objects served from a specific replica region; a recent migration moved this bucket from a strongly-consistent store to a different object store (or a cross-region replicated bucket). No errors on the write side. How do you triage and fix these read-after-write consistency misses?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.