Question
A workflow service runs multi-step business processes (e.g. 'process refund': validate → call payment provider → update ledger → notify user). Steps run on a pool of stateless workers. Two hazards have occurred: (1) two workers picked up the same workflow instance and both called the payment provider, issuing a double refund; (2) a worker that held a workflow crashed mid-step, and the workflow stalled forever because no one else picked it up. Design the coordination so each workflow instance is processed by at most one worker at a time AND a crashed worker's workflow is reliably resumed — while making the externally-visible side effect (the refund) happen at-most-once even if the lease changes hands mid-step.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.