Question
Design the scheduler for a transcoding compute farm that runs millions of encode jobs/day across a mix of cheap interruptible spot instances and a small reserved on-demand fleet. Jobs have wildly different priorities and SLAs: a creator's just-uploaded short must transcode in <60s (interactive), a backfill re-encode of the back catalog can take hours, and a few live-to-VOD jobs are deadline-critical. You want to minimize cost (favor spot) while still hitting the interactive SLA and surviving spot reclamations that can yank 30% of capacity with a 2-minute warning.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.