Question
Design an online-learning system that uses a contextual bandit to choose which of ~30 promotional offers to show each user at app-open, optimizing 7-day retained-conversion. You serve ~5M decisions/day with a 40ms p99 budget. Reward is delayed (you only know the 7-day outcome a week later) and partial (you only observe reward for the arm you actually showed). The offer catalog changes weekly, and you must avoid getting stuck always showing the currently-best offer (you need principled exploration), while not torching revenue with too much random exploration.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.