Question
Your leaderboard service starts returning elevated errors and the SDK logs fill with `ProvisionedThroughputExceededException` from DynamoDB, even though the table is on on-demand capacity and CloudWatch shows consumed capacity well under the account peak. The `ThrottledRequests` metric is spiking on one partition. A new feature shipped this morning that writes every score update under a single `gameId` for the current tournament. Client p99 is climbing and your retry storm is making CloudWatch consumed-capacity look higher than real demand. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.