Question
Design a data-loss-prevention (DLP) scanner for a large enterprise SaaS that inspects every outbound document, message, and API payload (200K events/s, files up to 1GB) for sensitive data — PII, PCI card numbers, source code, credentials — and blocks or quarantines policy violations in line with under 300ms added latency for interactive flows. Threat model: insiders exfiltrating data slowly, and accidental leaks via misconfigured shares. It must support content-aware policies (exact-data-match against a customer's own datasets) and minimize false positives that would break legitimate work. Cover classification pipeline, the exact-match index, and how you handle large files and encryption.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.