Code Room
System designMediumsd-g369
Subject EmbeddingsLevel Mid–Senior~40 minCommon in ML systems interviewsIndustries Technology, Software development

Question

Design a near-duplicate detection system for a media platform ingesting 3M images/videos a day that must flag content that is a re-upload or trivial edit (crop, re-encode, watermark, slight color shift) of existing content — for copyright/dedup, not exact-hash matching, which a single re-encode defeats. Walk through the embedding/fingerprinting approach, how you search 2B existing fingerprints fast enough to check each upload at ingest, and how you set the match threshold without drowning in false positives.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.