Database Deduplication

Finding the most recent event for every user in a log table.

The idea

Many systems use "append-only" tables. When a user updates their profile, you don't overwrite the old row; you insert a new row with a newer timestamp. This creates an audit log. But when you need to query "the current state of all users," you have a problem. You have duplicates. You need to group by user, but then extract ONLY the row with the maximum timestamp from each group. A standard GROUP BY struggles with this because it crushes the data down.

Step 1: The append-only table has 4 rows. Alice has two entries. We want only the latest.

How it works (Window Functions)

The modern, highly-performant way to solve this is using Window Functions (specifically ROW_NUMBER() OVER (...)). This allows the database to partition the data into logical groups, sort each group internally, and assign a rank (1, 2, 3...) to each row, without crushing the rows together like GROUP BY does.

-- Find the latest status for every user
WITH RankedLogs AS (
    SELECT 
        user_id, 
        status, 
        timestamp,
        -- Group by user, sort newest to oldest, assign a row number
        ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY timestamp DESC) as rn
    FROM user_logs
)
-- Filter to only grab the #1 newest row per user!
SELECT user_id, status, timestamp
FROM RankedLogs
WHERE rn = 1;

Cost

Window functions require sorting. If you have 10 million rows, ORDER BY timestamp DESC is expensive. To make this query lightning fast, you must have a composite index on exactly the columns you are partitioning and sorting by: CREATE INDEX idx_user_time ON user_logs(user_id, timestamp DESC);

Watch out for