Replicating OLTP into OLAP

Let the transactional database do what it's good at, and quietly ship its changes to a warehouse built for heavy reporting.

The idea

Your production database (OLTP — online transaction processing) stores data row by row, which is perfect for the small point reads and writes an app makes all day: fetch one order, update one user. But an analyst's question — "total revenue per country across ten million orders" — has to scan whole columns, and that scan would crawl on a row store while also stealing CPU from live customers.

A column store (OLAP — online analytical processing) lays the same data out by column, so a SUM or GROUP BY touches only the bytes it needs and flies. The trick is keeping the two in sync: change data capture (CDC) tails the OLTP write-ahead log, turns every insert, update, and delete into a change event, and streams that feed into the warehouse — so analysts query a fresh-enough copy without ever touching production.

Press play, or step through one change at a time.

How it works

The OLTP database already writes every change to a durable write-ahead log before it touches the table. CDC simply reads that log from a replication slot, so it captures committed changes in order without adding query load to the source. Each change becomes a small event — operation, table, and the affected row — that is streamed and then applied into the warehouse, which physically lays the data out by column.

# 1. Tail the OLTP write-ahead log via a logical replication slot.
#    No table scans on the primary — just read committed changes in order.
for change in wal_stream(slot="warehouse_sync"):

    # 2. Emit a structured change event for each row touched.
    event = {
        "op":    change.op,      # "insert" | "update" | "delete"
        "table": change.table,   # e.g. "orders"
        "pk":    change.pk,      # primary key, so updates/deletes find their row
        "row":   change.new_row, # full row image (None for a delete)
        "lsn":   change.lsn,     # log position = ordering + dedup key
    }
    feed.publish(event)          # at-least-once stream; consumer dedups by lsn

# 3. In the warehouse loader: batch events, then apply.
#    Row store packs a record's fields together; the column store re-lays
#    them out one column at a time, so scans/aggregates read only what they need.
for batch in feed.consume(batch_size=5000):
    warehouse.merge(batch)       # insert / update-in-place / tombstone deletes
    # Now: SELECT country, SUM(total) FROM orders GROUP BY country  -> fast.

Trade-offs

Workload	Row store (OLTP)	Column store (OLAP)
Point read of one row	Fast — fields sit together	Slow — must stitch columns
Bulk scan / aggregate	Slow — reads whole rows	Fast — reads one column
Single-row write / update	Cheap and immediate	Costly — append + compact
Compression	Modest	Strong — similar values adjacent

And on the pipeline itself: a nightly batch ETL is simple and puts no continuous load on the source, but dashboards can be a day stale. Streaming CDC keeps freshness to seconds and reads the log instead of querying tables, at the cost of more moving parts to operate.

Watch out for

Replication lag means stale dashboards. The warehouse is eventually consistent — a number on screen reflects the OLTP state from a few seconds (or minutes) ago, not "now." Surface the freshness so nobody mistakes it for live.
Schema drift breaks the pipeline. Add or rename a column on the source and the loader can silently drop fields or stall. Evolve the warehouse schema in lockstep.
Updates and deletes must propagate, not just inserts. If you only stream inserts, the warehouse fills with stale duplicates and resurrected rows. Carry the primary key so updates land in place and deletes leave tombstones.
The first load needs a backfill. CDC only sees changes from the moment the slot starts. Seed the warehouse with a consistent snapshot first, then switch to the live feed at the matching log position.
At-least-once delivery can duplicate events. Streams replay on failure, so dedup by log position (LSN) or use idempotent merges — otherwise a retried event double-counts.

Worked example

An orders table on Postgres takes three new orders and one update over a couple of seconds: insert order 101 (US), insert 102 (CA), insert 103 (US), then an update marking 102 as refunded. Each commit lands in the write-ahead log; CDC reads them in order and publishes four events onto the change feed.

The warehouse loader is batching, so it applies them about 5 seconds behind — that's the replication lag. Once they land, the analyst runs SELECT country, SUM(total) FROM orders GROUP BY country. On the column store it reads only the country and total columns and returns in milliseconds. Run that same scan on the OLTP primary and it would read every full row and contend with live checkout traffic — exactly the contention the replication was built to avoid. The dashboard ends up ~5 seconds stale, and for revenue reporting that's a fine trade.

Check yourself

Why stream changes through CDC into a separate warehouse instead of just running the analyst's heavy query against the production database?

A dashboard shows a total that's slightly behind what an admin just changed in the app. Is something broken?