Let the transactional database do what it's good at, and quietly ship its changes to a warehouse built for heavy reporting.
Your production database (OLTP — online transaction processing) stores data row by row, which is perfect for the small point reads and writes an app makes all day: fetch one order, update one user. But an analyst's question — "total revenue per country across ten million orders" — has to scan whole columns, and that scan would crawl on a row store while also stealing CPU from live customers.
A column store (OLAP — online analytical processing) lays the same data out by column, so a SUM or GROUP BY touches only the bytes it needs and flies. The trick is keeping the two in sync: change data capture (CDC) tails the OLTP write-ahead log, turns every insert, update, and delete into a change event, and streams that feed into the warehouse — so analysts query a fresh-enough copy without ever touching production.
The OLTP database already writes every change to a durable write-ahead log before it touches the table. CDC simply reads that log from a replication slot, so it captures committed changes in order without adding query load to the source. Each change becomes a small event — operation, table, and the affected row — that is streamed and then applied into the warehouse, which physically lays the data out by column.
# 1. Tail the OLTP write-ahead log via a logical replication slot.
# No table scans on the primary — just read committed changes in order.
for change in wal_stream(slot="warehouse_sync"):
# 2. Emit a structured change event for each row touched.
event = {
"op": change.op, # "insert" | "update" | "delete"
"table": change.table, # e.g. "orders"
"pk": change.pk, # primary key, so updates/deletes find their row
"row": change.new_row, # full row image (None for a delete)
"lsn": change.lsn, # log position = ordering + dedup key
}
feed.publish(event) # at-least-once stream; consumer dedups by lsn
# 3. In the warehouse loader: batch events, then apply.
# Row store packs a record's fields together; the column store re-lays
# them out one column at a time, so scans/aggregates read only what they need.
for batch in feed.consume(batch_size=5000):
warehouse.merge(batch) # insert / update-in-place / tombstone deletes
# Now: SELECT country, SUM(total) FROM orders GROUP BY country -> fast.
| Workload | Row store (OLTP) | Column store (OLAP) |
|---|---|---|
| Point read of one row | Fast — fields sit together | Slow — must stitch columns |
| Bulk scan / aggregate | Slow — reads whole rows | Fast — reads one column |
| Single-row write / update | Cheap and immediate | Costly — append + compact |
| Compression | Modest | Strong — similar values adjacent |
And on the pipeline itself: a nightly batch ETL is simple and puts no continuous load on the source, but dashboards can be a day stale. Streaming CDC keeps freshness to seconds and reads the log instead of querying tables, at the cost of more moving parts to operate.
An orders table on Postgres takes three new orders and one update over a couple of seconds: insert order 101 (US), insert 102 (CA), insert 103 (US), then an update marking 102 as refunded. Each commit lands in the write-ahead log; CDC reads them in order and publishes four events onto the change feed.
The warehouse loader is batching, so it applies them about 5 seconds behind — that's the replication lag. Once they land, the analyst runs SELECT country, SUM(total) FROM orders GROUP BY country. On the column store it reads only the country and total columns and returns in milliseconds. Run that same scan on the OLTP primary and it would read every full row and contend with live checkout traffic — exactly the contention the replication was built to avoid. The dashboard ends up ~5 seconds stale, and for revenue reporting that's a fine trade.
Why stream changes through CDC into a separate warehouse instead of just running the analyst's heavy query against the production database?
A dashboard shows a total that's slightly behind what an admin just changed in the app. Is something broken?