Turning a static database into a real-time stream of events.
How do you keep a Search index (like Elasticsearch) perfectly in sync with your main Database (like PostgreSQL)? You could update both from your application code, but if one fails, they go out of sync. Change Data Capture (CDC) solves this by secretly listening to the database's internal transaction log. Every time a row is inserted, updated, or deleted, the CDC tool immediately emits an event to a message queue for other systems to consume.
Databases like Postgres use a Write-Ahead Log (WAL) to guarantee durability. Tools like Debezium connect to the database acting as a "replica", reading this WAL stream, translating binary log entries into JSON, and publishing them to Kafka.
# The conceptual flow of CDC (e.g., Debezium -> Kafka)
# 1. Application executes SQL
db.execute("UPDATE users SET status = 'active' WHERE id = 5")
# 2. Database writes to its internal WAL (Write-Ahead Log)
# WAL Entry: [LSN 1004] UPDATE table=users id=5 old_status=pending new_status=active
# 3. CDC Tool (Debezium) reads the WAL and pushes to Kafka
kafka.publish("db.users.changes", {
"op": "u", # Update
"before": {"id": 5, "status": "pending"},
"after": {"id": 5, "status": "active"}
})
# 4. Elasticsearch consumer reads Kafka and updates the index
search_index.update_document(5, {"status": "active"})
CDC decouples systems beautifully, but adds infrastructural complexity. You now have to manage Kafka, Debezium, and monitor replication lag. If Kafka goes down, the database might pause writes to prevent the WAL from filling up the disk!