Database incidents

Why your replicas show stale data, and how write-heavy jobs break them.

The idea

To handle high traffic, databases use a Primary-Replica architecture. You send all WRITES (INSERT/UPDATE) to the Primary. The Primary then streams those changes to Replicas. You send READS to the Replicas to spread the load.

However, this streaming is asynchronous. If someone runs a massive background job (e.g., deleting 1,000,000 old rows), the Primary processes it fast, but the Replica takes time to catch up. This delay is called Replication Lag. During this lag, if a user updates their profile (Write to Primary) and refreshes the page (Read from Replica), they will see their old profile data!

Primary DB
Handles WRITES
v1
Replication Log
Replica DB
Handles READS
v1
Replica Lag:
10ms
System healthy. Replica is fully synced with the Primary.

How it works (Handling Lag)

# BAD: Naive Read
def get_user_profile(user_id):
    # If the user JUST updated their profile 1 second ago,
    # the replica might not have the change yet!
    return db_replica.query("SELECT * FROM users WHERE id=?", user_id)

# GOOD: Read-your-writes consistency
def update_profile(user_id, data):
    db_primary.execute("UPDATE users...", data)
    # Set a flag in Redis/Cookie: "User wrote recently"
    cache.set(f"recent_write:{user_id}", True, expire=5)

def get_user_profile(user_id):
    if cache.get(f"recent_write:{user_id}"):
        # Force read from Primary to ensure freshness!
        return db_primary.query("SELECT * FROM users WHERE id=?", user_id)
    return db_replica.query("SELECT * FROM users WHERE id=?", user_id)