Database Unreachable

Handling network blips without taking down the whole API.

The idea

Networks are inherently unreliable. Sometimes a router drops a packet, or a database fails over to a replica (which takes 10 seconds). If your API immediately crashes and throws a 500 Internal Server Error every time a single packet is lost, your users will have a terrible experience. To fix this, applications must wrap database calls in a Retry Loop with Exponential Backoff and Jitter.

Step 1: The App attempts to query the Database, but a network blip drops the connection.

How it works (Exponential Backoff + Jitter)

If the database goes down, and 100 App Servers all immediately retry at the exact same millisecond, they will DDOS the database the moment it comes back up (the "Thundering Herd" problem). Exponential Backoff means waiting longer between each retry (1s, 2s, 4s). Jitter adds a random amount of time to the wait, spreading out the retries so they don't all hit the database simultaneously.

import time
import random

def query_with_retry(sql, max_retries=3):
    for attempt in range(max_retries):
        try:
            return db.execute(sql)
        except ConnectionError:
            if attempt == max_retries - 1:
                raise # We give up!
            
            # Exponential backoff: 2^attempt (1s, 2s)
            base_wait = 2 ** attempt
            
            # Jitter: add between 0 and 1000ms randomly
            jitter = random.uniform(0, 1)
            
            time.sleep(base_wait + jitter)

Cost

Retries solve intermittent failures, but they tie up App Server threads. If the database is actually dead for 5 minutes, every incoming web request will sleep for 3+ seconds trying to reconnect, exhausting the web server's connection pool. To prevent this, you should wrap retries in a Circuit Breaker that stops retrying entirely if 10 requests fail in a row.

Watch out for