Handling network blips without taking down the whole API.
Networks are inherently unreliable. Sometimes a router drops a packet, or a database fails over to a replica (which takes 10 seconds). If your API immediately crashes and throws a 500 Internal Server Error every time a single packet is lost, your users will have a terrible experience. To fix this, applications must wrap database calls in a Retry Loop with Exponential Backoff and Jitter.
If the database goes down, and 100 App Servers all immediately retry at the exact same millisecond, they will DDOS the database the moment it comes back up (the "Thundering Herd" problem). Exponential Backoff means waiting longer between each retry (1s, 2s, 4s). Jitter adds a random amount of time to the wait, spreading out the retries so they don't all hit the database simultaneously.
import time
import random
def query_with_retry(sql, max_retries=3):
for attempt in range(max_retries):
try:
return db.execute(sql)
except ConnectionError:
if attempt == max_retries - 1:
raise # We give up!
# Exponential backoff: 2^attempt (1s, 2s)
base_wait = 2 ** attempt
# Jitter: add between 0 and 1000ms randomly
jitter = random.uniform(0, 1)
time.sleep(base_wait + jitter)
Retries solve intermittent failures, but they tie up App Server threads. If the database is actually dead for 5 minutes, every incoming web request will sleep for 3+ seconds trying to reconnect, exhausting the web server's connection pool. To prevent this, you should wrap retries in a Circuit Breaker that stops retrying entirely if 10 requests fail in a row.
UPDATE users SET balance = balance - 100, and the network drops the response (but the DB actually processed it), retrying the query will charge the user $200! Only retry read queries, or writes that are idempotent (e.g. UPDATE users SET balance = 500).