Designing team chat

A message is a letter that must be filed before it's delivered — write it down first, then fan it out to everyone in the room.

The idea

Team chat (Slack-style channels) looks simple until you trace a single message. The hard parts are: deliver it to everyone in the channel who's online right now, and let anyone who was offline catch up later with no gaps and no duplicates.

The shape that handles both: a gateway holds each client's live connection, a message service persists every message to a per-channel ordered log first, then a fan-out step pushes it to the gateways of all connected members. Persisting before fan-out is the linchpin — the durable log is the source of truth, and live delivery is just a fast path on top of it.

Sender client Gateway live conns Message svc order + persist Channel log Fan-out to members
Press Play to send one message through the system.

How it works

The send path persists first, then delivers. The catch-up path replays the log from the reader's last-seen sequence. Both read the same ordered log, so an online user and a returning offline user converge on the identical history.

# Send: persist to the durable, ordered log BEFORE fan-out
def send(channel, sender, body):
    seq = log.append(channel, {              # 1) durable, gives an order
        "from": sender, "body": body, "ts": now()
    })
    for member in online_members(channel):    # 2) fast path to live clients
        gateway.push(member, channel, seq)
    return seq

# Catch up: an offline user replays from where they left off
def catch_up(channel, reader):
    since = last_seen[channel].get(reader, 0)
    missed = log.read(channel, after=since)    # no gaps, in order
    last_seen[channel][reader] = log.head(channel)
    return missed

Sequence numbers do double duty: they order messages and they're the cursor for dedup and catch-up.

Cost

PathCostNote
Persist one messageO(1) appendOrdered log write
Fan-outO(online members)One push per live connection
Catch upO(missed)Replay from last-seen seq
State on gatewayO(open connections)Sharded across gateway nodes

Watch out for

Worked example

Ada posts in a channel with Bo (online) and Cy (offline). The message is appended to the log as seq 42, then pushed to Bo's gateway — Bo sees it instantly. Cy's laptop is closed, so nothing is pushed. An hour later Cy reconnects with last-seen seq 41; catch-up replays everything after 41, delivering seq 42 exactly once. No special case, no lost message: both users read the same durable log.

Check yourself

Why persist the message to the log before fanning it out to online clients?