WAF bot detection

Score every request on how bot-like it looks, then let it in, slow it down with a challenge, or turn it away.

The idea

A web application firewall sits in front of your app and inspects each incoming request before it reaches your code. It reads signals: how fast this client is hitting you, whether the headers look like a real browser, the TLS/JA3 fingerprint, the behaviour over time, and the IP's reputation.

Those signals combine into a single risk score. A low score means let it through. A medium score means add friction — a JavaScript or CAPTCHA challenge. A high score means block it outright. It's a graduated risk decision, not a binary on/off switch, because no single signal is ever certain.

Press play to feed each request through the WAF and watch where it lands.

How it works

Each signal contributes a weighted amount to the score. The weights say how much that signal matters and how much you trust it. Sum them, clamp the result into [0, 1], then compare against two thresholds: at or above 0.8 block, at or above 0.5 challenge, otherwise allow.

WEIGHTS = {
    "ip_reputation":             0.35,  # known abusive / datacenter range
    "req_rate":                  0.25,  # requests per second from this client
    "missing_js_cookie":         0.20,  # never solved a JS challenge
    "tls_fingerprint_known_bot": 0.30,  # JA3 matches a scraping toolkit
    "ua_anomaly":                0.15,  # user-agent inconsistent / spoofed
}

def score(request):
    s = 0.0
    for signal, weight in WEIGHTS.items():
        s += weight * request.signals.get(signal, 0.0)  # each in [0, 1]
    return max(0.0, min(1.0, s))                          # clamp to [0, 1]

def decide(request):
    s = score(request)
    if s >= 0.8:
        return "block"      # high risk: turn it away
    if s >= 0.5:
        return "challenge"  # medium risk: add friction (JS / CAPTCHA)
    return "allow"          # low risk: let it through

The thresholds are policy, not physics. Raise them to be more permissive (fewer false blocks, more bots get in); lower them to be stricter (fewer bots, but more real users see a challenge).

Cost

Signal	What it catches
Request rate	Crawlers and scrapers hammering endpoints far faster than a human could click
TLS / JA3 fingerprint	Automation toolkits whose TLS handshake differs from a real browser's, even with a faked user-agent
JS challenge cookie	Headless clients that never run JavaScript, so they never earn the proof-of-work cookie
IP reputation	Datacenter ranges and addresses seen abusing other sites recently
Behavioural	Inhuman patterns — no mouse movement, perfectly even timing, hitting only the API never the page

Watch out for

False positives block real people. Many genuine users share one egress IP behind a corporate NAT or VPN; screen readers and accessibility tools can look "unusual" to behavioural heuristics. Tune thresholds against real traffic, not just against attacks.
Challenge fatigue is real. If legitimate users are challenged on every page, they leave. Remember a solved challenge with a cookie or token so a human proves themselves once, not constantly.
Sophisticated bots adapt. Some solve CAPTCHAs through farms, run full headless browsers, and rotate residential IPs so reputation looks clean. Any single defence ages; layer signals and watch for drift.
Easily-spoofed signals mislead. The user-agent string is one line a bot edits freely. Weighting it heavily on its own punishes honest oddball clients while missing bots that simply lie. Trust harder-to-fake signals (TLS fingerprint, behaviour) more.
Blocking can amplify a DDoS. If your challenge is expensive to serve, an attacker who floods you with high-risk requests makes you do the costly work. Prefer cheap rate-limiting and connection drops at the edge over heavy per-request challenges under load.

Worked example

A request arrives from a residential IP, 2 req/s, with a valid JS challenge cookie and a browser-matching TLS fingerprint. Its signals are near zero, the score lands around 0.10, and the WAF allows it — a human browsing normally.

A second request comes from a clean-looking residential IP but at 18 req/s with no JS cookie and a slightly odd user-agent. Nothing screams "bot," but several mild signals stack up to roughly 0.62. The WAF challenges it: a real user solves the JS check in a blink and proceeds; an automated client without a JS engine stalls there.

A third request comes from a datacenter IP on a reputation list, 80 req/s, no JS cookie, and a TLS fingerprint matching a known scraping toolkit. The signals pile up past the block threshold to about 0.86, and the WAF blocks it before it ever touches the app.

Check yourself

Why challenge a medium-score request instead of just blocking it?