Percent-decoding

A URL can only safely carry a small alphabet, so every other byte travels disguised as %XX — decoding is just reading those disguises back into bytes.

The idea

A URL is allowed to contain only a limited set of characters. Anything reserved (like &, /, ?) or outside plain ASCII (like é or ✓) gets percent-encoded: each raw byte becomes a % followed by that byte’s two hex digits.

Decoding walks the string left to right. A normal character copies straight through. When it meets a %, it reads the next two hex characters, turns them into one byte, and collects bytes into a buffer. Once a complete UTF-8 sequence has been gathered, that buffer decodes into a single Unicode character — which is how %E2%9C%93 (three bytes) becomes one ✓.

See it work

Encoded input:

Input

Pending bytes (UTF-8 in progress)

Decoded output

Press play, or step through one move at a time.

How it works

Scan the string. On a literal character, keep its byte. On %, read the next two hex digits and turn them into one byte. Collect all the raw bytes, then decode the whole byte stream as UTF-8 at the end — a single multi-byte character only resolves once all of its bytes have arrived. Note that + means a space only in application/x-www-form-urlencoded data (typically the query string), never in the path.

def percent_decode(s, form=False):
    raw = bytearray()
    i = 0
    while i < len(s):
        c = s[i]
        if c == '%':
            # need exactly two hex digits after the '%'
            hex2 = s[i+1:i+3]
            if len(hex2) != 2 or not all(h in '0123456789abcdefABCDEF' for h in hex2):
                raise ValueError(f"malformed escape at {i}: {s[i:i+3]!r}")
            raw.append(int(hex2, 16))   # one byte, e.g. 'C3' -> 0xC3
            i += 3
        elif c == '+' and form:
            raw.append(0x20)            # '+' is space ONLY in form data
            i += 1
        else:
            raw.extend(c.encode('utf-8'))
            i += 1
    return raw.decode('utf-8')          # interpret the byte stream as UTF-8

Cost / trade-offs

Aspect	Cost	Why
Decode time	O(n)	One left-to-right pass; each char or escape is read once.
Extra space	O(n)	A byte buffer plus the output string; both bounded by input length.
Size when encoded	~3× per byte	One encoded byte (`%XX`) is 3 characters; non-ASCII chars cost 3 chars per UTF-8 byte.
Double-encoding	Ambiguity hazard	Encoding twice turns `%` into `%25`; decoding the wrong number of times changes the result.

Watch out for

Decoding twice. %2541 decodes once to %41, then again to A. An attacker can hide a / or ../ behind a second layer (%252F) to slip past a filter that only decodes once. Decode exactly once, then validate.
Decode order matters. Splitting on &, =, or / before decoding is correct; an encoded %26 inside a value must stay literal, not be mistaken for a real delimiter. Decode after you split the structure.
Treating + as space in a path. + means space only in form-urlencoded query data. In a path, a+b is literally a+b; a real space there is %20.
Malformed escapes. A % with fewer than two following hex digits (%2, %G1, a trailing %) is invalid — decide explicitly whether to reject or pass it through, rather than crashing.
Invalid UTF-8 / blind trust. Bytes that aren’t valid UTF-8 (a lead byte with no continuation, an unexpected 0x80–0xBF) must be handled, not assumed. And decoding then trusting the path enables traversal — canonicalize and check after decoding.

Worked example

Take caf%C3%A9. The letters c, a, f copy straight through as the bytes 63 61 66. Then %C3 yields the byte 0xC3 — in binary 11000011, a UTF-8 lead byte whose top bits announce a 2-byte sequence, so we wait for one more. %A9 yields 0xA9 (10101001), a valid continuation byte in the 0x80–0xBF range. Together C3 A9 decode to é, giving the word café. Add %20%26%20%E2%9C%93 and you get café & ✓, where the three bytes E2 9C 93 form one check mark.

Check yourself

A web app finds %252e%252e%252f in a path and decodes it twice to “clean it up.” What does it end up with, and why is that a problem?

In a URL path (not the query string), what does a+b%20c decode to?