Percent-decoding

A URL can only safely carry a small alphabet, so every other byte travels disguised as %XX — decoding is just reading those disguises back into bytes.

The idea

A URL is allowed to contain only a limited set of characters. Anything reserved (like &, /, ?) or outside plain ASCII (like é or ) gets percent-encoded: each raw byte becomes a % followed by that byte’s two hex digits.

Decoding walks the string left to right. A normal character copies straight through. When it meets a %, it reads the next two hex characters, turns them into one byte, and collects bytes into a buffer. Once a complete UTF-8 sequence has been gathered, that buffer decodes into a single Unicode character — which is how %E2%9C%93 (three bytes) becomes one .

See it work

Input
Pending bytes (UTF-8 in progress)
Decoded output
Press play, or step through one move at a time.

How it works

Scan the string. On a literal character, keep its byte. On %, read the next two hex digits and turn them into one byte. Collect all the raw bytes, then decode the whole byte stream as UTF-8 at the end — a single multi-byte character only resolves once all of its bytes have arrived. Note that + means a space only in application/x-www-form-urlencoded data (typically the query string), never in the path.

def percent_decode(s, form=False):
    raw = bytearray()
    i = 0
    while i < len(s):
        c = s[i]
        if c == '%':
            # need exactly two hex digits after the '%'
            hex2 = s[i+1:i+3]
            if len(hex2) != 2 or not all(h in '0123456789abcdefABCDEF' for h in hex2):
                raise ValueError(f"malformed escape at {i}: {s[i:i+3]!r}")
            raw.append(int(hex2, 16))   # one byte, e.g. 'C3' -> 0xC3
            i += 3
        elif c == '+' and form:
            raw.append(0x20)            # '+' is space ONLY in form data
            i += 1
        else:
            raw.extend(c.encode('utf-8'))
            i += 1
    return raw.decode('utf-8')          # interpret the byte stream as UTF-8

Cost / trade-offs

AspectCostWhy
Decode timeO(n)One left-to-right pass; each char or escape is read once.
Extra spaceO(n)A byte buffer plus the output string; both bounded by input length.
Size when encoded~3× per byteOne encoded byte (%XX) is 3 characters; non-ASCII chars cost 3 chars per UTF-8 byte.
Double-encodingAmbiguity hazardEncoding twice turns % into %25; decoding the wrong number of times changes the result.

Watch out for

Worked example

Take caf%C3%A9. The letters c, a, f copy straight through as the bytes 63 61 66. Then %C3 yields the byte 0xC3 — in binary 11000011, a UTF-8 lead byte whose top bits announce a 2-byte sequence, so we wait for one more. %A9 yields 0xA9 (10101001), a valid continuation byte in the 0x80–0xBF range. Together C3 A9 decode to é, giving the word café. Add %20%26%20%E2%9C%93 and you get café & ✓, where the three bytes E2 9C 93 form one check mark.

Check yourself

A web app finds %252e%252e%252f in a path and decodes it twice to “clean it up.” What does it end up with, and why is that a problem?

In a URL path (not the query string), what does a+b%20c decode to?