A URL can only safely carry a small alphabet, so every other byte travels disguised as %XX — decoding is just reading those disguises back into bytes.
A URL is allowed to contain only a limited set of characters. Anything reserved (like &, /, ?) or outside plain ASCII (like é or ✓) gets percent-encoded: each raw byte becomes a % followed by that byte’s two hex digits.
Decoding walks the string left to right. A normal character copies straight through. When it meets a %, it reads the next two hex characters, turns them into one byte, and collects bytes into a buffer. Once a complete UTF-8 sequence has been gathered, that buffer decodes into a single Unicode character — which is how %E2%9C%93 (three bytes) becomes one ✓.
Scan the string. On a literal character, keep its byte. On %, read the next two hex digits and turn them into one byte. Collect all the raw bytes, then decode the whole byte stream as UTF-8 at the end — a single multi-byte character only resolves once all of its bytes have arrived. Note that + means a space only in application/x-www-form-urlencoded data (typically the query string), never in the path.
def percent_decode(s, form=False):
raw = bytearray()
i = 0
while i < len(s):
c = s[i]
if c == '%':
# need exactly two hex digits after the '%'
hex2 = s[i+1:i+3]
if len(hex2) != 2 or not all(h in '0123456789abcdefABCDEF' for h in hex2):
raise ValueError(f"malformed escape at {i}: {s[i:i+3]!r}")
raw.append(int(hex2, 16)) # one byte, e.g. 'C3' -> 0xC3
i += 3
elif c == '+' and form:
raw.append(0x20) # '+' is space ONLY in form data
i += 1
else:
raw.extend(c.encode('utf-8'))
i += 1
return raw.decode('utf-8') # interpret the byte stream as UTF-8
| Aspect | Cost | Why |
|---|---|---|
| Decode time | O(n) | One left-to-right pass; each char or escape is read once. |
| Extra space | O(n) | A byte buffer plus the output string; both bounded by input length. |
| Size when encoded | ~3× per byte | One encoded byte (%XX) is 3 characters; non-ASCII chars cost 3 chars per UTF-8 byte. |
| Double-encoding | Ambiguity hazard | Encoding twice turns % into %25; decoding the wrong number of times changes the result. |
%2541 decodes once to %41, then again to A. An attacker can hide a / or ../ behind a second layer (%252F) to slip past a filter that only decodes once. Decode exactly once, then validate.&, =, or / before decoding is correct; an encoded %26 inside a value must stay literal, not be mistaken for a real delimiter. Decode after you split the structure.+ as space in a path. + means space only in form-urlencoded query data. In a path, a+b is literally a+b; a real space there is %20.% with fewer than two following hex digits (%2, %G1, a trailing %) is invalid — decide explicitly whether to reject or pass it through, rather than crashing.0x80–0xBF) must be handled, not assumed. And decoding then trusting the path enables traversal — canonicalize and check after decoding.Take caf%C3%A9. The letters c, a, f copy straight through as the bytes 63 61 66. Then %C3 yields the byte 0xC3 — in binary 11000011, a UTF-8 lead byte whose top bits announce a 2-byte sequence, so we wait for one more. %A9 yields 0xA9 (10101001), a valid continuation byte in the 0x80–0xBF range. Together C3 A9 decode to é, giving the word café. Add %20%26%20%E2%9C%93 and you get café & ✓, where the three bytes E2 9C 93 form one check mark.
A web app finds %252e%252e%252f in a path and decodes it twice to “clean it up.” What does it end up with, and why is that a problem?
In a URL path (not the query string), what does a+b%20c decode to?