3.8. Evaluating URL Encodings
Problem
You need to decode a Uniform Resource Locator (URL).
Solution
Iterate over the characters in the URL looking for a
percent symbol followed by two
hexadecimal digits. When such a sequence is encountered, combine the
hexadecimal digits to obtain the character with which to replace the
entire sequence. For example, in the ASCII character set, the letter
“A” has the value
0x41
, which could be encoded as
“%41”.
Discussion
RFC 1738 defines the syntax for URLs. Section 2.2 of that document also defines the rules for encoding characters in a URL. While some characters must always be encoded, any character may be encoded. Essentially, this means that before you do anything with a URL—whether you need to parse the URL into pieces (i.e., username, password, host, and so on), match portions of the URL against a whitelist or blacklist, or something else entirely—you need to decode it.
The problem is that you must make certain that you never decode a URL that has already been decoded; otherwise, you will be vulnerable to double-encoding attacks. Suppose that the URL contains the sequence “%25%34%31”. Decoded once, the result is “%41” because “%25” is the encoding for the percent symbol, “%34” is the encoding for the number 4, and “%31” is the encoding for the number 1. Decoded twice, the result is “A”.
At first glance, this may seem harmless, but what if you were to decode repeatedly until there were no more escaped characters? You would end up with certain ...
Get Secure Programming Cookbook for C and C++ now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.