Chapter 4. Web-Oriented Data Encoding

In the field of observation, chance favors only the prepared mind.

Louis Pasteur

Even though web applications have all sorts of different purposes, requirements, and expected behaviors, there are some basic technologies and building blocks that show up time and again. If we learn about those building blocks and master them, then we will have versatile tools that can apply to a variety of web applications, regardless of the application’s specific purpose or the technologies that implement it.

One of these fundamental building blocks is data encoding. Web applications ship data back and forth from the browser to the server in myriad ways. Depending on the type of data, the requirements of the system, and the programmer’s particular preferences, that data might be encoded or packaged in any number of different formats. To make useful test cases, we often have to decode the data, manipulate it, and reencode it. In particularly complicated situations, you may have to recompute a valid integrity check value, like a checksum or hash. The vast majority of our tests in the web world involve manipulating the parameters that pass back and forth between a server and a browser, but we have to understand how they are packed and shipped before we can manipulate them.

In this chapter, we’ll talk about recognizing, decoding, and encoding several different formats: Base 64, Base 36, Unix time, URL encoding, HTML encoding, and others. This is not so much meant to be a reference for these formats (there are plenty of good references). Instead, we will help you know it when you see it and manipulate the basic formats. Then you will be able to design test data carefully, knowing that the application will interpret your input in the way you expect.

The kinds of parameters we’re looking at appear in lots of independent places in our interaction with a web application. They might be hidden form field values, GET parameters in the URL, or values in the cookie. They might be small, like a 6-character discount code, or they might be large, like hundreds of characters with an internal composite structure. As a tester, you want to do boundary case testing and negative testing that addresses interesting cases, but you cannot figure out what is interesting if you don’t understand the format and use of the data. It is difficult to methodically generate boundary values and test data if you do not understand how the input is structured. For example, if you see dGVzdHVzZXI6dGVzdHB3MTIz in an HTTP header, you might be tempted to just change characters at random. Decoding this with a Base-64 decoder, however, reveals the string testuser:testpw123. Now you have a much better idea of the data, and you know how to modify it in ways that are relevant to its usage. You can make test cases that are valid and carefully targeted at the application’s behavior.

4.1. Recognizing Binary Data Representations


You have decoded some data in a parameter, input field, or data file and you want to create appropriate test cases for it. You have to determine what kind of data it is so that you can design good test cases that manipulate it in interesting ways.

We will consider these kinds of data:

  • Hexadecimal (Base 16)

  • Octal (Base 8)

  • Base 36


Hexadecimal data

Hexadecimal characters, or Base-16 digits, are the numerical digits 0–9 and the letters A–F. You might see them in all uppercase or all lowercase, but you will rarely see the letters in mixed case. If you have any letters beyond F in the alphabet, you’re not dealing with Base 16.

Although this is fundamental computer science material here, it bears repeating in the context of testing. Each individual byte of data is represented by two characters in the output. A few things to note that will be important: 00 is 0 is NULL, etc. That’s one of our favorite boundary values for testing. Likewise, FF is 255, or −1, depending on whether it’s an unsigned or signed value. It’s our other favorite boundary value. Other interesting values include 20, which is the ASCII space character, and 41, which is ASCII for uppercase A. There are no common, printable ASCII characters above 7F. In most programming languages, hexadecimal values can be distinguished by the letters 0x in front of them. If you see 0x24, your first instinct should be to treat it as a hexadecimal number. Another common way of representing hexadecimal values is with colons between individual bytes. Network MAC addresses, SNMP MIB values, X.509 certificates, and other protocols and data structures that use ASN.1 encoding frequently do this. For example, a MAC address might be represented: 00:16:00:89:0a:cf. Note that some programmers will omit unnecessary leading zeros. So the above MAC address could be represented: 0:16:0:89:a:cf. Don’t let the fact that some of the data are single digits persuade you that it isn’t a series of hexadecimal bytes.

Octal data

Octal encoding—Base 8—is somewhat rare, but it comes up from time to time. Unlike some of the other Bases (16, 64, 36), this one uses fewer than all 10 digits and uses no letters at all. The digits 0 to 7 are all that are used. In programming, octal numbers are frequently represented by a leading zero, e.g., 017 is the same as 15 decimal or 0F hexadecimal. Don’t assume octal, however, if you see leading zeroes. Octal is too rare to assume just on that evidence alone. Leading zeroes typically indicate a fixed field size and little else. The key distinguishing feature of octal data is that the digits are all numeric with none greater than 7. Of course, 00000001 fits that description but is probably not octal. In fact, this decoding could be anything, and it doesn’t matter. 1 is 1 is 1 in any of these encodings!

Base 36

Base 36 is rather an unusual hybrid between Base 16 and Base 64. Like Base 16, it begins at 0 and carries on into the alphabet after reaching 9. It does not stop at F, however. It includes all 26 letters up to Z. Unlike Base 64, however, it does not distinguish between uppercase and lowercase letters and it does not include any punctuation. So, if you see a mixture of letters and numbers, and all the letters are the same case (either all upper or all lower), and there are letters in the alphabet beyond F, you’re probably looking at a Base-36 number.


Finding encoders and decoders for Base 16 and Base 8 are easy. Even the basic calculator on Windows can do them. Finding an encoder/decoder for Base 36, however, is somewhat rarer.

4.2. Working with Base 64


Base 64 fills a very specific niche: it encodes binary data that is not printable or safe for the channel in which it is transmitted. It encodes that data into something relatively opaque and safe for transmission using just alphanumeric characters and some punctuation. You will encounter Base 64 wrapping most complex parameters that you might need to manipulate, so you will have to decode, modify, and then reencode them.


Install OpenSSL in Cygwin (if you’re using Windows) or make sure you have the openssl command if you’re using another operating system. All known distributions of Linux and Mac OS X will have OpenSSL.

Decode a string

% echo 'Q29uZ3JhdHVsYXRpb25zIQ==' | openssl base64 -d

Encode the entire contents of a file

% openssl base64 -e -in input.txt -out input.b64

This puts the Base 64-encoded output in a file called input.b64.

Encode a simple string

% echo -n '&a=1&b=2&c=3' | openssl base64 -e


You will see Base 64 a lot. It shows up in many HTTP headers (e.g., the Authorization: header) and most cookie values are Base 64-encoded. Many applications encode complex parameters with Base 64 as well. If you see encoded data, especially with equals characters at the end, think Base 64.

Notice the -n after the echo command. This prevents echo from appending a newline character on the end of the string that it is provided. If that newline character is not suppressed, then it will become part of the output. Example 4-1 shows the two different commands and their respective output.

Example 4-1. Embedded newlines in Base 64-encoded strings
% echo -n '&a=1&b=2&c=3' | openssl base64 -e   # Right.

% echo '&a=1&b=2&c=3' | openssl base64 -e      # Wrong.

This is also a danger if you insert your binary data or raw data in a file and then use the -in option to encode the entire file. Virtually all editors will put a newline on the end of the last line of a file. If that is not what you want (because your file contains binary data), then you will have to take extra care to create your input.

You may be surprised to see us using OpenSSL for this, when clearly there is no SSL or other encryption going on. The openssl command is a bit of a Swiss Army knife. It can perform many operations, not just cryptography.

Recognizing Base 64

Base-64 characters include the entire alphabet, upper- and lowercase, as well as the ten digits 0–9. That gives us 62 characters. Add in plus (+) and solidus (/) and we have 64 characters. The equals sign is also part of the set, but it will only appear at the end. Base-64 encoding will always contain a number of characters that is a multiple of 4. If the input data does not encode to an even multiple of 4 bytes, one or more equals (=) will be added to the end to pad out to a multiple of 4. Thus, you will see at most 3 equals, but possibly none, 1, or 2. The hallmark of Base 64 is the trailing equals. Failing that, it is also the only encoding that uses a mixture of both upper- and lowercase letters.


It is important to realize that Base 64 is an encoding. It is not encryption (since it can be trivially reversed with no special secret necessary). If you see important data (e.g., confidential data, security data, program control data) Base-64-encoded, just treat it as if it were totally exposed and in the clear—because it is. Given that, put on your hacker’s black hat and ask yourself what you gain by knowing the data that is encoded.

Note also that there is no compression in Base 64. In fact, the encoded data is guaranteed to be larger than the unencoded input. This can be an issue in your database design, for example. If your program changes from storing raw user IDs (that, say, have a maximum size of 8 characters) to storing Base-64-encoded user IDs, you will need 12 characters to store the result. This might have ripple effects throughout the design of the system—a good place to test for security issues!

Other tools

We showed OpenSSL in this example because it is quick, lightweight, and easily accessible. If you have CAL9000 installed, it will also do Base-64 encoding and decoding easily. Follow the instructions in Recipe 4.5, but select “Base 64” as your encoding or decoding type. You still have to watch out for accidentally pasting newlines into the input boxes.

There is a MIME::Base64 module for Perl. Although it is not a standard module, you’ll almost certainly have it if you use the LibWWWPerl module we discuss in Chapter 8.

4.3. Converting Base-36 Numbers in a Web Page


You need to encode and decode Base-36 numbers and you don’t want to write a script or program to do that. This is probably the easiest way if you just need to convert occasionally.


Brian Risk has created a demonstration website at that performs conversions to arbitrary conversions from one base to another. You can go back and forth from Base 10 to Base 36 by specifying the two bases in the page. Figure 4-1 shows an example of converting a large Base-10 number to Base 36. To convert from Base 36 to Base 10, simply swap the 10 and the 36 in the web page.

Converting between Base 36 and Base 10
Figure 4-1. Converting between Base 36 and Base 10


Just because this is being done in your web browser does not mean you have to be online and connected to the Internet to do this. In fact, like CAL9000 (see Recipe 4.5), you can save a copy of this page to your local hard drive and then load it in your web browser whenever you need to do these conversions.

4.4. Working with Base 36 in Perl


You need to encode or decode Base-36 numbers a lot. Perhaps you have many numbers to convert or you have to make this a programmatic part of your testing.


Of the tools we use in this book, Perl is the tool of choice. It has a library Math::Base36 that you can install using the standard CPAN or ActiveState method for installing modules. (See Chapter 2). Example 4-2 shows both encoding and decoding of Base-36 numbers.

Example 4-2. Perl script to convert Base-36 numbers
use Math::Base36 qw(:all);

my $base10num = 67325649178; # should convert to UXFYBDM
my $base36num = "9FFGK4H";   # should convert to 20524000481

my $newb36    = encode_base36( $base10num );
my $newb10    = decode_base36( $base36num );

print "b10 $base10num\t= b36 $newb36\n";
print "b36 $base36num\t= b10 $newb10\n";


For more information on the Math::Base36 module, you can run the command perldoc Math::Base36. In particular, you can get your Base-10 results padded on the left with leading zeros if you want.

4.5. Working with URL-Encoded Data


URL-encoded data uses the % character and hexadecimal digits to transmit characters that are not allowed in URLs directly. The space, angle brackets (< and >), and slash (solidus, /) are a few common examples. If you see URL-encoded data in a web application (perhaps in a parameter, input, or some source code) and you need to either understand it or manipulate it, you will have to decode it or encode it.


The easiest way is to use CAL9000 from OWASP. It is a series of HTML web pages that use JavaScript to perform the basic calculations. It gives you an interactive way to copy and paste data in and out and encode or decode it at will.


Enter your decoded data into the “Plain Text” box, then click on the “Url (%XX)” button to the left under “Select Encoding Type.” Figure 4-2 shows the screen and the results.

URL encoding with CAL9000
Figure 4-2. URL encoding with CAL9000


Enter your encoded data into the box labeled “Encoded Text,” then click on the “Url (%XX)” option to the left, under “Select Decoding Type.” Figure 4-3 shows the screen and the results.

URL decoding with CAL9000
Figure 4-3. URL decoding with CAL9000


URL-encoded data is familiar to anyone who has looked at HTML source code or any behind-the-scenes data being sent from a web browser to a web server. RFC 1738 ( defines URL encoding, but it does not require encoding of certain ASCII characters. Notice that, although it isn’t required, there is nothing wrong with unnecessarily encoding these characters. The encoded data in Figure 4-3 shows an example of this. In fact, redundant encoding is one of the ways that attackers mask their malicious input. Naïve blacklists that check for <script> or even %3cscript%3e might not check for %3c%73%63%72%69%70%74%3e, even though all of them are essentially the same.

One of the great things about CAL9000 is that it is not really software. It is a collection of web pages that have JavaScript embedded in them. Even if your IT policies are super-draconian and you cannot install anything at all on your workstation, you can open these web pages in your browser from a local hard disk and they will work for you. You can easily load them onto a USB drive and load them straight from that drive, so that you never install anything at all.

4.6. Working with HTML Entity Data


The HTML specification provides a way to encode special characters so that they are not interpreted as HTML, JavaScript, or another kind of command. In order to generate test cases and potential attacks, you will need to be able to perform this kind of encoding and decoding.


The easiest choice for this kind of encoding and decoding is CAL9000. We won’t repeat the detailed instructions on how to use CAL9000 because it is pretty straightforward to use. See Recipe 4.5 for detailed instructions.

To encode special characters, you enter the special characters in the box labeled “Plain Text” and choose your encoding. You will want to enter a semicolon (;) in the “Trailing Characters” box in CAL9000.

Decoding HTML Entity-encoded characters is the same process in reverse. Type or paste the entity-encoded characters into the “encoded text box” and then click on the “HTML Entity” entry under “Select Decoding Type.”


HTML entity encoding is an area rich with potential mistakes. The authors have seen many web applications perform multiple rounds of entity encoding (e.g., the ampersand is encoded as &amp;amp;) in one part of the display and perform no entity encoding in other parts of the display. Not only is it important to do correctly, it turns out that since there are so many variations on HTML entity encoding, it is very challenging to write a web application that does handle encoding correctly.

Variations on a theme

There are at least five or six legitimate, relatively well-known methods to encode the same character using HTML entity encoding. Table 4-1 shows a few possible encodings for a single character: the less-than sign (<).

Table 4-1. Variations on entity encoding

Encoding variation

Encoded character

Named entity


Decimal value (ASCII or ISO-8859-1)


Hexadecimal value (ASCII or ISO-8859-1)


Hexadecimal value (long integer)


Hexadecimal value (64-bit integer)


There are even a few more encoding methods that are specific to Internet Explorer. Clearly, from a testing point of view, if you have boundary values or special values you want to test, you have at least six to eight permutations of them: two or three URL-encoded versions and four or five entity-encoded versions.

4.7. Calculating Hashes


When your application uses hashes, checksums, or other integrity checks over its data, you need to recognize them and possibly calculate them on test data. If you are unfamiliar with hashes, see the upcoming sidebar “What Are Hashes?.”


As with other encoding tasks, you have at least three good choices: OpenSSL, CAL9000, and Perl.


% echo -n "my data" | openssl md5

c:\> type myfile.txt | openssl md5


use Digest::SHA1  qw(sha1);
$data   = "my data";
$digest = sha1($data);
print "$digest\n";


The MD5 case is shown using OpenSSL on Unix or on Windows. OpenSSL has an equivalent sha1 command. Note that the -n is required on Unix echo command to prevent the newline character from being added on the end of your data. Although Windows has an echo command, you can’t use it the same way because there is no way to suppress the carriage-return/linefeed set of characters on the end of the message you give it.

The SHA-1 case is shown as a Perl script, using the Digest::SHA1 module. There is an equivalent Digest::MD5 module that works the same way for MD5 hashes.

Note that there is no way to decode a hash. Hashes are mathematical digests that are one-way. No matter how much data goes in, the hash produces exactly the same size output.

MD5 hashes

MD5 hashes produce exactly 128 bits (16 bytes) of data. You might see this represented in a few different ways:

32 hexadecimal characters


24 Base 64 characters

PlnPFeQx5Jj+uwRfh//RSw==. You will see it this way if they take the binary output of MD5 (the raw 128 binary bits) and then Base-64 encode it.

SHA-1 hashes

SHA-1 is a hash that always produces exactly 160 bits (20 bytes) of data. Like MD5, you might see this represented in a few ways:

40 hexadecimal characters


28 Base-64 characters


4.8. Recognizing Time Formats


You are likely to see time represented in a lot of different ways. Recognizing a representation of time for what it is will help you build better test cases. Not only knowing that it is time, but knowing what the programmer’s fundamental assumptions might have been when the code was written makes it easier to write targeted test cases.


Obvious time formats encode the year, month, and day in familiar arrangements, providing either two or four digits for the year. Some include hours, minutes, and seconds, possibly with a decimal and milliseconds. Table 4-2 shows several representations of June 1, 2008, 5:32:11 p.m. and 844 milliseconds. Some of the formats do not represent certain parts of the date or time. The unrepresentable parts are omitted as appropriate.

Table 4-2. Various representations of time


Example output





Unix time (Seconds since Jan 1, 1970)


POSIX in “C” locale

Sun Jun 1 17:32:11 2008


You may think that recognizing time is pretty obvious and not important to someone testing web applications, especially for security. We would argue that it is actually very important. The authors have seen many applications where time was considered to be unpredictable by the developers. Time has been used in session IDs, temporary filenames, temporary passwords, and account numbers. As a simulated attacker, you know that time is not unpredictable. As we plan “interesting” test cases on a given input field, we can narrow down the set of possible test values dramatically if we know it corresponds to a time value from the recent past or recent future.

4.9. Encoding Time Values Programmatically


You have determined that your application uses time in some interesting way, and now you want to generate specific values in specific formats.


Perl is a great tool for this job. You will need the Time::Local module for some manipulations of Unix time and the POSIX module for strftime. Both are standard modules. The code in Example 4-3 shows you four different formats and how to calculate them.

Example 4-3. Encoding various time values in Perl
use Time::Local; 
use POSIX qw(strftime); 
# June 1, 2008, 5:32:11pm and 844 milliseconds 
$year  = 2008; 
$month = 5;      # months are numbered starting at 0! 
$day   = 1; 
$hour  = 17;     # use 24-hour clock for clarity 
$min   = 32; 
$sec   = 11; 
$msec  = 844; 

# UNIX Time (Seconds since Jan 1, 1970)     1212355931 
$unixtime = timelocal( $sec, $min, $hour, $day, $month, $year ); 
print "UNIX\t\t\t$unixtime\n"; 

# populate a few values (wday, yday, isdst) that we'll need for strftime 
    $wday,$yday,$isdst) = localtime($unixtime);

# YYYYMMDDhhmmss.sss    20080601173211.844 
# We use strftime() because it accounts for Perl's zero-based month numbering 
$timestring = strftime( "%Y%m%d%H%M%S",
	$sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst ); 
$timestring .= ".$msec"; 
print "YYYYMMDDhhmmss.sss\t$timestring\n"; 

# YYMMDDhhmm  0806011732 
$timestring = strftime( "%y%m%d%H%M", $sec,$min,$hour,$mday, 
    $mon,$year,$wday,$yday,$isdst ); 
print "YYMMDDhhmm\t\t$timestring\n"; 

# POSIX in "C" Locale   Sun Jun  1 17:32:11 2008 
$gmtime = localtime($unixtime); 
print "POSIX\t\t\t$gmtime\n";


You can use perldoc Time::Local or man strftime to find out more about possible ways to format time.

Perl’s Time Idiosyncrasies

Although Perl is very flexible and is definitely a good tool for this job, it has its idiosyncrasies. Be careful of the month values when writing code like this. For some inexplicable reason, they begin counting months with 0. That is, January is 0, and February is 1, instead of January being 1. Days are not done this way. The first day of the month is 1. Furthermore, you need to be aware of how the year is encoded. It is the number of years since 1900. Thus, 1999 is 99 and 2008 is 108. To get a correct value for the year, you must add 1900. Despite all the year 2000 histrionics, there are websites to this day that show the date as 6/28/108.

4.10. Decoding ASP.NET’s ViewState


ASP.NET provides a mechanism by which the client can store state, rather than the server. Even relatively large state objects (several kilobytes) can be sent as form fields and posted back by the web browser with every request. This is called the ViewState and is stored in an input called __VIEWSTATE on the form. If your application uses this ViewState, you will want to investigate how the business logic relies on it and develop tests around corrupt ViewStates. Before you can build tests with corrupt ViewStates, you have to understand the use of ViewState in the application.


Get the ViewState Decoder from Fritz Onion ( The simplest use case is to copy and paste the URL of your application (or a specific page) into the URL. Figure 4-4 shows version 2.1 of the ViewState decoder and a small snapshot of its output.

Decoding ASP.NET ViewState
Figure 4-4. Decoding ASP.NET ViewState


Sometimes the program fails to fetch the ViewState from the web page. That’s really no problem. Just view the source of the web page (see Recipe 3.2) and search for <input type= "hidden" name="__VIEWSTATE"...>. Take the value of that input and paste it into the decoder.

If the example in Figure 4-4 was your application, it would suggest several potential avenues for testing. There are URLs in the ViewState. Can they contain JavaScript or direct a user to another, malicious website? What about the various integer values?

There are several questions you should ask yourself about your application, if it is using ASP.NET and the ViewState:

  • Is any of the data in the ViewState inserted into the URL or HTML of the subsequent page when the server processes it?

    Consider that Figure 4-4 shows several URLs. What if page navigation links were derived from the ViewState in this application? Could a hacker trick someone into visiting a malicious site by sending them a poisoned ViewState?

  • Is the ViewState protected against tampering?

    ASP.NET provides several ways to protect the ViewState. One of them is a simple hash code that will allow the server to trap an exception if the ViewState is modified unexpectedly. The other is an encryption mechanism that makes the ViewState opaque to the client and a potential attacker.

  • Does any of the program logic depend blindly on values from the ViewState?

    Imagine an application where the user type (normal versus administrator) was stored in the ViewState. An attacker merely needs to modify it to change his effective permissions.

When it comes time to create tests for corrupted ViewStates, you will probably use tools like TamperData (see Recipe 3.6) or WebScarab (see Recipe 3.4) to inject new values.

4.11. Decoding Multiple Encodings


Sometimes data is encoded multiple times, either intentionally or as a side effect of passing through some middleware. For example, it is common to see the nonalphanumeric characters (=, /, +) in a Base 64-encoded string (see Recipe 4.2) encoded with URL encoding (see Recipe 4.5). For example, V+P//z== might be displayed as V%2bP%2f%2f%3d%3d. You’ll need to be aware of this so that when you’ve completed one round of successful decoding, you treat the result as potentially more encoded data.


Sometimes a single parameter is actually a specially structured payload that carries many parameters. For example, if we see AUTH=dGVzdHVzZXI6dGVzdHB3MTIz, then we might be tempted to consider AUTH to be one parameter. When we realize that the value decodes to testuser:testpw123, then we realize that it is actually a composite parameter containing a user ID and a password, with a colon as a delimiter. Thus, our tests will have to manipulate the two pieces of this composite differently. The rules and processing in the web application are almost certainly different for user IDs and passwords.


We don’t usually include quizzes as a follow-up to a recipe, but in this case it might be worthwhile. Recognizing data encodings is a pretty important skill, and an exercise here may help reinforce what we’ve just explained. Remember that some of them may be encoded more than once. See if you can determine the kind of data for each of the following (answers in the footnotes):

  1. xIThJBeIucYRX4fqS+wxtR8KeKk=[1]

  2. TW9uIEFwciAgMiAyMjoyNzoyMSBFRFQgMjAwNwo=[2]

  3. 4BJB39XF[3]

  4. F8A80EE2F6484CF68B7B72795DD31575[4]

  5. 0723034505560231[5]

  6. 713ef19e569ded13f2c7dd379657fe5fbd44527f[6]

[1] MD5 encoded with Base 64

[2] SHA1 encoded with Base 64

[3] Base 36

[4] Hexadecimal MD5

[5] Octal

[6] Hexadecimal SHA1

Get Web Security Testing Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.