CGI Programming on the World Wide WebBy Shishir Gundavaram
1st Edition March 1996
This book is out of print, but it has been made available online through the O'Reilly Open Books Project.
Earlier in this chapter we mentioned the application/x-www-form-urlencoded MIME type. The browser uses this MIME type to encode the form data.
First, each form element's name--specified by the NAME attribute--is equated with the value entered by the user to create a key-value pair. For example, if the user entered "30" when asked for the age, the key-value pair would be (age=30). Each key-value pair is separated by the " &" character.
Second, since the variable names for the form element and the actual form data are standard text, it is possible this text could consist of characters that will confuse browsers. To prevent possible errors, the encoding scheme translates all "special" characters to their corresponding hexadecimal codes. These "special" characters include control characters and certain alphanumeric symbols. For example, the string "Thanks for the help!" would be converted to "Thanks%20for%20the%20help%21". This process is repeated for each key-value pair to create a query string.
 Before the forms interface, the only way you could retrieve user information was through a search field (i.e., <ISINDEX>), which passed the data to the server with spaces converted to plus signs ( "+").
For text and password fields, the user input will represent the value. If no information was entered, the key-value pair will be sent anyway, with the value left blank (i.e., "name=").
For radio buttons and checkboxes, the VALUE attribute represents the value when the button element is checked. If no VALUE is specified, the value defaults to "on." An unchecked checkbox will not be sent as a key-value pair; it will be ignored.
The CGI program then has to "decode" this information in order to access the form data. The encoding scheme is the same for both GET and POST.
There are two methods for sending form data: GET and POST. The main difference between these methods is the way in which the form data is passed to the CGI program. If the GET method is used, the query string is simply appended to the URL of the program when the client issues the request to the server. This query string can then be accessed by using the environment variable QUERY_STRING. Here is a sample GET request by the client, which corresponds to the first form example:
GET /cgi-bin/program.pl?user=Larry%20Bird&age=35&pass=testing HTTP/1.0 Accept: www/source Accept: text/html Accept: text/plain User-Agent: Lynx/2.4 libwww/2.14
As we discussed in Chapter 2, the query string is appended to the URL after the "?" character. The server then takes this string and assigns it to the environment variable QUERY_STRING.
 The information in the password field is not encrypted in any way; it is plain text. You have to be very careful when asking for sensitive data using the password field. If you want security, please use server authentication.
The GET method has both advantages and disadvantages. The main advantage is that you can access the CGI program with a query without using a form. In other words, you can create " canned queries." Basically, you are passing parameters to the program. For example, if you want to send the previous query to the program directly, you can do this:
<A HREF="/cgi-bin/program.pl?user=Larry%20Bird&age=35&pass=testing">CGI Program</A>
Here is a simple program that will aid you in encoding data:
#!/usr/local/bin/perl print "Please enter a string to encode: "; $string = <STDIN>; chop ($string); $string =~ s/(\W)/sprintf("%%%x", ord($1))/eg; print "The encoded string is: ", "\n"; print $string, "\n"; exit(0);
This is not a CGI program; it is meant to be run from the shell. When you run the program, the program will prompt you for a string to encode. The <STDIN> operator reads one line from standard input. It is similar to the <FILEHANDLE> construct we have been using. The chop command removes the trailing newline character ("\n") from the input string. Finally, the user-specified string is converted to a hexadecimal value with the sprintf command, and printed out to standard output.
A query is one method of passing information to a CGI program via the URL. The other method involves sending extra path information to the program. Here is an example:
<A HREF="/cgi-bin/program.pl/user=Larry%20Bird/age=35/pass=testing>CGI Program</A>
The string "/user=Larry%20Bird/age=35/pass=testing" will be placed in the environment variable PATH_INFO when the request gets to the CGI program. This method of passing information to the CGI program is generally used to provide file information, rather than form data. The NCSA imagemap program works in this manner by passing the filename of the selected image as extra path information.
If you use the "question-mark" method or the pathname method to pass data to the program, you have to be careful, as the browser or the server may truncate data that exceeds an arbitrary number of characters.
Now, here is a sample POST request:
POST /cgi-bin/program.pl HTTP/1.0 Accept: www/source Accept: text/html Accept: text/plain User-Agent: Lynx/2.4 libwww/2.14 Content-type: application/x-www-form-urlencoded Content-length: 35 user=Larry%20Bird&age=35&pass=testing
The main advantage to the POST method is that query length can be unlimited-- you don't have to worry about the client or server truncating data. To get data sent by the POST method, the CGI program reads from standard input. However, you cannot create "canned queries."
In order to access the information contained within the form, a decoding protocol must be applied to the data. First, the program must determine how the data was passed by the client. This can be done by examining the value in the environment variable REQUEST_METHOD. If the value indicates a GET request, either the query string or the extra path information must be obtained from the environment variables. On the other hand, if it is a POST request, the number of bytes specified by the CONTENT_LENGTH environment variable must be read from standard input. The algorithm for decoding form data follows:
- Determine request protocol (either GET or POST) by checking the REQUEST_METHOD environment variable.
- If the protocol is GET, read the query string from QUERY_STRING and/or the extra path information from PATH_INFO.
- If the protocol is POST, determine the size of the request using CONTENT_LENGTH and read that amount of data from the standard input.
- Split the query string on the "&" character, which separates key-value pairs (the format is key=value&key=value...).
- Decode the hexadecimal and "+" characters in each key-value pair.
- Create a key-value table with the key as the index. (If this sounds complicated, don't worry, just use a high-level language like Perl. The language makes it pretty easy.)
You might wonder why a program needs to check the request protocol, when you know exactly what type of request the form is sending. The reason is that by designing the program in this manner, you can use one module that takes care of both types of requests. It can also be beneficial in another way.
Say you have a form that sends a POST request, and a program that decodes both GET and POST requests. Suppose you know that there are three fields: user, age, and pass. You can fill out the form, and the client will send the information as a POST request. However, you can also send the information as a query string because the program can handle both types of requests; this means that you can save the step of filling out the form. You can even save the complete request as a hotlist item, or as a link on another page.
Back to: CGI Programming on the World Wide Web
© 2001, O'Reilly & Associates, Inc.