Buy this Book
Print Book $39.95 Read it Now!
Print Book £28.50
Add to UK Cart
Reprint Licensing

Writing Apache Modules with Perl and C The Apache API and mod_perl

By Lincoln Stein, Doug MacEachern
Price: $39.95 USD
£28.50 GBP

Cover | Table of Contents | Colophon


Table of Contents

Chapter 1: Server-Side Programming with Apache
Before the World Wide Web appeared, client/server network programming was a drag. Application developers had to develop the communications protocol, write the low-level network code to reliably transmit and receive messages, create a user interface at the client side of the connection, and write a server to listen for incoming requests, service them properly, and transmit the results back to the client. Even simple client/server applications were many thousand lines of code, the development pace was slow, and programmers worked in C.
When the web appeared in the early '90s, all that changed. The web provided a simple but versatile communications protocol standard, a universal network client, and a set of reliable and well-written network servers. In addition, the early servers provided developers with a server extension protocol called the Common Gateway Interface (CGI). Using CGI, a programmer could get a simple client/server application up and running in 10 lines of code instead of thousands. Instead of being limited to C or another "systems language," CGI allowed programmers to use whatever development environment they felt comfortable with, whether that be the command shell, Perl, Python, REXX, Visual Basic, or a traditional compiled language. Suddenly client/server programming was transformed from a chore into a breeze. The number of client/server applications increased 100-fold over a period of months, and a new breed of software developer, the "web programmer," appeared.
The face of network application development continues its rapid pace of change. Open the pages of a web developer's magazine today and you'll be greeted by a bewildering array of competing technologies. You can develop applications using server-side include technologies such as PHP or Microsoft's Active Server Pages (ASP). You can create client-side applications with Java, JavaScript, or Dynamic HTML (DHTML). You can serve pages directly out of databases with products like the Oracle web server or Lotus Domino. You can write high-performance server-side applications using a proprietary server application programming interface (API). Or you can combine server- and client-side programming with integrated development environments like Netscape's LiveWire or NeXT's WebObjects. CGI scripting is still around too, but enhancements like FastCGI and ActiveState's Perl ISAPI are there to improve script performance.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Web Programming Then and Now
In the beginning was the web server. Specifically, in the very very beginning was CERN httpd , a C-language server developed at CERN, the European high-energy physics lab, by Tim Berners-Lee, Ari Luotonen, and Henrik Frystyk Nielsen around 1991. CERN httpd was designed to serve static web pages. The server listened to the network for Uniform Resource Locator (URL) requests using what would eventually be called the HTTP/0.9 protocol, translated the URLs into file paths, and returned the contents of the files to the waiting client. If you wanted to extend the functionality of the web server—for example, to hook it up to a bibliographic database of scientific papers—you had to modify the server's source code and recompile.
This was neither very flexible nor very easy to do. So early on, CERN httpd was enhanced to launch external programs to handle certain URL requests. Special URLs, recognized with a complex system of pattern matching and string transformation rules, would invoke a command shell to run an external script or program. The output of the script would then be redirected to the browser, generating a web page on the fly. A simple scheme allowed users to pass argument lists to the script, allowing developers to create keyword search systems and other basic applications.
Meanwhile, Rob McCool, of the National Center for Supercomputing Applications at the University of Illinois, was developing another web server to accompany NCSA's browser product, Mosaic. NCSA httpd was smaller than CERN httpd, faster (or so the common wisdom had it), had a host of nifty features, and was easier than the CERN software to configure and install. It quickly gained ground on CERN httpd, particularly in the United States. Like CERN httpd, the NCSA product had a facility for generating pages on the fly with external programs but one that differed in detail from CERN httpd 's. Scripts written to work with NCSA httpd wouldn't work with CERN httpd and vice versa.
Fortunately for the world, the CERN and the NCSA groups did not cling tenaciously to "their" standards as certain latter-day software vendors do. Instead, the two groups got together along with other interested parties and worked out a common standard called the Common Gateway Interface.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Apache Project
This book is devoted to developing applications with the Apache web server API, so we turn our attention now to the short history of the Apache project.
The Apache project began in 1995 when a group of eight volunteers, seeing that web software was becoming increasingly commercialized, got together to create a supported open source web server. Apache began as an enhanced version of the public-domain NCSA server but steadily diverged from the original. Many new features have been added to Apache over the years: significant features include the ability for a single server to host multiple virtual web sites, a smorgasbord of authentication schemes, and the ability for the server to act as a caching proxy. In some cases, Apache is way ahead of the commercial vendors in the features wars. For example, at the time this book was written only the Apache web server had implemented the HTTP/1.1 Digest Authentication scheme.
Internally the server has been completely redesigned to use a modular and extensible architecture, turning it into what the authors describe as a "web server toolkit." In fact, there's very little of the original NCSA httpd source code left within Apache. The main NCSA legacy is the configuration files, which remain backward-compatible with NCSA httpd.
Apache's success has been phenomenal. In less than three years, Apache has risen from relative obscurity to the position of market leader. Netcraft, a British market research company that monitors the growth and usage of the web, estimates that Apache servers now run on over 50 percent of the Internet's web sites, making it by far the most popular web server in the world. Microsoft, its nearest rival, holds a mere 22 percent of the market. This is despite the fact that Apache has lacked some of the conveniences that common wisdom holds to be essential, such as a graphical user interface for configuration and administration.
Apache has been used as the code base for several commercial server products. The most successful of these, C2Net's Stronghold, adds support for secure communications with Secure Socket Layer (SSL) and a form-based configuration manager. There is also WebTen by Tenon Intersystems, a Macintosh PowerPC port, and the Red Hat Secure Server, an inexpensive SSL-supporting server from the makers of Red Hat Linux.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Apache C and Perl APIs
The Apache module API gives you access to nearly all of the server's internal processing. You can inspect what it's doing at each step of the HTTP transaction cycle and intervene at any of the steps to customize the server's behavior. You can arrange for the server to take custom actions at startup and exit time, add your own directives to its configuration files, customize the process of translating URLs into file names, create custom authentication and authorization systems, and even tap into the server's logging system. This is all done via modules—self-contained pieces of code that can either be linked directly into the server executable, or loaded on demand as a dynamic shared object (DSO).
The Apache module API was intended for C programmers. To write a traditional compiled module, you prepare one or more C source files with a text editor, compile them into object files, and either link them into the server binary or move them into a special directory for DSOs. If the module is implemented as a DSO, you'll also need to edit the server configuration file so that the module gets loaded at the appropriate time. You'll then launch the server and begin the testing and debugging process.
This sounds like a drag, and it is. It's even more of a drag because you have to worry about details of memory management and configuration file processing that are tangential to the task at hand. A mistake in any one of these areas can crash the server.
For this reason, the Apache server C API has generally been used only for substantial modules which need high performance, tiny modules that execute very frequently, or anything that needs access to server internals. For small to medium applications, one-offs, and other quick hacks, developers have used CGI scripts, FastCGI, or some other development system.
Things changed in 1996 when Doug MacEachern introduced mod_perl , a complete Perl interpreter wrapped within an Apache module. This module makes almost the entire Apache API available to Perl programmers as objects and method calls. The parts that it doesn't export are C-specific routines that Perl programmers don't need to worry about. Anything that you can do with the C API you can do with
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Ideas and Success Stories
To give you an impression of the power and versatility of the Apache API, here are some examples of what people have done with it. Some of the modules described here have been incorporated into Apache and are now part of the standard distribution. Others are third-party modules that have been developed to solve particular mission-critical tasks.
A movie database
The Internet Movie Database (http://www.imdb.com/) uses mod_perl to make queries against a vast database of film and television movies. The system rewrites URLs on the fly in order to present pages in the language of the user's choice and to quickly retrieve the results of previously cached searches. In 1998, the site won the coveted Webby award for design and service.
No more URL spelling errors
URLs are hard things to type, and many HTML links are broken because of a single typo in a long URL. The most frequent errors are problems with capitalization, since many HTML authors grew up in a case-insensitive MS-DOS/Windows world before entering the case-sensitive web.
mod_speling [sic ], part of the standard Apache distribution, is a C-language module that catches and fixes typographical errors on the fly. If no immediate match to a requested URL is found, it checks for capitalization variations and a variety of character insertions, omissions, substitutions, and transpositions, trying to find a matching valid document on the site. If one is found, it generates a redirect request, transparently forwarding the browser to the correct resource. Otherwise, it presents the user with a menu of closest guesses to choose from.
An on-campus housing renewal system
At Texas A&M University, students have to indicate each academic year whether they plan to continue living in campus-provided housing. For the 1997-1998 academic year, the university decided to move the process from its current error-prone manual system to a web-based solution. The system was initially implemented using ActiveWare's PerlScript to drive a set of Microsoft Internet Information Server Active Server Pages, but with less than two weeks to go before deployment it was clear that the system would be too slow to handle the load. The system was hurriedly rewritten to use
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 2: A First Module
This chapter covers the mechanics of developing Apache extension modules in the Perl and C APIs. First we'll show you how to install mod_perl, which you'll need for all Perl API modules, and how to write a simple "Hello World" script. Then we'll show you an equivalent C module implemented both as statically linked code and as a dynamic shared object.
We won't go into the gory details of Apache internals in this chapter—that's deferred until Chapter 3 —but by the end you'll understand the mechanics of getting a new Apache module up and running.
Before you can start hacking away at your own Apache modules, there are a number of preliminaries to take care of. This section discusses what you need and how you can get it if you don't have it already.
You'll need a working version of Apache, preferably a recent release (the version we used to prepare this book was Version 1.3.4). If you do not already have Apache, you can download it, free of charge, from http://www.apache.org/.
Users of Windows 95 and NT systems (henceforth called "Win32") who want to write modules using the Perl API can download precompiled binaries. You will need two components: the server itself, available at http://www.apache.org/dist/, and ApacheModulePerl.dll , which is mod_perl implemented as a dynamically loadable module. ApacheModulePerl.dll has been made available by Jeffrey W. Baker. You can find it on the Comprehensive Perl Archive Network (CPAN) in the directory authors/Jeffrey_Baker/. Win32 users with access to the Microsoft Visual C++ development environment can also compile ApacheModulePerl.dll from mod_perl source code.
This book will not try to teach you how to install and maintain an Apache-based web site. For the full details, see the Apache server's excellent online documentation or the reference books listed in the preface.
To use the C API, you'll need a working C compiler and its associated utilities. Most Unix systems come with the necessary software development tools preinstalled, but sometimes the bundled tools are obsolete or nonstandard (SunOS and HP-UX systems are particularly infamous in this regard). To save yourself some headaches, you may want to install the GNU
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Preliminaries
Before you can start hacking away at your own Apache modules, there are a number of preliminaries to take care of. This section discusses what you need and how you can get it if you don't have it already.
You'll need a working version of Apache, preferably a recent release (the version we used to prepare this book was Version 1.3.4). If you do not already have Apache, you can download it, free of charge, from http://www.apache.org/.
Users of Windows 95 and NT systems (henceforth called "Win32") who want to write modules using the Perl API can download precompiled binaries. You will need two components: the server itself, available at http://www.apache.org/dist/, and ApacheModulePerl.dll , which is mod_perl implemented as a dynamically loadable module. ApacheModulePerl.dll has been made available by Jeffrey W. Baker. You can find it on the Comprehensive Perl Archive Network (CPAN) in the directory authors/Jeffrey_Baker/. Win32 users with access to the Microsoft Visual C++ development environment can also compile ApacheModulePerl.dll from mod_perl source code.
This book will not try to teach you how to install and maintain an Apache-based web site. For the full details, see the Apache server's excellent online documentation or the reference books listed in the preface.
To use the C API, you'll need a working C compiler and its associated utilities. Most Unix systems come with the necessary software development tools preinstalled, but sometimes the bundled tools are obsolete or nonstandard (SunOS and HP-UX systems are particularly infamous in this regard). To save yourself some headaches, you may want to install the GNU gcc compiler and make programs. They are available via anonymous FTP from ftp://prep.ai.mit.edu, in the directory /pub/gnu, or via the web at http://www.gnu.org/.
Win32 users are not so lucky. To develop C API modules, you will need the Microsoft Visual C++ 5.0 development package. No other development environment is guaranteed to work, although you are certainly welcome to try; Borland C++ is reported to work in some people's hands. If you are primarily interested in the Perl API, you can use the precompiled binaries mentioned in the previous section.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Directory Layout Structure
We refer to a variety of special files and directories throughout this book. Although there is a standard Apache server layout, this standard has changed over time and many sites have extensively customized their layout. Furthermore, some operating systems which come with Apache preinstalled choose a nonstandard directory structure that is more consistent with the OS's way of doing things. To avoid potential confusion, we explain the directory structure we use in this book. If you are installing Apache and mod_perl for the first time, you might want to follow the suggestions given here for convenience.
Server root directory
This is the top of the Apache server tree. In a typical setup, this directory contains a bin directory for the httpd Apache executable and the apachectl control utility; the configuration and log directories (conf and logs ); a directory for executable CGI scripts, cgi-bin; a directory for dynamically loaded modules, libexec; header files for building C-language modules, include; and the document root directory, htdocs.
The default server root directory on Unix machines is /usr/local/apache, which we'll use throughout the book. However, in order to avoid typing this long name, we suggest creating a pseudo-user named www with /usr/local/apache as its home directory. This allows you to refer to the server root quickly as ~www.
On Win32 systems, the default server root is C:\Program Files\Apache Group\Apache. However, many people change that to simply C:\Apache, as we do here. Readers who use this platform should mentally substitute ~www with the path to their true server root.
Document root directory
This is the top of the web document tree, the default directory from which the server fetches documents when the remote user requests
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Installing mod_perl
In order to use the Perl API, you'll need to download and install mod_perl if you haven't done so already. This section will describe the simplest way to do this. If you've already installed mod_perl you'll want to skip this section or jump directly to Appendix B, where we give you the lowdown on mod_perl 's advanced installation options.
If you are a Win32 user, you can skip to Section 2.3.2 and download the precompiled ApacheModulePerl.dll loadable module. We'll show you how to activate ApacheModulePerl.dll at the end of the section.
mod_perl is part of the CPAN archive. FTP to a CPAN site close to you and enter the directory modules/by-module/Apache/. Download the file mod_perl-X.XX.tar.gz, where X.XX is the highest version number you find.
It is easiest to build mod_perl when it is located at the same level as the Apache source tree. Change your working directory to the source directory of the server root, and unpack the mod_perl distribution using the gunzip and tar tools:
% cd ~www/build
  % gunzip -c mod_perl-
               X.XX.tar.gz | tar xvf - 
  mod_perl-X.XX/t/
  mod_perl-X.XX/t/docs/
  mod_perl-X.XX/t/docs/env.iphtml
  mod_perl-X.XX/t/docs/content.shtml
  mod_perl-X.XX/t/docs/error.txt
   ....
  % cd mod_perl-
               X.XX
            
Now, peruse the README and INSTALL files located in the mod_perl directory. These files contain late-breaking news, installation notes, and other information.
The next step is to configure, build, and install mod_perl. Several things happen during this process. First, an installation script named Makefile.PL generates a top-level Makefile and runs Apache's configure script to add mod_perl to the list of C modules compiled into the server. After this, you run make to build the mod_perl object file and link it into a new version of the Apache server executable. The final steps of the install process are to test this new executable and, if it checks out, to move
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
"Hello World" with the Perl API
Now that you have mod_perl installed, it's time to put the Perl API through its paces.
First you'll need to create a location for your Apache Perl modules to live. If you haven't done so already, create a directory in some convenient place. We suggest creating a lib subdirectory within the server root, and a perl directory within that, making the full location ~www/lib/perl (Unix), or C:\Apache\lib\perl (Win32). Within this directory, create yet another directory for modules that live in the Apache:: namespace (which will be the vast majority of the modules we write), namely ~www/lib/perl/Apache.
You'll now have to tell Apache where to look for these modules. mod_perl uses the same include path mechanism to find required modules that Perl does, and you can modify the default path either by setting the environment variable PERL5LIB to a colon-delimited list of directories to search before Apache starts up or by calling use lib '/ path / to / look / in ' when the interpreter is first launched. The first technique is most convenient to use in conjunction with the PerlSetEnv directive, which sets an environment variable. Place this directive somewhere early in your server configuration file:
PerlSetEnv PERL5LIB /my/lib/perl:/other/lib/perl
Unfortunately this adds a little overhead to each request. Instead, we recommend creating a Perl startup file that runs the use lib statement. You can configure mod_perl to invoke a startup file of common Perl commands each time the server is launched or restarted. This is the logical place to put the use lib statement. Here's a small startup file to get you started:
#!/usr/local/bin/perl

# modify the include path before we do anything else
BEGIN {
   use Apache ();
   use lib Apache->server_root_relative('lib/perl');
}

# commonly used modules
use Apache::Registry ();
use Apache::Constants();
use CGI qw(-compile :all);
use CGI::Carp ();

# put any other common modules here
# use Apache::DBI ();
# use LWP ();
# use DB_File ();
1;
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
"Hello World" with the C API
In this section we will create the same "Hello World" module using the C API. This will show you how closely related the two APIs really are. Many of the details in this section are specific for Unix versions of Apache. For differences relating to working in Win32 environments, be sure to read the section Section 2.5.3.
The preparation for writing C API modules is somewhat simpler than that for the Perl modules. You just need to create a subdirectory in the Apache source tree to hold your site-specific source code. We recommend creating a directory named site in the modules subdirectory. The complete path to the directory will be something like ~www/src/modules/site (C:\Apache\src\modules\site on Win32 systems).
To have this new subdirectory participate in the server build process, create a file within it named Makefile.tmpl. For simple modules that are contained within a single source file, Makefile.tmpl can be completely empty. The Apache configure script does a pretty good job of creating a reasonable default makefile. Makefile.tmpl is there to provide additional file and library dependencies that Apache doesn't know about.
The next step is to create the module itself. Example 2.2 shows the source for mod_hello. Create a file in the site subdirectory named mod_hello.c and type in the source code (or better yet, steal it from the source code listings in http://www.modperl.com/book/source/).
Example 2.2. A First C-Language Module
#include "httpd.h"
#include "http_config.h"
#include "http_core.h"
#include "http_log.h"
#include "http_protocol.h"
/* file: mod_hello.c */

/* here's the content handler */
static int hello_handler(request_rec *r) {
  const char* hostname;

  r->content_type = "text/html";
  ap_send_http_header(r);
  hostname = ap_get_remote_host(r->connection,r->per_dir_config,REMOTE_NAME);

  ap_rputs("<HTML>\n"                           ,r);
  ap_rputs("<HEAD>\n"                           ,r);
  ap_rputs("<TITLE>Hello There</TITLE>\n"       ,r);
  ap_rputs("</HEAD>\n"                          ,r);
  ap_rputs("<BODY>\n"                           ,r);
  ap_rprintf(r,"<H1>Hello %s</H1>\n"            ,hostname);
  ap_rputs("Who would take this book seriously if the first example didn't\n",r);
  ap_rputs("say \"hello world\"?\n"             ,r);
  ap_rputs("</BODY>\n"                          ,r);
  ap_rputs("</HTML>\n"                          ,r);

  return OK;
}

/* Make the name of the content handler known to Apache */
static handler_rec hello_handlers[] =
{
    {"hello-handler", hello_handler},
    {NULL}
};

/* Tell Apache what phases of the transaction we handle */
module MODULE_VAR_EXPORT hello_module =
{
   STANDARD_MODULE_STUFF,
   NULL,               /* module initializer                 */
   NULL,               /* per-directory config creator       */
   NULL,               /* dir config merger                  */
   NULL,               /* server config creator              */
   NULL,               /* server config merger               */
   NULL,               /* command table                      */
   hello_handlers,     /* [9]  content handlers              */
   NULL,               /* [2]  URI-to-filename translation   */
   NULL,               /* [5]  check/validate user_id        */
   NULL,               /* [6]  check user_id is valid *here* */
   NULL,               /* [4]  check access by host address  */
   NULL,               /* [7]  MIME type checker/setter      */
   NULL,               /* [8]  fixups                        */
   NULL,               /* [10] logger                        */
   NULL,               /* [3]  header parser                 */
   NULL,               /* process initialization             */
   NULL,               /* process exit/cleanup               */
   NULL                /* [1]  post read_request handling    */
};
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Instant Modules with Apache::Registry
By now, although it may not be obvious, you've seen two of the problems with using the Apache APIs. The first problem is that you can't make changes to modules casually. When using the Perl API, you have to restart the server in order to have your changes take effect. With the C API, you have to rebuild the module library or completely relink the server executable. Depending on the context, this can be a minor annoyance (when you're developing a module on a test server that gets light usage) to a bit of a headache (when you're trying to apply bug fixes to an installed module on a heavily used production server).
The second problem is that Apache API modules don't look anything like CGI scripts. If you've got a lot of CGI scripts that you want to run faster, porting them to the Apache API can be a major undertaking.
Apache::Registry, an Apache Perl module that is part of the mod_perl distribution, solves both problems with one stroke. When it runs, it creates a pseudo-CGI environment that so exactly mimics the real thing that Perl CGI scripts can run under it unmodified. It also maintains a cache of the scripts under its control. When you make a change to a script, Apache::Registry notices that the script's modification date has been updated and recompiles the script, making the changes take effect immediately without a server restart. Apache::Registry provides a clean upgrade path for existing CGI scripts. Running CGI scripts under Apache::Registry gives them an immediate satisfying performance boost without having to make any source code changes. Later you can modify the script at your own pace to take advantage of the nifty features offered only by the Apache API.
Be aware that Apache::Registry is intended only for Perl CGI scripts. CGI scripts written in other languages cannot benefit from the speedup of having a Perl interpreter embedded in the server.
To install Apache::Registry you'll need to create a directory to hold the scripts that it manages. We recommend a
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Troubleshooting Modules
Not every module will work the way you think it will the first time you try it. Because the modules written with the Apache API are by definition embedded in the server, debugging them is not as straightforward as debugging a standalone CGI script. In this section, we cover some general module debugging techniques. You'll find more tips later when we discuss specific issues.
If you are using the C API, you can use standard debuggers to step through your module, examine and change data structures, set watch points, and so forth. Be sure to use a version of httpd that has been compiled with debugging symbols and to turn compiler optimizations off. On Unix systems, you can do this by setting the CFLAGS environment variable before running the configure script:
% CFLAGS=-g ./configure ...
            
Launch your favorite debugger, such as gdb, and run httpd within it. Be sure to launch httpd with the - X flag. Ordinarily, Unix versions of httpd will prefork many independent processes. This forking will confuse the debugger and will probably confuse you too. -X prevents Apache from preforking and keeps the server in the foreground as well. You will also probably want to specify an alternate configuration file with the -f switch so that you can use a high numbered port instead of the default port 80. Be sure to specify different ErrorLog, TransferLog, PidFile, and ScoreBoardFile directives in the alternate configuration file to avoid conflicts with the live server.
% gdb httpd 
(gdb) run -X -f ~www/conf/httpd.conf
            
Fetch a few pages from the server to make sure that it is running correctly under the debugger. If there is a problem that triggers a core dump, the (gdb) prompt will return and tell you which function caused the crash. Now that you have an idea of where the problem is coming from, a breakpoint can be set to step through and see exactly what is wrong. If we were debugging
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 3: The Apache Module Architecture and API
In this chapter we lay out the design of the Apache module architecture and its application programming interface. We describe the phases in which Apache processes each request, list the data types that are available for your use, and go over the directives that control how extension modules can intercede in transaction processing.
This is the broad overview of the API. For a full blow-by-blow description of each function and data structure available to you, see Chapter 9, Chapter 10 and Chapter 11.
Much of the Apache API is driven by the simple fact that Apache is a hypertext transfer protocol (HTTP) server that runs in the background as a daemon. Because it is a daemon, it must do all the things that background applications do, namely, read its configuration files, go into the background, shut down when told to, and restart in the case of a configuration change. Because it is an HTTP server, it must be able to listen for incoming TCP/IP connections from web browsers, recognize requests for URIs, parse the URIs and translate them into the names of files or scripts, and return some response to the waiting browser (Figure 3.1). Extension modules play an active role in all these aspects of the Apache server's life.
Figure 3.1: The HTTP transaction consists of a URI request from the browser to the server, followed by a document response from the server to the browser.
Like most other servers, Apache multiplexes its operations so that it can start processing a new request before it has finished working on the previous one. On Unix systems, Apache uses a multiprocess model in which it launches a flock of servers: a single parent server is responsible for supervision and one or more children are actually responsible for serving incoming requests. The Apache server takes care of the basic process management, but some extension modules need to maintain process-specific data for the lifetime of a process as well. They can do so cleanly and simply via hooks that are called whenever a child is launched or terminated. (The Win32 version of Apache uses multithreading rather than a multiprocess model, but as of this writing modules are not given a chance to take action when a new thread is created or destroyed.)
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
How Apache Works
Much of the Apache API is driven by the simple fact that Apache is a hypertext transfer protocol (HTTP) server that runs in the background as a daemon. Because it is a daemon, it must do all the things that background applications do, namely, read its configuration files, go into the background, shut down when told to, and restart in the case of a configuration change. Because it is an HTTP server, it must be able to listen for incoming TCP/IP connections from web browsers, recognize requests for URIs, parse the URIs and translate them into the names of files or scripts, and return some response to the waiting browser (Figure 3.1). Extension modules play an active role in all these aspects of the Apache server's life.
Figure 3.1: The HTTP transaction consists of a URI request from the browser to the server, followed by a document response from the server to the browser.
Like most other servers, Apache multiplexes its operations so that it can start processing a new request before it has finished working on the previous one. On Unix systems, Apache uses a multiprocess model in which it launches a flock of servers: a single parent server is responsible for supervision and one or more children are actually responsible for serving incoming requests. The Apache server takes care of the basic process management, but some extension modules need to maintain process-specific data for the lifetime of a process as well. They can do so cleanly and simply via hooks that are called whenever a child is launched or terminated. (The Win32 version of Apache uses multithreading rather than a multiprocess model, but as of this writing modules are not given a chance to take action when a new thread is created or destroyed.)
However, what extension modules primarily do is to intercede in the HTTP protocol in order to customize how Apache processes and responds to incoming browser requests. For this reason, we turn now to a quick look at HTTP itself.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Apache Life Cycle
Apache's life cycle is straightforward (Figure 3.2). It starts up, initializes, forks off several copies of itself (on Unix systems only), and then enters a loop in which it processes incoming requests. When it is done, Apache exits the loop and shuts itself down. Most of the interesting stuff happens within the request loop, but both Perl and C-language modules can intervene at other stages as well. They do so by registering short code routines called "handlers" that Apache calls at the appropriate moment. A phase may have several handlers registered for it, a single handler, or none at all. If multiple modules have registered their interest in handling the same phase, Apache will call them in the reverse order in which they registered. This in turn will depend on the order in which the modules were loaded, either at compile time or at runtime when Apache processes its LoadModule directives. If no module handlers are registered for a phase, it will be handled by a default routine in the Apache core.
Figure 3.2: The Apache server life cycle
When the server is started, Apache initializes globals and other internal resources and parses out its command-line arguments. It then locates and parses its various configuration files.
The configuration files may contain directives that are implemented by external modules. Apache parses each directive according to a prototype found in the command table that is part of each module and passes the parsed information to the module's configuration-handling routines. Apache processes the configuration directives on a first-come, first-serve basis, so in certain cases, the order in which directives appear is important. For example, before Apache can process a directive that is implemented by a module configured as a dynamically shared object, that module must be pulled in with the
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
The Handler API
When Apache calls a handler, it passes information about the current transaction and the server configuration. It's the handler's responsibility to take whatever action is appropriate for this phase and to then return an integer status code to Apache indicating the success or failure of its operation.
In the Perl API, the definition of a handler is short and sweet:
In the Perl API, the definition of a handler is short and sweet:
sub handler { 
   my $r = shift; 
   # do something 
   return SOME_STATUS_CODE; 
}
No matter which phase of the Apache life cycle the handler is responsible for, the subroutine structure is always the same. The handler is passed a single argument consisting of a reference to an Apache request object. The request object is an object-oriented version of a central C record structure called the request record , and it contains all the information that Apache has collected about the transaction. By convention, a typical handler will store this object in a lexically scoped variable named $r. The handler retrieves whatever information it needs from the request object, does some processing, and possibly modifies the object to suit its needs. The handler then returns a numeric status code as its function result, informing Apache of the outcome of its work. We discuss the list of status codes and their significance in the next section.
There is one special case, however. If the handler has a function prototype of ($$) indicating that the subroutine takes two scalar arguments, the Perl API treats the handler as an object-oriented method call. In this case, the handler will receive two arguments. The handler's class (package) name or an object reference will be the first argument, and the Apache request object reference will be the second. This allows handlers to take advantage of class inheritance, polymorphism, and other useful object-oriented features. Handlers that use this feature are called "method handlers" and have the following structure:
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Perl API Classes and Data Structures
We'll look now at what a handler subroutine sees when it is called. All interaction between the handler and the Apache server is done through the request record. In the Perl API, the request record is encapsulated within a request object, which for historical reasons is blessed into the Apache:: namespace. The Apache request object contains most of the information about the current transaction. It also contains references to other objects that provide further information about the server and the current transaction. The request object's server( ) method returns an Apache::Server object, which contains server configuration information. The connection() method returns an Apache::Connection object, which contains low-level information about the TCP/IP connection between the browser and the client.
In the C API, information about the request is passed to the handler as a pointer to a request_rec . Included among its various fields are pointers to a server_rec and a conn_rec structure, which correspond to the Perl API's Apache::Server and Apache::Connection objects. We have much more to say about using the request_rec in Chapters Chapter 10 and Chapter 11 when we discuss the C-language API in more detail.
The Apache request object (the request_rec in C) is the primary conduit for the transfer of information between modules and the server. Handlers can use the request object to perform several types of operations:
Get and set information about the requested document
The URI of the requested document, its translated file name, its MIME type, and other useful information are available through a set of request object methods. For example, a method named uri() returns the requested document's URI, and content_type() retrieves the document's MIME type. These methods can also be used to change the values, for example, to set the MIME type of a document generated on the fly.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Chapter 4: Content Handlers
This chapter is about writing content handlers for the Apache response phase, when the contents of the page are actually produced. In this chapter you'll learn how to produce dynamic pages from thin air, how to modify real documents on the fly to produce effects like server-side includes, and how Apache interacts with the MIME-typing system to select which handler to invoke.
Starting with this chapter we shift to using the Apache Perl API exclusively for code examples and function prototypes. The Perl API covers the majority of what C programmers need to use the C-language API. What's missing are various memory management functions that are essential to C programmers but irrelevant in Perl. If you are a C programmer, just have patience and the missing pieces will be filled in eventually. In the meantime, follow along with the Perl examples and enjoy yourself. Maybe you'll even become a convert.
Early web servers were designed as engines for transmitting physical files from the host machine to the browser. Even though Apache does much more, the file-oriented legacy still remains. Files can be sent to the browser unmodified or passed through content handlers to transform them in various ways before sending them on to the browser. Even though many of the documents that you produce with modules have no corresponding physical files, some parts of Apache still behave as if they did.
When Apache receives a request, the URI is passed through any URI translation handlers that may be installed (see Chapter 7, for information on how to roll your own), transforming it into a file path. The mod_alias translation handler (compiled in by default) will first process any Alias, ScriptAlias, Redirect, or other mod_alias directives. If none applies, the http_core default translator will simply prepend the DocumentRoot directory to the beginning of the URI.
Next, Apache attempts to divide the file path into two parts: a "filename" part which usually (but not always) corresponds to a physical file on the host's filesystem, and an "additional path information" part corresponding to additional stuff that follows the filename. Apache divides the path using a very simple-minded algorithm. It steps through the path components from left to right until it finds something that doesn't correspond to a directory on the host machine. The part of the path up to and including this component becomes the filename, and everything that's left over becomes the additional path information.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Content Handlers as File Processors
Early web servers were designed as engines for transmitting physical files from the host machine to the browser. Even though Apache does much more, the file-oriented legacy still remains. Files can be sent to the browser unmodified or passed through content handlers to transform them in various ways before sending them on to the browser. Even though many of the documents that you produce with modules have no corresponding physical files, some parts of Apache still behave as if they did.
When Apache receives a request, the URI is passed through any URI translation handlers that may be installed (see Chapter 7, for information on how to roll your own), transforming it into a file path. The mod_alias translation handler (compiled in by default) will first process any Alias, ScriptAlias, Redirect, or other mod_alias directives. If none applies, the http_core default translator will simply prepend the DocumentRoot directory to the beginning of the URI.
Next, Apache attempts to divide the file path into two parts: a "filename" part which usually (but not always) corresponds to a physical file on the host's filesystem, and an "additional path information" part corresponding to additional stuff that follows the filename. Apache divides the path using a very simple-minded algorithm. It steps through the path components from left to right until it finds something that doesn't correspond to a directory on the host machine. The part of the path up to and including this component becomes the filename, and everything that's left over becomes the additional path information.
Consider a site with a document root of /home/www that has just received a request for URI /abc/def/ghi. The way Apache splits the file path into filename and path information parts depends on what directories it finds in the document root:
Physical Directory
Translated Filename
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Virtual Documents
The previous sections of this chapter have been concerned with transforming existing files. Now we turn our attention to spinning documents out of thin air. Despite the fact that these two operations seem very different, Apache content handlers are responsible for them both. A content handler is free to ignore the translation of the URI that is passed to it. Apache neither knows nor cares that the document produced by the content handler has no correspondence to a physical file.
We've already seen an Apache content handler that produces a virtual document. Chapter 2, gave the code for Apache::Hello, an Apache Perl module that produces a short HTML document. For convenience, we show it again in Example 4.7. This content handler is essentially identical to the previous content handlers we've seen. The main difference is that the content handler sets the MIME content type itself, calling the request object's content_type() method to set the MIME type to type text/html. This is in contrast to the idiom we used earlier, where the handler allowed Apache to choose the content type for it. After this, the process of emitting the HTTP header and the document itself is the same as we've seen before.
After setting the content type, the handler calls send_http_header() to send the HTTP header to the browser, and immediately exits with an OK status code if header_only() returns true (this is a slight improvement over the original Chapter 2 version of the program). We call get_remote_host( ) to get the DNS name of the remote host machine, and incorporate the name into a short HTML document that we transmit using the request object's print( ) method. At the end of the handler, we return OK.
There's no reason to be limited to producing virtual HTML documents. You can just as easily produce images, sounds, and other types of multimedia, provided of course that you know how to produce the file format that goes along with the MIME type.
Example 4.7. "Hello World" Redux
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Redirection
Instead of synthesizing a document, a content handler has the option of redirecting the browser to fetch a different URI using the HTTP redirect mechanism. You can use this facility to randomly select a page or picture to display in response to a URI request (many banner ad generators work this way) or to implement a custom navigation system.
Redirection is extremely simple with the Apache API. You need only add a Location field to the HTTP header containing the full or partial URI of the desired destination, and return a REDIRECT result code. A complete functional example using mod_perl is only a few lines (Example 4.8). This module, named Apache::GoHome , redirects users to the hardcoded URI http://www.ora.com/. When the user selects a document or a portion of the document tree that this content handler has been attached to, the browser will immediately jump to that URI.
The module begins by importing the REDIRECT error code from Apache::Constants (REDIRECT isn't among the standard set of result codes imported with :common). The handler() method then adds the desired location to the outgoing headers by calling Apache:: header_out( ). header_out( ) can take one or two arguments. Called with one argument, it returns the current value of the indicated HTTP header field. Called with two arguments, it sets the field indicated by the first argument to the value indicated by the second argument. In this case, we use the two-argument form to set the HTTP Location field to the desired URI.
The final step is to return the REDIRECT result code. There's no need to generate an HTML body, since most HTTP-compliant browsers will take you directly to the Location URI. However, Apache adds an appropriate body automatically in order to be HTTP-compliant. You can see the header and body message using telnet:
% telnet localhost 80
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Processing Input
You can make the virtual documents generated by the Apache API interactive in exactly the way that you would documents generated by CGI scripts. Your module will generate an HTML form for the user to fill out. When the user completes and submits the form, your module will process the parameters and generate a new document, which may contain another fill-out form that prompts the user for additional information. In addition, you can store information inside the URI itself by placing it in the additional path information part.
When a fill-out form is submitted, the contents of its fields are turned into a series of name=value parameter pairs that are available for your module's use. Unfortunately, correctly processing these parameter pairs is annoying because, for a number of historical reasons, there are a variety of formats that you must know about and deal with. The first complication is that the form may be submitted using either the HTTP GET or POST method. If the GET method is used, the URI encoded parameter pairs can be found separated by ampersands in the "query string," the part of the URI that follows the ? character:
http://your.site/uri/path?name1=val1&name2=val2&name3=val3...
To recover the parameters from a GET request, mod_perl users should use the request object's args( ) method. In a scalar context this method returns the entire query string, ampersands and all. In an array context, this method returns the parsed name=value pairs; however, you will still have to do further processing in order to correctly handle multivalued parameters. This feature is only found in the Perl API. Programmers who use the C API must recover the query string from the request object's args field and do all the parsing manually.
If the client uses the POST method to submit the fill-out form, the parameter pairs can be found in something called the "client block." C API users must call three functions named
Additional content appearing in this section has been removed.
Purchase this book now or read it online at Safari to get the whole thing!
Apache::Registry