Search the Catalog
Perl for Web Site Management

Perl for Web Site Management

By John Callender
October 2001
1-56592-647-1, Order Number: 6471
528 pages, $34.95

Chapter 8
Parsing Web Access Logs

Web server access logs are an excellent source of information about what your site's visitors are up to. The information on separate visitors is all mixed together, though, and for all but the smallest sites the raw access logs are too large to read directly. What you need is log analysis software to make the information in the log more easily accessible. You can buy commercial log analysis software to do this, but Perl makes it easy to write your own. The next three chapters describe how to build such a home-grown log-analysis tool.

This chapter focuses on the first part of the process: extracting and storing the information we're interested in. We talk about log file structure, converting IP addresses, and creating regular expressions capable of parsing web access logs. We also talk about creating a suitable data structure for storing the extracted data, so we can answer interesting questions about what our site's visitors have been doing. Along the way we discuss the difficulty of identifying those visitors in the web server's log entries and devise an approach for extracting at least an approximate version of that information.

The example continues in Chapter 9, which focuses on how to do computations involving dates and times, and finishes in Chapter 10, which covers the specifics of how we manipulate the "visit" information from our logs, as well as the actual output of the finished report.

Log File Structure

Most web servers store their access log in what is called the "common log format." Each time a user requests a file from the server, a line containing the following fields is added to the end of the log file:

host
This is either the IP address (like 207.71.222.231) or the corresponding hostname (like pm9-31.sba1.avtel.net) of the remote user requesting the page. For performance reasons, many web servers are configured not to do hostname lookups on the remote host. This means that all you end up with in the log file is a bunch of IP addresses. A bit later in this chapter, you'll develop a Perl script that you can use to convert those IP addresses into hostnames.

identd result
This is a field for logging the response returned by the remote user's identd server. Almost no one actually uses this; in every web log I've ever seen, this field is always just a dash (-).

authuser
If you are using basic 'ecHTTP authentication (which we'll be talking about in Chapter 19) to restrict access to some of your web documents, this is where the username of the authenticated user for this transaction will be recorded. Otherwise, it will be just a dash (-).

date and time
Next comes a date and time string inside square brackets, like: [06/Jul/1999:00:09:12 -0700]. That's the day of the month, the abbreviated month name, and the four-digit year, all separated by slashes. Next come the time (expressed in 24-hour format, so 11:30 P.M. would be 23:30:00) and a time-zone offset (in this example, -0700, because the web server this log was from was using Pacific Daylight Time, which is seven hours behind Universal Time/Greenwich Mean Time).

request
This is the actual request sent by the remote user, enclosed in double quotes. Normally it will look something like: "GET / HTTP/1.0". The GET part means it is a GET request (as opposed to a POST or a HEAD request). The next part is the path of the URL requested; in this case, the default page in the server's top-level directory, as indicated by a single slash (/). The last part of the request is the protocol being used, at the time of this writing typically HTTP/1.0 or HTTP/1.1.

status code
This is the status code returned by the server; by definition this will be a three-digit number. A status code of 200 means everything was handled okay, 304 means the document has not changed since the client last requested it, 404 means the document could not be found, and 500 indicates that there was some sort of server-side error. (More detail on the various status codes can be found in RFC 1945, which describes the HTTP/1.0 protocol. See http://www.w3.org/Protocols/rfc1945/rfc1945.)

bytes sent
The amount of data returned by the server, not counting the header line.

An extended version of this log format, often referred to as the "combined" format, includes two additional fields at the end:

referer
The referring page, if any, as reported by the remote user's browser. Note that referer is consistently misspelled (with a single "r" in the middle) in the HTTP specification, and in the name of the corresponding environment variable.

user agent
The user agent reported by the remote user's browser. Typically, this is a string describing the type and version of browser software being used.

Assuming you have control over your web server's configuration, or can get your ISP to modify it for you, the combined format's extra fields can provide some very interesting information about the users visiting your site. The log analysis script described in this chapter will work with either format, however.

Converting IP Addresses

Before we jump into log-file analysis, let's return briefly to the problem of doing hostname lookups on the IP addresses that most likely comprise the "host" entries in our web access logs. Example 8-1 gives a script, clf_lookup.plx, that does just that. (Like all the examples in this book, it is available for download from the book's web site, at http://www.elanus.net/book/.)

Example 8-1: A script to do hostname lookups on IP addresses in web access logs

#!/usr/bin/perl -w
 
# clf_lookup.plx
 
# given common or extended-format web logs on STDIN, outputs 
# them with numeric IP addresses in the first (host) field converted 
# to hostnames (where possible).
 
use strict;
use Socket;
 
my %hostname;
 
while (<>) {
    my $line = $_;
    my($host, $rest) = split / /, $line, 2;
    if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {
        # looks vaguely like an IP address
        unless (exists $hostname{$host}) {
            # no key, so haven't processed this IP before
            $hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);
        }
        if ($hostname{$host}) {
            # only processes IPs with successful lookups
            $line = "$hostname{$host} $rest";
        }
    }
    print $line;
}

The script itself is pretty simple, but it introduces some new concepts that are definitely worth learning about. The first new thing is this line:

use Socket;

Here we are importing a module called Socket.pm. Just as we did earlier, when we pulled in the CGI.pm module, we're doing this in order to let some more experienced programmers do our dirty work for us. Specifically, the use Socket declaration in this script means we'll be able to do DNS lookups (converting numeric IP addresses to hostnames) using just a few lines of code.

Thousands of Perl modules are available. Some are distributed as part of the Perl language itself; these are usually referred to as being in the standard distribution, or as the standard module Walnuts.pm. (CGI.pm and Socket.pm are in the standard distribution.) Others can be found at CPAN, the Comprehensive Perl Archive Network, which we'll be learning more about in Chapter 11. If you can't wait until then, though (which I can totally understand, CPAN being something like the world's biggest toy store for a Perl programmer), see the accompanying sidebar, "Using CPAN," for details on how you can jump the gun and start exploring CPAN on your own.

Using CPAN

CPAN, the Comprehensive Perl Archive Network, is the official place to (among other things) get Perl modules that are not included in the standard distribution (that is, that are not distributed automatically along with all recent versions of the language). The hardest part about dealing with CPAN, at least for a beginning programmer, is that it is so extensive. With user contributions from all over the world, it has grown like kudzu, spreading organically in all directions, defying efforts to organize its contents usefully for anyone unwilling to spend a significant amount of time studying it.

Of course, if you are spending much time at all programming with Perl, the time spent learning what's in CPAN will be repaid many times over by the time you save using other people's code to perform common tasks rather than reinventing the wheel.

In any event, the following resources will help you get started with CPAN:

http://www.cpan.org/README.html
The top-level overview of what's in CPAN, with links to more-specific starting points

http://www.cpan.org/modules/
The top-level page within the modules portion of CPAN, with pointers to various views of the modules

http://www.cpan.org/modules/00modlist.long.html
A long, annotated list of all the modules in CPAN

http://search.cpan.org/
The CPAN search engine

The next thing in clf_lookup.plx is a my variable declaration for the %hostname hash. This is going to be used to cache hostname lookups while the script is running. That way, each IP address will have to be looked up only once rather than every time it appears in the log. It is important to initialize the %hostname hash out here, before the while loop that actually processes each line from the log file, because putting my %hostname inside the loop block would make it so that a new copy of the hash was created each time through the loop.

Let's get to the loop now. The beginning of the loop takes the form:

while (<>) {

Here we're beginning a while loop, which you'll remember means we're going to run a block of code repeatedly as long as whatever is inside those parentheses evaluates to a true value. But what a weird thing we've got inside that logical test. It looks somewhat like the angle-input operator we use to read lines from a filehandle, but there's no filehandle inside it.

What the <> (which is sometimes called the diamond operator) is doing is this: it looks at the @ARGV array (which you'll remember from Chapter 4 is the special variable holding your script's command-line arguments) and assumes that those arguments represent the names of one or more text files. The <> operator then returns the text from those files, one line at a time, so you can work with those lines in the body of your while block. It keeps feeding you lines of text until it has exhausted all of the first file mentioned in @ARGV, then goes on to the second file, and so on, until it has exhausted all the files mentioned in @ARGV. After the last line from the last file has been delivered, it returns undef (the undefined value), which is false, ending the loop.

You get an interesting extra feature with the <> operator. If you don't give your script any command-line arguments, such that there are no files mentioned in @ARGV, <> instead will read from standard input (that is, from the STDIN filehandle your script gets by default when it is started up). This in turn lets you do cool things like use your script in a shell pipeline to process the input or output for another program. In fact, we'll be using that feature with this script a little later.

Where does the <> operator put each line of text as it is working its way through the files mentioned in @ARGV? In the special scalar variable $_. As I mentioned previously, many of Perl's operators and functions are designed to work with $_ by default, and this ends up being really handy because it lets you write certain common operations very quickly.

In this case, though, we're going to go ahead and stick the contents of $_ into something a little more memorable. That happens in the next line:

my $line = $_;

Next comes the following:

my($host, $rest) = split / /, $line, 2;
if ($host =~ /^\d+\.\d+\.\d+\.\d+$/) {
    # looks vaguely like an IP address

Here you are using the split function to take the current line from the log file and separate it into everything before the first space character (which goes into the scalar variable $host) and everything after the first space character (which goes into $rest). This takes advantage of an optional third argument to the split function, with that argument being a number telling split how many fields to split the string into (in this case, two, because we don't need to keep splitting once we've split off the first field).

Next comes an if statement with a regular expression in the logical test. With your new understanding of regular expressions it should be pretty easy to decipher the meaning of /^\d+\.\d+\.\d+\.\d+$/: it matches a string consisting of four sets of one or more numbers each, separated by periods. This is not the exact same thing as an IP address (in which the component numbers can fall only within a certain range); this pattern is naïve, in that it would accept as IP addresses things like 98765.1234.1.1, but it's close enough for our current purpose.

Next come these two lines:

unless (exists $hostname{$host}) {
    # no key, so haven't processed this IP before

As discussed earlier, we're going to use $hostname{$host} to keep track of the IP addresses we've already looked up. Sometimes, though, we will attempt to look up an IP address and find that it can't be resolved to a hostname. In such cases, we'll stick undef into the value corresponding to $hostname{$host}. The hash will still have a key corresponding to that IP address, but there won't be an associated value. By testing for the existence of a particular hash key (which is what the exists function lets us do), we can avoid entering this if block if we come across those hosts again.

Next comes the actual looking up of the hostname, which is quite simple, thanks to the Socket.pm module:

    $hostname{$host} = gethostbyaddr(inet_aton($host), AF_INET);
}

gethostbyaddr is a Perl function that provides an interface to the computer's underlying hostname lookup function. The two arguments of inet_aton($host) and AF_INET are a little bit of magic provided courtesy of Socket.pm.

Next comes this:

    if ($hostname{$host}) {
        # only processes IPs with successful lookups
        $line = "$hostname{$host} $rest";
    }
}

If the gethostbyaddr returned a false value for this particular $hostname{$host} (meaning this IP address couldn't be resolved to a hostname), the script will skip over this block. Otherwise, it will re-create the $line variable (which corresponds to the current line from the log file) by interpolating the looked-up hostname in $hostname{$host} into a string, along with the $rest variable (which you will recall holds the rest of the line).

The two closing curly braces end the two if blocks we were in, after which we print out just the current value of $line:

    print $line;
}

This script jumps through a number of hoops in the interest of cutting down the actual work it does -- caching the lookups and avoiding rebuilding $line unless it has to make the script a little more involved -- but it is worth taking that sort of care with a program like this because it may end up having to process some very big log files. Even with these tricks, because the gethostbyaddr function normally takes a certain amount of time to give up on an IP address that can't be resolved, this script will tend to take a long time to process large log files.

One use of clf_lookup.plx that is kind of fun is to put it in a pipeline to convert your log file's IP addresses into hostnames on the fly. For example, if your log file is called access.log, you could use the tail command with the -f switch to watch that log growing in real time, piping the output through clf_lookup.plx to convert the hostnames, like this:

[jbc@andros .logs]$ tail -f access.log | clf_lookup.plx

The Log-Analysis Script

Now that the hostname lookups are taken care of, it's time to write the log-analysis script. Example 8-2 shows the first version of that script.

Example 8-2: log_report.plx, a web log-analysis script (first version)

#!/usr/bin/perl -w
 
# log_report.plx
 
# report on web visitors
 
use strict;
 
while (<>) {
    my ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+)$/;
 
    print join "\n", $host, $ident_user, $auth_user, $date, $time,
        $time_zone, $method, $url, $protocol, $status,
        $bytes, "\n";
}

This first version of the script is simple. All it does is read in lines via the <> operator, parse those lines into their component pieces, and then print out the parsed elements for debugging purposes. The line that does the printing out is interesting, in that it uses Perl's join function, which you haven't seen before. The join function is the polar opposite, so to speak, of the split function: it lets you specify a string (in its first argument) that will be used to join the list comprising the rest of its arguments into a scalar. In other words, the Perl expression join '-', 'a', 'b', 'c' would return the string a-b-c. And in this case, using \n to join the various elements parsed by our script lets us print out a newline-separated list of those parsed items.

The Mammoth Regular Expression

The real juicy part of this script, though, is that giant regular expression used to parse each log file line into its component parts. Here's that part of the script again:

    my ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+)$/;

There are a couple of important things to note here. The first is that it is actually fairly tricky to represent this regular expression, which is meant to be on a single line, within the limited width of this book's pages. It's particularly tricky in this case because the spaces between the various elements are important, but it's hard to keep track of those spaces when the expression is broken to fit onto multiple lines. If you are going to test this script yourself, be sure that your version of the expression is all on one line, with a single space character between the right parenthesis that ends the first line and the begin parenthesis that begins the second line. (Or you can just download the example from the book's web site, at http://www.elanus.net/book/, since the downloadable example doesn't feature those problematic line breaks.) You also can refer to the version of this expression created using the /x modifier, which is described in the accompanying sidebar, "Regular Expression Extensions," and use that version instead of the one-line version given here.

Regular Expression Extensions

Putting the /x modifier at the end of a regular expression lets you use regular expression "extensions." This means that you can put whitespace characters (like spaces, tabs, and newlines) into the expression, and they will be ignored by Perl when trying to make a match. (The one exception to this is inside a square-bracketed character class, where literal whitespace characters will still count.) To get a literal whitespace character outside a character class you need to precede it by a backslash. Also, you can embed comments in the expression by preceding them with the hash symbol (#), just like you can with regular Perl statements. The idea is that you can break your expression across multiple lines and use indenting and comments in an effort to make it more easily understood.

With a substitution expression, by the way, the /x modifier applies only to the search pattern (the first half of the expression). The replacement part (the second half) still treats whitespace and the # sign as literal characters.

Here's how you might use the /x modifier to represent the regular expression in Example 8-2:

my ($host, $ident_user, $auth_user, $date, $time,
     $time_zone, $method, $url, $protocol, $status,
     $bytes) =
 
     /             # regexp begins
     ^             # beginning-of-string anchor
     (\S+)         # assigned to $host
     \             # literal space
     (\S+)         # assigned to $ident_user
     \             # literal space
     (\S+)         # assigned to $auth_user
     \             # literal space
     \[([^:]+)     # assigned to $date
     :             # literal :
     (\d+:\d+:\d+) # assigned to $time
     \             # literal space
     ([^\]]+)      # assigned to $time_zone
     \]\ "         # literal string '] "'
     (\S+)         # assigned to $method
     \             # literal space
     (.+?)         # assigned to $url
     \             # literal space
     (\S+)         # assigned to $protocol
     "\            # literal string '" '
     (\S+)         # assigned to $status
     \             # literal space
     (\S+)         # assigned to $bytes
     $             # end-of-string anchor
     /x;           # regexp ends, with x modifier

The second interesting thing about this chunk of code is how we're taking advantage of the fact that a regular expression that contains capturing parentheses, if you place it in a list context, returns a list of all the elements captured by those parentheses. This means you can stick a list of scalar variables inside parentheses on the left side of the assignment operator, and a regular expression containing capturing parentheses on the right, and assign all the captured substrings in one fell swoop.

Let's go through the regular expression search pattern one chunk at a time:

/^(\S+) (\S+) (\S+)

The first thing in the expression is the beginning-of-string anchor (^). Next comes one or more nonwhitespace characters (\S+), which will match only if they are the first thing on the line (thanks to that beginning-of-string anchor). Because they are enclosed by parentheses, whatever matches will be captured into the $host variable in the corresponding list on the left side of the assignment (and into $1, too, though we're not doing anything with that).

Next comes a literal space, then another sequence of one or more nonwhitespace characters, which is captured into $ident_user. Next comes another literal space, and (again) one or more nonwhitespace characters, which are captured into $auth_user.

After that comes another literal space, then the following interesting chunk:

\[([^:]+):(\d+:\d+:\d+) ([^\]]+)\]

This part of the pattern starts off by matching a literal left bracket ([). Next it captures one or more characters that are anything but colons (:). Then comes a colon, and then it captures a string consisting of three sets of one or more digits separated by colons. Next there is a space, after which the pattern captures one or more characters that are anything but a right bracket (]). Finally, the pattern matches a right bracket. In other words, it matches a date string that looks like:

[06/Jul/1999:00:09:12 -0700]

and in doing so captures the date, time, and time zone offset into the $date, $time, and $time_zone variables, respectively.

After that part of the pattern comes another literal space, and then this:

"(\S+) (.+?) (\S+)"

This part matches the request as sent by the web browser to the server. As mentioned earlier, that request typically looks something like this:

"GET / HTTP/1.0"

The tricky thing about this part of the pattern is the stuff inside the middle set of capturing parentheses. That's where you match the path of the actual page requested. At first glance you would probably be tempted to use (\S+) to match that part, on the theory that the requested path is unlikely to contain spaces, but occasionally a space will creep into the path, if only because a user accidentally typed one in when manually specifying the URL.

You could use something like ([^"]+) to match the URL part of the request, which would match all the way out to the double quote, and count on the fact that Perl would then backtrack to match the time zone, which needs to have a double quote after it. The problem with this, though, is that it would be relatively inefficient because you'd be making Perl do a lot of backtracking on almost every line.

The solution given here is better. By using .+? to match the URL, you say "match one or more of anything, but don't be greedy." This means the expression will match only as much as it has to in order to make the rest of the expression match. Once it has matched all of the requested URL, the rest of the expression should match, meaning you'll get what you were looking for without a lot of backtracking.

The last part of the regular expression is fairly simple, capturing the next two one-or-more-nonwhitespace chunks into $status and $bytes, with an end-of-string anchor at the end.

And that's it. The only remaining part of the script is the debugging print statement that outputs each captured item on its own line, with an extra newline at the end to put a blank line after each line's data has been printed out:

print join "\n", $host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $url, $protocol, $status,
    $bytes, "\n";

Notice, by the way, how we've just stuck a print function in front of the join function. This chaining together of two or more functions, where the function on the right returns something that serves as the argument for the function on the left, is a handy shortcut you'll see experienced Perl users using all the time.

Before we test the script, we should think for a minute about what will happen in the case where a line from the log file doesn't match that monster regular expression. There's a fairly good chance that all the lines will fail to match the first time we try it because that expression is big and complicated, and one typo will mess the whole thing up. Even after we have it working properly, though, there's always the chance that a screwy line will show up in the log and fail to be parsed properly. What will happen in that case?

What will happen is that the match will fail, and nothing will be assigned to all those variables. If they were global variables, and we were counting on the successful match to replace whatever was already in them from processing the previous line, our script would now go on to print out the previous line's data all over again, which would be a problem. Since we are using my to give us a fresh batch of variables each time through the loop, though, we don't have to worry about that. Even so, we still have a problem. Since none of the variables were successfully assigned for this trip through the loop, the -w switch will cause our script to emit a bunch of "Use of uninitialized variable" warnings as soon as it comes to that print statement.

Perhaps the best thing to do in cases where a line doesn't match is to just bail out and go on to the next line. This behavior turns out to be very easy to add to the script: we just put or next at the end of the line containing the regular expression assignment:

($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+)$/
            or next;

How does this work? The next function tells Perl to go on to the next iteration of the loop we're in. Putting or something at the end of a line of a Perl expression causes everything to the left of the or to be evaluated in a Boolean context (that is, evaluated to see if it yields a true or false value). If our regular expression fails to match, it will return an empty list. That not only means that all those variables will get undef assigned to them; it also means the whole expression will be false, which means the stuff to the right of the or will be executed. In general, you can put or (something) on the right side of the regular expression, and whatever you put there will fire off only in cases where the expression fails to match.

Different Log File Formats

It's fairly easy to modify this script to accept either the common or the extended log format. We do that by adding a configuration variable near the top of the script that looks like this:

my $log_format = 'common'; # 'common' or 'extended'

Then we modify the part of the script where the regular expression parsing occurs to include some logic to check that $log_format variable, along with a second version of the regular expression to be used on logs that are in the extended format:

    if ($log_format eq 'common') {
 
        ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+)$/
            or next;
 
    } elsif ($log_format eq 'extended') {
 
        ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes,
            $referer, $agent) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
            or next;
    } else {
        die "unrecognized log format '$log_format'";
    }

I think this probably qualifies as the ugliest block of code in this entire book. This is not the sort of code that anybody wants to have to make sense of more than once, but fortunately, once we get it right, we aren't likely to need to modify it.

Anyway, you'll notice that the new regular expression for extended-format logs has a couple of new chunks at the end, both of which look like "([^"]+)". By now that should be an easy one for you: it means "match a literal double quote, then capture one or more characters that are anything but a double quote, then match another literal double quote." These two new chunks capture into the new $referer and $agent variables that we've added at the end of the parenthetical list being assigned to.

We've also added an else block, which just does a quick sanity check, dying with an error message if the $log_format variable was inadvertently set to an unexpected value.

You may have noticed that there is no my declaration before the list of variables in either branch of the if-elsif construct. That's because declaring those variables as my variables here, inside the curly braces of the if-elsif block, would limit their visibility later on in the while block, where they need to be visible. As a result, we've moved the my declaration for the variables above the if-elsif construct, just after the while (<>) line:

my ($host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $url, $protocol, $status, $bytes,
    $referer, $agent);

We should also add the $referer and $agent variables to the list of variables that the debugging print statement should print out. This will give us some extra blank lines in the output if our log file is actually in the common format, but that print statement is just a quick debugging tool anyway; the real output that the script produces later will be implemented more intelligently:

print join "\n", $host, $ident_user, $auth_user, $date, $time,
    $time_zone, $method, $url, $protocol, $status,
    $bytes, $referer, $agent, "\n";

The entire script as it should look at this point is given in Example 8-3.

Example 8-3: Second version of the log_report.plx script

#!/usr/bin/perl -w
 
# log_report.plx
 
# report on web visitors
 
use strict;
 
my $log_format = 'common'; # 'common' or 'extended'
 
while (<>) {
 
    my ($host, $ident_user, $auth_user, $date, $time,
        $time_zone, $method, $url, $protocol, $status, $bytes,
        $referer, $agent);
 
    if ($log_format eq 'common') {
 
        ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+)$/
            or next;
 
    } elsif ($log_format eq 'extended') {
 
        ($host, $ident_user, $auth_user, $date, $time,
            $time_zone, $method, $url, $protocol, $status, $bytes,
            $referer, $agent) = 
/^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] "(\S+) (.+?)
 (\S+)" (\S+) (\S+) "([^"]+)" "([^"]+)"$/
            or next;
    } else {
        die "unrecognized log format '$log_format'";
    }
 
    print join "\n", $host, $ident_user, $auth_user, $date, $time,
        $time_zone, $method, $url, $protocol, $status,
        $bytes, $referer, $agent, "\n";
}

Now we're ready to test the log_report.plx script on a real log file, to make sure the regular expression is actually parsing the way we think it should. We set the $log_format variable to the appropriate value for our log files, then try using something like this in the shell (substituting appropriate pathnames as needed) to redirect the log file into the script's standard input and then pipe the script's output to more:

[jbc@andros .logs]$ log_report.plx < access.log | more

If our log file has only IP addresses and no hostnames, we can put the clf_lookup.plx script we created earlier at the beginning of the pipeline with something like this:

[jbc@andros .logs]$ clf_lookup.plx < access.log | log_report.plx | more

This makes use of a new shell redirection symbol that we haven't used before, the left angle bracket (<). The < character tells the shell to redirect the contents of a file into a command's standard input.

If we enter the command line given here and nothing prints out, we need to make sure we've got the log file path and filename correct. If that's not the problem, we need to double-check the configuration variable to make sure it has the appropriate value for our log file format (and take a careful look at the log file, too, to make sure it really is in the common or extended format).

If the script still doesn't output anything, the problem is probably in our regular expression. We need to make sure it is all on one line and has the appropriate spacing between the various elements. As a last resort, we can try shortening it a little bit at a time (or building it up from nothing a little bit at a time), getting it so that it successfully matches and captures something, at least, then adding additional elements until the whole thing is working.

Storing the Data

Now that we're successfully parsing out the individual elements from each line in the log file, what are we going to do with them? It's time to think about what sorts of things we want to keep track of, and how to represent them in our data structure.

One good thing to keep track of is the time of the first and last access processed. When printed out in our report, this will let us see what range of time is covered by the analyzed log file lines.

Another obvious thing to keep track of is how many raw hits are in the log file. Similarly, we can track the total amount of data (in megabytes) sent out by the server, and the number of HTML page views.

We'll begin implementing these features by adding the following to the top of the log_report.plx script, just before the start of the while loop that parses the log file lines:

my($begin_time, $end_time, $total_hits, $total_mb, $total_views);

This establishes a number of scalar variables that will be visible throughout the script, and will be used to store the various categories of information we're interested in tracking.

Now, at the end of the while loop, we'll comment out that debugging print statement and add the new lines shown here in order to store those various pieces of data:

#    print join "\n", $host, $ident_user, $auth_user, $date, $time,
#        $time_zone, $method, $url, $protocol, $status,
#        $bytes, $referer, $agent, "\n";
 
    unless ($begin_time) {
        $begin_time = "$date:$time";
    }
    $end_time = "$date:$time";
 
    ++$total_hits;
    $total_mb += ($bytes / (1024 * 1024));
 
    next if $url =~ /\.(gif|jpg|jpeg|png|xbm)$/i; 
    # don't care about these for visit-tracking purposes
 
    ++$total_views;
    &store_line($host, $date, $time, $url, $referer, $agent);
}

We stick the assignment to $begin_time inside an unless block that checks to see if the variable has been assigned already, so it only gets assigned when the first line of the log file is processed. The $end_time variable is just overwritten with the current values of $date and $time for every line, such that we end up with the date and time of the last access when we're done parsing the log file.

Adding one to $total_hits each time through the loop using the auto-increment operator (++) is easy enough to understand. $total_mb is assigned using the interesting += operator, which does what you would probably guess it does: it takes whatever number is on the right and adds it to the contents of the variable on the left, storing the new sum in the variable. It is thus the equivalent of:

$total_mb = $total_mb + ($bytes / (1024 * 1024));

except it's a bit easier to write. Dividing $bytes by the product of 1024 * 1024 simply converts that number to megabytes.

The next line uses that handy condensed form of an if statement: do something if something else. In this case, it says to bail out and go to the next cycle through the while loop (which in this case means going to the next line in the log file) if the contents of $url end in .gif, .jpg, .jpeg, .png, or .xbm. This reflects the fact that we're only interested in actual "page views" at this point, and don't care about the image files whose requests also end up in the log file. We could instead have used something like:

next unless $url =~ /\.html?$/;

which would skip to the next line from the log file unless the current line's $url ended in .htm or .html, but this would skip requests for CGI scripts and for directories that return a default page such as index.html. It probably makes sense to count those requests in $total_views.

Next, now that we've gotten rid of those extraneous log file entries, it's time to add one to the contents of $total_views. And finally, we invoke a subroutine called &store_line with the arguments $host, $time, $url, $referer, and $agent. We'll be using that subroutine in an effort to generate statistics on something more interesting: the activities of the individual visitors to our site.

The "Visit" Data Structure

Trying to track individual visitors via the entries in a web server's access log is something of an exercise in futility. With things like proxy servers and client-side caching getting in the way, the series of accesses that show up in the log from a particular hostname or IP address can give only an approximate picture of what individual visitors are doing. Multiple users sharing the same IP address can have their activity merged into what looks like a single, very active visitor. Conversely, a single visitor can show up in the logs via a different IP address on each request, defying efforts to abstract those requests into a meaningful "visit." A proxy server at a major ISP can cache the site's pages, then satisfy hundreds of requests that never get recorded in the server's logs.

Even so, it's hard not to wonder what a log file would reveal if we could pluck out the requests corresponding to specific hosts and string them together to see what patterns emerge. Many users still browse from individual host addresses without intervening proxy servers; for these users, at least, the resulting "visit" tracking provides a fascinating look at the paths being followed through the site. It's also interesting to see how many incoming requests are actually being generated by robot "spider" programs, and to study the behavior of those programs as they interact with the server. Finally, it's an interesting programming exercise to see how we can assemble and present information on these "visits."

As with the data structure we used to create the SprocketExpo exhibitor directory in Chapters and , we could really benefit in this case by taking advantage of Perl's support for multilevel data structures. A hash of hashes (that is, a hash whose values are themselves hash variables) would make the task of storing and accessing information on these visits significantly easier. As it is, though, we won't be learning how to use multilevel data structures for several more chapters. That's okay; we can fake it by using the conventional variables we've been using already, just as we did for the SprocketExpo example.

For the purposes of this script, we're going to define a "visit" as a series of one or more requests received from the same host, with no more than 15 minutes elapsing between one request and the next. If we get another request from the same host but more than 15 minutes has elapsed since the last one, we will treat the new request as the start of a new "visit," counting it separately in our statistics.

We may as well make that 15-minute visit timeout a configuration variable up at the top of the script and store it in seconds to make our computations easier:

my $expire_time      = 900; # seconds of inactivity to consider a
                            # "visit" ended (0 = forever)

Notice how the comment tells us we can set the $expire_time variable to 0 to make the expiration time "forever." We'll see how this works in a minute.

A number of other variables, visible throughout the script and declared with my near the beginning, will be used to store the information on individual visits:

$total_visits
This scalar will be incremented by one for each new visit processed. Besides being used in the script's report to tell us how many visits there were in all, this count will be used to generate a unique visit number for each visit.

%visit_num
This hash will have keys consisting of hostnames or IP addresses, and values consisting of the currently "working" visit number corresponding to that host.

All of the following hash variables will have keys consisting of the visit number described previously:

%host
Key is visit number, value is the hostname or IP address corresponding to that visit number.

%first_time
Key is visit number, value is the date and time of that visit's first access.

%last_time
Key is visit number, value is the date and time of that visit's last (that is to say, most recent) access.

%last_seconds
Key is visit number, value is the number of seconds returned by the &get_seconds subroutine for the date and time of that visit's last access.

%referer
Key is visit number, value is the HTTP_REFERER environment variable supplied for that visit's first access.

%agent
Key is visit number, value is the user-agent string supplied for that visit's first access.

We'll add all these new variables to the big my declaration up at the top of the script:

my($begin_time, $end_time, $total_hits, $total_mb, $total_views,
    $total_visits, %visit_num, %host, %first_time, %last_time,
    %last_seconds, %page_sequence, %referer, %agent);

The &store_line Subroutine

We previously saw how the &store_line subroutine was being invoked to process each line from the log file that didn't represent an image request. Now let's skip down to the bottom of the script and see what that &store_line subroutine actually does:

# script proper ends. subroutines follow.
 
sub store_line {
 
    # store one line's worth of visit data
 
    my($host, $date, $time, $url, $referer, $agent) = @_;
    my $seconds = &get_seconds($date, $time);
 
    if ($visit_num{$host}) {
        # there is a visit currently "working" for this host
        my $visit_num = $visit_num{$host};
        my $elapsed = $seconds - $last_seconds{$visit_num};
        if (($expire_time) and ($elapsed > $expire_time)) {
            # this visit has expired, so start a new one
            &new_visit($host, $date, $time, $url, $seconds, 
                $referer, $agent);
        } else {
            # this visit has not expired, so add to existing record
            &add_to_visit($host, $date, $time, 
                $url, $seconds, $elapsed);
        }
    } else {
        # there is no visit currently "working" for this host
        &new_visit($host, $date, $time, $url, $seconds,
            $referer, $agent);
    }
}

Most of the Perl in this subroutine should look pretty familiar by now. In essence, this routine is functioning as a traffic cop, using the host and the time of this access to figure out if this request represents a new visit, or another request in a previously started visit.

First, it checks to see if a key exists for the current line's $host in the %visit_num hash. If there is, it means we've previously processed a request from this host, so the script checks to see if the currently working visit for this host has "expired." That is, it looks to see if the time difference between this host's last access and the current access is greater than the value stored in the $expire_time configuration variable. If it is, it means enough time has gone by that this access needs to be considered the beginning of a new visit, and the script invokes the &new_visit subroutine. If it isn't, the script invokes the &add_to_visit subroutine instead. Finally, if there wasn't any key for the current $host in the %visit_num hash, if means this host hasn't been seen before at all. Accordingly, the &new_visit subroutine is invoked to create a new entry for it.

Here's where we've implemented the feature of turning off visit expiration for cases where the $expire_time configuration variable has been set to 0. We've done that by making the logical test that determines whether a visit has ended actually contain two logical tests, both of which must be true for the "true" branch to be invoked:

if (($expire_time) and ($elapsed > $expire_time)) {

This works because joining two logical tests with and requires both of them to be true for the expression as a whole to be true. If $expire_time is set to 0, which is a false value, the test can never return true.

So, again, three subroutines are invoked from within this &store_line subroutine: the &get_seconds subroutine, which accepts as arguments the date and time strings from the current log file line and converts them to something called Unix seconds. That routine, and the date arithmetic it performs, is the subject of the next chapter. The &new_visit and &add_to_visit routines, which handle the updating of the script's data structure, will be covered in Chapter 10.

Back to: Perl for Web Site Management


oreilly.com Home | O'Reilly Bookstores | How to Order | O'Reilly Contacts
International | About O'Reilly | Affiliated Companies | Privacy Policy

© 2001, O'Reilly & Associates, Inc.
webmaster@oreilly.com