Combined Log Format

Problem

You need a regular expression that matches each line in the log files produced by a web server that uses the Combined Log Format.[14] For example:

127.0.0.1 - jg [27/Apr/2012:11:27:36 +0700] "GET /regexcookbook.html HTTP/1.1" 200 2326 "http://www.regexcookbook.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

Solution

^(?<client>\S+)\S+(?<userid>\S+)\[(?<datetime>[^\]]+)\]↵
"(?<method>[A-Z]+)(?<request>[^"]+)?HTTP/[0-9.]+"↵
(?<status>[0-9]{3})(?<size>[0-9]+|-)"(?<referrer>[^"]*)"↵
"(?<useragent>[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
^(?P<client>\S+)\S+(?P<userid>\S+)\[(?P<datetime>[^\]]+)\]↵
"(?P<method>[A-Z]+)(?P<request>[^"]+)?HTTP/[0-9.]+"↵
(?P<status>[0-9]{3})(?P<size>[0-9]+|-)"(?P<referrer>[^"]*)"↵
"(?P<useragent>[^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: PCRE 4, Perl 5.10, Python
^(\S+)\S+(\S+)\[([^\]]+)\]"([A-Z]+)([^"]+)?HTTP/[0-9.]+"↵
([0-9]{3})([0-9]+|-)"([^"]*)""([^"]*)""([^"]*)""([^"]*)"
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

The Combined Log Format is the same as the Common Log Format, but with two extra fields added at the end of each entry, and the first extra field is the referring URL. The second extra field is the user agent. Both appear as double-quoted strings. We can easily match those strings with "[^"]*"

Get Regular Expressions Cookbook, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.