Chapter 4. Data in the Service Domain

This chapter is concerned with the practical problem of collecting data moderated by a service, which is to say service domain data. At the conclusion of this chapter, you will be able to identify the sources for service domain data and understand how vantage impacts this data and the challenges to its validity.

What and Why

Practically speaking, service domain data consists of log data generated by various services operating on a host. Service domain data is characterized by the serviceâs moderation of thingsâwhere network data deals in packets which may or may not do anything, service data deals in events that are defined by the service. While data in the host domain consists of the hostâs current state, the service domain contains the events that caused a state change.

Service data is therefore distinguished from the other categories by the impact of a service moderating the information you receive. To understand how this moderation impacts use, consider the scenario shown in FigureÂ 4-1. This figure shows how data in two domains is generated for two different phenomena.

On the left is a simple GET request via HTTPS. In this example, data is collected from two points: a span port on a switch collecting raw pcap, and the logs from the web server. Since the HTTPS packets are encrypted, the only information the span port can provide is header-basedâports, protocol, and nothing else. Meanwhile, the web server can decrypt and process the data, recording that information.

While encryption is the most obvious challenge to the validity of inferences using packet capture, all network-based instrumentation faces challenges to its internal validity because TCP/IP design is built on the end-to-end principle. Systems designed around this principle push more functionality to the endpoints, while keeping the network itself relatively simple.

The practical impact of the end-to-end principle on network monitoring is that nothing seen on the network can be taken at face value, especially once you allow for perverse and creative packet crafting. For example, an attacker can send perfectly valid DNS request packets or a realistic-looking half of a TCP session, and from the network traffic alone, the observer cannot guarantee that this wasnât real traffic. In the network domain, the best the defender can do is guess something happened, while in the service domain, the defender can find out what happened.

Instrumenting at the service domain is not a panacea, however, as shown in the righthand side of the figure. In this case, a scanner is hitting the network with half-open scans, and the pcap sensor can see this information. But the web server has never received an HTTPS session from this, and doesnât record the information.

The net result of this is that no single collection point will address all of the potential phenomena you can face. Understanding the strength of each domain, and the data you can effectively collect in each case, is critical to determining what data to collect.

Logfiles as the Basis for Service Data

The foundational data you deal with when working with service information is service-specific logfiles. A service-specific logfile is a record of the transactions that the service engages in.

The concept of a transaction is critical here. A transaction is an interaction with the service as defined by the service. Example transactions may include fetching a web page (for a web server), sending out a DNS query (for a DNS server), or listing all the mail in a folder (for an IMAP server). Transactions are service-specific, and determining what transactions the service provides will require diving into its documentation and relevant standards.

Accessing and Manipulating Logfiles

Operating systems have dozens of processes generating log data at any time. In Unix systems, these logfiles are usually stored as text files in the /var/log directory. ExampleÂ 4-1 shows this directory for macOS (the ellipses indicate where lines were removed for clarity).

Example 4-1. A /var/log directory from a macOS system

drwxr-xr-x   2 _uucp            wheel      68 Jun 20  2012 uucp
...
drwxr-xr-x   2 root             wheel      68 Dec  9  2012 apache2
drwxr-xr-x   2 root             wheel      68 Jan  7 01:47 ppp
drwxr-xr-x   3 root             wheel     102 Mar 12 12:43 performance
...
-rw-r--r--   1 root             wheel     332 Jun  1 05:30 monthly.out
-rw-r-----   1 root             admin    6957 Jun  5 00:30 system.log.7.bz2
-rw-r-----   1 root             admin    5959 Jun  6 00:30 system.log.6.bz2
-rw-r-----   1 root             admin    5757 Jun  7 00:30 system.log.5.bz2
-rw-r-----   1 root             admin    5059 Jun  8 00:30 system.log.4.bz2
-rw-r--r--   1 root             wheel     870 Jun  8 03:15 weekly.out
-rw-r-----   1 root             admin   10539 Jun  9 00:30 system.log.3.bz2
-rw-r-----   1 root             admin    8476 Jun 10 00:30 system.log.2.bz2
-rw-r-----   1 root             admin    5345 Jun 11 00:31 system.log.1.bz2
-rw-r--r--   1 root             wheel  131984 Jun 11 18:57 vnetlib
drwxrwx---  33 root             admin    1122 Jun 12 00:23 DiagnosticMessages
-rw-r-----   1 root             admin    8546 Jun 12 00:30 system.log.0.bz2
-rw-r--r--   1 root             wheel  108840 Jun 12 03:15 daily.out
-rw-r--r--   1 root             wheel   22289 Jun 12 04:51 fsck_hfs.log
-rw-r-----   1 root             admin  899464 Jun 12 20:11 install.log

Note several features of this directory. The system.log files are started daily at 00:30 and are differentiated numerically. There are a number of subdirectories for handling various services. Check the configuration of each individual service you want to acquire logfiles for, but itâs not uncommon for Unix systems to dump them to a subdirectory of /var/log by default.

Unix logfiles are almost always plain text. For example, a brief snippet of a system log reads as follows:

$ cat system.log
Jun 19 07:24:49 local-imac.home loginwindow[58]: in pam_sm_setcred(): Done
    getpwnam()
Jun 19 07:24:49 local-imac.home loginwindow[58]: in pam_sm_setcred(): Done
    setegid() & seteuid()
Jun 19 07:24:49 local-imac.home loginwindow[58]: in pam_sm_setcred():
    pam_sm_setcred: krb5 user admin doesn't have a principal
Jun 19 07:24:49 local-imac.home loginwindow[58]: in pam_sm_setcred(): Done
    cleanup3

The majority of Unix system logs are text messages created by filling in templates with specific event information. This kind of templated text is easy to read, but doesnât scale very well.

As of Vista, Windows has extensively revamped its logging structure. Windows logs are further subdivided into several classes, notably:

Application log: Contains messages from individual applications. Note that services such as IIS may use auxiliary logs to contain additional information.
Security log: Contains security events, such as logon attempts and audit policy changes.
System log: Contains messages about system status, such as driver failures.
Forwarded events log: Stores events from remote hosts.

These logs are recorded in %SystemRoot%\System32\Config by default on most Windows installs; however, the more effective mechanism for accessing and reading the files is to use the Windows Event Viewer, as seen in FigureÂ 4-2.

Note the use of the event ID in FigureÂ 4-2; as with Unix systems, the Windows event messages are templated text, though Windows explicitly identifies the type of event using a unique numeric code. These messages are accessible from Microsoftâs website.

On Linux, application logfiles are much less consistently located. As seen in the /var/log directory, administrative structure may be set up to record a logfile in a fixed location, but almost every application has the ability to move around logfiles as necessary. When working with a particular application, consult its documentation to find out where it drops logs.

The Contents of Logfiles

Logs are usually designed to provide debugging and troubleshooting information for an administrator on the host. Because of this, you will often find that host-based logs require both some degree of parsing and some degree of reorganization to make them satisfactory security logs. In this section, we discuss mechanisms for interpreting, troubleshooting, and converting host log data.

The Characteristics of a Good Log Message

Before discussing how to convert a log message, and before complaining about how bad most log messages are, it behooves us to describe what a good security message should look like. A good security log should be descriptive, it should be relatable to other data, and it should be complete.

A descriptive message is one that contains enough information for an analyst to identify all necessary accessible resources for the event described by the message. For example, if a host log records that a user attempted to illegally access a file, it should contain the userâs ID and the file accessed. A host log recording a change in group permissions for a user needs to record the user and the group. A log recording a failed remote login attempt should include the ID that attempted the login and the address that attempted the login.

For example, consider a log message about a failed login attempt on host 192.168.2.2, local name myhost. A nondescriptive message would look like this:

Mar 29 11:22:45.221 myhost sshd[213]: Failed login attempt

This message doesnât tell me anything about why the failure occurred and doesnât provide any information to differentiate between this and any other failed login attempts. I have no information on the target of the attack; is it against the admin account or some user? Analysts with only this information will have to reconstruct the attempt solely from timing data, and they canât even be sure what host was contacted because the name of the host is nondescriptive and there is no addressing information.

A more descriptive message would look like this:

Mar 29 11:22:45.221 myhost (192.168.2.2) sshd[213]: Failed
    login attempt from host 192.168.3.1 as 'admin',
    incorrect password

A good mental exercise for building a descriptive message is to fall back to the âfive Ws and one Hâ approach from investigation and journalism: who, what, when, where, why, and how. The nondescriptive log message provides the what (failed login) and when, and a partial answer to where (myhost). The descriptive log message provides the who (192.168.3.1 as admin), why, and how (incorrect password), and a better where.

A relatable message is one where the event is easily related to information from other sources. For host-based events this requires IP address and timing information, including whether an event was remote or physically local, the IP address and port of the event if it was remote, and the IP address and port of the host. Relatability is a particular headache when dealing with service logs, as these types of logs often introduce additional addressing schemes on top of IP. For example, hereâs an unrelatable mail log message:

Mar 29 11:22:45.221 myhost (192.168.2.2) myspamapp[213]:
    Message <21394.283845@spam.com> title 'Herbal Remedies and Tiny Cars'
    from 'spammer@spam.com' rejected due to unsolicited commercial content

The message has a lot of information, but thereâs no way to relate the message sent back to the particular IP address that sent the message. When looking at log messages, consider how you will relate this information to other sources, particularly network traffic. A more relatable message would be as follows:

Mar 29 11:22:45.221 myhost (192.168.2.2) myspamapp[213]:
    Message <21394.283845@spam.com> title 'Herbal Remedies and Tiny Cars'
    from 'spammer@spam.com' at SMTP host 192.168.3.1:2034 rejected due
    to unsolicited commercial content

This example includes client port and addressing information, so we can now relate it to network traffic.

A complete log message is one that contains all the information about a particular event within that single log message. Completeness reduces the number of records an analyst has to search through and provides the analyst with a clear indicator that there is no further information to acquire from this process. Incomplete messages are usually a function of complicated process. For example, an antispam tool might run several different filters on a message, with each filter and the final decision being a separate log line:

Mar 29 11:22:45.221 myhost (192.168.2.2) myspamapp[213]:
    Received Message <21394.283845@spam.com> title
    'Herbal Remedies and Tiny Cars' from 'spammer@spam.com' at
    SMTP host 192.168.3.1:2034
Mar 29 11:22:45.321 myhost (192.168.2.2) myspamapp[213]:
    Message <21394.283845@spam.com> passed reputation filter
Mar 29 11:22:45.421 myhost (192.168.2.2) myspamapp[213]:
    Message <21394.283845@spam.com> FAILED Bayesian filter
Mar 29 11:22:45.521 myhost (192.168.2.2) myspamapp[213]:
    Message <21394.283845@spam.com> Dropped

With incomplete messages, you have to track state across multiple messages, each of which gives a snippet of information and which youâre going to have to group together to do any useful analysis. Consequently, I prefer the message to be aggregated at the start, like this:

Mar 29 11:22:45.521 myhost (192.168.2.2) myspamapp[213]:
    Received Message <21394.283845@spam.com> title
    'Herbal Remedies and Tiny Cars' from 'spammer@spam.com' at
    SMTP host 192.168.3.1:2034 reputation=pass Bayesian=FAIL decision=DROP

Log messages are often only minimally modifiable directly. Instead, to build an effective message you might have to write some kind of logging shim. For example, if the log system outputs syslog messages, you can receive and parse those messages, convert them to a friendlier format, and then forward them on. When considering converting logfiles, in addition to the rules just mentioned, consider the following:

Convert time to epoch time: Almost all record correlation involves identifying the same phenomenon from different sensors, meaning that you need to look for records that are close in time. Converting all time values to epoch time reduces parsing complexity, throws out the nightmare of time zones and daylight saving time, and ensures a consistent treatment for a consistent value.
Make sure sensors are synchronized: A corollary to the first note, make sure that when sensors report the same event, they are reporting the same time. Trying to correct for this after the fact is terribly difficult, so make sure that all the sensors are coordinated, that they all report the same time, and that the clocks are corrected and resynchronized regularly.
Include addressing information: Wherever possible, include the flow five-tuple (source IP, destination IP, source port, destination port, protocol). If some of the values can be inferred from the record (e.g., HTTP servers are running TCP), they can be dropped.
Ensure that delimiters are understood by the logger: On several occasions, I have encountered helpful administrators reconfiguring HTTP logs to use pipes rather than spaces as delimiters. A worthy sentiment, except when the logging module doesnât know to escape the pipe when it occurs in text. If the logger can change its delimiter and understands that the change requires escaping the character, let the logger do it.
Use error codes rather than text if possible: Text doesnât scale wellâitâs bulky, difficult to parse, and often repetitive. Logging systems that generate template messages can also include an error code of some kind as a compact representation of the message. Use this rather than text to save space.

Existing Logfiles and How to Manipulate Them

We can break logfiles into three major categories: columnar, templated, or stateful. Columnar logs record records in discrete columns that are distinguishable by delimiters or fixed text width. Templated logfiles look like English text, but the text comes from a set of document templates and is enumerable. Stateful logfiles use multiple text records to describe a single event.

Columnar data, such as NCSA HTTPdâs CLF format, records one message per event. This message is a summary of the entire event, and consists of a fixed set of fields in columnar format. Columnar logs are relatively easy to deal with as the fields are cleanly delineated and the format is rigid; every message has the same columns and the same information.

When dealing with columnar data, keep in mind the following:

Is the data delimited or fixed-width? If itâs fixed-width, are there fields that could conceivably exceed that width, and if so, are the results truncated or is the column expanded?
If the data is delimited, is the delimiter escaped when used in the fields? Customizable formats (such as HTTP logs) may use a default delimiter and automatically escape it; if you decide to use your own delimiter, it probably wonât be automatically escaped.
Is there a maximum record length? If there is a maximum record length, you may encounter truncated messages with missing fields.

ELF and CLF logfiles, discussed in the next chapter, are good examples of columnar formats.

Templated text messages record one message per event, but the events are recorded as unformatted English text. The messages are templated in the sense that they come from a fixed and enumerable set of templates. Where possible, itâs best to convert templated text messages into some kind of indexed numeric format. In the best case, this is at least partly done. For example, the Windows event log shown in FigureÂ 4-2 has an event ID that describes the type of event and can be used to determine the other arguments that will be provided.

When dealing with templated text, keep in mind the following:

Can you get a complete list of the log messages? As an example, consider the Windows logfile in FigureÂ 4-2. Each of these messages is text, but it has a unique integer ID for the message. Check the documentation for a list of all potential log messages.

Converting Text to Columns

Templated text can be parsed; the messages belong to an enumerable set and can conceivably be converted into a columnar format. Creating such a system, however, requires developing an intermediary application that can read the text, parse each individual message, and deposit the result in a schema. Doing so is a nontrivial development task (and will have to be updated when new messages are developed), but it also can reduce the amount of space required and increase the readability of the data. Here are specific steps to take when creating a system to convert templated text into columns:

From whatever documentation you can find on the text format, identify and select the messages most relevant to security. Any conversion script is going to consist of a bunch of regular expressions, and the fewer expressions you have to maintain, the happier youâll be.
For each message, identify the parameters it contains. As an example, consider the following made-up templated messages: âAntispam tool SPAMKILLER identifies email <12938@yahoo.com> as Spam,â âAntispam tool SPAMKILLER identifies email <12938@yahoo.com> as Commercial,â âAntispam tool SPAMKILLER identifies email <12938@yahoo.com> as Legitimate.â There are three potential parameters here: the name of the antispam tool (enumerable), the message ID (a string), and the output (enumerable).
Once youâve identified parameters for each potential message, merge the parameters to create a superset. The goal is to create a schema representation of all the parameters that a message may potentially have; a particular message may not have all of them.
Try to generate at least one event record for every templated message the application can generate, and then compare that to the output from the documentation. Documentation is often inaccurate, and you may be assuming a templated output that doesnât actually appear. This type of work is best done by setting up a small lab to actually generate each event.
Check absurdly long string lengths, delimiters, and control characters. Templates will often truncate output strings without clearly specifying them, and may also escape special characters. Since you will be laying multiple layers of parsers on top of each other, you need to be sure of what happens with those corner cases.
Fail conservatively when generating columnar data. You may not be able to realistically convert more than 9 out of every 10 messages; if thatâs the case, drop the unparsed messages into a separate error feed for the analysts to check as needed.

Stateful Logfiles

In a stateful logfile, a single transaction is split across multiple messages.¹ Stateful logfiles are extremely frustrating for operational analysisâyou spend time and cycles reconstructing the event that happened.

Again, some kind of shim is the usual solution here. In general, a shim for a stateful logfile will be some kind of cache holding partially formed records. As it receives signals that a new transaction is happening, it pushes a new record into the cache. When it receives a signal that the transaction completed, or it times out (a critical step), it removes the relevant records from the cache. In between, it will fill each record with the appropriate messages, and by doing so converts the logfiles into a stateless format. In practice, building these shims requires managing a number of different failure modes based around when the message terminates.

This is the kind of work where, to generate a quality logfile, youâre going to have to run the system through its paces and see whether there are hidden errors. To help prep for these errors, consider stateful logs in three categories: stateful with delimiters, stateful with identifiers, and interleaved.

If a logfile is stateful with identifiers, then the messages have individual ID numbers that link together the transaction. The following truncated IronPort example is a good example of a stateful message with IDs:

Thu Mar 24 16:58:31 2016 Info: Start MID 452 ICID 98
Thu Mar 24 16:58:31 2016 Info: MID 452 ICID 98 From: <alerts@fakecompany.org>
Thu Mar 24 16:58:31 2016 Info: MID 452 ICID 98 RID 0 To: \
    <elvis.presley@graceland.int>
Thu Mar 24 16:58:31 2016 Info: MID 452 Subject 'Sent to elvis presley to test \
    regexps in filtering'
Thu Mar 24 16:58:31 2016 Info: MID 452 ready 665 bytes from \
    <alerts@fakecompany.org>
Thu Mar 24 16:58:31 2016 Info: MID 452 matched all recipients for per-recipient \
    policy DEFAULT in the inbound table
Thu Mar 24 16:58:31 2016 Info: Delivery start DCID 151 MID 452 to RID [0]
Thu Mar 24 16:58:31 2016 Info: Message done DCID 151 MID 452 to RID [0] \
    [('x-customheader', '192.168.1.1')]
Thu Mar 24 16:58:31 2016 Info: MID 452 RID [0] Response 'Ok'
Thu Mar 24 16:58:31 2016 Info: Message finished MID 452 done
Thu Mar 24 16:58:33 2016 Info: Start MID 453 ICID 98

When working with these types of logs, check for the following pitfalls:

Determine the format of the identifiers, and how they are assigned. If identifiers are assigned linearly, (e.g., ID 1, ID 2, ID 3), be prepared for gaps in ID numbers and have contingencies ready, such as a secondary alert.
Determine messages for creating and destroying the identifiers. Expect to have a timeout or other mechanism for dealing with messages that donât close.
When creating a shim, fill in all the fields you will report on with a default unreported value, then update those fields as you encounter their values in the logfile. This way you can ship the record even if you donât have all the fields, and it will be clear which fields were omitted.

A logfile is stateful with delimiters if it uses some kind of discrete begin/end messages to indicate the beginning and ending of a transaction, and the entire transaction is contained within those begin and end messages. ExampleÂ 4-2 shows a representative format; as this example shows, each individual event is prefaced with a timestamp, and then there are a variable number of fields. Also of note is that there is no explicit delimiter for ending the messageâthe Request-Authenticator field may be a delimiter, but given the differences in the fields shown in both entries, it canât be verified.

Example 4-2. An example of a delimited log: GNU RADIUS packets²

Fri Dec 15 18:00:24 2000
        Acct-Session-Id = "2193976896017"
        User-Name = "e2"
        Acct-Status-Type = Start
        Acct-Authentic = RADIUS
        Service-Type = Framed-User
        Framed-Protocol = PPP
        Framed-IP-Address = 11.10.10.125
        Calling-Station-Id = "+15678023561"
        NAS-IP-Address = 11.10.10.11
        NAS-Port-Id = 8
        Acct-Delay-Time = 0
        Timestamp = 976896024
        Request-Authenticator = Unverified

Fri Dec 15 18:32:09 2000
        Acct-Session-Id = "2193976896017"
        User-Name = "e2"
        Acct-Status-Type = Stop
        Acct-Authentic = RADIUS
        Acct-Output-Octets = 5382
        Acct-Input-Octets = 7761
        Service-Type = Framed-User
        Framed-Protocol = PPP
        Framed-IP-Address = 11.10.10.125
        Acct-Session-Time = 1905
        NAS-IP-Address = 11.10.10.11
        NAS-Port-Id = 8
        Acct-Delay-Time = 0
        Timestamp = 976897929
        Request-Authenticator = Unverified

When working with log data, expect to run into common problems, in particular with how log messages are delimited. From experience, I recommend checking the following:

Verify that logs have both start and end delimiters. Itâs not uncommon for these log formats to explicitly start a log message, and then never end it.
Verify that a single transaction happens between delimiters. If at all possible, test this empirically by running a long transaction and, while that is running, running a simultaneous short transaction. If messages overlap, you have an interleaved log.
Determine the format of starting and ending messages, and be prepared to deal with incomplete messages.

The most problematic cases are stateful interleaved logfiles; these are logfiles where events take multiple lines, and multiple events happen simultaneously. In an interleaved logfile, you will see messages from event A, messages from event B, and then more messages from event A.

Stateful interleaved logfiles have serious internal validity problems and should, when encountered, be considered a secondary source of dataâacceptable for debugging, terrible for forensics. Your best solution when encountering this type of message is to keep it in the original format, since you canât trust that youâre seeing events in sequence

Network Security Through Data Analysis, 2nd Edition by Michael Collins

Chapter 4. Data in the Service Domain

What and Why

Figure 4-1. Comparing service and network domain data

Logfiles as the Basis for Service Data

Accessing and Manipulating Logfiles

Example 4-1. A /var/log directory from a macOS system

Figure 4-2. The Windows event log

The Contents of Logfiles

The Characteristics of a Good Log Message

Existing Logfiles and How to Manipulate Them

Stateful Logfiles

Example 4-2. An example of a delimited log: GNU RADIUS packets²

Further Reading

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Chapter 4. Data in the Service Domain

What and Why

Figure 4-1. Comparing service and network domain data

Logfiles as the Basis for Service Data

Accessing and Manipulating Logfiles

Example 4-1. A /var/log directory from a macOS system

Figure 4-2. The Windows event log

The Contents of Logfiles

The Characteristics of a Good Log Message

Existing Logfiles and How to Manipulate Them

Stateful Logfiles

Example 4-2. An example of a delimited log: GNU RADIUS packets2

Further Reading

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly

Example 4-2. An example of a delimited log: GNU RADIUS packets²