Chapter 1. Log Analytics

The humble machine log has been with us for many technology generations. The data that makes up these logs is a collection of records generated by hardware and software—including mobile devices, laptop and desktop PCs, servers, operating systems, applications, and more—that document nearly everything that happens in a computing environment. With the constantly accelerating pace of business, these logs are gaining importance as contributors to practices that help keep applications running 24/7/365 and analyzing issues faster to bring them back online when outages do occur.

If logging is enabled on a piece of hardware or software, almost every system process, event, or message can be captured as a time-series element of log data. Log analytics is the process of gathering, correlating, and analyzing that information in a central location to develop a sophisticated understanding of what is occurring in a datacenter and, by extension, providing insights about the business as a whole.

The comprehensive view of operations provided by log analytics can help administrators investigate the root cause of problems and identify opportunities for improvement. With the greater volume of that data and novel technology to derive value from it, logs have taken on new value in the enterprise. Beyond long-standing uses for log data, such as troubleshooting systems functions, sophisticated log analytics have become an engine for business insight and compliance with regulatory requirements and internal policies, such as the following:

A retail operations manager looks at customer interactions with the ecommerce platform to discover potential optimizations that can influence buying behavior. Complex relationships among visit duration, time of day, product recommendations, and promotions reveal insights that help reduce cart abandonment rates, improving revenue.
A ride-sharing company collects position data on both drivers and riders, directing them together efficiently in real time, as well as performing long-term analysis to optimize where to position drivers at particular times. Analytics insights enable pricing changes and marketing promotions that increase ridership and market share.
A smart factory monitors production lines with sensors and instrumentation that provide a wealth of information to help maximize the value generated by expensive capital equipment. Applying analytics to log data generated by the machinery increases production by tuning operations, identifying potential issues, and preventing outages.

Using log analytics to generate insight and value is challenging. The volume of log data generated all over an enterprise is staggeringly large, and the relationships among individual pieces of log data are complex. Organizations are challenged with managing log data at scale and making it available where and when it is needed for log analytics, which requires high compute and storage performance.

Note

Log analytics is maturing in tandem with the global explosion of data more generally. The International Data Corporation (IDC) predicts that the global datasphere will grow more than fivefold in seven years, from 33 zettabytes in 2018 to 175 zettabytes in 2025.¹ (A zettabyte is 10^21 bytes or a million petabytes.)

What’s more, the overwhelming majority of log data offers little value and simply records mundane details of routine day-to-day operations such as machine processes, data movement, and user transactions. There is no simple way of determining what is important or unimportant when the logs are first collected, and conventional data analytics are ill-suited to handle the variety, velocity, and volume of log data.

This report examines emerging opportunities for deriving value from log data, as well as the associated challenges and some approaches for meeting those challenges. It investigates the mechanics of log analytics and places them in the context of specific use cases, before turning to the tools that enable organizations to fulfill those use cases. The report next outlines key architectural considerations for data storage to support the demands of log analytics. It concludes with guidance for architects to consider when planning and designing their own solutions to drive the full value out of log data, culminating in best practices associated with nine key questions:

What are the trends for ingest rates?
How long does log data need to be retained?
How will regulatory issues affect log analytics?
What data sources and formats are involved?
What role will changing business realities have?
What are the ongoing query requirements?
How are data-management challenges addressed?
How are data transformations handled?
What about data protection and high availability?

Capturing the Potential of Log Data

At its core, log analytics is the process of taking the logs generated from all over the enterprise—servers, operating systems, applications, and many others—and deducing insights from them that power business decision making. That requires a broad and coherent system of telemetry, which is the process of PCs, servers, and other endpoints capturing relevant data points and either transmitting them to a central location or a robust system of edge analytics.

Log analytics begins with collecting, unifying, and preparing log data from throughout the enterprise. Indexing, scrubbing, and normalizing datasets all play a role, and all of those tasks must be completed at high speed and with efficiency, often to support real-time analysis. This entire life cycle and the systems that perform it must be designed to be scalable, flexible, and secure in the face of requirements that will continue to evolve in the future.

Generating insights consists of searching for specific pieces of data and analyzing them together against historical data as well as expected values. The log analytics apparatus must be capable of detecting various types of high-level insights such as anomalies, relationships, and trends among the log data generated by information technology (IT) systems and technology infrastructure, as shown in Figure 1-1.

High level types of insights discoverable from log data

The following are some examples of these types of high-level insights:

Anomaly: Historically, 90% of the traffic to a given server has come from HR. There is now an influx of traffic from a member of the sales department. The security team might need to investigate the possibility of an insider threat.
Relationship: The type of spike currently observed in traffic to a self-serve support portal from a specific customer often precedes losing that customer to the competition. The post-sales support team might need to ensure that the customer isn’t at risk.
Trend: Shopping cart abandonment rates are increasing on the ecommerce site for a specific product type. The sales operations team might need to investigate technical or marketing shortcomings that could be suppressing that product’s sales.

In addition to detecting these high-level insights, the log analytics apparatus must be capable of effective reporting on and visualization of those findings to make them actionable by human administrators.

Your Environment Has Too Many Log Sources to Count

Log data is generated from many sources all over the enterprise, and deciding which ones to use for analytics is an ongoing process that can never be completed. The following list is representative, as opposed to exhaustive:

Servers: Operating systems, authentication platforms, applications, databases
Network infrastructure: Routers, switches, wireless access points
Security components: Firewalls, identity systems such as active directory (AD), intrusion prevention systems, management tools
Virtualization environments: Hypervisors, orchestration engines, management utilities
Data storage: Local, virtualized, storage area network (SAN), and/or network-attached storage (NAS) resources; object storage
Client machines: Usage patterns, data movement, resource accesses

Although they derive from a shared central concept, implementations of log analytics are highly variable in scope, intent, and requirements. They can run the gamut from modest to massive in scale, with individual log entries that might be sparse or verbose, carrying all manner of information in an open-ended variety of formats that might not be readily compatible, as shown in Figure 1-2. All share the challenge of tightening the feedback loop between sifting through and interpreting enormous numbers of events, often in real time, to generate insights that can optimize processes.

Storing log data enables analysts to go back through the repository of time-series data and re-create a series of events, correlating causes and effects after the fact. In addition to casting light on the past, identifying historical patterns also helps illuminate present and future dangers and opportunities. The sheer volume of that data and the need to be able to effectively query against it places significant demands on storage systems.

Challenge of bringing together nonstandardized log file formats

Treating Logs as Data Sources

The contents of logs are less a series of metrics than they are text strings akin to natural language, in the sense that they are formatted imprecisely, with tremendous variation depending on who created the log-writing mechanism. In addition, because log entries are only semistructured, they must be interpreted and then parsed into discrete data points before being written to a database.

Telemetry from thousands of different sources might be involved, from simple sensors to enterprise databases. In keeping with that enormous diversity, the structure, contents, and syntax of entries vary dramatically. Beyond differences in format and syntax, various logs contain discrete datasets, with mismatched types of data. Transforming and normalizing this data is key to making it valuable.

Analytics can be performed on log data that is either streaming or at rest. Real-time or near-real-time analysis of logs as they are generated can monitor operations and reveal emerging or existing problems. Analysis of historical data can identify trends in quantities such as hardware utilization and network throughput, providing technology insights that complement business insights more broadly and help guide infrastructure development. Using older data to create baselines for the future also helps to identify cases for which those ranges have been exceeded.

Logs Versus Metrics

Both logs and metrics are essentially status messages, which can come from the same source. They are complementary but distinct, as represented in Figure 1-3.

Figure 1-3. Comparison between logs and metrics

Logs are semistructured, defined according to the preferences of the individual developers that created them. They are verbose by nature, most often based on free text, often resembling the natural language from which they derive. They are intended to give detail about a specific event, which can be useful in drill-down, root-cause analysis of scenarios such as system failures or security incidents.

Metrics are quantitative assessments of specific variables, typically gathered at specific time intervals, unlike logs, which are triggered by external events. Metrics have a more structured format than logs, making them suitable for direct numerical analysis and visualization. Because their collection is governed by time rather than events, volumes of metrics data tend to scale more gradually than logs with increased IT complexity and transaction volume.

Of the two, logs are far messier. Although they are semistructured, recovering that structure requires parsing with specialized tools. Metrics, by contrast, are inherently highly structured. Logs and metrics can work together, with different functions that reflect their respective structures.

For example, metrics reveal trends through repeated measurement of the same quantities over time. Referred to as the aforementioned “time series,” this sequence of data points can be plotted as a line graph, for example, where increases in query response time might indicate deteriorating performance of a database. The greater level of forensic detail in the corresponding logs can be the key to determining why that deterioration occurred.

One key distinction between logs and metrics is that metrics are usually generated to answer specific questions and often cannot be repurposed for other use cases. For example, if you wanted to know how many unique users per day were accessing a website, you could gather this information by aggregating all the logs from each day during a query. However, doing this at query time is computationally expensive unless the data is indexed. A metric-based approach would be to generate a daily summary of unique users. The metric-based approach is quicker to query, however. As you can see, you cannot drill down into the data any further. Therefore, when planning metrics, it is important to think of as many use cases as possible when you are designing them to extract maximum value.

Log data is messy, both in the organizational sense of mismatched formats of logs from various sources, as well as in the hygienic sense of misspellings, missing data, and so on. Beyond the need to interpret the structures of entries and then parse them, the transformations applied to log data must also account for quality issues within the data itself. For example, log analytics systems typically provide the ability to interpret data so that they can successfully query against data points that might include synonyms, misspellings, and other irregularities.

Aside from quality issues, data logs can contain mismatches simply because of the way they characterize data, such as one security system tagging an event as “warning” while another tags the same event as “critical.” Such discrepancies among log entries must be resolved as part of the process of preparing data for analysis.

The need to collect log data from legacy systems can also be challenging. Whereas legacy applications, operating systems, and hardware are frequent culprits in operational issues, they can provide less robust (or otherwise different) logging than their more modern counterparts. Additional layers of data transformation might be required by such cases in order to normalize their log data to that of the rest of the environment and provide a holistic basis for log analytics.

Standardizing Log Formatting

One of the major issues in log analysis is that there is no standard way of formatting log messages. The closest protocol is the Syslog (RFC-5414) format; unfortunately, there are many other formats for log messages. In order to analyze log data, the messages must ultimately be mapped to a schema of columns and rows. Using custom log formats means that any downstream system which seeks to consume these logs will require custom parsers for each format.

In the security environment, some devices can be configured to emit logs as XML or JSON.² These formats offer two advantages over plaintext logs in that the schema of the data is contained in the file structure, which simplifies downstream processing. Most analytic scripting languages such as Python or R have libraries to easily ingest these formats, which eliminates the need for custom parsers. Of these two formats, JSON has two significant advantages over XML: the schema in JSON data is not ambiguous, and the file size is smaller for JSON files than XML given the same data.

Columnar versus row-based formats

An additional option for log formats are columnar formats such as Apache Parquet. Parquet is an open source columnar file format designed to support fast data processing using complex data. Parquet is a columnar format,³ which means that instead of storing rows together, it groups data together by columns. Additionally, Parquet stores both the actual data and metadata about the columns.

The columnar format has several advantages over row-based formats, including columnar compression, metadata storage, and complex data support. In practice, what this means is that storing log data in Parquet format can result in size reductions of up to six times that of conventional logs. Storing data in Parquet format may accelerate queries as well as Parquet stores metadata which is useful in minimizing the amount of data scanned during queries.

Many modern analytic platforms support parquet format, including Python/Pandas, R, Spark, Drill, Presto, Impala, and others.

The Log Analytics Pipeline

As log data continues to grow in volume, variety, and velocity, the associated challenges require structured approaches to infrastructure design and operations. Log analytics and storage mechanisms for machine data based on tools such as Splunk and the Elastic Stack must be optimized across a life cycle of requirements, as illustrated in Figure 1-4. To perform these functions effectively, it must be possible to draw data from anywhere in the environment, without being impinged by data silos.

Pipeline for assembling and driving value from log data

This pipeline represents the process for transforming log data into actionable insights, although in practice, their order might be rearranged, or only a subset of the steps listed here might be used. Common operations performed on log data include the following:

Collect: From its dispersed sources, log data must be aggregated, parsed, and scrubbed, inserting defaults for missing values and discarding irrelevant entries, etc.
ETL (Extract, Transform, Load): Data preparation can include being cleaned of bad entries, reformatted, normalized, and enriched with elements of other datasets.
Index: To accelerate queries, the value of indexing all or a portion of the log data must be balanced against the compute overhead required to do so (as discussed below).
Store: Potentially massive sets of log data must be stored efficiently, using infrastructure built to deliver performance that scales out smoothly and cost effectively.
Search: The large scale of the log data in a typical implementation places extreme demands on the ability to perform flexible, fast, sophisticated queries against it.
Correlate: The relationships among various data sources must be identified and correlated before the significance of the underlying data points can be uncovered.
Visualize: Treating log entries as data means that they can be represented visually using graphs, dashboards, and other means to assist humans in understanding them.
Analyze: Slicing and dicing log data and applying algorithms to it in a structured, automated way enables you to identify trends, patterns, and actionable insights.
Report: Both predefined and ad hoc reporting must be powerful and flexible so that users can rapidly produce outputs that illuminate their business needs.

Note

The framework of steps given here is a guideline that could easily be expanded to, more specifically, call out actions such as data parsing, transformation, and interpretation, among many others. The point of this life cycle description is to provide a workable overall view rather than the most exhaustive one possible.

Getting your arms fully around the challenges associated with implementing log analytics is daunting. The potential sources and types of log data available are of an open-ended variety, as are the possible uses of that data. Although the specific implementations at every organization will be different, they share general technical requirements as well as the potential to be applied to common business needs and use cases.

¹ David Reinsel, John Gantz, and John Rydning. IDC, November 2018. “The Digitization of the World from Edge to Core”.

² Unfortunately, the option to emit logs in XML or JSON is not available on all devices.

³ For more information about the difference between row-based and columnar stores see: “The Data Processing Holy Grail? Row vs. Columnar Databases”.

Get Understanding Log Analytics at Scale, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Understanding Log Analytics at Scale, 2nd Edition by Matt Gillespie, Charles Givre

Chapter 1. Log Analytics

Note

Capturing the Potential of Log Data

Figure 1-1. High-level types of insights discoverable from log data

Your Environment Has Too Many Log Sources to Count

Figure 1-2. Challenge of bringing together nonstandardized log file formats

Treating Logs as Data Sources

Standardizing Log Formatting

Columnar versus row-based formats

The Log Analytics Pipeline

Figure 1-4. Pipeline for assembling and driving value from log data

Note

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly