Chapter 1. The Rise of the Security Data Lake

Today’s cybersecurity experts are overwhelmed. They are constantly on guard against malicious activity on their networks, from advanced malware infections to persistent threats, and from phishing schemes to SQL injection attacks. These external assaults are further complicated by the growing number of internal risks arising from simple errors, disgruntled employees, and outdated software configurations. Security experts must act on the assumption that all applications, services, identities, and networks are under threat.

For years, cybersecurity teams have relied on standalone security information event management (SIEM) systems that aggregate log data from firewalls, servers, network devices, and other sources. By pulling together these data points, analysts in the security operations center (SOC) can detect and respond to attacks as they happen, with the goal of mitigating threats quickly. Unfortunately, the process of identifying and investigating these incidents has failed to keep up with the complexities stemming from cloud computing, DevOps, work-from-home practices, and other emerging computing and lifestyle environments. Growing visibility gaps and insufficient automation prevent security teams from achieving their goals in threat detection and response, as well as tackling other important use cases such as regulatory compliance and vulnerability management.

In addition, as data from these activity logs grows in complexity and scope, most organizations find they can analyze only a small fraction of these vast repositories. Legacy security tools have limited data management capabilities and restrictive data storage allocations, which hinder the effectiveness of forensic investigations. Meanwhile, because of increasingly complex regulatory requirements, security professionals must help their organizations comply with strict data privacy regulations governing the creation, storage, and use of consumer data. This adds to the already onerous task of monitoring corporate information systems to avoid unauthorized incursions—before data is lost, trust is breached, or customers become aware of performance issues.

Security teams need a faster, easier way to get the data they need so they can stop bad actors before attacks escalate into breaches. Maintaining strong edge controls, such as endpoint detection and response (EDR) software and secure access service edge (SASE) systems, is an important part of network security, but savvy attackers know how to penetrate these virtual perimeters. To minimize risk, security operations teams must modernize their cybersecurity systems so they can detect, analyze, and even predict potential threats quickly and effectively to thwart breaches when intrusions do occur.

A security data lake is a specialized data lake designed for collecting and manipulating security data. This report describes how the security data lake model can complement or replace the traditional SIEM model. It also describes how to create a modern security data lake with an organization’s existing cloud data platform to deliver comprehensive visibility and powerful automation across multiple security use cases.

Understanding the Limitations of the Traditional SIEM Model

SIEMs monitor and analyze data from users, software applications, hardware assets, cloud environments, and network devices. These information systems allow security professionals to recognize potential security threats and vulnerabilities before they have a chance to disrupt business operations. The data is collected, stored, and analyzed in real time, allowing security teams to automatically monitor logins, data downloads, and other activities, as well as track and log data for compliance or auditing purposes. Security experts write and maintain rules to manage alerts from servers, firewalls, antivirus applications, and many types of equipment sensors. This event data from machine logs and other network sources is stored in a database and presented via monitoring dashboards.

The traditional SIEM model was adequate when corporate data and applications resided in on-premises data centers. Today, as organizations migrate information systems to the cloud, adopt software as a service (SaaS) applications, and deploy an immense variety of web and mobile applications, they generate much more data. Every transaction—often every click, swipe, or tap—generates a record, leading to a deluge of log data. SIEM systems bog down under the burgeoning load of data arising from these cloud-driven log sources as well as from containerized systems such as Kubernetes.

Popular SIEM solutions are designed to search activity logs, overlooking complementary data sets such as asset inventory and configuration records. To supplement security logs, security analysts may also gather contextual information about the employees of the organization they work to protect, such as user location, device type, and job role. However, SIEM solutions can’t ingest these other enterprise data sources, so they lack the ability to use complementary and contextual data sources to automatically weed out false positives (noisy alerts) from false negatives (undetected threats). The SIEM can only ingest log data and only in limited quantities. Furthermore, traditional SIEM providers offer only rudimentary analytics via their proprietary search languages.

SIEM solutions are also prohibitively expensive to use with high-volume data sets, such as cloud activity logs and endpoint forensic data. As a result, potentially important security data is kept siloed in low-cost, archival storage media, commonly known as cold archives because the data is accessed infrequently. By contrast, hot data resides in a readily accessible state for instant query and analysis, but generally only in limited quantities.

These restrictions stem from a basic problem: most SIEM solutions emerged when corporate applications and data resided in on-premises data centers. Most of these SIEM offerings now operate in the cloud, but they lack a true cloud multitenant architecture. These “cloud-washed” solutions can’t take advantage of the near-unlimited storage and computing power the cloud offers. SIEM searches can take minutes or even hours to complete, and it is difficult to scale these deployments to match the steady growth in log data that arises from today’s mobile and cloud-based information systems. SIEM solutions are also expensive because of high software license costs and excessive data storage costs. In turn, these solutions are constrained by limited retention windows and can’t combine structured, semistructured, and unstructured data into a single repository. Figure 1-1 sums up these limitations.

Figure 1-1. In the era of mobile computing and cloud-based applications, yesterday’s SIEM tools are showing their limitations

Expanding Your Analytic Horizons

Every time an alert fires or a breach investigation is launched, security analysts must quickly identify the attacker’s entry point, establish the blast radius, and validate whether or not an incident has been properly mitigated. However, it’s hard with traditional SIEM solutions to synthesize insights from activity logs with contextual data sources, restricting the security organization’s ability to effectively detect and respond to genuine threats. With traditional SIEM, monitoring activities remain largely manual, and the detection/response cycle often isn’t quick enough to adequately thwart determined attackers.

According to the 2022 Verizon Data Breach Investigations Report, the time from an attacker’s first action in an event chain to the initial compromise of an asset is typically measured in minutes, whereas the time to discovery is often measured in days, weeks, months, or even years.

Many reasons for this poor response time exist, beginning with the obvious: there are simply too many systems, devices, and applications that generate contextual data in disparate places. When these data sets are siloed, SIEM solutions are prevented from detecting a threat in one or more information systems.

Forward-looking security teams see the value in storing contextual data from dozens of sources in a single repository so they don’t have to turn to many different places to find the right information. Consolidating security data streamlines investigations by allowing security teams to use robust analytics and data science methods to spot suspicious activity and respond to threats. A centralized repository makes it easier to apply modern business intelligence and predictive analytics capabilities to the data, dramatically improving the search-only interfaces that characterize traditional SIEM systems. This approach helps security teams overcome the negative impacts of SIEM solutions.

Reviewing Security Data Lake Prototypes

Most security teams rely on their SIEM solutions as a centralized data platform, pulling in data from various “point” products—specialized software solutions that address specific security use cases—to support threat detection and investigations. However, as more organizations move their information systems to the cloud and adopt a wide variety of SaaS applications, traditional SIEM solutions can’t keep up with the complexity and volume of data storage. Because it is cost prohibitive to store large amounts of data in traditional SIEM products, organizations must decide which limited data they can collect from security sources and how long they can keep it available in an active, searchable, hot database.

As early security data lake prototypes gained prominence, cybersecurity teams saw the potential to leverage robust data platforms to power their immense security workloads. Early security data lakes were built using Hadoop. However, these open source systems required complex development and maintenance to get them up and running. Hadoop infrastructure also required specialized experts to implement, manage, and scale—skills most security teams don’t have.

Other organizations cobbled together security data by storing some of it in a cold data archive, such as an Amazon Web Services S3 cloud storage bucket. This approach solved the data storage problem, but because no direct integrations among these disparate environments existed, these security data lakes required lots of tinkering. For example, organizations had to constantly restore data from a cold archive into an active or hot database to conduct analytics.

Introducing the Modern Cloud Security Data Lake

Security professionals who attempted to supplement their SIEM solutions by building a security data lake from scratch often found themselves mired in expensive projects with limited results. They learned that a complete security data lake involves much more than loading archival security logs and applying general-purpose analytics. Organizations must also overcome unique challenges related to ingesting, enriching, and formatting the data of many types from many sources for specific security use cases, including assessing the most pressing threat detection requirements and addressing those requirements with custom queries, machine learning models, and data visualizations.

To understand the power and potential of a modern security data lake, let’s review the attributes of a general-purpose data lake—a versatile repository designed to store large amounts of data in native formats. This data can be structured, semistructured, or unstructured, and it can include tables, text files, system logs, and other sources. To maximize flexibility, data lakes do not impose a schema on the data when it is captured. Instead, the schema is applied when the data is extracted from the data lake, allowing for multiple use cases on the same data.

Just as a general-purpose data lake allows analysts and data scientists to consolidate many types of data for analysis and visualization, a modern cloud security data lake enables security teams to use one system to analyze many kinds of data. A security data lake should be able to store and manage data from single-, multi-, and cross-cloud environments that encompass an immense variety of SaaS apps along with data entrusted to public cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

Every security product, network device, and computer on a network creates its own logs. Instead of forcing security analysts to manually gather and separately analyze data from all these siloed systems and devices, a modern security data lake includes data pipelines that pull it all together to enable consolidated analytics. Centralizing these logs in a security data lake simplifies threat investigations and other cyber use cases such as control validation, identity and access, and vulnerability management.

Harnessing the Power of a Cloud Data Platform and Connected Ecosystem

Early security data lake implementations resulted in a swamp of data that security analysts could not readily leverage for investigations. Modern security data lakes are enabled by a cloud data platform that can scale up and down automatically based on fluctuating workloads. As shown in Figure 1-2, these platforms provide inexpensive storage for structured, semistructured, and unstructured data, which is important for security use cases, and they uphold strong control and management capabilities to govern how users access the data. They also offer near-unlimited compute power and a growing ecosystem of connected applications, providing off-the-shelf capabilities complete with API integrations, purpose-built interfaces, and near-immediate access to up-to-date security content via data marketplaces.

Figure 1-2. A modern cloud data platform provides unique capabilities for creating security data lakes

With virtually unlimited cloud data storage capacity, security teams are no longer constrained by the data-ingestion and data-retention limits imposed by traditional SIEMs. They can store all their data in a single platform and maintain it all in a hot, readily accessible state. Threat intelligence data collected from SIEM logs and other security sources can be joined with other data sets to reveal the full scope of an incident from historical records.

As you will see in the chapters that follow, if you build your security data lake with a modern cloud data platform, you will obtain a data management solution that extends far beyond typical SIEM use cases to support many other parts of an advanced cybersecurity strategy. Your security team will be working within the same data platform as your other data teams and gain instant access to contextual data sets without inflicting more overhead. Your security team can partner with data professionals throughout the enterprise, with everybody working within a standards-based environment (in contrast to other security solutions that use proprietary languages and formats). This freedom allows your security operations center to apply all types of business intelligence and data science tools to the security discipline.

Note

Security data lakes apply advanced analytics, near-limitless cloud storage, and near-infinite elastic computing to the task of security analytics. They allow security teams to access a universal data repository that combines security data from across the enterprise into one system, making it easier to evaluate alerts and understand attack details. Powerful analytics help these security professionals detect and respond to threats, while security content and visualizations in connected applications help them to be more effective.

Summary

Attack surfaces are expanding as enterprises increasingly rely on complex, multi-cloud environments. Unfortunately, legacy SIEM solutions fail to enable effective threat detection and response in these diverse IT settings. These outdated solutions are plagued with data storage and retention limitations, along with poor scalability and slow query performance. As a result, many security teams can’t easily determine what is happening across their organization’s infrastructure. These limitations also limit historical reporting and data science initiatives because each set of security logs ingested into the SIEM is available for a limited period, typically 90 days or less. Effectively securing your environment is difficult when you have access to only some of your security data, some of the time.

To overcome these limitations, a growing number of organizations are moving their security data and SIEM workloads into security data lakes. A well-architected security data lake, based on a scalable cloud data platform, eliminates data ingestion and retention limits. A modern security data lake powers robust analytic capabilities on top of the cost-efficient data storage capability, which greatly reduces data management overhead and the manual investigation processes that traditional SIEM platforms require. Finally, aligning the security organization with the rest of the enterprise provides an opportunity to protect the business while enabling growth and innovation.

Get Deploying a Modern Security Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.