Chapter 4. Achieving Your Security Program Objectives

Bad actors are increasingly well funded, highly capable, and determined to leverage new technologies and paradigms to launch their attacks. In this constantly evolving threat landscape, building detections for every possible adversary or technique is nearly impossible. To maximize your defense posture, your security operation center must employ proven, repeatable processes for creating and maintaining threat detections in conjunction with continuous monitoring and testing to adapt them to real-world conditions. A security data lake enables you to achieve your security program objectives as part of a continuous process of improvement known as the Threat Detection Maturity Framework. This process yields more robust threat detections and greater alert fidelity by adhering to detection-as-code procedures.

Introducing the Threat Detection Maturity Framework

Incident response (IR) and threat hunting are separate but closely related activities. IR teams continually educate threat hunters about current attack patterns. Conversely, threat hunting teams share insights from their analytics so the incident response team learns more about what constitutes normal behavior in the environment.

To stay up to date on prevalent and emerging attack patterns, many security organizations adopt the MITRE ATT&CK matrix, an industry-standard framework for measuring threat detection maturity. This publicly available knowledge base tracks the tactics and techniques used by threat actors across the entire attack lifecycle. Created by a nonprofit organization for the United States government, the ATT&CK matrix helps security teams understand the motivations of adversaries and determine how their actions relate to specific classes of defenses.

Although the MITRE ATT&CK matrix is a good starting point, conscientious threat detection teams consider many additional factors outside of this matrix. A complete threat detection maturity framework should encompass five categories:

Processes: Development methodologies and workflows
Data, tools, and technology: Logs, data sources, integration logic, and documentation
Capabilities: Searches, analytic dashboards, and risk-scoring models
Coverage: Mapping of detections to threats, and prioritization of responses
People: A diverse and well-rounded SOC team

For each of these five categories, the framework defines the following three maturity levels:

Ad hoc: Initial rollout of security data, logic, and tools
Organized: Gradual adoption of best practices
Optimized: Well-defined procedures based on proven principles

For example, within the processes category, a team with an ad hoc level of maturity likely has no formalized development methodology, no defined inputs, only rudimentary detections, and no defined metrics.

As the team progresses to the organized stage, it establishes key development methodologies and workflows, defines important data inputs and detections, forges partnerships with connected application vendors, and collects essential metrics to gauge its progress.

Once the team reaches the optimized stage, the threat-detection effort includes a proven methodology and workflow, carefully delineated inputs for multiple detections, a well-defined threat lifecycle, and mature partnerships with connected application vendors. Metrics are continually collected and regularly presented to the CISO and other stakeholders accountable for the organization’s risk posture.

The article “Threat Detection Maturity Framework” provides additional details on how to structure your threat detection efforts and progress along the threat detection maturity curve.

Embracing Detection-as-Code Principles

As security teams tasked with detecting and mitigating threats progress along the maturity curve, they should carefully consider how they develop, deploy, and maintain detection logic. Just as software engineers adhere to the DevOps lifecycle to build and maintain robust software applications, detection engineers should follow the detection development lifecycle, which is governed by detection-as-code principles.

DevOps thrives via peer review processes for developing, testing, and deploying new code. Detection-as-code applies the same concepts to the creation and maintenance of detection logic for identifying risks (proactively) and threats (reactively). The approach also extends “as code” to the collection of the security data and to the database schemas that define it. This gives security teams a structured way to make sense of security data at scale and to transform manual processes to automated ones. Without these rigorous, repeatable processes, your detections will generate false positives that may cause “alert fatigue.” Your IR team won’t be able to handle all the alerts, enabling some threats to progress from initial access events into full-blown breaches.

By taking lessons from the DevOps and DataOps disciplines, the detection development lifecycle allows security teams to leverage proven, repeatable processes for building, maintaining, and testing threat detections. It emphasizes reusable code, version control methods, peer review, and check-in/check-out procedures as analysts collaborate to create and maintain high-fidelity threat detections. Adhering to this lifecycle empowers SOC teams to develop robust security rules and monitor their performance in the environment.

The detection development lifecycle consists of the following six phases:

Requirements gathering: Collect relevant technical details from key stakeholders such as the primary goal of each detection, the systems these detections target, the risks and vulnerabilities they address, and the desired alerting methods (such as Slack, Jira, and other methods).
Design: Once work commences on a detection, the goal is converted into a detection strategy. Some security teams use standard detection frameworks such as the Palantir Alerting and Detection Strategy (ADS) framework, which also assists with creating documentation that defines the purpose and use of each detection.
Development: After a new detection’s design has been completed, it is converted into code. Make sure every detection has a set of common fields and a link to your chosen detection framework so you can clearly define the goal of the detection in the code. Where possible, leverage out-of-the-box detections from connected applications.
Testing and deployment: Test each detection for accuracy, precision, and alert volume. Historical testing involves running the detection against past data. After testing is completed, detections are peer reviewed and managed in a version control system.
Monitoring: Continuously monitor the performance of deployed detections, review assumptions and gaps, and decommission detections that are no longer needed.
Continuous testing: This is how mature threat detection teams ensure each detection is accomplishing its intended goal. The output of continuous testing can be no action if the detection delivers appropriate alerts. It can also trigger a detection improvement request or a request for an entirely new detection.

Adhering to this lifecycle enhances the quality of your detections, encourages robust documentation of those detections, makes scaling your team easier, and serves as a solid foundation for developing program metrics, as described in the next section.

Improving Threat Detection Fidelity

The complexity of today’s hybrid IT environments has led to an exponential growth in the number of alerts generated on a daily basis. Alerts arise not only from suspicious network behavior but also from routine internal events. For example, the finance department might roll out a new version of an enterprise software application that queries a large section of a key corporate database. As the software goes live, a sudden spike in CPU load on the database cluster might trigger multiple warnings, generating a flood of alerts spanning multiple systems. Other applications that rely on that cluster might experience memory deficiencies or latency issues.

These machine-generated alerts can be particularly challenging to isolate and identify because of the sheer volume of alert activity, or “noise,” that can arise from a single incident. IR teams seek to correlate these alerts across multiple layers of the technology stack to determine if an incident represents a true threat or is simply the result of routine system maintenance. If the team uses a cloud data platform that can hold security data along with business data from these other IT activities, they can more easily bring in the necessary contextual information to eliminate or avoid false positives.

Note

Combining the holistic visibility of the security data lake with a detection-as-code approach reverses the traditional volume/noise trade-off. More data starts to mean less noise, not more.

For example, if a US-based employee suddenly appears to be logging in from another country known for malicious attacks, analysts may need to quickly determine whether the event constitutes an actual intrusion involving stolen credentials. In the traditional SIEM model, the security team likely would reach out to HR or the user’s manager to find out if there’s a good reason for the login location. This approach takes time and delays the incident response. A security data lake, as discussed in previous chapters, enables the security team to apply up-to-date contextual information (such as HR updates) to activity logs across multiple systems. With detections defined as code, the rule that resulted in the original alert would be modified to include the data sets the analysts used to investigate and dismiss the alert. Thus the next time this type of scenario occurs, the alert will be eliminated within the detection logic, avoiding the false positive. The fastest alerts to triage are the ones that were never triggered in the first place.

When HR data is stored in the security data lake and updates about employee status are monitored, that data can be correlated with the security data and instantly analyzed in time to prevent a potential breach or disregard a false positive. Some security teams also store issue tracking and project management data from Jira and other ticketing systems. Agile development teams use this data to track bugs, stories, epics, and other tasks. A high-fidelity detection can monitor when these tickets are created and approved. Being able to automatically log the ticket data removes manual processes from the security team.

By supporting structured, semistructured, and unstructured data types, a security data lake can not only detect many attacks but also support all the types of data that help you contextualize each incident to determine the root cause.

As detection rule sets evolve through the continuous improvement process, their accuracy reaches the point where IR teams can rely on the rule books for each alert. This maturity makes security orchestration, automation, and response (SOAR) activities more meaningful. If you’ve ever been disappointed by lackluster gains from SOAR, it was probably because low-fidelity detections required a human in the loop for most alerts. The combination of a security data lake and detection-as-code principles allows a SOAR program to achieve its full potential. Accurate and actionable detections trigger automated playbooks to stop a security breach or minimize its impact through actions such as issuing a security challenge or temporarily isolating compromised systems. It takes a lot of confidence to isolate a server in production, but a mature threat detection program can make this possible. Security data lakes also support advanced threat detections using data science techniques already employed in domains such as fraud detection, setting a near-limitless path towards improving detection maturity.

Preparing for Breach Response

In the event of a breach, security analysts must be able to study months’ or even years’ worth of data. These investigations may be performed directly within the security data lake using built-in SQL worksheets or via a purpose-built investigation interface within a connected application. Either way, results must arrive fast. This requires a single queryable repository for event data. The events that constitute a single incident may appear in one data set or be spread across many. These various events can be close in time or months apart. A key advantage to the security data lake architecture is that it eliminates the need for “rehydrating” or “replaying” events from cold storage. Data pipeline solutions that rely on data restoration often underplay the complexity and delay introduced within that approach.

Responding to a diverse set of threats requires a diverse set of data. A security data lake allows for long retention windows and can apply a consistent schema across sources from the entire IT infrastructure, both cloud-based and on-premises. This approach is much easier than working in a traditional siloed landscape, where the security team must investigate alerts and events console by console, API by API. Instead, the team can prepare for breach response by normalizing events and modeling unified views for assets, users, vulnerabilities, and other variables. This dramatically simplifies IR procedures when the team must respond quickly. The security data lake model offers the opportunity to prepare for fast IR in collaboration with the data analytics team. Connected applications handle most of the prep with prebuilt code and data models.

Measuring Alert Quality with KPIs

Once you identify your most critical log sources, such as data from firewalls, servers, and network devices, you can focus on improving the quality of the detections that operate against those sources. It’s important to constantly scrutinize your threat detection workflows and incident response systems with an eye for continuous improvement. Which data sources yield the most alerts? Which alerts are the noisiest? Which detections yield the most false positives? Which data is the most critical? How strong are your rules?

With a security data lake, all your data is in one place, and all authorized users can access it via BI tools and data science models. BI dashboards can display current metrics such as the volume of phishing emails, the number of incidents, and the severity of incidents. If information about security and support tickets is stored in the security data lake, the dashboards can show the mean time to detect (MTTD) as tickets are escalated and mean time to respond (MTTR) as tickets are closed.

All these inquiries are made possible within a centralized platform where the team can query all data and insights simultaneously. Insights can be stored back into the data lake for future analysis. Mature security teams go beyond merely gathering lots of data into an analytics platform. Tremendous efficiency can be gained from metrics and key performance indicators (KPIs) that track the effectiveness of your cybersecurity efforts and enable data-driven decision making for future projects and initiatives.

Applying Data Science to Threat Hunting

Data science involves studying, processing, and extracting insights from a given set of information, such as the security data, log data, and contextual data sources you have identified as critical to your cybersecurity efforts. Data scientists create machine learning (ML) models that reveal trends and patterns in these data sets. For example, they might develop algorithms that identify the likelihood that certain types of devices, user profiles, or portions of a data set will be targeted during an attempted hack, or which types of network switches serve as the most common port of entry for denial-of-service attacks. Algorithms rank and score activity data to flag anomalies that may indicate suspicious behavior, maximizing the efficiency of security teams. Understanding these probabilities allows the security team to predict how potential attacks might unfold in the future.

Data scientists develop code using ML notebooks such as Jupyter and Zeppelin, as well as with high-level languages such as Python, Java, and Scala. If your security data lake supports these languages and notebooks, you can more easily enlist data scientists to conduct these investigations.

The detection development lifecycle, described in the preceding section, helps guide data science projects by enforcing regular procedures to develop ML models, review the code, run detections in test mode, study the true-positive and false-positive rates, and store the finished models in an alert library. The output of data science models then can be fed back into traditional business intelligence decision-making processes.

Business analysts from the cybersecurity and data teams can utilize SQL-based tools to view the results of investigations through self-service dashboards. This is an opportunity to gain more insight, and more value, from your security data.

Collaboration with data scientists is much easier when your organization standardizes on the same data platform. Skilled statisticians and ML engineers can get involved in cybersecurity investigations, just as they collaborate with other corporate domains such as marketing, sales, and finance.

Of course, data scientists need powerful compute resources to process and prepare the data before they can feed it into ML libraries and tools. The more data points they can collect, the more accurate their analyses will be. A cloud data platform allows them to easily access, collect, and organize data from a variety of sources and formats. The best cloud data platforms can scale compute and storage capacity separately and near infinitely, and they offer usage-based pricing, so you only pay for compute by the second. This cost model allows data scientists to ingest and process massive amounts of data at a reasonable cost.

Summary

To break down the data silos and enable analytics on a scale that can accommodate today’s nonstop network activity, invest in a cloud data platform that can handle a broad set of use cases, including a security data lake, and work with a very high volume of data. Security teams can use this platform as a foundation to progress on the threat detection maturity framework and follow detection-as-code principles, including the following:

Agile development of detections throughout the continuous loop of testing, debugging, deployment, and production
Continuous integration/continuous delivery (CI/CD) of data pipelines and models for fast and reliable detection and response
Automated testing and quality assurance (QA) for rules, especially important as upstream data sources change over time
Versioning and change management for detection code
Promotion, reuse, and automation of data models, detections, and other artifacts

By moving your data sets to a security data lake, you can reduce traditional SIEM license fees and operational overhead. You can use one system to analyze data from a huge variety of sources. You can store many types of data—including logs, user credentials, asset details, findings, and metrics—in one central place and use the same sets of data for multiple security initiatives. Collected data can be stored in the security data lake for however long you want, eliminating complex storage tiers and rehydration overhead. Anytime you want to search that data, you can do so easily via your connected security applications of choice.

A modern cybersecurity strategy begins with a security data lake and its rich ecosystem of security solutions and data providers equipped to handle the vastly expanding threat landscape.

Get Deploying a Modern Security Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Deploying a Modern Security Data Lake by David Baum