Making the most of your IOC data

Five questions for Alex Pinto: Data-science techniques for incorporating indicators of compromise into your threat intelligence strategy.

By Courtney Allen and Alexandre Pinto
November 7, 2016
Gears Gears (source: MustangJoe)

I recently sat down with Alex Pinto, Chief Data Scientist at Niddel and lead for the MLSec Project, to discuss the limitations and potential benefits of indicators of compromise (IOCs), plus great open source tools for analyzing datasets. Here are some highlights from our talk.

1. What are some examples of indicators of compromise, and how do they relate to the security posture of an organization?

IOCs are presented as a part of technical or tactical threat intelligence data, which is the bulk of threat intelligence offerings today. They are usually presented as known bad entities (e.g., IP addresses, domains, URLs, file hashes) that could be signs of attacker or breach activity in an organization’s network.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Their most common purpose is to allow an organization to search for the presence of those entities on their logs. Detected entities could indicate a malware infection, or an active known threat on the network. That is the promise of IOCs, at least.

2. What are the limitations of IOC matching as a standalone tool?

There are a couple things that make the consumption of IOC data as advertised very hard:

  • Volume and velocity of the data. Some IOC offerings present themselves as pure indicator datasets or lookup API offerings, which make the process of integrating them back at the organization’s environment difficult. Security information and event management (SIEM) systems claim to be able to ingest those indicators to compare with log data stored on them and alert in case of a match, or to enable firewalls to ingest lists in order to block or alert access to those entities. However, the volume of data SIEMs and firewalls can actually handle is smaller than what is usually available on the IOC feeds, and the integration needed for this data to be quickly available for incident detection purposes is often unattainable for understaffed security organizations. In fact, a whole new security product category, called threat intelligence platforms, was created to deal with that, and whole funded vendors exist to enable organizations to manage and apply IOCs to their environment.
  • Quality and aging of the data. IOC data is usually fickle, given that the actual malware binaries and the attackers’ infrastructure are prone to constantly change. A communication with a specific IP address or domain can become a false indicator that a machine is compromised because the hosting provider may have taken down the location that IOC refers to after an abuse complaint, or the attackers may have moved the location to parking / quarantine. That makes it difficult to match those pieces of information to log or traffic data and confirm a valid compromise. The results become unreliable for alerting and subsequent actions, and teams get “false positive fatigue” using faulty threat intelligence data on their environment.

Despite these difficulties, there is a lot of good actionable data that can be extracted from those feeds, and companies that have well staffed monitoring and threat intelligence teams can make good use of this raw data.

3. What are some steps organizations can take towards using IOC data more effectively?

Making good use of indicators can be boiled down to three steps:

  • Define your objectives in using IOCs in your defense strategy. This will influence your intelligence collection criteria.
  • Select what IOC feeds are important and relevant to your environment based on these collection requirements.
  • Be prepared to operationalize the IOC data on your environment beyond matching the indicators against the logs, and learn bigger patterns of how the threats are set up.

In my Security Conference talk I cover two data-driven techniques that reduce the burden of accomplishing this:

  • Enrich IOC data with “pivoting points” (e.g., the internet provider or datacenter hosting the IP address, the relationships between IP addresses and domain names, and information on who registered ownership of a domain name), and then use those connections to build relationships between the threat intelligence data and the company’s telemetry (data that is found daily on the company’s logs).
  • Build graphs of those relationships to understand how both normal traffic and IOCs aggregate themselves around those pivoting points. That creates a data-driven knowledge base similar to the intuition and experience that threat intelligence analyst teams use in their day-to-day job and helps analysts draw conclusions from the raw threat intelligence data about how threat actors organize themselves.

It is important to measure whether the organization is observing the collected IOCs on their network traffic. Even more important is the ability to know when those observations are false positives or “bad matches” from the IOC data available. It takes a robust analyst staff to achieve this.

4. What open source tools do you recommend for working with small datasets, log data, and IOCs?

Two tools Kyle Maxwell and I wrote as a part of MLSec Project a few years ago are Combine and TIQ-Test. The premise of those tools is to help individuals and organizations gather commonly open feed sources and perform data-driven analyses on them. They were developed with Python and R, which I can’t recommend enough for data science-like work in any kind of dataset.

There are several others that help with IOC consumption and management, such as:

5. You’re speaking about applying data science techniques to IOC-based detection at the Security Conference in Amsterdam this November. What presentations are you looking forward to attending while there?

I am really looking forward to watching these presentations on subjects related to my talk:

  • Building and designing MISP“” by Alexandre Dulaunoy and Andras Iklody. MISP is the most robust and advanced open-source threat intelligence platform, and is worth adding to your toolkit.
  • The bad things happen when you’re not looking“” by Ryan Huber and Nate Brown. The Slack team has done some amazing things in improving the visibility on their network and vast production environment. This talk is a great pick for anyone facing scaling challenges in monitoring.
  • Security analytics using big data and Apache Hadoop“” by Eddie Garcia. Cloudera has been pushing for some standardization on using open platforms like Hadoop for security data, so it should be interesting to learn how they see those practices evolving.

Post topics: Security