4 trends in security data science

From intelligent investigation to cloud “security-as-a-service,” what you need to know for 2016.

By Ram Shankar Siva Kumar

March 9, 2016

Locked up. (source: By Darren D on Flickr)

In 2015, we saw graphs dominate security data science. Graphs permeated all areas—everything from visualizations to graphical inference. It’s quite easy to write about security trends for 2016—the hard part is trying to interpret what the trends could potentially mean to organizations on a day-to-day basis.

This article is not the wishlist of a deluded security data scientist. Rather, these are strategic trends that you can expect to see in the field, mixed with tactical steps to capitalize on them. These four trends can even serve as guideposts as you navigate the labyrinth of security data science in 2016.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Shift from detection to intelligent investigation

“Protect-detect-investigate” is the virtuous cycle in security, with each stage maturing at different times. In the 1990s, there was a big push toward “protecting” systems through firewalls and access control lists. Currently, there is a massive push for “detection” that can most likely be attributed to advances in data analysis.

Current intrusion detection systems have many drawbacks, however—despite promises from vendors in security analytics of machine learning being able to detect everything from insider attacks, to Web vulnerabilities. While detection systems are definitely poised to get better, in 2016 we will finally see organizations getting better at investigations.

We will start to see the onset of tools that inject intelligence into investigation. There are two key reasons for this development:

Data explosion in the investigation phase. Investigation is the step wherein we find out what the adversary has done, and how to roll back from attacks. It involves analyzing dizzying amounts of data—different data than that gathered in the detection step. For instance, incident response analysts may need to parse through an entire master file table to see and collect hashes of running processes from all hosts. Just like in the detection step, this investigation involves data collection.
Automated investigation tools.whitepaper from SANS

Intelligent investigation: The next logical steps

With all of the data that is being collected, and the steps being taken toward automation, the next logical step is intelligent investigation. But what does this mean for you? As a first step, organize your security investments into one of the three stages of security: protect, detect, investigate.

Next, reassess your portfolio. If you are investing a majority of your effort and resources into detecting malicious insiders but have no investments to respond to a security incident, or perform remediation measures, you’re neglecting an important aspect of your security. There are out-of-the box options like Mandiant/FireEye, and open source projects like Kansa and Netflix’s FIDO.

Kansa has been getting a lot of traction in the security community; it is a set of modular powershell scripts that collects and analyzes forensics data from hosts. FIDO is a more end-to-end system that provides “hooks” to integrate with third-party detection systems like Bit9 and SourceFire.

Cloud “security-as-a-service” is here

In 2016, cloud security-as-a-service will no longer be a differentiator—it will become a necessity. For those already on cloud systems, rejoice. If the trends are any indication, cloud providers will soon offer security monitoring as a service—lifting the burden on customers to implement their own cloud security monitoring solutions.

There are three factors leading to security as-a-service:

The biggest security stumbling block customers face when moving from on-premise to the cloud is the purported lack of security visibility. For instance, how do you view the logs from your VM, and is there a console to see the security health of all the VMs? Cloud providers have realized this gap and see security as-a-service as a way to entice customers into reconsidering cloud as a viable option.
The cost of developing customer-specific detection is low. In most cases, the detections that end-customers see are those that the cloud provider itself would need to see in order to secure their infrastructure. So, in essence, you really get the best security minds of cloud providers working for you.
Cloud providers also have the tools necessary to make security as-a-service a reality. They provide storage solutions to export VM logs, they have compute solutions to crunch data, and they also offer machine-learning-as-a-service solutions to apply on top of machine-generated data.

Security-as-a-service: Next steps

All three major cloud providers offer security as a service: Amazon, Microsoft, and Google. Chances are your data is stored in one of them, so check it out. The capabilities they offer range everything from detecting malware on your VM to more sophisticated ones like network based attacks. The good news is that cloud security-as-a-service will not break your bank. Given the enormous competition among cloud providers, you are likely to see the rates drop. However, depending on your cloud provider, you are pretty much locked into their services.

Adversarial machine learning: Attackers will go after your data science solutions

While the industry (and the government) races to nail down algorithms for security detections, there is a pervasive lack of attention on the fragility of machine learning systems. Machine learning algorithms, as commonly implemented, are extremely vulnerable to changes to the input data, features, and the final model parameters/hyper-parameters that are generated.

Essentially, an attacker can exploit any of these vulnerabilities, and subvert machine learning algorithms. As a result, attackers can go undetected either by increasing the false negative rate (attacks are misclassified as normal activity), or by increasing the false positive rate (attacks get drowned in sea of “noise,” which causes the system to shift the baseline activity).

Next steps to address adversarial machine learning

The first, and most important step to address this is to start the practice of threat modeling your data science solutions. Here is a commonly used framework:

Define the adversary’s goal. For example, is the attacker attempting to evade detection? Can the attacker poison the data?
Assemble the adversary’s knowledge. Does the attacker have limited knowledge of the data science solution, or perfect knowledge of any of the following:

the training set or part of it
the feature representation of each sample (i.e. how real objects such as emails and network packets are mapped into the classifier’s feature space)
the type of learning algorithm and the form of its decision function
the (trained) classifier model (i.e. weights of a linear classifier)

Determine the adversary’s capability. Can the attacker poison the data set, tamper with the machine learning model, or subvert the alerting system?

The second step is to chart the necessary steps to protecting your data pipeline, end-to-end; this includes securing the data uploader as well as the repository that ultimately hosts the input data and the machine learning models.

Finally, check out AdversarialLib, an open source tool for testing your machine learning algorithms against common attacks.

“Productized” security solutions will become mainstream

By “productized” security data science, I am referring to software solutions that conform to standard software development practices, resulting in an engineered solution. You might be surprised by how many data science solutions are “spaghetti code” systems churned out by researchers, as opposed to software developers. It is one thing to prototype and develop solutions that run under someone’s box under the desk, but it is a completely different ballgame to engineer the solution, deploy, and maintain it. Google wrote a fantastic paper on this topic, aptly titled Machine Learning: The High Interest Credit Card of Technical Deb t, which is most definitely worth a read.

In security, having an engineered solution is critical. In a previous post, I explored using ranking algorithms to bolster intrusion detection systems. As Alex Pinto pointed out on Twitter, employing ranking to surface tailored feedback can be detrimental if not managed correctly. For instance, if we have only one security analyst, the system will only show those results that the analyst likes, and therefore will stop being an intrusion detection system and devolve into a system that is constantly trying to placate the security analyst. Proper testing on large populations can avoid this specific problem, but issues like these highlight the need to think of machine learning solutions in the bigger context.

Next steps for productized security solutions

Ideally, data scientists should be paired with engineers when designing these software solutions. When the data science team has a mix of applied machine learning engineers and software developers, along with a security-focused team, they can bridge the gap between data science and security. While security vendors may boast that they have a good number of Ph.D.s working on their data science problems, what you really want are robust applied machine learning engineers, who can engineer the solution as well as apply the machine learning modules. Granted, this may be a small population, but it’s worth keeping an eye out for them.

Machine-learning-as-a-service platforms, such as AzureML or Watson Analytics, should be able to help you out to some degree. If you are building your own models, you might also look into Google’s TensorFlow Serving system, which helps to deploy and manage ML systems.

Whatever you choose, just remember: Math + Code + Software Engineering = Bliss.

Post topics: Data science