4 trends in security data science for 2017

How bots, threat intelligence, adversarial machine learning, and deep learning are impacting the security landscape.

By Ram Shankar Siva Kumar and Cody Rioux

December 27, 2016

Four square. (source: Pixabay)

Security data science is booming—reports indicate that the security analytics market is set to reach $8 billion dollars by 2023, with a growth rate of 26%, thanks to relentless cyber attacks. If you want to stay ahead of emerging security threats in 2017, it is important to invest in the right areas. In March 2016, I wrote a piece on the 4 trends to be aware of for 2016; for my 2017 trends post, Cody Rioux from Netflix joins me, bringing his platform perspective. Our goal is to help you formulate a plan for every quarter of 2017 (i.e., 4 trends for 4 quarters). For each of our trends, we provide a short rationale, why we think the time is right for investing, and how to capitalize on the investment, with pointers to specific tools and resources.

1. Bots for automated security response and assistance

We believe the security industry is going to see an uptick in automated and autonomous responses in the form of chat bots that will provide information when a model deems the information relevant, as well as on demand responses. The responses will likely be integrated into the platform you’re currently using to communicate with teammates during incident response. This isn’t a new idea—chatbots have existed at least as long as internet relay chat (IRC), but they’ve seen a big uptick in popularity thanks to “ChatOps.” Shivon Zilis and James Cham refer to this as “the great chatbot explosion of 2016,” and their infographic lists 15 companies developing autonomous agents as of today.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Why Now?

Chris Messina(@chrismessina) , inventor of hashtag, recently penned that Chat Bots Aren’t a Fad. They’re a Revolution. Tech organizations are generally in a place where there is trust for autonomous systems within the production environment, and this opens the door for automating all types of menial tasks, including those in the security domain. Bot frameworks are prime for deployment for a wide array of communication platforms, including Slack, IRC, and Skype. You’re likely already using such a platform for communication both during security incidents and your day-to-day work, which makes a bot the ideal companion for both executing tasks quickly mid-incident, and performing and reporting on routine checks, such as rolling certificates and ensuring security standards compliance. Jason Chan (@chanjbs) also recently spoke about how Netflix uses bots in the context of security—from security consultation, to approving deployment changes, to having noticeable security keywords.

Next Steps

Talk to your operations team / network operating center to see if they already have a solution you can repurpose.
Check out Microsoft Bot Framework, Slack Bots, or one of many IRC Bot Frameworks.
Investigate the automations in Security Monkey and replicate them.

2. Incorporate threat intelligence into machine learning detections

Threat intelligence (TI) feeds can be thought of as discrete instances of known bad actors—or rather, a collection of indicators of compromise. They can vary from hashes of known malicious files used by adversaries, IP addresses of the command and control servers of botnets, or even user agent strings used by persistent threats. Threat intelligence feeds have long been used by the security community as point-in-time checks for security monitoring, but we argue that the data science community should leverage them with the behavioral detection systems in 2017.

Why now?

The Bayes error rate is the fundamental limit of any classifier with a given data set. The standard way to improve the error rate is to include new sources of information. We posit that TI feeds are an easy gateway, and a first step toward including new data sources.

Additionally, there’s surrogate interpretability—they also provide insight into explaining your alerts. For instance, if your ML system determines that the login is anomalous and the IP address of the login is present in a botnet feed, then we can surmise that the login is anomalous because it stems from a machine infected by botnet. Although hacky, and not a sure guarantee, this can provide a quick win for explaining alerts.

Next Steps

The easiest way to incorporate threat intelligence, is to simply join the results of the ML system with your TI feeds. A straightforward way of achieving this is to use TI as a filter, after the ML system.
Another option is to include them as as binary features within the training set. This gives the added advantage of managing only one code base. The con of this method is that every time you add a new TI feed, you would have to do a code change and re-train and re-deploy your ML system, which can be cumbersome.

Before you start experimenting with TI feeds, keep in mind that the feeds have varying levels of confidence in their indicators and, thus, require some trial and error. Commercial TI vendors include Team Cymru, iSight, iDefense, and Webroot. Open source TI feeds include Project Honeypot. Malware Domain List, trackers such as Feodo Tracker, Zeus Tracker and OpenPhish are inexpensive prototyping options.

3. Continuing to invest in adversarial machine learning

Adversarial machine learning is when an adversary can subvert machine learning (ML) systems for their advantage. The adversary can increase the false positive rate of the system so high, such that it frustrates and overwhelms the security analyst; or it can increase the false negative rate of the system, and hence fly under the radar and go completely unnoticed; or it can even take complete control of the system. Adversarial machine learning is real and happening; Ian Goodfellow, who wrote a number of papers on this subject with Nicholas Papernot, wrote a fantastic blog disabusing many of the myths—the crux of it is that adversarial machine learning is very much possible.

Why now?

This trend was listed in my 2016 post—but given the uptick in interest, and the possible damages, we agreed there is merit in reminding our readers to really begin protecting their ML detection systems in 2017, too. While security experts have seen this trend in the past in the realm of spam filtering, 2016 also offered many other examples, hitting all of the big-wig companies. Firstly, Microsoft’s Tay the Tweet bot had to be taken down because it began spewing racist words. Secondly, researchers from Cornell showed how they were able to steal machine learning models from Amazon and BigML. Finally, adversarial ML even made an appearance in the 2016 election when Google briefly showed a picture of a presidential candidate for the words “pathological liar.”

Next steps

Begin threat modeling your public facing ML systems—there is some solid guidance from Nicholas Papernot (@nicholaspapernot), et. al’s, new paper, “Towards the Science of Security and Privacy in Machine Learning.”
Check out cleverhans, a new library that simulates various kinds of attacks on machine learning solutions.
Scrutinize user input before allowing it to become training data for a model, particularly in online learners.

4. Deep learning for security

Deep learning has opened the door to the ability to achieve human-level performance on tasks ranging from driving a car to creating paintings in the style of your favorite artist, all the way to superhuman performance on tasks such as the game of go. Security tasks such as traffic identification, malware identification, detect command, and control servers (to name a few) have already directed some attention toward this trend. Neural networks are also capable of unsupervised learning techniques with autoencoders and reinforcement learning, which offer solutions for tasks such as anomaly detection and creating autonomous systems, even without labeled data. In short, if you need human-level performance and have a lot of data, along with the compute resources to process it, then you may want to take advantage of this trend for automating tasks that were once viewed as human only.

Why now?

Deep learning implementations were once relegated to the machines of data scientists with cobbled together python scripts containing hundreds of lines of Theano code. This is no longer the case as production-grade deep learning toolkits are now prolific, regardless of your software stack, and distributed computing resources necessary to train large models are also commonplace—you likely have a Spark or Hadoop cluster available to you already. You’re probably also generating enough data to train a “data hungry” algorithm, such as a deep neural network. The convergence of data and computing resources onto your other distributed computing cluster, coupled with production-grade software packages that allow users to easily train, predict, monitor, and maintain deep learning models, imply that it is easier than ever to integrate deep learning into your production threat monitoring system.

Next Steps

If you have no introduction to deep learning, check out the new course by Jeremy Howard (@jeremyphoward)—it is pragmatic, coding focused, and very hands-on: http://course.fast.ai/.
Investigate the neural network package for your stack. Python (Keras, Lasagne, Theano, Tensorflow), Java (deeplearning4j), or .NET (accord), or you may prefer to offload to a managed service such as Azure ML.
Once you have your chosen software package, try your hand at the Cyber Defense Exercise data set.
Investigate research on malware identification by beginning with these papers: Deep Neural Network-Based Malware Detection Using Two-Dimensional Binary Program Features and Droid-Sec: Deep learning in android malware detection.

Overall, adversarial machine learning will continue to be of prime focus, and deep neural networks will begin to have an impact on security data science, just as it has in the rest of the industry. In the meantime, analysts will be making their lives easier by integrating threat intelligence feeds and automating everything they can via the security flavour of “chat ops,” further automating tasks once performed manually and automatically disseminating the information to the involved parties via the chat bot.

We’d love to hear what you think about your trend predictions in security data science—reach out to us on Twitter at @ram_ssk and @codyrioux and join in on the conversation!

Post topics: Data science