There is a lot of focus in building security detections, and the attention is always on the algorithms. However, as security data scientists are quickly realizing, the tough task is not model building but rather, model evaluation: how do you convince yourself that the detections work as intended, and measure efficacy given the lack of tainted data? This article provides strategies—both from the perspective of security and machine learning—for generating attack data to validate security data science detections.
Drought of good quality test data in security analytics
If you attend any security data science presentation, you are most likely to see “lack of labeled data” in the “challenges” section. The difficulty in building credible security analytics is not only the know-how, but also the drought of good quality test data. Textbook applications of machine learning boast a huge corpus of labeled data; for instance, for image recognition problems, the ImageNet data set offers 14 million labeled images, and for speech recognition, the SwitchBoard corpus offers close to 5,800 minutes of labelled data.
Security applications, on the other hand, don’t have standard benchmark data sets—the last relevant data set goes back to the 1999 KDD data set on network intrusions from an artificially created network by Lincoln Labs. The problem has less to do with how old the data set is, and more to do with how outdated the attack vectors are. For instance, some of the threats like denial of service, is less of a security problem than an availability problem. The bigger issue is that the security landscape is changing drastically every day—so much so that it is hard to identify one canonical data set.
Why constant attacks do not equal test data
By now you may be asking the question, why is this even a problem? After all, we are being constantly attacked, so we should have a ton of data to test on, right? Sadly, the answer is no. Despite the constant besiege, there are three reasons why this conclusion is flawed. First, the attacks that get surfaced may not involve a component that you are testing for. For instance, a watering hole attack with the initial point of compromise using a web exploit might not necessarily help if your goal is to detect data exfiltration through netflow logs. Second, attacks in the wild leave very few traces, which can be as low as two entries in the logs of interest, and this is too small a sample set to measure against. Finally, unless you are a “threat intelligence” company that offers scoped tainted data as indicators of compromise, for sale, there is no incentive for a compromised company to share their raw logs of tainted data to the public. The catch is, as a security data scientist, one builds and tests analytical models on logs, not on scoped indicators. Hence, though attacks are frequent, there is a paucity of quality test data for detections.
In the next section, I’ll outline two steps to generate attack data, so you can validate your security data science solutions:
- Bootstrap the detection system to provide a starting point of attack data
- Use machine learning techniques to grow this seed of labeled data
Bootstrapping using security strategies
The first goal is to bootstrap the detection system using security techniques, so as to provide a small subset of labeled data. The three strategies below map to different maturity levels of the detection lifecycle, and list their pros and cons. For instance, at the beginning stages of detection, you can run a quick test by injecting “malicious” sessions to check if the detection works. Once the system has become reasonably mature, you can get a red team involved.
1. Inject fake malicious data into data storage
The idea here is to create fake attacker activity, and combine it with the rest of your data to test whether the detection system is able to catch it. We can assume the detection works, if it is able to surface the injected data among the rest of the non-anomalous data. For instance, label previous pen test data as “eviluser” and check the system by seeing whether “eviluser” pops to the top of detection alerts/reports every day.
Pro: low overhead—you don’t have to depend on a red team to test your detection.
Con: the injected data may not be representative of true attacker activity.
2. Employ commonly used attack tools
Imagine you are building a detection for suspicious process creation. One way to spin up a malicious process is to run tools like Metasploit, Powersploit, or Veil in your environment, and look for traces in the logs. This technique is similar to using fuzzers for web detection.
Pro: easy to implement—unlike the next strategy, you don’t need to depend a red team; your development team, with little tutorial, can run the tool, which would generate attack data in the logs.
Con: the machine learning system, will only learn to detect known attacker toolkits and not generalize over the attack methodology. In other words, there is a risk that our detections are overfitting to the tool instead of to a real-world attack.
3. Red team manually pen-tests
This is the regular pen test engagement, in which a red team attacks the system and we try to get the logs from the attacks, as tainted data (in this article, we use “pen testing” and “red team” interchangeably to keep things simple—@davehull has a great presentation explaining the differences).
Pro: this is the closest technique to real-world attacks. Red team members emulate adversaries with specific motivations, like exfiltrating data, and begin from scratch with reconnaissance. Results from red teaming are the best approximations to attacks in the wild and make a great way to validate detection systems.
- Red teams are expensive resources, and in most cases, only large organizations have red teams in-house. One way to overcome this is to hire red team services like Casaba Security, TrustedSec, Immunity.
- Pen tests are point-in-time exercises; however, we are constantly building detections. Typically, pen tests exercises are scheduled once a quarter, but security analyst teams are constantly shipping new detections based on the currents threats. It would be impractical to have the pen test team instantly validate every detection that is authored
- Getting attack data from red teams isn’t always an easy task. For instance, even when pen testers take meticulous notes during the process, identifying the malicious sessions is an extremely time-consuming task. For example, their log may say “ran tool X at this time”; however, blue team members would need to spend an inordinate amount of time mapping back to identify the attack data in the logs.
Machine learning for increasing attack data
Once we have a small set of labeled data, we have a couple of options to increase our attack data.
Synthetic Minority Oversampling Technique (SMOTE)
Techniques like undersampling the majority class (i.e., the “normal” non-malicious data) do not lend themselves well, as they are essentially tantamount to throwing away data—the model never sees some of the normal data, which may encode important information. Oversampling the minority classes (i.e. the generated attack data) has the disadvantage of repeating the same observed variations—we are not increasing the data set; we are merely replicating it.
One of my favorite techniques is SMOTE (Chawla, et. al). Essentially, it generates a random point along the line segment connecting an anomalous point, or in our case, an attack point, to its nearest neighbor. SMOTE packages are well represented across the learning ecosystem, in AzureML, R, Python, and even Weka.
Generative adversarial networks
A new technique called Generative Adversarial Networks (GAN) (Goodfellow, et. al) is taking the machine learning world by storm. It is having such an effect that Yann LeCun named it one of the top recent and potentially upcoming breakthroughs in deep learning in his Quora session. While I am very cautious to jump in on the bandwagon, I posit that GANs in particular will be quite useful to increase attack data.
Here is the idea behind GAN:
- There are two learners, G and D
- G is a generative learner that can “generate” data from the distribution it learns.
- D is a discriminative learner that maps the input into one of the classes.
- The process is “adversarial” because
- The goal of D is to identify whether the data was produced by G or if the data was from the database.
- The goal of G is to produce data such that D is fooled.
- The goal is that after sufficient training, D cannot identify whether the data is from true distribution or the distribution learned by G.
By using the corpus of attack data as input to GAN, we can potentially build a system that is able to generate extensions of attack data. Caution: GANs are extremely difficult to get right. You can play around with GANs in Torch, Theano or Tensor Flow here.