Partial map of the Internet based on the January 15, 2005 data found on
Partial map of the Internet based on the January 15, 2005 data found on (source: The Opte Project on Wikimedia Commons).

In this episode of the Data Show, I spoke with Parvez Ahammad, who leads the data science and machine learning efforts at Instart Logic. He has applied machine learning in a variety of domains, most recently to computational neuroscience and security. Along the way, he has assembled and managed teams of data scientists and has had to grapple with issues like explainability and interpretability, ethics, insufficient amount of labeled data, and adversaries who target machine learning models. As more companies deploy machine learning models into products, it’s important to remember there are many other factors that come into play aside from raw performance metrics.

Here are some highlights from our conversation:

Machine learning systems that require minimal supervision

There is this division where people think about machine learning algorithms as supervised, unsupervised, or semi-supervised, based on availability of labels. I was particularly interested in problems where there is a lot of data and there is a little bit of information available. There are some examples where somebody can supply ground truth. It's really hard to get ground truth at  a larger scale. There are problems of this type even in industry. I was interested in creating systems that can bootstrap from the small amount of ground truth that you provide and slowly build up a much more robust underlying machine learning system that can achieve the goal you are trying to accomplish. For example, if you are trying to do a classification problem and you started off with a little bit of data that's ground truth, combine the ground truth that you gave with unlabeled data along with some person being in the loop to create a system that can scale over time and over data.

This is my personal opinion. I think there is a way to build a semi-supervised machine learning system without having a person in the loop. I am particularly interested in the class of problems where it is hard to take the person out of the loop because there is probably some complexity that the algorithm can't approximate. It's very useful to have a person in the loop. Maybe, there is some relevance feedback that's extremely helpful for the system that the human can provide. When I said minimally supervised, I meant that there is a person in the loop and having this person in the loop allows you to divide up the decisions into what the algorithm has to decide and what the person has to decide in such a way that you don't need a lot of ground truth. You can also bootstrap between these two elements.

Adversaries and game-theoretic machine learning techniques

A lot of the time, most of the machine learning applications are from the supervised side. What you're basically assuming is that you collected some data, and you think the underlying distribution that is part of your data collection is going to be true. In statistics, this is called stationarity assumption. You assume that this batch is representative of what you're going to see later. You are going to split it up into two parts. You train on one part and you test on the other part. The issue is, in security especially, there is an adversary. Any time you settle down and build some classifier, there is somebody actively working to break it. There is no assumption of stationarity that is going to hold.  Also, there are people or botnets that are actively trying to get around whatever model you constructed. There is an adversarial nature to the problem. These dual-sided problems are typically dealt in the game theoretic framework.

There are papers from Doug Tygar’s group in Berkeley that you can look up, where, essentially, you can poison a machine learning classifier to do bad things by messing with how the samples are being constructed or messing with the distribution that the classifier is looking at. Alternatively, you can also try to construct safe machine learning approaches that go in with the assumption that there is going to be an adversary, and try to design systems that are robust to such an adversary.

Explainability and machine learning

I think companies like Google or Facebook probably have access to large-scale resources where they can curate and generate really good quality ground truth. In such a scenario, it's probably wise to try deep learning. On a philosophical level, I also feel that deep learning is like proving that there is a Nash equilibrium. You know that it can be done. How is it exactly getting done is a separate problem. I also think, as a scientist, I am interested in understanding what exactly is making this work. ...For example, if you throw deep learning at this problem and the thing comes back, and the classification rates are very small, then we probably need to look at a different problem because you just threw the kitchen sink at it. Whereas, on the other hand, if we found that it is doing a good job, then what we need to do is to start from there and figure out an explainable model that we can train. Because we are an enterprise and in the enterprise industry, it's not sufficient to have an answer. We need to be able to explain why. For that, there are issues in simply applying deep learning as it is.

...What I'm really interested in these days is the idea of explainable machine learning. It’s not enough that we build machine learning systems that can do a certain classification or segmentation job very well. I'm starting to be really interested in the idea of, how to build systems that are interpretable, that are explainable where you can have faith in the outcome of the system by inspecting something about the system that allows you to say, 'Hey, this was actually trustworthy result.'

Parvez Ahammad will speak on recent Applications of machine learning to security at Strata + Hadoop World San Jose, March 13-16, 2017.

Related resources: