Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series
The O’Reilly Data Show Podcast: Ira Cohen on developing machine learning tools for a broad range of real-time applications.
In this episode of the O’Reilly Data Show, I spoke with Ira Cohen, co-founder and chief data scientist at Anodot (full disclosure: I’m an advisor to Anodot). Since my days in quantitative finance, I’ve had a longstanding interest in time-series analysis. Back then, I used statistical (and data mining) techniques on relatively small volumes of financial time series. Today’s applications and use cases involve data volumes and speeds that require a new set of tools for data management, collection, and simple analysis.
On the analytics side, applications are also beginning to require online machine learning algorithms that are able to scale, are adaptive, and free of a rigid dependence on labeled data. I talked with Cohen about the challenges in building an advanced analytics system for intelligent applications at extremely large scale.
Here are some highlights from our conversation:
A lot of systems have a concept called dashboarding, where you put your regular things that you look at—the total revenue, the total amount of traffic to my website. … We have a parallel concept that we called Anoboard, which is an anomaly board. An anomaly board is basically showing you only the things that right now have some strange patterns to them. … So, out of the millions, here are the top 20 things you should be looking at because they have a strange behavior to them.
… The Anoboard is something that gets populated by machine learning algorithms. … We only highlight the things that you need to look at rather than the subset of things that you’re used to looking at, but that might not be relevant for discovering anything that’s happening right now.
Adaptive, online, unsupervised algorithms at scale
We are a generic platform that can take any time series into it, and we’ll output anomalies. Like any machine learning system, we have success criteria. In our case, it’s that the number of false positives should be minimal, and the number of true detections should be the highest possible. Given those constraints and given that we are agnostic to the data so we’re generic enough, we have to have a set of algorithms that will fit almost any type of metrics, any type of time series signals that get sent to us.
To do that, we had to observe and collect a lot of different types of time series data from various types of customers. … We have millions of metrics in our system today. … We have over a dozen different algorithms that fit different types of signals. We had to design them and implement them, and obviously because our system is completely unsupervised, we also had to design algorithms that know how to choose the right one for every signal that comes in.
… When you have millions of time series and you’re measuring a large ecosystem, there are relationships between the time series, and the relationships and anomalies between different signals do tell a story. … There are a set of learning algorithms behind the scene that do this correlation automatically.
… All of our algorithms are adaptive, so they take in samples and basically adapt themselves over time to fit the samples. Let’s say there is a regime change. It might trigger an anomaly, but if it stays in a different regime, it will learn that as the new normal. … All our algorithms are completely online, which means they adapt themselves as new samples come in. This actually addresses the second part of the first question, which was scale. We know we have to be adaptive. We want to track 100% of the metrics, so it’s not a case where you can collect a month of data, learn some model, put it in production and then everything is great and you don’t have to do anything. You don’t have to relearn anything. … We assume that we have to relearn everything all the time because things change all the time.
Discovering relationships among KPIs and semi-supervised learning
We find relationships between different KPIs and show it to a user; it’s often something they are not aware of and are surprised to see. … Then, when they think about it and go back, they realize, ‘Oh, yeah. That’s true.’ That completely changes their way of thinking. … If you’re measuring all sorts of business KPIs, nobody knows the relationships between things. They can only conjecture about them, but they don’t really know it.
… I came from a world of semi-supervised learning where you have some labels, but most of the data is unlabeled. I think this is the reality for us as well. We get some feedback from users, but it’s a fraction of the feedback you need if you want to apply supervised learning methods. Getting that feedback is actually very, very helpful. … Because I’m from the semi-supervised learning world, I always try to see where I can get some inputs from users, or from some oracle, but I never want to rely on it being there.
Editor’s note: Ira Cohen will present a talk entitled Analytics for large-scale time-series and event data at Strata + Hadoop World London 2016.