Natalino Busa

How to build an anomaly detection engine with Spark, Akka and Cassandra

Coral: a real-time streaming anomaly detection engine

Date: This event took place live on December 16 2015

Presented by: Natalino Busa

Duration: Approximately 60 minutes.

Questions? Please send email to

Description:

Watch the webcast recording

Hosted By: Ben Lorica

This webcast presents a solution for streaming anomaly detection: "Coral". The Coral system is composed of three elements: a machine learning module, an event processing scoring module, and a data store that is implemented using Spark, Akka, and Cassandra, respectively.

Spark is employed to train the model, which identifies event anomalies from a given stream of incoming events. This module uses Spark SQL sample statistics and Spark MLlib k-means clustering, in order to identify the outliers. The model is re-trained at regular intervals as new (micro)batches of events arrive. We re-run the training algorithm by using Spark Streaming, to make sure that the trained anomaly detection model is up-to-date, even under changing trends and conditions.

Both the stream of events and the trained anomaly detection model are persisted in Cassandra. Data events are collected in Cassandra and read out by Spark to perform the machine learning analytics. Once the model is trained in Spark, the model's parameters are written back to Cassandra. The model stored in Cassandra is subsequently accessed by the event processing module, implemented in Akka. The Akka runtime module will then score 1000s of event per seconds per node.

Each element of this system (Spark, Akka, Cassandra) can be distributed on multiple nodes. Therefore this solution provides strong resilience and availability characteristics.

In this webcast you will learn how to:

  • determine when to use batch, microbatch and event data processing
  • build an anomaly detection module using Spark, Spark SQL, and Spark MLlib
  • build an event processing engine with Akka
  • setup Cassandra to persist events as well as machine learning models
  • keep a machine learning model up-to-date by using Spark Streaming

About Natalino Busa

Natalino is currently Senior Data Architect at ING in the Netherlands, where leads the strategy, definition, design and implementation of big/fast data solutions for data-driven applications, for personalized marketing, predictive analytics, and fraud/security management.

All-round Software Architect, Data Technologist, Innovator, with 15+ years experience in research, development and management of distributed architectures and scalable services and applications.

Served as Senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and parallelizing compilers.

Blogs regularly about big data, analytics, data science and scala reactive programming at natalinobusa.com

About Ben Lorica

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc.. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services. He is an advisor to Databricks.

Twitter: @bigdata