Barzan MozafariJags Ramnarayan

Efficient state management with Spark and in-memory databases

Date: This event took place live on May 25 2016

Presented by: Barzan Mozafari, Jags Ramnarayan

Duration: Approximately 60 minutes.

Questions? Please send email to

Description:

Spark 2.0 ushers in huge advances to support real-time use cases. With structured streaming, a single, unified API enables querying and combines streams and static data frames. Most streaming applications are stateful computations that need to build and maintain state incrementally. For instance, one application may maintain counters or incrementally build a predictive model, while another may derive insight by correlating coarse-grained stream windows with historical data in order to detect anomalies. The model state may also undergo changes and must be transactionally consistent. Spark's new API for state management in streams is promising, but it isn't designed to maintain, say, a complex model or reference data from enterprise relational databases. Many applications will continue to use external stores (SQL, NoSQL, or in-memory databases). However, for many scenarios, such as in IoT applications, this approach can be challenging due to excessive serialization/deserialization, slow scan or aggregation performance in row-oriented databases, and the difficulty in enforcing exactly-once semantics. Without sufficient care, an application may easily fail to keep up with the incoming stream.

Barzan Mozafari and Jags Ramnarayan present a design that combines Spark with an open source in-memory database—SnappyData—that equally scales and collocates its partitions with those of Spark, effectively offering state for stream processing while also allowing external clients to execute analytic queries, all in a single Spark cluster. Barzan and Jags also introduce the concept of approximate query processing and explain how it can dramatically reduce the resource requirements to manage the streaming data while offering a 100x or more speed up for analytic queries.

You'll learn:

  • A few common use case patterns for ingesting streams via the new structured streaming APIs;
  • Different options for managing state, including Spark 2.0's new streaming state API, external in-memory or NoSQL stores, and an in-memory database that runs collocated with the Spark executors (i.e., sharing the same memory space);
  • The pros and cons of each approach for managing state, with some examples of experimental benchmark results from SnappyData;
  • The benefits of using probabilistic data structures to manage infinite streams and approximation strategies for real-time analytics with high-velocity streams using limited resources.

About Barzan Mozafari

Barzan Mozafari is an assistant professor of computer science and engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Previously, Barzan was a postdoctoral associate at MIT. His research career has led to many successful open source projects, including CliffGuard (the first robust framework for database tuning), DBSeer (the first automated database diagnosis tool), and BlinkDB (the first massively parallel approximate query engine). Barzan holds a PhD in computer science from UCLA.

About Jags Ramnarayan

Jags Ramnarayan is CTO of SnappyData, a Spark-based startup. Previously, Jags was the chief architect for fast data products at Pivotal and served on the extended leadership team of the company. At both Pivotal and VMware, he led the technology direction for GemFire, helped lead the company strategy for data services, and worked closely with customers to help them be successful. Jags is recognized for his expertise in distributed systems and databases and is a frequent speaker on distributed data. He has a bachelor's degree in computer science and a master's degree in management.