Efficient state management with Spark and in-memory databases
Date: This event took place live on May 25 2016
Duration: Approximately 60 minutes.
Questions? Please send email to
Spark 2.0 ushers in huge advances to support real-time use cases. With structured streaming, a single, unified API enables querying and combines streams and static data frames. Most streaming applications are stateful computations that need to build and maintain state incrementally. For instance, one application may maintain counters or incrementally build a predictive model, while another may derive insight by correlating coarse-grained stream windows with historical data in order to detect anomalies. The model state may also undergo changes and must be transactionally consistent. Spark's new API for state management in streams is promising, but it isn't designed to maintain, say, a complex model or reference data from enterprise relational databases. Many applications will continue to use external stores (SQL, NoSQL, or in-memory databases). However, for many scenarios, such as in IoT applications, this approach can be challenging due to excessive serialization/deserialization, slow scan or aggregation performance in row-oriented databases, and the difficulty in enforcing exactly-once semantics. Without sufficient care, an application may easily fail to keep up with the incoming stream.
Barzan Mozafari and Jags Ramnarayan present a design that combines Spark with an open source in-memory database—SnappyData—that equally scales and collocates its partitions with those of Spark, effectively offering state for stream processing while also allowing external clients to execute analytic queries, all in a single Spark cluster. Barzan and Jags also introduce the concept of approximate query processing and explain how it can dramatically reduce the resource requirements to manage the streaming data while offering a 100x or more speed up for analytic queries.
About Barzan Mozafari
Barzan Mozafari is an assistant professor of computer science and engineering at the University of Michigan, Ann Arbor, where he leads a research group designing the next generation of scalable databases using advanced statistical models. Previously, Barzan was a postdoctoral associate at MIT. His research career has led to many successful open source projects, including CliffGuard (the first robust framework for database tuning), DBSeer (the first automated database diagnosis tool), and BlinkDB (the first massively parallel approximate query engine). Barzan holds a PhD in computer science from UCLA.
About Jags Ramnarayan
Jags Ramnarayan is CTO of SnappyData, a Spark-based startup. Previously, Jags was the chief architect for fast data products at Pivotal and served on the extended leadership team of the company. At both Pivotal and VMware, he led the technology direction for GemFire, helped lead the company strategy for data services, and worked closely with customers to help them be successful. Jags is recognized for his expertise in distributed systems and databases and is a frequent speaker on distributed data. He has a bachelor's degree in computer science and a master's degree in management.