Evan ChanHelena Edelson

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming

Date: This event took place live on March 16 2016

Presented by: Evan Chan, Helena Edelson

Duration: Approximately 60 minutes.

Questions? Please send email to

Description:

Hosted By: Ben Lorica

Companies today need a combination of streaming, batch, and ad-hoc analytics to extract the most meaning from their data sets. In order to have immediate insight into your data, low latency response times to new events are required, which often means querying recent and historical data to compute trends and predictions. The rise of Apache Spark, along with distributed messaging systems that support streaming, such as Kafka, has enabled and simplified this process. Yet, organizations still build dual streaming and batch pipelines (such as the Lambda architecture) with a multitude of technologies (for example, Hadoop) that need to be supported and financed in terms of staff and storage.

FiloDB is a new open-source high performance analytical database designed for modern streaming workloads as well as fast batch and ad-hoc analysis. In this webcast, we will demonstrate some use cases of FiloDB that have enabled flexible ad-hoc analytics for a new generation of real-time Spark and Spark Streaming applications, while also simplifying the stack.

In this webcast, we will also review:

  • Modern streaming and batch/ad-hoc architectures
  • Precise and scalable streaming ingestion using Apache Kafka, Akka, Spark Streaming, Cassandra, and FiloDB
  • How a unified streaming + batch stack can lower your TCO
  • Using Cassandra to answer roll ups and web-speed queries
  • What FiloDB is and how it enables fast analytics with competitive storage cost
  • Data Warehousing with Spark, Cassandra, and FiloDB
  • Time series / event data / geospatial examples, including smart cities
  • Machine learning using Spark MLLib—without the need to export to HDFS
  • Combining streaming and historical data analysis, including efficient longer-time window analysis

About Evan Chan

Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He is the creator of the FiloDB open-source distributed analytical OLAP database, as well as the Spark Job Server. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project, and a Datastax Cassandra MVP. He has built Spark applications since Spark 0.8, and Cassandra applications since 0.6. He is a big believer in GitHub, open source, and meetups, and has given talks at various conferences including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.

Evan has a Bachelor’s degree and a Master's of Electrical Engineering degree, with distinction, from Stanford University. He is currently a distinguished engineer at Tuplejump.

Twitter: @evanfchan

About Helena Edelson

Helena has been a Software Engineer for over 15 years. After a decade in distributed messaging engineering she moved exclusively to working with Scala, first for cloud infrastructure automation, then big data, all for large scale distributed systems. As a Senior Cloud Engineer she was on the first Scala team at VMware building multi-tenant cloud automation systems, then in big data architecting, building and deploying streaming and batch analytics pipelines for Cyber Security for real time threat analysis. Most recently she has worked on streaming analytics and machine learning at scale with Kafka, Apache Spark, FiloDB, Cassandra, and Akka.

Helena is a committer to several open source projects including the Spark Cassandra Connector and a contributor to Akka (new features in Akka Cluster). While working at SpringSource she was a contributor to Spring Integration and Spring AMQP. She is a speaker at international Big Data and Scala conferences such as Spark Summit (Europe and the US), Strata NY, Strata San Jose, QCon SF, Scala Days (Europe and the US), Scala World, Data Days, and Philly Emerging Technology. She is currently VP of Product Engineering at Tuplejump.

Twitter: @helenaedelson

About Ben Lorica, Chief Data Scientist — O'Reilly Media

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.