AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Radar > Topics > AI & ML

Building a scalable platform for streaming updates and analytics

The O’Reilly Data Show podcast: Evan Chan on the early days of Spark+Cassandra, FiloDB, and cloud computing.

By Ben Lorica and Shannon Cutt December 17, 2015 • 00:38:52 listen

LinkedIn X Facebook Threads Bluesky Reddit

O'Reilly Data Show Podcast

Building a scalable platform for streaming updates and analytics

00:00 / 00:38:52

In this episode of the O’Reilly Data Show, I sit down with Evan Chan, distinguished engineer at Tuplejump. We talk about the early days of Spark (particularly his contributions to Spark/Cassandra integration), his interesting new open source project (FiloDB), and recent trends in cloud computing.

Bringing Apache Spark & Apache Cassandra together

Datastax credits me with inspiring them to bring Spark into Cassandra … I think they’re very generous about that. I think I was one of the first folks to talk about the possibility of bringing Cassandra and Spark together. The vision that I saw was that Cassandra was really good for real-time updates, but what if we’re able to do more analytical queries on it? Then you could combine, basically, a platform that is really good for real-time updates with analytics.

What is FiloDB?

FiloDB is an analytical database … It is a distributed columnar analytical database. … It’s distributed meaning that, just like Cassandra, runs on multiple nodes. You can spread out your data very easily and query it as a single entity. It is columnar meaning that stores your data in a format that makes it very fast for those little queries. What do I mean by that? That means you might want to find out, for example, the top products that are selling in a department for month X. These are queries that typically show up in business reporting…these kind of queries would benefit greatly from FiloDB.
…We’re seeing there’s really a need for something that allows you to do queries very quickly and interactively but still work with more recent data. Some more recent use cases that really motivated this was around processing, such as geospatial processing … For example, I have a location column. Let’s say that it’s IoT or something else, and I have positions or coordinates. Oftentimes, you need to annotate this data and take the position and same ZIP codes and other kinds of things like that. I saw an opportunity to use columnar storage for it, but nothing that allowed me to take advantage of it very easily.
…One of our core messages is, look, you already have Spark and Cassandra. You ingest real-time data. Now you’re thinking how to add analytics to it. You don’t have to set up a whole complex stack involving Hadoop and a lot of extra stuff. You can simplify your stack a lot and just use what we call a “SMACK stack”: Spark, [Mesos, Akka,] Cassandra, Kafka.

Cloud computing

It’s such a [different] landscape now. … You can run everything on Amazon or Google cloud with the data flow. Basically, I think the industry is transitioning from: you have to build a lot of things yourself, to more of a pick and choose (like I’m going to go and see what services I can assemble and integrate all of them). … When you think about testing, and especially when you have, say, a dev cluster, a staging cluster, a production cluster, a lot of times you want to spin up things for tests, like performance. With cloud, it becomes much easier. With data centers, you often can’t find a space to do performance tests. With the cloud provider, I can just spin out a cluster.

Related resources:

Post topics: AI & ML•Data•O'Reilly Data Show

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

Building a scalable platform for streaming updates and analytics

Bringing Apache Spark & Apache Cassandra together

What is FiloDB?

Cloud computing