Chapter 3. Architecting a Real-Time Data Pipeline with Spark Streaming

Eric Frenkiel

Editor’s Note: At Strata + Hadoop World in Singapore, in December 2015, Eric Frenkiel (CEO and cofounder at MemSQL) presented a talk that explores modeling the smart and connected city of the future with Kafka and Spark.

Hadoop has solved the “volume” aspect of big data, but “velocity” and “variety” are two aspects that still need to be tackled. In-memory technology is important for addressing velocity and variety, and here we’ll discuss the challenges, design choices, and architecture required to enable smarter energy systems, and efficient energy consumption through a real-time data pipeline that combines Apache Kafka, Apache Spark, and an in-memory database.

What does a smart city look like? Here’s a familiar-looking vision: it’s definitely something that is futuristic, ultra-clean, and for some reason there are always highways that loop around buildings. But here’s the reality: we have a population of almost four billion people living in cities, and unfortunately, very few cities can actually enact the type of advances that are necessary to support them.

A full 3.9 billion people live in cities today; by 2050, we’re expected to add another 2.5 billion people. It’s critical that we get our vision of a smart city right, because in the next few decades we’ll be adding billions of people to our urban centers. We need to think about how we can design cities and use technology to help people, and ...

Get Analyzing Data in the Internet of Things now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Analyzing Data in the Internet of Things by Ashish Thusoo

Chapter 3. Architecting a Real-Time Data Pipeline with Spark Streaming

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly