AI & ML Business Data Innovation Research Security

Try the O’Reilly learning platform

With the O’Reilly learning platform, you get the resources and guidance to keep your skills sharp and stay ahead. Try it free for up to 14 days.

Start trial

Try a course for free

Join a live online event on the O’Reilly platform to learn from the experts shaping tech.

See what’s coming soon

Get the Radar Trends newsletter

Your email

Country

Please read our privacy policy.

Radar > Topics > AI & ML

Building systems for massive scale data applications

The O’Reilly Data Show podcast: Tyler Akidau on the evolution of systems for bounded and unbounded data processing.

By Ben Lorica November 5, 2015 • 00:39:08 listen

LinkedIn X Facebook Threads Bluesky Reddit

O'Reilly Data Show Podcast

Building systems for massive scale data applications

00:00 / 00:39:08

In this episode of the O’Reilly Data Show, I sat down with Tyler Akidau one of the lead engineers in Google’s streaming and Dataflow technologies. He recently wrote an extremely popular article that provided a framework for how to think about bounded and unbounded data processing (a follow-up article is due out soon). We talked about the evolution of stream processing, the challenges of building systems that scale to massive data sets, and the recent surge in interest in all things real time:

On the need for MillWheel: A new stream processing engine

At the time [that MillWheel was built], there was, as far as I know, literally nothing externally that could handle the scale that we needed to handle. A lot of the existing streaming systems didn’t focus on out-of-order processing, which was a big deal for us internally. Also we really wanted to hit a strong focus on consistency — being able to get absolutely correct answers. … All three of these things were lacking in at least some area in [the systems we examined].

The Dataflow model

There are two projects that we say Dataflow came out of. The FlumeJava project, which, for anybody who is not familiar, is a higher level language for describing large-scale, massive-scale data processing systems and then running it through an optimizer and coming up with an execution plan. … We had all sorts of use cases at Google where people were stringing together these series of MapReduce [jobs]. It was complex and difficult to deal with, and you had to try to manually optimize them for performance. If you do what the database folks have done,[you] run it through an optimizer.
…
Flume is the primary data processing system, so as part of that for the last few years, we’ve been moving MillWheel to be essentially a secondary execution engine for FlumeJava. You can either do it on batch mode and run on MapReduce or you can execute it on MillWheel. … FlumeJava plus MillWheel — it’s this evolution that’s happened internally, and now we’ve externalized it.

Balancing correctness, latency, and cost

There’s a wide variety of use cases out there. Sometimes you need high correctness; sometimes you don’t; sometimes you need low latency; sometimes higher latency is okay. Sometimes you’re willing to pay a lot for those other two features; sometimes you don’t want to pay as much. The real key, at least as far as having a system that is broadly applicable, is being able to be flexible and give people the choices to make the trade-offs they have to make.
…
There is a single knob which is, which runner am I going to use: batch or streaming? Aside from that, the other level at which you get to make these choices is when you’re deciding exactly when you materialize your results within the pipeline.
…
Once you have a streaming system or streaming execution engine that gives you this automatic-scaling, like Dataflow does, and it gives you consistency and strong tools for working with your data, then people start to build these really complicated services on them. It may not just be data processing. It actually becomes a nice platform for orchestrating events or orchestrating distributed state machines and things like that. We have a lot of users internally doing this stuff.

Related resources:

Tyler Akidau will be speaking at Strata+Hadoop World Singapore: The evolution of massive scale data processing
The world beyond batch: Streaming 101
Specialized and hybrid data processing and management engines
How intelligent data platforms are powering smart cities

Post topics: AI & ML•Data•O'Reilly Data Show

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Try the O’Reilly learning platform

Try a course for free

Get the Radar Trends newsletter

Thank you for subscribing to the O’Reilly Radar Trends to Watch newsletter.

Building systems for massive scale data applications

On the need for MillWheel: A new stream processing engine

The Dataflow model

Balancing correctness, latency, and cost