Skip to Content
Hadoop Application Architectures
book

Hadoop Application Architectures

by Mark Grover, Ted Malaska, Jonathan Seidman, Gwen Shapira
July 2015
Intermediate to advanced
250 pages
10h 47m
English
O'Reilly Media, Inc.
Content preview from Hadoop Application Architectures

Chapter 7. Near-Real-Time Processing with Hadoop

Through much of its development, Hadoop has been thought of as a batch processing system. The most common processing pattern has been loading data into Hadoop, followed by processing of that data through a batch job, typically implemented in MapReduce. A very common example of this type of processing is the extract-transform-load (ETL) processing pipeline—loading data into Hadoop, transforming it in some way to meet business requirements, and then moving it into a system for further analysis (for example, a relational database).

Hadoop’s ability to efficiently process large volumes of data in parallel provides great benefits, but there are also a number of use cases that require more “real time” (or more precisely, near-real-time, as we’ll discuss shortly) processing of data—processing the data as it arrives, rather than through batch processing. This type of processing is commonly referred to as stream processing.

Fortunately, this need for more real-time processing is being addressed with the integration of new tools into the Hadoop ecosystem. These stream processing tools include systems like Apache Storm, Apache Spark Streaming, Apache Samza, or even Apache Flume via Flume interceptors. In contrast to batch processing systems such as MapReduce, these tools allow you to build processing flows that continuously process incoming data; these flows will process data as long as they remain running, as opposed to a batch process that ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Apache Hadoop 3 Quick Start Guide

Apache Hadoop 3 Quick Start Guide

Hrishikesh Vijay Karambelkar
Architecting HBase Applications

Architecting HBase Applications

Jean-Marc Spaggiari, Kevin O'Dell

Publisher Resources

ISBN: 9781491910313Errata Page