O'Reilly logo

Learning Real-time Processing with Spark Streaming by Sumit Gupta

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 3. Processing Distributed Log Files in Real Time

In today's world, large amounts of valuable data are stored in repositories distributed across large-scale networks which can be accessed over the Web. The key challenge is to provide a distributed, efficient, scalable, extensible fault-tolerant system to manipulate this data easily, safely, and with high performance.

There is no doubt that systems like Hadoop brought revolution and provided a framework for processing data at Internet scale—large and distributed data in TBs/PBs. But it was primarily meant for data-intensive operations, where it processed data over clusters of nodes/machines and devoted most of its processing time to I/O and manipulation of data.

Apache Spark took distributed ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required