17.5 Hadoop
The next several sections show how Apache Hadoop and Apache Spark deal with big-data storage and processing challenges via huge clusters of computers, massively parallel processing, Hadoop MapReduce programming and Spark in-memory processing techniques. Here, we discuss Apache Hadoop, a key big-data infrastructure technology that also serves as the foundation for many recent advancements in big-data processing and an entire ecosystem of software tools that are continually evolving to support today’s big-data needs.
17.5.1 Hadoop Overview
When Google was launched in 1998, the amount of online data was already enormous with approximately 2.4 million websites20—truly big data. Today there are now nearly two billion websites21 (almost ...
Get Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.