Many of the ideas that underpin the Apache Hadoop project are decades old. Academia and industry have been exploring distributed storage and computation since the 1960s. The entire tech industry grew out of government and business demand for data processing, and at every step along that path, the data seemed big to the people in the moment. Even some of the most advanced and interesting applications go way back: machine learning, a capability that’s new to many enterprises, traces its origins to academic research in the 1950s and to practical systems work in the 1960s and 1970s.
But real, practical, useful, massively scalable, and reliable systems simply could not be found—at least not cheaply—until Google confronted the problem of the internet in the late 1990s and early 2000s. Collecting, indexing, and analyzing the entire web was impossible, using commercially available technology of the time.
Google dusted off the decades of research in large-scale systems. Its architects realized that, for the first time ever, the computers and networking they required could be had, at reasonable cost.
Its work—on the Google File System (GFS) for storage and on the MapReduce framework for computation—created the big data industry.
This work led to the creation of the open source Hadoop project in 2005 by Mike Cafarella and Doug Cutting. The fact that the software was easy to get, and could be improved and extended by a global developer community, made it attractive to a wide audience. At first, other consumer internet companies used the software to follow Google’s lead. Quickly, though, traditional enterprises noticed that something was happening and looked for ways to get involved.
In the decade-plus since the Hadoop project began, the ecosystem has exploded. Once, the only storage system was the Hadoop Distributed File System (HDFS), based on GFS. Today, HDFS is thriving, but there are plenty of other choices: Amazon S3 or Microsoft Azure Data Lake Store (ADLS) for cloud storage, for example, or Apache Kudu for IoT and analytic data. Similarly, MapReduce was originally the only option for analyzing data. Now, users can choose among MapReduce, Apache Spark for stream processing and machine learning workloads, SQL engines like Apache Impala and Apache Hive, and more.
All of these new projects have adopted the fundamental architecture of Hadoop: large-scale, distributed, shared-nothing systems, connected by a good network, working together to solve the same problem. Hadoop is the open source progenitor, but the big data ecosystem built on it is vastly more powerful—and more useful—than the original Hadoop project.
That explosion of innovation means big data is more valuable than ever before. Enterprises are eager to adopt the technology. They want to predict customer behavior, foresee failure of machines on their factory floors or trucks in their fleets, spot fraud in their transaction flows, and deliver targeted care—and better outcomes—to patients in hospitals.
But that innovation, so valuable, also confounds them. How can they keep up with the pace of improvement, and the flurry of new projects, in the open source ecosystem? How can they deploy and operate these systems in their own datacenters, meeting the reliability and stability expectations of users and the requirements of the business? How can they secure their data and enforce the policies that protect private information from cyberattacks?
Mastering the platform in an enterprise context raises new challenges that run deep in the data. We have been able to store and search a month’s worth of data, or a quarter’s, for a very long time. Now, we can store and search a decade’s worth, or a century’s. That large quantitative difference turns into a qualitative difference: what new applications can we build when we can think about a century?
The book before you is your guide to answering those questions as you build your enterprise big data platform.
Jan, Ian, Lars, and Paul—this book’s authors—are hands-on practitioners in the field, with many years of experience helping enterprises get real value from big data. They are not only users of Hadoop, Impala, Hive, and Spark, but they are also active participants in the open source community, helping to shape those projects, and their capabilities, for enterprise adoption. They are experts in the analytic, data processing, and machine learning capabilities that the ecosystem offers.
When technology moves quickly, it’s important to focus on techniques and ideas that stand the test of time. The advice here works for the software—Hadoop and its many associated services—that exists today. The thinking and design, though, are tied not to specific projects but to the fundamental architecture that made Hadoop successful: large-scale, distributed, shared-nothing software requires a new approach to operations, to security, and to governance.
You will learn those techniques and those ideas here.