If you’re reading this book, it will come as no surprise that we are in the middle of a revolution in the way data is stored and processed in the enterprise. As anyone who has been in IT for any length of time knows, the technologies and approaches behind data processing and storage are always evolving. However, in the past 10 to 15 years, the pace of change has been remarkable. We have moved from a world where almost all enterprise data was processed and analyzed using variants of SQL and was contained in some form of relational database to one in which an enterprise’s data may be found in a variety of so-called NoSQL storage engines. Each of these engines sacrifices some constraint of the relational model to achieve superior performance and scalability for a certain use case. The modern data landscape includes nonrelational key-value stores, distributed filesystems, distributed columnar databases, log stores, and document stores, in addition to traditional relational databases. The data in these systems is exploited in a multitude of ways and is processed using distributed batch-processing algorithms, stream processing, massively parallel processing query engines, free-text searches, and machine learning pipelines.
There are many drivers for this transformation, but the predominant ones are:
The phrase big data has been used too much to retain much value, but the sheer volume of data generated by today’s enterprises, especially those with a heavy web presence—which is to say all enterprises—is staggering. The explosion of data from edge computing and Internet of Things (IoT) devices will only add to the volume. Although storing data in as granular a form as possible may not seem immediately useful, this will become increasingly important in order to derive new insights. Storage is cheap, and bad decisions that have lasting consequences are costly. Better to store in full fidelity with a modern data platform and have the option to make a new decision later. Traditional architectures based on relational databases and shared file storage are simply unable to store and process data at these scales. This has led directly to the development of new tools and techniques in which computations are linearly scalable and distributed by default.
Gone are the days in which data for analytics would arrive in nice, neat daily batches. Although this still happens for some datasets, increasingly data arrives in a streaming fashion at high rates. The velocity of its generation demands a new way of storing, processing, and serving it up.
New insights and new models feed off data—the more the better. Hitherto untapped sources of data, perhaps in semi-structured or completely unstructured forms, are increasingly in demand. All aspects of an enterprise’s operation are relevant and potentially valuable sources of information to drive new insights and, ultimately, revenue. It’s essential to have a single, unified platform with technologies capable of storing and processing all these many and varied forms of data.
The enterprises that will succeed in the data age are the ones building new business strategies and products and, crucially, making decisions based on the insights gleaned from new data sources. To make the right data-driven decisions, you need a solid data and computation platform. Such a platform needs to be capable of embracing both on-premises and cloud deployments. It also needs to scale to support traditional data analytics and to enable advances in your business from data science, machine learning, and artificial intelligence (AI).
We have only just begun our exploration of Hadoop in the enterprise, but it is worth dispelling some common misconceptions about data platforms and Hadoop early on:
Although it is true that many technologies in the Hadoop ecosystem have more flexible notions of schemas and do not impose schemas as strictly as, say, a relational database, it is a mistake to think that data stored in Hadoop clusters does not need a defined schema. Applications using data stored in Hadoop still need to understand the data they are querying, and there is always some sort of underlying data model or structure, either implicit or explicit. What the Hadoop ecosystem does offer is much more flexibility in the way data is structured and queried. Instead of imposing a globally fixed schema on the data as it is ingested and potentially dropping any fields that don’t match the schema, the data gets its structure from the frameworks and applications using it. This concept is often referred to as schema on read. You can store any type of data in its raw form and then process, transform, and combine it with other sources into the best format and structure for your use case. And if you get it wrong, you can always build a new representation from the raw data.
This is a very common mistake when thinking about modern data platforms. Different use cases require different access patterns, and this often means storing the same datasets in different ways using different storage engines. This is a logical consequence of the various optimizations each storage engine provides. This data duplication should be considered par for the course and embraced as a fundamental aspect of the freedom of operating in the Hadoop ecosystem. Hadoop platforms are designed to be horizontally scalable and to be orders of magnitude cheaper (if your enterprise IT department has a sensible approach to procurement, that is) than the proprietary alternatives. But the money you save on storage is just one aspect—maybe not even the most important aspect—of moving to a modern data platform. What it also brings you is a multitude of options for processing and querying the data and for extracting new value through scalable analytics and machine learning.
In the initial excitement of moving to Hadoop, the notion of a single, all-encompassing data lake arose, in which all data was stored in and all processing and querying were performed on a single cluster, which consisted of potentially many thousands of machines. Although Hadoop is certainly capable of scaling to that number of servers, the variety of access patterns and modes of processing data don’t necessarily mesh well in a single cluster. Colocating use cases that require strict query completion time guarantees with other ad hoc, variable workloads is likely to lead to an unsatisfactory experience. Multitenancy controls do exist, but they can’t change the fact that a finite set of resources can’t satisfy all requirements all the time. As a result, you should plan for multiple clusters serving different use cases with similar processing patterns or service levels. Don’t go too far the other way, though. Lots of small clusters can be just as bad as a “single cluster to rule them all.” Clusters can and should be shared, but be prepared to divide and conquer when necessary.
The trends in industry are clear to see. Many, if not most, enterprises have already embarked on their data-driven journeys and are making serious investments in hardware, software, and services. The big data market is projected to continue growing apace, reaching somewhere in the region of $90 billion of annual revenue by 2025. Related markets, such as deep learning and artificial intelligence, that are enabled by data platforms are also set to see exponential growth over the next decade.
The move to Hadoop, and to modern data platforms in general, has coincided with a number of secular trends in enterprise IT, a selection of which are discussed here. Some of these trends are directly caused by the focus on big data, but others are a result of a multitude of other factors, such as the desire to reduce software costs, consolidate and simplify IT operations, and dramatically reduce the time to procure new hardware and resources for new use cases.
This trend is already well established. It is now generally accepted that, for storage and data processing, the right way to scale a platform is to do so horizontally using distributed clusters of commodity (which does not necessarily mean the cheapest) servers rather than vertically with ever more powerful machines. Although some workloads, such as deep learning, are more difficult to distribute and parallelize, they can still benefit from plenty of machines with lots of cores, RAM, and GPUs, and the data to drive such workloads will be ingested, cleaned, and prepared in horizontally scalable environments.
Although proprietary software will always have its place, enterprises have come to appreciate the benefits of placing open source software at the center of their data strategies, with its attendant advantages of transparency and data freedom. Increasingly, companies—especially public sector agencies—are mandating that new projects are built with open source technologies at their core.
We have reached a tipping point in the use of public cloud services. These services have achieved a level of maturity in capability and security where even regulated industries, such as healthcare and financial services, feel comfortable running a good deal of their workloads in the cloud. Cloud solutions can have considerable advantages over on-premises solutions, in terms of agility, scalability, and performance. The ability to count cloud usage against operational—rather than capital—expenditure, even if the costs can be considerable over the long run, is also a significant factor in its adoption. But while the use of public cloud services is growing and will continue to do so, it is unlikely to become all-encompassing. Some workloads will need to stay in traditional on-premise clusters or private clouds. In the current landscape, data platforms will need to be able to run transparently on-premises, in the public cloud, and in private cloud deployments.
There are many exciting developments being made in cloud-based deployments, particularly around new ways of deploying and running frameworks using containerization, such as can be done with Docker and Kubernetes. Since they are not yet widely adopted within enterprises, and since best practices and deployment patterns are still emerging, we do not cover these technologies in great detail in this book, but we recommend closely following developments in this space.
The desire to decouple compute from storage is strongly related to the move to cloud computing. In its first few years, when high-throughput networking was relatively rare and many data use cases were limited by disk bandwidth, Hadoop clusters almost exclusively employed direct-attached storage (for good reason, as we’ll see in future chapters). However, the migration of many workloads to the public cloud has opened up new ways of interacting with persistent data that take advantage of their highly efficient networked storage systems, to the extent that compute and storage can be scaled independently for many workloads. This means that the data platform of the future will need to be flexible in how and from where it allows data to be accessed, since data in storage clusters will be accessed by both local and remote compute clusters.
As we discussed writing this book, we gave serious thought to the title. If you saw the early drafts, you’ll know it originally had a different title: Hadoop in the Enterprise. But the truth is, the clusters are about much more than the Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), and MapReduce. Even though it is still common to refer to these platforms as Hadoop clusters, what we really mean is Hadoop, Hive, Spark, HBase, Solr, and all the rest. The modern data platform consists of a multitude of technologies, and splicing them together can be a daunting task.
You may also be wondering why we need yet another book about Hadoop and the technologies that go around it. Aren’t these things already well—even exhaustively—covered in the literature, blogosphere, and conference circuit? The answer is yes, to a point. There is no shortage of material out there covering the inner workings of the technologies themselves and the art of engineering data applications and applying them to new use cases. There is also some material for system administrators about how to operate clusters. There is, however, much less content about successfully integrating Hadoop clusters into an enterprise context.
Our goal in writing this book is to equip you to successfully architect, build, integrate, and run modern enterprise data platforms. Our experience providing professional services for Hadoop and its associated services over the past five or more years has shown that there is a major lack of guidance for both the architect and the practitioner. Undertaking these tasks without a guiding hand can lead to expensive architectural mistakes, disappointing application performance, or a false impression that such platforms are not enterprise-ready. We want to make your journey into big data in general, and Hadoop in particular, as smooth as possible.
We cover a lot of ground in this book. Some sections are primarily technical, while other sections discuss practice and architecture at a higher level. The book can be read by anyone who deals with Hadoop as part of their daily job, but we had the following principal audiences in mind when we wrote the book:
Those whose job is making sure all aspects of the Hadoop cluster integrate and gel with the other enterprise systems and who must ensure that the cluster is operated and governed according to enterprise standards (Chapters 1–4, 6–7, and 9–18)
Developers and architects designing the next generation of data-driven applications who want to know how best to fit their code into Hadoop and to take advantage of its capabilities (Chapters 1–2, 9–13, and 17–18)
Those who are tasked with operating and monitoring clusters and who need to have an in-depth understanding of how the cluster components work together and how they interact with the underlying hardware and external systems (Chapters 1, 3, 4, and 6–18)
We’ve noted particularly relevant chapters, but readers should not feel limited by that selection. Each chapter contains information of interest to each audience.
This book is about all things architecture. We’ve split it up into three parts. In Part I, we establish a solid foundation for clusters by looking at the underlying infrastructure. In Part II, we look at the platform as a whole and at how to build a rock-solid cluster that integrates smoothly with external systems. Finally, in Part III, we cover the important architectural aspects of running Hadoop in the cloud. We begin with a technical primer for Hadoop and the ecosystem.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/architectingModernDataPlatforms.
To comment or ask technical questions about this book, send email to email@example.com.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
The main goal of this book is to help our readers succeed in enterprise Hadoop integration. This required us to go beyond technical facts and specifications to give actionable advice, which is, essentially, an account of how big data is done in enterprise IT. This would have been completely impossible without the help of many experienced individuals who have long practiced big data, among them many of our current and former colleagues, clients, and other industry experts. We feel privileged that we could rely on their knowledge and experience when we reached the limit of our own.
Thank you Jörg Grauvogl, Werner Schauer, Dwai Lahiri, and Travis Campbell for providing so much feedback and best practices on networks, private clouds, and datacenter design. We would also like to thank Stefan Letz and Roelf Zomerman for patiently discussing and answering our many questions regarding public clouds. A special shout-out goes to Andrew Wang for helping us extensively on the ins and outs of HDFS erasure coding and its capabilities for zero-copy reads! Further thanks go to Dominik Meyer, Alexis Moundalexis, Tristan Stevens, and Mubashir Kazia.
We also must thank the amazing team at O’Reilly: Marie Beaugureau, Nicole Tache, and Michele Cronin—thank you so much for your relentless push and supervision. Without you, we’d be lost in space. Additional thanks to Kristen Brown, Colleen Cole, Shannon Wright, and Nick Adams.
Our deepest obligation is to our reviewers: David Yahalom, Frank Kane, Ryan Blue, Jesse Anderson, Amandeep Khurana, and Lars Francke. You invested much of your valuable time to read through this work and to provide us with invaluable feedback, regardless of the breadth of subjects.
Now for our individual acknowledgments:
For supporting me throughout the entire process with both time and encouragement, I am extremely grateful to my employer, Cloudera, and, in particular, to Hemal Kanani and my colleagues in New York, Jeremy Beard, Ben Spivey, and Jeff Shmain, for conversations and banter. Thanks also to Michael Ernest for providing much advice on “verbal stylings.”
As with many things in life, writing a book is always more work than expected, but it has been a rare privilege to be able to work with my fellow authors, Jan, Paul, and Lars. Thanks for the reviews, discussions, and all the hard work you have put in—and for the camaraderie. It’s been fun.
Finally—and most importantly—I want to thank my wonderful family, Jenna, Amelia, and Sebastian. Thank you for letting me embark on this project, for your unfailing love, support, and encouragement throughout the long process, and for never uttering a word of complaint about the lost evenings, weekends, and holidays—not even when you found out that, despite its cover, the book wasn’t about birds. This book is for you.
For Dala, Ilai, Katy, and Andre. Thank you for believing in me.
I would also like to express my gratitude to my fellow authors: Ian, Paul, and Lars—we went through thick and thin, we learned a lot about who we are, and we managed to keep our cool. It is an honor for me to work with you.
To my family, Sarah, Tom, and Evie: thank you. Writing this book has been a rare privilege, but, undoubtedly, the greatest sacrifice to enable it has been yours. For that, for your patience, and for your support, I am truly grateful.
I’m also incredibly grateful to my coauthors, Jan, Ian, and Lars. I have no doubt that this book would be greatly diminished without your contributions—and not just in word count. Your friendship and camaraderie mean a lot to me.
Finally, this is also a rare opportunity to thank the wider supporting cast of thousands. To all of my friends, teachers, lecturers, customers, and colleagues: each of you has played a significant role in my life, shaping my thinking and understanding—even if you’re not aware of it. Thank you all!
This is for my loving family, Katja, Laura, and Leon. Thank you for sticking with me, even if I missed promises or neglected you in the process—you are the world to me.
Thank you also to my coauthors, Doc Ian, Tenacious Jan, and “Brummie” Paul, who are not just ex-colleagues of mine but also friends for life. You made this happen, and I am grateful to be part of it.
And to everyone at O’Reilly for their patience with us, the reviewers for their unwavering help, and all the people behind Hadoop and big data who built this ecosystem: thank you. We stand tall on your shoulders.
Despite its vast complexity, the Hadoop ecosystem has facilitated a rapid adoption of distributed systems into enterprise IT. We are thrilled to be a part of this journey among the fine people who helped us to convey what we know about this field. Enterprise big data has reached cruising altitude, but, without a doubt, innovation in data processing software frameworks, the amount of data, and its value will continue to soar beyond our imagination today.
This is just the beginning!