It’s probably not an exaggeration to say that Apache Hadoop has revolutionized data management and processing. Hadoop’s technical capabilities have made it possible for organizations across a range of industries to solve problems that were previously impractical with existing technologies. These capabilities include:
Scalable processing of massive amounts of data
Flexibility for data processing, regardless of the format and structure (or lack of structure) in the data
Another notable feature of Hadoop is that it’s an open source project designed to run on relatively inexpensive commodity hardware. Hadoop provides these capabilities at considerable cost savings over traditional data management solutions.
This combination of technical capabilities and economics has led to rapid growth in Hadoop and tools in the surrounding ecosystem. The vibrancy of the Hadoop community has led to the introduction of a broad range of tools to support management and processing of data with Hadoop.
Despite this rapid growth, Hadoop is still a relatively young technology. Many organizations are still trying to understand how Hadoop can be leveraged to solve problems, and how to apply Hadoop and associated tools to implement solutions to these problems. A rich ecosystem of tools, application programming interfaces (APIs), and development options provide choice and flexibility, but can make it challenging to determine the best choices to implement a data processing application.
The inspiration for this book comes from our experience working with numerous customers and conversations with Hadoop users who are trying to understand how to build reliable and scalable applications with Hadoop. Our goal is not to provide detailed documentation on using available tools, but rather to provide guidance on how to combine these tools to architect scalable and maintainable applications on Hadoop.
We assume readers of this book have some experience with Hadoop and related tools. You should have a familiarity with the core components of Hadoop, such as the Hadoop Distributed File System (HDFS) and MapReduce. If you need to come up to speed on Hadoop, or need refreshers on core Hadoop concepts, Hadoop: The Definitive Guide by Tom White remains, well, the definitive guide.
The following is a list of other tools and technologies that are important to understand in using this book, including references for further reading:
Up until recently, the core of Hadoop was commonly considered as being HDFS and MapReduce. This has been changing rapidly with the introduction of additional processing frameworks for Hadoop, and the introduction of YARN accelarates the move toward Hadoop as a big-data platform supporting multiple parallel processing models. YARN provides a general-purpose resource manager and scheduler for Hadoop processing, which includes MapReduce, but also extends these services to other processing models. This facilitates the support of multiple processing frameworks and diverse workloads on a single Hadoop cluster, and allows these different models and workloads to effectively share resources. For more on YARN, see Hadoop: The Definitive Guide, or the Apache YARN documentation.
Hadoop and many of its associated tools are built with Java, and much application development with Hadoop is done with Java. Although the introduction of new tools and abstractions increasingly opens up Hadoop development to non-Java developers, having an understanding of Java is still important when you are working with Hadoop.
Although Hadoop opens up data to a number of processing frameworks, SQL remains very much alive and well as an interface to query data in Hadoop. This is understandable since a number of developers and analysts understand SQL, so knowing how to write SQL queries remains relevant when you’re working with Hadoop. A good introduction to SQL is Head First SQL by Lynn Beighley (O’Reilly).
Scala is a programming language that runs on the Java virtual machine (JVM) and supports a mixed object-oriented and functional programming model. Although designed for general-purpose programming, Scala is becoming increasingly prevalent in the big-data world, both for implementing projects that interact with Hadoop and for implementing applications to process data. Examples of projects that use Scala as the basis for their implementation are Apache Spark and Apache Kafka. Scala, not surprisingly, is also one of the languages supported for implementing applications with Spark. Scala is used for many of the examples in this book, so if you need an introduction to Scala, see Scala for the Impatient by Cay S. Horstmann (Addison-Wesley Professional) or for a more in-depth overview see Programming Scala, 2nd Edition, by Dean Wampler and Alex Payne (O’Reilly).
- Apache Hive
Speaking of SQL, Hive, a popular abstraction for modeling and processing data on Hadoop, provides a way to define structure on data stored in HDFS, as well as write SQL-like queries against this data. The Hive project also provides a metadata store, which in addition to storing metadata (i.e., data about data) on Hive structures is also accessible to other interfaces such as Apache Pig (a high-level parallel programming abstraction) and MapReduce via the HCatalog component. Further, other open source projects—such as Cloudera Impala, a low-latency query engine for Hadoop—also leverage the Hive metastore, which provides access to objects defined through Hive. To learn more about Hive, see the Hive website, Hadoop: The Definitive Guide, or Programming Hive by Edward Capriolo, et al. (O’Reilly).
- Apache HBase
HBase is another frequently used component in the Hadoop ecosystem. HBase is a distributed NoSQL data store that provides random access to extremely large volumes of data stored in HDFS. Although referred to as the Hadoop database, HBase is very different from a relational database, and requires those familiar with traditional database systems to embrace new concepts. HBase is a core component in many Hadoop architectures, and is referred to throughout this book. To learn more about HBase, see the HBase website, HBase: The Definitive Guide by Lars George (O’Reilly), or HBase in Action by Nick Dimiduk and Amandeep Khurana (Manning).
- Apache Flume
Flume is an often used component to ingest event-based data, such as logs, into Hadoop. We provide an overview and details on best practices and architectures for leveraging Flume with Hadoop, but for more details on Flume refer to the Flume documentation or Using Flume (O’Reilly).
- Apache Sqoop
Sqoop is another popular tool in the Hadoop ecosystem that facilitates moving data between external data stores such as a relational database and Hadoop. We discuss best practices for Sqoop and where it fits in a Hadoop architecture, but for more details on Sqoop see the Sqoop documentation or the Apache Sqoop Cookbook (O’Reilly).
- Apache ZooKeeper
The aptly named ZooKeeper project is designed to provide a centralized service to facilitate coordination for the zoo of projects in the Hadoop ecosystem. A number of the components that we discuss in this book, such as HBase, rely on the services provided by ZooKeeper, so it’s good to have a basic understanding of it. Refer to the ZooKeeper site or ZooKeeper by Flavio Junqueira and Benjamin Reed (O’Reilly).
As you may have noticed, the emphasis in this book is on tools in the open source Hadoop ecosystem. It’s important to note, though, that many of the traditional enterprise software vendors have added support for Hadoop, or are in the process of adding this support. If your organization is already using one or more of these enterprise tools, it makes a great deal of sense to investigate integrating these tools as part of your application development efforts on Hadoop. The best tool for a task is often the tool you already know. Although it’s valuable to understand the tools we discuss in this book and how they’re integrated to implement applications on Hadoop, choosing to leverage third-party tools in your environment is a completely valid choice.
Again, our aim for this book is not to go into details on how to use these tools, but rather, to explain when and why to use them, and to balance known best practices with recommendations on when these practices apply and how to adapt in cases when they don’t. We hope you’ll find this book useful in implementing successful big data solutions with Hadoop.
A Note About the Code Examples
Before we move on, a brief note about the code examples in this book. Every effort has been made to ensure the examples in the book are up-to-date and correct. For the most current versions of the code examples, please refer to the book’s GitHub repository at https://github.com/hadooparchitecturebook/hadoop-arch-book.
Who Should Read This Book
Hadoop Application Architectures was written for software developers, architects, and project leads who need to understand how to use Apache Hadoop and tools in the Hadoop ecosystem to build end-to-end data management solutions or integrate Hadoop into existing data management architectures. Our intent is not to provide deep dives into specific technologies—for example, MapReduce—as other references do. Instead, our intent is to provide you with an understanding of how components in the Hadoop ecosystem are effectively integrated to implement a complete data pipeline, starting from source data all the way to data consumption, as well as how Hadoop can be integrated into existing data management systems.
We assume you have some knowledge of Hadoop and related tools such as Flume, Sqoop, HBase, Pig, and Hive, but we’ll refer to appropriate references for those who need a refresher. We also assume you have experience programming with Java, as well as experience with SQL and traditional data-management systems, such as relational database-management systems.
So if you’re a technologist who’s spent some time with Hadoop, and are now looking for best practices and examples for architecting and implementing complete solutions with it, then this book is meant for you. Even if you’re a Hadoop expert, we think the guidance and best practices in this book, based on our years of experience working with Hadoop, will provide value.
This book can also be used by managers who want to understand which technologies will be relevant to their organization based on their goals and projects, in order to help select appropriate training for developers.
Why We Wrote This Book
We have all spent years implementing solutions with Hadoop, both as users and supporting customers. In that time, the Hadoop market has matured rapidly, along with the number of resources available for understanding Hadoop. There are now a large number of useful books, websites, classes, and more on Hadoop and tools in the Hadoop ecosystem available. However, despite all of the available materials, there’s still a shortage of resources available for understanding how to effectively integrate these tools into complete solutions.
When we talk with users, whether they’re customers, partners, or conference attendees, we’ve found a common theme: there’s still a gap between understanding Hadoop and being able to actually leverage it to solve problems. For example, there are a number of good references that will help you understand Apache Flume, but how do you actually determine if it’s a good fit for your use case? And once you’ve selected Flume as a solution, how do you effectively integrate it into your architecture? What best practices and considerations should you be aware of to optimally use Flume?
This book is intended to bridge this gap between understanding Hadoop and being able to actually use it to build solutions. We’ll cover core considerations for implementing solutions with Hadoop, and then provide complete, end-to-end examples of implementing some common use cases with Hadoop.
Navigating This Book
The organization of chapters in this book is intended to follow the same flow that you would follow when architecting a solution on Hadoop, starting with modeling data on Hadoop, moving data into and out of Hadoop, processing the data once it’s in Hadoop, and so on. Of course, you can always skip around as needed. Part I covers the considerations around architecting applications with Hadoop, and includes the following chapters:
Chapter 1 covers considerations around storing and modeling data in Hadoop—for example, file formats, data organization, and metadata management.
Chapter 2 covers moving data into and out of Hadoop. We’ll discuss considerations and patterns for data ingest and extraction, including using common tools such as Flume, Sqoop, and file transfers.
Chapter 3 covers tools and patterns for accessing and processing data in Hadoop. We’ll talk about available processing frameworks such as MapReduce, Spark, Hive, and Impala, and considerations for determining which to use for particular use cases.
Chapter 4 will expand on the discussion of processing frameworks by describing the implementation of some common use cases on Hadoop. We’ll use examples in Spark and SQL to illustrate how to solve common problems such as de-duplication and working with time series data.
Chapter 5 discusses tools to do large graph processing on Hadoop, such as Giraph and GraphX.
Chapter 6 discusses tying everything together with application orchestration and scheduling tools such as Apache Oozie.
Chapter 7 discusses near-real-time processing on Hadoop. We discuss the relatively new class of tools that are intended to process streams of data such as Apache Storm and Apache Spark Streaming.
In Part II, we cover the end-to-end implementations of some common applications with Hadoop. The purpose of these chapters is to provide concrete examples of how to use the components discussed in Part I to implement complete solutions with Hadoop:
Chapter 8 provides an example of clickstream analysis with Hadoop. Storage and processing of clickstream data is a very common use case for companies running large websites, but also is applicable to applications processing any type of machine data. We’ll discuss ingesting data through tools like Flume and Kafka, cover storing and organizing the data efficiently, and show examples of processing the data.
Chapter 9 will provide a case study of a fraud detection application on Hadoop, an increasingly common use of Hadoop. This example will cover how HBase can be leveraged in a fraud detection solution, as well as the use of near-real-time processing.
Chapter 10 provides a case study exploring another very common use case: using Hadoop to extend an existing enterprise data warehouse (EDW) environment. This includes using Hadoop as a complement to the EDW, as well as providing functionality traditionally performed by data warehouses.
Conventions Used in This Book
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This icon signifies a tip, suggestion, or general note.
This icon indicates a warning or caution.
Using Code Examples
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/hadooparchitecturebook/hadoop-arch-book.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Hadoop Application Architectures by Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira (O’Reilly). Copyright 2015 Jonathan Seidman, Gwen Shapira, Ted Malaska, and Mark Grover, 978-1-491-90008-6.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at email@example.com.
Safari® Books Online
Technology professionals, software developers, web designers, and business and creative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training.
Members have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
- O’Reilly Media, Inc.
- 1005 Gravenstein Highway North
- Sebastopol, CA 95472
- 800-998-9938 (in the United States or Canada)
- 707-829-0515 (international or local)
- 707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/hadoop_app_arch_1E.
To comment or ask technical questions about this book, send email to firstname.lastname@example.org.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
We would like to thank the larger Apache community for its work on Hadoop and the surrounding ecosystem, without which this book wouldn’t exist. We would also like to thank Doug Cutting for providing this book’s forward, and not to mention for co-creating Hadoop.
There are a large number of folks whose support and hard work made this book possible, starting with Eric Sammer. Eric’s early support and encouragement was invaluable in making this book a reality. Amandeep Khurana, Kathleen Ting, Patrick Angeles, and Joey Echeverria also provided valuable proposal feedback early on in the project.
Many people provided invaluable feedback and support while writing this book, especially the following who provided their time and expertise to review content: Azhar Abubacker, Sean Allen, Ryan Blue, Ed Capriolo, Eric Driscoll, Lars George, Jeff Holoman, Robert Kanter, James Kinley, Alex Moundalexis, Mac Noland, Sean Owen, Mike Percy, Joe Prosser, Jairam Ranganathan, Jun Rao, Hari Shreedharan, Jeff Shmain, Ronan Stokes, Daniel Templeton, Tom Wheeler.
Andre Araujo, Alex Ding, and Michael Ernest generously gave their time to test the code examples. Akshat Das provided help with diagrams and our website.
Many reviewers helped us out and greatly improved the quality of this book, so any mistakes left are our own.
We would also like to thank Cloudera management for enabling us to write this book. In particular, we’d like to thank Mike Olson for his constant encouragement and support from day one.
We’d like to thank our O’Reilly editor Brian Anderson and our production editor Nicole Shelby for their help and contributions throughout the project. In addition, we really appreciate the help from many other folks at O’Reilly and beyond—Ann Spencer, Courtney Nash, Rebecca Demarest, Rachel Monaghan, and Ben Lorica—at various times in the development of this book.
Our apologies to those who we may have mistakenly omitted from this list.
Mark Grover’s Acknowledgements
First and foremost, I would like to thank my parents, Neelam and Parnesh Grover. I dedicate it all to the love and support they continue to shower in my life every single day. I’d also like to thank my sister, Tracy Grover, who I continue to tease, love, and admire for always being there for me. Also, I am very thankful to my past and current managers at Cloudera, Arun Singla and Ashok Seetharaman for their continued support of this project. Special thanks to Paco Nathan and Ed Capriolo for encouraging me to write a book.
Ted Malaska’s Acknowledgements
I would like to thank my wife, Karen, and TJ and Andrew—my favorite two boogers.
Jonathan Seidman’s Acknowledgements
I’d like to thank the three most important people in my life, Tanya, Ariel, and Madeleine, for their patience, love, and support during the (very) long process of writing this book. I’d also like to thank Mark, Gwen, and Ted for being great partners on this journey. Finally, I’d like to dedicate this book to the memory of my parents, Aaron and Frances Seidman.
Gwen Shapira’s Acknowledgements
I would like to thank my husband, Omer Shapira, for his emotional support and patience during the many months I spent writing this book, and my dad, Lior Shapira, for being my best marketing person and telling all his friends about the “big data book.” Special thanks to my manager Jarek Jarcec Cecho for his support for the project, and thanks to my team over the last year for handling what was perhaps more than their fair share of the work.