It’s late 2015, and I’m staring at a page of mine on my employer’s wiki, trying to think of an OKR. An OKR is something like a performance objective, a goal to accomplish paired with a way to measure if it’s been accomplished. While my management chain defines OKRs for the company as a whole and major organizations in it, individuals define their own. We grade ourselves on them, but they do not determine how well we performed because they are meant to be aspirational, not necessary. If you meet all your OKRs, they weren’t ambitious enough.
My coworkers had already been impressed with writing that I’d done as part of my job, both in product documentation and in internal presentations, so focusing on a writing task made sense. How aspirational could I get? So I set this down.
“Begin writing a technical book! On something! That is, begin working on one myself, or assist someone else in writing one.”
Outright ridiculous, I thought, but why not? How’s that for aspirational.
Well, I have an excellent manager who is willing to entertain the ridiculous, and so she encouraged me to float the idea to someone else in our company who dealt with things like employees writing books, and he responded.
“Here’s an idea: there is no book out there about Running Hadoop in the Cloud. Would you have enough material at this point?”
I work on a product that aims to make the use of Hadoop clusters in the cloud easier, so it was admittedly an extremely good fit. It didn’t take long at all for this ember of an idea to catch, and the end result is the book you are reading right now.
Between the twin subjects of Hadoop and the cloud, there is more than enough to write about. Since there are already plenty of good Hadoop books out there, this book doesn’t try to duplicate them, and so you should already be familiar with running Hadoop. The details of configuring Hadoop clusters are only covered as needed to get clusters up and running. You can apply your prior Hadoop knowledge with great effectiveness to clusters in the cloud, and much of what other Hadoop books cover still applies.
It is not assumed, however, that you are familiar with the cloud. Perhaps you’ve dabbled in it, spun up an instance or two, read some documentation from a provider. Perhaps you haven’t even tried it at all, or don’t know where to begin. Readers with next to no knowledge of the cloud will find what they need to get rolling with their Hadoop clusters. Often, someone is tasked by their organization with “moving stuff to the cloud,” and neither the tasker nor the tasked truly understands what that means. If this describes you, this book is for you.
DevOps engineers, system administrators, and system architects will get the most out of this book, since it focuses on constructing clusters in a cloud provider and interfacing with the provider’s services. Software developers should also benefit from it; even if they do not build clusters themselves, they should understand how clusters work in the cloud so they know what to ask for and how to design their jobs.
Besides having a good grasp of Hadoop concepts, you should have a working knowledge of the Java programming language and the Bash shell, or similar languages. At least being able to read them should suffice, although the Bash scripts do not shy away from advanced shell features. Code examples are constrained to only those languages.
Before working on your clusters, you will need credentials for a cloud provider. The first two parts of the book do not require a cloud account to follow along, but the later hands-on parts do. Your organization may already have an account with a provider, and if so, you can seek your own account within that to work with. If you are on your own, you can sign up for a free trial with any of the cloud providers this book covers in detail.
As stated previously, this book does not delve into Hadoop details more than necessary. A seasoned Hadoop administrator may notice that configurations are not necessarily optimal, and that clusters are not tuned for maximum efficiency. This information is left out for brevity, so as not to duplicate content in books that focus only on Hadoop. Many of the principles for Hadoop maintenance apply to cloud clusters just as well as ordinary ones.
The core Hadoop components of HDFS and YARN are covered here, along with other important components such as ZooKeeper, Hive, and Spark. This doesn’t imply at all that other components won’t work in the cloud; there are simply so many components that, due to space considerations, not all could be included.
A limited set of popular cloud providers is covered in this book: Amazon Web Services, Google Cloud Platform, and Microsoft Azure. There are other cloud providers, both publicly available and deployed privately, but they are not included. The ones that were chosen are the most popular, and you should find that their concepts transfer over rather directly to those in other providers. Even so, each provider does things a little, or a lot, differently from its peers. When getting you up and running, all of them are covered equally, but beyond that, only Amazon Web Services is fully considered, since it is the dominant choice at this time. Brief summaries of equivalent procedures in the other providers are given to get you started with them.
Overall, between Hadoop and the cloud, there is just so much to write about. What’s more, cloud providers introduce new services and revamp older services all the time, and it can be challenging to keep up even when you work in the cloud every day. This book attempts to stick with the most vital, core Hadoop components and cloud services to be as relevant as possible in this fast-changing world. Understanding them will serve you well when integrating new features into your clusters in the future.
Part I starts off this book by asking why you would host Hadoop clusters in a cloud provider, and briefly introduces the providers this book looks at. Part II describes the common concepts of cloud providers, like instances and virtual networks. If you are already familiar with a cloud provider or two, you might skim or skip these parts.
Part III begins the hands-on portion of this book, where you build out a Hadoop cluster in one of the cloud providers. There is a chapter for the unique steps needed by each provider, and a common chapter for bringing up a cluster and seeing it in action. Later parts of the book use this first cluster as a launching point for more.
If you are interested in making an even more capable cluster, Part IV can help you. It covers adding high availability and installing Hive and Spark. You can try any combination of the enhancements, and learn even more about the ramifications of running in a cloud provider.
Finally, Part V looks at patterns and practices for running cloud clusters well, from designing for price and security to dealing with maintenance. Those first starting out in the cloud may not need the guidance in this part, but as usage ramps up, it becomes much more important.
Here are the versions of Hadoop components used in this book. All are distributed through Apache:
Apache Hadoop 2.7.2
Apache ZooKeeper 3.4.8
Apache Hive 2.1.0
Apache Spark 1.6.3 and 2.0.2
Code examples require:
Cloud providers update their services continually, and so determining the exact “versions” used for them is not possible. Most of the work in the book was performed during 2016 with the services as they existed at that time. Since then, service web interfaces may have changed and workflows may have been altered.
The following typographical conventions are used in this book:
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values determined by context.
This element signifies a tip or suggestion.
This element signifies a general note.
This element indicates a warning or caution.
Many of the examples throughout this book include IP addresses, usually for cluster nodes. The example IP addresses are drawn from reserved address ranges as specified in RFC 5737. They should never resolve to an actual IP address anywhere on the internet or within private networks. Change them as needed when using the examples in your work.
Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/bhavanki/moving-hadoop-to-the-cloud.
This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Moving Hadoop to the Cloud by Bill Havanki (O’Reilly). Copyright 2017 Bill Havanki Jr., 978-1-491-95963-3.”
If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at email@example.com.
Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals.
Members have access to thousands of books, training videos, Learning Paths, interactive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others.
For more information, please visit http://oreilly.com/safari.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://www.oreilly.com/catalog/0636920051459.
To comment or ask technical questions about this book, send email to firstname.lastname@example.org.
For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
I’m well aware that barely anyone reads the acknowledgments in a book, especially a technical one like this. So, for those few of you who are reading this right now, well, first, I’d like to thank you for your diligence, not to mention your attention and support in the first place. Truly, thanks for spending time and/or money on what I’ve written here, and I hope it helps you.
Thank you to everyone who’s helped to build up the amazing Apache Hadoop ecosystem, from its founders to its committers to its contributors to its users, for showing us a new way of computing. Thank you also to everyone who’s built and maintained the amazing cloud provider services, for showing us another new way of computing and empowering the rest of us to use it.
This book would be worse off without its reviewers: Jesse Anderson, Jenny Kim, Don Miner, Alex Moundalexis, and those who went unnamed or whom I’ve forgotten. They each applied their expertise, experience, and attention to detail to their feedback, filling in where I left important information out and correcting what I got wrong. I also owe thanks to Misha Brukman and the Google Cloud Platform team for looking over Chapter 7. My editors, Marie Beaugureau and Colleen Toporek, did a wonderful job of shepherding the writing process and giving feedback on organization, formatting, writing flow, and lots of other details. Finally, extra thanks is due to Alex Moundalexis for writing the foreword.
One of my favorite aphorisms is by Laozi: “A good traveler has no fixed plans and is not intent on arriving.” I’ve arrived at the destination of authoring a book, but no one observing my travel, including me, could have guessed that I’d have gotten here. The road has wound through a career with a few different employers and with a few more projects, and I was privileged to walk alongside a truly wonderful collection of coworkers and friends along the way. I owe them all my gratitude for their company, and their roles in my journey.
I owe special thanks, of course, to my current employer, Cloudera, for the opportunity to create this book and the endorsement of the effort. I specifically want to thank Vinithra Varadharajan, my manager for the past few years, for her unwavering faith in and promotion of my writing effort; and also Justin Kestelyn, who got the ball rolling between me, my employer, and O’Reilly. My teammates past and present on my current project have all played a part in helping me learn about the cloud and have contributed their thoughts and opinions, for which I’m grateful: John Adair, Asif Arman, Cagdas Bayram, Jayita Bhojwani, Michael Cudahy, Xiaohua Guo, David Han, Joe Heyming, Ying Li, Andrei Savu, Fahd Siddiqui, and Michael Wilson.
Finally, I must thank my family, including my parents and in-laws for their encouragement, my daughters Samantha and Lydia, and especially my wife Kathy.1 They have been constantly supportive of me during the long effort it’s taken to write this book, and excited for it to be one of my accomplishments. I love them all very much.
1 Te amo et semper amabo.