Introduction

Overview: Big Data’s Big Journey to the Cloud

It all started with the data. There was too much of it. Too much to process in a timely manner. Too much to analyze. Too much to store cost effectively. Too much to protect. And yet the data kept coming. Something had to give.

We generate 2.5 quintillion bytes of data each day (one quintillion is one thousand quadrillion, which is one thousand trillion). A NASA mathematician puts it like this: “1 million seconds is about 11.5 days, 1 billion seconds is about 32 years, while a trillion seconds is equal to 32,000 years.” This would mean one quadrillion seconds is 32 billion years—and 2.5 quintillion would be 2,500 times that.

After you’ve tried to visualize that—you can’t, it’s not humanly possible—keep in mind that 90% of all the data in the world was created in just the past two years.

Despite these staggering numbers, organizations are beginning to harness the value of what is now called big data.

Almost half of respondents to a recent McKinsey Analytics study, Analytics Comes of Age, say big data has “fundamentally changed” their business practices. According to NewVantage Partners, big data is delivering the most value to enterprises by cutting expenses (49.2%) and creating new avenues for innovation and disruption (44.3%). Almost 7 in 10 companies (69.4%) have begun using big data to create data-driven cultures, with 27.9% reporting positive results, as illustrated in Figure I-1.

Figure I 1. The benefits of deploying big data

Overall, 27% of those surveyed indicate their big data projects are already profitable, and 45% indicate they’re at a break-even stage.

What’s more, the majority of big data projects these days are being deployed in the cloud. Big data stored in the cloud will reach 403 exabytes by 2021, up almost eight-fold from the 25 exabytes that was stored in 2016. Big data alone will represent 30% of data stored in datacenters by 2021, up from 18% in 2016.

My Journey to a Data Lake

The journey to a data lake is different for everyone. For me, Jon King, it was the realization that I was already on the road to implementing a data lake architecture. My company at the time was running a data warehouse architecture that housed a subset of data coming from our hundreds of MySQL servers. We began by extracting our MySQL tables to comma-separated values (CSV) format on our NetApp Filers and then loading those into the data warehouse. This data was used for business reports and ad hoc questions.

As the company grew, so did the platform. The amount, complexity, and—most important—the types of data also increased. In addition to our usual CSV-to-warehouse extract, transform, and load (ETL) conversions, we were soon ingesting billions of complex JSON-formatted events daily. Converting these JSON events to a relational database management system (RDBMS) format required significantly more ETL resources, and the schemas were always evolving based on new product releases. It was soon apparent that our data warehouse wasn’t going to keep up with our product roadmap. Storage and compute limitations meant that we were having to constantly decide what data we could and could not keep in the warehouse, and schema evolutions meant that we were frequently taking long maintenance outages.

At this point, we began to look at new distributed architectures that could meet the demands of our product roadmap. After looking at several open source and commercial options, we found Apache Hadoop and Hive. The nature of the Hadoop Distributed File System (HDFS) and Hive’s schema-on-read enabled us to address our need for tabular data as well as our need to parse and analyze complex JSON objects and store more data than we could in the data warehouse. The ability to use Hive to dynamically parse a JSON object allowed us to meet the demands of the analytics organization.

Thus, we had a cloud data lake, which was based in Amazon Web Services (AWS). But soon thereafter, we found ourselves growing at a much faster rate, and realized that we needed a platform to help us manage the new open source tools and technologies that could handle these vast data volumes with the elasticity of the cloud while also controlling cost overruns. That led us to Qubole’s cloud data platform—and my journey became much more interesting.

A Quick History Lesson on Big Data

To understand how we got here, let’s look at Figure I-2, which provides a retrospective on how the big data universe developed.

Even now, the big data ecosystem is still under construction. Advancement typically begins with an innovation by a pioneering organization (a Facebook, Google, eBay, Uber, or the like), an innovation created to address a specific challenge that a business encounters in storing, processing, analyzing, or managing its data. Typically, the intellectual property (IP) is eventually open sourced by its creator. Commercialization of the innovation almost inevitably follows.

A significant early milestone in the development of a big data ecosystem was a 2004 whitepaper from Google. Titled “MapReduce: Simplified Data Processing on Large Clusters,” it detailed how Google performed distributed information processing with a new engine and resource manager called MapReduce.

Struggling with the huge volumes of data it was generating, Google had distributed computations across thousands of machines so that it could finish calculations in time for the results to be useful. The paper addressed issues such as how to parallelize the computation, distribute the data, and handle failures.

Google called it MapReduce because you first use a map() function to process a key and generate a set of intermediate keys. Then, you use a reduce() function that merges all intermediate values that are associated with the same intermediate key, as demonstrated in Figure I-3

A year after Google published its whitepaper, Doug Cutting of Yahoo combined MapReduce with an open source web search engine called Nutch that had emerged from the Lucene Project (also open source). Cutting realized that MapReduce could solve the storage challenge for the very large files generated as part of Apache Nutch’s web-crawling and indexing processes.

By early 2005, developers had a working MapReduce implementation in Nutch, and by the middle of that year, most of the Nutch algorithms had been ported using MapReduce. In February 2006, the team moved out of Nutch completely to found an independent subproject of Lucene. They called this project Hadoop, named for a toy stuffed elephant that had belonged to Cutting’s then-five-year-old son.

Hadoop became the go-to framework for large-scale, data-intensive deployments. Today, Hadoop has evolved far beyond its beginnings in web indexing and is now used to tackle a huge variety of tasks across multiple industries.

“The block of time between 2004 and 2007 were the truly formative years,” says Pradeep Reddy, a solutions architect at Qubole, who has been working with big data systems for more than a decade. “There was really no notion of big data before then.”

The Second Phase of Big Data Development

Between 2007 and 2011, a significant number of big data companies—including Cloudera and MapR—were founded in what would be the second major phase of big data development. “And what they essentially did was take the open source Hadoop code and commercialize it,” says Reddy. “By creating nice management frameworks around basic Hadoop, they were the first to offer commercial flavors that would accelerate deployment of Hadoop in the enterprise.”

So, what was driving all this big data activity? Companies attempting to deal with the masses of data pouring in realized that they needed faster time to insight. Businesses themselves needed to be more agile and support complex and increasingly digital business environments that were highly dynamic. The concept of lean manufacturing and just-in-time resources in the enterprise had arrived.

But there was a major problem, says Reddy: “Even as more commercial distributions of Hadoop and open source big data engines began to emerge, businesses were not benefiting from them, because they were so difficult to us. All of them required specialized skills, and few people other than data scientists had those skills.” In the O’Reilly book Creating a Data-Driven Enterprise with DataOps, Ashish Thusoo, cofounder and CEO of Qubole, describes how he and Qubole cofounder Joydeep Sen Sarma together addressed this problem while working at Facebook:

I joined Facebook in August 2007 as part of the data team. It was a new group, set up in the traditional way for that time. The data infrastructure team supported a small group of data professionals who were called upon whenever anyone needed to access or analyze data located in a traditional data warehouse. As was typical in those days, anyone in the company who wanted to get data beyond some small and curated summaries stored in the data warehouse had to come to the data team and make a request. Our data team was excellent, but it could only work so fast: it was a clear bottleneck.

I was delighted to find a former classmate from my undergraduate days at the Indian Institute of Technology already at Facebook. Joydeep Sen Sarma had been hired just a month previously. Our team’s charter was simple: to make Facebook’s rich trove of data more available.

Our initial challenge was that we had a nonscalable infrastructure that had hit its limits. So, our first step was to experiment with Hadoop. Joydeep created the first Hadoop cluster at Facebook and the first set of jobs, populating the first datasets to be consumed by other engineers—application logs collected using Scribe and application data stored in MySQL.

But Hadoop wasn’t (and still isn’t) particularly user friendly, even for engineers. It was, and is, a challenging environment. We found that the productivity of our engineers suffered. The bottleneck of data requests persisted. [See Figure I-4.]

SQL, on the other hand, was widely used by both engineers and analysts, and was powerful enough for most analytics requirements. So Joydeep and I decided to make the programmability of Hadoop available to everyone. Our idea: to create a SQL-based declarative language that would allow engineers to plug in their own scripts and programs when SQL wasn’t adequate. In addition, it was built to store all of the metadata about Hadoop-based datasets in one place. This latter feature was important because it turned out indispensable for creating the data-driven company that Facebook subsequently became. That language, of course, was Hive, and the rest is history.

Figure I 4. Human bottlenecks for democratizing data

Says Thusoo today: “Data was clearly too important to be left behind lock and key, accessible only by data engineers. We needed to democratize data across the company—beyond engineering and IT.”

Then another innovation appeared: Spark. Spark was originally developed because though memory was becoming cheaper, there was no single engine that could handle both real-time and batch-advanced analytics. Engines such as MapReduce were built specifically for batch processing and Java programming, and they weren’t always user-friendly tools for anyone other than data specialists such as analysts and data scientists. Researchers at the University of California at Berkeley’s AMPLab asked: is there a way to leverage memory to make big data processing faster?

Spark is a general-purpose, distributed data-processing engine suitable for use in a wide range of applications. On top of the Spark core data-processing engine lay libraries for SQL, machine learning, graph computation, and stream processing, all of which can be used together in an application. Programming languages supported by Spark include Java, Python, Scala, and R.

Big data practitioners began integrating Spark into their applications to rapidly query, analyze, and transform large amounts of data. Tasks most frequently associated with Spark include ETL and SQL batch jobs across large datasets; processing of streaming data from sensors, Internet of Things (IoT), or financial systems; and machine learning.

In 2010, AMPLab donated the Spark codebase to the Apache Software Foundation, and it became open source. Businesses rapidly began adopting it.

Then, in 2013, Facebook launched another open source engine, Presto. Presto started as a project at Facebook to run interactive analytic queries against a 300 PB data warehouse. It was built on large Hadoop and HDFS-based clusters.

Prior to building Presto, Facebook had been using Hive. Says Reddy, “However, Hive wasn’t optimized for fast performance needed in interactive queries, and Facebook needed something that could operate at the petabyte scale.”

In November 2013, Facebook open sourced Presto on its own (versus licensing with Apache or MIT) with Apache, and made it available for anyone to download. Today, Presto is a popular engine for large scale, running interactive SQL queries on semi-structured and structured data. Presto shines on the compute side, where many data warehouses can’t scale out, thanks to its in-memory engine’s ability to handle massive data volume and query concurrency Hadoop.

Facebook’s Presto implementation is used today by more than a thousand of its employees, who together run more than 30,000 queries and process more than one petabyte of data daily. The company has moved a number of their large-scale Hive batch workloads into Presto as a result of performance improvements. “[Most] ad hoc queries, before Presto was released, took too much time,” says Reddy. “Someone would hit query and have time to eat their breakfast before getting results. With Presto you get subsecond results.”

“Another interesting trend we’re seeing is machine learning and deep learning being applied to big data in the cloud,” says Reddy. “The field of artificial intelligence had of course existed for a long time, but beginning in 2015, there was a lot of open source investments happening around it, enabling machine learning in Spark for distributed computing.” The open source community also made significant investments in innovative frameworks like TensorFlow, CNTK, PyTorch, Theano, MXNET, and Keras.

During the Deep Learning Summit at AWS re:Invent 2017, AI and deep learning pioneer Terrence Sejnowski notably said, “Whoever has more data wins.” He was summing up what many people now regard as a universal truth: machine learning requires big data to work. Without large, well-maintained training sets, machine learning algorithms—especially deep learning algorithms—fall short of their potential.

But despite the recent increase in applying deep learning algorithms to real-world challenges, there hasn’t been a corresponding upswell of innovation in this field. Although new “bleeding edge” algorithms have been released—most recently Geoffrey Hinton’s milestone capsule networks—most deep learning algorithms are actually decades old. What’s truly driving these new applications of AI and machine learning isn’t new algorithms, but bigger data. As Moore’s law predicts, data scientists now have incredible compute and storage capabilities that today allow them to make use of the massive amounts of data being collected.

Weather Update: Clouds Ahead

Within a year of Hadoop’s introduction, another important—at the time seemingly unrelated—event occurred. Amazon launched AWS in 2006. Of course, the cloud had been around for a while. Project MAC, begun by the Defense Advanced Research Projects Agency (DARPA) in 1963, was arguably the first primitive instance of a cloud, “but Amazon’s move turned out to be critical for advancement of a big data ecosystem for enterprises,” says Reddy.

Google, naturally, wasn’t far behind. According to “An Annotated History of Google’s Cloud Platform,” in April 2008, App Engine launched for 20,000 developers as a tool to run web applications on Google’s infrastructure. Applications had to be written in Python and were limited to 500 MB of storage, 200 million megacycles of CPU, and 10 GB bandwidth per day. In May 2008, Google opened signups to all developers. The service was an immediate hit.

Microsoft tried to catch up with Google and Amazon by announcing Azure Cloud, codenamed Red Dog, also in 2008. But it would take years for Microsoft to get it out the door. Today, however, Microsoft Azure is growing quickly. It currently has 29.4% of application workloads in the public cloud, according to a recent Cloud Security Alliance (CSA) report. That being said, AWS continues to be the most popular, with 41.5% of application workloads. Google trails far behind, with just 3% of the installed base. However, the market is still considered immature and continues to develop as new cloud providers enter. Stay tuned; there is still room for others such as IBM, Alibaba, and Oracle to seize market share, but the window is beginning to close.

Bringing Big Data and Cloud Together

Another major event that happened around the time of the second phase of big data development is that Amazon launched the first cloud distribution of Hadoop by offering the framework in its AWS cloud ecosystem. Amazon Elastic MapReduce (EMR) is a web service that uses Hadoop to process vast amounts of data in the cloud. “And from the very beginning, Amazon offered Hadoop and Hive,” says Reddy. He adds that though Amazon also began offering Spark and other big data engines, “2010 is the birth of a cloud-native Hadoop distribution—a very important timeline event.”

Commercial Cloud Distributions: The Formative Years

Reddy calls 2011–2015 the “formative” years of commercial cloud Hadoop platforms. He adds that, “within this period, we saw the revolutionary idea of separating storage and compute emerge.”

Qubole’s founders came from Facebook, where they were the creators of Apache Hive and the key architects of Facebook’s internal data platforms. In 2011, when they founded Qubole, they set out on a mission to create a cloud-agnostic, cloud-native big data distribution platform to replicate their success at Facebook in the cloud. In doing so, they pioneered a new market.

Through the choice of engines, tools, and technologies, Qubole caters to users with diverse skillsets and enables a wide spectrum of big data use cases like ETL, data prep and ingestion, business intelligence (BI), and advanced analytics with machine learning and AI.

Qubole incorporated in 2011, founded on the belief that big data analytics workloads belong in the cloud. Its platform brings all the benefits of the cloud to a broader range of users. Indeed, Thusoo and Sarma started Qubole to “bring the template for hypergrowth companies like Facebook and Google to the enterprise.”

“We asked companies what was holding them back from using machine learning to do advanced analytics. They said, ‘We have no expertise and no platform,’” Thusoo said in a 2018 interview with Forbes. “We delivered a cloud-based unified platform that runs on AWS, Microsoft Azure, and Oracle Cloud.” During this same period of evolution, Facebook’s open sourced Presto enabled fast business intelligence on top of Hadoop. Presto is meant to deliver accelerated access to the data for interactive analytics queries.

2011 also saw the founding of another commercial on-premises distribution platform: Hortonworks. Microsoft Azure later teamed up with Hortonworks to repackage Hortonworks Data Platform (HDP) and in 2012 released its cloud big data distribution for Azure under the name HDInsight.

OSS Monopolies? Not in the Cloud

An interesting controversy has arisen in the intersection between the open source software (OSS) and cloud worlds, as captured in an article by Qubole cofounder Joydeep Sen Sarma. Specifically, the AWS launch of Kafka as a managed service seems to have finally brought the friction between OSS and cloud vendors out into the open. Although many in the industry seem to view AWS as the villain, Sarma disagrees. He points out that open source started as a way to share knowledge and build upon it collectively, which he calls “a noble goal.” Then, open source became an alternative to standards-based technology—particularly in the big data space. This led to an interesting phenomenon: the rise of the open source monopoly. OSS thus became a business model. “OSS vendors hired out most of the project committers and became de facto owners of their projects,” wrote Sarma, adding that of course venture capitalists pounced; why wouldn’t they enjoy the monopolies that such arrangements enabled? But cloud vendors soon caught up. AWS in particular crushed the venture capitalists’ dreams. No one does commodity and Software as a Service (SaaS) better than AWS. Taking code and converting it into a low-cost web service is an art form at which AWS excels. To say that OSS vendors were not pleased is an understatement. They began trying to get in the way of AWS and other cloud platform vendors. But customers (businesses in this case) do better when competition exists. As software consumption increasingly goes to SaaS, we need the competition that AWS and others provide. Wrote Sarma, “Delivering highly reliable web services and fantastic vertically integrated product experiences online is a different specialty than spawning a successful OSS project and fostering a great community.” He sees no reason why success in the latter should automatically extend to a monopoly in the former. Success must be earned in both markets.

As previously mentioned, in 2012 Microsoft released HDInsight, its first commercial cloud distribution. Then in 2013, another big data platform provider, Databricks, was launched. Founded by the creators of Apache Spark, Databricks aims to help clients with cloud-based big data processing. This marked the beginning of a new era of “born-in-the-cloud” SaaS companies that were aligned perfectly with the operational agility and pricing structure of the cloud.

Big Data and AI Move Decisively to the Cloud, but Operationalizing Initiatives Lag

Since 2015, big data has steadily moved to the cloud. The most popular open source projects (Apache Kafka, ElasticSearch, Presto, Apache Hadoop, Spark, and many others) all have operators built for various cloud commodities (such as storage and compute) and managed services (such as databases, monitoring apps, and more). These open source communities (largely comprising other enterprise practitioners) are also using the cloud in their workplaces, and we’re seeing some extraordinary contributions going into these projects from developers worldwide.

“We’ve seen a lot of enterprise companies moving away from on-premises deployments because of the pain of hitting the wall in terms of capacity,” says Reddy, adding that, with the cloud, the notion of multitenancy (or sharing a cluster across many users) came full circle. In the cloud, “it’s all about creating clusters for specific use cases, and right-sizing them to get the most out of them,” says Reddy.

Back when Hadoop was in its infancy, Yahoo began building Hadoop on-demand clusters—dedicated clusters of Hadoop that lacked multitenancy. Yahoo would bring these clusters up for dedicated tasks, perform necessary big data operations, and then tear them down.

But since then, most of the advancements around Hadoop have been focused around multitenant capabilities. The YARN (yet another resource negotiator) project was chartered with this as one of its main objectives. YARN delivered and helped Hadoop platforms expand in the enterprises that adopted it early. But there was a problem. The velocity and volume of data was increasing at such an exponential rate that all these enterprises that implemented big data on-premises would soon hit the ceiling in terms of capacity. They’d require multiple hardware refreshes to meet the demand for data processing. Multitenancy on-premises also requires a lot of administration time to manage fair share across the different users and workloads.

Today, in the cloud, we see Hadoop on-demand clusters similar to those we saw when Hadoop was in its infancy. As Reddy said, the focus is more about right-sizing the clusters for specific uses rather than enabling multitenancy. Multitenancy is still very relevant in the cloud for Presto, although not as much for Hive and Spark clusters.

At the present time, cloud deployments of big data represent as much as 57% of all big data workloads, according to Gartner. And global spending on big data solutions via cloud subscriptions will grow almost 7.5 times faster than those on-premises, says Forrester, which found that moving to the public cloud was the number-one technology priority for big data practitioners, according to its 2017 survey of data analytics professionals (see Figure I-5).

Figure I 5. Growth of big data solutions

This represents just the foundational years of machine learning and deep learning, stresses Reddy. “There is a lot more to come.”

By 2030, applying AI technologies such as machine learning and deep learning to big data deployments will be a $15.7 trillion “game changer,” according to PwC. Also, 59% of executives say their companies’ ability to leverage big data will be significantly enhanced by applying AI.

Indeed, big data and AI are becoming inexorably intertwined. A number of recent industry surveys have unanimously agreed that from the top down, companies are ramping up their investment in advanced analytics as a key priority of this decade. In NewVantage Partners’ annual executive survey, an overwhelming 97.2% of executives report that their companies are investing in building or launching combined big data and AI initiatives. And 76.5% said the proliferation of data is empowering AI and cognitive computing initiatives.

One reason for the quick marriage of big data and AI on the cloud is that most companies surveyed were worried that they would be “disrupted” by new market entrants.

AI and Big Data: A Disruptive Force for Good

The technology judged most disruptive today is AI. A full 72% of executives in the NewVantage survey chose it as the disruptive technology with the most impact. And 73% said they have already received measurable value from their big data and AI projects.

But they also reported having trouble deploying these technologies. When executives were asked to rate their companies’ ability to operationalize big data—that is, make it a key part of a data-driven organization—the results were somewhat mixed, according to Forrester. Few have achieved all of their goals, as shown in Figure I-6.

Figure I-6. Few companies have managed to operationalize their AI and big data initiatives

And a global survey by Gartner indicated that the overwhelming majority—91%—of businesses have yet to achieve a “transformational” level of maturity in big data, despite such activities being a number one investment priority for CIOs recently.

In this survey, Gartner asked organizations to rate themselves based on Gartner’s big data maturity model—ranging from Level 1 (basic) to Level 2 (opportunistic) to Level 3 (systematic) to Level 4 (differentiating), and to Level 5 (transformational)—and found that 60% placed themselves in the lowest three levels.

We Believe in the Cloud for Big Data and AI

The premise of this book is that by taking advantage of the compute power and scalability of the cloud and the right open source big data and AI engines and tools, businesses can finally operationalize their big data. This will allow them to be more innovative and collaborative, achieving analytical value in less time while lowering operational costs. The ultimate result: businesses will achieve their goals faster and more effectively while accelerating time to market through intelligent use of data.

This book is designed not to be a tow rope, but a guiding line for all members of the data team—from data engineers to data scientists to machine learning engineers to analysts—to help them understand how to operationalize big data and machine learning in the cloud.

Following is a snapshot of what you will learn in this book:

Chapter 1: You learn why you need a “central repository” to be able to use your data effectively. In short, you’ll learn why you need a data lake.
Chapter 2: You need a data-driven culture, but it can be challenging to get there. This chapter explains how.
Chapter 3: We show you how to begin to build a data lake.
Chapter 4: We discuss building the infrastructure for the data lake. What kind of structure do you need to house your big data?
Chapter 5: Now that you’ve built the “house” for your data lake, you need to consider governance. In this chapter, we cover three necessary governance plans: data, financial, and security.
Chapter 6: You’ll need some tools to manage your growing data lake. Here, we provide a roundup of those tools.
Chapter 7: We examine three key considerations for securing a data lake in the cloud.
Chapter 8: We discuss the role of data engineers, and how they interface with a cloud-native data platform.
Chapter 9: We discuss the role of data scientists, and how they interface with a cloud-native data platform.
Chapter 10: We discuss the role of data analysts, and how they interface with a cloud-native data platform.
Chapter 11: We present a case study from Ibotta, which transitioned from a static and rigid data warehouse to a cost-efficient, self-service data lake using Qubole’s cloud-native data platform.
Chapter 12: We conclude by examining why a cloud data platform is a future-proof approach to operationalizing your data lake.

Get Operationalizing the Data Lake now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Operationalizing the Data Lake by Holden Ackerman, Jon King

Introduction

Overview: Big Data’s Big Journey to the Cloud

Figure I-1. The benefits of deploying big data

My Journey to a Data Lake

A Quick History Lesson on Big Data

Figure I-2. The evolution of big data

Figure I-3. How MapReduce works

The Second Phase of Big Data Development

Figure I-4. Human bottlenecks for democratizing data

Weather Update: Clouds Ahead

Bringing Big Data and Cloud Together

Commercial Cloud Distributions: The Formative Years

Big Data and AI Move Decisively to the Cloud, but Operationalizing Initiatives Lag

Figure I-5. Growth of big data solutions

We Believe in the Cloud for Big Data and AI

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly