Chapter 1. Analyzing Big Data
[Data applications] are like sausages. It is better not to see them being made.
Otto von Bismarck
Intelligently recommend millions of products to millions of users
Estimate financial risk through simulations of portfolios that include millions of instruments
Easily manipulate data from thousands of human genomes to detect genetic associations with disease
These are tasks that simply could not have been accomplished 5 or 10 years ago. When people say that we live in an age of big data they mean that we have tools for collecting, storing, and processing information at a scale previously unheard of. Sitting behind these capabilities is an ecosystem of open source software that can leverage clusters of commodity computers to chug through massive amounts of data. Distributed systems like Apache Hadoop have found their way into the mainstream and have seen widespread deployment at organizations in nearly every field.
But just as a chisel and a block of stone do not make a statue, there is a gap between having access to these tools and all this data and doing something useful with it. This is where data science comes in. Just as sculpture is the practice of turning tools and raw material into something relevant to nonsculptors, data science is the practice of turning tools and raw data into something that non–data scientists might care about.
Often, “doing something useful” means placing a schema over it and using SQL to answer questions like “Of the gazillion users who made it to the third page in our registration process, how many are over 25?” The field of how to structure a data warehouse and organize information to make answering these kinds of questions easy is a rich one, but we will mostly avoid its intricacies in this book.
Sometimes, “doing something useful” takes a little extra. SQL still may be core to the approach, but in order to work around idiosyncrasies in the data or perform complex analysis, we need a programming paradigm that’s a little bit more flexible and closer to the ground, and with richer functionality in areas like machine learning and statistics. These are the kinds of analyses we are going to talk about in this book.
For a long time, open source frameworks like R, the PyData stack, and Octave have made rapid analysis and model building viable over small data sets. With fewer than 10 lines of code, we can throw together a machine learning model on half a data set and use it to predict labels on the other half. With a little more effort, we can impute missing data, experiment with a few models to find the best one, or use the results of a model as inputs to fit another. What should an equivalent process look like that can leverage clusters of computers to achieve the same outcomes on huge data sets?
The right approach might be to simply extend these frameworks to run on multiple machines to retain their programming models and rewrite their guts to play well in distributed settings. However, the challenges of distributed computing require us to rethink many of the basic assumptions that we rely on in single-node systems. For example, because data must be partitioned across many nodes on a cluster, algorithms that have wide data dependencies will suffer from the fact that network transfer rates are orders of magnitude slower than memory accesses. As the number of machines working on a problem increases, the probability of a failure increases. These facts require a programming paradigm that is sensitive to the characteristics of the underlying system: one that discourages poor choices and makes it easy to write code that will execute in a highly parallel manner.
Of course, single-machine tools like PyData and R that have come to recent prominence in the software community are not the only tools used for data analysis. Scientific fields like genomics that deal with large data sets have been leveraging parallel computing frameworks for decades. Most people processing data in these fields today are familiar with a cluster-computing environment called HPC (high-performance computing). Where the difficulties with PyData and R lie in their inability to scale, the difficulties with HPC lie in its relatively low level of abstraction and difficulty of use. For example, to process a large file full of DNA-sequencing reads in parallel, we must manually split it up into smaller files and submit a job for each of those files to the cluster scheduler. If some of these fail, the user must detect the failure and take care of manually resubmitting them. If the analysis requires all-to-all operations like sorting the entire data set, the large data set must be streamed through a single node, or the scientist must resort to lower-level distributed frameworks like MPI, which are difficult to program without extensive knowledge of C and distributed/networked systems.
Tools written for HPC environments often fail to decouple the in-memory data models from the lower-level storage models. For example, many tools only know how to read data from a POSIX filesystem in a single stream, making it difficult to make tools naturally parallelize, or to use other storage backends, like databases. Recent systems in the Hadoop ecosystem provide abstractions that allow users to treat a cluster of computers more like a single computer—to automatically split up files and distribute storage over many machines, divide work into smaller tasks and execute them in a distributed manner, and recover from failures. The Hadoop ecosystem can automate a lot of the hassle of working with large data sets, and is far cheaper than HPC.
The Challenges of Data Science
A few hard truths come up so often in the practice of data science that evangelizing these truths has become a large role of the data science team at Cloudera. For a system that seeks to enable complex analytics on huge data to be successful, it needs to be informed by—or at least not conflict with—these truths.
First, the vast majority of work that goes into conducting successful analyses lies in preprocessing data. Data is messy, and cleansing, munging, fusing, mushing, and many other verbs are prerequisites to doing anything useful with it. Large data sets in particular, because they are not amenable to direct examination by humans, can require computational methods to even discover what preprocessing steps are required. Even when it comes time to optimize model performance, a typical data pipeline requires spending far more time in feature engineering and selection than in choosing and writing algorithms.
For example, when building a model that attempts to detect fraudulent purchases on a website, the data scientist must choose from a wide variety of potential features: fields that users are required to fill out, IP location info, login times, and click logs as users navigate the site. Each of these comes with its own challenges when converting to vectors fit for machine learning algorithms. A system needs to support more flexible transformations than turning a 2D array of doubles into a mathematical model.
Second, iteration is a fundamental part of data science. Modeling and analysis typically require multiple passes over the same data. One aspect of this lies within machine learning algorithms and statistical procedures. Popular optimization procedures like stochastic gradient descent and expectation maximization involve repeated scans over their inputs to reach convergence. Iteration also matters within the data scientist’s own workflow. When data scientists are initially investigating and trying to get a feel for a data set, usually the results of a query inform the next query that should run. When building models, data scientists do not try to get it right in one try. Choosing the right features, picking the right algorithms, running the right significance tests, and finding the right hyperparameters all require experimentation. A framework that requires reading the same data set from disk each time it is accessed adds delay that can slow down the process of exploration and limit the number of things we get to try.
Third, the task isn’t over when a well-performing model has been built. If the point of data science is to make data useful to non–data scientists, then a model stored as a list of regression weights in a text file on the data scientist’s computer has not really accomplished this goal. Uses of data recommendation engines and real-time fraud detection systems culminate in data applications. In these, models become part of a production service and may need to be rebuilt periodically or even in real time.
For these situations, it is helpful to make a distinction between analytics in the lab and analytics in the factory. In the lab, data scientists engage in exploratory analytics. They try to understand the nature of the data they are working with. They visualize it and test wild theories. They experiment with different classes of features and auxiliary sources they can use to augment it. They cast a wide net of algorithms in the hopes that one or two will work. In the factory, in building a data application, data scientists engage in operational analytics. They package their models into services that can inform real-world decisions. They track their models’ performance over time and obsess about how they can make small tweaks to squeeze out another percentage point of accuracy. They care about SLAs and uptime. Historically, exploratory analytics typically occurs in languages like R, and when it comes time to build production applications, the data pipelines are rewritten entirely in Java or C++.
Of course, everybody could save time if the original modeling code could be actually used in the app for which it is written, but languages like R are slow and lack integration with most planes of the production infrastructure stack, and languages like Java and C++ are just poor tools for exploratory analytics. They lack read-evaluate-print loop (REPL) environments to play with data interactively and require large amounts of code to express simple transformations. A framework that makes modeling easy but is also a good fit for production systems is a huge win.
Introducing Apache Spark
Enter Apache Spark, an open source framework that combines an engine for distributing programs across clusters of machines with an elegant model for writing programs atop it. Spark, which originated at the UC Berkeley AMPLab and has since been contributed to the Apache Software Foundation, is arguably the first open source software that makes distributed programming truly accessible to data scientists.
One illuminating way to understand Spark is in terms of its advances over its predecessor, Apache Hadoop’s MapReduce. MapReduce revolutionized computation over huge data sets by offering a simple model for writing programs that could execute in parallel across hundreds to thousands of machines. The MapReduce engine achieves near linear scalability—as the data size increases, we can throw more computers at it and see jobs complete in the same amount of time—and is resilient to the fact that failures that occur rarely on a single machine occur all the time on clusters of thousands of machines. It breaks up work into small tasks and can gracefully accommodate task failures without compromising the job to which they belong.
Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways. First, rather than relying on a rigid map-then-reduce format, its engine can execute a more general directed acyclic graph (DAG) of operators. This means that in situations where MapReduce must write out intermediate results to the distributed filesystem, Spark can pass them directly to the next step in the pipeline. In this way, it is similar to Dryad, a descendant of MapReduce that originated at Microsoft Research. Second, it complements this capability with a rich set of transformations that enable users to express computation more naturally. It has a strong developer focus and streamlined API that can represent complex pipelines in a few lines of code.
Third, Spark extends its predecessors with in-memory processing. Its Dataset and DataFrame abstractions enable developers to materialize any point in a processing pipeline into memory across the cluster, meaning that future steps that want to deal with the same data set need not recompute it or reload it from disk. This capability opens up use cases that distributed processing engines could not previously approach. Spark is well suited for highly iterative algorithms that require multiple passes over a data set, as well as reactive applications that quickly respond to user queries by scanning large in-memory data sets.
Perhaps most importantly, Spark fits well with the aforementioned hard truths of data science, acknowledging that the biggest bottleneck in building data applications is not CPU, disk, or network, but analyst productivity. It perhaps cannot be overstated how much collapsing the full pipeline, from preprocessing to model evaluation, into a single programming environment can speed up development. By packaging an expressive programming model with a set of analytic libraries under a REPL, Spark avoids the roundtrips to IDEs required by frameworks like MapReduce and the challenges of subsampling and moving data back and forth from the Hadoop distributed file system (HDFS) required by frameworks like R. The more quickly analysts can experiment with their data, the higher likelihood they have of doing something useful with it.
With respect to the pertinence of munging and ETL, Spark strives to be something closer to the Python of big data than the MATLAB of big data. As a general-purpose computation engine, its core APIs provide a strong foundation for data transformation independent of any functionality in statistics, machine learning, or matrix algebra. Its Scala and Python APIs allow programming in expressive general-purpose languages, as well as access to existing libraries.
Spark’s in-memory caching makes it ideal for iteration both at the micro- and macrolevel. Machine learning algorithms that make multiple passes over their training set can cache it in memory. When exploring and getting a feel for a data set, data scientists can keep it in memory while they run queries, and easily cache transformed versions of it as well without suffering a trip to disk.
Last, Spark spans the gap between systems designed for exploratory analytics and systems designed for operational analytics. It is often quoted that a data scientist is someone who is better at engineering than most statisticians, and better at statistics than most engineers. At the very least, Spark is better at being an operational system than most exploratory systems and better for data exploration than the technologies commonly used in operational systems. It is built for performance and reliability from the ground up. Sitting atop the JVM, it can take advantage of many of the operational and debugging tools built for the Java stack.
Spark boasts strong integration with the variety of tools in the Hadoop ecosystem. It can read and write data in all of the data formats supported by MapReduce, allowing it to interact with formats commonly used to store data on Hadoop, like Apache Avro and Apache Parquet (and good old CSV). It can read from and write to NoSQL databases like Apache HBase and Apache Cassandra. Its stream-processing library, Spark Streaming, can ingest data continuously from systems like Apache Flume and Apache Kafka. Its SQL library, SparkSQL, can interact with the Apache Hive Metastore, and the Hive on Spark initiative enabled Spark to be used as an underlying execution engine for Hive, as an alternative to MapReduce. It can run inside YARN, Hadoop’s scheduler and resource manager, allowing it to share cluster resources dynamically and to be managed with the same policies as other processing engines, like MapReduce and Apache Impala.
About This Book
The rest of this book is not going to be about Spark’s merits and disadvantages. There are a few other things that it will not be about either. It will introduce the Spark programming model and Scala basics, but it will not attempt to be a Spark reference or provide a comprehensive guide to all its nooks and crannies. It will not try to be a machine learning, statistics, or linear algebra reference, although many of the chapters will provide some background on these before using them.
Instead, it will try to help the reader get a feel for what it’s like to use Spark for complex analytics on large data sets. It will cover the entire pipeline: not just building and evaluating models, but also cleansing, preprocessing, and exploring data, with attention paid to turning results into production applications. We believe that the best way to teach this is by example, so after a quick chapter describing Spark and its ecosystem, the rest of the chapters will be self-contained illustrations of what it looks like to use Spark for analyzing data from different domains.
When possible, we will attempt not to just provide a “solution,” but to demonstrate the full data science workflow, with all of its iterations, dead ends, and restarts. This book will be useful for getting more comfortable with Scala, Spark, and machine learning and data analysis. However, these are in service of a larger goal, and we hope that most of all, this book will teach you how to approach tasks like those described at the beginning of this chapter. Each chapter, in about 20 measly pages, will try to get as close as possible to demonstrating how to build one of these pieces of data applications.
The Second Edition
The years 2015 and 2016 saw seismic changes in Spark, culminating in the release of Spark 2.0 in July of 2016. The most salient of these changes are the modifications to Spark’s core API. In versions prior to Spark 2.0, Spark’s API centered around Resilient Distributed Datasets (RDDs), which are lazily instantiated collections of objects, partitioned across a cluster of computers.
Although RDDs enabled a powerful and expressive API, they suffered two main problems. First, they didn’t lend themselves well to performant, stable execution. By relying on Java and Python objects, RDDs used memory inefficiently and exposed Spark programs to long garbage-collection pauses. They also tied the execution plan into the API, which put a heavy burden on the user to optimize the execution of their program. For example, where a traditional RDBMS might be able to pick the best join strategy based on the size of the tables being joined, Spark required users to make this choice on their own. Second, Spark’s API ignored the fact that data often fits into a structured relational form, and when it does, an API can supply primitives that makes the data much easier to manipulate, such as by allowing users to refer to column names instead of ordinal positions in a tuple.
Spark 2.0 addressed these problems by replacing RDDs with Datasets and DataFrames. Datasets are similar to RDDs but map the objects they represent to encoders, which enable a much more efficient in-memory representation. This means that Spark programs execute faster, use less memory, and run more predictably. Spark also places an optimizer between data sets and their execution plan, which means that it can make more intelligent decisions about how to execute them. DataFrame is a subclass of Dataset that is specialized to model relational data (i.e., data with rows and fixed sets of columns). By understanding the notion of a column, Spark can offer a cleaner, expressive API, as well as enable a number of performance optimizations. For example, if Spark knows that only a subset of the columns are needed to produce a result, it can avoid materializing those columns into memory. And many transformations that previously needed to be expressed as user-defined functions (UDFs) are now expressible directly in the API. This is especially advantageous when using Python, because Spark’s internal machinery can execute transformations much faster than functions defined in Python. DataFrames also offer interoperability with Spark SQL, meaning that users can write a SQL query that returns a data frame and then use that DataFrame programmatically in the Spark-supported language of their choice. Although the new API looks very similar to the old API, enough details have changed that nearly all Spark programs need to be updated.
In addition to the code API changes, Spark 2.0 saw big changes to the APIs used for machine learning and statistical analysis. In prior versions, each machine learning algorithm had its own API. Users who wanted to prepare data for input into algorithms or to feed the output of one algorithm into another needed to write their own custom orchestration code. Spark 2.0 contains the Spark ML API, which introduces a framework for composing pipelines of machine learning algorithms and feature transformation steps. The API, inspired by Python’s popular Scikit-Learn API, revolves around estimators and transformers, objects that learn parameters from the data and then use those parameters to transform data. The Spark ML API is heavily integrated with the DataFrames API, which makes it easy to train machine learning models on relational data. For example, users can refer to features by name instead of by ordinal position in a feature vector.
Taken together, all these changes to Spark have rendered much of the first edition obsolete. This second edition updates all of the chapters to use the new Spark APIs when possible. Additionally, we’ve cut some bits that are no longer relevant. For example, we’ve removed a full appendix that dealt with some of the intricacies of the API, in part because Spark now handles these situations intelligently without user intervention. With Spark in a new era of maturity and stability, we hope that these changes will preserve the book as an useful resource on analytics with Spark for years to come.