Chapter 3. Data Science Technologies

Data science tooling has entered a golden age. At the laptop scale, the most common tools are R, MATLAB, and Python Scikit-learn, but there are many others. Oftentimes, an expert data scientist will have her “go-to” language, where she feels most confident developing prototypes. A data engineer may also have her preferences when writing scalable code.

There are so many tools to choose from that it can sometimes be a challenge to know which ones to start with. Our aim here is to narrow the field by providing technical foundations for a few simple frameworks. These may not be optimal for all workloads, but they are certainly among the most popular choices for many use cases. We will discuss Apache Spark for most scalable applications. For the deep learning examples, we will use TensorFlow, which we will not discuss in detail in this chapter. The examples provided in later sections are relatively straightforward to follow along with.

Apache Spark

As you have learned, Apache Spark is a framework for writing distributed code. It can be developed at the desktop scale on moderately sized datasets. Once the code is ready, it can be migrated to a cluster or cloud computing resource. The scale-up process is straightforward, and often requires only trivial modifications to the code to run at scale.

Spark is often considered to be the second generation of distributed computing in the enterprise. The first generation was Hadoop, which consists of the Hadoop ...

Get Data Science and Engineering at Enterprise Scale now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.