Chapter 4. Data (Science) Pipelines

This chapter tackles the nitty-gritty of data work—not on the data side but the data scientist’s side. Ben Lorica tackles the combinations of tools available off-the-shelf (and the platforms that enable combining tools), as well as the process of feature discovery and selection. 

Verticalized Big Data Solutions

General-purpose platforms can come across as hammers in search of nails

by Ben Lorica

As much as I love talking about general-purpose big data platforms and data science frameworks, I’m the first to admit that many of the interesting startups I talk to are focused on specific verticals. At their core, big data applications merge large amounts of real-time and static data to improve decision-making:

Data fusion diagram

This simple idea can be hard to execute in practice (think volume, variety, velocity). Unlocking value from disparate data sources entails some familiarity with domain-specific1 data sources, requirements, and business problems.

It’s difficult enough to solve a specific problem, let alone a generic one. Consider the case of Guavus—a successful startup that builds big data solutions for the telecom industry (“communication service providers”). Its founder2 was very familiar with the data sources in telecom, and knew the types of applications that would resonate within that industry. Once they solve one set of problems for a telecom company (network optimization), they quickly leverage the same systems to solve others (marketing analytics).

This ability to address a variety of problems stems from Guavus’ deep familiarity with data and problems in telecom. In contrast, a typical general-purpose platform can come across as a hammer in search of a nail. So while I remain a fan (and user) of general-purpose platforms, the less well-known verticalized solutions are definitely on my radar.

Better Tools Can’t Overcome Poor Analysis

I’m not suggesting that the criticisms raised against big data don’t apply to verticalized solutions. But many problems are due to poor analysis and not the underlying tools. A few of the more common criticisms arise from analyzing correlations: correlation is not causation, correlations are dynamic and can sometimes change drastically,3 and data dredging.4

Scaling Up Data Frames

New frameworks for interactive business analysis and advanced analytics fuel the rise in tabular data objects

by Ben Lorica

Long before the advent of “big data,” analysts were building models using tools like R (and its forerunners S/S-PLUS). Productivity hinged on tools that made data wrangling, data inspection, and data modeling convenient. Among R users, this meant proficiency with data frames—objects used to store data matrices that can hold both numeric and categorical data. A data.frame is the data structure consumed by most R analytic libraries.

But not all data scientists use R, nor is R suitable for all data problems. I’ve been watching with interest the growing number of alternative data structures for business analysis and advanced analytics. These new tools are designed to handle much larger data sets and are frequently optimized for specific problems. And they all use idioms that are familiar to data scientists—either SQL-like expressions, or syntax similar to those used for R data.frame or pandas.DataFrame.

As much as I’d like these different projects and tools to coalesce, there are differences in the platforms they inhabit, the use cases they target, and the (business) objectives of their creators. Regardless of their specific features and goals, these emerging tools5 and projects all need data structures that simplify data munging and data analysis—including data alignment, how to handle missing values, standardizing values, and coding categorical variables.


As the data processing engine for big data, analytic libraries and features are making their way into Spark,6 thus objects and data structures that simplify data wrangling and analysis are also beginning to appear. For advanced analytics and machine learning, MLTable is a table-like interface that mimics structures like R data.frame, database tables, or MATLAB’s dataset array. For business analytics (interactive query analysis), SchemaRDD’s are used in relational queries executed in Spark SQL.

Adatao logo

At the recent Spark Summit, start-up Adatao unveiled and announced plans to open sourceDistributed Data Frames (DDF)—objects that were heavily inspired by R data.frame. Adatao developed DDF as part of their pAnalytics and pInsights products, so DDF comes with many utilities for analysis and data wrangling.


Inspired by idioms used for R data.frame, Adatao’s DDF can be used from within RStudio. With standard R code,7 users can access a collection of highly scalable analytic libraries (the algorithms are executed in Spark).

ddf <- adatao.getDDF("ddf://adatao/flightInfo")
adatao.setMutable(ddf, TRUE)
adatao.transform(ddf, "delayed = if(arrdelay > 15, 1, 0)")
# adatao implementation of lm
model <- adatao.lm(delayed ~ distance + deptime + depdelay, data=ddf)
lmpred <- adatao.predict(model, ddf1)

For interactive queries, new R packages dplyr and/or data.table can be used for fast aggregations and joins. dplyr also comes with an operator (%.%) for chaining together data (wrangling) operations.


Among data scientists who use Python, pandas.DataFrame has been an essential tool ever since its release. Over the past few years pandas has become one of the most active open source projects in the data space (266 distinct contributors and counting). But pandas was designed for small to medium sized data sets, and as pandas creator Wes McKinney recently noted, there are many areas for improvement.

GraphLab Inc. logo

One area is scalability. To scale to terabytes of data, a new alternative is GraphLab’s SFrame, a component of a product called GraphLab Create. GraphLab Create targets Python users: it comes with a Python API and detailed examples contained in IPython notebooks. SFrame itself uses syntax that should be easy for pandas users to pick up. There are plans to open source SFrame (and some other components of GraphLab Create) later this year.

# recommender in five lines of Python
import graphlab
data = graphlab.SFrame("s3://my_bucket/my_data.csv")
model = graphlab.recommender.create(data)



Badger is a new tabular analytics library being built at DataPad—a start-up led and co-founded by Wes McKinney.

A C library coupled with a Python-based interface, Badger targets “business analytics and BI use cases” and has a pandas-like syntax, designed for data processing and analytical queries (“more expressive than SQL”). As an in-memory query processor, it features active memory management and caching, and targets interactive speeds on 100-million row and smaller data sets on single machines.

Figure 4-1. Screenshot of DataPad

Badger is currently only available as part of DataPad’s visual analysis platform. But its lineage (developed by the team that created pandas) combined with promising performance reports have many Pydata users itching to try it out.

Streamlining Feature Engineering

Researchers and startups are building tools that enable feature discovery

by Ben Lorica

Why do data scientists spend so much time on data wrangling and data preparation? In many cases it’s because they want access to the best variables with which to build their models. These variables are known as features in machine-learning parlance. For many8 data applications, feature engineering and feature selection are just as (if not more important) than choice of algorithm:

Good features allow a simple model to beat a complex model
(to paraphrase Alon Halevy, Peter Norvig, and Fernando Pereira).

The terminology can be a bit confusing, but to put things in context one can simplify the data science pipeline to highlight the importance of features:

Feature engineering and discovery pipeline

Feature Engineering or the Creation of New Features

A simple example to keep in mind is text mining. One starts with raw text (documents) and extracted features could be individual words or phrases. In this setting, a feature could indicate the frequency of a specific word or phrase. Features9 are then used to classify and cluster documents, or extract topics associated with the raw text. The process usually involves the creation10 of new features (feature engineering) and identifying the most essential ones (feature selection).

Feature Selection Techniques

Why bother selecting features? Why not use all available features? Part of the answer could be that you need a solution that is simple, interpretable, and fast. This favors features that have good statistical performance and that are easy to explain to non-technical users. But there could be legal11 reasons for excluding certain features as well (e.g., the use of credit scores is discriminatory in certain situations).

In the machine-learning literature there are three commonly used methods for feature selection:

Expect More Tools to Streamline Feature Discovery

In practice, feature selection and feature engineering are iterative processes where humans leverage automation12 to wade through candidate features. Statistical software have long had (stepwise) procedures for feature selection. New startups are providing similar tools: Skytree’s new user interface lets business users automate feature selection.

I’m definitely noticing much more interest from researchers and startups. A group out of Stanford13 just released a paper on a new R language extension and execution framework designed for feature selection. Their R extension enables data analysts to incorporate feature selection using high-level constructs that form a domain specific language. Some startups like ContextRelevant and SparkBeyond,14 are working to provide users with tools that simplify feature engineering and selection. In some instances this includes incorporating features derived from external data sources. Users of SparkBeyond are able to incorporate the company’s knowledge databases (Wikipedia, OpenStreeMap, Github, etc.) to enrich their own data sources.

While many startups who build analytic tools begin by focusing on algorithms, many products will soon begin highlighting how they handle feature selection and discovery. There are many reasons why there will be more emphasis on features: interpretability (this includes finding actionable features that drive model performance), big data (companies have many more data sources to draw upon), and an appreciation of data pipelines (algorithms are just one component).

Big Data Solutions Through the Combination of Tools

Applications get easier to build as packaged combinations of open source tools become available

by Ben Lorica

As a user who tends to mix-and-match many different tools, not having to deal with configuring and assembling a suite of tools is a big win. So I’m really liking the recent trend towards more integrated and packaged solutions. A recent example is the relaunch of Cloudera’s Enterprise Data hub, to include Spark15 and Spark Streaming. Users benefit by gaining automatic access to analytic engines that come with Spark.16 Besides simplifying things for data scientists and data engineers, easy access to analytic engines is critical for streamlining the creation of big data applications.

Another recent example is Dendrite17—an interesting new graph analysis solution from Lab41. It combines Titan (a distributed graph database), GraphLab (for graph analytics), and a front-end that leverages AngularJS, into a Graph exploration and analysis tool for business analysts:


Users of Spark explore Spark Streaming because similar code for batch (Spark) can, with minor modification, be used for realtime (Spark Streaming) computations. Along these lines, Summingbird—an open source library from Twitter—offers something similar for Hadoop MapReduce and Storm. With Summingbird, programs that look like Scala collection transformations can be executed in batch (Scalding) or realtime (Storm).

In some instances the underlying techniques from a set of tools makes its way into others. The DeepDive team at Stanford just recently revamped their information extraction and natural language understanding system. But already techniques used in DeepDive have found their way into many other systems including MADlib, Cloudera Impala, “a product from Oracle,” and Google Brain.

1 General-purpose platforms and components are helpful, but they usually need to be “tweaked” or “optimized” to solve problems in a variety of domains.

2 This post grew out of a recent conversation with Guavus founder, Anukool Lakhina.

3 When I started working as a quant at a hedge fund, traders always warned me that correlations jump to 1 during market panics.

4 The best example comes from finance and involves the S&P 500 and butter production in Bangladesh.

5 For this short piece, I’m skipping the many tabular data structures and columnar storage projects in the Hadoop ecosystem, and I’m focusing on the new tools that target (or were created by) data scientists.

6 Full disclosure: I am an advisor to Databricks—a start-up commercializing Apache Spark.

7 DDF is an ambitious project that aims to simplify big data analytics for users across languages and compute engines. It can be accessed using other languages including Python, Scala, and Java. It is also designed for multiple engines. In a demo, data from an HBase table is read into a DDF, data cleansing and machine learning operations are performed on it using Spark, and results are written back out to S3, all using DDF idioms.

8 The quote from Alon Halevy, Peter Norvig, and Fernando Pereira is associated with big data. But features are just as important in small data problems. Read through the Kaggle blog and you quickly realize that winning entries spend a lot of their time on feature engineering.

9 In the process documents usually get converted into structures that algorithms can handle (vectors).

10 Once can for example create composite (e.g. linear combination) features out of existing ones.

11 From Materialization Optimizations for Feature Selection Workloads: “Using credit score as a feature is considered a discriminatory practice by the insurance commissions in both California and Massachusetts.”

12 Stepwise procedures in statistical regression is a familiar example.

13 The Stanford research team designed their feature selection tool after talking to data analysts at several companies. The goal of their project was to increase analyst productivity.

14 Full disclosure: I’m an advisor to SparkBeyond.

15 Full disclosure: I am an advisor to Databricks—a startup commercializing Spark.

16 Some potential applications of Spark and Spark Streaming include stream processing and mining, interactive and iterative computing, machine-learning, and graph analytics.

17 Hat tip to Danny Bickson.

Get Big Data Now: 2014 Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.