Chapter 3. Data Pipelines
Engineering and optimizing data pipelines continues to be an area of particular interest, as researchers attempt to improve efficiency so they can scale to very large data sets. Workflow tools that enable users to build pipelines have also become more common—these days, such tools exist for data engineers, data scientists, and even business analysts. In this chapter, we present a collection of blog posts and podcasts that cover the latest thinking in the realm of data pipelines.
First, Ben Lorica explains why interactions between parts of a pipeline are an area of active research, and why we need tools to enable users to build certifiable machine learning pipelines. Michael Li then explores three best practices for building successful pipelines—reproducibility, consistency, and productionizability. Next, Kiyoto Tamura explores the ideal frameworks for collecting, parsing, and archiving logs, and also outlines the value of JSON as a unifying format. Finally, Gwen Shapira discusses how to simplify backend A/B testing using Kafka.
Building and Deploying Large-Scale Machine Learning Pipelines
There are many algorithms with implementations that scale to large data sets (this list includes matrix factorization, SVM, logistic regression, LASSO, and many others). In fact, machine learning experts are fond of pointing out: If you can pose your problem as a simple optimization problem then you’re ...
Get Big Data Now: 2015 Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.