Skip to Content
Data Science at the Command Line
book

Data Science at the Command Line

by Jeroen Janssens
October 2014
Beginner to intermediate
210 pages
4h 32m
English
O'Reilly Media, Inc.
Content preview from Data Science at the Command Line

Chapter 8. Parallel Pipelines

In the previous chapters, we’ve been dealing with commands and pipelines that take care of an entire task at once. In practice, however, you may find yourself facing a task that requires the same command or pipeline to run multiple times. For, example, you may need to:

  • Scrape hundreds of web pages

  • Make dozens of API calls and transform their output

  • Train a classifier for a range of parameter values

  • Generate scatter plots for every pair of features in your data set

In any of these examples, there is a certain form of repetition involved. With your favorite scripting or programming language, you take care of this with a for loop or a while loop. On the command line, the first thing you might be inclined to do is to press <Up> (which brings back the previous command), modify the command if necessary, and press <Enter> (which runs the command again). This is fine for two or three times, but imagine doing this for, say, dozens of files. Such an approach quickly becomes cumbersome and inefficient. The good news is that we can write for and while loops on the command line as well.

Sometimes, repeating fast commands one after another (in serial) is sufficient. When you have multiple cores (and perhaps even multiple machines) it would be nice if you could make use of those, especially when you’re faced with a data-intensive task. When using multiple cores or machines, the total running time may be reduced significantly. In this chapter, we’ll introduce ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Science with Java

Data Science with Java

Michael R. Brzustowicz
Data Wrangling with Python

Data Wrangling with Python

Jacqueline Kazil, Katharine Jarmul
Data Analytics with Hadoop

Data Analytics with Hadoop

Benjamin Bengfort, Jenny Kim
Data Science on AWS

Data Science on AWS

Chris Fregly, Antje Barth

Publisher Resources

ISBN: 9781491947845Supplemental ContentErrata Page