Chapter 8. Parallel Pipelines

In the previous chapters, we’ve been dealing with commands and pipelines that take care of an entire task at once. In practice, however, you may find yourself facing a task that requires the same command or pipeline to run multiple times. For example, you may need to:

  • Scrape hundreds of web pages

  • Make dozens of API calls and transform their output

  • Train a classifier for a range of parameter values

  • Generate scatter plots for every pair of features in your dataset

In any of these examples, there’s a certain form of repetition involved. With your favorite scripting or programming language, you could take care of this with a for loop or a while loop. On the command line, the first thing you might be inclined to do is to press the up arrow key to bring back the previous command, modify it if necessary, and press Enter to run the command again. This is fine to do two or three times, but imagine doing it dozens of times. Such an approach quickly becomes cumbersome, inefficient, and prone to errors. The good news is that you can write such loops on the command line as well. That’s what this chapter is all about.

Sometimes, repeating a fast command again and again in succession (in a serial manner) is sufficient. When you have multiple cores (and perhaps even multiple machines), it would be nice to make use of those, especially when you’re faced with a data-intensive task. Using multiple cores or machines may reduce the total running time significantly. ...

Get Data Science at the Command Line, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.