Skip to Main Content
Data Science at the Command Line, 2nd Edition
book

Data Science at the Command Line, 2nd Edition

by Jeroen Janssens
August 2021
Beginner to intermediate content levelBeginner to intermediate
280 pages
6h 12m
English
O'Reilly Media, Inc.
Content preview from Data Science at the Command Line, 2nd Edition

Chapter 8. Parallel Pipelines

In the previous chapters, we’ve been dealing with commands and pipelines that take care of an entire task at once. In practice, however, you may find yourself facing a task that requires the same command or pipeline to run multiple times. For example, you may need to:

  • Scrape hundreds of web pages

  • Make dozens of API calls and transform their output

  • Train a classifier for a range of parameter values

  • Generate scatter plots for every pair of features in your dataset

In any of these examples, there’s a certain form of repetition involved. With your favorite scripting or programming language, you could take care of this with a for loop or a while loop. On the command line, the first thing you might be inclined to do is to press the up arrow key to bring back the previous command, modify it if necessary, and press Enter to run the command again. This is fine to do two or three times, but imagine doing it dozens of times. Such an approach quickly becomes cumbersome, inefficient, and prone to errors. The good news is that you can write such loops on the command line as well. That’s what this chapter is all about.

Sometimes, repeating a fast command again and again in succession (in a serial manner) is sufficient. When you have multiple cores (and perhaps even multiple machines), it would be nice to make use of those, especially when you’re faced with a data-intensive task. Using multiple cores or machines may reduce the total running time significantly. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python Data Science Handbook

Python Data Science Handbook

Jake VanderPlas

Publisher Resources

ISBN: 9781492087908Errata PageSupplemental Content