Skip to Content
Data Science at the Command Line
book

Data Science at the Command Line

by Jeroen Janssens
October 2014
Beginner to intermediate
210 pages
4h 32m
English
O'Reilly Media, Inc.
Content preview from Data Science at the Command Line

Chapter 6. Managing Your Data Workflow

We hope that by now you have come to appreciate that the command line is a very convenient environment for doing data science. You may have noticed that, as a consequence of working at the command line, we:

  • Invoke many different commands

  • Create custom and ad-hoc command-line tools

  • Obtain and generate many (intermediate) files

As this process is of an exploratory nature, our workflow tends to be rather chaotic, which makes it difficult to keep track of what we’ve done. It’s very important that our steps can be reproduced, whether by ourselves or by others. When we, for example, continue with a project from a few weeks earlier, chances are that we have forgotten which commands we have run, on which files, in which order, and with which parameters. Imagine the difficulty of passing on your analysis to a collaborator.

You may recover some lost commands by digging into your Bash history, but this is, of course, not a good approach. A better approach would be to save your commands to a Bash script, such as run.sh. This allows you and your collaborators to at least reproduce the analysis. A shell script is, however, a suboptimal approach because:

  • It’s difficult to read and to maintain.

  • Dependencies between steps are unclear.

  • Every step gets executed every time, which is inefficient and sometimes undesirable.

This is where Drake comes in handy (Factual, 2014). Drake is command-line tool created by Factual that allows you to:

  • Formalize ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Data Science with Java

Data Science with Java

Michael R. Brzustowicz
Data Wrangling with Python

Data Wrangling with Python

Jacqueline Kazil, Katharine Jarmul
Data Analytics with Hadoop

Data Analytics with Hadoop

Benjamin Bengfort, Jenny Kim
Data Science on AWS

Data Science on AWS

Chris Fregly, Antje Barth

Publisher Resources

ISBN: 9781491947845Supplemental ContentErrata Page