Chapter 10. Polyglot Data Science

A polyglot is someone who speaks multiple languages. A polyglot data scientist, as I see it, is someone who uses multiple programming languages, tools, and techniques to obtain, scrub, explore, and model data.

The command line stimulates a polyglot approach. The command line doesn’t care which programming language a tool is written in, as long as it adheres to the Unix philosophy. We saw that very clearly in Chapter 4, where we created command-line tools in Bash, Python, and R. Moreover, we executed SQL queries directly on CSV files and executed R expressions from the command line. In short, we have already been doing polyglot data science without fully realizing it!

In this chapter I’m going take this further by flipping it around. I’m going to show you how to leverage the command line from various programming languages and environments. Because let’s be honest: we’re not going to spend our entire data science careers at the command line. As for me, when I’m analyzing some data, I often use the RStudio integrated development environment (IDE); and when I’m implementing something, I often use Python. I use whatever helps me get the job done.

I find it comforting to know that the command line is often within arm’s reach, without my having to switch to a different application. It allows me to quickly run a command without switching to a separate application and breaking my workflow. Examples are downloading files with curl, inspecting a piece of ...

Get Data Science at the Command Line, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.