Chapter 3. Obtaining Data

This chapter deals with the first step of the OSEMN model: obtaining data. After all, without any data, there is not much data science that we can do. I assume that the data you need to solve your data science problem already exists. Your first task is to get this data onto your computer (and possibly also inside the Docker container) in a form that you can work with.

According to the Unix philosophy, text is a universal interface. Almost every command-line tool takes text as input, produces text as output, or both. This is the main reason why command-line tools can work so well together. However, as we’ll see, even just text can come in multiple forms.

Data can be obtained in several ways—for example, by downloading it from a server, querying a database, or connecting to a Web API. Sometimes the data comes in a compressed form or in a binary format such as a Microsoft Excel Spreadsheet. In this chapter, I discuss several tools that help tackle this from the command line, including curl,1 in2csv,2 sql2csv,3 and tar.4

Overview

In this chapter, you’ll learn how to:

  • Copy local files to the Docker image

  • Download data from the internet

  • Decompress files

  • Extract data from spreadsheets

  • Query relational databases

  • Call web APIs

This chapter starts with the following files:

$ cd /data/ch03
 
$ l
total 924K
-rw-r--r-- 1 dst dst 627K Jun 29 14:26 logs.tar.gz
-rw-r--r-- 1 dst dst 189K Jun 29 14:26 r-datasets.db -rw-r--r-- 1 dst dst 149 Jun 29 14:26 tmnt-basic.csv ...

Get Data Science at the Command Line, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.