Skip to Main Content
Data Science at the Command Line, 2nd Edition
book

Data Science at the Command Line, 2nd Edition

by Jeroen Janssens
August 2021
Beginner to intermediate content levelBeginner to intermediate
280 pages
6h 12m
English
O'Reilly Media, Inc.
Content preview from Data Science at the Command Line, 2nd Edition

Chapter 3. Obtaining Data

This chapter deals with the first step of the OSEMN model: obtaining data. After all, without any data, there is not much data science that we can do. I assume that the data you need to solve your data science problem already exists. Your first task is to get this data onto your computer (and possibly also inside the Docker container) in a form that you can work with.

According to the Unix philosophy, text is a universal interface. Almost every command-line tool takes text as input, produces text as output, or both. This is the main reason why command-line tools can work so well together. However, as we’ll see, even just text can come in multiple forms.

Data can be obtained in several ways—for example, by downloading it from a server, querying a database, or connecting to a Web API. Sometimes the data comes in a compressed form or in a binary format such as a Microsoft Excel Spreadsheet. In this chapter, I discuss several tools that help tackle this from the command line, including curl,1 in2csv,2 sql2csv,3 and tar.4

Overview

In this chapter, you’ll learn how to:

  • Copy local files to the Docker image

  • Download data from the internet

  • Decompress files

  • Extract data from spreadsheets

  • Query relational databases

  • Call web APIs

This chapter starts with the following files:

$ cd /data/ch03
 
$ l
total 924K
-rw-r--r-- 1 dst dst 627K Jun 29 14:26 logs.tar.gz
-rw-r--r-- 1 dst dst 189K Jun 29 14:26 r-datasets.db -rw-r--r-- 1 dst dst 149 Jun 29 14:26 tmnt-basic.csv ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python Data Science Handbook

Python Data Science Handbook

Jake VanderPlas

Publisher Resources

ISBN: 9781492087908Errata PageSupplemental Content