book

Data Science at the Command Line, 2nd Edition

Name: Data Science at the Command Line, 2nd Edition
Author: Jeroen Janssens
ISBN: 9781492087915

by Jeroen Janssens

August 2021

Beginner to intermediate

280 pages

6h 12m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Foreword
Preface
What to Expect from This BookChanges for the Second EditionHow to Read This BookWho This Book Is ForConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments for the Second Edition (2021)Acknowledgments for the First Edition (2014)
1. Introduction
Data Science Is OSEMNObtaining DataScrubbing DataExploring DataModeling DataInterpreting DataIntermezzo ChaptersWhat Is the Command Line?Why Data Science at the Command Line?The Command Line Is AgileThe Command Line Is AugmentingThe Command Line Is ScalableThe Command Line Is ExtensibleThe Command Line Is UbiquitousSummaryFor Further Exploration
2. Getting Started
Getting the DataInstalling the Docker ImageEssential Unix ConceptsThe EnvironmentExecuting a Command-Line ToolFive Types of Command-Line ToolsCombining Command-Line ToolsRedirecting Input and OutputWorking with Files and DirectoriesManaging OutputHelp!SummaryFor Further Exploration
3. Obtaining Data
OverviewCopying Local Files to the Docker ContainerDownloading from the InternetIntroducing curlSavingOther ProtocolsFollowing RedirectsDecompressing FilesConverting Microsoft Excel Spreadsheets to CSVQuerying Relational DatabasesCalling Web APIsAuthenticationStreaming APIsSummaryFor Further Exploration
4. Creating Command-Line Tools
OverviewConverting One-Liners into Shell ScriptsStep 1: Create a FileStep 2: Give Permission to ExecuteStep 3: Define a ShebangStep 4: Remove the Fixed InputStep 5: Add ArgumentsStep 6: Extend Your PATHCreating Command-Line Tools with Python and RPorting the Shell ScriptProcessing Streaming Data from Standard InputSummaryFor Further Exploration
5. Scrubbing Data
OverviewTransformations, Transformations EverywherePlain TextFiltering LinesExtracting ValuesReplacing and Deleting ValuesCSVBodies and Headers and Columns, Oh My!Performing SQL Queries on CSVExtracting and Reordering ColumnsFiltering RowsMerging ColumnsCombining Multiple CSV FilesWorking with XML/HTML and JSONSummaryFor Further Exploration
6. Project Management with Make
OverviewIntroducing MakeRunning TasksBuilding, for RealAdding DependenciesSummaryFor Further Exploration
7. Exploring Data
OverviewInspecting Data and Its PropertiesHeader or Not, Here I ComeInspect All the DataFeature Names and Data TypesUnique Identifiers, Continuous Variables, and FactorsComputing Descriptive StatisticsColumn StatisticsR One-Liners on the ShellCreating VisualizationsDisplaying Images from the Command LinePlotting in a RushCreating Bar ChartsCreating HistogramsCreating Density PlotsHappy Little AccidentsCreating Scatter PlotsCreating Trend LinesCreating Box PlotsAdding LabelsGoing Beyond Basic PlotsSummaryFor Further Exploration
8. Parallel Pipelines
OverviewSerial ProcessingLooping Over NumbersLooping Over LinesLooping Over FilesParallel ProcessingIntroducing GNU ParallelSpecifying InputControlling the Number of Concurrent JobsLogging and OutputCreating Parallel ToolsDistributed ProcessingGet List of Running AWS EC2 InstancesRunning Commands on Remote MachinesDistributing Local Data Among Remote MachinesProcessing Files on Remote MachinesSummaryFor Further Exploration

9. Modeling Data
OverviewMore Wine, Please!Dimensionality Reduction with TapkeeIntroducing TapkeeLinear and Nonlinear MappingsRegression with Vowpal WabbitPreparing the DataTraining the ModelTesting the ModelClassification with SciKit-Learn LaboratoryPreparing the DataRunning the ExperimentParsing the ResultsSummaryFor Further Exploration
10. Polyglot Data Science
OverviewJupyterPythonRRStudioApache SparkSummaryFor Further Exploration
11. Conclusion
Let’s RecapThree Pieces of AdviceBe PatientBe CreativeBe PracticalWhere to Go from HereThe Command LineShell ProgrammingPython, R, and SQLAPIsMachine LearningGetting in Touch
A. List of Command-Line Tools
aliasawkawsbashbatbcbodycatcdchmodcolscolumncowsaycpcsv2vwcsvcutcsvgrepcsvjoincsvlookcsvquotecsvsortcsvsqlcsvstackcsvstatcurlcutdisplaydseqechoenvexportfcfindfoldforfxgitgrepgronheadheaderhistoryhostnamein2csvjqjson2csvllesslsmakemanmkdirmvnanonlparallelpastepbcpippuppwdpythonRrevrmrushsamplescpsedseqservewdshufskllsortsplitspongesql2csvsshsudotailtapkeetarteetelnettldrtrtreetrimtstypeuniqunpackunrarunzipvwwcwhichxml2jsonxmlstarletxsvzcatzsh
Index

Content preview from Data Science at the Command Line, 2nd Edition

Chapter 3. Obtaining Data

This chapter deals with the first step of the OSEMN model: obtaining data. After all, without any data, there is not much data science that we can do. I assume that the data you need to solve your data science problem already exists. Your first task is to get this data onto your computer (and possibly also inside the Docker container) in a form that you can work with.

According to the Unix philosophy, text is a universal interface. Almost every command-line tool takes text as input, produces text as output, or both. This is the main reason why command-line tools can work so well together. However, as we’ll see, even just text can come in multiple forms.

Data can be obtained in several ways—for example, by downloading it from a server, querying a database, or connecting to a Web API. Sometimes the data comes in a compressed form or in a binary format such as a Microsoft Excel Spreadsheet. In this chapter, I discuss several tools that help tackle this from the command line, including curl,¹ in2csv,² sql2csv,³ and tar.⁴

Overview

In this chapter, you’ll learn how to:

Copy local files to the Docker image
Download data from the internet
Decompress files
Extract data from spreadsheets
Query relational databases
Call web APIs

This chapter starts with the following files:

$ cd /data/ch03
 
$ l
total 924K
-rw-r--r-- 1 dst dst 627K Jun 29 14:26 logs.tar.gz
-rw-r--r-- 1 dst dst 189K Jun 29 14:26 r-datasets.db -rw-r--r-- 1 dst dst 149 Jun 29 14:26 tmnt-basic.csv ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Science on the Google Cloud Platform, 2nd Edition

Publisher Resources

ISBN: 9781492087908Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science at the Command Line, 2nd Edition

by Jeroen Janssens

Chapter 3. Obtaining Data

Overview

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.