book

Data Science at the Command Line

Name: Data Science at the Command Line
Author: Jeroen Janssens
ISBN: 9781491947852

by Jeroen Janssens

October 2014

Beginner to intermediate

210 pages

4h 32m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
What to Expect from This BookHow to Read This BookWho This Book Is ForConventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact UsAcknowledgments
1. Introduction
OverviewData Science Is OSEMNObtaining DataScrubbing DataExploring DataModeling DataInterpreting DataIntermezzo ChaptersWhat Is the Command Line?Why Data Science at the Command Line?The Command Line Is AgileThe Command Line Is AugmentingThe Command Line Is ScalableThe Command Line Is ExtensibleThe Command Line Is UbiquitousA Real-World Use CaseFurther Reading
2. Getting Started
OverviewSetting Up Your Data Science ToolboxStep 1: Download and Install VirtualBoxStep 2: Download and Install VagrantStep 3: Download and Start the Data Science ToolboxStep 4: Log In (on Linux and Mac OS X)Step 4: Log In (on Microsoft Windows)Step 5: Shut Down or Start AnewEssential Concepts and ToolsThe EnvironmentExecuting a Command-Line ToolFive Types of Command-Line ToolsCombining Command-Line ToolsRedirecting Input and OutputWorking with FilesHelp!Further Reading
3. Obtaining Data
OverviewCopying Local Files to the Data Science ToolboxLocal Version of Data Science ToolboxRemote Version of Data Science ToolboxDecompressing FilesConverting Microsoft Excel SpreadsheetsQuerying Relational DatabasesDownloading from the InternetCalling Web APIsFurther Reading
4. Creating Reusable Command-Line Tools
OverviewConverting One-Liners into Shell ScriptsStep 1: Copy and PasteStep 2: Add Permission to ExecuteStep 3: Define ShebangStep 4: Remove Fixed InputStep 5: ParameterizeStep 6: Extend Your PATHCreating Command-Line Tools with Python and RPorting the Shell ScriptProcessing Streaming Data from Standard InputFurther Reading
5. Scrubbing Data
OverviewCommon Scrub Operations for Plain TextFiltering LinesExtracting ValuesReplacing and Deleting ValuesWorking with CSVBodies and Headers and Columns, Oh My!Performing SQL Queries on CSVWorking with HTML/XML and JSONCommon Scrub Operations for CSVExtracting and Reordering ColumnsFiltering LinesMerging ColumnsCombining Multiple CSV FilesFurther Reading
6. Managing Your Data Workflow
OverviewIntroducing DrakeInstalling DrakeObtain Top Ebooks from Project GutenbergEvery Workflow Starts with a Single StepWell, That DependsRebuilding Specific TargetsDiscussionFurther Reading
7. Exploring Data
OverviewInspecting Data and Its PropertiesHeader or Not, Here I ComeInspect All the DataFeature Names and Data TypesUnique Identifiers, Continuous Variables, and FactorsComputing Descriptive StatisticsUsing csvstatUsing R from the Command Line with RioCreating VisualizationsIntroducing Gnuplot and feedgnuplotIntroducing ggplot2HistogramsBar PlotsDensity PlotsBox PlotsScatter PlotsLine GraphsSummaryFurther Reading
8. Parallel Pipelines
OverviewSerial ProcessingLooping Over NumbersLooping Over LinesLooping Over FilesParallel ProcessingIntroducing GNU ParallelSpecifying InputControlling the Number of Concurrent JobsLogging and OutputCreating Parallel ToolsDistributed ProcessingGet a List of Running AWS EC2 InstancesRunning Commands on Remote MachinesDistributing Local Data Among Remote MachinesProcessing Files on Remote MachinesDiscussionFurther Reading
9. Modeling Data
OverviewMore Wine, Please!Dimensionality Reduction with TapkeeIntroducing TapkeeInstalling TapkeeLinear and Nonlinear MappingsClustering with WekaIntroducing WekaTaming Weka on the Command LineConverting Between CSV and ARFFComparing Three Clustering AlgorithmsRegression with SciKit-Learn LaboratoryPreparing the DataRunning the ExperimentParsing the ResultsClassification with BigMLCreating Balanced Train and Test Data SetsCalling the APIInspecting the ResultsConclusionFurther Reading

10. Conclusion
Let’s RecapThree Pieces of AdviceBe PatientBe CreativeBe PracticalWhere to Go from Here?APIsShell ProgrammingPython, R, and SQLInterpreting DataGetting in Touch
A. List of Command-Line Tools
aliasawkawsbashbcbigmlerbodycatcdchmodcolscowsaycpcsvcutcsvgrepcsvjoincsvlookcsvsortcsvsqlcsvstackcsvstatcurlcurlicuecutdisplaydrakedseqechoenvexportfeedgnuplotfieldsplitfindforgitgrepheadheaderin2csvjqjson2csvlesslsmanmkdirmvparallelpastepbcpippwdpythonRRioRio-scatterrmrun_experimentsamplescpscrapesedseqshufsortsplitsql2csvsshsudotailtapkeetarteetrtreetypeuniqunpackunrarunzipwcwekawhichxml2json
B. Bibliography
Index

Content preview from Data Science at the Command Line

Chapter 5. Scrubbing Data

In Chapter 2, we looked at the first step of the OSEMN model for data science, how to obtain data from a variety of sources. It’s not uncommon for this data to have missing values, inconsistencies, errors, weird characters, or uninteresting columns. Sometimes we only need a specific portion of the data. And sometimes we need the data to be in a different format. In those cases, we have to clean, or scrub, the data before we can move on to the third step: exploring data.

The data we obtained in Chapter 3 can come in a variety of formats. The most common ones are plain text, CSV, JSON, and HTML/XML. Because most command-line tools operate on one format only, it is worthwhile to be able to convert data from one format to another.

CSV, which is the main format we’re working with in this chapter, is actually not the easiest format to work with. Many CSV data sets are broken or incompatible with each other because there is no standard syntax, unlike XML and JSON.

Once our data is in the format we want it to be, we can apply common scrubbing operations. These include filtering, replacing, and merging data. The command line is especially well-suited for these kind of operations, as there exist many powerful command-line tools that are optimized for handling large amounts of data. Tools that we’ll discuss in this chapter include classic ones such as: cut (Ihnat, MacKenzie, & Meyering, 2012) and sed (Fenlason, Lord, Pizzini, & Bonzini, 2012), and newer ones such ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Data Science at the Command Line, 2nd Edition

Publisher Resources

ISBN: 9781491947845Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Data Science at the Command Line

by Jeroen Janssens

Chapter 5. Scrubbing Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.