Data Science at the Command Line, 2nd Edition

Book description

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools--useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on text, CSV, HTML, XML, and JSON files
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow
  • Create your own tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Model data with dimensionality reduction, regression, and classification algorithms
  • Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. What to Expect from This Book
    2. Changes for the Second Edition
    3. How to Read This Book
    4. Who This Book Is For
    5. Conventions Used in This Book
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments for the Second Edition (2021)
    9. Acknowledgments for the First Edition (2014)
  3. 1. Introduction
    1. Data Science Is OSEMN
      1. Obtaining Data
      2. Scrubbing Data
      3. Exploring Data
      4. Modeling Data
      5. Interpreting Data
    2. Intermezzo Chapters
    3. What Is the Command Line?
    4. Why Data Science at the Command Line?
      1. The Command Line Is Agile
      2. The Command Line Is Augmenting
      3. The Command Line Is Scalable
      4. The Command Line Is Extensible
      5. The Command Line Is Ubiquitous
    5. Summary
    6. For Further Exploration
  4. 2. Getting Started
    1. Getting the Data
    2. Installing the Docker Image
    3. Essential Unix Concepts
      1. The Environment
      2. Executing a Command-Line Tool
      3. Five Types of Command-Line Tools
      4. Combining Command-Line Tools
      5. Redirecting Input and Output
      6. Working with Files and Directories
      7. Managing Output
      8. Help!
    4. Summary
    5. For Further Exploration
  5. 3. Obtaining Data
    1. Overview
    2. Copying Local Files to the Docker Container
    3. Downloading from the Internet
      1. Introducing curl
      2. Saving
      3. Other Protocols
      4. Following Redirects
    4. Decompressing Files
    5. Converting Microsoft Excel Spreadsheets to CSV
    6. Querying Relational Databases
    7. Calling Web APIs
      1. Authentication
      2. Streaming APIs
    8. Summary
    9. For Further Exploration
  6. 4. Creating Command-Line Tools
    1. Overview
    2. Converting One-Liners into Shell Scripts
      1. Step 1: Create a File
      2. Step 2: Give Permission to Execute
      3. Step 3: Define a Shebang
      4. Step 4: Remove the Fixed Input
      5. Step 5: Add Arguments
      6. Step 6: Extend Your PATH
    3. Creating Command-Line Tools with Python and R
      1. Porting the Shell Script
      2. Processing Streaming Data from Standard Input
    4. Summary
    5. For Further Exploration
  7. 5. Scrubbing Data
    1. Overview
    2. Transformations, Transformations Everywhere
    3. Plain Text
      1. Filtering Lines
      2. Extracting Values
      3. Replacing and Deleting Values
    4. CSV
      1. Bodies and Headers and Columns, Oh My!
      2. Performing SQL Queries on CSV
      3. Extracting and Reordering Columns
      4. Filtering Rows
      5. Merging Columns
      6. Combining Multiple CSV Files
    5. Working with XML/HTML and JSON
    6. Summary
    7. For Further Exploration
  8. 6. Project Management with Make
    1. Overview
    2. Introducing Make
    3. Running Tasks
    4. Building, for Real
    5. Adding Dependencies
    6. Summary
    7. For Further Exploration
  9. 7. Exploring Data
    1. Overview
    2. Inspecting Data and Its Properties
      1. Header or Not, Here I Come
      2. Inspect All the Data
      3. Feature Names and Data Types
      4. Unique Identifiers, Continuous Variables, and Factors
    3. Computing Descriptive Statistics
      1. Column Statistics
      2. R One-Liners on the Shell
    4. Creating Visualizations
      1. Displaying Images from the Command Line
      2. Plotting in a Rush
      3. Creating Bar Charts
      4. Creating Histograms
      5. Creating Density Plots
      6. Happy Little Accidents
      7. Creating Scatter Plots
      8. Creating Trend Lines
      9. Creating Box Plots
      10. Adding Labels
      11. Going Beyond Basic Plots
    5. Summary
    6. For Further Exploration
  10. 8. Parallel Pipelines
    1. Overview
    2. Serial Processing
      1. Looping Over Numbers
      2. Looping Over Lines
      3. Looping Over Files
    3. Parallel Processing
      1. Introducing GNU Parallel
      2. Specifying Input
      3. Controlling the Number of Concurrent Jobs
      4. Logging and Output
      5. Creating Parallel Tools
    4. Distributed Processing
      1. Get List of Running AWS EC2 Instances
      2. Running Commands on Remote Machines
      3. Distributing Local Data Among Remote Machines
      4. Processing Files on Remote Machines
    5. Summary
    6. For Further Exploration
  11. 9. Modeling Data
    1. Overview
    2. More Wine, Please!
    3. Dimensionality Reduction with Tapkee
      1. Introducing Tapkee
      2. Linear and Nonlinear Mappings
    4. Regression with Vowpal Wabbit
      1. Preparing the Data
      2. Training the Model
      3. Testing the Model
    5. Classification with SciKit-Learn Laboratory
      1. Preparing the Data
      2. Running the Experiment
      3. Parsing the Results
    6. Summary
    7. For Further Exploration
  12. 10. Polyglot Data Science
    1. Overview
    2. Jupyter
    3. Python
    4. R
    5. RStudio
    6. Apache Spark
    7. Summary
    8. For Further Exploration
  13. 11. Conclusion
    1. Let’s Recap
    2. Three Pieces of Advice
      1. Be Patient
      2. Be Creative
      3. Be Practical
    3. Where to Go from Here
      1. The Command Line
      2. Shell Programming
      3. Python, R, and SQL
      4. APIs
      5. Machine Learning
    4. Getting in Touch
  14. A. List of Command-Line Tools
    1. alias
    2. awk
    3. aws
    4. bash
    5. bat
    6. bc
    7. body
    8. cat
    9. cd
    10. chmod
    11. cols
    12. column
    13. cowsay
    14. cp
    15. csv2vw
    16. csvcut
    17. csvgrep
    18. csvjoin
    19. csvlook
    20. csvquote
    21. csvsort
    22. csvsql
    23. csvstack
    24. csvstat
    25. curl
    26. cut
    27. display
    28. dseq
    29. echo
    30. env
    31. export
    32. fc
    33. find
    34. fold
    35. for
    36. fx
    37. git
    38. grep
    39. gron
    40. head
    41. header
    42. history
    43. hostname
    44. in2csv
    45. jq
    46. json2csv
    47. l
    48. less
    49. ls
    50. make
    51. man
    52. mkdir
    53. mv
    54. nano
    55. nl
    56. parallel
    57. paste
    58. pbc
    59. pip
    60. pup
    61. pwd
    62. python
    63. R
    64. rev
    65. rm
    66. rush
    67. sample
    68. scp
    69. sed
    70. seq
    71. servewd
    72. shuf
    73. skll
    74. sort
    75. split
    76. sponge
    77. sql2csv
    78. ssh
    79. sudo
    80. tail
    81. tapkee
    82. tar
    83. tee
    84. telnet
    85. tldr
    86. tr
    87. tree
    88. trim
    89. ts
    90. type
    91. uniq
    92. unpack
    93. unrar
    94. unzip
    95. vw
    96. wc
    97. which
    98. xml2json
    99. xmlstarlet
    100. xsv
    101. zcat
    102. zsh
  15. Index

Product information

  • Title: Data Science at the Command Line, 2nd Edition
  • Author(s): Jeroen Janssens
  • Release date: August 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492087915