Data Science at the Command Line

Book description

This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.

To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.

Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on plain text, CSV, HTML/XML, and JSON
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow using Drake
  • Create reusable tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines using GNU Parallel
  • Model data with dimensionality reduction, clustering, regression, and classification algorithms

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What to Expect from This Book
    2. How to Read This Book
    3. Who This Book Is For
    4. Conventions Used in This Book
    5. Using Code Examples
    6. Safari® Books Online
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Introduction
    1. Overview
    2. Data Science Is OSEMN
      1. Obtaining Data
      2. Scrubbing Data
      3. Exploring Data
      4. Modeling Data
      5. Interpreting Data
    3. Intermezzo Chapters
    4. What Is the Command Line?
    5. Why Data Science at the Command Line?
      1. The Command Line Is Agile
      2. The Command Line Is Augmenting
      3. The Command Line Is Scalable
      4. The Command Line Is Extensible
      5. The Command Line Is Ubiquitous
    6. A Real-World Use Case
    7. Further Reading
  3. 2. Getting Started
    1. Overview
    2. Setting Up Your Data Science Toolbox
      1. Step 1: Download and Install VirtualBox
      2. Step 2: Download and Install Vagrant
      3. Step 3: Download and Start the Data Science Toolbox
      4. Step 4: Log In (on Linux and Mac OS X)
      5. Step 4: Log In (on Microsoft Windows)
      6. Step 5: Shut Down or Start Anew
    3. Essential Concepts and Tools
      1. The Environment
      2. Executing a Command-Line Tool
      3. Five Types of Command-Line Tools
      4. Combining Command-Line Tools
      5. Redirecting Input and Output
      6. Working with Files
      7. Help!
    4. Further Reading
  4. 3. Obtaining Data
    1. Overview
    2. Copying Local Files to the Data Science Toolbox
      1. Local Version of Data Science Toolbox
      2. Remote Version of Data Science Toolbox
    3. Decompressing Files
    4. Converting Microsoft Excel Spreadsheets
    5. Querying Relational Databases
    6. Downloading from the Internet
    7. Calling Web APIs
    8. Further Reading
  5. 4. Creating Reusable Command-Line Tools
    1. Overview
    2. Converting One-Liners into Shell Scripts
      1. Step 1: Copy and Paste
      2. Step 2: Add Permission to Execute
      3. Step 3: Define Shebang
      4. Step 4: Remove Fixed Input
      5. Step 5: Parameterize
      6. Step 6: Extend Your PATH
    3. Creating Command-Line Tools with Python and R
      1. Porting the Shell Script
      2. Processing Streaming Data from Standard Input
    4. Further Reading
  6. 5. Scrubbing Data
    1. Overview
    2. Common Scrub Operations for Plain Text
      1. Filtering Lines
      2. Extracting Values
      3. Replacing and Deleting Values
    3. Working with CSV
      1. Bodies and Headers and Columns, Oh My!
      2. Performing SQL Queries on CSV
    4. Working with HTML/XML and JSON
    5. Common Scrub Operations for CSV
      1. Extracting and Reordering Columns
      2. Filtering Lines
      3. Merging Columns
      4. Combining Multiple CSV Files
    6. Further Reading
  7. 6. Managing Your Data Workflow
    1. Overview
    2. Introducing Drake
    3. Installing Drake
    4. Obtain Top Ebooks from Project Gutenberg
    5. Every Workflow Starts with a Single Step
    6. Well, That Depends
    7. Rebuilding Specific Targets
    8. Discussion
    9. Further Reading
  8. 7. Exploring Data
    1. Overview
    2. Inspecting Data and Its Properties
      1. Header or Not, Here I Come
      2. Inspect All the Data
      3. Feature Names and Data Types
      4. Unique Identifiers, Continuous Variables, and Factors
    3. Computing Descriptive Statistics
      1. Using csvstat
      2. Using R from the Command Line with Rio
    4. Creating Visualizations
      1. Introducing Gnuplot and feedgnuplot
      2. Introducing ggplot2
      3. Histograms
      4. Bar Plots
      5. Density Plots
      6. Box Plots
      7. Scatter Plots
      8. Line Graphs
      9. Summary
    5. Further Reading
  9. 8. Parallel Pipelines
    1. Overview
    2. Serial Processing
      1. Looping Over Numbers
      2. Looping Over Lines
      3. Looping Over Files
    3. Parallel Processing
      1. Introducing GNU Parallel
      2. Specifying Input
      3. Controlling the Number of Concurrent Jobs
      4. Logging and Output
      5. Creating Parallel Tools
    4. Distributed Processing
      1. Get a List of Running AWS EC2 Instances
      2. Running Commands on Remote Machines
      3. Distributing Local Data Among Remote Machines
      4. Processing Files on Remote Machines
    5. Discussion
    6. Further Reading
  10. 9. Modeling Data
    1. Overview
    2. More Wine, Please!
    3. Dimensionality Reduction with Tapkee
      1. Introducing Tapkee
      2. Installing Tapkee
      3. Linear and Nonlinear Mappings
    4. Clustering with Weka
      1. Introducing Weka
      2. Taming Weka on the Command Line
      3. Converting Between CSV and ARFF
      4. Comparing Three Clustering Algorithms
    5. Regression with SciKit-Learn Laboratory
      1. Preparing the Data
      2. Running the Experiment
      3. Parsing the Results
    6. Classification with BigML
      1. Creating Balanced Train and Test Data Sets
      2. Calling the API
      3. Inspecting the Results
      4. Conclusion
    7. Further Reading
  11. 10. Conclusion
    1. Let’s Recap
    2. Three Pieces of Advice
      1. Be Patient
      2. Be Creative
      3. Be Practical
    3. Where to Go from Here?
      1. APIs
      2. Shell Programming
      3. Python, R, and SQL
      4. Interpreting Data
    4. Getting in Touch
  12. A. List of Command-Line Tools
    1. alias
    2. awk
    3. aws
    4. bash
    5. bc
    6. bigmler
    7. body
    8. cat
    9. cd
    10. chmod
    11. cols
    12. cowsay
    13. cp
    14. csvcut
    15. csvgrep
    16. csvjoin
    17. csvlook
    18. csvsort
    19. csvsql
    20. csvstack
    21. csvstat
    22. curl
    23. curlicue
    24. cut
    25. display
    26. drake
    27. dseq
    28. echo
    29. env
    30. export
    31. feedgnuplot
    32. fieldsplit
    33. find
    34. for
    35. git
    36. grep
    37. head
    38. header
    39. in2csv
    40. jq
    41. json2csv
    42. less
    43. ls
    44. man
    45. mkdir
    46. mv
    47. parallel
    48. paste
    49. pbc
    50. pip
    51. pwd
    52. python
    53. R
    54. Rio
    55. Rio-scatter
    56. rm
    57. run_experiment
    58. sample
    59. scp
    60. scrape
    61. sed
    62. seq
    63. shuf
    64. sort
    65. split
    66. sql2csv
    67. ssh
    68. sudo
    69. tail
    70. tapkee
    71. tar
    72. tee
    73. tr
    74. tree
    75. type
    76. uniq
    77. unpack
    78. unrar
    79. unzip
    80. wc
    81. weka
    82. which
    83. xml2json
  13. B. Bibliography
  14. Index

Product information

  • Title: Data Science at the Command Line
  • Author(s): Jeroen Janssens
  • Release date: October 2014
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491947852