O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Wrangling with JavaScript

Book Description

Data Wrangling with JavaScript promotes JavaScript to the center of the data analysis stage! With this hands-on guide, you’ll create a JavaScript-based data processing pipeline, handle common and exotic data, and master practical troubleshooting strategies. You’ll also build interactive visualizations and deploy your apps to production. Each valuable chapter provides a new component for your reusable data wrangling toolkit.

Table of Contents

  1. Titlepage
  2. Copyright
  3. preface
  4. acknowledgments
  5. about this book
    1. Who should read this book
    2. How this book is organized: a roadmap
    3. About the code
    4. Book forum
    5. Other online resources
  6. about the author
  7. about the cover illustration
  8. Chapter 1: Getting started: establishing your data pipeline
    1. 1.1 Why data wrangling?
    2. 1.2 What’s data wrangling?
    3. 1.3 Why a book on JavaScript data wrangling?
    4. 1.4 What will you get out of this book?
    5. 1.5 Why use JavaScript for data wrangling?
    6. 1.6 Is JavaScript appropriate for data analysis?
    7. 1.7 Navigating the JavaScript ecosystem
    8. 1.8 Assembling your toolkit
    9. 1.9 Establishing your data pipeline
      1. 1.9.1 Setting the stage
      2. 1.9.2 The data-wrangling process
      3. 1.9.3 Planning
      4. 1.9.4 Acquisition, storage, and retrieval
      5. 1.9.5 Exploratory coding
      6. 1.9.6 Clean and prepare
      7. 1.9.7 Analysis
      8. 1.9.8 Visualization
      9. 1.9.9 Getting to production
    10. Summary
  9. Chapter 2: Getting started with Node.js
    1. 2.1 Starting your toolkit
    2. 2.2 Building a simple reporting system
    3. 2.3 Getting the code and data
      1. 2.3.1 Viewing the code
      2. 2.3.2 Downloading the code
      3. 2.3.3 Installing Node.js
      4. 2.3.4 Installing dependencies
      5. 2.3.5 Running Node.js code
      6. 2.3.6 Running a web application
      7. 2.3.7 Getting the data
      8. 2.3.8 Getting the code for chapter 2
    4. 2.4 Installing Node.js
      1. 2.4.1 Checking your Node.js version
    5. 2.5 Working with Node.js
      1. 2.5.1 Creating a Node.js project
      2. 2.5.2 Creating a command-line application
      3. 2.5.3 Creating a code library
      4. 2.5.4 Creating a simple web server
    6. 2.6 Asynchronous coding
      1. 2.6.1 Loading a single file
      2. 2.6.2 Loading multiple files
      3. 2.6.3 Error handling
      4. 2.6.4 Asynchronous coding with promises
      5. 2.6.5 Wrapping asynchronous operations in promises
      6. 2.6.6 Async coding with “async” and “await”
    7. Summary
  10. Chapter 3: Acquisition, storage, and retrieval
    1. 3.1 Building out your toolkit
    2. 3.2 Getting the code and data
    3. 3.3 The core data representation
      1. 3.3.1 The earthquakes website
      2. 3.3.2 Data formats covered
      3. 3.3.3 Power and flexibility
    4. 3.4 Importing data
      1. 3.4.1 Loading data from text files
      2. 3.4.2 Loading data from a REST API
      3. 3.4.3 Parsing JSON text data
      4. 3.4.4 Parsing CSV text data
      5. 3.4.5 Importing data from databases
      6. 3.4.6 Importing data from MongoDB
      7. 3.4.7 Importing data from MySQL
    5. 3.5 Exporting data
      1. 3.5.1 You need data to export!
      2. 3.5.2 Exporting data to text files
      3. 3.5.3 Exporting data to JSON text files
      4. 3.5.4 Exporting data to CSV text files
      5. 3.5.5 Exporting data to a database
      6. 3.5.6 Exporting data to MongoDB
      7. 3.5.7 Exporting data to MySQL
    6. 3.6 Building complete data conversions
    7. 3.7 Expanding the process
    8. Summary
  11. Chapter 4: Working with unusual data
    1. 4.1 Getting the code and data
    2. 4.2 Importing custom data from text files
    3. 4.3 Importing data by scraping web pages
      1. 4.3.1 Identifying the data to scrape
      2. 4.3.2 Scraping with Cheerio
    4. 4.4 Working with binary data
      1. 4.4.1 Unpacking a custom binary file
      2. 4.4.2 Packing a custom binary file
      3. 4.4.3 Replacing JSON with BSON
      4. 4.4.4 Converting JSON to BSON
      5. 4.4.5 Deserializing a BSON file
    5. Summary
  12. Chapter 5: Exploratory coding
    1. 5.1 Expanding your toolkit
    2. 5.2 Analyzing car accidents
    3. 5.3 Getting the code and data
    4. 5.4 Iteration and your feedback loop
    5. 5.5 A first pass at understanding your data
    6. 5.6 Working with a reduced data sample
    7. 5.7 Prototyping with Excel
    8. 5.8 Exploratory coding with Node.js
      1. 5.8.1 Using Nodemon
      2. 5.8.2 Exploring your data
      3. 5.8.3 Using Data-Forge
      4. 5.8.4 Computing the trend column
      5. 5.8.5 Outputting a new CSV file
    9. 5.9 Exploratory coding in the browser
    10. Putting it all together
    11. Summary
  13. Chapter 6: Clean and prepare
    1. 6.1 Expanding our toolkit
    2. 6.2 Preparing the reef data
    3. 6.3 Getting the code and data
    4. 6.4 The need for data cleanup and preparation
    5. 6.5 Where does broken data come from?
    6. 6.6 How does data cleanup fit into the pipeline?
    7. 6.7 Identifying bad data
    8. 6.8 Kinds of problems
    9. 6.9 Responses to bad data
    10. Techniques for fixing bad data
    11. Cleaning our data set
      1. 6.11.1 Rewriting bad rows
      2. 6.11.2 Filtering rows of data
      3. 6.11.3 Filtering columns of data
    12. Preparing our data for effective use
      1. 6.12.1 Aggregating rows of data
      2. 6.12.2 Combining data from different files using globby
      3. 6.12.3 Splitting data into separate files
    13. Building a data processing pipeline with Data-Forge
    14. Summary
  14. Chapter 7: Dealing with huge data files
    1. 7.1 Expanding our toolkit
    2. 7.2 Fixing temperature data
    3. 7.3 Getting the code and data
    4. 7.4 When conventional data processing breaks down
    5. 7.5 The limits of Node.js
      1. 7.5.1 Incremental data processing
      2. 7.5.2 Incremental core data representation
      3. 7.5.3 Node.js file streams basics primer
      4. 7.5.4 Transforming huge CSV files
      5. 7.5.5 Transforming huge JSON files
      6. 7.5.6 Mix and match
    6. Summary
  15. Chapter 8: Working with a mountain of data
    1. 8.1 Expanding our toolkit
    2. 8.2 Dealing with a mountain of data
    3. 8.3 Getting the code and data
    4. 8.4 Techniques for working with big data
      1. 8.4.1 Start small
      2. 8.4.2 Go back to small
      3. 8.4.3 Use a more efficient representation
      4. 8.4.4 Prepare your data offline
    5. 8.5 More Node.js limitations
    6. 8.6 Divide and conquer
    7. 8.7 Working with large databases
      1. 8.7.1 Database setup
      2. 8.7.2 Opening a connection to the database
      3. 8.7.3 Moving large files to your database
      4. 8.7.4 Incremental processing with a database cursor
      5. 8.7.5 Incremental processing with data windows
      6. 8.7.6 Creating an index
      7. 8.7.7 Filtering using queries
      8. 8.7.8 Discarding data with projection
      9. 8.7.9 Sorting large data sets
    8. 8.8 Achieving better data throughput
      1. 8.8.1 Optimize your code
      2. 8.8.2 Optimize your algorithm
      3. 8.8.3 Processing data in parallel
    9. Summary
  16. Chapter 9: Practical data analysis
    1. 9.1 Expanding your toolkit
    2. 9.2 Analyzing the weather data
    3. 9.3 Getting the code and data
    4. 9.4 Basic data summarization
      1. 9.4.1 Sum
      2. 9.4.2 Average
      3. 9.4.3 Standard deviation
    5. 9.5 Group and summarize
    6. 9.6 The frequency distribution of temperatures
    7. 9.7 Time series
      1. 9.7.1 Yearly average temperature
      2. 9.7.2 Rolling average
      3. 9.7.3 Rolling standard deviation
      4. 9.7.4 Linear regression
      5. 9.7.5 Comparing time series
      6. 9.7.6 Stacking time series operations
    8. 9.8 Understanding relationships
      1. 9.8.1 Detecting correlation with a scatter plot
      2. 9.8.2 Types of correlation
      3. 9.8.3 Determining the strength of the correlation
      4. 9.8.4 Computing the correlation coefficient
    9. Summary
  17. Chapter 10: Browser-based visualization
    1. 10.1 Expanding your toolkit
    2. 10.2 Getting the code and data
    3. 10.3 Choosing a chart type
    4. 10.4 Line chart for New York City temperature
      1. 10.4.1 The most basic C3 line chart
      2. 10.4.2 Adding real data
      3. 10.4.3 Parsing the static CSV file
      4. 10.4.4 Adding years as the X axis
      5. 10.4.5 Creating a custom Node.js web server
      6. 10.4.6 Adding another series to the chart
      7. 10.4.7 Adding a second Y axis to the chart
      8. 10.4.8 Rendering a time series chart
    5. 10.5 Other chart types with C3
      1. 10.5.1 Bar chart
      2. 10.5.2 Horizontal bar chart
      3. 10.5.3 Pie chart
      4. 10.5.4 Stacked bar chart
      5. 10.5.5 Scatter plot chart
    6. 10.6 Improving the look of our charts
    7. 10.7 Moving forward with your own projects
    8. Summary
  18. Chapter 11: Server-side visualization
    1. 11.1 Expanding your toolkit
    2. 11.2 Getting the code and data
    3. 11.3 The headless browser
    4. 11.4 Using Nightmare for server-side visualization
      1. 11.4.1 Why Nightmare?
      2. 11.4.2 Nightmare and Electron
      3. 11.4.3 Our process: capturing visualizations with Nightmare
      4. 11.4.4 Prepare a visualization to render
      5. 11.4.5 Starting the web server
      6. 11.4.6 Procedurally start and stop the web server
      7. 11.4.7 Rendering the web page to an image
      8. 11.4.8 Before we move on . . .
      9. 11.4.9 Capturing the full visualization
      10. Feeding the chart with data
      11. Multipage reports
      12. Debugging code in the headless browser
      13. Making it work on a Linux server
    5. 11.5 You can do much more with a headless browser
      1. 11.5.1 Web scraping
      2. 11.5.2 Other uses
    6. Summary
  19. Chapter 12: Live data
    1. 12.1 We need an early warning system
    2. 12.2 Getting the code and data
    3. 12.3 Dealing with live data
    4. 12.4 Building a system for monitoring air quality
    5. 12.5 Set up for development
    6. 12.6 Live-streaming data
      1. 12.6.1 HTTP POST for infrequent data submission
      2. 12.6.2 Sockets for high-frequency data submission
    7. 12.7 Refactor for configuration
    8. 12.8 Data capture
    9. 12.9 An event-based architecture
    10. Code restructure for event handling
      1. 12.10.1 Triggering SMS alerts
      2. 12.10.2 Automatically generating a daily report
    11. Live data processing
    12. Live visualization
    13. Summary
  20. Chapter 13: Advanced visualization with D3
    1. 13.1 Advanced visualization
    2. 13.2 Getting the code and data
    3. 13.3 Visualizing space junk
    4. 13.4 What is D3?
    5. 13.5 The D3 data pipeline
    6. 13.6 Basic setup
    7. 13.7 SVG crash course
      1. 13.7.1 SVG circle
      2. 13.7.2 Styling
      3. 13.7.3 SVG text
      4. 13.7.4 SVG group
    8. 13.8 Building visualizations with D3
      1. 13.8.1 Element state
      2. 13.8.2 Selecting elements
      3. 13.8.3 Manually adding elements to our visualization
      4. 13.8.4 Scaling to fit
      5. 13.8.5 Procedural generation the D3 way
      6. 13.8.6 Loading a data file
      7. 13.8.7 Color-coding the space junk
      8. 13.8.8 Adding interactivity
      9. 13.8.9 Adding a year-by-year launch animation
    9. Summary
  21. Chapter 14: Getting to production
    1. 14.1 Production concerns
    2. 14.2 Taking our early warning system to production
    3. 14.3 Deployment
    4. 14.4 Monitoring
    5. 14.5 Reliability
      1. 14.5.1 System longevity
      2. 14.5.2 Practice defensive programming
      3. 14.5.3 Data protection
      4. 14.5.4 Testing and automation
      5. 14.5.5 Handling unexpected errors
      6. 14.5.6 Designing for process restart
      7. 14.5.7 Dealing with an ever-growing database
    6. 14.6 Security
      1. 14.6.1 Authentication and authorization
      2. 14.6.2 Privacy and confidentiality
      3. 14.6.3 Secret configuration
    7. 14.7 Scaling
      1. 14.7.1 Measurement before optimization
      2. 14.7.2 Vertical scaling
      3. 14.7.3 Horizontal scaling
    8. Summary
  22. appendix a: JavaScript cheat sheet
    1. Updates
    2. Logging
    3. Objects
    4. Arrays
    5. Regular expressions
    6. Read and write text files (Node.js, synchronous)
    7. Read and write JSON files (Node.js, synchronous)
    8. Read and write CSV files (Node.js, synchronous)
  23. appendix b: Data-Forge cheat sheet
    1. Updates
    2. Loading data into a DataFrame
    3. Loading CSV files
    4. Loading JSON files
    5. Data transformation
    6. Data filtering
    7. Removing a column
    8. Saving CSV files
    9. Saving JSON files
  24. appendix c: Getting started with Vagrant
    1. Updates
    2. Installing VirtualBox
    3. Installing Vagrant
    4. Creating a virtual machine
    5. Installing software on your virtual machine
    6. Running code on your virtual machine
    7. Turning off your virtual machine
  25. Index
  26. List of Figures
  27. List of Tables
  28. List of Listings