O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Agile Data Science 2.0

Book Description

Data science teams looking to turn research into useful analytics applications require not only the right tools, but also the right approach if they’re to succeed. With the revised second edition of this hands-on guide, up-and-coming data scientists will learn how to use the Agile Data Science development methodology to build data applications with Python, Apache Spark, Kafka, and other tools.

Author Russell Jurney demonstrates how to compose a data platform for building, deploying, and refining analytics applications with Apache Kafka, MongoDB, ElasticSearch, d3.js, scikit-learn, and Apache Airflow. You’ll learn an iterative approach that lets you quickly change the kind of analysis you’re doing, depending on what the data is telling you. Publish data science work as a web application, and affect meaningful change in your organization.

  • Build value from your data in a series of agile sprints, using the data-value pyramid
  • Extract features for statistical models from a single dataset
  • Visualize data with charts, and expose different aspects through interactive reports
  • Use historical data to predict the future via classification and regression
  • Translate predictions into actions
  • Get feedback from users after each sprint to keep your project on track

Table of Contents

  1. Preface
    1. Agile Data Science Mailing List
    2. Data Syndrome, Product Analytics Consultancy
      1. Live Training
    3. Who This Book Is For
    4. How This Book Is Organized
    5. Conventions Used in This Book
    6. Using Code Examples
    7. O’Reilly Safari
    8. How to Contact Us
  2. I. Setup
  3. 1. Theory
    1. Introduction
    2. Definition
      1. Methodology as Tweet
      2. Agile Data Science Manifesto
    3. The Problem with the Waterfall
      1. Research Versus Application Development
    4. The Problem with Agile Software
      1. Eventual Quality: Financing Technical Debt
      2. The Pull of the Waterfall
    5. The Data Science Process
      1. Setting Expectations
      2. Data Science Team Roles
      3. Recognizing the Opportunity and the Problem
      4. Adapting to Change
    6. Notes on Process
      1. Code Review and Pair Programming
      2. Agile Environments: Engineering Productivity
      3. Realizing Ideas with Large-Format Printing
  4. 2. Agile Tools
    1. Scalability = Simplicity
    2. Agile Data Science Data Processing
    3. Local Environment Setup
      1. System Requirements
      2. Setting Up Vagrant
      3. Downloading the Data
    4. EC2 Environment Setup
      1. Downloading the Data
    5. Getting and Running the Code
      1. Getting the Code
      2. Running the Code
      3. Jupyter Notebooks
    6. Touring the Toolset
      1. Agile Stack Requirements
      2. Python 3
      3. Serializing Events with JSON Lines and Parquet
      4. Collecting Data
      5. Data Processing with Spark
      6. Publishing Data with MongoDB
      7. Searching Data with Elasticsearch
      8. Distributed Streams with Apache Kafka
      9. Processing Streams with PySpark Streaming
      10. Machine Learning with scikit-learn and Spark MLlib
      11. Scheduling with Apache Airflow (Incubating)
      12. Reflecting on Our Workflow
      13. Lightweight Web Applications
      14. Presenting Our Data
    7. Conclusion
  5. 3. Data
    1. Air Travel Data
      1. Flight On-Time Performance Data
      2. OpenFlights Database
    2. Weather Data
    3. Data Processing in Agile Data Science
      1. Structured Versus Semistructured Data
    4. SQL Versus NoSQL
      1. SQL
      2. NoSQL and Dataflow Programming
      3. Spark: SQL + NoSQL
      4. Schemas in NoSQL
      5. Data Serialization
      6. Extracting and Exposing Features in Evolving Schemas
    5. Conclusion
  6. II. Climbing the Pyramid
  7. 4. Collecting and Displaying Records
    1. Putting It All Together
    2. Collecting and Serializing Flight Data
    3. Processing and Publishing Flight Records
      1. Publishing Flight Records to MongoDB
    4. Presenting Flight Records in a Browser
      1. Serving Flights with Flask and pymongo
      2. Rendering HTML5 with Jinja2
    5. Agile Checkpoint
    6. Listing Flights
      1. Listing Flights with MongoDB
      2. Paginating Data
    7. Searching for Flights
      1. Creating Our Index
      2. Publishing Flights to Elasticsearch
      3. Searching Flights on the Web
    8. Conclusion
  8. 5. Visualizing Data with Charts and Tables
    1. Chart Quality: Iteration Is Essential
    2. Scaling a Database in the Publish/Decorate Model
      1. First Order Form
      2. Second Order Form
      3. Third Order Form
      4. Choosing a Form
    3. Exploring Seasonality
      1. Querying and Presenting Flight Volume
    4. Extracting Metal (Airplanes [Entities])
      1. Extracting Tail Numbers
      2. Assessing Our Airplanes
    5. Data Enrichment
      1. Reverse Engineering a Web Form
      2. Gathering Tail Numbers
      3. Automating Form Submission
      4. Extracting Data from HTML
      5. Evaluating Enriched Data
    6. Conclusion
  9. 6. Exploring Data with Reports
    1. Extracting Airlines (Entities)
      1. Defining Airlines as Groups of Airplanes Using PySpark
      2. Querying Airline Data in Mongo
      3. Building an Airline Page in Flask
      4. Linking Back to Our Airline Page
      5. Creating an All Airlines Home Page
    2. Curating Ontologies of Semi-structured Data
    3. Improving Airlines
      1. Adding Names to Carrier Codes
      2. Incorporating Wikipedia Content
      3. Publishing Enriched Airlines to Mongo
      4. Enriched Airlines on the Web
    4. Investigating Airplanes (Entities)
      1. SQL Subqueries Versus Dataflow Programming
      2. Dataflow Programming Without Subqueries
      3. Subqueries in Spark SQL
      4. Creating an Airplanes Home Page
      5. Adding Search to the Airplanes Page
      6. Creating a Manufacturers Bar Chart
      7. Iterating on the Manufacturers Bar Chart
      8. Entity Resolution: Another Chart Iteration
    5. Conclusion
  10. 7. Making Predictions
    1. The Role of Predictions
    2. Predict What?
    3. Introduction to Predictive Analytics
      1. Making Predictions
    4. Exploring Flight Delays
    5. Extracting Features with PySpark
    6. Building a Regression with scikit-learn
      1. Loading Our Data
      2. Sampling Our Data
      3. Vectorizing Our Results
      4. Preparing Our Training Data
      5. Vectorizing Our Features
      6. Sparse Versus Dense Matrices
      7. Preparing an Experiment
      8. Training Our Model
      9. Testing Our Model
      10. Conclusion
    7. Building a Classifier with Spark MLlib
      1. Loading Our Training Data with a Specified Schema
      2. Addressing Nulls
      3. Replacing FlightNum with Route
      4. Bucketizing a Continuous Variable for Classification
      5. Feature Vectorization with pyspark.ml.feature
      6. Classification with Spark ML
    8. Conclusion
  11. 8. Deploying Predictive Systems
    1. Deploying a scikit-learn Application as a Web Service
      1. Saving and Loading scikit-learn Models
      2. Groundwork for Serving Predictions
      3. Creating Our Flight Delay Regression API
      4. Testing Our API
      5. Pulling Our API into Our Product
    2. Deploying Spark ML Applications in Batch with Airflow
      1. Gathering Training Data in Production
      2. Training, Storing, and Loading Spark ML Models
      3. Creating Prediction Requests in Mongo
      4. Fetching Prediction Requests from MongoDB
      5. Making Predictions in a Batch with Spark ML
      6. Storing Predictions in MongoDB
      7. Displaying Batch Prediction Results in Our Web Application
      8. Automating Our Workflow with Apache Airflow (Incubating)
      9. Conclusion
    3. Deploying Spark ML via Spark Streaming
      1. Gathering Training Data in Production
      2. Training, Storing, and Loading Spark ML Models
      3. Sending Prediction Requests to Kafka
      4. Making Predictions in Spark Streaming
      5. Testing the Entire System
    4. Conclusion
  12. 9. Improving Predictions
    1. Fixing Our Prediction Problem
    2. When to Improve Predictions
    3. Improving Prediction Performance
      1. Experimental Adhesion Method: See What Sticks
      2. Establishing Rigorous Metrics for Experiments
      3. Time of Day as a Feature
    4. Incorporating Airplane Data
      1. Extracting Airplane Features
      2. Incorporating Airplane Features into Our Classifier Model
    5. Incorporating Flight Time
    6. Conclusion
  13. A. Manual Installation
    1. Installing Hadoop
    2. Installing Spark
    3. Installing MongoDB
    4. Installing the MongoDB Java Driver
    5. Installing mongo-hadoop
      1. Building mongo-hadoop
      2. Installing pymongo_spark
    6. Installing Elasticsearch
    7. Installing Elasticsearch for Hadoop
    8. Setting Up Our Spark Environment
    9. Installing Kafka
    10. Installing scikit-learn
    11. Installing Zeppelin
  14. Index