O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Data Science on the Google Cloud Platform

Book Description

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build on top of the Google Cloud Platform (GCP). This hands-on guide shows developers entering the data science field how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP. Through the course of the book, you’ll work through a sample business decision by employing a variety of data science approaches.

Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

You’ll learn how to:

  • Automate and schedule data ingest, using an App Engine application
  • Create and populate a dashboard in Google Data Studio
  • Build a real-time analysis pipeline to carry out streaming analytics
  • Conduct interactive data exploration with Google BigQuery
  • Create a Bayesian model on a Cloud Dataproc cluster
  • Build a logistic regression machine-learning model with Spark
  • Compute time-aggregate features with a Cloud Dataflow pipeline
  • Create a high-performing prediction model with TensorFlow
  • Use your deployed model as a microservice you can access from both batch and real-time pipelines

Table of Contents

  1. Preface
    1. Who This Book Is For
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Safari
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Making Better Decisions Based on Data
    1. Many Similar Decisions
    2. The Role of Data Engineers
    3. The Cloud Makes Data Engineers Possible
    4. The Cloud Turbocharges Data Science
    5. Case Studies Get at the Stubborn Facts
    6. A Probabilistic Decision
    7. Data and Tools
      1. Getting Started with the Code
    8. Summary
  3. 2. Ingesting Data into the Cloud
    1. Airline On-Time Performance Data
      1. Knowability
      2. Training–Serving Skew
      3. Download Procedure
      4. Dataset Attributes
    2. Why Not Store the Data in Situ?
      1. Scaling Up
      2. Scaling Out
      3. Data in Situ with Colossus and Jupiter
    3. Ingesting Data
      1. Reverse Engineering a Web Form
      2. Dataset Download
      3. Exploration and Cleanup
      4. Uploading Data to Google Cloud Storage
    4. Scheduling Monthly Downloads
      1. Ingesting in Python
      2. Flask Web App
      3. Running on App Engine
      4. Securing the URL
      5. Scheduling a Cron Task
    5. Summary
    6. Code Break
  4. 3. Creating Compelling Dashboards
    1. Explain Your Model with Dashboards
    2. Why Build a Dashboard First?
    3. Accuracy, Honesty, and Good Design
    4. Loading Data into Google Cloud SQL
    5. Create a Google Cloud SQL Instance
    6. Interacting with Google Cloud Platform
    7. Controlling Access to MySQL
    8. Create Tables
    9. Populating Tables
    10. Building Our First Model
      1. Contingency Table
      2. Threshold Optimization
      3. Machine Learning
    11. Building a Dashboard
    12. Getting Started with Data Studio
      1. Creating Charts
      2. Adding End-User Controls
      3. Showing Proportions with a Pie Chart
      4. Explaining a Contingency Table
    13. Summary
  5. 4. Streaming Data: Publication and Ingest
    1. Designing the Event Feed
    2. Time Correction
    3. Apache Beam/Cloud Dataflow
      1. Parsing Airports Data
      2. Adding Time Zone Information
      3. Converting Times to UTC
      4. Correcting Dates
      5. Creating Events
      6. Running the Pipeline in the Cloud
    4. Publishing an Event Stream to Cloud Pub/Sub
      1. Get Records to Publish
      2. Paging Through Records
      3. Building a Batch of Events
      4. Publishing a Batch of Events
    5. Real-Time Stream Processing
      1. Streaming in Java Dataflow
      2. Executing the Stream Processing
      3. Analyzing Streaming Data in BigQuery
      4. Real-Time Dashboard
    6. Summary
  6. 5. Interactive Data Exploration
    1. Exploratory Data Analysis
    2. Loading Flights Data into BigQuery
      1. Advantages of a Serverless Columnar Database
      2. Staging on Cloud Storage
      3. Access Control
      4. Federated Queries
      5. Ingesting CSV Files
    3. Exploratory Data Analysis in Cloud Datalab
      1. Jupyter Notebooks
      2. Cloud Datalab
      3. Installing Packages in Cloud Datalab
      4. Jupyter Magic for Google Cloud Platform
    4. Quality Control
      1. Oddball Values
      2. Outlier Removal: Big Data Is Different
      3. Filtering Data on Occurrence Frequency
    5. Arrival Delay Conditioned on Departure Delay
      1. Applying Probabilistic Decision Threshold
      2. Empirical Probability Distribution Function
      3. The Answer Is...
    6. Evaluating the Model
      1. Random Shuffling
      2. Splitting by Date
      3. Training and Testing
    7. Summary
  7. 6. Bayes Classifier on Cloud Dataproc
    1. MapReduce and the Hadoop Ecosystem
      1. How MapReduce Works
      2. Apache Hadoop
      3. Google Cloud Dataproc
      4. Need for Higher-Level Tools
      5. Jobs, Not Clusters
      6. Initialization Actions
    2. Quantization Using Spark SQL
      1. Google Cloud Datalab on Cloud Dataproc
      2. Independence Check Using BigQuery
      3. Spark SQL in Google Cloud Datalab
      4. Histogram Equalization
      5. Dynamically Resizing Clusters
    3. Bayes Classification Using Pig
      1. Running a Pig Job on Cloud Dataproc
      2. Limiting to Training Days
      3. The Decision Criteria
      4. Evaluating the Bayesian Model
    4. Summary
  8. 7. Machine Learning: Logistic Regression on Spark
    1. Logistic Regression
      1. Spark ML Library
      2. Getting Started with Spark Machine Learning
      3. Spark Logistic Regression
      4. Creating a Training Dataset
      5. Dealing with Corner Cases
      6. Creating Training Examples
      7. Training
      8. Predicting by Using a Model
      9. Evaluating a Model
    2. Feature Engineering
      1. Experimental Framework
      2. Creating the Held-Out Dataset
      3. Feature Selection
      4. Scaling and Clipping Features
      5. Feature Transforms
      6. Categorical Variables
      7. Scalable, Repeatable, Real Time
    3. Summary
  9. 8. Time-Windowed Aggregate Features
    1. The Need for Time Averages
    2. Dataflow in Java
      1. Setting Up Development Environment
      2. Filtering with Beam
      3. Pipeline Options and Text I/O
      4. Run on Cloud
      5. Parsing into Objects
    3. Computing Time Averages
      1. Grouping and Combining
      2. Parallel Do with Side Input
      3. Debugging
      4. BigQueryIO
      5. Mutating the Flight Object
      6. Sliding Window Computation in Batch Mode
      7. Running in the Cloud
    4. Monitoring, Troubleshooting, and Performance Tuning
      1. Troubleshooting Pipeline
      2. Side Input Limitations
      3. Redesigning the Pipeline
      4. Removing Duplicates
    5. Summary
  10. 9. Machine Learning Classifier Using TensorFlow
    1. Toward More Complex Models
    2. Reading Data into TensorFlow
    3. Setting Up an Experiment
      1. Linear Classifier
      2. Training and Evaluating Input Functions
      3. Serving Input Function
      4. Creating an Experiment
      5. Performing a Training Run
      6. Distributed Training in the Cloud
    4. Improving the ML Model
      1. Deep Neural Network Model
      2. Embeddings
      3. Wide-and-Deep Model
      4. Hyperparameter Tuning
    5. Deploying the Model
      1. Predicting with the Model
      2. Explaining the Model
    6. Summary
  11. 10. Real-Time Machine Learning
    1. Invoking Prediction Service
      1. Java Classes for Request and Response
      2. Post Request and Parse Response
      3. Client of Prediction Service
    2. Adding Predictions to Flight Information
      1. Batch Input and Output
      2. Data Processing Pipeline
      3. Identifying Inefficiency
      4. Batching Requests
    3. Streaming Pipeline
      1. Flattening PCollections
      2. Executing Streaming Pipeline
      3. Late and Out-of-Order Records
      4. Watermarks and Triggers
    4. Transactions, Throughput, and Latency
      1. Possible Streaming Sinks
      2. Cloud Bigtable
      3. Designing Tables
      4. Designing the Row Key
      5. Streaming into Cloud Bigtable
      6. Querying from Cloud Bigtable
    5. Evaluating Model Performance
      1. The Need for Continuous Training
      2. Evaluation Pipeline
      3. Evaluating Performance
      4. Marginal Distributions
      5. Checking Model Behavior
      6. Identifying Behavioral Change
    6. Summary
    7. Book Summary
  12. A. Considerations for Sensitive Data within Machine Learning Datasets
    1. Handling Sensitive Information
      1. Identifying Sensitive Data
    2. Protecting Sensitive Data
      1. Removing Sensitive Data
      2. Masking Sensitive Data
      3. Coarsening Sensitive Data
    3. Establishing a Governance Policy
  13. Index