Data Science on the Google Cloud Platform, 2nd Edition

Book description

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline with cloud native tools on GCP.

Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way.

You'll learn how to:

  • Employ best practices in building highly scalable data and ML pipelines on Google Cloud
  • Automate and schedule data ingest using Cloud Run
  • Create and populate a dashboard in Data Studio
  • Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery
  • Conduct interactive data exploration with BigQuery
  • Create a Bayesian model with Spark on Cloud Dataproc
  • Forecast time series and do anomaly detection with BigQuery ML
  • Aggregate within time windows with Dataflow
  • Train explainable machine learning models with Vertex AI
  • Operationalize ML with Vertex AI Pipelines

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who This Book Is For
    2. Conventions Used in This Book
    3. Using Code Examples
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  2. 1. Making Better Decisions Based on Data
    1. Many Similar Decisions
    2. The Role of Data Scientists
      1. Scrappy Environment
      2. Full Stack Cloud Data Scientists
      3. Collaboration
    3. Best Practices
      1. Simple to Complex Solutions
      2. Cloud Computing
      3. Serverless
    4. A Probabilistic Decision
      1. Probabilistic Approach
      2. Probability Density Function
      3. Cumulative Distribution Function
    5. Choices Made
      1. Choosing Cloud
      2. Not a Reference Book
      3. Getting Started with the Code
    6. Agile Architecture for Data Science on Google Cloud
      1. What Is Agile Architecture?
      2. No-Code, Low-Code
      3. Use Managed Services
    7. Summary
    8. Suggested Resources
  3. 2. Ingesting Data into the Cloud
    1. Airline On-Time Performance Data
      1. Knowability
      2. Causality
      3. Training–Serving Skew
      4. Downloading Data
      5. Hub-and-Spoke Architecture
      6. Dataset Fields
    2. Separation of Compute and Storage
      1. Scaling Up
      2. Scaling Out with Sharded Data
      3. Scaling Out with Data-in-Place
    3. Ingesting Data
      1. Reverse Engineering a Web Form
      2. Dataset Download
      3. Exploration and Cleanup
      4. Uploading Data to Google Cloud Storage
    4. Loading Data into Google BigQuery
      1. Advantages of a Serverless Columnar Database
      2. Staging on Cloud Storage
      3. Access Control
      4. Ingesting CSV Files
      5. Partitioning
    5. Scheduling Monthly Downloads
      1. Ingesting in Python
      2. Cloud Run
      3. Securing Cloud Run
      4. Deploying and Invoking Cloud Run
      5. Scheduling Cloud Run
    6. Summary
    7. Code Break
    8. Suggested Resources
  4. 3. Creating Compelling Dashboards
    1. Explain Your Model with Dashboards
      1. Why Build a Dashboard First?
      2. Accuracy, Honesty, and Good Design
    2. Loading Data into Cloud SQL
      1. Create a Google Cloud SQL Instance
      2. Create Table of Data
      3. Interacting with the Database
    3. Querying Using BigQuery
      1. Schema Exploration
      2. Using Preview
      3. Using Table Explorer
      4. Creating BigQuery View
    4. Building Our First Model
      1. Contingency Table
      2. Threshold Optimization
    5. Building a Dashboard
      1. Getting Started with Data Studio
      2. Creating Charts
      3. Adding End-User Controls
      4. Showing Proportions with a Pie Chart
      5. Explaining a Contingency Table
    6. Modern Business Intelligence
      1. Digitization
      2. Natural Language Queries
      3. Connected Sheets
    7. Summary
    8. Suggested Resources
  5. 4. Streaming Data: Publication and Ingest with Pub/Sub and Dataflow
    1. Designing the Event Feed
      1. Transformations Needed
      2. Architecture
      3. Getting Airport Information
      4. Sharing Data
    2. Time Correction
      1. Apache Beam/Cloud Dataflow
      2. Parsing Airports Data
      3. Adding Time Zone Information
      4. Converting Times to UTC
      5. Correcting Dates
      6. Creating Events
      7. Reading and Writing to the Cloud
      8. Running the Pipeline in the Cloud
    3. Publishing an Event Stream to Cloud Pub/Sub
      1. Speed-Up Factor
      2. Get Records to Publish
      3. How Many Topics?
      4. Iterating Through Records
      5. Building a Batch of Events
      6. Publishing a Batch of Events
    4. Real-Time Stream Processing
      1. Streaming in Dataflow
      2. Windowing a Pipeline
      3. Streaming Aggregation
      4. Using Event Timestamps
      5. Executing the Stream Processing
      6. Analyzing Streaming Data in BigQuery
    5. Real-Time Dashboard
    6. Summary
    7. Suggested Resources
  6. 5. Interactive Data Exploration with Vertex AI Workbench
    1. Exploratory Data Analysis
      1. Exploration with SQL
      2. Reading a Query Explanation
    2. Exploratory Data Analysis in Vertex AI Workbench
      1. Jupyter Notebooks
      2. Creating a Notebook
      3. Jupyter Commands
      4. Installing Packages
      5. Jupyter Magic for Google Cloud
    3. Exploring Arrival Delays
      1. Basic Statistics
      2. Plotting Distributions
      3. Quality Control
      4. Arrival Delay Conditioned on Departure Delay
    4. Evaluating the Model
      1. Random Shuffling
      2. Splitting by Date
      3. Training and Testing
    5. Summary
    6. Suggested Resources
  7. 6. Bayesian Classifier with Apache Spark on Cloud Dataproc
    1. MapReduce and the Hadoop Ecosystem
      1. How MapReduce Works
      2. Apache Hadoop
    2. Google Cloud Dataproc
      1. Need for Higher-Level Tools
      2. Jobs, Not Clusters
      3. Preinstalling Software
    3. Quantization Using Spark SQL
      1. JupyterLab on Cloud Dataproc
      2. Independence Check Using BigQuery
      3. Spark SQL in JupyterLab
      4. Histogram Equalization
    4. Bayesian Classification
      1. Bayes in Each Bin
      2. Evaluating the Model
      3. Dynamically Resizing Clusters
      4. Comparing to Single Threshold Model
    5. Orchestration
      1. Submitting a Spark Job
      2. Workflow Template
      3. Cloud Composer
      4. Autoscaling
      5. Serverless Spark
    6. Summary
    7. Suggested Resources
  8. 7. Logistic Regression Using Spark ML
    1. Logistic Regression
      1. How Logistic Regression Works
      2. Spark ML Library
      3. Getting Started with Spark Machine Learning
    2. Spark Logistic Regression
      1. Creating a Training Dataset
      2. Training the Model
      3. Predicting Using the Model
      4. Evaluating a Model
    3. Feature Engineering
      1. Experimental Framework
      2. Feature Selection
      3. Feature Transformations
      4. Feature Creation
      5. Categorical Variables
      6. Repeatable, Real Time
    4. Summary
    5. Suggested Resources
  9. 8. Machine Learning with BigQuery ML
    1. Logistic Regression
      1. Presplit Data
      2. Interrogating the Model
      3. Evaluating the Model
      4. Scale and Simplicity
    2. Nonlinear Machine Learning
      1. XGBoost
      2. Hyperparameter Tuning
      3. Vertex AI AutoML Tables
    3. Time Window Features
      1. Taxi-Out Time
      2. Compounding Delays
      3. Causality
    4. Time Features
      1. Departure Hour
      2. Transform Clause
      3. Categorical Variable
      4. Feature Cross
    5. Summary
    6. Suggested Resources
  10. 9. Machine Learning with TensorFlow in Vertex AI
    1. Toward More Complex Models
      1. Preparing BigQuery Data for TensorFlow
      2. Reading Data into TensorFlow
    2. Training and Evaluation in Keras
      1. Model Function
      2. Features
      3. Inputs
      4. Training the Keras Model
      5. Saving and Exporting
      6. Deep Neural Network
    3. Wide-and-Deep Model in Keras
      1. Representing Air Traffic Corridors
      2. Bucketing
      3. Feature Crossing
      4. Wide-and-Deep Classifier
    4. Deploying a Trained TensorFlow Model to Vertex AI
      1. Concepts
      2. Uploading Model
      3. Creating Endpoint
      4. Deploying Model to Endpoint
      5. Invoking the Deployed Model
    5. Summary
    6. Suggested Resources
  11. 10. Getting Ready for MLOps with Vertex AI
    1. Developing and Deploying Using Python
      1. Writing model.py
      2. Writing the Training Pipeline
      3. Predefined Split
      4. AutoML
    2. Hyperparameter Tuning
      1. Parameterize Model
      2. Shorten Training Run
      3. Metrics During Training
      4. Hyperparameter Tuning Pipeline
      5. Best Trial to Completion
    3. Explaining the Model
      1. Configuring Explanations Metadata
      2. Creating and Deploying Model
      3. Obtaining Explanations
    4. Summary
    5. Suggested Resources
  12. 11. Time-Windowed Features for Real-Time Machine Learning
    1. Time Averages
      1. Apache Beam and Cloud Dataflow
      2. Reading and Writing
      3. Time Windowing
    2. Machine Learning Training
      1. Machine Learning Dataset
      2. Training the Model
    3. Streaming Predictions
      1. Reuse Transforms
      2. Input and Output
      3. Invoking Model
      4. Reusing Endpoint
      5. Batching Predictions
    4. Streaming Pipeline
      1. Writing to BigQuery
      2. Executing Streaming Pipeline
      3. Late and Out-of-Order Records
      4. Possible Streaming Sinks
    5. Summary
    6. Suggested Resources
  13. 12. The Full Dataset
    1. Four Years of Data
      1. Creating Dataset
      2. Training Model
      3. Evaluation
    2. Summary
    3. Suggested Resources
  14. Conclusion
  15. A. Considerations for Sensitive Data Within Machine Learning Datasets
    1. Handling Sensitive Information
      1. Sensitive Data in Columns
      2. Sensitive Data in Natural Language Datasets
      3. Sensitive Data in Free-Form Unstructured Data
      4. Sensitive Data in a Combination of Fields
      5. Sensitive Data in Unstructured Content
    2. Protecting Sensitive Data
      1. Removing Sensitive Data
      2. Masking Sensitive Data
      3. Coarsening Sensitive Data
    3. Establishing a Governance Policy
  16. Index
  17. About the Author

Product information

  • Title: Data Science on the Google Cloud Platform, 2nd Edition
  • Author(s): Valliappa Lakshmanan
  • Release date: March 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098118952