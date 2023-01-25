Data Science on the Google Cloud Platform, 2nd Edition

by Valliappa Lakshmanan
Released January 2023
Publisher(s): O'Reilly Media, Inc.
ISBN: 9781098118938

Explore a preview version of Data Science on the Google Cloud Platform, 2nd Edition right now.

Book description

Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP.

Through the course of this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.

You'll learn how to:

  • Employ best practices in building highly scalable data and ML pipelines on Google Cloud
  • Automate and schedule data ingest using Cloud Run
  • Create and populate a dashboard in Data Studio
  • Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery
  • Conduct interactive data exploration with BigQuery
  • Create a Bayesian model with Spark on Cloud Dataproc
  • Forecast time series and do anomaly detection with BigQuery ML
  • Aggregate within time windows with Dataflow
  • Train explainable machine learning models with Vertex AI
  • Operationalize ML with Vertex AI Pipelines

Table of contents

  1. 1. Making Better Decisions Based on Data
    1. Many Similar Decisions
    2. The Role of Data Scientists
      1. Scrappy Environment
      2. Full Stack Cloud Data Scientists
      3. Collaboration
      4. Target audience for the book
    3. Best Practices
      1. Simple to Complex Solutions
      2. Cloud Computing
      3. Serverless
    4. A Probabilistic Decision
      1. Probabilistic Approach
      2. Probability Density Function
      3. Cumulative Distribution Function
    5. Data and Tools
      1. Getting Started with the Code
    6. Summary
  2. 2. Ingesting Data into the Cloud
    1. Airline On-Time Performance Data
      1. Knowability
      2. Training–Serving Skew
      3. Downloading Data
      4. Hub and Spoke Architecture
      5. Dataset Fields
    2. Separation of Compute and Storage
      1. Scaling Up
      2. Scaling Out with Sharded Data
      3. Scaling out with Data in Situ
    3. Ingesting Data
      1. Reverse Engineering a Web Form
      2. Dataset Download
      3. Exploration and Cleanup
      4. Uploading Data to Google Cloud Storage
    4. Loading Data into Google BigQuery
      1. Advantages of a Serverless Columnar Database
      2. Staging on Cloud Storage
      3. Access Control
      4. Ingesting CSV Files
      5. Partitioning
    5. Scheduling Monthly Downloads
      1. Ingesting in Python
      2. Cloud Run
      3. Securing Cloud Run
      4. Deploying and Invoking Cloud Run
      5. Scheduling Cloud Run
    6. Summary
    7. Code Break
  3. 3. Creating Compelling Dashboards
    1. Explain Your Model with Dashboards
      1. Why Build a Dashboard First?
      2. Accuracy, Honesty, and Good Design
    2. Loading Data into Cloud SQL
      1. Create a Google Cloud SQL Instance
      2. Create Table of Data
      3. Interacting with the database
    3. Querying Using BigQuery
      1. Schema Exploration
      2. Using Preview
      3. Using Table Explorer
      4. Creating BigQuery View
    4. Building Our First Model
      1. Contingency Table
      2. Threshold Optimization
    5. Building a Dashboard
      1. Getting Started with Data Studio
      2. Creating Charts
      3. Adding End-User Controls
      4. Showing Proportions with a Pie Chart
      5. Explaining a Contingency Table
    6. Summary
  4. 4. Streaming Data: Publication and Ingest with Pub/Sub and Dataflow
    1. Designing the Event Feed
      1. Transformations Needed
      2. Architecture
      3. Getting airport information
      4. Sharing data
    2. Time Correction
      1. Apache Beam/Cloud Dataflow
      2. Parsing Airports Data
      3. Adding Time Zone Information
      4. Converting Times to UTC
      5. Correcting Dates
      6. Creating Events
      7. Reading and Writing to the Cloud
      8. Running the Pipeline in the Cloud
    3. Publishing an Event Stream to Cloud Pub/Sub
      1. Speed-up Factor
      2. Get Records to Publish
      3. Iterating Through Records
      4. Building a Batch of Events
      5. Publishing a Batch of Events
    4. Real-Time Stream Processing
      1. Streaming in Dataflow
      2. Windowing a pipeline
      3. Streaming aggregation
      4. Using Event Timestamps
      5. Executing the Stream Processing
      6. Analyzing Streaming Data in BigQuery
    5. Real-Time Dashboard
    6. Summary

