O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Google BigQuery: The Definitive Guide

Book Description

With Early Release ebooks, you get books in their earliest form—the authors' raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles.

Work with petabyte-scale datasets while building a collaborative, agile workplace in the process. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. BigQuery enables enterprises to efficiently store, query, ingest, and learn from their data in a convenient framework. With this book, you’ll examine how to analyze data at scale to derive insights from large datasets efficiently.

Valliappa Lakshmanan, tech lead on Google Cloud Platform, and Jordan Tigani, engineering director for the BigQuery team, provide best practices in modern data warehousing within an autoscaled, serverless, public cloud. Whether you want to explore parts of BigQuery you’re not familiar with, or prefer to focus on specific tasks, this reference is indispensable.

Table of Contents

  1. Preface
    1. Who Is This Book for?
    2. Acknowledgments
  2. 1. What Is Google BigQuery?
    1. Data Processing Architectures
      1. Relational Database Management System
      2. MapReduce Framework
      3. BigQuery: A Serverless, Distributed SQL Engine
    2. Working with BigQuery
      1. Deriving Insights Across Datasets
      2. ETL, EL, and ELT
      3. Powerful Analytics
      4. Simplicity of Management
    3. How BigQuery Came About
    4. What Makes BigQuery Possible?
      1. Separation of Compute and Storage
      2. Storage and Networking Infrastructure
      3. Managed storage
      4. Integration with Google Cloud Platform
      5. Security and Compliance
    5. Summary
  3. 2. Query Essentials
    1. Simple Queries
      1. Retrieving Rows by Using SELECT
      2. Aliasing column names with AS
      3. Filtering with WHERE
      4. SELECT *, EXCEPT, REPLACE
      5. Subqueries with WITH
      6. Sorting with ORDER BY
    2. Aggregates
      1. Computing Aggregates by Using GROUP BY
      2. Counting Records by Using COUNT
      3. Filtering Grouped Items by Using HAVING
      4. Finding Unique Values by Using DISTINCT
    3. A Brief Primer on Arrays and Structs
      1. Creating ARRAYs by Using ARRAY_AGG
      2. ARRAY of STRUCT
      3. TUPLE
      4. Working with Arrays
      5. UNNEST an Array
    4. Joining Tables
      1. The JOIN Explained
      2. INNER JOIN
      3. CROSS JOIN
      4. OUTER JOIN
    5. Saving and Sharing
      1. Query History and Caching
      2. Saved Queries
      3. Views versus Shared Queries
    6. Summary
  4. 3. Data Types, Functions, and Operators
    1. Numeric Types and Functions
      1. Mathematical Functions
      2. Standard-Compliant Floating-Point Division
      3. SAFE Functions
      4. Comparisons
      5. Precise Decimal Calculations with NUMERIC
    2. Working with BOOL
      1. Logical Operations
      2. Conditional Expressions
      3. Cleaner NULL-Handling with COALESCE
      4. Casting and Coercion
      5. Using COUNTIF to Avoid Casting Booleans
    3. String Functions
      1. Internationalization
      2. Printing and parsing
      3. String Manipulation Functions
      4. Transformation Functions
      5. Regular Expressions
      6. Summary of String Functions
    4. Working with TIMESTAMP
      1. Parsing and Formatting timestamps
      2. Extracting Calendar Parts
      3. Arithmetic with Timestamps
      4. Date, Time, and DateTime
    5. Working with GIS functions
    6. Summary
  5. 4. Loading Data into BigQuery
    1. The Basics
      1. Loading from a Local Source
      2. Specifying a Schema
      3. Copying into a New Table
      4. Data Management (DDL and DML)
      5. Loading Data Efficiently
    2. Federated Queries and External Data Sources
      1. How to Use Federated Queries
      2. When to Use Federated Queries and External Data Sources
      3. Interactive Exploration and Querying of Data in Google Sheets
      4. SQL Queries on Data in Cloud Bigtable
    3. Transfers and Exports
      1. Data Transfer Service
      2. Exporting Stackdriver Logs
      3. Using Cloud Dataflow to Read/Write from BigQuery
    4. Moving On-Premises Data
      1. Data Migration Methods
    5. Summary
  6. 5. Developing with BigQuery
    1. Developing Programmatically
      1. Accessing BigQuery via the REST API
      2. Google Cloud Client library
    2. Accessing BigQuery from Data Science Tools
      1. Notebooks on Google Cloud Platform
      2. Working with BigQuery, pandas, and Jupyter
      3. Working with BigQuery from R
      4. Cloud Dataflow
      5. JDBC/ODBC drivers
      6. Incorporating BigQuery Data into Google Slides (in GSuite)
    3. Bash Scripting with BigQuery
      1. Creating Datasets and Tables
      2. Executing Queries
      3. BigQuery objects
    4. Summary
  7. 6. Architecture of BigQuery
    1. High-Level Architecture
      1. Life of a Query Request
      2. BigQuery Upgrades
    2. Query Engine (Dremel)
      1. Dremel Architecture
      2. Query Execution
    3. Storage
      1. Storage Data
      2. Metadata
    4. Summary
  8. 7. Optimizing Performance and Cost
    1. Principles of Performance
      1. Key Drivers of Performance
      2. Controlling Cost
    2. Measuring and Troubleshooting
      1. Measuring Query Speed Using REST API
      2. Measuring Query Speed Using BigQuery Workload Tester
      3. Troubleshooting Workloads Using Stackdriver
      4. Reading Query Plan Information
    3. Increasing Query Speed
      1. Minimizing I/O
      2. Caching the Results of Previous Queries
      3. Performing Efficient Joins
      4. Avoiding Overwhelming a Worker
      5. Approximate Aggregation Functions
    4. Optimizing How Data Is Stored and Accessed
      1. Minimizing Network Overhead
      2. Choosing an Efficient Storage Format
      3. Partitioning Tables to Reduce Scan Size
      4. Clustering Tables Based on High-Cardinality Keys
    5. Time-Insensitive Use Cases
      1. Batch Queries
      2. File Loads
    6. Summary
      1. Checklist
  9. 8. Advanced Queries
    1. Reusable Queries
      1. Parameterized Queries
      2. SQL User-Defined Functions
      3. Reusing Parts of Queries
    2. Advanced SQL
      1. Working with Arrays
      2. Window functions
      3. Table Metadata
      4. Data Definition Language and Data Manipulation Language
    3. Beyond SQL
      1. JavaScript UDFs
      2. Scripting
    4. Advanced Functions
      1. BigQuery Geographic Information Systems
      2. Useful Statistical Functions
      3. Hash Algorithms
    5. Summary
  10. 9. Machine Learning in BigQuery
    1. What Is Machine Learning?
      1. Formulating a Machine Learning problem
      2. Types of Machine Learning Problems
    2. Building a Regression Model
      1. Choose the label
      2. Exploring the Dataset to Find Features
      3. Creating a Training Dataset
      4. Training and Evaluating the Model
      5. Predicting with Model
      6. Examining Model Weights
      7. More-Complex Regression Models
    3. Building a Classification Model
      1. Training
      2. Evaluation
      3. Prediction
      4. Choosing the Threshold
    4. Customizing BigQuery ML
      1. Controlling Data Split
      2. Balancing Classes
      3. Regularization
    5. k-Means Clustering
      1. What’s Being Clustered?
      2. Clustering Bicycle Stations
      3. Carrying Out Clustering
      4. Understanding the Clusters
      5. Data-Driven Decisions
    6. Recommender Systems
      1. The MovieLens Dataset
      2. Matrix Factorization
      3. Making Recommendations
      4. Incorporating User and Movie Information
    7. Custom Machine Learning Models on GCP
      1. Hyperparameter Tuning
      2. AutoML
      3. Support for TensorFlow
    8. Summary
  11. 10. Administering and Securing BigQuery
    1. Infrastructure Security
    2. Identity and Access Management
      1. Identity
      2. Role
      3. Resource
    3. Administering BigQuery
      1. Job Management
      2. Authorizing Users
      3. Restoring Deleted Records and Tables
      4. Continuous Integration/Continuous Deployment
      5. Cost/Billing Exports
      6. Dashboards, Monitoring, and Audit Logging
    4. Availability, Disaster Recovery, and Encryption
      1. Zones, Regions, and Multiregions
      2. BigQuery and Failure Handling
      3. Durability, Backups, and Disaster Recovery
      4. Privacy and Encryption
    5. Regulatory Compliance
      1. Data Locality
      2. Restricting Access to Subsets of Data
      3. Removing All Transactions Related to a Single Individual
      4. Data Loss Prevention
      5. CMEK
      6. Data Exfiltration Protection
    6. Summary
  12. Index