O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Google BigQuery: The Definitive Guide

Book Description

With Early Release ebooks, you get books in their earliest form—the authors' raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles.

Work with petabyte-scale datasets while building a collaborative, agile workplace in the process. This practical book is the canonical reference to Google BigQuery, the query engine that lets you conduct interactive analysis of large datasets. BigQuery enables enterprises to efficiently store, query, ingest, and learn from their data in a convenient framework. With this book, you’ll examine how to analyze data at scale to derive insights from large datasets efficiently.

Valliappa Lakshmanan, tech lead on Google Cloud Platform, and Jordan Tigani, engineering director for the BigQuery team, provide best practices in modern data warehousing within an autoscaled, serverless, public cloud. Whether you want to explore parts of BigQuery you’re not familiar with, or prefer to focus on specific tasks, this reference is indispensable.

Table of Contents

  1. 1. What is Google BigQuery?
    1. Data processing architectures
      1. Relational Database Management System (RDBMS)
      2. MapReduce Framework
      3. BigQuery: A serverless, distributed SQL engine
    2. Working with BigQuery
      1. Deriving insights across datasets
      2. ETL, EL, and ELT
      3. Powerful Analytics
      4. Simplicity
    3. How BigQuery came about
    4. What makes BigQuery possible?
      1. Separation of compute and storage
      2. Storage and Networking Infrastructure
      3. Managed storage
      4. Integration with GCP
      5. Security and compliance
  2. 2. Query Essentials
    1. Simple queries
      1. Retrieving rows with SELECT
      2. Aliasing column names with AS
      3. Filtering with WHERE
      4. Subqueries with WITH
      5. Sorting with ORDER BY
    2. Aggregates
      1. Computing aggregates with GROUP BY
      2. Counting records with COUNT
      3. Filtering grouped items with HAVING
      4. Finding unique values with DISTINCT
    3. A brief primer on arrays and structs
      1. Creating ARRAYs using ARRAY_AGG
      2. ARRAY of STRUCT
      3. TUPLE
      4. Working with arrays
      5. UNNEST an array
    4. Joining tables
      1. The JOIN, explained
      2. Inner join
      3. Cross Join
      4. Outer join
    5. Saving and sharing
      1. Query history and caching
      2. Saved queries
      3. Views vs. shared queries
    6. Chapter Summary
  3. 3. Data Types, Functions and Operators
    1. Numeric types and functions
    2. Working with BOOL
    3. String functions
    4. Working with TIMESTAMP
    5. Working with GIS functions
    6. Summary of data types in BigQuery
  4. 4. Loading Data into BigQuery
    1. The basics
      1. Loading from a local source
      2. Specifying a schema
      3. Copying into a new table
      4. Data management (DDL and DML)
      5. Loading data efficiently
    2. Federated queries and external data sources
      1. How to use federated queries
      2. When to use federated queries and external data sources
      3. Interactive exploration and querying of data in Google Sheets
      4. SQL queries on data in Cloud Bigtable
    3. Transfers and exports
      1. Data Transfer Service
      2. Export Stackdriver Logs
      3. Using Cloud Dataflow to read/write from BigQuery
    4. Migrating on-premises data
      1. Data migration methods
    5. Summary
  5. 5. Developing with BigQuery
    1. Developing programmatically
      1. Accessing BigQuery via REST API
      2. Google Cloud Client library
    2. Accessing BigQuery from data science tools
      1. Notebooks on Google Cloud
      2. pandas
      3. Working with BigQuery from R
      4. Dataflow
      5. JDBC/ODBC drivers
      6. Incorporating BigQuery data into Google Slides (in GSuite)
    3. Bash Scripting with BigQuery
      1. Creating datasets and tables
      2. Executing queries
      3. BigQuery objects
    4. Summary
  6. 6. Architecture of BigQuery
    1. High-level architecture
      1. Life of a Query Request
      2. BigQuery Upgrades
    2. Query Engine (Dremel)
      1. Dremel Architecture
      2. Query Execution
    3. Storage
      1. Metadata
      2. Storage Data
    4. Security, Availability and Disaster Recovery
      1. Zones, Regions, and Multi-Regions
      2. BigQuery and Failure handling
      3. Durability, Backups and Disaster Recovery
      4. Privacy, Security and Encryption
    5. Summary
  7. 7. Optimizing Performance and Cost
    1. Principles of performance
      1. Key drivers of performance
      2. Controlling cost
    2. Measuring and troubleshooting
      1. Measuring query speed using REST API
      2. Measuring query speed using BigQuery Workload Tester
      3. Troubleshooting workloads using Stackdriver
      4. Reading query plan information
    3. Increasing query speed
      1. Minimize I/O
      2. Cache the results of previous queries
      3. Efficient joins
      4. Avoid overwhelming a worker
      5. Approximate aggregation functions
    4. Optimizing how data is stored and accessed
      1. Minimizing network overhead
      2. Choosing an efficient storage format
      3. Partition tables to reduce scan size
      4. Cluster tables based on high cardinality keys
    5. Time-insensitive use cases
      1. Batch queries
      2. File loads
    6. Summary
      1. Checklist
  8. 8. Advanced Queries
    1. Reusable queries
      1. Parameterized queries
      2. SQL User Defined Functions
      3. Reusing parts of queries
    2. Advanced SQL
      1. Working with arrays
      2. Window functions
      3. Table metadata
      4. DDL and DML
    3. Beyond SQL
      1. JavaScript UDFs
      2. Scripting
    4. Advanced Functions
      1. BigQuery GIS
      2. Useful statistical functions
      3. Hash algorithms
    5. Summary
  9. 9. Machine Learning in BigQuery
    1. What is ML?
      1. Formulating an ML problem
      2. Types of ML problems
    2. Regression
      1. Choose the label
      2. Explore dataset to find features
      3. Create training dataset
      4. Train and evaluate
      5. Predict with model
      6. Examining model weights
      7. More complex regression models
    3. Classification model
      1. Training
      2. Evaluation
      3. Prediction
      4. Choosing threshold
    4. Customizing BigQuery ML
      1. Controlling data split
      2. Balancing classes
      3. Regularization
    5. K-Means clustering
      1. What’s being clustered?
      2. Clustering bicycle stations
      3. Carry out clustering
      4. Understand the clusters
      5. Data driven decisions
    6. Recommender systems
      1. MovieLens dataset
      2. Matrix factorization
      3. Making recommendations
      4. Incorporating user and movie information
    7. Custom Machine Learning Models on GCP
      1. AutoML
      2. Support for TensorFlow
    8. Summary
  10. 10. Administering and Securing BigQuery
    1. Infrastructure Security
    2. Identity and Access Management
      1. Identity
      2. Role
      3. Resource
    3. Administering BigQuery
      1. Job management
      2. Authorizing users
      3. Restoring deleted records and tables
      4. Continuous integration / continuous deployment (CICD)
      5. Cost/billing exports
      6. Dashboards, monitoring, and audit logging
    4. Regulatory Compliance
      1. Data locality
      2. Restricting access to subsets of data
      3. Removing all transactions related to a single individual
      4. Data loss prevention
      5. Customer Managed Encryption Keys (CMEK)
      6. Data exfiltration protection
    5. Summary