Amazon Redshift: The Definitive Guide

Book description

Amazon Redshift powers analytic cloud data warehouses worldwide, from startups to some of the largest enterprise data warehouses available today. This practical guide thoroughly examines this managed service and demonstrates how you can use it to extract value from your data immediately, rather than go through the heavy lifting required to run a typical data warehouse.

Analytic specialists Rajesh Francis, Rajiv Gupta, and Milind Oke detail Amazon Redshift's underlying mechanisms and options to help you explore out-of-the box automation. Whether you're a data engineer who wants to learn the art of the possible or a DBA looking to take advantage of machine learning-based auto-tuning, this book helps you get the most value from Amazon Redshift.

By understanding Amazon Redshift features, you'll achieve excellent analytic performance at the best price, with the least effort. This book helps you:

  • Build a cloud data strategy around Amazon Redshift as foundational data warehouse
  • Get started with Amazon Redshift with simple-to-use data models and design best practices
  • Understand how and when to use Redshift Serverless and Redshift provisioned clusters
  • Take advantage of auto-tuning options inherent in Amazon Redshift and understand manual tuning options
  • Transform your data platform for predictive analytics using Redshift ML and break silos using data sharing
  • Learn best practices for security, monitoring, resilience, and disaster recovery
  • Leverage Amazon Redshift integration with other AWS services to unlock additional value

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword
  2. Preface
    1. Conventions Used in This Book
    2. Using Code Examples
    3. O’Reilly Online Learning
    4. How to Contact Us
    5. Acknowledgments
  3. 1. AWS for Data
    1. Data-Driven Organizations
      1. Business Use Cases
      2. New Business Use Cases with Generative AI
    2. Modern Data Strategy
      1. Comprehensive Set of Capabilities
      2. Integrated Set of Tools
      3. End-to-End Data Governance
    3. Modern Data Architecture
      1. Role of Amazon Redshift in a Modern Data Architecture
      2. Real-World Benefits of Adopting a Modern Data Architecture
      3. Reference Architecture for Modern Data Architecture
      4. Data Sourcing
      5. Extract, Transform, and Load
      6. Storage
      7. Analysis
    4. Data Mesh and Data Fabric
      1. Data Mesh
      2. Data Fabric
    5. Summary
  4. 2. Getting Started with Amazon Redshift
    1. Amazon Redshift Architecture Overview
    2. Get Started with Amazon Redshift Serverless
      1. Creating an Amazon Redshift Serverless Data Warehouse
    3. Sample Data
      1. Activate Sample Data Models and Query Using the Query Editor
    4. When to Use a Provisioned Cluster?
      1. Creating an Amazon Redshift Provisioned Cluster
    5. Estimate Your Amazon Redshift Cost
      1. Amazon Redshift Managed Storage
      2. Amazon Redshift Serverless Compute Cost
      3. Amazon Redshift Provisioned Compute Cost
    6. AWS Account Management
    7. Connecting to Your Amazon Redshift Data Warehouse
      1. Private/Public VPC and Secure Access
      2. Stored Password
      3. Temporary Credentials
      4. Federated User
      5. SAML-Based Authentication from an Identity Provider
      6. Native IdP Integration
      7. Amazon Redshift Data API
      8. Querying a Database Using the Query Editor V2
      9. Business Intelligence Using Amazon QuickSight
      10. Connecting to Amazon Redshift Using JDBC/ODBC
    8. Summary
  5. 3. Setting Up Your Data Models and Ingesting Data
    1. Data Lake First Versus Data Warehouse First Strategy
      1. Data Lake First Strategy
      2. Data Warehouse First Strategy
      3. Deciding On a Strategy
    2. Defining Your Data Model
      1. Database Schemas, Users, and Groups
      2. Star Schema, Denormalized, Normalized
    3. Student Information Learning Transactional Dataset
      1. Create Data Models for Student Information Learning Transactional Dataset
      2. Ingest Data for the Student Learning Transaction Dataset
    4. Student Information Learning Analytics Dataset
      1. Create Data Models for Student Information Learning Analytics Dataset
    5. Load Batch Data into Amazon Redshift
      1. Using the COPY Command
      2. Ingest Data for the Student Learning Analytics Dataset
      3. Building a Star Schema
      4. Continuous File Ingestion from Amazon S3
      5. Using AWS Glue for Transformations
      6. Manual Loading Using SQL Commands
      7. Using the Query Editor V2
    6. Load Real-Time and Near Real-Time Data
      1. Near Real-Time Replication Using AWS Database Migration Service
      2. Amazon Aurora Zero-ETL Integration with Amazon Redshift
      3. Using Amazon AppFlow
      4. Streaming Ingestion
    7. Optimize Your Data Structures
      1. Automatic Table Optimization and Autonomics
      2. Distribution Style
      3. Sort Key
      4. Compression Encoding
    8. Summary
  6. 4. Data Transformation Strategies
    1. Comparing ELT and ETL Strategies
    2. In-Database Transformation
      1. Semistructured Data
      2. User-Defined Functions
      3. Stored Procedures
    3. Scheduling and Orchestration
    4. Access All Your Data
      1. External Amazon S3 Data
      2. External Operational Data
      3. External Amazon Redshift Data
    5. External Transformation
      1. AWS Glue
    6. Summary
  7. 5. Scaling and Performance Optimizations
    1. Scale Storage
    2. Autoscale Your Serverless Data Warehouse
    3. Scale Your Provisioned Data Warehouse
      1. Evolving Compute Demand
      2. Unpredictable Workload Changes
    4. WLM, Queues, and QMR
      1. Queue Assignment
      2. Short Query Acceleration
      3. Query Monitoring Rules
      4. Automatic WLM
      5. Manual WLM
      6. Parameter Group
      7. WLM Dynamic Memory Allocation
    5. Materialized Views
    6. Autonomics
      1. Auto Table Optimizer and Smart Defaults
      2. Auto Vacuum
      3. Auto Vacuum Sort
      4. Auto Analyze
      5. Auto Materialized Views (AutoMV)
      6. Amazon Redshift Advisor
    7. Workload Isolation
    8. Additional Optimizations for Achieving the Best Price and Performance
      1. Database Versus Data Warehouse
      2. Amazon Redshift Serverless
      3. Multi-Warehouse Environment
      4. AWS Data Exchange
      5. Table Design
      6. Indexes Versus Zone Maps
      7. Drivers
      8. Simplify ETL
      9. Query Editor V2
    9. Query Tuning
      1. Query Processing
      2. Analyzing Queries
      3. Identifying Queries for Performance Tuning
    10. Summary
  8. 6. Amazon Redshift Machine Learning
    1. Machine Learning Cycle
    2. Amazon Redshift ML
      1. Amazon Redshift ML Flexibility
      2. Getting Started with Amazon Redshift ML
    3. Machine Learning Techniques
      1. Supervised Learning Techniques
      2. Unsupervised Learning Techniques
    4. Machine Learning Algorithms
    5. Integration with Amazon SageMaker Autopilot
      1. Create Model
      2. Label Probability
      3. Explain Model
    6. Using Amazon Redshift ML to Predict Student Outcomes
    7. Amazon SageMaker Integration with Amazon Redshift
    8. Integration with Amazon SageMaker—Bring Your Own Model (BYOM)
      1. BYOM Local
      2. BYOM Remote
    9. Amazon Redshift ML Costs
    10. Summary
  9. 7. Collaboration with Data Sharing
    1. Amazon Redshift Data Sharing Overview
    2. Data Sharing Use Cases
    3. Key Concepts of Data Sharing
    4. How to Use Data Sharing
      1. Sharing Data Within the Same Account
      2. Sharing Data Across Accounts Using Cross-Account Data Sharing
    5. Analytics as a Service Use Case with Multi-Tenant Storage Patterns
      1. Scaling Your Multi-tenant Architecture Using Data Sharing
      2. Multi-tenant Storage Patterns Using Data Sharing
    6. External Data Sharing with AWS ADX Integration
      1. Publishing a Data Product
      2. Subscribing to a Published Data Product
      3. Considerations When Using AWS Data Exchange for Amazon Redshift
    7. Query from the Data Lake and Unload to the Data Lake
    8. Amazon DataZone to Discover and Share Data
      1. Use Cases for a Data Mesh Architecture with Amazon DataZone
      2. Key Capabilities and Use Cases for Amazon DataZone
      3. Amazon DataZone Integrations with Amazon Redshift and Other AWS Services
      4. Components and Capabilities of Amazon DataZone
      5. Getting Started with Amazon DataZone
      6. Security in Amazon DataZone
    9. Summary
  10. 8. Securing and Governing Data
    1. Object-Level Access Controls
      1. Object Ownership
      2. Default Privileges
      3. Public Schema and Search Path
      4. Access Controls in Action
    2. Database Roles
      1. Database Roles in Action
    3. Row-Level Security
      1. Row-Level Security in Action
      2. Row-Level Security Considerations
    4. Dynamic Data Masking
      1. Dynamic Data Masking in Action
      2. Dynamic Data Masking Considerations
    5. External Data Access Control
      1. Associate IAM Roles
      2. Authorize Assume Role Privileges
      3. Establish External Schemas
      4. Lake Formation for Fine-Grained Access Control
    6. Summary
  11. 9. Migrating to Amazon Redshift
    1. Migration Considerations
      1. Retire Versus Retain
      2. Migration Data Size
      3. Platform-Specific Transformations Required
      4. Data Volatility and Availability Requirements
      5. Selection of Migration and ETL Tools
      6. Data Movement Considerations
      7. Domain Name System (DNS)
    2. Migration Strategies
      1. One-Step Migration
      2. Two-Step Migration
      3. Iterative Migration
    3. Migration Tools and Services
      1. AWS Schema Conversion Tool
      2. Data Warehouse Migration Service
      3. AWS Snow Family
      4. AWS Snowball Edge Client
    4. Database Migration Process
      1. Step 1: Convert Schema and Subject Area
      2. Step 2: Initial Data Extraction and Load
      3. Step 3: Incremental Load Through Data Capture
    5. Amazon Redshift Migration Tools Considerations
    6. Accelerate Your Migration to Amazon Redshift
      1. Macro Conversion
      2. Case-Insensitive String Comparison
      3. Recursive Common Table Expressions
      4. Proprietary Data Types
    7. Summary
  12. 10. Monitoring and Administration
    1. Amazon Redshift Monitoring Overview
      1. Monitoring
      2. Troubleshooting
      3. Optimization
    2. Monitoring Using Console
      1. Monitoring and Administering Serverless
      2. Monitoring Provisioned Data Warehouse Using Console
      3. Monitoring Queries and Loads Across Clusters
      4. Identifying Systemic Query Performance Problems
    3. Monitoring Using Amazon CloudWatch
      1. Amazon Redshift CloudWatch Metrics
    4. Monitoring Using System Tables and Views
      1. Monitoring Serverless Using System Views
    5. High Availability and Disaster Recovery
      1. Recovery Time Objective and Recovery Point Objective Considerations
      2. Multi-AZ Compared to Single-AZ Deployment
      3. Creating or Converting a Provisioned Data Warehouse with Multi-AZ Configuration
      4. Auto Recovery of Multi-AZ Deployment
    6. Snapshots, Backup, and Restore
      1. Snapshots for Backup
      2. Automated Snapshots
      3. Manual Snapshots
      4. Disaster Recovery Using Cross-Region Snapshots
      5. Using Snapshots for Simple-Replay
    7. Monitoring Amazon Redshift Using CloudTrail
    8. Bring Your Own Visualization Tool to Monitor Amazon Redshift
      1. Monitor Operational Metrics Using System Tables and Amazon QuickSight
      2. Monitor Operational Metrics Using Grafana Plug-in for Amazon Redshift
    9. Summary
  13. Index
  14. About the Authors

Product information

  • Title: Amazon Redshift: The Definitive Guide
  • Author(s): Rajesh Francis, Rajiv Gupta, Milind Oke
  • Release date: October 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098135300