Learning and Operating Presto

Book description

The Presto community has mushroomed since its origins at Facebook in 2012. But ramping up this open source distributed SQL query engine can be challenging even for the most experienced engineers. With this practical book, data engineers and architects, platform engineers, cloud engineers, and software engineers will learn how to use Presto operations at your organization to derive insights on datasets wherever they reside.

Authors Angelica Lo Duca, Tim Meehan, Vivek Bharathan, and Ying Su explain what Presto is, where it came from, and how it differs from other data warehousing solutions. You'll discover why Facebook, Uber, Alibaba Cloud, Hewlett Packard Enterprise, IBM, Intel, and many more use Presto and how you can quickly deploy Presto in production.

With this book, you will:

  • Learn how to install and configure Presto
  • Use Presto with business intelligence tools
  • Understand how to connect Presto to a variety of data sources
  • Extend Presto for real-time business insight
  • Learn how to apply best practices and tuning
  • Get troubleshooting tips for logs, error messages, and more
  • Explore Presto's architectural concepts and usage patterns
  • Understand Presto security and administration

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Why We Wrote This Book
    2. Who This Book Is For
    3. Conventions Used in This Book
    4. Using Code Examples
    5. O’Reilly Online Learning
    6. How to Contact Us
    7. Acknowledgments
      1. Angelica Lo Duca
      2. Tim Meehan
      3. Vivek Bharathan
      4. Ying Su
  2. 1. Introduction to Presto
    1. Data Warehouses and Data Lakes
    2. The Role of Presto in a Data Lake
    3. Presto Origins and Design Considerations
      1. High Performance
      2. High Scalability
      3. Compliance with the ANSI SQL Standard
      4. Federation of Data Sources
      5. Running in the Cloud
    4. Presto Architecture and Core Components
    5. Alternatives to Presto
      1. Apache Impala
      2. Apache Hive
      3. Spark SQL
      4. Trino
    6. Presto Use Cases
      1. Reporting and Dashboarding
      2. Ad Hoc Querying
      3. ETL Using SQL
      4. Data Lakehouse
      5. Real-Time Analytics with Real-Time Databases
    7. Introducing Our Case Study
    8. Conclusion
  3. 2. Getting Started with Presto
    1. Presto Manual Installation
    2. Running Presto on Docker
      1. Installing Docker
      2. Presto Docker Image
      3. Building and Running Presto on Docker
      4. The Presto Sandbox
    3. Deploying Presto on Kubernetes
      1. Introducing Kubernetes
      2. Configuring Presto on Kubernetes
      3. Adding a New Catalog
      4. Running the Deployment on Kubernetes
    4. Querying Your Presto Instance
      1. Listing Catalogs
      2. Listing Schemas
      3. Listing Tables
      4. Querying a Table
    5. Conclusion
  4. 3. Connectors
    1. Service Provider Interface
    2. Connector Architecture
    3. Popular Connectors
      1. Thrift
    4. Writing a Custom Connector
      1. Prerequisites
      2. Plugin and Module
      3. Configuration
      4. Metadata
      5. Input/Output
      6. Deploying Your Connector
    5. Apache Pinot
      1. Setting Up and Configuring Presto
      2. Presto-Pinot Querying in Action
    6. Conclusion
  5. 4. Client Connectivity
    1. Setting Up the Environment
      1. Presto Client
      2. Docker Image
      3. Kubernetes Node
    2. Connectivity to Presto
      1. REST API
      2. Python
      3. R
      4. JDBC
      5. Node.js
      6. ODBC
      7. Other Presto Client Libraries
    3. Building a Client Dashboard in Python
      1. Setting Up the Client
      2. Building the Dashboard
    4. Conclusion
  6. 5. Open Data Lakehouse Analytics
    1. The Emergence of the Lakehouse
    2. Data Lakehouse Architecture
    3. Data Lake
      1. File Store
      2. File Format
      3. Table Format
    4. Query Engine
    5. Metadata Management
    6. Data Governance
      1. Data Access Control
    7. Building a Data Lakehouse
      1. Configuring MinIO
      2. Configuring HMS
      3. Configuring Spark
      4. Registering Hudi Tables with HMS
      5. Connecting and Querying Presto
    8. Conclusion
  7. 6. Presto Administration
    1. Introducing Presto Administration
    2. Configuration
      1. Properties
      2. Sessions
      3. JVM
    3. Monitoring
      1. Console
      2. REST API
      3. Metrics
    4. Management
      1. Resource Groups
      2. Verifiers
      3. Session Properties Managers
      4. Namespace Functions
    5. Conclusion
  8. 7. Understanding Security in Presto
    1. Introducing Presto Security
    2. Building Secure Communication in Presto
      1. Encryption
      2. Keystore Management
      3. Configuring HTTPS/TLS
    3. Authentication
      1. File-Based Authentication
      2. LDAP
      3. Kerberos
      4. Creating a Custom Authenticator
    4. Authorization
      1. Authorizing Access to the Presto REST API
      2. Configuring System Access Control
      3. Authorization Through Apache Ranger
    5. Conclusion
  9. 8. Performance Tuning
    1. Introducing Performance Tuning
      1. Reasons for Performance Tuning
      2. The Performance Tuning Life Cycle
    2. Query Execution Model
    3. Approaches for Performance Tuning in Presto
      1. Resource Allocation
      2. Storage
      3. Query Optimization
    4. Aria Scan
      1. Table Scanning
      2. Repartitioning
    5. Implementing Performance Tuning
      1. Building and Importing the Sample CSV Table in MinIO
      2. Converting the CSV Table in ORC
      3. Defining the Tuning Parameters
      4. Running Tests
    6. Conclusion
  10. 9. Operating Presto at Scale
    1. Introducing Scalability
      1. Reasons to Scale Presto
      2. Common Issues
    2. Design Considerations
      1. Availability
      2. Manageability
      3. Performance
      4. Protection
      5. Configuration
    3. How to Scale Presto
      1. Multiple Coordinators
      2. Presto on Spark
      3. Spilling
    4. Using a Cloud Service
    5. Conclusion
  11. Index
  12. About the Authors

Product information

  • Title: Learning and Operating Presto
  • Author(s): Angelica Lo Duca, Tim Meehan, Vivek Bharathan, Ying Su
  • Release date: September 2023
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098141851