Presto: The Definitive Guide

Book description

Perform fast interactive analytics against different data sources using the Presto high-performance, distributed SQL query engine. With this practical guide, you’ll learn how to conduct analytics on data where it lives, whether it’s Hive, Cassandra, a relational database, or a proprietary data store. Analysts, software engineers, and production engineers will learn how to manage, use, and even develop with Presto.

Initially developed by Facebook, open source Presto is now used by Netflix, Airbnb, LinkedIn, Twitter, Uber, and many other companies. Matt Fuller, Manfred Moser, and Martin Traverso show you how a single Presto query can combine data from multiple sources to allow for analytics across your entire organization.

  • Get started: Explore Presto’s use cases and learn about tools that will help you connect to Presto and query data
  • Go deeper: Learn Presto’s internal workings, including how to connect to and query data sources with support for SQL statements, operators, functions, and more
  • Put Presto in production: Secure Presto, monitor workloads, tune queries, and connect more applications; learn how other organizations apply Presto

Table of contents

  1. Foreword
  2. Preface
    1. About the Book
    2. Conventions Used in This Book
    3. Code Examples, Permissions, and Attribution
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  3. I. Getting Started with Presto
  4. 1. Introducing Presto
    1. The Problems with Big Data
    2. Presto to the Rescue
      1. Designed for Performance and Scale
      2. SQL-on-Anything
      3. Separation of Data Storage and Query Compute Resources
    3. Presto Use Cases
      1. One SQL Analytics Access Point
      2. Access Point to Data Warehouse and Source Systems
      3. Provide SQL-Based Access to Anything
      4. Federated Queries
      5. Semantic Layer for a Virtual Data Warehouse
      6. Data Lake Query Engine
      7. SQL Conversions and ETL
      8. Better Insights Due to Faster Response Times
      9. Big Data, Machine Learning, and Artificial Intelligence
      10. Other Use Cases
    4. Presto Resources
      1. Website
      2. Documentation
      3. Community Chat
      4. Source Code, License, and Version
      5. Contributing
      6. Book Repository
      7. Iris Data Set
      8. Flight Data Set
    5. A Brief History of Presto
    6. Conclusion
  5. 2. Installing and Configuring Presto
    1. Trying Presto with the Docker Container
    2. Installing from Archive File
      1. Java Virtual Machine
      2. Python
      3. Installation
      4. Configuration
    3. Adding a Data Source
    4. Running Presto
    5. Conclusion
  6. 3. Using Presto
    1. Presto Command-Line Interface
      1. Getting Started
      2. Pagination
      3. History
      4. Additional Diagnostics
      5. Executing Queries
      6. Output Formats
      7. Ignoring Errors
    2. Presto JDBC Driver
      1. Downloading and Registering the Driver
      2. Establishing a Connection to Presto
    3. Presto and ODBC
    4. Client Libraries
    5. Presto Web UI
    6. SQL with Presto
      1. Concepts
      2. First Examples
    7. Conclusion
  7. II. Diving Deeper into Presto
  8. 4. Presto Architecture
    1. Coordinator and Workers in a Cluster
    2. Coordinator
    3. Discovery Service
    4. Workers
    5. Connector-Based Architecture
    6. Catalogs, Schemas, and Tables
    7. Query Execution Model
    8. Query Planning
      1. Parsing and Analysis
      2. Initial Query Planning
    9. Optimization Rules
      1. Predicate Pushdown
      2. Cross Join Elimination
      3. TopN
      4. Partial Aggregations
    10. Implementation Rules
      1. Lateral Join Decorrelation
      2. Semi-Join (IN) Decorrelation
    11. Cost-Based Optimizer
      1. The Cost Concept
      2. Cost of the Join
      3. Table Statistics
      4. Filter Statistics
      5. Table Statistics for Partitioned Tables
      6. Join Enumeration
      7. Broadcast Versus Distributed Joins
    12. Working with Table Statistics
      1. Presto ANALYZE
      2. Gathering Statistics When Writing to Disk
      3. Hive ANALYZE
      4. Displaying Table Statistics
    13. Conclusion
  9. 5. Production-Ready Deployment
    1. Configuration Details
    2. Server Configuration
    3. Logging
    4. Node Configuration
    5. JVM Configuration
    6. Launcher
    7. Cluster Installation
    8. RPM Installation
      1. Installation Directory Structure
      2. Configuration
      3. Uninstall Presto
    9. Installation in the Cloud
    10. Cluster Sizing Considerations
    11. Conclusion
  10. 6. Connectors
    1. Configuration
    2. RDBMS Connector Example PostgreSQL
      1. Query Pushdown
      2. Parallelism and Concurrency
      3. Other RDBMS Connectors
      4. Security
    3. Presto TPC-H and TPC-DS Connectors
    4. Hive Connector for Distributed Storage Data Sources
      1. Apache Hadoop and Hive
      2. Hive Connector
      3. Hive-Style Table Format
      4. Managed and External Tables
      5. Partitioned Data
      6. Loading Data
      7. File Formats and Compression
      8. MinIO Example
    5. Non-Relational Data Sources
    6. Presto JMX Connector
    7. Black Hole Connector
    8. Memory Connector
    9. Other Connectors
    10. Conclusion
  11. 7. Advanced Connector Examples
    1. Connecting to HBase with Phoenix
    2. Key-Value Store Connector Example: Accumulo
      1. Using the Presto Accumulo Connector
      2. Predicate Pushdown in Accumulo
    3. Apache Cassandra Connector
    4. Streaming System Connector Example: Kafka
    5. Document Store Connector Example: Elasticsearch
      1. Overview
      2. Configuration and Usage
      3. Query Processing
      4. Full-Text Search
      5. Summary
    6. Query Federation in Presto
    7. Extract, Transform, Load and Federated Queries
    8. Conclusion
  12. 8. Using SQL in Presto
    1. Presto Statements
    2. Presto System Tables
    3. Catalogs
    4. Schemas
    5. Information Schema
    6. Tables
      1. Table and Column Properties
      2. Copying an Existing Table
      3. Creating a New Table from Query Results
      4. Modifying a Table
      5. Deleting a Table
      6. Table Limitations from Connectors
    7. Views
    8. Session Information and Configuration
    9. Data Types
      1. Collection Data Types
      2. Temporal Data Types
      3. Type Casting
    10. SELECT Statement Basics
    11. WHERE Clause
    12. GROUP BY and HAVING Clauses
    13. ORDER BY and LIMIT Clauses
    14. JOIN Statements
    15. UNION, INTERSECT, and EXCEPT Clauses
    16. Grouping Operations
    17. WITH Clause
    18. Subqueries
      1. Scalar Subquery
      2. EXISTS Subquery
      3. Quantified Subquery
    19. Deleting Data from a Table
    20. Conclusion
  13. 9. Advanced SQL
    1. Functions and Operators Introduction
    2. Scalar Functions and Operators
    3. Boolean Operators
    4. Logical Operators
    5. Range Selection with the BETWEEN Statement
    6. Value Detection with IS (NOT) NULL
    7. Mathematical Functions and Operators
    8. Trigonometric Functions
    9. Constant and Random Functions
    10. String Functions and Operators
    11. Strings and Maps
    12. Unicode
    13. Regular Expressions
    14. Unnesting Complex Data Types
    15. JSON Functions
    16. Date and Time Functions and Operators
    17. Histograms
    18. Aggregate Functions
      1. Map Aggregate Functions
      2. Approximate Aggregate Functions
    19. Window Functions
    20. Lambda Expressions
    21. Geospatial Functions
    22. Prepared Statements
    23. Conclusion
  14. III. Presto in Real-World Uses
  15. 10. Security
    1. Authentication
      1. Password and LDAP Authentication
    2. Authorization
      1. System Access Control
      2. Connector Access Control
    3. Encryption
      1. Encrypting Presto Client-to-Coordinator Communication
      2. Creating Java Keystores and Java Truststores
      3. Encrypting Communication Within the Presto Cluster
    4. Certificate Authority Versus Self-Signed Certificates
    5. Certificate Authentication
    6. Kerberos
      1. Prerequisites
      2. Kerberos Client Authentication
      3. Cluster Internal Kerberos
    7. Data Source Access and Configuration for Security
    8. Kerberos Authentication with the Hive Connector
      1. Hive Metastore Thrift Service Authentication
      2. HDFS Authentication
    9. Cluster Separation
    10. Conclusion
  16. 11. Integrating Presto with Other Tools
    1. Queries, Visualizations, and More with Apache Superset
    2. Performance Improvements with RubiX
    3. Workflows with Apache Airflow
    4. Embedded Presto Example: Amazon Athena
    5. Starburst Enterprise Presto
    6. Other Integration Examples
    7. Custom Integrations
    8. Conclusion
  17. 12. Presto in Production
    1. Monitoring with the Presto Web UI
      1. Cluster-Level Details
      2. Query List
      3. Query Details View
    2. Tuning Presto SQL Queries
    3. Memory Management
    4. Task Concurrency
    5. Worker Scheduling
      1. Scheduling Splits per Task and per Node
      2. Local Scheduling
    6. Network Data Exchange
      1. Concurrency
      2. Buffer Sizes
    7. Tuning Java Virtual Machine
    8. Resource Groups
      1. Resource Group Definition
      2. Scheduling Policy
      3. Selector Rules Definition
    9. Conclusion
  18. 13. Real-World Examples
    1. Deployment and Runtime Platforms
    2. Cluster Sizing
    3. Hadoop/Hive Migration Use Case
    4. Other Data Sources
    5. Users and Traffic
    6. Conclusion
  19. 14. Conclusion
  20. Index

Product information

  • Title: Presto: The Definitive Guide
  • Author(s): Matt Fuller, Manfred Moser, Martin Traverso
  • Release date: April 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492044277