Chapter 1. Introduction to Presto

Over the last few years, the increasing availability of different data produced by users and machines has raised new challenges for organizations wanting to make sense of their data to make better decisions. Becoming a data-driven organization is crucial in finding insights, driving change, and paving the way to new opportunities. While it requires significant data, the benefits are worth the effort.

This large amount of data is available in different formats, provided by different data sources, and searchable with different query languages. In addition, when searching for valuable insights, users need results very quickly, thus requiring high-performance query engine systems. These challenges caused companies such as Facebook (now Meta), Airbnb, Uber, and Netflix to rethink how they manage data. They have progressively moved from the old paradigm based on data warehouses to data lakehouses. While a data warehouse manages structured and historical data, a data lakehouse can also manage and get insights from unstructured and real-time data.

Presto is a possible solution to the previous challenges. Presto is a distributed SQL query engine, created and used by Facebook at scale. You can easily integrate Presto in your data lake to build fast-running SQL queries that interact with data wherever your data is physically located, regardless of its original format.

This chapter will introduce you to the concept of the data lake and how it differs from the data warehouse. Then, you’ll learn what Presto is, why it was created, and why it is used by so many companies. You’ll also learn the most popular Presto use cases, such as ad hoc querying, reporting, and dashboarding. Finally, you’ll become familiar with the case study you’ll use throughout all the chapters.

Data Warehouses and Data Lakes

There are three main data types: structured data, semi-structured data, and unstructured data. Table 1-1 shows these data types, with a short description, the typical formats, the pros and cons, and some practical examples.

Table 1-1. Data types
Structured data Semi-structured data Unstructured data

Description

Data is organized in a fixed schema

Data is partially organized without a fixed schema

Data is not organized

Typical formats

SQL, CSV

JSON, XML

Audio, video, text

Pros

Easy to derive insights

More flexible than structured data

Very scalable

Cons

Schema dependence limits scalability

The meta-level structure may contain unstructured data

Difficult to search

Examples

Database

Annotated texts, such as tweets with hashtag

Plain text, digital photos

Depending on the types of supported data and how they are organized and processed, there are different data storage systems. A data warehouse is a central repository containing only structured data and is used for reporting and analysis. Figure 1-1 shows the general architecture of a data warehouse. There are four main layers:

Structured data

Includes structured data provided by multiple sources (e.g., relational database systems)

Extract, transform, and load (ETL)

The process that converts data into a proper format

Data warehouse

Contains data ready to be consumed by the final layers

Reporting, dashboarding, and data mining

The final layers that consume data contained in the data warehouse

The general architecture of a data warehouse.
Figure 1-1. The general architecture of a data warehouse

With the advent of the big data era, the underlying architecture behind data warehouses has proven insufficient to manage large amounts of data. Big companies, such as Facebook, had the following issues with using data warehouses:

Unstructured data

Since a data warehouse manages structured data, it cannot be used to store raw unstructured data, such as text or audio. You must process unstructured data before ingesting it into a data warehouse.

Scalability

A data warehouse experiences a nonlinear increase in technical costs associated with the growing amounts of ingested data and analytical processing.

Real-time data

A data warehouse is not suitable for near-real-time data because data must be structured before it can be used.

A data lake addresses these issues. Figure 1-2 shows the general architecture of a data lake.

The general architecture of a data lake.
Figure 1-2. The general architecture of a data lake

Unlike a data warehouse, a data lake manages and provides ways to consume, or process, structured, semi-structured, and unstructured data. Ingesting raw data permits a data lake to ingest both historical and real-time data in a raw storage system. Over time, the concept of a data lake has evolved to the data lakehouse, an augmented data lake that includes support for transactions at its top. In practice, a data lakehouse modifies the existing data in the data lake, following the data warehouse semantics. We will discuss the concept of the data lakehouse and implement it in Chapter 5.

The early data lakes, called on-premise data lakes, were installed on company servers. The main advantage of this type of data lake was the total control of the system by the company. With the advent of cloud computing, data lakes have moved to the cloud, leaving the management, maintenance, and security issues to the cloud providers and their customers, who are both responsible for the security of their data. This is called a cloud data lake, and it is growing in popularity. The major platforms that provide cloud data lakes are Amazon Web Services (AWS), Azure, and Google Cloud via something called an object store.

To make data accessible to the upper layers (dashboarding, reporting, and data mining), a data lake provides an intermediate layer, called metadata and governance, which guarantees data consistency and security controls.

The Role of Presto in a Data Lake

Presto is an open source, distributed SQL query engine that supports structured and semi-structured data sources. You can use Presto to query your data directly where it is located, like a data lake, without the need to move the data to another system. Presto runs queries concurrently through a memory-based architecture, making it very fast and scalable.

Within the data lake architecture, you can imagine that Presto fits into the governance and metadata layer. Presto executes queries directly in memory. Avoiding the need for writing and reading from disk between stages ultimately speeds up the query execution time.

The Presto coordinator machine analyzes any query written in SQL (supporting the ANSI SQL standard), creates and schedules a query plan on a cluster of Presto worker machines connected to the data lake, and then returns the query results. The query plan may have a number of execution stages, depending on the query. For example, if your query is joining many large tables, it may need multiple stages to execute, aggregating the tables. You can think of those intermediate results as your scratchpad for a long calculus problem.

Presto Origins and Design Considerations

Presto was implemented by Facebook in 2012 to overcome the issues derived from Apache Hive, a distributed SQL engine on top of the Hadoop MapReduce framework connected to its data lake. Apache Hive was one of Facebook’s data warehouses used at the time. The main problem of Apache Hive included the fact that it was slow when dealing with huge quantities of data.

Apache Hive

Apache Hive was also originally developed and made open source by Facebook in 2010. At that time, the architecture underlying Apache Hive was MapReduce, which exploited intermediate datasets to be persisted to disk. This required frequent I/O access to the disk for data for transient, intermediate result sets.

To overcome these issues, Facebook developed Presto, a new distributed SQL query engine designed as an in-memory engine without the need to persist intermediate result sets for a single query. This approach led to a query engine that processed the same query faster by orders of magnitude with many queries completed with a latency of less than a second. End users, such as engineers, product managers, and data analysts, found they could interactively query subsets of large datasets to test hypotheses and create visualizations.

Figure 1-3 shows how Presto and Hive execute queries. Hive uses the MapReduce framework to run queries. In practice, it stores intermediate results to disk: both after the map and the reduce phases, the intermediate results are stored to the disk. Instead, Presto saves time by executing the queries in the memory of the worker machines, including performing operations on intermediate datasets there, instead of persisting them to disk.

How Presto and Hive execute queries
Figure 1-3. How Presto and Hive execute queries

In 2013, Facebook made the Presto GitHub repository open source under the Apache 2.0 license. Later, Facebook donated the project to be hosted by the Linux Foundation, which created a subfoundation called the Presto Foundation.

Presto was developed with the following design considerations: high performance, high scalability, compliance to the American National Standards Institute (ANSI) SQL standard, a federation of data sources, and the ability to run in the cloud.

High Performance

Presto defines several rules, including well-known optimizations such as predicate and limit pushdown, column pruning, and decorrelation. In practice, Presto can make intelligent choices on how much of the query processing can be pushed down into the data sources, depending on the source’s abilities. For example, some data sources may be able to evaluate predicates, aggregations, function evaluation, etc. By pushing these operations closer to the data, Presto achieves significantly improved performance by minimizing disk I/O and network data transfer. The remainder of the query, such as joining data across different data sources, will be processed by Presto.

High Scalability

Thanks to its architecture that you’ll see in the next section, Presto can run at any scale, although a large infrastructure isn’t a requirement. You can also use Presto in small settings or for prototyping before tackling a larger dataset. Because of its very low latency, there isn’t a major overhead for running small queries.

Compliance with the ANSI SQL Standard

Presto runs SQL queries, which adhere to the ANSI SQL standard. As most users already know how to write SQL queries, Presto is easily accessible and doesn’t require learning a new language. Presto’s SQL compliance immediately enables a large number of use cases.

What ANSI SQL Compliance Means

Being compliant with the ANSI SQL standard means that the major, commonly used commands, like SELECT, UPDATE, DELETE, INSERT, and JOIN, all operate as you’d expect.

Federation of Data Sources

A federated query engine is mapped to multiple data sources enabling unified access to those systems either for queries to a single data source at a time or for federated queries with multiple data sources.

Presto is a federated query engine that supports pluggable connectors to access data from and write data to external data sources—no matter where they reside. Many data sources are available for integration with Presto.

Federated Query Engine Versus Federated Query

The concept of a federated query engine is slightly different from that of the federated query. A federated query is a single query that stores or retrieves data from multiple different data sources, instead of a single data source. A federated query engine is a query engine specifically designed to execute federated queries. Presto is a federated query engine that supports federated queries.

Figure 1-4 illustrates the basic steps of how a query engine processes a federated query.1 Upon receiving a query, the query engine parses it (query parsing) and accesses the sources catalog to select the data source or sources involved in the query (data source selection). As a result, source selection decomposes the query into subqueries.

Basic steps of federated query processing
Figure 1-4. Basic steps of federated query processing

The next step involves building a logical plan (query optimization), which defines how the query is executed and which operators (JOIN, UNION, FILTER, and so on) should be used. An example of a logical plan is a tree-based plan, where the leaves of the tree correspond to the subqueries to be executed, and internal nodes represent the operators.

The logical plan is translated into a physical plan (query execution), which executes the query practically over the selected data sources. The outputs of the single subqueries are finally reconciled to build the final result (query reconciliation).

Running in the Cloud

You can run Presto in a cluster deployed by your company, or you can use an existing cloud service. There are many cloud offerings for running Presto, including Amazon Elastic MapReduce (EMR) and Google Dataproc. Other vendors, such as IBM, offer Presto as part of an open data lakehouse offering that makes it easier to set up and operate multiple Presto clusters for different use cases.

Presto Architecture and Core Components

Figure 1-5 shows the Presto architecture, which is deployed as two main services: a single coordinator and many workers. The coordinator service is effectively the brain of the operation, receiving query requests from clients, parsing the query, building an execution plan, and then scheduling work to be done across many worker services. The coordinator contains three main components: the parser, the planner, and the scheduler.

Each worker processes a part of the overall query in parallel, and you can add worker services to your Presto deployment to fit your demand. Each data source is configured as a catalog, and you can query as many catalogs as you want in each query.

The Presto architecture. The arrows show the data transfer flow.
Figure 1-5. The Presto architecture

You can configure Presto in three different ways:

With only one data source

Users can query a single data source with Presto. In this case, Presto becomes the separated query engine that uses the metadata from an external catalog and processes data stored in the data lake.

With multiple data sources queried independently

As a federated engine, you can see many Presto deployments that are connected to multiple data sources. This allows end users to query one data source at a time, using the same interface without having to switch between systems or think of them as different data systems.

With multiple data sources, correlated and queried together

Taking federation a step further, a query can combine data from two or more data sources. The benefits of doing so allow end users to analyze more data without the need to move or copy data into a single data source.

Table 1-2 outlines the benefits provided by the different configurations. Single sources only provide fast analytics. Presto configured with multiple data sources, each queried independently, gives you fast, federated analytics. Finally, if Presto is configured with multiple data sources, correlated and queried together, it gives you fast, federated, unified analytics.

Table 1-2. Benefits provided by the different configurations of Presto
Type of source Fast analytics Federated analytics Unified analytics

Single source

X

-

-

Multiple sources queried independently

X

X

-

Multiple sources queried together

X

X

X

Alternatives to Presto

Many alternatives to Presto have been proposed by the research community. Research has focused on building a fast and scalable, distributed query engine, able to deal with big data. Given the constant growth of big data, solutions like Presto and its alternatives have become very attractive to the industry.

Apache Impala

Apache Impala, originally developed by Cloudera, is a distributed SQL query engine for Apache Hadoop. You can use Apache Impala for medium-sized datasets, although it does not support some SQL operations, such as UPDATE and DELETE. Apache Impala is supported by Amazon Web Services and MapR.

Apache Hive

Apache Hive is data warehouse software for managing large datasets queried using the SQL syntax. Built on Apache Hadoop, Hive supports different data formats, such as comma- and tab-separated values (CSV/TSV) text files, Apache Parquet, and more. You can extend Hive with a custom connector to support other data formats. You can also use Hive with Presto.

Spark SQL

Spark SQL is a module built for Apache Spark to work with structured data. You can also use it as a distributed SQL query engine, and you can integrate it with the rest of the Spark modules.

Spark Versus Presto

Spark and Presto manage stages differently. In Spark, data needs to be fully processed before passing it to the next stage. Presto uses a pipeline processing approach and doesn’t need to wait for an entire stage to finish.

Trino

When the founders of the Presto project left Facebook in 2018, the original project, PrestoDB (described in this book), was forked into a separate project, called PrestoSQL. In 2021, PrestoSQL was rebranded as Trino.

Similar to Presto, Trino aims at running fast and federated queries, without copying and moving data from sources to temporary storage.

Presto Use Cases

Presto was originally designed for interactive analytics and ad hoc querying. With the evolution of technology and the availability of near-real-time data, the number of use cases where Presto is applied has increased. In this section, you’ll see the most popular use cases for Presto.

Reporting and Dashboarding

Unlike the first-generation static versions, today’s interactive reporting and dashboards are very different. Analysts, data scientists, product managers, marketers, and other users not only want to look at Key Performance Indicators (KPIs), product statistics, telemetry data, and other data, but they also want to drill down into specific areas of interest or areas where opportunity may lie.

Presto gives users the ability to query data across sources on their own so they’re not dependent on data platform engineers. It also greatly simplifies data platform engineer tasks by providing them with a single endpoint for many reporting and dashboarding tools, including Tableau, Graphana, Apache Superset, and much more.

Ad Hoc Querying

Engineers, analysts, data scientists, and product managers can customize their queries either manually or using a range of visualization, dashboarding, and Business Intelligence (BI) tools. Depending on the tools chosen, they can run many complex concurrent queries against a Presto cluster. With Presto, they can iterate quickly on innovative hypotheses with the interactive exploration of any dataset, residing anywhere.

ETL Using SQL

Analysts can aggregate terabytes of data across multiple data sources and run efficient ETL (extract, transform, and load) queries against that data with Presto. Instead of legacy batch processing systems, you can use Presto to run resource-efficient and high-throughput queries.

Running queries in batch ETL jobs is much more expensive in terms of data volume and CPU than running interactive jobs. Because the clusters tend to be much bigger, some companies separate Presto clusters into two groups: one for ETL and the other for ad hoc queries. This is operationally advantageous because the two clusters use the same Presto technology and require the same skills.

ETL Versus ELT

ETL (extract, transform, and load) differs from ELT (extract, load, and transform), although the performed operations are the same in both processes. The difference is not simply the order of the operations, because usually they are performed in parallel. Instead, the difference is where the transformation of data is performed. In an ETL system, transformation is performed in a staging area of the data warehouse, while in an ELT system, transformation is performed directly in the backend data warehouse.

Data Lakehouse

A data lake enables you to store all your structured and unstructured data as is and run different types of analytics on it. A data lakehouse has SQL workloads and also other non-SQL workloads (e.g., machine learning on unstructured data). Presto handles the SQL workloads. You can use Presto to run SQL queries directly on your data lake without moving them or transforming them.

Real-Time Analytics with Real-Time Databases

Real-time analytics usually involves combining data that is being captured in real time with historical or archived data. Imagine that an ecommerce site uses two stores: the first store is, for example, an Amazon S3 bucket that stores your past activity, and the second is an Apache Pinot real-time store that stores your real-time activity, such as the content of your cart.

Also, imagine that your current session activity is moved from the real-time store to the historical archive at regular intervals. At a given instant, your current session activity may not make it into S3. By using Presto to merge data across both systems, the website could provide you with real-time incentives so you don’t abandon your cart, or it could determine if there’s possible fraud happening earlier and with greater accuracy.

Introducing Our Case Study

You can use Presto in different scenarios, including data-mining analytics, high-performance business intelligence analytics, and real-time monitoring. To show Presto’s capabilities, we have built a fictional scenario that we’ll use throughout the book, whenever possible.

Imagine that an ecommerce company offers a service that sells or distributes some products worldwide, such as books, clothes, and other items. To represent the service, we will use the Transaction Processing Performance Council Benchmark H (TPC-H) database, which is fully compatible with Presto.

The TPC-H database defines eight tables, as illustrated in Figure 1-6. The arrows in the figure show the relationships among the tables. The TPC-H database defines a generic ecommerce scenario that you can use for testing purposes, with a variable amount of data.

The TPC-H database
Figure 1-6. The TPC-H database

Figure 1-7 shows the architecture of our case study. There are two main data sources:

Real-time customer activity

Defines the real-time events generated by the customer, such as the products in the cart

Products database

Contains the products catalog

The architecture of our case study
Figure 1-7. The architecture of our case study

Both data sources are ingested in the data lake. Periodically, the real-time customer activity is moved to a specific segment of the data lake, called archived activity. Presto accesses the data lake to perform different types of queries. At the top layers are the reporting, dashboarding, and data mining services.

Conclusion

In this chapter, you learned what Presto is and its role in a data lake. Presto is a parallel distributed SQL query engine for querying any data format: structured, semi-structured, or unstructured. You can use Presto to run queries in the context of big data, using ANSI SQL, a single standard and well-known language. You can also use Presto in scenarios requiring high performance and high scalability, and running in the cloud.

You now know the basic concepts behind Presto, including its architecture, how it runs federated queries, and its use cases. You now should be able to move a step further by installing and running your first scenario using Presto.

In Chapter 2, you will learn how to start with Presto, focusing on how to install and configure Presto using Docker and Kubernetes.

1 Kemele M. Endris, Maria-Esther Vidal, and Damien Graux, “Chapter 5, Federated Query Processing,” in Knowledge Graphs and Big Data Processing, ed. Valentina Janev, Damien Graux, Hajira Jabeen, and Emanuel Sallinger (Springer, 2020), https://oreil.ly/p7KFC.

Get Learning and Operating Presto now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.