Chapter 4. Client Connectivity

A Presto client is a process that queries Presto and shows the query results for a number of purposes, such as data analytics, ad hoc queries, and many more. Presto provides a variety of clients, written in different programming languages, including REST API, R, Python, JDBC, Node.js, and ODBC.

This chapter is organized into two parts. In the first part, you’ll learn how to deploy a Presto client. Although you can deploy a Presto client as a separate process running on an existing cluster node, in this chapter, you’ll implement the client on a different node to simulate an external application accessing the cluster. We’ll focus on implementing a client in the REST API, Python, R, JDBC, Node.js, and ODBC. You can easily extend the described procedure to the other programming languages supported by Presto.

In the second part, you’ll implement a practical web service in Python, querying Presto and showing the results in a simple dashboard. The goal of this example is to illustrate the potentialities of a Presto client.

Setting Up the Environment

To add a Presto client as a node of the Presto cluster, first you must set up the environment. Deploy the following three components already implemented in Chapter 3:

  • Presto client

  • Docker image that contains your Presto client

  • Kubernetes node that runs the Docker image in the Presto cluster

The code used in this section is available in the 04 directory of the book’s repository, which contains the configuration files for all the described Presto clients.

Presto Client

The Presto client connects to Presto, runs queries, and shows the results. You can write your Presto client in your preferred language. Save the code into a directory that will be mounted in the Kubernetes node as a separate volume. Optionally, write a web application or a service that permits access over HTTP. We’ll see how to deploy a client web service later in this chapter. Alternatively, use the terminal to access your Presto client.

Docker Image

The Docker image contains the environment needed to run your Presto client. Use the following template for the Dockerfile:

FROM <basic_image>
RUN apt-get update && \
<install required libraries>

For example, to build the image for the R client, write the following Dockerfile:

FROM r-base
RUN apt-get update && \
apt install -y build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev
RUN R -e "install.packages('RPresto')"

To build your Docker image, run the following command in a terminal:

docker build -t client-presto .

Kubernetes Node

To deploy a Kubernetes node in the Presto cluster, write a .yaml configuration file that simply wraps the Docker image:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-client
  labels:
    app: my-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-client
  template:
    metadata:
      labels:
        app: my-client
    spec:
      containers:
      - name: my-client
        image:  client-presto:latest
        imagePullPolicy: Never
        command: [ "sleep" ]
        args: [ "infinity" ]

The command sleep infinity makes sure that Kubernetes keeps your client alive. Alternatively, if your client runs a service, modify the configuration code to deploy the service.

To run custom scripts in your Docker container, mount a local volume:

spec:
 containers:
 - name: client
   # add details based on the client type
   volumeMounts:
    - name: scripts
      mountPath: "/path/in/the/container"

 volumes:
  - name: scripts
    hostPath:
      path: "/path/to/scripts/in/the/host"

Deploy the client by running the following command:

kubectl apply -f client.yaml --namespace presto

Once you have deployed your client, log in to it and run the scripts in the mounted directory.

Mounting Warning

Use the volumeMount option while writing your code. In a production environment, wrap the implemented app directly in the Docker image instead of mounting it in a separate directory in your Kubernetes .yaml file.

Connectivity to Presto

This section gives an overview of how to use the existing client libraries to set up a connection to the Presto cluster. We’ll use the different programming languages supported by Presto. Download the running example in your preferred language from the book’s GitHub repository, under the directory 04/<language>_client. Feel free to skip the languages that don’t interest you.

REST API

The Presto coordinator includes an HTTP server that supports a REST API endpoint. All the communications between the client and the Presto coordinator use the HTTP protocol. To run a Presto client using a REST API, specify at least the URL to the Presto coordinator, the query, and a minimal header that includes the catalog and the schema to query:

curl -d "SELECT * FROM customer LIMIT 5" \
 --request POST \
 -s "presto-coordinator:8080/<endpoint>" \
 -H "X-Presto-User: client" \
 -H "X-Presto-Catalog: tpch" \
 -H "X-Presto-Schema: sf1"

Specify the user using -H "X-Presto-User: client". By default, Presto accepts any user, so write any name. In Chapter 7, you’ll see how to configure specific users in Presto.

If the call to the REST API is successful, Presto returns a JSON object where the nextUri key is set. Run another call to Presto using the value corresponding to the nextUri key as the URL. You’ll see an example for this later in this section.

The REST API provides you with many endpoints, such as the following:

/v1/node

To get information about a node.

/v1/statement

To run queries.

/v1/query

To get metrics on the most recent queries. Optionally, specify the query ID to get information about a specific query.

/v1/thread

To get information about a thread.

/v1/task

To get information about a task.

Iterate the previous call to Presto until the nextUri key is absent in the response JSON object. If the last call is successful, the returned object contains the query results as the value of the data key.

curl -s "URL contained in nextUri"

The following code shows a complete example of a call to the REST API in Python:

  1. Run the first call to the REST API:

    import requests
    import time
    
    url = "http://presto-coordinator:8080/v1/statement"
    
    headers = {
        "X-Presto-User" : "client", \
        "X-Presto-Catalog": "tpch", \
        "X-Presto-Schema": "sf1"
        }
    
    sql = "SELECT * FROM customer LIMIT 5"
    
    resp = requests.post(url, headers=headers, data=sql)
    json_resp = resp.json()
  2. Iterate over the nextUri key until it’s absent. Use time.sleep(0.5) to wait for data. Increase this value if your client does not receive any data.

    while 'nextUri' in json_resp:
     time.sleep(0.5)
     new_url = json_resp['nextUri']
     resp = requests.get(new_url)
     json_resp = resp.json()
  3. Get the final data:

    data = json_resp['data']
    for i in range(0, len(data)):
     print(data[i])

The Presto REST API provides many additional parameters. Read the Presto documentation for more details.

Python

Install the presto-python-client library. In your Dockerfile, run the following command:

pip install presto-python-client

Then, connect to Presto as follows:

import prestodb
conn=prestodb.dbapi.connect(
    host='presto-coordinator',
    port=8080,
    user='client',
    catalog='tpch',
    schema='sf1',
)

To run a query, retrieve the cursor:

cur = conn.cursor()
cur.execute('SELECT * FROM customer LIMIT 5')
rows = cur.fetchall()

R

Install the RPresto library. In your Dockerfile, run the following command:

RUN R -e "install.packages('RPresto')"

To query Presto in R, first import the required libraries:

library(RPresto)
library(DBI)

Then connect to the Presto server:

con <- DBI::dbConnect(
  drv = RPresto::Presto(),
  host = "presto-coordinator",
  port = 8080,
  user = "r-client",
  catalog = "tpch",
  schema = "sf1"
)

To run a query, use the DBI::dbGetQuery() function:

DBI::dbGetQuery(con, "SELECT * FROM customer LIMIT 5")

JDBC

Use the Presto JDBC driver com.facebook.presto.jdbc.PrestoDriver. If you use Maven, add the following dependencies to your pom.xml file:

<dependencies>
 <dependency>
  <groupId>com.facebook.presto</groupId>
  <artifactId>presto-jdbc</artifactId>
  <version>0.276</version>
 </dependency>
</dependencies>

If you don’t use Maven, download the presto-jdbc-0.276.jar file from the Presto website and add it to the classpath of your Java application. Replace the Presto version with the most recent version of Presto (0.276 at the time of writing this chapter).

In your main Java application, define the URL to the Presto server and connect to Presto through a Connection:

import java.sql.DriverManager;
import java.sql.Connection;
import java.sql.SQLException;

String url = "jdbc:presto://presto-coordinator:8080/tpch/sf1";

try {
 Class.forName("com.facebook.presto.jdbc.PrestoDriver");

 Connection connection =
          DriverManager.getConnection(url, "test", null);
}
catch(SQLException e){
 e.printStackTrace();
}
catch(ClassNotFoundException e){
 e.printStackTrace();
}

In the URL string, specify the catalog (tpch) and the schema (sf1). Then run your queries:

String sql = "SELECT * FROM customer LIMIT 5";

try{
 Statement statement = connection.createStatement();
 ResultSet rs = statement.executeQuery(sql);
 while (rs.next()) {
     System.out.println(rs.getString(1));
 }
}
catch(SQLException e){
 e.printStackTrace();
}

The example prints the first column for each result in the query’s result set.

Node.js

Install the presto-client library:

npm install -g presto-client

In your main JavaScript application, import the library and create a new Client() object:

var presto = require('presto-client');
var client = new presto.Client({
 user: 'myuser',
 'host': 'http://presto-coordinator',
 port: '8080'
});

Then execute any query:

client.execute({
  query:   'SELECT * FROM customer LIMIT 5',
  catalog: 'tpch',
  schema:  'sf1',
  source:  'nodejs-client',
  data:    function(error, data, columns, stats){ console.log(data); },
  success: function(error, stats){console.log(stats);},
  error:   function(error){console.log(error);}
});

ODBC

Open Database Connectivity (ODBC) is a standard database access method that enables applications to access data in a variety of database formats. Many database vendors, such as Microsoft, Oracle, and IBM, provide ODBC proprietary drivers to connect to any ODBC-compliant database, regardless of the database’s underlying structure or format.

To implement a Presto ODBC client, first you must install an ODBC driver. In this book, we use CData.

To make the driver work, first install the unixodbc, unixodbc-dev, and tdsodbc libraries. For example, in Ubuntu, run the following command:

apt-get install -y unixodbc unixodbc-dev tdsodbc

Next, download the CData driver for Presto (Unix version), install the package, and activate the license:

dpkg -i PrestoODBCDriverforUnix.deb
cd /opt/cdata/cdata-odbc-driver-for-presto/bin/
./install-license.sh

Then install the client library in your preferred language to access Presto through the ODBC driver. For example, in Python, install pyodbc:

pip install pyodbc

In your ODBC client application, connect to the Presto database:

conn=pyodbc.connect("DRIVER=CData ODBC Driver for Presto;\
server=presto-coordinator;\
Port=8080;user=client;\
catalog=tpch;schema=sf1")

Then execute any query:

cur = conn.cursor()
cur.execute('SELECT * FROM customer LIMIT 5')
rows = cur.fetchall()
print(rows)

Other Presto Client Libraries

Presto supports many other languages, including C, Go, PHP, and Ruby. The procedure to build your client in any of these languages is essentially the same. Refer to the Presto GitHub profile to download the template for a specific client. For example, use the Go client in the Presto GitHub profile’s Go repository.

Building a Client Dashboard in Python

This section describes a practical example of how to implement a simple dashboard querying Presto in Python from scratch. The dashboard is a web application, which connects to Presto, runs a sample query, and shows the outputs of the query in two separate graphs.

The code described in this section is available in the book’s GitHub repository, under the directory 04/client-app.

Setting Up the Client

The dashboard runs as a Kubernetes node, which exposes an HTTP service to access the dashboard. We’ll implement the dashboard using the following Python libraries:

Streamlit

Transforms Python scripts into web applications very quickly

Altair

A Python library for data visualization

Pandas

A popular Python library for dataset manipulation

To configure the client, perform the following steps:

  1. Write the Dockerfile that copies the app code to the Docker image, installs the required libraries, exposes the Streamlit listening port, and runs the app:

    FROM python:3.8.6
    WORKDIR /app
    
    COPY app .
    
    RUN pip install --upgrade pip
    RUN pip install presto-python-client
    RUN pip install streamlit altair pandas
    
    EXPOSE 8501
    
    CMD ["./run.sh"]

    The run.sh script simply launches the Streamlit server:

    #!/bin/bash
    streamlit run app.py --browser.serverAddress 0.0.0.0
  2. Build the Docker image by running the following command from the directory where the Dockerfile is located:

    docker build -t client-app .
  3. Wrap the client-app Docker image into a Kubernetes node, using the procedure that you have already followed for the Python client. Do not mount an external volume since the application is already available from the Docker image. In addition, make the web server port available for external access.

    spec:
     containers:
     - name: client-app
       image:  client-app:latest
       ...
       ports:
         - containerPort: 8501
  4. Build a service for the web application:

    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: client-service
      labels:
        app: client-service
    spec:
      ports:
        - port: 8501
          protocol: TCP
          targetPort: 8501
      type: LoadBalancer
      selector:
        app: client-app

    The service exposes the same port as the original web server. In addition, it uses a LoadBalancer to make the service available outside of the Kubernetes cluster.

  5. Deploy the client-app node in the presto cluster:

    kubectl apply -f client-app.yaml --namespace presto
  6. Finally, open your browser and point to http://localhost:8501 to access the web application.

Building the Dashboard

The client dashboard is a web server that connects to Presto, runs a sample query, and shows the outputs as graphs. The objective of this app is to demonstrate the potentialities of building a client app rather than building a complete example of data analytics.

The code of the client dashboard is available under 04/client-app/app of the GitHub repository for the book. The code is organized into a single script named app.py, which runs the web server as a Streamlit web server. In detail, the app performs the following operations:

  • Connecting to and querying Presto

  • Preparing the results of the query

  • Building the first graph

  • Building the second graph

Connecting to and querying Presto

First, the app imports all the required libraries:

import prestodb
import altair as alt
import pandas as pd
import streamlit as st

Then the app connects to Presto using the prestodb library. Use the code described in “Python” to perform the connection.

Resources Warning

If your local machine does not have enough resources to manage a big dataset, use the tiny schema from the TPC-H catalog when you specify the connection parameters to Presto. The tiny schema contains about 10,000 sample data items.

The app runs a query that lists the number of items shipped by date from the TPC-H catalog and orders them by oldest shipping date:

query = """SELECT
     count(*) as nitems,
     shipdate
 FROM
     lineitem
 GROUP BY shipdate
 ORDER BY shipdate ASC
        """

Preparing the results of the query

The app stores the result of the query in a Pandas DataFrame:

cur.execute(query)
df = pd.DataFrame(cur.fetchall(), columns=['nitems', 'shipdate'])

To build the final graphs, the app extracts some information from shipdate:

df['shipdate'] = pd.to_datetime(df['shipdate'])
df['dayofweek'] = df['shipdate'].dt.day_name()
df['year'] = df['shipdate'].dt.year
df['month'] = df['shipdate'].dt.month_name()
df['weekofyear'] = df['shipdate'].dt.isocalendar().week

The extracted information includes the day of the week (from Sunday to Saturday), year, month, and week of the year (a progressive number from 1 to 53).

Building the first graph

The first graph focuses on 1992 and shows a line for each week of the year, related to the number of shipped items organized by day of the week:

df_1992 = df[df['year'] == 1992 ]

The graph is composed of two parts. The first part of the graph shows a line for each week of the year:

days_of_weeks = ['Sunday', ... ,'Saturday']
bar = alt.Chart(df_1992).mark_line().encode(
    x = alt.X('dayofweek:O',
      sort=days_of_weeks,title=''
    ),
    y = alt.Y('nitems:Q', title='nitems'),
    color = alt.Color('weekofyear:N',
      scale=alt.Scale(range=['#C8C8C8']),
      legend=None
    )
)

The second part of the graph shows the average value of the number of shipped items over all the weeks:

mean = alt.Chart(df_1992).mark_line(color='black').encode(
    x = alt.X('dayofweek:O',
      sort=days_of_week,
      title=''
    ),
    y = alt.Y('weekly_count:Q', title=''),
).transform_aggregate(
    weekly_count = 'mean(nitems)',
    groupby=['dayofweek']
)

Finally, the app combines the two parts to build the final graph:

chart1 = (bar + mean).properties(
    width=300,
    height=300,
    title='Number of shipped items in 1992')
)
st.altair_chart(chart1)

We use the st.altair_chart() function to build the graph. Figure 4-1 shows the resulting graph.

Number of shipped items in 1992 organized by week
Figure 4-1. Number of shipped items in 1992 organized by week

Building the second graph

The second graph shows a calendar heatmap, with years on the x-axis and months on the y-axis:

chart2 = alt.Chart(df).mark_rect().encode(
     x = alt.X('year:O'),
     y = alt.Y('month:O',
     sort=days_of_weeks
    ),
    color = alt.Color(
     'nitems:Q',
      scale=alt.Scale(range=['#F5F5F5','#000000'])
    )
).properties(
    width=300,
    height=300
)

st.altair_chart(chart2)

Figure 4-2 shows the resulting graph.

A calendar heatmap of number of shipped items
Figure 4-2. A calendar heatmap of number of shipped items

Conclusion

In this chapter, you learned how to deploy a Presto client using different languages, including REST API, Python, R, JDBC, and Node.js. The procedure is the same in all cases: build the Docker image, wrap it into a Kubernetes node, and execute the node to run queries to Presto.

You also implemented a client web service to Presto, showing a simple summary report. The web service queried the TPC-H catalog made available by Presto and showed some statistics related to the number of items by ship date.

In Chapter 5, you’ll learn how to perform open data lakehouse analytics.

Get Learning and Operating Presto now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.