Chapter 4. Extending Presto: Building a Presto Connector
This chapter will explore what it takes to build a connector for Presto. A connector is the heart of Presto! It is a type of plugin that enables Presto to interact with external systems for reading and writing data. As we’ll see, a connector is responsible for exposing table metadata to Presto -- schemas, tables, column definitions, as well as mapping data from the external system to Presto data types (most of which are Java primitives).
We’ll be using the Example HTTP Connector (https://prestodb.io/docs/current/develop/example-http.html) to discuss the components that make up a connector. This is a basic connector that demonstrates the various classes you must implement to read data into Presto. You can look at the source code on the PrestoDB Github (https://github.com/prestodb/presto), under the presto-example-http module. Note that this chapter was written shortly after version 0.244 was released, so the source code may have changed since then. You can view the source code for this specific version at https://github.com/prestodb/presto/tree/0.244.
At a high level, there are four major components that make up a connector:
Plugin and Module
We’ll now explore how these components break down into individual classes.
Plugin and Module
This specifies all of the top-level classes that make up your connector, allowing Presto to initialize your catalog.
The examplePlugin.java class implements the Plugin interface, which tells Presto the features that this plugin has. Since this example is a connector, it only implements the getConnectorFactory function and returns the ExampleConnectorFactory. Suppose this plugin had any additional functionality, such as user-defined functions (UDFs), event listeners, access control, etc. In that case, you could implement other functions here to expose those implementations to Presto.
The connector factory class tells Presto the name of the connector, example-http, as well as the handle resolver and the factory function which creates the connector implementation using Google Guice. The connector name uniquely identifies this connector in a Presto installation. You’ll recall configuring Presto using catalog property files in Chapter 7—this is where the value you would specify for “connector.name” comes from to tell Presto which connector to create for the catalog.
The HandleResolver tells Presto what classes to use for this connector—more on that later.
The meat of this class is the implementation in the create function. Presto uses Google Guice for dependency injection. Explaining this project (and dependency injection in general) is out of scope for this book, but in short, you define the Module the connector requires (which we will see next), then create an instance of it using the injector. Guice analyzes the module and what classes are needed, creating instances as necessary. Guice reports any missing dependencies on startup (and the program crashes). See the Guice project documentation for more details.
This class module is used to configure Guice, defining all the needed classes. You’ll see the Guice BInder binds various classes used by this connector. Any class that uses Guice (which can be spotted by seeing the @Inject annotation) will need to be configured here.
This class implements the Connector interface and lays out the connector’s supported features. Users will need to return a transaction handle, the metadata, split manager, and typically a RecordSetProvider to enable reading from tables. There is also PageSourceProvider, which is another interface for providing pages of data to Presto (vs. individual records), and the PageSinkProvider for writing data. You can implement other functions for more advanced features, such as supporting partitioned tables or query plan optimizations.
This class implements the ConnectorHandleResolver interface and is used by Presto to determine class names for various pieces of its execution pipeline. Users should implement this class and return the handles necessary to fit your connector’s feature set. At a minimum, users must provide an implementation of a ConnectorTableHandle, ConnectorTableLayoutHandle, ColumnHandle, and ConnectorSplit. These classes are explained in more detail below. If you add support for writing data to your external system (commands like INSERT or CREATE TABLE AS SELECT), then you would implement getOutputTableHandleClass (among other classes) to provide the output table. The actual classes themselves are flexible as far as what fields the classes have—if you look at the interface definition, they are (mostly) empty. The handle resolver is all about telling the Presto engine what classes you are using for the handles.
Configuration includes classes for specifying catalog configurations, such as external database URLs and credentials, as well as session properties. There are three kinds of classes you will typically see related to configuration—connector, session, and table. Connector properties provide static information used by your connector, typically items like connection information such as URIs and login information, and enabling specific features for your connector. Session properties are items specified by a user for each client session, generally used to tweak configuration settings based on the types of query a user wants to run or enable experimental features. Finally, we have table properties, which are properties attached to a specific table of your external system, such as how the table would be partitioned.
The example connector only has one kind of these classes, ExampleConfig.java, for connector configuration values (which isn’t very exciting). Let’s look at RedisConnectorConfig.java instead. You’ll see lots of private member variables with properties set to their default values, and then several setters and getters for each variable with annotations. The @Config annotation specifies the property name, while other annotations on the getters can enforce restrictions on the values. Such as @NotNull to tell Presto that this configuration value won’t be null or @Size(min = 1) to enforce that users cannot set any value less than one. Take a look at other connector config classes (as well as classes in the javax.validation.constraints package) for more examples. These configuration properties are set in your catalog configuration file for Presto, e.g., etc/catalog/example-http.properties.
Session properties can be set via the shell using:
SET SESSION connectorname.propertyname = ‘value’
Or, they can be set via JDBC using the PrestoConnection.
The example connector does not have any session properties, so let’s instead look at AccumuloSessionProperties.java (which has a lot of them!).
You’ll find a lot of public constants at the top of the file which are all of the available session properties. The constructor of this class creates a list of all available properties. Each property has a name, description, SQL data type, Java data type, a default value, whether or not the session property is hidden, and encoder/decoder functions. You’ll find creating a property using the main PropertyMetadata constructor, as well as several helper functions for common property types like booleans, integers, strings, and durations.
After the (rather large) constructor, you’ll see a function to return this list of properties and then many static getter functions that take in a ConnectorSession object and extract the value of the property from the session. You’ll find the ConnectorSession is provided in many places throughout the code during query execution, so any time you need a session property, these functions are how you access the value.
Classes that define table properties look exactly like classes that define session properties (e.g., HiveTableProperties.java), but they are instead used for table definitions in the WITH clause:
CREATE TABLE foo ( a BIGINT ) WITH ( myTableProperty = ‘value’, myOtherTableProperty = ‘othervalue’ );
Metadata includes classes that expose schemas, tables, and column definitions for tables exposed to Presto. Even if your connector does not have schemas or tables, you’ll need to model your connector in this manner because that is how Presto reads data -- it expects a relational data model.
Classes related to connector Metadata typically boil down to a client class that is used to retrieve metadata from the external system and Plain Old Java Objects (POJOs) that represent that metadata.
This class implements the ConnectorMetadata interface, which contains all the functions for exposing schemas, tables, and column information to Presto. Implement each of these functions based on how you have modeled your external system to a relational model for Presto. Typically, you will see these Metadata classes leverage a Client class that does all of the heavy lifting for listing schemas, tables, and column information.
This class is one such example. It contains functionality for fetching schemas, tables, and column definitions, the implementation of which is pretty lightweight since it is an example connector. In a real connector, your client would typically make API calls to your external system to map its metadata model to Presto’s relational metadata model.
The Table, Column, TableHandle, ColumnHandle, TableLayoutHandle, and TransactionHandle classes for the example connector are all POJO objects decorated with an @JsonCreator annotation. This annotation is provided by the Jackson library and is used by Presto to serialize the POJO to a JSON object, which is passed around between the Coordinator and Worker nodes. These handles contain the metadata needed by your connector to do its job. At a minimum, you’ll see things like table names, column names, and column types, but these classes are very flexible—you can embed any necessary information in these handles to perform any additional operations or optimizations.
Input/Output refers to classes that read data from your external system into Presto and optionally write data to the external system. Reading data would also be optional, but the typical use case for Presto is reading data from a system into the engine
When it comes to input data from your external system into Presto, you define a handful of classes. First, you split the table scan into multiple split objects. Then, for each split, you create a record set for the split, and the record set contains a cursor that iterates over rows of data fed to Presto. For this, you will need a split manager.
A split manager has one function, getSplits, which is called by Presto to all your connector to split a scan against your table into multiple pieces which can be read in parallel by Presto workers. The connector provides the ConnectorSession (which can be used to access your session properties), the table layout handle that is being scanned, and the context. Your table handle should be packed with all the information you need to generate your list of splits. The HTTP connector gets all sources for the table and creates one split per source. Presto then takes these splits and passes them to a RecordSetProvider.
But first, the split class. This class provides your connector all information needed to process this chunk of your overall table. Like the metadata classes, it is a Jackson POJO object containing the schema, table, and URI to read.
The RecordSetProvider creates a RecordSet from the given split via the getRecordSet function. It also includes a list of columns that are expected to be read from the table and in the record set. These lists of columns would be the columns a user is selecting. It could be either of them in the case of SELECT *, but it could be a subset of fields that a user selects, or it could be no columns at all. When the given list is empty, Presto is only looking to count the number of records in the set.
Whether all columns are wanted (or none), this is where Presto provides column pruning, where only a subset of columns are desired from your table. Some external systems can optimize for this, reducing data that needs to be scanned along with network traffic. Take advantage of this if you are able to do so!
The RecordSet is used to factory a RecordCursor, which iterates over the split to provide rows of data. They typically are not very exciting.
But the RecordCursor is where the meat of reading data from your system in Presto resides! TODO continue
Deploying your Connector
To deploy your connector, you will want to build your project with Maven using the “presto-plugin” packaging. You can view the pom.xml of the presto-example-http project and use it as a starting point, modifying the project name and dependencies as needed. Once the package is built, you will find a folder named after your project in the target directory, for example, presto-example-http-0.244. This folder contains all the artifacts needed for your plugin. Copy this folder and its contents to the plugins directory to all nodes in your Presto installation. Then, add the catalog for your connector in the, etc/catalogs directory like you would any other connector, and then restart Presto. You should now be able to access your connector via the catalog.
Alternatively, you can fork Presto and add your project as a separate module. You can then include your plugin as part of the presto-server project’s assembly.xml. When you build your fork of Presto, this will create a presto-server.tar.gz or RPM artifact that contains your plugin. You can then deploy this artifact like you would deploy Presto as if you had downloaded a release from prestodb.io.