Chapter 4. Loading Data
Before running queries or using analytic tools, Greenplum needs to ingest data. There are multiple ways to move data into Greenplum. In the sections that follow, we’ll explore each of them.
INSERT Statements
The simplest way of inserting data is to use INSERT SQL statement that facilitates inserting a few rows of data. However, because the insert is done via the master node of Greenplum Database, it cannot be parallelized.
An insert like the one that follows is fine for populating the values in a small lookup table.
INSERT INTO faa.d_cancellation_codes VALUES ('A', 'Carrier'), ('B', 'Weather'), ('C', 'NAS'), ('D', 'Security'), ('', 'none');
There are set-based insert statements that can be parallelized. They are discussed later in this chapter.
\COPY command
The \COPY command is both a psql command as well as a SQL command in Greenplum, with minor differences, one of which is that COPY can be used only by gpadmin, the Greenplum superuser. You can run \COPY as psql scripts. psql is the command-line tool for accessing Greenplum as well as PostgreSQL. A Greenplum version is available as part of the normal installation process.
\COPY is efficient for loading smallish datasets—a few thousand rows or so—but because it is single-threaded through the master server, it is less useful for loads of millions of rows or more.
Let’s consider a table called dogs
, defined as follows:
CREATE TABLE dogs (did int, dname text, bday date) distributed randomly;
Now, let’s look at a data file on the master server containing a header line and three rows of data:
id,name,bday 123,Fido, /09/09/2010 456, Rover, /01/21/2014 789 ,Bonzo, 04/15/2004
Here’s the SQL statement that copies the three rows of data to the table dogs
:
\COPY dogs FROM '/home/gpuser/Exercises/dogs.csv' CSV HEADER LOG ERRORS SEGMENT REJECT LIMIT 50 ROWS;
Raw data is often filled with errors. Without the REJECT clause, the \COPY statement would fail if there were errors. In this example, the REJECT clause allows the script to continue loading until there are 50 errors. The LOG ERRORS clause will place the errors in a log file. This is explained in the Greenplum SQL Reference manual.
You also can use \COPY to unload small amounts of data:
\COPY dogs TO '/home/gpuser/Exercises/dogs_out.csv' CSV HEADER;
The gpfdist Tool
For large datasets, neither INSERT nor \COPY will be nearly as performant as Greenplum’s parallel loading techniques. This makes use of external tables and the gpfdist tool. gpfdist is a file server that runs on the server on which the raw data resides. It sits there passively until a Greenplum SQL statement implicitly calls it to request data from an external table.
There is an external file with data that looks like the following:
id,name,bday 123,Fido, /09/09/2010 456, Rover, /01/21/2014 789 ,Bonzo, 04/15/2004
The table to be loaded would be defined like this:
CREATE TABLE dogs (did int, dname text, bday date) distributed randomly;
The external table definition could be as shown here:
CREATE READABLE EXTERNAL TABLE dogs_ext like(dogs) LOCATION ('gpfdist://10.0.0.99:8081/dogs.csv') FORMAT 'csv' (header) LOG ERRORS SEGMENT REJECT LIMIT 50 rows;
After the gpfdist process has been started on the host housing the comma-separated values (CSV) file, the data can easily be imported by using the following statement:
INSERT INTO dogs AS SELECT * FROM dogs_ext;
When the INSERT statement is executed, all of the segments engaged in the INSERT statement will issue requests to the gpfdist process running on the server with address 10.0.0.99 for chunks of data. They will parse each row, and if the row should belong to the segment that imports it, it will be stored there. If not, the row will be shipped across the private interconnect to the segment to which it belongs and it will be stored there. This process is known as Scatter-Gather. The mechanics of this are completely transparent to the user community.
The Greenplum documentation describes a number of methods of deploying gpfdist. In particular, the number of gpfdist processes per external server can have a large impact on load speeds. Figure 4-1 shows one example in which one gpfdist process is running on the external ETL server.
The gpload Tool
Many Greenplum users employ the gpload tool. This is a binary command distributed as part of the Greenplum distribution. It uses a YAML configuration file to orchestrate the loading of data. In many organizations, there are dedicated data loading teams who might not actually have much SQL experience or might not even have database accounts. For them, gpload is an ideal tool.
Here’s what gpload does behind the scenes:
-
Creates a user session in a database
-
Creates an external table that describes the data to be loaded
-
Starts the gpfdist program on the host where the data is located
-
Performs a parallel load of the data from the source to the target table
For this to happen, the configuration file or the command-line argument or relevant environment variable must specify the database user, the target database and table, the location of the target data and the Greenplum master host of the cluster.
Following is a very simple YAML configuration file for our dog
table example:
VERSION: 1.0.0.1 DATABASE: dogpound USER: gpuser HOST: mdw-1 PORT: 5432 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - data_host.foobar.com PORT: 8081 FILE: - /home/gpuser/data/* - FORMAT: csv - DELIMITER: ',' OUTPUT: - TABLE: public.dogs - MODE: INSERT
Notice that there are two PORT fields in the YAML file. The first is the Greenplum listener port on which the user session exists. The second is the port that gpfdist and Greenplum uses to transfer data.
There are many useful optional features in gpload:
-
Logging load errors for potential correction.
-
SQL commands that you can run before and after the load operation. You can use these to add audit commands or check the error files.
-
Ability to truncate the target table before loading. This is useful when loading into staging tables on a regular basis.
-
Mapping makes it possible for you to apply functions or expressions to data as part of the ingest process. You could use this to encrypt data or transform it in other ways.
You can achieve the same effect of gpload manually.
On data_host.foobar.com
, run the following command at the Linux prompt:
gpfdist -d /home/gpuser/data/ -p 8081 > gpfdist.log 2>&1 &
Then, in a psql session, type the following:
CREATE TABLE dogs (did int, dname text, bday date) distributed randomly; CREATE READABLE EXTERNAL TABLE dogs_ext (like dogs) LOCATION ('gpfdist://data_host.foobar.com:8081//dogs.csv') FORMAT 'CSV' (HEADER) LOG ERRORS SEGMENT REJECT LIMIT 50; INSERT INTO public.dogs SELECT * from dogs_ext;
You can find more information about the YAML configuration file in the next section.
Learning More
There is a thorough discussion of loading and unloading data in the Greenplum documentation.
Pivotal’s Jon Roberts has written Outsourcer, a tool that does Change Data Capture from Oracle and SQLServer into Greenplum. This is not a Pivotal product and thus not supported by Pivotal, but Jon makes every effort to maintain, support, and enhance the product.
Get Data Warehousing with Greenplum now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.