Chapter 5. Loading Data

Before running queries or using analytic tools, Greenplum needs to ingest data. There are multiple ways to move data into Greenplum. In the sections that follow, we explore each of them.

INSERT Statements

The simplest way to insert data is to use the INSERT SQL statement, which facilitates inserting a few rows of data. However, because the insert is done via the master node of Greenplum Database, it cannot be parallelized.

An insert like the one that follows is fine for populating the values in a small lookup table:

INSERT INTO faa.d_cancellation_codes
  VALUES ('A', 'Carrier'),
         ('B', 'Weather'),
         ('C', 'NAS'),
         ('D', 'Security'),
         ('', 'none');

There are also set-based INSERT statements that you can parallelize (these are discussed later in this chapter).

\COPY Command

The \COPY command is a psql command. There is also the COPY statement in SQL. There are minor differences, one of which is that only a user with superuser privileges can use COPY.

You can run \COPY in psql scripts. psql is the command-line tool for accessing Greenplum as well as PostgreSQL. A Greenplum version is available as part of the normal installation process.

\COPY is efficient for loading smallish datasets—a few thousand rows or so—but because it is single-threaded through the master server, it is less useful for loads of millions of rows or more.

The data is in a file, the top lines of which look like this:

dog_id, dog_name, dog_dob 123,Fido,09/09/2010 456,Rover,01/21/2014 789,Bonzo,04/15/2016 ...

Get Data Warehousing with Greenplum, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.