Chapter 3. Organizing Data in Greenplum

To make effective use of Greenplum, architects, designers, developers, and users must be aware of the various methods by which data can be stored because it will affect performance in loading, querying, and analyzing datasets. A simple “lift and shift” from a transactional data model is almost always suboptimal. Data warehouses generally prefer a data model that is flatter than a normalized transactional model. Data model aside, Greenplum offers a wide variety of choice in how the data is organized. These choices include the following:

Distribution: Determines into which segment table rows are assigned
Partitioning: Determines how the data is stored on each of the segments
Orientation: Determines whether the data is stored by rows or by columns
Compression: Used to minimize data table storage in the disk system
Append-optimized tables: Used to enhance performance for data that is rarely changed
External tables: Provide a method for accessing data outside Greenplum
Indexing: Used to speed lookups of individual rows in a table

Distributing Data

One of the most important methods for achieving good query performance from Greenplum is the proper distribution of data. All other things being equal, having roughly the same number of rows in each segment of a database is a huge benefit. In Greenplum, the data distribution policy is determined at table creation time. Greenplum adds a distribution clause to the Data Definition Language (DDL) for ...

Get Data Warehousing with Greenplum now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Warehousing with Greenplum by Marshall Presser

Chapter 3. Organizing Data in Greenplum

Distributing Data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly