Chapter 3. Organizing Data in Greenplum

To make effective use of Greenplum, architects, designers, developers, and users must be aware of the various methods by which data can be stored because it will affect performance in loading, querying, and analyzing datasets. A simple “lift and shift” from a transactional data model is almost always suboptimal. Data warehouses generally prefer a data model that is flatter than a normalized transactional model. Data model aside, Greenplum offers a wide variety of choice in how the data is organized. These choices include the following:

Distribution
Determines into which segment table rows are assigned
Partitioning
Determines how the data is stored on each of the segments
Orientation
Determines whether the data is stored by rows or by columns
Compression
Used to minimize data table storage in the disk system
Append-optimized tables
Used to enhance performance for data that is rarely changed
External tables
Provide a method for accessing data outside Greenplum
Indexing
Used to speed lookups of individual rows in a table

Distributing Data

One of the most important methods for achieving good query performance from Greenplum is the proper distribution of data. All other things being equal, having roughly the same number of rows in each segment of a database is a huge benefit. In Greenplum, the data distribution policy is determined at table creation time. Greenplum adds a distribution clause to the Data Definition Language (DDL) for ...

Get Data Warehousing with Greenplum now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.