Chapter 3. Organizing Data in Greenplum
To make effective use of Greenplum, architects, designers, developers, and users must be aware of the various methods by which data can be stored because it will affect performance in loading, querying, and analyzing datasets. A simple “lift and shift” from a transactional data model is almost always suboptimal. Data warehouses generally prefer a data model that is flatter than a normalized transactional model. Data model aside, Greenplum offers a wide variety of choice in how the data is organized. These choices include the following:
- Distribution
- Determines into which segment table rows are assigned
- Partitioning
- Determines how the data is stored on each of the segments
- Orientation
- Determines whether the data is stored by rows or by columns
- Compression
- Used to minimize data table storage in the disk system
- Append-optimized tables
- Used to enhance performance for data that is rarely changed
- External tables
- Provide a method for accessing data outside Greenplum
- Indexing
- Used to speed lookups of individual rows in a table
Distributing Data
One of the most important methods for achieving good query performance from Greenplum is the proper distribution of data. All other things being equal, having roughly the same number of rows in each segment of a database is a huge benefit. In Greenplum, the data distribution policy is determined at table creation time. Greenplum adds a distribution clause to the Data Definition Language (DDL) for ...
Get Data Warehousing with Greenplum now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.