Data Warehousing with Greenplum

Book Description

Relational databases haven’t gone away, but they are evolving to integrate messy, disjointed unstructured data into a cleansed repository for analytics. With the execution of massively parallel processing (MPP), the latest generation of analytic data warehouses is helping organizations move beyond business intelligence to processing a variety of advanced analytic workloads. These MPP databases expose their power with the familiarity of SQL.

This report introduces the Greenplum Database, recently released as an open source project by Pivotal Software. Lead author Marshall Presser of Pivotal Data Engineering takes you through the Greenplum approach to data analytics and data-driven decisions, beginning with Greenplum’s shared-nothing architecture. You’ll explore data organization and storage, data loading, running queries, as well as performing analytics in the database.

You’ll learn:

  • How each networked node in Greenplum’s architecture features an independent operating system, memory, and storage
  • Four deployment options to help you balance security, cost, and time to usability
  • Ways to organize data, including distribution, storage, partitioning, and loading
  • How to use Apache MADlib for in-database analytics, and GPText to process and analyze free-form text
  • Tools for monitoring, managing, securing, and optimizing query responses available in the Pivotal Greenplum commercial database

Publisher Resources

View/Submit Errata

Table of Contents

  1. Foreword
  2. Preface
    1. Why Are We Writing This Book?
    2. Who Are the “We”?
    3. Who Is the Audience?
    4. What the Book Covers
    5. What It Doesn’t Cover
    6. Where You Can Find More Information
    7. How to Read This Book
    8. Acknowledgments
  3. 1. Introducing the Greenplum Database
    1. Problems with the Traditional Data Warehouse
    2. Responses to the Challenge
    3. A Brief Greenplum History
    4. What Is Massively Parallel Processing
    5. The Greenplum Database Architecture
      1. Master and Standby Master
      2. Segments and Segment Hosts
      3. Private Interconnect
      4. Mirror Segments
    6. Learning More
  4. 2. Deploying Greenplum
    1. Custom(er)-Built Clusters
    2. Appliance
    3. Public Cloud
    4. Private Cloud
    5. Choosing a Greenplum Deployment
    6. Greenplum Sandbox
    7. Learning More
  5. 3. Organizing Data in Greenplum
    1. Distributing Data
    2. Polymorphic Storage
    3. Partitioning Data
    4. Compression
    5. Append-Optimized Tables
    6. External Tables
    7. Indexing
    8. Learning More
  6. 4. Loading Data
    1. INSERT Statements
    2. \COPY command
    3. The gpfdist Tool
    4. The gpload Tool
    5. Learning More
  7. 5. Gaining Analytic Insight
    1. Data Science on Greenplum with Apache MADlib
      1. What Is Data Science and Why Is It Important?
      2. Common Data Science Use Cases
      3. Tools for Data Science
      4. Apache MADlib (incubating)
      5. R Interface
    2. Text Analytics
    3. Brief Overview of the Solr/GPText Architecture
      1. Configuring Solr/GPText
      2. Defining Your Analysis and Performing Text Searches
      3. Administering GPText
    4. Learning More
  8. 6. Monitoring and Managing Greenplum
    1. Greenplum Command Center
    2. Resource Queues
    3. Greenplum Workload Manager
    4. Greenplum Management Utilities
      1. Common Tasks: Performed on a Regular Basis
      2. Specialized Tasks: Performed as Needed
    5. Learning More
  9. 7. Integrating with Real-Time Response
    1. GemFire-Greenplum Connector
      1. Problem Scenario: Fraud Detection
      2. Supporting the Fraud Detection Process
      3. Problem Scenario: Internet of Things Monitoring and Failure Prevention
    2. What Is GemFire?
      1. The GemFire-Greenplum Connector
    3. Learning More
  10. 8. Optimizing Query Response
    1. Fast Query Response Explained
    2. Learning More
  11. 9. Learning More About Greenplum
    1. Greenplum Sandbox
    2. Greenplum Documentation
    3. Pivotal Guru (formerly Greenplum Guru)
    4. Greenplum Best Practices Guide
    5. Greenplum Blogs
    6. Greenplum YouTube Channel
    7. Greenplum Knowledge Base
    8. greenplum.org

Product Information

  • Title: Data Warehousing with Greenplum
  • Author(s): Marshall Presser
  • Release date: July 2017
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491983515