Data Warehousing with Greenplum, 2nd Edition

Book description

Data professionals are confronting the most disruptive change since relational databases appeared in the 1980s. SQL is still a major tool for data analytics, but conventional relational database management systems can’t handle the increasing size and complexity of today’s datasets. This updated edition teaches you best practices for Greenplum Database, the open source massively parallel processing (MPP) database that accommodates large sets of nonrelational and relational data.

Marshall Presser, field CTO at Pivotal, introduces Greenplum’s approach to data analytics and data-driven decisions, beginning with its shared-nothing architecture. IT managers, developers, data analysts, system architects, and data scientists will all gain from exploring data organization and storage, data loading, running queries, and learning to perform analytics in the database. Discover how MPP and Greenplum will help you go beyond the traditional data warehouse.

This ebook covers:

  • Greenplum features, use case examples, and techniques for optimizing use
  • Four Greenplum deployment options to help you balance security, cost, and time to usability
  • Why each networked node in Greenplum’s architecture includes an independent operating system, memory, and storage
  • Additional tools for monitoring, managing, securing, and optimizing query responses in the Pivotal Greenplum commercial database

Table of contents

  1. Foreword to the Second Edition
  2. Foreword to the First Edition
  3. Preface
    1. Why Are We Rewriting This Book?
    2. Why Did We Write This Book in the First Place?
    3. Who Are the “We”?
    4. Who Should Read This Book?
    5. What the Book Covers
    6. What It Doesn’t Cover
    7. Where You Can Find More Information
    8. How to Read This Book
    9. Acknowledgments
  4. 1. Introducing the Greenplum Database
    1. Problems with the Traditional Data Warehouse
    2. Responses to the Challenge
    3. A Brief Greenplum History
    4. What Is Massively Parallel Processing?
    5. The Greenplum Database Architecture
      1. Master and Standby Master
      2. Segments and Segment Hosts
      3. Private Interconnect
      4. Mirror Segments
    6. Additional Resources
      1. Greenplum Documentation
      2. Greenplum Best Practices Guide
      3. Greenplum Cluster Concepts Guide
      4. PivotalGuru (Formerly Greenplum Guru)
      5. Pivotal Greenplum Blogs
      6. Greenplum YouTube Channel
      7. Greenplum Knowledge Base
      9. Other Sources
  5. 2. What’s New in Greenplum?
    1. What’s New in Greenplum 5?
    2. What’s New in Greenplum 6?
    3. Additional Resources
  6. 3. Deploying Greenplum
    1. Custom(er)-Built Clusters
    2. Greenplum Building Blocks
    3. Public Cloud
    4. Private Cloud
    5. Greenplum for Kubernetes
    6. Choosing a Greenplum Deployment
    7. Additional Resources
  7. 4. Organizing Data in Greenplum
    1. Distributing Data
    2. Polymorphic Storage
    3. Partitioning Data
    4. Orientation
    5. Compression
    6. Append-Optimized Tables
    7. External Tables
    8. Indexing
    9. Additional Resources
  8. 5. Loading Data
    1. INSERT Statements
    2. \COPY Command
    3. The gpfdist Process
    4. The gpload Tool
    5. Additional Resources
  9. 6. Gaining Analytic Insight
    1. Data Science on Greenplum with Apache MADlib
      1. What Is Data Science and Why Is It Important?
      2. Common Data Science Use Cases
    2. Apache MADlib
      1. Scale and Performance
      2. Familiar SQL Interface
      3. Algorithm Design
      4. R Interface
      5. Deep Learning
    3. Text Analytics
    4. Brief Overview of GPText Architecture
      1. Configuring Solr/GPText
      2. Defining Your Analysis and Performing Text Searches
      3. Administering GPText
    5. Additional Resources
  10. 7. Monitoring and Managing Greenplum
    1. Greenplum Command Center
    2. Workload Management
      1. Resource Queues
      2. Resource Groups
    3. Greenplum Management Tools
      1. Basic Command and Control
      2. System Health
      3. Disaster Recovery and Data Replication
      4. Operations and System Management
      5. Other Tools
    4. Additional Resources
  11. 8. Accessing External Data
    1. dblink
    2. Foreign Data Wrappers
    3. Platform Extension Framework
    4. Greenplum Stream Server
    5. Greenplum-Kafka Integration
    6. Greenplum-Informatica Connector
    7. GemFire-Greenplum Connector
    8. Greenplum-Spark Connector
    9. Amazon S3
    10. External Web Tables
    11. Additional Resources
  12. 9. Optimizing Query Response
    1. Fast Query Response Explained
    2. GPORCA Recent Accomplishments
    3. Additional Resources

Product information

  • Title: Data Warehousing with Greenplum, 2nd Edition
  • Author(s): Marshall Presser
  • Release date: July 2019
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492058120