book

Getting Started with Greenplum for Big Data Analytics

Name: Getting Started with Greenplum for Big Data Analytics
Author: Sunila Gollapudi
ISBN: 9781782177043

by Sunila Gollapudi

October 2013

Intermediate to advanced

172 pages

3h 51m

English

Packt Publishing

Read now

Unlock full access

Getting Started with Greenplum for Big Data Analytics
Table of Contents
Getting Started with Greenplum for Big Data Analytics
Credits
Foreword
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and moreWhy Subscribe?Free Access for Packt account holdersInstant Updates on New Packt Books
Preface
What this book covers

What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
ErrataPiracyQuestions
1. Big Data, Analytics, and Data Science Life Cycle
Enterprise dataClassificationFeatures
Big Data
So, what is Big Data?Multi-structured data
Data analytics
Data science
Data science life cyclePhase 1 – state business problemPhase 2 – set up dataPhase 3 – explore/transform dataPhase 4 – modelPhase 5 – publish insightsPhase 6 – measure effectiveness
References/Further reading
Summary
2. Greenplum Unified Analytics Platform (UAP)
Big Data analytics – platform requirements
Greenplum Unified Analytics Platform (UAP)
Core componentsGreenplum DatabaseHadoop (HD)ChorusCommand CenterModulesDatabase modulesHD modulesData Integration Accelerator (DIA) modulesCore architecture conceptsData warehousingColumn-oriented databasesParallel versus distributed computing/processingShared nothing, massive parallel processing (MPP) systems, and elastic scalabilityShared disk data architectureShared memory data architectureShared nothing data architectureData loading patterns
Greenplum UAP components
Greenplum DatabaseThe Greenplum Database physical architectureThe Greenplum high-availability architectureHigh-speed data loading using external tablesExternal table typesPolymorphic data storage and historic data managementData distributionHadoop (HD)Hadoop Distributed File System (HDFS)Hadoop MapReduceChorus
Greenplum Data Computing Appliance (DCA)
Greenplum Data Integration Accelerator (DIA)
References/Further reading
Summary
3. Advanced Analytics – Paradigms, Tools, and Techniques
Analytic paradigmsDescriptive analyticsPredictive analyticsPrescriptive analytics
Analytics classified
ClassificationForecasting or prediction or regressionClusteringOptimizationSimulations
Modeling methods
Decision treesAssociation rulesThe Apriori algorithmLinear regressionLogistic regressionThe Naive Bayesian classifierK-means clusteringText analysis
R programming
Weka
In-database analytics using MADlib
References/Further reading
Summary
4. Implementing Analytics with Greenplum UAP
Data loading for Greenplum Database and HDGreenplum data loading optionsExternal tablesgpfdistgploadHadoop (HD) data loading optionsSqoop 2Greenplum BulkLoader for HadoopUsing external ETL to load data into GreenplumExtraction, Load, and Transformation (ELT) and Extraction, Transformation, Load, and Transformation (ETLT)Greenplum target configurationSourcing large volumes of data from GreenplumUnsupported Greenplum data typesPush Down Optimization (PDO)
Greenplum table distribution and partitioning
DistributionData skew and performanceOptimizing the broadcast or redistribution motion for data co-locationPartitioningQuerying Greenplum Database and HDQuerying Greenplum DatabaseAnalyzing and optimizing queriesThe ANALYZE functionThe EXPLAIN functionDynamic Pipelining in GreenplumQuerying HDFSHivePigData communication between Greenplum Database and Hadoop (using external tables)
Data Computing Appliance (DCA)
Storage design, disk protection, and fault toleranceMaster server RAID configurationsSegment server RAID configurationsMonitoring DCA
Greenplum Database management
In-database analytics options (Greenplum-specific)
Window functionsThe PARTITION BY clauseThe ORDER BY clauseThe OVER (ORDER BY…) clauseCreating, modifying, and dropping functionsUser-defined aggregates
Using R with Greenplum
DBI Connector for RPL/R
Using Weka with Greenplum
Using MADlib with Greenplum
Using Greenplum Chorus
Pivotal
References/Further reading
Summary
Index

Content preview from Getting Started with Greenplum for Big Data Analytics

Greenplum table distribution and partitioning

In the following section, we will define table distribution in Greenplum context and detail the other related aspects of distribution, like data skew.

Distribution

Greenplum is a massive parallel processing data store, and data is distributed across segments as per the definition of the distribution strategy.

Every table in Greenplum has a data distribution method, the DISTRIBUTED BY clause helps define the distribution strategy. We need to ensure that there is no data skew introduced on any of the segment hosts as a result of the distribution key defined.

There are two methods of distributing table data across segment hosts:

Column oriented/Hash distribution: This is a distribution mechanism that considers ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781782177043Other

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Getting Started with Greenplum for Big Data Analytics

by Sunila Gollapudi

Greenplum table distribution and partitioning

Distribution

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.