book

Getting Started with Greenplum for Big Data Analytics

Name: Getting Started with Greenplum for Big Data Analytics
Author: Sunila Gollapudi
ISBN: 9781782177043

by Sunila Gollapudi

October 2013

Intermediate to advanced

172 pages

3h 51m

English

Packt Publishing

Read now

Unlock full access

Getting Started with Greenplum for Big Data Analytics
Table of Contents
Getting Started with Greenplum for Big Data Analytics
Credits
Foreword
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and moreWhy Subscribe?Free Access for Packt account holdersInstant Updates on New Packt Books
Preface
What this book covers

What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
ErrataPiracyQuestions
1. Big Data, Analytics, and Data Science Life Cycle
Enterprise dataClassificationFeatures
Big Data
So, what is Big Data?Multi-structured data
Data analytics
Data science
Data science life cyclePhase 1 – state business problemPhase 2 – set up dataPhase 3 – explore/transform dataPhase 4 – modelPhase 5 – publish insightsPhase 6 – measure effectiveness
References/Further reading
Summary
2. Greenplum Unified Analytics Platform (UAP)
Big Data analytics – platform requirements
Greenplum Unified Analytics Platform (UAP)
Core componentsGreenplum DatabaseHadoop (HD)ChorusCommand CenterModulesDatabase modulesHD modulesData Integration Accelerator (DIA) modulesCore architecture conceptsData warehousingColumn-oriented databasesParallel versus distributed computing/processingShared nothing, massive parallel processing (MPP) systems, and elastic scalabilityShared disk data architectureShared memory data architectureShared nothing data architectureData loading patterns
Greenplum UAP components
Greenplum DatabaseThe Greenplum Database physical architectureThe Greenplum high-availability architectureHigh-speed data loading using external tablesExternal table typesPolymorphic data storage and historic data managementData distributionHadoop (HD)Hadoop Distributed File System (HDFS)Hadoop MapReduceChorus
Greenplum Data Computing Appliance (DCA)
Greenplum Data Integration Accelerator (DIA)
References/Further reading
Summary
3. Advanced Analytics – Paradigms, Tools, and Techniques
Analytic paradigmsDescriptive analyticsPredictive analyticsPrescriptive analytics
Analytics classified
ClassificationForecasting or prediction or regressionClusteringOptimizationSimulations
Modeling methods
Decision treesAssociation rulesThe Apriori algorithmLinear regressionLogistic regressionThe Naive Bayesian classifierK-means clusteringText analysis
R programming
Weka
In-database analytics using MADlib
References/Further reading
Summary
4. Implementing Analytics with Greenplum UAP
Data loading for Greenplum Database and HDGreenplum data loading optionsExternal tablesgpfdistgploadHadoop (HD) data loading optionsSqoop 2Greenplum BulkLoader for HadoopUsing external ETL to load data into GreenplumExtraction, Load, and Transformation (ELT) and Extraction, Transformation, Load, and Transformation (ETLT)Greenplum target configurationSourcing large volumes of data from GreenplumUnsupported Greenplum data typesPush Down Optimization (PDO)
Greenplum table distribution and partitioning
DistributionData skew and performanceOptimizing the broadcast or redistribution motion for data co-locationPartitioningQuerying Greenplum Database and HDQuerying Greenplum DatabaseAnalyzing and optimizing queriesThe ANALYZE functionThe EXPLAIN functionDynamic Pipelining in GreenplumQuerying HDFSHivePigData communication between Greenplum Database and Hadoop (using external tables)
Data Computing Appliance (DCA)
Storage design, disk protection, and fault toleranceMaster server RAID configurationsSegment server RAID configurationsMonitoring DCA
Greenplum Database management
In-database analytics options (Greenplum-specific)
Window functionsThe PARTITION BY clauseThe ORDER BY clauseThe OVER (ORDER BY…) clauseCreating, modifying, and dropping functionsUser-defined aggregates
Using R with Greenplum
DBI Connector for RPL/R
Using Weka with Greenplum
Using MADlib with Greenplum
Using Greenplum Chorus
Pivotal
References/Further reading
Summary
Index

Overview

This book serves as a thorough introduction to using the Greenplum platform for big data analytics. It explores key concepts for processing, analyzing, and deriving insights from big data using Greenplum, covering aspects from data integration to advanced analytics techniques like programming with R and MADlib.

What this Book will help me do

Understand the architecture and core components of the Greenplum platform.
Learn how to design and execute data science projects using Greenplum.
Master loading, processing, and querying big data in Greenplum efficiently.
Explore programming with R and integrating it with Greenplum for analytics.
Gain skills in high-availability configurations, backups, and recovery within Greenplum.

Author(s)

Sunila Gollapudi is a seasoned expert in the field of big data analytics and has multiple years of experience working with platforms like Greenplum. Her real-world problem-solving expertise shapes her practical and approachable writing style, making this book not only educational but enjoyable to read.

Who is it for?

This book is ideal for data scientists or analysts aiming to explore the capabilities of big data platforms like Greenplum. It suits readers with basic knowledge of data warehousing, programming, and analytics tools who want to deepen their expertise and effectively harness Greenplum for analytics.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781782177043Other

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills