book

Apache Hive Essentials

Name: Apache Hive Essentials
Author: Dayong Du
ISBN: 9781783558575

by Dayong Du

February 2015

Beginner to intermediate

208 pages

4h 15m

English

Packt Publishing

Read now

Unlock full access

Apache Hive Essentials
Table of Contents
Apache Hive Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Overview of Big Data and Hive
A short history
Introducing big data
Relational and NoSQL database versus Hadoop
Batch, real-time, and stream processing
Overview of the Hadoop ecosystem
Hive overview
Summary
2. Setting Up the Hive Environment
Installing Hive from Apache
Installing Hive from vendor packages
Starting Hive in the cloud
Using the Hive command line and Beeline
The Hive-integrated development environment
Summary
3. Data Definition and Description
Understanding Hive data types
Data type conversions
Hive Data Definition Language
Hive database
Hive internal and external tables
Hive partitions
Hive buckets
Hive views
Summary
4. Data Selection and Scope
The SELECT statement
The INNER JOIN statement
The OUTER JOIN and CROSS JOIN statements
Special JOIN – MAPJOIN
Set operation – UNION ALL
Summary
5. Data Manipulation
Data exchange – LOAD
Data exchange – INSERT
Data exchange – EXPORT and IMPORT
ORDER and SORT
Operators and functions
Transactions
Summary
6. Data Aggregation and Sampling
Basic aggregation – GROUP BY
Advanced aggregation – GROUPING SETS
Advanced aggregation – ROLLUP and CUBE
Aggregation condition – HAVING
Analytic functions
Sampling
Summary
7. Performance Considerations
Performance utilitiesThe EXPLAIN statementThe ANALYZE statement
Design optimization
Partition tablesBucket tablesIndex
Data file optimization
File formatCompressionStorage optimization
Job and query optimization
Local modeJVM reuseParallel executionJoin optimizationCommon joinMap joinBucket map joinSort merge bucket (SMB) joinSort merge bucket map (SMBM) joinSkew join
Summary
8. Extensibility Considerations
User-defined functionsThe UDF code templateThe UDAF code templateThe UDTF code templateDevelopment and deployment
Streaming
SerDe
Summary
9. Security Considerations
AuthenticationMetastore server authenticationHiveServer2 authentication
Authorization
Legacy modeStorage-based modeSQL standard-based mode
Encryption
Summary
10. Working with Other Tools
JDBC / ODBC connector
HBase
Hue
HCatalog
ZooKeeper
Oozie
Hive roadmap
Summary
Index

Content preview from Apache Hive Essentials

Sampling

When data volume is extra large, we may need to find a subset of data to speed up data analysis. Here it comes to a technique used to select and analyze a subset of data in order to identify patterns and trends. In Hive, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.

Random sampling uses the RAND() function and LIMIT keyword to get the sampling of data as shown in the following example. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY RAND() statement can also achieve the same purpose, but the performance is not good:

SELECT * FROM <Table_Name> DISTRIBUTE BY RAND() SORT BY RAND()
LIMIT ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781783558575

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Hive Essentials

by Dayong Du

Sampling

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.