Skip to Content
Apache Hive Cookbook
book

Apache Hive Cookbook

by Hanish Bansal, Saurabh Chauhan, Shrey Mehrotra
April 2016
Beginner content levelBeginner
268 pages
5h 32m
English
Packt Publishing
Content preview from Apache Hive Cookbook

Creating buckets in Hive

In the scenario where we query on a unique values column of a dataset, partitioning is not a good fit. If we go with a partition on a column with high unique values like ID, it would create a large number of small datasets in HDFS and partition entries in the metastore, thus increasing the load on NameNode and the metastore service.

To optimize queries on such a dataset, we group the data into a particular number of buckets and the data is divided into the maximum number of buckets.

How to do it…

Using the same sales dataset, if we need to optimize queries on a column with high unique column values such as ID, we create buckets on that column as follows:

create table sales_buck (id int, fname string, state string, zip string, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Introduction to Apache Hive

Introduction to Apache Hive

Tom Hanlon

Publisher Resources

ISBN: 9781782161080Supplemental Content