April 2016
Beginner
268 pages
5h 32m
English
In the scenario where we query on a unique values column of a dataset, partitioning is not a good fit. If we go with a partition on a column with high unique values like ID, it would create a large number of small datasets in HDFS and partition entries in the metastore, thus increasing the load on NameNode and the metastore service.
To optimize queries on such a dataset, we group the data into a particular number of buckets and the data is divided into the maximum number of buckets.
Using the same sales dataset, if we need to optimize queries on a column with high unique column values such as ID, we create buckets on that column as follows:
create table sales_buck (id int, fname string, state string, zip string, ...