Besides partition, bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from partition, the bucket corresponds to segments of files in HDFS. For example, the
employee_partitioned table from the previous section uses the year and month as the top-level partition. If there is a further request to use the
employee_id as the third level of partition, it leads to many deep and small partitions and directories. For instance, we can bucket the
employee_partitioned table using
employee_id as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same
employee_id will always be stored in the same bucket (segment of ...