Chapter 9. Schema Design

Hive looks and acts like a relational database. Users have a familiar nomenclature such as tables and columns, as well as a query language that is remarkably similar to SQL dialects they have used before. However, Hive is implemented and used in ways that are very different from conventional relational databases. Often, users try to carry over paradigms from the relational world that are actually Hive anti-patterns. This section highlights some Hive patterns you should use and some anti-patterns you should avoid.

Table-by-Day

Table-by-day is a pattern where a table named supply is appended with a timestamp such as supply_2011_01_01, supply_2011_01_02, etc. Table-by-day is an anti-pattern in the database world, but due to common implementation challenges of ever-growing data sets, it is still widely used:

hive> CREATE TABLE supply_2011_01_02 (id int, part string, quantity int);

hive> CREATE TABLE supply_2011_01_03 (id int, part string, quantity int);

hive> CREATE TABLE supply_2011_01_04 (id int, part string, quantity int);

hive> .... load data ...

hive> SELECT part,quantity supply_2011_01_02
    > UNION ALL
    > SELECT part,quantity from supply_2011_01_03
    > WHERE quantity < 4;

With Hive, a partitioned table should be used instead. Hive uses expressions in the WHERE clause to select input only from the partitions needed for the query. This query will run efficiently, and it is clean and easy on the eyes:

hive> CREATE TABLE supply (id int, part string, quantity int)
    

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.