Chapter 10. Tuning

HiveQL is a declarative language where users issue declarative queries and Hive figures out how to translate them into MapReduce jobs. Most of the time, you don’t need to understand how Hive works, freeing you to focus on the problem at hand. While the sophisticated process of query parsing, planning, optimization, and execution is the result of many years of hard engineering work by the Hive team, most of the time you can remain oblivious to it.

However, as you become more experienced with Hive, learning about the theory behind Hive, and the low-level implementation details, will let you use Hive more effectively, especially where performance optimizations are concerned.

This chapter covers several different topics related to tuning Hive performance. Some tuning involves adjusting numeric configuration parameters (“turning the knobs”), while other tuning steps involve enabling or disabling specific features.

Using EXPLAIN

The first step to learning how Hive works (after reading this book…) is to use the EXPLAIN feature to learn how Hive translates queries into MapReduce jobs.

Consider the following example:

hive> DESCRIBE onecol;
number  int

hive> SELECT * FROM onecol;
5
5
4

hive> SELECT SUM(number) FROM onecol;
14

Now, put the EXPLAIN keyword in front of the last query to see the query plan and other information. The query will not be executed.

hive> EXPLAIN SELECT SUM(number) FROM onecol;

The output requires some explaining and practice to understand.

First, the abstract ...

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.