Chapter 10. Tuning
HiveQL is a declarative language where users issue declarative queries and Hive figures out how to translate them into MapReduce jobs. Most of the time, you don’t need to understand how Hive works, freeing you to focus on the problem at hand. While the sophisticated process of query parsing, planning, optimization, and execution is the result of many years of hard engineering work by the Hive team, most of the time you can remain oblivious to it.
However, as you become more experienced with Hive, learning about the theory behind Hive, and the low-level implementation details, will let you use Hive more effectively, especially where performance optimizations are concerned.
This chapter covers several different topics related to tuning Hive performance. Some tuning involves adjusting numeric configuration parameters (“turning the knobs”), while other tuning steps involve enabling or disabling specific features.
Using EXPLAIN
The first step to learning how Hive works (after reading
this book…) is to use the EXPLAIN
feature to learn how Hive translates queries into MapReduce jobs.
Consider the following example:
hive
>
DESCRIBE
onecol
;
number
int
hive
>
SELECT
*
FROM
onecol
;
5
5
4
hive
>
SELECT
SUM
(
number
)
FROM
onecol
;
14
Now, put the EXPLAIN
keyword in
front of the last query to see the query plan and other information. The
query will not be executed.
hive
>
EXPLAIN
SELECT
SUM
(
number
)
FROM
onecol
;
The output requires some explaining and practice to understand.
First, the abstract ...
Get Programming Hive now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.