Chapter 19. Locking

While HiveQL is an SQL dialect, Hive lacks the traditional support for locking on a column, row, or query, as typically used with update or insert queries. Files in Hadoop are traditionally write-once (although Hadoop does support limited append semantics). Because of the write-once nature and the streaming style of MapReduce, access to fine-grained locking is unnecessary.

However, since Hadoop and Hive are multi-user systems, locking and coordination are valuable in some situations. For example, if one user wishes to lock a table, because an INSERT OVERWRITE query is changing its content, and a second user attempts to issue a query against the table at the same time, the query could fail or yield invalid results.

Hive can be thought of as a fat client, in the sense that each Hive CLI, Thrift server, or web interface instance is completely independent of the other instances. Because of this independence, locking must be coordinated by a separate system.

Locking Support in Hive with Zookeeper

Hive includes a locking feature that uses Apache Zookeeper for locking. Zookeeper implements highly reliable distributed coordination. Other than some additional setup and configuration steps, Zookeeper is invisible to Hive users.

To set up Zookeeper, designate one or more servers to run its server processes. Three Zookeeper nodes is a typical minimum size, to provide a quorum and to provide sufficient redundancy.

For our next example, we will use three nodes: zk1.site.pvt, zk2.site.pvt ...

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.