Chapter 17. Storage Handlers and NoSQL

Storage Handlers are a combination of InputFormat, OutputFormat, SerDe, and specific code that Hive uses to treat an external entity as a standard Hive table. This allows the user to issue queries seamlessly whether the table represents a text file stored in Hadoop or a column family stored in a NoSQL database such as Apache HBase, Apache Cassandra, and Amazon DynamoDB. Storage handlers are not only limited to NoSQL databases, a storage handler could be designed for many different kinds of data stores.

Note

A specific storage handler may only implement some of the capabilities. For example, a given storage handler may allow read-only access or impose some other restriction.

Storage handlers offer a streamlined system for ETL. For example, a Hive query could be run that selects a data table that is backed by sequence files, however it could output to text files.

Storage Handler Background

Hadoop has an abstraction known as InputFormat that allows data from different sources and formats to be used as input for a job. The TextInputFormat is a concrete implementation of InputFormat. It works by providing Hadoop with information on how to split a given path into multiple tasks, and it provides a RecordReader that provides methods for reading data from each split.

Hadoop also has an abstraction known as OutputFormat, which takes the output from a job and outputs it to an entity. The TextOutputFormat is a concrete implementation of OutputFormat. It works ...

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.