Chapter 17. Storage Handlers and NoSQL
Storage Handlers are a combination of
SerDe, and specific code that Hive uses to treat
an external entity as a standard Hive table. This allows the user to issue
queries seamlessly whether the table represents a text file stored in Hadoop
or a column family stored in a NoSQL database such as Apache
HBase, Apache Cassandra, and
Amazon DynamoDB. Storage handlers are not only limited
to NoSQL databases, a storage handler could be designed for many different
kinds of data stores.
A specific storage handler may only implement some of the capabilities. For example, a given storage handler may allow read-only access or impose some other restriction.
Storage handlers offer a streamlined system for ETL. For example, a Hive query could be run that selects a data table that is backed by sequence files, however it could output
Storage Handler Background
Hadoop has an abstraction known as
InputFormat that allows data from different
sources and formats to be used as input for a job. The
TextInputFormat is a concrete implementation of
InputFormat. It works by providing
Hadoop with information on how to split a given path into multiple tasks,
and it provides a
provides methods for reading data from each split.
Hadoop also has an abstraction known as
OutputFormat, which takes the output from a job
and outputs it to an entity. The
TextOutputFormat is a concrete implementation of
OutputFormat. It works by persisting ...