Hive functionality can be customized in several ways. First, there are the variables and properties that we discussed in Variables and Properties. Second, you may extend Hive using custom UDFs, or user-defined functions, which was discussed in Chapter 13. Finally, you can customize the file and record formats, which we discuss now.
Hive draws a clear distinction between the file format, how records are encoded in a file, the record format, and how the stream of bytes for a given record are encoded in the record.
In this book we have been using text files, with the default
STORED AS TEXTFILE in
CREATE TABLE statements (see Text File Encoding of Data Values), where each line in the file is
a record. Most of the time those records have used the default separators,
with occasional examples of data that use commas or tabs as field
separators. However, a text file could contain JSON or XML
For Hive, the file format choice is orthogonal to the record format. We’ll first discuss options for file formats, then we’ll discuss different record formats and how to use them in Hive.
Throughout the book we have shown examples of creating tables. You may have noticed that
CREATE TABLE has a variety of syntax.
Examples of this syntax are
OUTPUTFORMAT. This chapter will cover much of this syntax and give examples, ...