Chapter 15. Customizing Hive File and Record Formats
Hive functionality can be customized in several ways. First, there are the variables and properties that we discussed in Variables and Properties. Second, you may extend Hive using custom UDFs, or user-defined functions, which was discussed in Chapter 13. Finally, you can customize the file and record formats, which we discuss now.
File Versus Record Formats
Hive draws a clear distinction between the file format (how records are encoded in a file) and the record format (how the stream of bytes for a given record are encoded in the record).
In this book we have been using text files, with the default
STORED AS TEXTFILE
in CREATE TABLE
statements (see Text File Encoding of Data Values), where each line in the file is
a record. Most of the time those records have used the default separators,
with occasional examples of data that use commas or tabs as field
separators. However, a text file could contain JSON or XML
“documents.”
For Hive, the file format choice is orthogonal to the record format. We’ll first discuss options for file formats, then we’ll discuss different record formats and how to use them in Hive.
Demystifying CREATE TABLE Statements
Throughout the book we have shown examples of creating tables. You may have noticed that
CREATE TABLE
has a variety of syntax.
Examples of this syntax are STORED AS
SEQUENCEFILE
, ROW FORMAT
DELIMITED
, SERDE
, INPUTFORMAT
, OUTPUTFORMAT
. This chapter will cover much of this syntax and give examples, ...
Get Programming Hive now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.