Chapter 15. Customizing Hive File and Record Formats

Hive functionality can be customized in several ways. First, there are the variables and properties that we discussed in Variables and Properties. Second, you may extend Hive using custom UDFs, or user-defined functions, which was discussed in Chapter 13. Finally, you can customize the file and record formats, which we discuss now.

File Versus Record Formats

Hive draws a clear distinction between the file format (how records are encoded in a file) and the record format (how the stream of bytes for a given record are encoded in the record).

In this book we have been using text files, with the default STORED AS TEXTFILE in CREATE TABLE statements (see Text File Encoding of Data Values), where each line in the file is a record. Most of the time those records have used the default separators, with occasional examples of data that use commas or tabs as field separators. However, a text file could contain JSON or XML “documents.”

For Hive, the file format choice is orthogonal to the record format. We’ll first discuss options for file formats, then we’ll discuss different record formats and how to use them in Hive.

Demystifying CREATE TABLE Statements

Throughout the book we have shown examples of creating tables. You may have noticed that CREATE TABLE has a variety of syntax. Examples of this syntax are STORED AS SEQUENCEFILE, ROW FORMAT DELIMITED , SERDE, INPUTFORMAT, OUTPUTFORMAT. This chapter will cover much of this syntax and give examples, ...

Get Programming Hive now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.