Chapter 14. Streaming
Hive works by leveraging and extending the components of Hadoop,
common abstractions such as
Reducer, plus its own abstractions, like
SerializerDeserializer (SerDe), User-Defined
Functions (UDFs), and
These components are all Java components, but Hive hides the complexity of implementing and using these components by letting the user work with SQL abstractions, rather than Java code.
Streaming offers an alternative way to transform data. During a streaming job, the Hadoop Streaming API opens an I/O pipe to an external process. Data is then passed to the process, which operates on the data it reads from the standard input and writes the results out through the standard output, and back to the Streaming API job. While Hive does not leverage the Hadoop streaming API directly, it works in a very similar way.
This pipeline computing model is familiar to users of Unix operating systems and their descendants, like Linux and Mac OS X.
Streaming is usually less efficient than coding the
comparable UDFs or
Serializing and deserializing data to pass it in and out of the pipe is
relatively inefficient. It is also harder to debug the whole program in a
unified manner. However, it is useful for fast prototyping and for
leveraging existing code that is not written in Java. For Hive users who
don’t want to write Java code, it can be a very effective approach.
Hive provides several clauses to use streaming: ...