Chapter 2

Storing Data in Hadoop


  • Getting to know the Hadoop Distributed File System (HDFS)
  • Understanding HBase
  • Choosing the most appropriate data storage for your applications


The code downloads for this chapter are found at on the Download Code tab. The code is in the Chapter 2 download and individually named according to the names throughout the chapter.

The foundation of efficient data processing in Hadoop is its data storage model. This chapter examines different options for storing data in Hadoop — specifically, in the Hadoop Distributed File System (HDFS) and HBase. This chapter explores the benefits and drawbacks of each option, and outlines a decision tree for picking the best option for a given problem. You also learn about Apache Avro — an Hadoop framework for data serialization, which can be tightly integrated with Hadoop-based storage. This chapter also covers different data access models that can be implemented on top of Hadoop storage.


HDFS is Hadoop’s implementation of a distributed filesystem. It is designed to hold a large amount of data, and provide access to this data to many clients distributed across a network. To be able to successfully leverage HDFS, you first must understand how it is implemented and how it works.

HDFS Architecture

The HDFS design is based on the design of the Google File System (GFS). Its implementation addresses a ...

Get Professional Hadoop Solutions now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.