Chapter 2. HDFS

Goals and Motivation

The first half of Apache Hadoop is a filesystem called the Hadoop Distributed Filesystem or simply HDFS. HDFS was built to support high throughput, streaming reads and writes of extremely large files. Traditional large storage area networks (SANs) and network attached storage (NAS) offer centralized, low-latency access to either a block device or a filesystem on the order of terabytes in size. These systems are fantastic as the backing store for relational databases, content delivery systems, and similar types of data storage needs because they can support full-featured POSIX semantics, scale to meet the size requirements of these systems, and offer low-latency access to data. Imagine for a second, though, hundreds or thousands of machines all waking up at the same time and pulling hundreds of terabytes of data from a centralized storage system at once. This is where traditional storage doesn’t necessarily scale.

By creating a system composed of independent machines, each with its own I/O subsystem, disks, RAM, network interfaces, and CPUs, and relaxing (and sometimes removing) some of the POSIX requirements, it is possible to build a system optimized, in both performance and cost, for the specific type of workload we’re interested in. There are a number of specific goals for HDFS:

  • Store millions of large files, each greater than tens of gigabytes, and filesystem sizes reaching tens of petabytes.

  • Use a scale-out model based on inexpensive commodity ...

Get Hadoop Operations now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.