Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a fault-tolerant distributed filesystem, which is designed to run on low-cost hardware, and able to handle very large datasets (in the order of hundreds of petabytes to exabytes). Although the HDFS requires a fast network connection to transfer data across nodes, the latency can't be as low as in classic filesystems (it may be in the order of seconds); therefore, the HDFS has been designed for batch processing and high throughput. Each HDFS node contains a part of the filesystem's data; the same data is also replicated in other instances, and this ensures a high throughput access and fault-tolerance.

The HDFS's architecture is master-slave. If the master (called ...

Get Python Data Science Essentials - Third Edition now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.