Appendix C. HDFS dissected

If you’re using Hadoop you should have a solid understanding of HDFS so that you can make smart decisions about how to manage your data. In this appendix we’ll walk through how HDFS reads and writes files to help you better understand how HDSF works behind the scenes.

C.1. What is HDFS?

HDFS is a distributed filesystem modeled on the Google File System (GFS), details of which were published in a 2003 paper.[1] Google’s paper highlighted a number of key architectural and design properties, the most interesting of which included optimizations to reduce network input/output (I/O), how data replication should occur, and overall system availability and scalability. Not many details about GFS are known beyond those published ...

Get Hadoop in Practice now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.