Chapter 5. Data storage on the batch layer: Illustration

This chapter covers

  • Using the Hadoop Distributed File System (HDFS)
  • Pail, a higher-level abstraction for manipulating datasets

In the last chapter you saw the requirements for storing a master dataset and how a distributed filesystem is a great fit for those requirements. But you also saw how using a filesystem API directly felt way too low-level for the kinds of operations you need to do on the master dataset. In this chapter we’ll show you how to use a specific distributed filesystem—HDFS—and then show how to automate the tasks you need to do with a higher-level API.

Like all illustration chapters, we’ll focus on specific tools to show the nitty-gritty of applying the higher-level ...

Get Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.