Companion tools

Now that we have established a technology stack to use, let's describe each of the components and explain why they are useful in a Spark environment. This part of the book is designed as a reference rather than a straight read. If you're familiar with most of the technologies, then you can refresh your knowledge and continue to the next section, Chapter 2, Data Acquisition.

Apache HDFS

The Hadoop Distributed File System (HDFS) is a distributed filesystem with built-in redundancy. It is optimized to work on three or more nodes by default (although one will work fine and the limit can be increased), which provides the ability to store data in replicated blocks. So not only is a file split into a number of blocks but three copies of ...

Get Mastering Spark for Data Science now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.