Now that we have established a technology stack to use, let's describe each of the components and explain why they are useful in a Spark environment. This part of the book is designed as a reference rather than a straight read. If you're familiar with most of the technologies, then you can refresh your knowledge and continue to the next section, Chapter 2, Data Acquisition.
The Hadoop Distributed File System (HDFS) is a distributed filesystem with built-in redundancy. It is optimized to work on three or more nodes by default (although one will work fine and the limit can be increased), which provides the ability to store data in replicated blocks. So not only is a file split into a number of blocks but three copies of ...