Loading data from HDFS

HDFS is the most widely used big data storage system. One of the reasons for the wide adoption of HDFS is schema-on-read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it ideal storage for raw unstructured data and semi-structured data.

When it comes to reading data, even unstructured data needs to be given some structure to make sense. Hadoop uses InputFormat to determine how to read the data. Spark provides complete support for Hadoop's InputFormat so anything that can be read by Hadoop can be read by Spark as well.

The default InputFormat is TextInputFormat. TextInputFormat takes the ...

Get Spark Cookbook now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.