Loading data from HDFS
HDFS is the most widely used big data storage system. One of the reasons for the wide adoption of HDFS is schema-on-read. What this means is that HDFS does not put any restriction on data when data is being written. Any and all kinds of data are welcome and can be stored in a raw format. This feature makes it ideal storage for raw unstructured data and semi-structured data.
When it comes to reading data, even unstructured data needs to be given some structure to make sense. Hadoop uses
InputFormat to determine how to read the data. Spark provides complete support for Hadoop's
InputFormat so anything that can be read by Hadoop can be read by Spark as well.
TextInputFormat takes the ...