The first step involves data load. In the case of multiple files, the SparkContext's method wholeTextFiles provides the functionality we need. It reads each file as a single record and returns it as a key-value pair, where the key contains the location of the file and the value holds the file content. We can reference input files directly via the wildcard pattern data/subject*. This is not only useful during loading files from a local filesystem but especially important for loading files from HDFS as well.
val path = s"${sys.env.get("DATADIR").getOrElse("data")}/subject*"val dataFiles = sc.wholeTextFiles(path)println(s"Number of input files: ${dataFiles.count}")
Since the names are not part of the input data, we define a variable ...