July 2017
Intermediate to advanced
796 pages
18h 55m
English
HadoopRDD provides core functionality for reading data stored in HDFS using the MapReduce API from the Hadoop 1.x libraries. HadoopRDD is the default used and can be seen when loading data from any file system into an RDD:
class HadoopRDD[K, V] extends RDD[(K, V)]
When loading the state population records from the CSV, the underlying base RDD is actually HadoopRDD as in the following code snippet:
scala> val statesPopulationRDD = sc.textFile("statesPopulation.csv")statesPopulationRDD: org.apache.spark.rdd.RDD[String] = statesPopulation.csv MapPartitionsRDD[93] at textFile at <console>:25scala> statesPopulationRDD.toDebugStringres110: String =(2) statesPopulation.csv MapPartitionsRDD[93] at textFile at <console>:25 [] | statesPopulation.csv ...Read now
Unlock full access