July 2017
Intermediate to advanced
796 pages
18h 55m
English
A second method for creating an RDD is by reading data from an external distributed source such as Amazon S3, Cassandra, HDFS, and so on. For example, if you are creating an RDD from HDFS, then the distributed blocks in HDFS are all read by the individual nodes in the Spark cluster.
Each of the nodes in the Spark cluster is essentially doing its own input-output operations and each node is independently reading one or more blocks from the HDFS blocks. In general, Spark makes the best effort to put as much RDD as possible into memory. There is the capability to cache the data to reduce the input-output operations by enabling nodes in the spark cluster to avoid repeated reading operations, say from the HDFS ...
Read now
Unlock full access