January 2019
Beginner to intermediate
154 pages
4h 31m
English
Caching data into memory is one of the main features of Spark. You can cache large datasets in-memory or on-disk depending upon your cluster hardware. You can choose to cache your data in two scenarios:
If you want to run multiple actions of an RDD, then it will be a good idea to cache it into the memory so that recompilation of this RDD can be avoided. For example, the following code first takes out a few elements from the RDD and then returns the count of the elements:
//Scalaval baseRDD = spark.sparkContext.parallelize(1 to 10)baseRDD.take(2)baseRDD.count()
The following code makes use of cache() to make ...
Read now
Unlock full access