July 2017
Intermediate to advanced
796 pages
18h 55m
English
Caching enables Spark to persist data across computations and operations. In fact, this is one of the most important technique in Spark to speed up computations, particularly when dealing with iterative computations.
Caching works by storing the RDD as much as possible in the memory. If there is not enough memory then the current data in storage is evicted, as per LRU policy. If the data being asked to cache is larger than the memory available, the performance will come down because Disk will be used instead of memory.
You can mark an RDD as cached using either persist() or cache()
persist can use memory or disk or both:
persist(newLevel: StorageLevel)
The following are the possible ...
Read now
Unlock full access