Using Tachyon as an off-heap storage layer

Spark RDDs are a great way to store datasets in memory while ending up with multiple copies of the same data in different applications. Tachyon solves some of the challenges with Spark RDD management. A few of them are:

  • RDD only exists for the duration of the Spark application
  • The same process performs the compute and RDD in-memory storage; so, if a process crashes, in-memory storage also goes away
  • Different jobs cannot share an RDD even if they are for the same underlying data, for example, an HDFS block that leads to:
    • Slow writes to disk
    • Duplication of data in memory, higher memory footprint
  • If the output of one application needs to be shared with the other application, it's slow due to the replication in ...

Get Spark Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.