16. Cache and checkpoint: Enhancing Spark’s performances

This chapter covers

  • Caching and checkpointing to enhance Spark’s performance
  • Choosing the right method to enhance performance
  • Collecting performance information
  • Picking the right spot to use a cache or checkpoint
  • Using collect() and collectAsList() wisely

Spark is fast. It processes data easily across multiple nodes in a cluster or on your laptop. Spark also loves memory. That’s a key design for Spark’s performance. However, as your datasets grow from the sample that you use to develop applications to production datasets, you may feel that performance is going down.

In this chapter, you’ll get some foundational knowledge about how Spark uses memory. This knowledge will help you in optimizing ...

Get Spark in Action, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.