16. Cache and checkpoint: Enhancing Spark’s performances

This chapter covers

Caching and checkpointing to enhance Spark’s performance
Choosing the right method to enhance performance
Collecting performance information
Picking the right spot to use a cache or checkpoint
Using collect() and collectAsList() wisely

Spark is fast. It processes data easily across multiple nodes in a cluster or on your laptop. Spark also loves memory. That’s a key design for Spark’s performance. However, as your datasets grow from the sample that you use to develop applications to production datasets, you may feel that performance is going down.

In this chapter, you’ll get some foundational knowledge about how Spark uses memory. This knowledge will help you in optimizing ...

Get Spark in Action, Second Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Spark in Action, Second Edition by Jean-Georges Perrin

16. Cache and checkpoint: Enhancing Spark’s performances

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly