O'Reilly logo

Practical Predictive Analytics by Ralph Winters

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Cleaning up and caching the table in memory

Since Spark excels at processing in-memory data, we will first remove our intermediary data and then cache our out_sd dataframe, so that subsequent queries run much faster. Caching data in memory works best when similar types of queries are repeated. In that way, Spark is able to know how to juggle memory so that most of what you need resides in memory.

However, this is not foolproof. Good Spark query and table design will help with optimization, but out-of-the-box caching usually gives some benefit. Often, the first queries will not benefit from memory caching, but subsequent queries will run much faster.

Since we will no longer use the intermediary dataframes we created, we will remove them with ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required