8 Leveraging data locality and memory of your machines

This chapter covers

Data locality in big data processing
Optimizing join strategies with Apache Spark
How to reduce shuffling
Memory vs. disk usage in big data processing

With both streaming and batch processing in big data applications, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. Our data can live in the database or the filesystem, and this situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast, but in big data applications, it is not feasible to store large amounts of data on one machine. We need to employ techniques such ...

Get Software Mistakes and Tradeoffs now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Software Mistakes and Tradeoffs by Tomasz Lelek, Jonathan Skeet

8 Leveraging data locality and memory of your machines

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly