8 Leveraging data locality and memory of your machines
This chapter covers
- Data locality in big data processing
- Optimizing join strategies with Apache Spark
- How to reduce shuffling
- Memory vs. disk usage in big data processing
With both streaming and batch processing in big data applications, we often need to use data from multiple sources to get insights and business value. The data locality pattern allows us to move computation to data. Our data can live in the database or the filesystem, and this situation is simple as long as our data fits into the disk or memory of our machines. Processing can be local and fast, but in big data applications, it is not feasible to store large amounts of data on one machine. We need to employ techniques such ...
Get Software Mistakes and Tradeoffs now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.