Chapter 6. Batch layer

This chapter covers

  • Computing functions on the batch layer
  • Splitting a query into precomputed and on-the-fly components
  • Recomputation versus incremental algorithms
  • The meaning of scalability
  • The MapReduce paradigm
  • A higher-level way of thinking about MapReduce

The goal of a data system is to answer arbitrary questions about your data. Any question you could ask of your dataset can be implemented as a function that takes all of your data as input. Ideally, you could run these functions on the fly whenever you query your dataset. Unfortunately, a function that uses your entire dataset as input will take a very long time to run. You need a different strategy if you want your queries answered quickly.

In the Lambda ...

Get Big Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.