Chapter 5. Data Processing

In the trivial recommender that we defined in the introduction, we used a method get_availability; and in the MPIR, we used a method get_item_popularities. We hoped the choice of naming would provide sufficient context about what their function was; but we did not focus on the implementation details. Now we will start unpacking the details of some of this complexity, and see what the toolsets are for online and offline collectors.

Hydrating your system

Getting data into the pipeline is punnily referred to as hydration. There are a lot of water-theme naming conventions in Machine Learning and Data; one article that covers this is by Pardis Noorzad.


Spark is an extremely general computing library, with APIs for Java, Python, SQL, and Scala. PySpark’s role in many ML pipelines is for data processing and transformation of the large scale datasets in your pipeline.

Let’s return to the data-structure we introduced for our recommendation problem; recall that ...

Get Building Recommendation Systems in Python and JAX now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.