Chapter 6. Data Processing

In the trivial recommender that we defined in Chapter 1, we used the method get_availability; and in the MPIR, we used the method get_item_popularities. We hoped the choice of naming would provide sufficient context about their function, but we did not focus on the implementation details. Now we will start unpacking the details of some of this complexity and present the toolsets for online and offline collectors.

Hydrating Your System

Getting data into the pipeline is punnily referred to as hydration. The ML and data fields have a lot of water-themed naming conventions; “(Data ∩ Water) Terms” by Pardis Noorzad covers this topic.

PySpark

Spark is an extremely general computing library, with APIs for Java, Python, SQL, and Scala. PySpark’s role in many ML pipelines is for data processing and transforming the large-scale datasets.

Let’s return to the data structure we introduced for our recommendation problem; recall that the user-item matrix is the linear-algebraic representation of all the triples of users, items, and the user’s rating of the item. These triples are not naturally occurring in the wild. Most commonly, you begin with log files from your system; for example, Bookshop.org may have something that looks like this:

	'page_view_id': 'd15220a8e9a8e488162af3120b4396a9ca1',
	'anonymous_id': 'e455d516-3c08-4b6f-ab12-77f930e2661f',
	'view_tstamp': 2020-10-29 17:44:41+00:00,
	'page_url': 'https://bookshop.org/lists/best-sellers-of-the-week',
	'page_url_host' ...

Get Building Recommendation Systems in Python and JAX now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.