Chapter 8. Ranking Algorithms

This chapter introduces the following two ranking algorithms and presents their associated implementations in PySpark:

Rank product

This algorithm finds the ranks of items (such as genes) among all items. It was originally developed for the detection of differentially expressed genes in replicated microarray experiments, but has since achieved widespread acceptance and is now used more broadly, including in machine learning. Spark does not provide an API for the rank product, so I will present a custom solution.

PageRank

PageRank is an iterative algorithm for measuring the importance of nodes in a given graph. This algorithm is used heavily by search engines (such as Google) to find the importance of each web page (document) relative to all web pages (a set of documents). In a nutshell, given a set of web pages, the PageRank algorithm calculates a quality ranking for each page. The Spark API offers multiple solutions for the PageRank algorithm. I’ll present one of those, using the GraphFrames API, as well as two custom solutions.

Rank Product

The rank product is an algorithm commonly used in the field of bioinformatics, also known as computational biology. It was originally developed as a biologically motivated test for the detection of differentially expressed genes in replicated micro-array experiments. As well as expression profiling, it can be ...

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.