Chapter 6. Graph Algorithms

So far we’ve mainly been focusing on record data, which is typically stored in flat files or relational databases and can be represented as a matrix (a set of rows with named columns). Now we’ll turn our attention to graph-based data, which depicts the relationships between two or more data points. A common example is social network data: for example, if “Alex” is a “friend” of “Jane” and “Jane” is a “friend” of “Bob,” these relationships form a graph. Airline/flight data is another common example of graph data; we’ll explore both of these (and others) in this chapter.

Data structures are specific ways of organizing and storing data in computers so that it can be used effectively. In addition to linear data structures like the ones we’ve primarily been working with in the previous chapters (arrays, lists, tuples, etc.), these include nonlinear structures such as trees, hash maps, and graphs.

This chapter introduces GraphFrames, a powerful external package for Spark that provides APIs for representing directed and undirected graphs, querying and analyzing graphs, and running algorithms on graphs. We’ll start by exploring graphs and what they are used for, then look at how to use the GraphFrames API in PySpark to build and query graphs. We’ll dig into a few of the algorithms GraphFrames supports, such as finding triangles and motif finding, then walk through some practical, real-world applications.

Get Data Algorithms with Spark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.