Counting up the sum of friends and number of entries per age

Alright, now I'm going to throw you into the deep end of the pool here, look at this big scary line:

totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) 

However, if we break it down into its components, what's going on here is pretty straightforward. What we need to do next is to aggregate our RDD information somehow. So let's just break this totalsByAge line down, one component at a time. You can see, we have sort of a compound operation going on here; we're taking our RDD of age and number of friend key/value pairs and we're calling mapValues on it, and then we're taking the resulting RDD and calling reduceByKey on it. Let's take ...

Get Frank Kane's Taming Big Data with Apache Spark and Python now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.