book

Apache Spark Graph Processing

Name: Apache Spark Graph Processing
Author: Rindra Ramamonjison
ISBN: 9781784391805

by Rindra Ramamonjison

September 2015

Intermediate to advanced

148 pages

3h 20m

English

Packt Publishing

Read now

Unlock full access

Apache Spark Graph Processing
Table of Contents
Apache Spark Graph Processing
Credits
Foreword
About the Author
About the Reviewer
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy subscribe?Free access for Packt account holders
Preface
Distinctive features
What this book covers

What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Getting Started with Spark and GraphX
Downloading and installing Spark 1.4.1
Experimenting with the Spark shell
Getting started with GraphX
Building a tiny social networkLoading the dataThe property graphTransforming RDDs to VertexRDD and EdgeRDDIntroducing graph operationsBuilding and submitting a standalone applicationWriting and configuring a Spark programBuilding the program with the Scala Build ToolDeploying and running with spark-submit
Summary
2. Building and Exploring Graphs
Network datasetsThe communication networkFlavor networksSocial ego networks
Graph builders
The Graph factory methodedgeListFilefromEdgesfromEdgeTuples
Building graphs
Building directed graphsBuilding a bipartite graphBuilding a weighted social ego network
Computing the degrees of the network nodes
In-degree and out-degree of the Enron email networkDegrees in the bipartite food networkDegree histogram of the social ego networks
Summary
3. Graph Analysis and Visualization
Network datasets
The graph visualization
Installing the GraphStream and BreezeViz librariesVisualizing the graph dataPlotting the degree distribution
The analysis of network connectedness
Finding the connected componentsCounting triangles and computing clustering coefficients
The network centrality and PageRank
How PageRank worksRanking web pages
Scala Build Tool revisited
Organizing build definitionsManaging library dependenciesA preview of the stepsStep 1 – Enable the sbt-assembly pluginStep 2 – Create a build.sbt fileStep 3 – Declare library dependencies and resolversStep 4 – Set up the sbt-assembly pluginStep 5 – Create the uber JARRunning tasks with SBT commands
Summary
4. Transforming and Shaping Up Graphs to Your Needs
Transforming the vertex and edge attributesmapVerticesmapEdgesmapTriplets
Modifying graph structures
The reverse operatorThe subgraph operatorThe mask operatorThe groupEdges operator
Joining graph datasets
joinVerticesouterJoinVerticesExample – Hollywood movie graph
Data operations on VertexRDD and EdgeRDD
Mapping VertexRDD and EdgeRDDFiltering VertexRDDsJoining VertexRDDsJoining EdgeRDDsReversing edge directionsCollecting neighboring informationExample – from food network to flavor pairing
Summary
5. Creating Custom Graph Aggregation Operators
NCAA College Basketball datasets
The aggregateMessages operator
EdgeContextAbstracting out the aggregationKeeping things DRYCoach wants more numbersCalculating average points per gameDefense stats – D matters as in direction
Joining average stats into a graph
Performance optimization
The MapReduceTriplets operator
Summary
6. Iterative Graph-Parallel Processing with Pregel
The Pregel computational modelExample – iterating towards the social equality
The Pregel API in GraphX
Community detection through label propagation
The Pregel implementation of PageRank
Summary
7. Learning Graph Structures
Community clustering in graphsSpectral clusteringPower iteration clustering
Applications – music fan community detection
Step 1 – load the data into a Spark graph propertyStep 2 – extract the features of nodesStep 3 – define a similarity measure between two nodesStep 4 – create an affinity matrixStep 5 – run k-means clustering on the affinity matrixExercise – collaborative clustering through playlists
Summary
A. References
Chapter 2, Building and Exploring Graphs
Chapter 3, Graph Analysis and Visualization
Chapter 7, Learning Graph Structures
Index

Content preview from Apache Spark Graph Processing

Applications – music fan community detection

We are now ready to apply the previous graph clustering method to the cluster music songs, according to the tags attached to each song. Alternatively, a dataset of the song playlists can also be used to cluster songs that are often played in many lists. The datasets that we are going to work with can be downloaded from http://www.cs.cornell.edu/~shuochen/lme/data_page.html. The datasets consist of the following files:

train.txt: This file contains the playlist data by using the integer ID to represent songs
tags.txt: This file includes the social tags by using the integer ID to represent songs
song_hash.txt: This file maps a song ID to its title and artist
tag_hash.txt: This one maps a tag ID to its name ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781784391805

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Apache Spark Graph Processing

by Rindra Ramamonjison

Applications – music fan community detection

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.