O'Reilly logo
live online training icon Live Online training

Applied Network Analysis for Data Scientists: A Tutorial for Pythonistas

Topic: Data
Eric Ma

Have you ever wondered about how those data scientists at Facebook and LinkedIn make friend recommendations? Or how epidemiologists track down patient zero in an outbreak? If so, then this tutorial is for you. In this tutorial, we will use a variety of datasets to help you understand the fundamentals of network thinking, with a particular focus on constructing, summarizing, and visualizing complex networks.

This tutorial is for Pythonistas who want to understand relationship problems - as in, data problems that involve relationships between entities. Participants should already have a grasp of for loops and basic Python data structures (lists, tuples and dictionaries). By the end of the tutorial, participants will have learned how to use the NetworkX package in the Jupyter environment, and will become comfortable in visualizing large networks using Circos plots. Other plots will be introduced as well.

What you'll learn-and how you can apply it

  • Use NetworkX to model network data.
  • Compute centrality metrics on a graph.
  • Implement arbitrary path-finding algorithms that operate on graphs.
  • Create rational visualizations of graph-structured data.

This training course is for you because...

  • This workshop is geared towards data scientists who have a desire to learn about network science and how they can be used to solve data science problems.
  • The course material is geared towards intermediate learners. Course participants should be proficient in Python, but need not necessarily know graph theory beforehand.
  • Learners will be gain knowledge of foundational concepts, with concrete, anchoring examples to aid in recall.

Prerequisites

  • Participants in this course should already be familiar with Python programming idioms, including loops and list comprehensions, as well as basic Python data structures, including dictionaries and lists.
  • Knowledge of NumPy and Pandas, particularly their respective APIs, will help in Part 2 of this course.

Course Set-up:

All setup instructions are available on the GitHub repository:

Recommended Preparation:

Recommended Follow-up:

About your instructor

  • Eric is an Investigator at the Novartis Institutes for Biomedical Research, where he solves biological problems using machine learning. He obtained his Doctor of Science (ScD) from the Department of Biological Engineering, MIT, and was an Insight Health Data Fellow in the summer of 2017. He has taught Network Analysis at a variety of data science venues, including PyCon USA, SciPy, PyData and ODSC, and has also co-developed the Python Network Analysis curriculum on DataCamp. As an open source contributor, he has made contributions to PyMC3, matplotlib and bokeh. He has also led the development of the graph visualization package nxviz, and a data cleaning package pyjanitor (a Python port of the R package).

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

Introduction (10 min)

  • Lecture-style overview of graphs.
  • Mini class discussion on graph theory.

Section 1: NetworkX Basics (30 min)

  • Hands-on exercises interspersed with lectures.
  • Basics of NetworkX API, syntax, plots.

Break (10 min)

Section 2: Hubs and Paths (50 min)

  • Two metrics for identifying important nodes.
  • Pathfinding algorithms.

Break (10 min)

Section 3: Structures (50 min)

  • Algorithms for identifying cliques; connected component subgraphs.
  • Meta-level topic: Composing NetworkX functions to perform graph queries.

Leftover Q&A (20 min)

Part 2: Additional Topics

Section 1: Graph I/O (30 minutes)

  • Graph data formats on disk.
  • Reading and writing pandas DataFrames.

Break (10 minutes)

Section 2: Bipartite graphs (50 minutes)

  • Representing graphs with more than one node partitions: recommender systems.
  • Computing projections of a graph onto one node set.

Break (10 minutes)

Section 3: Network Statistical Inference (30 minutes)

  • Random graphs: a model for how the world works.
  • Using statistical inference methods to determine whether a graph came from a particular class of random graphs.

Break (10 min)

Section 4: Matrix Operations (30 minutes)

  • How to represent graphs as matrices
  • Matrix operations on adjacency matrices: non-bipartite and bipartite graphs.