The Practitioner's Guide to Graph Data

Book description

Graph data closes the gap between the way humans and computers view the world. While computers rely on static rows and columns of data, people navigate and reason about life through relationships. This practical guide demonstrates how graph data brings these two approaches together. By working with concepts from graph theory, database schema, distributed systems, and data analysis, you’ll arrive at a unique intersection known as graph thinking.

Authors Denise Koessler Gosnell and Matthias Broecheler show data engineers, data scientists, and data analysts how to solve complex problems with graph databases. You’ll explore templates for building with graph technology, along with examples that demonstrate how teams think about graph data within an application.

  • Build an example application architecture with relational and graph technologies
  • Use graph technology to build a Customer 360 application, the most popular graph data pattern today
  • Dive into hierarchical data and troubleshoot a new paradigm that comes from working with graph data
  • Find paths in graph data and learn why your trust in different paths motivates and informs your preferences
  • Use collaborative filtering to design a Netflix-inspired recommendation system

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Who Should Read This Book
    2. Goals of This Book
    3. Navigating This Book
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Graph Thinking
    1. Why Now? Putting Database Technologies in Context
      1. 1960s–1980s: Hierarchical Data
      2. 1980s–2000s: Entity-Relationship
      3. 2000s–2020s: NoSQL
      4. 2020s–?: Graph
    2. What Is Graph Thinking?
      1. Complex Problems and Complex Systems
      2. Complex Problems in Business
    3. Making Technology Decisions to Solve Complex Problems
      1. So You Have Graph Data. What’s Next?
      2. Seeing the Bigger Picture
    4. Getting Started on Your Journey with Graph Thinking
  3. 2. Evolving from Relational to Graph Thinking
    1. Chapter Preview: Translating Relational Concepts to Graph Terminology
    2. Relational Versus Graph: What’s the Difference?
      1. Data for Our Running Example
    3. Relational Data Modeling
      1. Entities and Attributes
      2. Building Up to an ERD
    4. Concepts in Graph Data
      1. Fundamental Elements of a Graph
      2. Adjacency
      3. Neighborhoods
      4. Distance
      5. Degree
    5. The Graph Schema Language
      1. Vertex Labels and Edge Labels
      2. Properties
      3. Edge Direction
      4. Self-Referencing Edge Labels
      5. Multiplicity of Your Graph
      6. Full Example Graph Model
    6. Relational Versus Graph: Decisions to Consider
      1. Data Modeling
      2. Understanding Graph Data
      3. Mixing Database Design with Application Purpose
    7. Summary
  4. 3. Getting Started: A Simple Customer 360
    1. Chapter Preview: Relational Versus Graph
    2. The Foundational Use Case for Graph Data: C360
      1. Why Do Businesses Care About C360?
    3. Implementing a C360 Application in a Relational System
      1. Data Models
      2. Relational Implementation
      3. Example C360 Queries
    4. Implementing a C360 Application in a Graph System
      1. Data Models
      2. Graph Implementation
      3. Example C360 Queries
    5. Relational Versus Graph: How to Choose?
      1. Relational Versus Graph: Data Modeling
      2. Relational Versus Graph: Representing Relationships
      3. Relational Versus Graph: Query Languages
      4. Relational Versus Graph: Main Points
    6. Summary
      1. Why Not Relational?
      2. Making a Technology Choice for Your C360 Application
  5. 4. Exploring Neighborhoods in Development
    1. Chapter Preview: Building a More Realistic Customer 360
    2. Graph Data Modeling 101
      1. Should This Be a Vertex or an Edge?
      2. Lost Yet? Let Us Walk You Through Direction
      3. A Graph Has No Name: Common Mistakes in Naming
      4. Our Full Development Graph Model
      5. Before We Start Building
      6. Our Thoughts on the Importance of Data, Queries, and the End User
    3. Implementation Details for Exploring Neighborhoods in Development
      1. Generating More Data for Our Expanded Example
    4. Basic Gremlin Navigation
    5. Advanced Gremlin: Shaping Your Query Results
      1. Shaping Query Results with the project(), fold(), and unfold() Steps
      2. Removing Data from the Results with the where(neq()) Pattern
      3. Planning for Robust Result Payloads with the coalesce() Step
    6. Moving from Development into Production
  6. 5. Exploring Neighborhoods in Production
    1. Chapter Preview: Understanding Distributed Graph Data in Apache Cassandra
    2. Working with Graph Data in Apache Cassandra
      1. The Most Important Topic to Understand About Data Modeling: Primary Keys
      2. Partition Keys and Data Locality in a Distributed Environment
      3. Understanding Edges, Part 1: Edges in Adjacency Lists
      4. Understanding Edges, Part 2: Clustering Columns
      5. Understanding Edges, Part 3: Materialized Views for Traversals
    3. Graph Data Modeling 201
      1. Finding Indexes with an Intelligent Index Recommendation System
    4. Production Implementation Details
      1. Materialized Views and Adding Time onto Edges
      2. Our Final C360 Production Schema
      3. Bulk Loading Graph Data
      4. Updating Our Gremlin Queries to Use Time on Edges
    5. Moving On to More Complex, Distributed Graph Problems
      1. Our First 10 Tips to Get from Development to Production
  7. 6. Using Trees in Development
    1. Chapter Preview: Navigating Trees, Hierarchical Data, and Cycles
    2. Seeing Hierarchies and Nested Data: Three Examples
      1. Hierarchical Data in a Bill of Materials
      2. Hierarchical Data in Version Control Systems
      3. Hierarchical Data in Self-Organizing Networks
      4. Why Graph Technology for Hierarchical Data?
    3. Finding Your Way Through a Forest of Terminology
      1. Trees, Roots, and Leaves
      2. Depth in Walks, Paths, and Cycles
    4. Understanding Hierarchies with Our Sensor Data
      1. Understand the Data
      2. Conceptual Model Using the GSL Notation
      3. Implement Schema
      4. Before We Build Our Queries
    5. Querying from Leaves to Roots in Development
      1. Where Has This Sensor Sent Information To?
      2. From This Sensor, What Was Its Path to Any Tower?
      3. From Bottom Up to Top Down
    6. Querying from Roots to Leaves in Development
      1. Setup Query: Which Tower Has the Most Sensor Connections So That We Could Explore It for Our Example?
      2. Which Sensors Have Connected Directly to Georgetown?
      3. Find All Sensors That Connected to Georgetown
      4. Depth Limiting in Recursion
    7. Going Back in Time
  8. 7. Using Trees in Production
    1. Chapter Preview: Understanding Branching Factor, Depth, and Time on Edges
    2. Understanding Time in the Sensor Data
      1. Final Thoughts on Time Series Data in Graphs
    3. Understanding Branching Factor in Our Example
      1. What Is Branching Factor?
      2. How Do We Get Around Branching Factor?
    4. Production Schema for Our Sensor Data
    5. Querying from Leaves to Roots in Production
      1. Where Has This Sensor Sent Information to, and at What Time?
      2. From This Sensor, Find All Trees up to a Tower by Time
      3. From This Sensor, Find a Valid Tree
      4. Advanced Gremlin: Understanding the where().by() Pattern
    6. Querying from Roots to Leaves in Production
      1. Which Sensors Have Connected to Georgetown Directly, by Time?
      2. What Valid Paths Can We Find from Georgetown Down to All Sensors?
    7. Applying Your Queries to Tower Failure Scenarios
      1. Applying the Final Results of Our Complex Problem
    8. Seeing the Forest for the Trees
  9. 8. Finding Paths in Development
    1. Chapter Preview: Quantifying Trust in Networks
    2. Thinking About Trust: Three Examples
      1. How Much Do You Trust That Open Invitation?
      2. How Defensible Is an Investigator’s Story?
      3. How Do Companies Model Package Delivery?
    3. Fundamental Concepts About Paths
      1. Shortest Paths
      2. Depth-First Search and Breadth-First Search
      3. Learning to See Application Features as Different Path Problems
    4. Finding Paths in a Trust Network
      1. Source Data
      2. A Brief Primer on Bitcoin Terminology
      3. Creating Our Development Schema
      4. Loading Data
      5. Exploring Communities of Trust
    5. Understanding Traversals with Our Bitcoin Trust Network
      1. Which Addresses Are in the First Neighborhood?
      2. Which Addresses Are in the Second Neighborhood?
      3. Which Addresses Are in the Second Neighborhood, but Not the First?
      4. Evaluation Strategies with the Gremlin Query Language
      5. Pick a Random Address to Use for Our Example
    6. Shortest Path Queries
      1. Finding Paths of a Fixed Length
      2. Finding Paths of Any Length
      3. Augmenting Our Paths with the Trust Scores
      4. Do You Trust This Person?
  10. 9. Finding Paths in Production
    1. Chapter Preview: Understanding Weights, Distance, and Pruning
    2. Weighted Paths and Search Algorithms
      1. Shortest Weighted Path Problem Definition
      2. Shortest Weighted Path Search Optimizations
    3. Normalization of Edge Weights for Shortest Path Problems
      1. Normalizing the Edge Weights
      2. Updating Our Graph
      3. Exploring the Normalized Edge Weights
      4. Some Thoughts Before Moving On to Shortest Weighted Path Queries
    4. Shortest Weighted Path Queries
      1. Building a Shortest Weighted Path Query for Production
    5. Weighted Paths and Trust in Production
  11. 10. Recommendations in Development
    1. Chapter Preview: Collaborative Filtering for Movie Recommendations
    2. Recommendation System Examples
      1. How We Give Recommendations in Healthcare
      2. How We Experience Recommendations in Social Media
      3. How We Use Deeply Connected Data for Recommendations in Ecommerce
    3. An Introduction to Collaborative Filtering
      1. Understanding the Problem and Domain
      2. Collaborative Filtering with Graph Data
      3. Recommendations via Item-Based Collaborative Filtering with Graph Data
      4. Three Different Models for Ranking Recommendations
    4. Movie Data: Schema, Loading, and Query Review
      1. Data Model for Movie Recommendations
      2. Schema Code for Movie Recommendations
      3. Loading the Movie Data
      4. Neighborhood Queries in the Movie Data
      5. Tree Queries in the Movie Data
      6. Path Queries in the Movie Data
    5. Item-Based Collaborative Filtering in Gremlin
      1. Model 1: Counting Paths in the Recommendation Set
      2. Model 2: NPS-Inspired
      3. Model 3: Normalized NPS
      4. Choosing Your Own Adventure: Movies and Graph Problems Edition
  12. 11. Simple Entity Resolution in Graphs
    1. Chapter Preview: Merging Multiple Datasets into One Graph
    2. Defining a Different Complex Problem: Entity Resolution
      1. Seeing the Complex Problem
    3. Analyzing the Two Movie Datasets
      1. MovieLens Dataset
      2. Kaggle Dataset
      3. Development Schema
    4. Matching and Merging the Movie Data
      1. Our Matching Process
    5. Resolving False Positives
      1. False Positives Found in the MovieLens Dataset
      2. Additional Errors Discovered in the Entity Resolution Process
      3. Final Analysis of the Merging Process
      4. The Role of Graph Structure in Merging Movie Data
  13. 12. Recommendations in Production
    1. Chapter Preview: Understanding Shortcut Edges, Precomputation, and Advanced Pruning Techniques
    2. Shortcut Edges for Recommendations in Real Time
      1. Where Our Development Process Doesn’t Scale
      2. How We Fix Scaling Issues: Shortcut Edges
      3. Seeing What We Designed to Deliver in Production
      4. Pruning: Different Ways to Precompute Shortcut Edges
      5. Considerations for Updating Your Recommendations
    3. Calculating Shortcut Edges for Our Movie Data
      1. Breaking Down the Complex Problem of Precalculating Shortcut Edges
      2. Addressing the Elephant in the Room: Batch Computation
    4. Production Schema and Data Loading for Movie Recommendations
      1. Production Schema for Movie Recommendations
      2. Production Data Loading for Movie Recommendations
    5. Recommendation Queries with Shortcut Edges
      1. Confirming Our Edges Loaded Correctly
      2. Production Recommendations for Our User
      3. Understanding Response Time in Production by Counting Edge Partitions
      4. Final Thoughts on Reasoning About Distributed Graph Query Performance
  14. 13. Epilogue
    1. Where to Go from Here?
      1. Graph Algorithms
      2. Distributed Graphs
      3. Graph Theory
      4. Network Theory
    2. Stay in Touch
  15. Index

Product information

  • Title: The Practitioner's Guide to Graph Data
  • Author(s): Denise Gosnell, Matthias Broecheler
  • Release date: March 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492044079