O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Practical Graph Analytics with Apache Giraph

Book Description

Practical Graph Analytics with Apache Giraph helps you build data mining and machine learning applications using the Apache Foundation’s Giraph framework for graph processing. This is the same framework as used by Facebook, Google, and other social media analytics operations to derive business value from vast amounts of interconnected data points.

Graphs arise in a wealth of data scenarios and describe the connections that are naturally formed in both digital and real worlds. Examples of such connections abound in online social networks such as Facebook and Twitter, among users who rate movies from services like Netflix and Amazon Prime, and are useful even in the context of biological networks for scientific research. Whether in the context of business or science, viewing data as connected adds value by increasing the amount of information available to be drawn from that data and put to use in generating new revenue or scientific opportunities.

Apache Giraph offers a simple yet flexible programming model targeted to graph algorithms and designed to scale easily to accommodate massive amounts of data. Originally developed at Yahoo!, Giraph is now a top top-level project at the Apache Foundation, and it enlists contributors from companies such as Facebook, LinkedIn, and Twitter. Practical Graph Analytics with Apache Giraph brings the power of Apache Giraph to you, showing how to harness the power of graph processing for your own data by building sophisticated graph analytics applications using the very same framework that is relied upon by some of the largest players in the industry today.

Table of Contents

  1. Cover
  2. Title
  3. Copyright
  4. Contents at a Glance
  5. Contents
  6. About the Authors
  7. About the Techincal reviewer
  8. Introduction
  9. Annotation Conventions
  10. Part I: Giraph Building Blocks
    1. Chapter 1: Introducing Giraph
      1. Data, Data, Data
      2. From Big Data to Big Graphs
      3. Why Giraph?
      4. Giraph and the Hadoop Ecosystem
      5. Giraph and Other Graph-Processing Tools
      6. Summary
    2. Chapter 2: Modeling Graph Processing Use Cases
      1. Graphs Are Everywhere
        1. Modeling a Computer Network with a Simple, Undirected Graph
        2. Modeling a Social Network and Relationships
        3. Modeling Semantic Graphs with Multigraphs
        4. Modeling Street Maps with Graphs and Weights
      2. Comparing Online and Offline Computations
      3. Fitting Giraph in an Application
      4. Giraph at a Web-Search Company
      5. Giraph at an E-Commerce Company
      6. Giraph at an Online Social Networking Company
      7. Summary
    3. Chapter 3: The Giraph Programming Model
      1. Simplifying Large-Scale Graph Processing
        1. Hiding the Complexity of Parallel, Distributed Computing
        2. Programming through a Graph-Specific Model Based on Iterations
      2. A Vertex-centric Perspective
        1. The Giraph Data Model
        2. A Computation Based on Messages and Supersteps
        3. Reducing Messages with a Combiner
        4. Computing Global Functions with Aggregators
        5. The Anatomy of a Giraph Computation
      3. Computing In-Out-Degrees
      4. Converting a Directed Graph to Undirected
      5. Understanding the Bulk Synchronous Parallel Model
      6. Summary
    4. Chapter 4: Giraph Algorithmic Building Blocks
      1. Designing Graph Algorithms That Scale
      2. Exploring Connectivity
        1. Computing Shortest Paths
        2. Computing Connected Components
      3. Ranking Important Vertices with PageRank
        1. Ranking Web Pages
        2. PageRank
      4. Predicting Ratings to Compute Recommendations
        1. Modeling Ratings with Graphs and Latent Vectors
        2. Minimizing Prediction Error
      5. Identifying Communities with Label Propagation
      6. Characterizing Types of Graphs and Networks
      7. Summary
  11. Part II: Giraph Overview
    1. Chapter 5: Working with Giraph
      1. “Hello World” in Giraph
        1. Defining the Twitter Followership Graph
        2. Creating Your First Graph Application
        3. Launching Your Application
      2. Counting the Number of Twitter Connections
      3. Turning Twitter into Facebook
      4. Changing the Graph Structure
        1. Sending and Combining Multiple Messages
        2. Unit-Testing Your Giraph Application
      5. Beyond a Single Vertex View: Enabling Global Computations
        1. Using Aggregators
        2. Aggregators and Master Compute
      6. A Real-World Example: Shortest Path Finder
      7. Summary
    2. Chapter 6: Giraph Architecture
      1. Genesis of Giraph
      2. Giraph Building Blocks and Concepts
        1. Masters
        2. Workers
        3. Coordinators
      3. Bootstrapping Giraph Services
      4. Anatomy of Giraph Services
        1. Master Services
        2. Worker Services
        3. Coordination Services
      5. Fault Tolerance
        1. Disk Failure
        2. Node Failure
        3. Network Failure
      6. Summary
    3. Chapter 7: Graph IO Formats
      1. Graph Representations
      2. Input Formats
        1. Vertex-Based Input Formats
        2. Edge-Based Input Formats
        3. Combining Input Formats
        4. Input Filters
      3. Output Formats
        1. Vertex-Based Output Formats
        2. Edge-Based Output Formats
      4. Aggregator Writers
      5. Summary
    4. Chapter 8: Beyond the Basic API
      1. Graph Mutations
      2. The Mutation API
        1. Direct Mutations
        2. Mutation Requests
        3. Mutation Through Messages
      3. Resolving Mutation Conflicts
      4. The Aggregator API
      5. Centralized Algorithm Coordination
      6. Halting the Computation
      7. Using Aggregators for Coordination
      8. Writing Modular Applications
      9. Structuring an Algorithm into Phases
      10. The Composable API
      11. Summary
  12. Part III: Advanced Topics
    1. Chapter 9: Exposing Parallelism in Giraph
      1. Worker Computations
        1. Use case: Sharing Data Across a Worker
        2. Use Case: Per-Worker Performance Statistics
        3. Thread Safety in Giraph
      2. Controlling Graph Partitioning
        1. The Importance of Partitioning
        2. Implementing Custom Partitioners
        3. Partition Balancing
      3. Summary
    2. Chapter 10: Advanced IO
      1. Accessing Data in Hive
        1. Reading Input Data
        2. Writing Output Data
      2. Accessing Data in Gora
        1. Reading Input Data
        2. Writing Output Data
      3. Summary
    3. Chapter 11: Tuning Giraph
      1. Key Giraph Performance Factors
      2. Giraph’s Requirements for Hadoop
        1. Hardware-related Choices
        2. Job-related Choices
      3. Tuning Your Data Structures
        1. The OutEdges Interface
        2. The MessageStore Interface
      4. Going Out-of-Core
        1. Out-of-Core Graph
        2. Out-of-Core Messages
      5. Giraph Parameters
      6. Summary
    4. Chapter 12: Giraph in the Cloud
      1. A Quick Introduction to Cloud Computing
      2. Giraph on the Amazon Web Services Cloud
        1. Before You Begin
        2. Creating Your First Cluster on the Amazon Cloud
        3. The Building Blocks of an EMR Cluster
        4. The Composition of an EMR Cluster: Instance Groups
        5. Deploying Giraph Applications onto an EMR Cluster
        6. EMR Cluster Data Processing Steps
        7. When Things Go Wrong: Debugging EMR Clusters
        8. Where’s My Stuff? Data Migration to and from EMR Clusters
        9. Putting It All Together: Ephemeral Graph Processing EMR Clusters
        10. Getting the Most Bang for the Buck: Amazon EMR Spot Instances
        11. One Size Doesn’t Fit All: Fine-Tuning Your EMR Clusters
      3. Summary
    5. Appendix A: Install and Configure Giraph and Hadoop
      1. System Requirements
        1. Hadoop Installation
        2. Giraph Installation
        3. Installing the Binary Release of Giraph
        4. Installing Giraph As Part of a Packaged Hadoop Distribution
        5. Installing Giraph by Building from Source Code
        6. Fundamentals of Hadoop and Hadoop Ecosystem Projects Configuration
        7. Configuring Giraph
        8. Configuring Hadoop
        9. Configuring Hadoop in Pseudo-Distributed Mode
      2. Summary
  13. Index