Chapter 1. Introduction to Graphs

The Power of Enterprise Graph Learning and Inference at Scale

Graphs serve as a fundamental tool for representing various natural phenomena, evident in ecological and hydrological networks. In ecological networks, food webs illustrated in Figure 1-1, nodes symbolize distinct species, and edges represent predator-prey interactions or trophic relationships. Similarly, hydrological networks exemplify this natural graph phenomena, with the nodes representing bodies of water (such as lakes, streams, and rivers), and edges depicting the water flow between them. The graph structures encapsulate how the energy flows within natural ecosystems in the initial ecological example. Conversely, in the second scenario, the graph representation aids researchers in comprehending the complexities of water distribution patterns. In natural systems, like biology and neuroscience, the internal organization resembles a graph. In biology, graphs simulate molecular interactions, such as protein-protein interaction networks, metabolic pathways, and gene regulatory networks. Similarly, in neuroscience, graphs illustrate neuronal connections in the brain, shedding light on brain structure and function. By representing these systems with graphs, it is possible to visualize and analyze these systems, leading to advancements in medicine, neuroscience, and biotechnology.

A sample of desert ecosystem food web
Figure 1-1. A sample of desert ecosystem food web

Additionally, the power of graphs is evident in designed technologies and systems, such as social networks, as illustrated in Figure 1-2, and in transportation networks. Both systems can be represented by graphs. In social networks, individuals are represented as nodes, and their friendship relationships are indicated by edges, creating a vast graph that represents millions of people around the world. Graph algorithms can be applied to this network to suggest new friendships (creating edges) between individuals, for example. Similarly, transportation networks use graphs to model road networks, railways, air transportation routes, and public transit networks. Nodes represent locations (e.g., cities, intersections, stations), while edges represent connections (e.g., roads, tracks, flight paths). Graph algorithms are then applied to optimize various operations, such as route planning (establishing connections between two locations).

Simplified example of graph representation in social networks
Figure 1-2. Simplified example of graph representation in social networks

The scenarios described above share a common feature: they all involve complex systems with complex relationships. When we refer to complex systems, we are talking about networks composed of numerous interconnected nodes that exhibit emergent behavior and nonlinear dynamics, making them challenging to predict or understand. These systems also feature complex relationships, characterized by multiple connections between nodes, further adding to their complexity. For example, in ecological food webs, various species interact through predation, competition, mutualism, and other ecological interactions. In this network, each species is connected to others by feeding relationships, forming a complex web of dependencies.

All these examples of graph applications highlight their inherent power. Another aspect of this power lies in the learning algorithms applied to these graphs. These algorithms, integral to graph learning, facilitate tasks such as suggesting new friendship connections or identifying optimal routes. Graph learning provides a comprehensive framework for solving problems represented graphically, which will be the focus of this book.

At the core of graph learning is the ability to distill complex relationships and structures. By representing data through nodes and edges, graphs facilitate the abstraction of complex details, allowing a focus on the essential connections that drive insights and understanding. Consequently, professionals across various fields can analyze large datasets more effectively and efficiently, uncovering patterns and trends that were once hidden by data’s overwhelming volume.

In the fast-paced, interconnected world of today, harnessing the power of graphs can significantly benefit businesses, governments, and individuals. This book delves into the significance of graph learning and inference at scale, employing graph-based algorithms and inference techniques to analyze vast datasets. We aim to show how these methods can transform data analysis, interpretation, and application, revealing their extensive potential.

For businesses, understanding complex data can lead to better decision-making, streamlined processes, and increased revenue. For example, retail companies can utilize graph learning for crafting targeted marketing strategies and offering personalized recommendations by deeply analyzing consumer behavior patterns and preferences. This nuanced understanding allows them to align their marketing efforts closely with customer expectations, enhancing the efficacy of their business operations. Similarly, financial institutions can leverage graphs for detecting fraudulent transactions and assessing credit risk with greater precision. By grasping the power of graphs, professionals can forge novel solutions to contemporary challenges, distinguishing themselves in their fields.

In scientific research, graph learning and inference enable the modeling of complex systems, such as protein interactions or ecological networks, opening the door to novel discoveries. By elucidating the connections between different components of these complex systems, scientists can create more accurate models, predict outcomes, and make informed decisions that propel human knowledge forward.

For governments and public entities, graph learning and inference are crucial in tackling some of today’s most critical challenges. Whether it’s tracking the spread of infectious diseases or understanding the dynamics of social unrest, graphs offer valuable insights that can guide policy-making.

As we edge toward the realization of artificial general intelligence (AGI)1, the significance of graph learning and inference only grows. This increasing significance is clear in the remarkable increase of representation-learning research showcased at the International Conference on Learning Representations (ICLR) 2022, as highlighted by the keywords in Figure 1-3.

Top 20 keywords from work submitted to the International Conference on Learning Representations  ICLR  2022The data for this visualization was obtained from https   github.com EdisonLeeeee ICLR2022 OpenReviewData and accessed in October 2023.. Graph based technologies are among the top active research areas.
Figure 1-3. Top 20 keywords from work submitted to the International Conference on Learning Representations (ICLR) 20222. Graph-based technologies are among the top active research areas.

In this book, we will explore both the theory and practice of enterprise graph learning and inference at scale, addressing the challenges involved in constructing end-to-end graph learning and inference pipelines capable of delivering both low latency and high throughput. We will lead you through the fundamentals of graph theory and introduce you to cutting-edge techniques and applications, all while providing a comprehensive framework to help you master the complex world of graphs.

One of the primary focuses of this book will be to tackle the scalability challenges in graph learning and inference. As data becomes larger and more complex - exceeding millions of nodes and edges, datasets expanding beyond terabytes in size, or the complexity of graphs escalating due to a variety of deep relationships - it becomes necessary to develop systems capable of efficiently managing these increases at scale. We will explore and cover various strategies and techniques to assist you in designing and implementing scalable graph learning and inference pipelines, ensuring that your solutions stay performant and effective as your data grows.

To accelerate the advancement of graph-based learning methodologies and inference systems, this book will introduce an open-source package that leverages widely used open-source tools, integrating best practices for the design and implementation of graph-based solutions. This package aims to not only save you time and effort but also to equip you with a robust foundation for developing advanced graph learning and inference systems customized to meet your unique needs.

By harnessing the power of graph learning and inference at scale, you will unlock the ability to tackle complex challenges and discover new opportunities for personal and professional development. Equipped with the knowledge and tools presented in this book, you will be prepared to navigate the challenges of an ever-more connected world, transforming how you analyze, interpret, and leverage data in your endeavors.

A Bird’s Eye View: Navigating the Book Chapters

If you’re curious about what to expect from this book and how to navigate it, we’ve got you covered.

The first three chapters are designed to familiarize you with the world of graphs. They will cover the fundamentals of graphs, the graph learning pipeline, and traditional approaches to graph learning with hands-on examples. Chapter 4 will introduce, PyGraf3, an open source library, explaining its architecture, modules, and how it can be used. Chapters 5-9 cover advanced topics such as graph neural networks and constructing large-scale networks, leveraging the capabilities of PyGraf.

Chapter 10 will focus on federated learning and differential privacy, offering an in-depth exploration of these concepts for building privacy-preserving graph learning and inference pipelines. Following this, Chapter 11 will cover different inference strategies, providing deeper insights into this area. To conclude, Chapter 12 will guide you on how to effectively monitor and refine these processes via feedback loops.

So, let’s continue with Chapter 1! This chapter is all about introducing you to the exciting world of graph learning in enterprise settings. We’ll start by breaking down the concepts of graphs and graph learning, making sure we’re all on the same page. After that, we’ll explore the incredible benefits of Graph Machine Learning (GML) across different applications. Trust us, you’ll be amazed by the scope of its applications.

We’ll also take a close look at real-world use cases that demonstrate the power and potential of graph learning. We’ll emphasize why it’s crucial for businesses to adopt these methods and techniques, and how they can drive success in various industries.

Towards the end of Chapter 1, we’ll address an important aspect: the limitations of graph learning when applied at an enterprise scale. It’s essential to understand these challenges, as they lay the groundwork for the subsequent chapters where we’ll go deeper into how to proactively mitigate these challenges. So, hang tight as we pave the way for tackling those obstacles head-on.

Hope that gives you a clear idea of what lies ahead! Get ready for an exciting journey through the world of graph learning in enterprise environments.

Graphs and Graph Learning

Now that we’ve provided a bird’s eye view of the book chapters, we’ll embark on our journey by laying the basic foundations for understanding graph learning. To begin, let’s explore the fundamental concepts of graphs and graph learning, exploring their applications, as well as addressing the challenges and potential solutions for implementing graph learning at an enterprise scale.

What is a Graph?

In discrete mathematics, particularly within the field of graph theory, a graph is a structure that involves a collection of objects, where certain pairs of objects have a particular kind of association/relationship. These objects are represented as mathematical concepts known as vertices (also referred to as nodes or points), and the connections between pairs of vertices are termed edges (also known as links or lines).4

We can think of graphs as complex topographical structures with no fixed order or reference points. This implies that there’s no inherent order or reference point governing the organization of graph nodes and edges. Consequently, graphs showcase complexity and variability in their structure. Their size and structure can vary. Graphs can be static or dynamic in structure. Static graphs have a fixed structure where nodes and edges do not change over time5. One example of a static graph is the graph that represents a snapshot of road networks, where the road intersects represent nodes and the roads represent edges.

Dynamic graphs, on the other hand, have time-varying structures. This means that a node or an edge can be removed or added over time without losing the graph’s structure. Many applications involve dynamic graphs. For instance, social networks, such as Facebook and X6 are examples of dynamic graphs which evolve over time as new people join, existing people leave and remove their accounts temporarily or permanently, new friendships are formed, and old ones are deleted.

Figure 1-4 represents a dynamic graph of a social network, where the temporal aspect is represented as an asynchronous stream of events. This means that the graph changes over time, and these changes are captured as a series of asynchronous events. The status of the network at t = 0 has only three people with two friendship connections existing between Person 1 and Person 2, and Person 2 and Person 3. At t = 1, Person 4 joined the platform and its node was created and connected to Person 3. At t = 2, Person 4 follows Person 2, and this creates a new edge linking the two.

  A dynamic graph representing a social network which evolves over time. Three event sequences at different timestamps  t0 t1 t2   depict the evolution of social networks in terms of the addition of new nodes and the formation of new edges.
Figure 1-4. : A dynamic graph representing a social network which evolves over time. Three event sequences at different timestamps (t0,t1,t2 ) depict the evolution of social networks in terms of the addition of new nodes and the formation of new edges.

We refer to the data that is contained within the graph as “graph data''. It refers to the structured information that is represented as a graph, consisting of interconnected nodes and edges. Graph data spans various domains, such as transportation networks, social networks, biological networks, finance, linguistics, etc. The nature of data might vary, and we will go over this through this in the Graph Data Levels section in Chapter 2.

When discussing graph data and its characteristics, it’s important to understand the specifics of the graph. Graphs can take different forms, including directed and undirected types, and may feature multiple edges between nodes. Let’s cover these distinctions:

Directed vs. undirected graph

Directed graphs have edges with a specific direction, meaning the relationship between nodes is one-way. Undirected graphs, on the other hand, lack edge directions, implying a two-way relationship between nodes.

Multiple Edges between nodes

Although we often simplify by assuming a single edge connecting a pair of nodes, it’s possible for two nodes to share multiple edges. These graphs are termed multigraphs.

Edge labeling

Edges within a graph can have a label, conveying additional information about the relationship between nodes. These labels might represent timestamps for friendship connections in a social network, distances or costs in a road network, or any other relevant properties associated with the edges.

Graph learning, on the other hand, refers to the process of extracting useful insights and patterns from graph-structured data using machine learning techniques. By leveraging the properties of graphs, graph learning algorithms can uncover hidden patterns and relationships within the data such as identifying tightly connected communities within a social network - based on their interactions, interests or affiliations, enabling more effective decision-making and predictions.

Having established an introductory understanding of the concepts of graphs and graph data, the next section focuses on exploring the methodologies involved in representing graph data.

Graph Data Representation

A graph is also a form of representation used to simplify a situation or problem. This mapping of any scenario to a graph representation is known as graph data representation, and it involves representing the problem’s data as a graph, with nodes representing graph nodes (or entities) and edges (or links) representing relationships between those entities. The first step in this mapping process is to identify the entities which will be represented as nodes and their relationships (edges).

Once the nodes and edges have been identified, a visual representation of the problem, in the form of a graph (nodes connected by edges), can be created. This visual mapping is referred to as “graph modeling”. It is useful for gaining an intuitive understanding of the problem and visualizing the relationships between different nodes. While this step is merely illustrative and hence optional, various software tools can be used to generate the visual representations. For example, popular software tools include Gephi, Cytroscape, and Graphviz. Other text-to-graph tools include those designed to handle unstructured data (e.g., raw text) and utilize Natural Language Processing (NLP) techniques to automatically generate a graph representation based on the identified nodes and edges in the text. Examples of such tools include Neo4j and IBM Watson Knowledge Studio. However, in most of these tools, visualizing the graph is only applicable when the size of the nodes is manageable7.

Figure 1-5 shows a graph modeling or a visualization of a simplified road network. Here the graph nodes are the road intersections, which are connected to one another by roads which represent edges.

Representation of the road network as a graph. The image represents the map of  Merrion Square Park in Dublin Ireland
Figure 1-5. Representation of the road network as a graph. The image represents the map of “Merrion Square Park in Dublin Ireland”

Using the nodes and edges that were identified in the previous step, the second step of graph representation requires finding a mathematical representation that captures the graph structure. This could mean modeling the structure using matrices or vectors, which then can subsequently be used for graph-based computations. In general, this mathematical representation can be done using traditional or advanced techniques. Traditional graph representation techniques include an adjacency matrix or an adjacency list. Adjacency matrices are square matrices that represent the relationships between nodes in a graph, where 1 indicates an edge between two nodes, and 0 indicates no edge. Adjacency lists represent each node and its neighbors as a list. You will learn more about this in Chapter 2.

In terms of advanced graph representation techniques, embeddings serve as the foundation. Graph embeddings, for example, are a way of representing nodes and edges in a graph as numerical vectors, which can be fed as inputs into machine learning algorithms. Their goal is to encapsulate the structural and semantic information of the graph in a way that can be easily processed by graph algorithms. There are various techniques for generating graph embeddings, and we’ll cover how these techniques are implemented through interactive examples from Chapters 3 to 8.

Until this point, we’ve seen how graph representation can be used to map complex phenomena or a real-world network into a visual representation, and how these representations are then utilized to create a map of the geometrics of graph structure using traditional or advanced graph representation techniques. The latter representation is subsequently fed into machine learning (ML) models as input data. But, how does this representation actually work within the machine learning model?

Well, once this representation is generated, we can apply a graph algorithm, which is a set of algorithms used to solve graph-related problems such as finding the shortest path in a transportation network as in Google Maps or predicting new nodes or edges, etc8. The relationships between different data points are important to consider when performing ML tasks. Using graph data representation allows machine learning models to directly incorporate this structural information into their predictions and decisions, which can improve performance and accuracy.

Now that you’ve learned about graphs and graph data representation, let’s explore graph learning.

Graph Learning

Graph learning refers to the process of applying machine learning algorithms to analyze, learn, and make predictions on graphs. This can involve common GML tasks such as node classification (assigning labels to nodes), link prediction (predicting connections between nodes), and clustering (grouping nodes based on similarities). You’ll explore these tasks further in Chapter 2, with an example in Chapter 3.

Once the graph data has been represented using graph representation as described in the preceding section, we can apply graph learning, aka machine learning algorithms, to solve problems on graphs. For example, Graph Convolutional Networks (GCNs) can be used for model classification tasks.

The most recent approach in graph learning uses a more advanced representation of graph data, which is known as graph representation learning. Let’s look at what this is and how it differs from graph data representation

Graph representation learning involves transforming the graph structure into a simplified, low-dimensional space, essentially a mathematical space with few dimensions. Imagine the nodes and edges of the graph as points and lines within this space. It serves as a method for visually and computationally representing the structure of a graph. This process, known as embedding, aims to capture information about individual nodes or edges within the graph by assigning them coordinates within a non-Euclidean space. These embeddings encode core details about the graph’s connectivity patterns, spanning both local and global aspects. For instance, they can illustrate the proximity of nodes and explain their roles within different subgraphs, which are smaller, self-contained segments of the larger graph.

A common methodology for graph representational learning is the employment of profound neural architectures, including but not limited to, Graph Convolutional Networks (GCNs), Graph Autoencoders (GAEs), Graph Attention Networks (GATs), and Graph Isomorphism Networks (GINs)9. These methods learn embeddings by iteratively transforming the features of a node’s neighbors, either via convolution or attention mechanism, until a fixed-size embedding is produced. These approaches will be covered in greater detail in Chapters 3 and 4.

Other graph representation learning techniques include random walk-based algorithms, these algorithms use the sequences of nodes visited during these walks to train node embeddings, effectively capturing the local and global structure of the graph. This will be covered in more detail in Chapter 2.

Once learned, node embeddings can be utilized for a multitude of machine learning tasks such as node/vertex classification, link prediction, and graph clustering. This means that node embeddings can serve as an input layer for the graph algorithms used to approach these tasks, such as PageRank10 and shortest path algorithms11.

Overall, graph representational learning has become an important component of graph learning, as it allows for the efficient processing of large and complex graphs. Its quality also influences the ability of machine learning models to make accurate predictions on graph data.

Let’s use social networks as an example to demonstrate how graph learning works at a high level. You might use graph learning to predict future interactions between users, such as a potential friendship connection, recommending content, or identifying communities the user might want to join. To apply ML, you’d need to encode the pairwise properties between nodes, such as the type and strength of the relationship between two people (nodes). The ML model might also need global positioning information about each node so that it can include the person’s local neighborhood in the contextual information it captures about them.

Note

The difference between Graph Data Representation and Graph Representation Learning is that Graph data representation is about structuring your data in the form of a graph, whereas graph representation learning is a machine learning task that learns a way to represent parts or all of your graph as low-dimensional vectors (embeddings) that can be used for various tasks.

Graph learning has made remarkable progress, as evidenced by its successful application in various use cases. Nevertheless, considering the importance of scalability in an enterprise context, scalable graph learning has been adopted to accommodate the growth associated with these settings.

Now that we grasped what graph learning is, let’s look into scalable graph learning and its prerequisites.

Scalable Graph Learning: Addressing the Requirements

Because of the dynamic nature of the enterprise ecosystem, graph learning should ideally be scalable. Scalable graph learning refers to the ability to perform machine learning on large-scale graph datasets efficiently. It involves the development of algorithms and models capable of processing graphs containing millions or billions of nodes and edges, while effectively leveraging computational resources and memory utilization through distribution and parallelism. Here are some of the requirements for scalable graph learning:

Data Quality and Integration

Enterprise-ready graph learning systems need to be able to handle and integrate data from a variety of sources, including structured and unstructured data, and ensure that the data is accurate, consistent, and up-to-date.

Graph Storage

Scalable graph learning requires efficient graph storage mechanisms that can handle large-scale graphs. By ‘large-scale graphs,’ we are referring to graphs at the enterprise level, characterized by an extensive number of nodes and edges, often representing a network of millions or even billions of connections12. Distributed graph storage systems such as Hadoop Distributed File System (HDFS) and Apache Cassandra, are examples of storage solutions commonly used in this context.

Distributed Computing

Scalable graph learning requires distributed computing frameworks that can distribute both the graph data and the computation across multiple machines. An example of such a framework is Apache Spark, which is widely used for processing large-scale datasets across a cluster of machines.

Integration with Other Systems

A scalable graph learning system should seamlessly integrate with existing enterprise infrastructures including data warehouses, business intelligence tools, and visualization platforms. It should offer compatibility, APIs, and connectors to easily ingest data from various sources.

Scalability

Graph learning systems often need to process and analyze large and complex graphs with millions or billions of nodes and edges. This requires the system to be able to handle such large graphs efficiently and effectively. It’s actually one of the key challenges restricting graph-based industrial applications, which often need to deal with very large and multi-sourced data while maintaining low latency. The association between graph scalability and latency constraints is strong and varies according to the application and the infrastructure settings. It restricts the potential of delivering offline-trained graph-based models into active or serving graph products.

Security and Privacy

Graph learning systems designed for enterprise use, particularly when handling sensitive data, need to prioritize data security as a means to prevent unauthorized breaches. This involves giving importance to measures such as implementing secure data pipelines, employing strong authentication methods, enabling auditing and logging functionalities, ensuring a secure infrastructure, and complying with relevant data protection regulations. Moreover, when scalability and privacy are key considerations, the adoption of federated learning techniques can be beneficial. By utilizing federated learning, graph learning models can be trained on decentralized data sources while maintaining data privacy and security. Chapters 8 and 9 will provide a more in-depth exploration of these techniques.

Hardware Acceleration

Graph learning algorithms are usually computationally intensive. Therefore, hardware acceleration techniques play a crucial role in accelerating the computations of these algorithms. Techniques such as Graphical Processing Units (GPUs) or Tensor Processing Units (TPUs) can significantly speed up the execution of these algorithms. By leveraging the specialized architectures of GPUs and TPUs, graph learning algorithms can be executed efficiently, leading to faster training on large-scale graph datasets.

Advantages of Scalable Graph Learning in Enterprise

In this section, we will explore in detail the benefits of employing scalable graph learning in enterprise settings and emphasize how this approach can drive significant advancements in various domains. By harnessing the unique capabilities of graph learning, businesses can tap into deeper data insights, enabling strategic advantages that are not readily achievable with traditional analytics techniques. Let’s explore these benefits in two distinct categories: technical and business advantages.

From a technical standpoint, scalable graph machine learning offers several advantages. It enhances data representation which enables a comprehensive and nuanced understanding of complex data structures. This capability allows graphs to capture higher-order interactions between entities in a graph, leading to improved predictive modeling and robustness. Moreover, it seamlessly integrates with existing data management systems. In terms of privacy and security, scalable graph learning can be combined with privacy-preserving techniques like federated learning and differential privacy, ensuring the confidentiality and protection of sensitive data. This is especially important for organizations that deal with confidential information and must comply with regulatory requirements.

These technical aspects translate into significant business benefits. They enhance business insights by revealing hidden patterns and relationships, resulting in more informed decision-making and a competitive edge through the identification of market opportunities. Furthermore, they enable personalized offerings and more effective marketing strategies which could lead to increased customer retention, higher revenue, and so on.

By harnessing the power of scalable graph learning in enterprise environments, organizations can unlock new possibilities in data analysis, predictive modeling, and decision-making, propelling them toward a more data-driven future.

As we transition to the next section, “Large-Scale Graphs in Real-World Enterprises: Use Cases,” we will explore how these benefits of scalable graph learning manifest in practical applications. We will examine a range of breakthroughs and advancements in various industries that have leveraged large-scale graph learning to address complex challenges and drive innovation. By understanding these real-world use cases, you will gain a clearer perspective on the potential impact of scalable graph learning techniques and how they can be harnessed to transform enterprise operations across diverse domains.

Large-Scale Graphs in Real-World Enterprises: Use Cases

Graph technology is rapidly permeating various aspects of our daily lives, profoundly impacting the way we interact with the world around us. From utilizing Google Maps for seamless navigation between locations to engaging with others on social media platforms like X13 and Facebook graph applications are transforming our experiences.

In the subsequent sections, we will cover a diverse range of large-scale graph applications across multiple domains, including geospatial analysis, drug discovery, and fraud detection within the financial industry. By exploring these examples, you will gain a deeper understanding of how graph-based solutions are revolutionizing different sectors and driving innovation.

Travel-time predictions on Google Maps

If you have used Google Maps, you’ve experienced firsthand how it utilizes graph theory and graph learning algorithms to model road networks and accurately predict travel times. By constructing a graph representation of the road network, with nodes representing interactions or points of interest, and edges representing roads or highways, Google Maps efficiently calculates travel times between two locations by finding the shortest path in this graph.

To achieve this, Google Maps employs a blend of graph search algorithms, such as Dijkstra’s algorithm for finding the shortest path and A* search for efficient pathfinding. Additionally, it integrates machine learning techniques to consider control variables such as distance, speed limits, and traffic conditions, ensuring accurate and up-to-date results. These algorithms dynamically adapt to changing factors, providing users with reliable navigation guidance. For deeper insights into these algorithms, you can explore resources like the Graph Algorithms book by OReilly.

Drug development: Halicin

Molecules are modeled as graphs with atoms as nodes and bonds as edges, with added properties to nodes and edges for detail. Graph learning algorithms are effective in analyzing these structures, leading to faster drug development by predicting molecular characteristics such as solubility and toxicity. Specifically, Graph Convolutional Networks (GCNs), a type of neural network, which we will visit in future chapters, suited for graph data, have been used to predict the antibacterial properties of molecules, advancing antibiotic discovery.

End to end Pipeline used by MIT researchers to advance antibiotic discovery using graph learning algorithms  A Deep Learning Approach to Antibiotic Discovery
Figure 1-6. End-to-end Pipeline used by MIT researchers to advance antibiotic discovery using graph learning algorithms “A Deep Learning Approach to Antibiotic Discovery

Figure 1-6 presents the end-to-end pipeline used while developing this transformative approach to antibiotic discovery utilizing the power of graph neural networks. This method vastly accelerates the search for new antibiotics by computationally navigating through extensive chemical territories, which traditionally required laborious and costly experimental screenings. The pipeline included the following high-level steps:

Data Preparation and Analysis:

A large dataset of chemical data, including molecular structures and properties effective against E. coli, is compiled. A Graph Convolutional Network (GCN) analyzes this data, representing molecules as graphs with atoms and bonds as nodes and edges, respectively.

Model Training and Prediction:

The GCN model is trained on the molecular data and then used to predict potential antibacterial molecules. These predictions are based on the model’s ability to recognize patterns that indicate effectiveness against E. coli.

Evaluation and Refinement:

Predictions are evaluated for accuracy, with the most promising candidates identified as ‘leads’. The model is iteratively refined using this evaluation feedback to improve its predictive accuracy.

Optimization and Validation:

Leads are optimized for increased antibacterial efficacy and safety, and the model’s predictions are scaled up for larger datasets. The results of the machine learning approach are compared with those from traditional experimental methods.

The process demonstrates GML’s role in accelerating drug discovery. A model trained on 2,335 molecules identified new antibiotics, and when applied to over 107 million molecules, it pinpointed compounds active against E. coli. Candidates were ranked by the model’s predicted effectiveness. One standout candidate, Halicin, proved effective against various bacteria and safe for human cells, marking a significant step in combating antibiotic resistance. Though further research is needed to understand Halicin’s mechanisms, its discovery is a notable breakthrough in the field.

Fraud detection

GML is a powerful tool for detecting fraudulent transactions in real-time. In financial institutions, transactions are often processed through complex networks of accounts, transactions, and relationships. These networks can be represented as graphs, with nodes representing accounts and the edges representing transactions or relationships between accounts. The labels for the nodes could include information about the account such as the account holder’s name, account balance, and transaction history. Similarly, the labels for the edges could also include additional information about the transaction such as the amount, date, and type of transaction. It is also possible to include additional nodes in the graph to represent other entities, such as merchants or payment processors. Figure 1-7 depicts an example representation of this network. It provides a conceptual way to visualize complex transactions between different entities. Nodes represent interacting entities in the systems (such as merchants or cardholders) with additional information shared via labels (location), and the relationships between nodes indicate the occurrence of transactions with labels to highlight additional transaction properties (for instance, amount and transaction type).

Illustration of a graph data model which depicts a transaction network  allowing a visual representation of interactions between entities through labeled nodes and relationships.
Figure 1-7. Illustration of a graph data model which depicts a transaction network, allowing a visual representation of interactions between entities through labeled nodes and relationships.

By analyzing these graphs, ML algorithms can identify patterns and anomalies that may suggest fraudulent activity: a cardholder suddenly making a large number of purchases in a short period of time, a high volume of transactions from a single cardholder to multiple merchants in a short period of time, a large number of transactions from a single merchant to multiple cardholders, or a merchant experiencing a high rate of chargebacks are all patterns of fraudulent activities. GML systems identify these patterns by analyzing the graph’s edges and labels, and they are used to flag transactions for further investigation or to trigger fraud alerts.

GML can process enormous amounts of data in real-time, even the hundreds of thousands of transactions that financial institutions typically process every day. This makes it excellent at quickly identifying suspicious transactions and alerting financial institutions, so they can take corrective action before any funds are lost. Graph ML can also learn and adapt over time. As an algorithm processes more data, its ability to detect fraudulent activities and eliminate false positives improves. In the next section, we’ll show you how this works.

The Evolution of Graphs and Graph Learning: From Early Beginnings to Modern Applications

The field of graph learning has evolved significantly over the past several decades. In this section, we will explore the key milestones and breakthroughs that have shaped this evolution from its early beginnings to the present day. Figure 1-8 depicts snapshots of this evolution.

Evolution of Graph Learning Systems  Sources  Era 1the K nigsberg bridge problem   Euler  L.  1741 . Solutio problematis ad geometriam situs pertinentis. Commentarii academiae scientiarum Petropolitanae  128 140.  Era 2Hopcroft  J. E.    Karp  R. M.  1973 . An n 5 2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing  2 4   225 231.  Era 3Castillo  R.  Rothe  C.    Leser  U.  2010 . RDFMatView  Indexing RDF Data for SPARQL Queries.  Era 4Belkin  M.    Niyogi  P.  2003 . Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation  15 6   1373 1396.   Era 5Kipf  T. N.    Welling  M.  2016 . Semi supervised classification with graph convolutional networks. arXiv preprint arXiv 1609.02907.   Era 6https   theaisummer.com gnn architectures   Last ImgWeis  J. W.    Jacobson  J. M.  2021 . Learning on knowledge graph dynamics provides an early warning of impactful research. Nature Biotechnology  39 10   1300 1307.
Figure 1-8. Evolution of Graph Learning Systems [Sources: Era 114, Era 215, Era 316, Era 417, Era 518, Era 619, Last Img20]

Era 1: The Foundation of Graph Theory and Algorithms (1736-1970)

This era marks the birth of graph theory. It all began when Leonhard Euler introduced the concept of a graph and solved the Seven Bridges of Königsberg problem21. As time went on, lots of exciting developments took place in the graph space, most notably the creation of Dijkstra’s algorithm, which became a powerful tool for finding the shortest path in graphs.

Era 2: More Advanced in Graph Algorithms and Technologies (1970 - 1999)

During this period, significant advances in graph technologies and algorithms occurred, including the development of graph-based data structures, graph traversal algorithms, graph coloring algorithms, and the establishment of graph theory as a foundational discipline in computer science.

This era also marks the very early work on graph storage capabilities by introducing relational databases as dominant technologies for data storage and querying. The relational database systematically arranges data using rows and columns, which collectively form a table. Typically, data spans across several tables, which can be joined together via a primary key or a foreign key.22 This systemic configuration, inherent to a relational database management system (RDBMS), facilitates the seamless and efficient execution of relational operations, such as JOINs, enabling rapid retrieval and correlation of data. However, it struggled with modeling complex data relationships.

Furthermore, the Resource Description Framework (RDF) was introduced in 1999, establishing the framework for describing data on the web in a graph-based structure, which later contributed to the emergence of Linked Data and the Semantic Web. Each RDF statement is a three-part structure, called a tuple, comprising three resources, each identified by URI. Each tuple is represented as (Subject, Predicate, Object). The adoption of RDF as a data representation facilitates seamless identification, faster disambiguation, and interconnection of information by AI systems.23

Similarly, the development of semantic graphs has witnessed significant progress. Semantic web technologies have evolved, giving rise to the concept of Knowledge Graphs, which serve as a more flexible and expressive form of a graph database. These knowledge graphs store structured data and knowledge about entities and their relationships. Knowledge graphs establish contextual relationships through the use of linking and semantic metadata, providing a structured approach to integrate, unify, analyze, and share data.24 At the core of a knowledge graph lies a comprehensive knowledge model, comprising interconnected descriptions of concepts, entities, relationships, and events. A prime example of a knowledge graph is Google’s Knowledge Graph25 – an expansive and contextually enriched data structure that semantically connects various informational elements.

Era 3: Emergence of Graph Databases and Graph Query Languages (2000s-2006)

With the increasing importance of graphs in various domains, the need for efficient storage and retrieval mechanisms for graph data became apparent. This era saw the emergence of graph databases such as Neo4j, OrientDB, and ArangoDB, and the development of graph query languages like Cypher, SPARQL, and Gremlin. A Graph database (GDB) is a database that is designed to store and manage data using graph structures with nodes, edges, and properties. It also enables semantic queries.26 This graph-based structure facilitates the creation and execution of complex queries.

Era 4: Graph Analytics and Traditional Machine Learning (2007-2011)

Graph analytics frameworks such as Apache Giraph and GraphX emerged during this period, enabling large-scale graph processing and analysis. At the same time, researchers began to explore the application of traditional machine learning techniques to graph data, paving the way for graph analytics. GML tasks such as link prediction, node classification, and graph clustering became increasingly popular.

Era 5: Rise of Graph Neural Networks (2012-2018)

The introduction of Graph Convolutional Networks (GCNs) marked a significant turning point in GML, as researchers started to leverage deep learning techniques for graph data. This era saw the development of various graph neural network architectures, including GraphSAGE27, Graph Attention Networks (GAT)28, and Graph Isomorphism Networks (GIN)29.

Era 6: Scalability, Robustness, and Enterprise Applications (2019-Present)

The most recent era in the evolution of GML has focused on addressing the challenges of scalability, robustness, and privacy in graph learning, along with an increasing emphasis on real-world enterprise applications. Novel techniques such as Graph Partitioning and Cluster-GCN, have emerged to handle large-scale graph learning problems.

Challenges of Enterprise-Ready Graph Learning Systems

Images and sequential data, such as text audio, have a linear structure, order, and point of reference. Graphs, on the other hand, have arbitrary dimensions, complex topology, and lack a reference point or concept of spatial location in networks. Furthermore, “graphs cannot be represented as a single entity”30 due to its inherent nature that consists of nodes and edges. This means that information on a graph can be missing or noisy, and the graph may become difficult to fully capture by a single representation. Also, enterprise graph networks tend to be dynamic and multi-modal which means they represent different data types text, images, audio etc. All these characteristics present challenges for enterprise settings. Here, we’ll look at a few of those challenges in more detail.

Data Harmonization Challenges

Enterprise-ready graph learning systems need to be able to handle and integrate data from a variety of sources and providers while ensuring that the data is consistent. This data comes under data format (unstructured, semi-structured, structured), and granularity. In the context of graph data representation, data harmonization can be a challenge since graph data frequently consists of a large number of interconnected entities and connections which can be challenging to reconcile and standardize across multiple sources. Two examples of why data harmonization can be difficult due to differences in schema, ontologies, and standardization across different data sources:

Differences in Schemas and Ontologies

Different sources may use different schemas or ontologies to represent the same concepts. For example, one source may represent a person using a “name” and an “age” fields, while another source may use “first_name” and a “last_name” fields to represent the same information. To harmonize these data sources, the data must be transformed into a common format that can be understood by all of the sources, which is also called “reconciliation”, without introducing errors due to discrepancies in representation formats.

Differences in the Applied Standardizations

Different sources may use different data standards or conventions for representing data. For example, one source may use the ISO 860131 standard for representing dates, while another source may use a different standard. Thus, to use these data sources, all the data must be transformed into a common format that adheres to a single standard.

Computationally Intensive Workloads

The development of GML systems can be both computationally and resource-intensive. This is mainly because each stage in the GML pipeline requires a different set of resources, such as storage for graph data, computing power for data preparation and training, and memory for storing parameters. Here, we will discuss the computational challenges associated with graph data, graph complexity, and training procedures.

Graph Data Processing Challenges

When working with graph data, two properties add additional challenges to its operational aspects. These characteristics are the volume and the high dimensionality of graph data. The complexity of storing and processing these massive and high-dimensional data sets is compounded by the constant problem of scaling the training process. To properly train a GML system, both of these factors would require more computing resources, ranging from memory storage to processing power.

Graph Complexity Challenges

Graphs can capture complex relationships between nodes and edges, which can be difficult to model and analyze. This can require the use of more sophisticated machine learning algorithms and techniques, which can be computationally intensive.

Training Challenges

GML systems may require distributed training, which involves training the model across multiple machines or devices. This can be challenging to set up and manage, and it can also increase the computational resources required to train the model.

Dynamic Evolving Graphs

When it comes to complexity in graph systems, the dynamic evolving nature of graphs presents a unique challenge due to their ability to constantly change over time. This means that the structure of the graph and the relationships between the nodes can change frequently. With structure’s nature, traditional algorithms that assume static graph structure would be insufficient, hence there is a need for dynamic models that can accurately predict or classify based on the graph data. Building these models on dynamically evolving graphs can be challenging as this requires a high level of adaptability, flexibility, and computational power to be able to analyze accurately and efficiently the constantly changing data. These aspects could be difficult to achieve at scale as models have to be responsive in learning and updating their predictions in response to changing data. Another challenge is the sheer volume of data generated; as the graph changes over time, new data is constantly being added, and the models must be able to handle and analyze this data in real-time.

Some algorithms and techniques used to analyze dynamic evolving graphs include graph clustering, community detection, and centrality measures. These methods can be used to identify patterns and trends in the data and to understand how the graph is evolving over time.

Active Monitoring and Drift Detection

The fourth complexity challenge is active monitoring and drift detection. For GML systems in production, active monitoring and drift detection provide issues since they call for the system to continuously monitor the data and model performance and to identify when the models are no longer producing accurate predictions.

The primary hurdle lies in the need to continually process and analyze enormous volumes of data in real-time, precisely evaluating the effectiveness of sophisticated models, and comparing present performance against historical performance to pinpoint plausible deviations or drifts in both data and model dynamics. This requires a robust and efficient system for storing and accessing historical data and model performance metrics, which can be difficult to design and implement in a production context. This will demand a significant amount of computational power, as well as efficient data processing techniques, both of which can be challenging to produce at scale. Another associated difficulty here is the complexities of the models themselves. GML models often require complex processes and potentially have many parameters, making it challenging to effectively and timely assess their performance and detect when they are no longer making accurate predictions.

Real Time-Inference

When it comes to real-time predictions, GML models need to be accurate and with sub-millisecond latency. But it’s not an easy task. These models deal with complex operations that demand a lot of computing power, making it a challenge to achieve those sub-millisecond response times.

Several factors affect how quickly these models can make inferences in real time. Things like the size and complexity of the graph, the machine learning algorithms chosen, and the hardware and software infrastructure used for running the models all play a role. That’s why it’s crucial to optimize both the model design and the infrastructure to reduce the time needed for each prediction. This can involve techniques such as model pruning, quantization, and parallelization. It also means fine-tuning the hardware and software infrastructure to minimize any extra time or effort involved in running the models.

Summary

In this chapter, we’ve laid the fundamentals for enterprise-ready graph learning, emphasizing the importance of scalable pipelines. These principles form the basis for the entire book’s content, setting the stage for our exploration of graph learning’s potential in the following chapters.

As we conclude this introductory chapter, let’s preview the exciting journey ahead. Our book’s overarching goal is to provide a comprehensive understanding of graph learning, its diverse applications across domains, such as healthcare, finance, biology, etc, and the unique challenges and opportunities it offers. We will start by delving into traditional machine learning for graphs, covering aspects like pipeline development, feature engineering, representation learning, semi-supervised node classification, and an introduction to our open-source graph library.

In subsequent chapters, we’ll dive into various graph pipelines, utilize traditional techniques to learn from graphs, cover deep learning and graph neural networks (GNNs), explore scalable node embeddings, discuss large-scale graph neural networks, investigate enterprise applications, and address privacy preservation. Finally, we’ll cover graph inference strategies and the monitoring and feedback loop in GML systems. This roadmap promises an enriching exploration of graph learning and its potential impact on diverse industries and applications.

1 AGI refers to AI systems having human-like intelligence and adaptability, capable of understanding, learning, and performing a wide range of tasks across multiple domains. AGI is a more advanced form of AI, as opposed to specialized AI, which is designed to solve a specific problem e.g facial recognition. AGI remains a research goal.

2 The data for this visualization was obtained from https://github.com/EdisonLeeeee/ICLR2022-OpenReviewData and accessed in October 2023.

3 PyGraf is the tentative name to our open source library and it’s subject to change.

4 Trudeau, Richard J. (1993). Introduction to Graph Theory

5 The static graph will have the same number of nodes and edges, and no node or edge being added or removed over time.

6 Formally known as Twitter (accessed via www.twitter.com)

7 There is a limitation on how many nodes and edges we can draw manually, as well as in most graph visualization software.

8 These algorithms will be covered in greater detail throughout this book. Core Graph algorithms and Traditional GML (Chapter: 2), and Graph Neural Networks (Chapters: 3,4,5,6,7,8,9)

9 You can read about the architectures of these networks on https://distill.pub/2021/gnn-intro/

10 Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7), 107-117.

11 Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (1989). Introduction to Algorithms. MIT Press.

12 Besta, M., & Hoefler, T. (2024). Parallel and distributed graph neural networks: An in-depth concurrency analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence.

13 Formally known as Twitter (accessed via www.twitter.com)

14 the Königsberg bridge problem - Euler, L. (1741). Solutio problematis ad geometriam situs pertinentis. Commentarii academiae scientiarum Petropolitanae, 128-140.

15 Hopcroft, J. E., & Karp, R. M. (1973). An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4), 225-231.

16 Castillo, R., Rothe, C., & Leser, U. (2010). RDFMatView: Indexing RDF Data for SPARQL Queries.

17 Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6), 1373-1396.

18 Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

19 https://theaisummer.com/gnn-architectures

20 Weis, J. W., & Jacobson, J. M. (2021). Learning on knowledge graph dynamics provides an early warning of impactful research. Nature Biotechnology, 39(10), 1300-1307.

21 https://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg

22 https://www.ibm.com/topics/relational-databases

23 https://www.ontotext.com/knowledgehub/fundamentals/what-is-rdf/

24 https://www.ontotext.com/knowledgehub/fundamentals/what-is-a-knowledge-graph

25 https://en.wikipedia.org/wiki/Google_Knowledge_Graph

26 Bourbakis, Nikolaos G. (1998). Artificial Intelligence and Automation

27 Hamilton, W., Ying, Z., & Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in neural information processing systems, 30.

28 Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. stat, 1050(20), 10-48550.

29 Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2018). How powerful are graph neural networks?. arXiv preprint arXiv:1810.00826.

30 https://cronfa.swan.ac.uk/Record/cronfa56088

31 ISO 8601 is an international standard for representing date and time-related data (see ISO 8601 for more details).

Get Scaling Graph Learning for the Enterprise now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.