Chapter 1. Connect and Explore Data

In Chapter 1, we showed the potential of graph analytics and machine learning applied to human and business endeavors, and we proposed to present the details in three stages: the power of connected data, the power of graph analytics, and the power of graph machine learning. In this chapter, we will take a deep dive into the first stage: the power of connected data.

Before we delve into the power of connected data, we need to lay some groundwork. We start by introducing the concepts and nomenclature of the graph data model. If you are already familiar with graphs, you may want to skim this section to check that we’re on the same page with regard to terminology. Besides graphs themselves, we’ll cover the important concepts of a graph schema and traversing a graph. Traversal is how we search for data and connections in a graph.

And along the way we talk about the differences between graph and relational databases and how we can ask questions and solve problems with graph analytics that would not be feasible in a relational database.

From that foundational understanding of what is a graph, we move on to present examples of the power of a graph by illustrating six ways that graph data provides you with more insight and more analytical capability than tabular data.

After completing this chapter, you should be able to:

  1. Use the standard terminology for describing graphs

  2. Know the difference between a graph schema and a graph instance

  3. Create a basic graph model or schema from scratch or from a relational database model

  4. Apply the “traversal” metaphor for searching and exploring graph data

  5. Understand six ways that graph data empowers your knowledge and analytics

  6. State the entity resolution problem and show how graphs resolve this problem

Graph Structure

In Chapter 1, we introduced you to the basic idea of a graph. In this section, we are going to go deeper. First we will establish the terminology that we will be using for the rest of this book. Then we will talk more about the idea of a graph schema, which is the key to having a plan and awareness of your data’s structure.

Graph Terminology

Suppose you’re organizing data about movies, actors, and directors. Maybe you work for Netflix or one of the other streaming services, or maybe you’re just a fan.

Let’s start with one movie, Star Wars: A New Hope, its three main actors and its director. If you were building this in a relational database, you could record this information in a single table, but the table would grow quickly and rapidly become unwieldy. How would we even record details about a movie, the fact that 50 actors appeared in it, and the details of each of those actor’s careers, all in one table?

Best practice for the design of relational databases would suggest putting actors, movies and directors each into a separate table, but that would mean also adding in cross-reference tables to handle the many-to-many relationships between actors and movies and between movies and directors.

So in total you’d need five tables just to represent this example in a relational database, as in Figure 1-1.

Separating different types of things into different tables is the right answer for organizing the data, but to see how one record relates to another we have to rejoin the data. A query asking which actors worked with which directors would involve building a temporary table in memory called a join table that includes all possible combinations of rows across all the tables you’ve called which satisfy the conditions of the query. Join tables are expensive in terms of memory and processor time.

Diagram of relational tables for a simple movie database
Figure 1-1. Diagram of relational tables for a simple movie database
Temporary table created from relational database query showing how three actors are linked to George Lucas via the movie Star Wars
Figure 1-2. Temporary table created from relational database query showing how three actors are linked to George Lucas via the movie Star Wars

As we can see from Figure 1-2, there is a lot of redundant data in this table join. For very large or complex databases, you would want to think of ways to structure the data and your queries to optimize the join tables.

However, if we compare that to the graph approach, as shown in Figure 1-3, one thing becomes immediately clear: The difference between a table and graph is that a graph can directly show how one data element is related to another. That is, the relationships between the data points are built into the database and don’t have to be constructed at run-time. So one of the key differences between a graph and relational database is that in a graph database, the relationships between data points are explicit.

Graph showing our basic information about Star Wars
Figure 1-3. Graph showing our basic information about Star Wars

Each actor, movie, and director is called a node or a vertex (plural: vertices). Vertices represent things, physical or abstract. In our example, the graph has five vertices which describe the relationships between the vertices. The connections between vertices are called edges. Edges are also considered data elements. This graph has four edges: three for actors showing how they are related to a movie (acted_in) and one for a director showing their relationship to a movie (directed). In its simplest form, a graph is a collection of vertices and edges. We will use the general term object to refer to either a vertex or an edge.

With this graph, we can answer a basic question: what actors have worked with the director George Lucas? Starting from George Lucas, we look at the movies he directed which include Star Wars and then we look at the actors in that movie which include Mark Hamill, Carrie Fisher and Harrison Ford.

It can be useful or even necessary to distinguish the direction of an edge. In a graph database, an edge can be directed or undirected. A directed edge has a specific directionality, going from a source vertex to a target vertex. We draw directed edges as arrows.

By adding a directed edge, we can also show hierarchy, that is, The Empire Strikes Back was the sequel to Star Wars (Figure 1-4).

Multi movie graph with a directed edge. This shows how we begin to build up the database with additional movies and production personnel. Note the directed edge  is_sequel_of  which provides the context to show that Empire was the sequel to Star Wars and not vice versa.
Figure 1-4. Multi-movie graph with a directed edge. This shows how we begin to build up the database with additional movies and production personnel. Note the directed edge, is_sequel_of, which provides the context to show that Empire was the sequel to Star Wars and not vice versa.

To do more useful work with a graph, however, we will want to add more details about each vertex or edge, such as an actor’s birthdate or a movie’s genre.

This book describes property graphs. A property graph is a graph where each vertex and each edge can have properties which provide the details about individual elements. If we look again at relational databases, properties are like the columns in a table. Properties are what make graphs truly useful. They add richness and context to data which enable us to develop more nuanced queries to extract just the data that we need. Figure 1-5 shows the Star Wars graph with some added features.

Graph with properties
Figure 1-5. Graph with properties

Graphs offer us another choice for modeling properties. Instead of treating genre as a property of movies, we could make each genre a separate vertex. Why do this? When the property is categorical, then we expect lots of other vertices to have the same property value (e.g., there are lots of sci-fi movies). All the sci-fi movies will link to the Sci-fi vertex, making it incredibly easy to search them or to collect statistics about them, such as, what was the top grossing sci-fi movie? All the non-sci-fi movies have already been filtered out for you. Either way, the additional data allows us to refine our queries to find just the information we need.

In our movie database example, we might want to create a new type of vertex called Character so we can show who played what role.

Figure 1-6 shows our Star Wars graph with the addition of Character vertices. The interesting thing about Darth Vader, of course, is that he was played by two people: David Prowse (in costume) and James Earl Jones (voice). Fortunately our database can represent this reality with a minimum of modification.

Movie graph with Actor and Character types. The flexibility of this schema enables us to easily show two actors portraying one character.
Figure 1-6. Movie graph with Actor and Character types. The flexibility of this schema enables us to easily show two actors portraying one character.

What else can we do with this graph? Well, it’s flexible enough to allow us to add just about every person who was involved in the production of this movie from the director and actors to make-up artists, special effects artists, key grip and even best boy. Everyone who contributed to a movie could be linked using an edge called worked_on and an edge property called role which could include director, actor, voice actor, camera operator, key grip and so on.

If we then built up our database to include thousands of movies and everyone who had worked on them, we could use graph algorithms to answer questions like, which actors do certain directors like to work with most? With a graph database you can answer less obvious questions like who are the specialists in science fiction special effects, or which lighting technicians do certain directors like to work with most? Interesting questions for companies that sell graphics software or lighting equipment.

With a graph database, you can connect to multiple data sources, extract just the data you need as vertices and run queries against the combined dataset. If you had access to a database of lighting equipment used on various movie projects, you could connect that to your movie database and use a graph query to ask which lighting technicians have experience with what equipment.

Table 1-1 summarizes the essential graph terminology we have introduced.

Table 1-1. Glossary of essential graph terminology
graph A collection of vertices, edges, and properties used to represent connected data and support semantic queries.
vertexa A graph object used to represent an object or thing. Plural: vertices.
edge A graph object which links two vertices, often used to represent a relationship between two objects or things.
property A variable associated with a vertex or edge, often used to describe it.
schema A database plan comprising vertex and edge types and associated properties which will define the structure of the data.
directed edge / undirected edge A directed edge represents a relationship with a clear semantic direction, from a source vertex to a destination vertex. An undirected edge represents a relationship in which no direction is implied.

a Another commonly used alternative name is node. It is a matter of personal preference. It’s been proposed that the upcoming ISO standard query language for property graphs accept either VERTEX or NODE.

Graph Schemas

In the previous section, we intentionally started with a very simple graph and then added complexity, not only by adding more vertices, edges, and properties, but also by adding new types of vertices and edges. To model and manage a graph well, especially in a business setting, it’s essential to plan out your data types and properties.

We call this plan a graph schema, or graph data model, analogous to the schema or entity-relationship model for a relational database. It defines the types of vertices and edges that our graph will contain as well as the properties associated with these objects.

You could make a graph without a schema by just adding arbitrary vertices and edges, but you’d quickly find it difficult to work with and difficult to make sense of. Also, if you wanted to search the data for all the movies, for example, it would be extremely helpful to know that they are all in fact referred to as “movie” and not “film” or “motion picture”!

It’s also helpful to settle on a standard set of properties for each object type. If we know all movie vertices have the same core set of properties, such as title, genre, and release date, then we can easily and confidently perform analysis on those properties.

Figure 1-7 shows a possible schema for a movie graph database. It systematically handles several of the data complexities that arose as we talked about adding more and more movies to the database.

Graph schema for movie database
Figure 1-7. Graph schema for movie database

Let’s run through the features of the schema:

  • A Person vertex type represents a real-world person, such as George Lucas.

  • The Worked_on edge type connects a Person to a Movie. It has a property to describe the person’s role: director, producer, actor, gaffer, etc. By having the role as a property, we can support as many roles as we want with only one vertex type for persons and one edge type for working on a film. If a person had multiple roles, then the graph can have multiple edges1. Schemas only show one of each type of object.

  • The Character vertex type is separate from the Person vertex type. One Person could portray multiple Characters (Tyler Perry in the Madea films), or multiple Persons could portray one Character (David Prowse, James Earl Jones, and Sebastian Shaw as Darth Vader in The Return of the Jedi).

  • The Movie vertex type is straightforward.

  • Is_sequel_of is a directed edge type, telling us that the source Movie is the sequel of the destination Movie.

  • As noted before, we chose to model the genre of a movie as a vertex type instead of as a property, to make it easier to filter and analyze movies by genre.

The key to understanding schemas is that having a consistent set of object types makes your data easier to interpret.

Traversing a Graph

Traversing a graph is the fundamental metaphor for how a graph is searched and how the data is gathered and analyzed. Imagine the graph as a set of interconnecting stepping stone paths, where each stepping stone represents a vertex. There are one or more agents who are accessing the graph. To read or write a vertex, an agent must be standing on its stepping stone. From there, the agent may step or traverse across an edge to a neighboring stone/vertex. From its new location, the agent can then take another step. Remember: if two vertices are directly connected, it means there is a relationship between them, so traversing is following the chain of relationships.

Hops and Distance

Traversing one edge is also called making a hop. An analogy to traversing a graph is moving on a game board, like the one shown in Figure 1-8. A graph is an exotic game board, and you traverse the graph as you would move across the game board.

Traversing a graph is like moving on a game board
Figure 1-8. Traversing a graph is like moving on a game board

In many board games, when it is your turn, you roll a die to determine how many steps or hops to take. In other games, you may traverse the board until you reach a space of a certain type. This is exactly like traversing a graph in search of a particular vertex type.

Graph hops and distance come up in other real-world situations. You may have heard of “six degrees of separation.” This refers to the belief that everyone in the U.S. is connected to everyone else through at most six hops of relationship. Or, if you use the LinkedIn business network app, you have probably seen that when you look at a person’s profile, LinkedIn will tell you if they are connected to you directly (one hop), through two hops, or through three hops.

Breadth and Depth

There are two basic approaches to systematically traversing a graph to conduct a search. Breadth-first search (BFS) means visit each of your direct neighbors before continuing the search to the next level of neighbors, the next level, and so on. Graph databases with parallel processing can accelerate BFS by having multiple traversals take place at the same time.

Depth-First Search (DFS) means follow a single chain of connections as far as you can, before backtracking to try other paths. Both BFS and DFS will result in eventually visiting every vertex, unless you stop because you have found what you sought.

Graph Modeling

Now you know what is a graph and what is a graph schema. But how do you come up with a good graph model?

Start by asking yourself these questions:

  • What are the key objects or entities that I care about?

  • What are the key relationships that I care about?

  • What are the key properties of entities you want to filter on?

Schema Options and Tradeoffs

As we have seen, good graph schema design represents data and relationships in a natural way that allows us to traverse vertices and edges as if they were real-world objects. As with any collection of real-world things, there are many ways we could organize our collection to optimize searching and extracting what we need.

In designing a graph database, two considerations that will influence the design are the format of our input data and our query use cases. And as we will see in this section, a key tradeoff is whether we want to optimize our schema to use less memory or make queries run faster.

Vertex, edge or property?

If you are converting tabular data into a graph, the natural thing seems to be to convert each table to a vertex type, and each table column in a vertex property. In fact, a column could map to a vertex, an edge, a property of a vertex or a property of an edge.

Entities and abstract concepts generally map to vertices, and you could think of this as a noun. Relationships generally map to edges and you can think of them as verbs. Descriptors are analogous to adjectives and adverbs and can map to vertex or edge properties depending on the context and your query use case.

At first glance it would appear that storing object attributes as close to the object as possible -- ie, as properties -- would deliver the most optimal solution. However, consider a use case in which you need to optimize your search for product color. Color is a quality that would usually be expected to be found as a property of a vertex, but then searching for blue objects would necessitate looking at every vertex.

In a graph, you can create a search index by defining a vertex type called color and linking the color vertex and the product vertex via an undirected edge. Then to find all blue objects, you simply start from the color vertex blue and find all linked product vertices. This speeds up query performance with the tradeoff being greater complexity and higher memory usage.

Edge direction

Earlier we introduced the concept of directionality in edges and noted that you can, in your design schema, define an edge type as directed or undirected. In this section we’ll discuss the benefits and tradeoffs of each type. We’ll also discuss a hybrid option available in the TigerGraph database.

This is so useful you might think you could use it all the time, but with all things computational, there are benefits and tradeoffs in your choice of edge type.

Undirected edge

Links any two vertices of defined type with no directionality implied. The benefit is they are easy to work with when creating links and easy to traverse in either direction. For example, if users and email addresses are both vertex types, you can use an undirected edge to find someone’s email but also find all the users who use that same email address, something you can’t do with a directed edge.

The tradeoff with an undirected edge is it does not give you contextual information such as hierarchy. If you have an enterprise graph and want to find the parent company, for example, you can’t do this with undirected edges because there is no hierarchy. In this case you would need to use a directed edge.

Directed edge

Represents a relationship with a clear semantic direction, from a source vertex to a destination vertex. The benefit to a directed edge is it gives you more contextual information. It is likely to be more efficient for the database to store and handle than an undirected edge. The tradeoff, however, is you can’t trace backward should you need to.

Directed edge paired with a reverse directed edge

You can have the benefits of directional semantics and traversing in either direction if you define two directed edge types, one for each direction. For example, to implement a family tree, you could define a child_of edge type to traverse down the tree and a parent_of edge type to traverse up the tree. The tradeoff, though, is you have to maintain two edge types: every time you insert or modify one edge, you need to insert or modify its partner. The TigerGraph database makes this easier by allowing you to define the two types together and to write data ingestion jobs that handle the two together.

As you can see, your choice of edge type will be influenced by the types of queries you need to run balanced against operational overheads such as memory, speed and coding.

Tip

If the source vertex and destination vertex types are different, such as Person and Product, you can usually settle for an undirected edge and let the vertex types provide the directional context. It’s when the two vertex types are the same and you care about direction that you must use a directed edge.

Granularity of edge type

How many different edge types do you need and how can you optimize your use of edge types? In theory, you could have one edge type -- undirected -- that linked every type of vertex in your schema. The benefit would be simplicity -- only one edge type to remember! -- but the tradeoffs would be the number of edge properties you would need for context and slower query performance.

At the other extreme, you could have a different edge type for each type of relationship. For instance, in a social network, you could have separate edge types for coworker, friend, parent_of, child_of, and so on. This would be very efficient to traverse if you were looking for just one type of relationship, such as professional networks. The tradeoff is the need to define new edge types to represent new types of relationships and a loss of abstraction -- ie, an increase in complexity -- in your code.

Modeling Interaction Events

In many applications, we want to track interactions between entities, such as a financial transaction where one financial account transfers funds to another account. You might think of representing the transaction (transferring funds) as an edge between two Account vertices. If you have multiple occurrences, will you have multiple edges? While it seems easy to conceive of this (Figure 1-9), in the realms of both mathematical theory and real-world databases, this is not so straightforward.

Multiple events represented as multiple edges
Figure 1-9. Multiple events represented as multiple edges

In mathematics, having multiple edges between a given pair of vertices goes beyond the definition of ordinary graphs into multi-edges and multigraphs. Due to this complexity, not all graph databases support this, or if they do, they don’t have a convenient way to refer to a specific edge in the group. Another way to handle this is to model each interaction event as a vertex, and use edges to connect the event to the participants (Figure 1-10[a]). Modeling an event as a vertex provides the greatest flexibility for linking it to other vertices and for designing analytics. A third way is to create a single edge between the two entities and aggregate all the transactions into an edge property (Figure 1-10[b]).

Two alternate ways to model multiple events   a  events as vertices and  b  a single event edge with a property that contains a list of occurrences.
Figure 1-10. Two alternate ways to model multiple events: (a) events as vertices and (b) a single event edge with a property that contains a list of occurrences.

Table 1-2 summarizes the pros and cons of each approach. The simplest model is not always your best choice, because application requirements and database performance issues may be more important.

Table 1-2. Comparing options for modeling multiple occurrences of an interaction
Model Benefit Tradeoff
Multiple edges
Simple model Database support is not universal
Vertex linked to related vertices Filtering on vertex properties
Ease of analytics including community and similarity of events
Advanced search tree integration
Uses more memory
Takes more steps to traverse
Single edge with property recording details of occurrences Less memory usage
Fewer steps to traverse between users
Searching on transactions less efficient
Slower update/insert of the property

Adjusting your design schema based on use case

Suppose you are creating a graph database to track events in an IT network. We’ll assume you would need these vertex types: event, server, IP, event type, user and device. But what relationships would you want to analyze and what edges would you need? The design would depend on what you wanted to focus on, and your schema could be event-centered or user-centered.

For the event-centered schema (Figure 1-11[a]), the key benefit is that all related data is just one hop away from the event vertex. This makes it straightforward to find communities of events, find servers that processed the most events of a given type, and find the servers that were visited by any given IP. The tradeoff is that from a user perspective, the user is two hops away from a device or IP vertex.

Two options for arranging the same vertex types   a  event centered  and  b  user centered.
Figure 1-11. Two options for arranging the same vertex types: [a] event-centered, and [b] user-centered.

We can fix this by making our schema user-centered at the expense of separating events from IPs and servers by two hops and event types are separated from devices, servers and IPs by three hops (Figure 1-11[b]). However, these disadvantages might be worth the tradeoff of being able to do useful user-centered analysis such as detecting blacklisted users, finding whitelisted users that are similar to blacklisted users, and finding the paths between two users.

Transforming Tables in a Graph

You won’t always create graph databases from scratch. Often, you’ll be taking data that is already stored in tables and then moving or copying the data into a graph. But how should you reorganize the data into a graph?

Migrating data from a relational database into a graph database is a matter of mapping the tables and columns onto a graph database schema. To map data from a relational database to a graph database, we create a one-to-one correspondence between columns and graph objects. Table 1-3 outlines a simple example of mapping data from a relational database to a graph database for bank transaction data.

Table 1-3. Example of mapping tables in a relational database to vertices, edges and properties in a graph database
Source: Relational database Destination: Graph database
Table: Customers – multiple columns including customer_id, first_name, last_name, DOB Vertex type: Customer -- with corresponding properties of customer_id, first_name, last_name, DOB
Table: Banks – columns bank_id, bank_name, routing_code, address Vertex type: Bank - properties bank_name, routing_code, address
Table: Accounts – columns bank_id, customer_id, Vertex type: Account - properties bank_id, customer_id
Table: Transactions – columns source_account, destination_account, amount Vertex type: Transaction - properties source_account, destination_account, amount
OR
Directed edge: transaction - properties source_account, destination_account, amount

The graph schema would be as shown in Figure 1-12.

Graph schema for a simple banking database with transactions as separate vertices.
Figure 1-12. Graph schema for a simple banking database with transactions as separate vertices.

One of the key decisions in creating your data schema is deciding which columns need to be mapped to their own vertices. For instance, people are generally key to understanding any real-life situation -- whether they be customers, employees or others -- so they would generally map to their own vertices.

In theory, every column in a relational database could become a vertex in your schema, but this is unnecessary and would quickly become unwieldy. In the same way that you have to think about structuring a relational database, optimizing a graph database is about understanding the real-world structure of your data and how you intend to use it.

In a graph database, the key columns from your relational database become vertices and the contextual or supporting data become properties of those vertices. Edges generally map to foreign keys and cross reference tables.

Some graph databases have tools that facilitate the importing of tables and mapping of foreign keys to vertex and edge IDs.

As with a relational database, a well-structured graph database eliminates redundant or repetitive data. This not only ensures efficient use of computing resources but, perhaps more importantly, ensures the consistency of your data by ensuring that it doesn’t exist in different forms in different locations.

Optimizing mapping choices

Simple mapping of columns to vertices and vertex properties works, but it may not take advantage of the richness of connections available in a graph, and in reality it is often necessary to adjust mapping choices based on differing search use cases.

For instance, in a graph database for a contacts database, mobile number and email address are properties of an individual person and are generally represented as properties of that vertex.

However, if you were trying to use a banking application to detect fraud, you might want to separate email addresses and telephone numbers out as separate vertices because they are useful in linking people and financial transactions.

It is not uncommon for information from multiple tables to map to one vertex or edge type. This is especially common when the data is coming from multiple sources, each of which provides a different perspective on the same real-world entities. Likewise, one table can map to more than one vertex and edge type.

Model Evolution

Most likely, your data is going to evolve over time, and you will need to adjust the schema to take account of new business structures and external factors. That’s why schemas are designed to be flexible, to allow the system to be adapted over time without having to start from scratch.

If we look at the banking sector, for instance, financial institutions are constantly moving into new markets, either through geographical expansion or introducing new types of products.

As a simple example, let’s assume we have a bank that’s always operated in a single country. The country of origin for all its customers is therefore implicit. However, moving into a second country would require updating the database to include country data. One could either add a country property to every vertex type for which it was relevant or create a new vertex type called country and create vertices for each country in which the bank operated.

With a flexible schema, the schema can be updated by adding the new vertex type and then linking customer vertices to the new country vertex.

Although this is a simple example, it shows how modeling data can be an evolutionary process. You can start with an initial model, perhaps one that closely resembles a prior relational database model. After you use your graph database for a while, you may learn that some model changes would serve your needs better. Two common changes are converting a vertex property into an independent vertex type and adding additional edge types.

Adapting a graph to evolving data can be simple. Adding a property, a vertex type, or an edge type is easy. Connecting two different data sets is easy, as long as we know how they relate. We can add edges to connect related entities, and we can even merge entities from two sources which represent the same real-world entity.

Graph Power

We’ve now seen how to build a graph, but the most important question that needs to be answered is why build a graph. What are the advantages? What can a graph do for you that other data structures don’t do as well? We call graph technology’s collected capabilities and advantages graph power.

What follows are the key facets of graph power. We humbly admit that this is neither a complete nor the best possible list. We suspect that others have presented lists that are more complete and more precise in a mathematical sense. Our goal, however, is not to present theory but to make a very human connection: to take the ideas that resonate with us and to share them with you, so that you will understand and experience graph power on your own.

Connecting the Dots

A graph forms an actionable body of knowledge.

As we’ve seen, connecting the dots is graph power at its most fundamental level. Whether we are linking actors and directors to movies or financial transactions to suspected fraudsters, a graph lets you describe the relationship between one entity and another across multiple hops.

The power of graph comes from being able to describe a network of connections, detect patterns, and extract intelligence from those patterns. While individual vertices may not contain the intelligence we are looking for, taken together, we may discover patterns in the relationships between multiple vertices that reveal new information.

With this knowledge we can begin to infer and predict from the data, like a detective joining the dots in a murder investigation.

In every detective story, the investigator gathers a set of facts, possibilities, hints, and suspicions. But these isolated bits and pieces are not the answer. The detective’s magic is to stitch these pieces together into a hidden truth. They might use the pattern of known or suspected connections to predict relationships which they had not been given.

When the detective has solved the mystery, they can show a sequence or network of connections that connect the suspect to the crime, along with the means, opportunity, and motive. They can likewise show that a sufficiently robust sequence of connections does not exist for any other suspect.

Did those detectives know they were doing graph analytics? Probably not, but we all do it every day in different aspects of our lives. Whether that’s work, family or our network of friends, we are constantly connecting the dots to understand connections between people and people, people and things, people and ideas, and so on.

The power of graph as a data paradigm is that it closely parallels this process, making the use of graph more intuitive.

The 360 View

A 360 graph view eliminates blind spots.

Organizations of all sizes bemoan their data silos. Each department expects the other to yield up their data on demand while at the same time failing to appreciate their own inability to be open on the same basis. The problem is that business processes and the systems that we have to support them actively work against this open sharing of data.

For instance, two departments may use two different data management systems. Although both may store their data in a relational database, the data schema for each is so alien to the other that there is little hope of linking the two to enable sharing.

The problem may not be obvious if you look at it at the micro scale. If for instance, you are compiling a record for customer X, an analyst with knowledge of the two systems in which customer data is stored will be able to easily extract the data from both, manually merge or reconcile the two records, and present a customer report. The problem comes when you want to replicate this a hundred thousand or a million times over.

And it’s only by sharing the data in a holistic, integrated way that a business would be able to remove the blinders that prevent it from seeing the whole picture.

The term Customer 360 describes a data architecture in which customer data from multiple sources and domains is brought together into a single data set so that you have a comprehensive and holistic view of each customer.

Working with a relational database, the most obvious solution would be to merge these two departmental databases into one. Many businesses have tried grand data integration projects, but they usually end in tears because while merging data yields considerable benefits, there are also considerable tradeoffs to be made that result in the loss of contextual nuance and functionality. Let’s face it, there’s usually a reason why the creators of a certain software package chose to construct their data schema in that particular way, and attempting to force it to conform to the schema of another system, or a new hybrid schema, will break at least one of the systems.

Graph allows you to connect databases in a natural, intuitive way without disturbing the original tables. Start by granting the graph application access to each database and then create a graph schema that links the data points from each database in a logical way. The graph database maps the relationships between the data points and does the analytical heavy lifting, leaving the source databases to carry on with what they were doing before.

If you want to see your full surroundings, you need a view that looks out across every angle – all 360 degrees. If you want to understand your full business or operational circumstances, you need data relationships across all the data you know is out there.

This is something we will look at in more depth in Chapter 3 where we will demonstrate a use case involving Customer Journey.

We have seen in the previous two points how to set up the data, and now in the next four points, we will look at how to extract meaningful intelligence from it.

Looking Deep for More Insight

Searching deep in a graph reveals vast amounts of connected information.

The “six degrees of separation” experiment conducted in the 1960s by Stanley Milgram demonstrated that just by following personal connections (and knowing that the target person is in Boston), randomly selected persons in Omaha, Nebraska could reach the mystery person through no more than six person-to-person connections.

Since then, more rigorous experiments have shown that many graphs are so-called small world graphs, meaning that the source vertex can reach millions and even billions of other vertices in a very small number of hops.

This vast reach in only a few hops occurs not only in social graphs but also in knowledge graphs. The ability to access this much information, and to understand how those facts relate to one another, is surely a super power.

Suppose you have a graph which has two types of vertices: persons and areas of expertise, like the one in Figure 1-13. The graph shows who do you know well and what do you know well. Each person’s direct connections represent what is in their own head.

Fig 2 13. A graph showing who knows who and what each of them are experts in.
Figure 1-13. A graph showing who knows who and what each of them are experts in.

From this we can readily see that A is an expert in two topics, astronomy and anthropology, but by traversing one additional hop to ask B and C what do they know, A has access to four more specialties.

Now, suppose each person has 10 areas of expertise and 100 personal connections. Consider how many people and how many areas of expertise are reached by your friends’ friends. There are 100 x 100 = 10,000 personal connections each with 10 areas of expertise. Chances are that that is not 10,000 unique persons – you and your friends know some of the same people. Nevertheless, with each hop in a graph, you are exposed to an exponentially larger quantity of information. Looking for the answer to a question? Want to do analytics? Want to understand the big picture? Ask around and you’ll find someone who knows someone who knows.

We talk about “looking deeper” all the time, but in graph it means something particular. It is a structured way of searching for information and understanding how those facts are related. Looking deeper includes breadth-based search to consider what is accessible to you from your current position. It then traverses to some of those neighboring vertices to gain depth and see what is accessible from those new positions. Whether it’s for a fraud investigation or to optimize decision making, looking deeper in a graph uncovers facts and connections that would otherwise be unknown.

As we saw in “Connecting the Dots”, one relationship on its own may be unremarkable, and there may be little if any information in a given vertex to reveal bad intentions, but thousands or even millions of vertices and edges considered in aggregate can begin to reveal new insights which in turn leads to actionable intelligence.

Seeing and Finding Patterns

Graphs present a new perspective, revealing hidden data patterns which are easy to interpret.

As we have seen, a graph is a set of vertices and edges, but within the set of vertices and relationships, we can begin to detect patterns.

A graph pattern is a small connected set of vertices and edges which can be used as a template for searching for groups of vertices and edges which have a similar configuration.

The most basic graph pattern is the data triplet: vertex → edge → vertex. The data triplet is sometimes thought of as a semantic relationship because it is related to the grammar of language and can be read as “subject → verb → object”, e.g, Bob → owns → boat.

We can also use graph patterns to describe higher-level objects or relationships that we have in mind. For instance, depending on your schema, a person could be linked to a number of vertices containing personal data such as address, telephone, and email. Although they are separate vertices, they are all related to that one person. Another example is a wash sale, which is the combination of two securities trades: selling a security at a loss, and then purchasing the same or substantially similar security within 30 days.

Patterns come in different shapes. The simplest pattern, which we have looked at already, is the linear relationship between two vertices across a series of hops. The other common pattern is the star shape: many edges and vertices radiating from a central vertex.

A pattern can be Y-shaped, a pattern you would see when two vertices come together on a third vertex which is then related to a fourth vertex. We can also have circular or recursive patterns and many more.

In contrast to relational databases, graph data is easy to visualize, and graph data patterns are easy to interpret.

A well-designed graph gives names to the vertex and edge types that reflect their meaning. When done right, you can almost look at a connected sequence of vertices and edges and read the names like a sentence. For example, consider Figure 2-15 which shows Items purchased by Persons.

Fig 2 14. People who bought Product A also bought these products.
Figure 1-14. Fig 2-14. People who bought Product A also bought these products.

Starting from the left, we see that Person A (you) bought Item 1. Moving to the right, we then see another group of persons B, C, and D who also bought item 1. Finally we see some more items that were purchased by these persons. So, we can say, “You bought Item 1. Other persons who bought Item 1 also bought Items 2, 3, 4, and 5.” Sound familiar?

A closer analysis reveals that Item 4 was the most popular item, purchased by all three shoppers in your co-purchaser group. Item 3 was next most popular (purchased by two), and Items 2 and 5 were the least popular. With this information, we can refine our recommendations.

Many retailers use graph analytics for their recommendation analytics, and they often go deeper yet, classifying purchases by other customer properties such as gender, age, location, and time of year. One could even make recommendations based on time of day if we saw that customers were, for instance, more likely to purchase luxury items in the evening and make more pragmatic purchases in the morning.

If we also analyze the sequence of purchases, we can also work out some highly personal information about customers. One large retailer was famously able to tell which customers were pregnant and when they were due simply by focusing on the purchases of 25 products. They were then able to send them targeted promotional offers to coincide with the birth of their child.

Matching and Merging

Graph is the most intuitive and efficient data structure for matching and merging records.

As we discussed earlier, organizations want to have a 360-degree view of their data, but a big obstacle to this is data ambiguity. An example of data ambiguity is having multiple versions of customer data, and the challenges of deduplicating data are well known to many organizations.

Duplication is sometimes caused by the proliferation of enterprise systems which split your customer view across many databases. For instance, if you have customer records in a number of databases -- such as Salesforce, a customer service database, an order processing system and an accounting package -- the view of that customer is divided across those systems.

To create a joined up view of your customers, you need to query each database and join together the records for each customer.

However, it’s not always that easy because customers can end up being registered in your databases under different reference IDs. Names can be spelled differently. Personal information (surname, phone number, email address, etc.) can change. How do you match together the correct records?

Entity resolution matches records based on properties that are assumed to be unique to the entities that are being represented. In the case of person records, this might be email addresses and telephone numbers, but it could also be aggregates of properties – for instance, we can take name, date of birth and place of birth together as a unique identifier because what are the chances of those three things being the same for any two people in the world?

Entity resolution is challenging across relational databases because in order to compare entities, you need to be comparing like with like. If you are working with a single table, you can say that similar values in similar columns indicate a match, allowing you to resolve two entities into one, but across multiple tables, the columns may not match. You may also have to construct elaborate table joins to include cross-referenced data in the analysis.

By comparison, entity resolution in a graph is easy. Similar entities share similar neighborhoods, which allows us to resolve them using similarity algorithms such as cosine similarity and Jaccard similarity.

In entity resolution, we actually do two things:

  • Find matches – compare attributes and look for indicators of similarity. Give the match a confidence score.

  • Merge matching records – using the confidence score, use one of several strategies to merge the records.

When it comes to merging records, we have a few options including:

  • Copy the data from record B to record A, redirect the edges that pointed to B to point to A, and delete B.

  • Create a special link called “same_as” between records A and B.

  • Create a new record, C, copy the data from A and B, redirect the links from A and B to link to C, and finally create “same_as” edges pointing from vertex C to vertices A and B.

Which is better? The second is quicker to execute because there is only one step involved – adding an edge – but a graph query can execute the first and third options just as well. In terms of outcomes, which option is better depends on your search use case – for instance, do you prioritize richness of data or search efficiency? It might also depend on the degree of matching and merging you expect to do in your database.

We will demonstrate and discuss entity resolution with a walkthrough example in a later chapter.

Weighing and Predicting

Graphs with weighted relationships let us easily model and analyze complex cost structures.

As we’ve shown, graphs are a powerful tool for analyzing relationships, but one thing to consider is that relationships don’t have to be binary, on or off, black or white. Edges, representing the relationships between vertices, can be weighted to indicate the strength of the relationship, such as distance, cost or probability.

If we weight the edges, path analysis then becomes a matter of not just tracing the links between nodes but also doing computational work such as aggregating their values.

However, weighted edges make graph analysis more complex in other ways, too. In addition to the computational work, finding shortest paths in a graph with weighted edges is algorithmically harder than in an unweighted graph. Even after you’ve found a path to a vertex, you cannot be certain that it is the shortest path. If the edge weights are always positive, then you have to keep trying until you have considered every in-edge to the vertex, and if edge weights can be negative, then it gets harder yet because you must consider all possible paths.

Then again, edge weighting does not always make for a significant increase in work. In the PageRank algorithm, which computes the influence of each vertex on all other vertices, edge weighting makes little difference except that the influence that a vertex receives from a referring neighbor is multiplied by the edge weight, which adds a minimal computational overhead to the algorithm.

There are many problems that can be solved with edge weighting. Anything to do with maps, for instance, lends itself to edge weighting. You can have multiple weights per edge. Considering the map example, these could include constant weights such as distance and speed limits and variable weights such as current travel times to take account of traffic conditions.

We could use a graph of airline routes and prices to work out the optimal journey for a passenger based not only on their itinerary but also their budget constraints. Are they looking for the fastest journey regardless of price or are they willing to accept a longer journey, perhaps with more stops, in exchange for a lower price? In both cases you might use the same algorithm, shortest path, but prioritize different edge weights.

With access to the right data, we could even work out the probability of having a successful journey. For instance, what is the probability of our flight departing and arriving on time? For a single hop, we might accept an 80% chance that the flight wouldn’t be more than an hour late, but for a two hop trip, where the chance for the second hop not being late was 85%, the combined risk of being delayed would be 68%.

Likewise, we could look at a supply chain model and ask, what are the chances of a severe delay in the production of our finished product? If we assume that there are six steps and the reliability of each step is 99%, then the combined reliability is about 94% -- in other words, there is a 6% chance that something will go wrong. We can model that across hundreds of interconnecting processes and use a shortest path algorithm to find the ‘safest’ route that satisfies a range of conditions.

Chapter Summary

In this chapter, we have looked at graph structure and how we can use a graph database to represent data as a series of data nodes and links. In graphs, we call these vertices and edges, and they enable us to not only represent data in an intuitive way – and query it more efficiently – but also use powerful graph functions and algorithms to traverse the data and extract meaningful intelligence.

Property graphs are graphs in which every vertex and edge – which we collectively refer to as objects – can hold properties which describe that object. One property of an edge is direction, and we discuss the benefits and tradeoffs of different directed edge types in indicating hierarchy and sequence.

We looked at what is meant by traversing a graph as well as ‘hops’ and ‘distance’. There are two approaches to traversing a graph: breadth-first search and depth-first search, each with its own benefits and tradeoffs.

We looked at the importance of using a graph schema to define the structure of the database, how a consistent set of object types makes your data easier to interpret and how it can closely relate to the real world.

Careful consideration was given to different approaches to the design, in particular the search use case and how mapping the columns of a relational database to a graph database can impact query time and the complexity of your coding.

A key step in implementing a graph database is mapping columns in a relational database to a graph because a common use case for graph is building relationships between disparate databases. One of the decisions you have to make is which columns to map to their own objects and which to include as properties of other objects.

We looked at the evolution of databases over time and why a flexible schema is essential to ensuring your database remains up to date.

In the design of a database schema, whether that be for a relational or graph database, there are benefits and tradeoffs to be made, and we looked at a few of those including the choice of whether to map a column to an object or make it the property of an object. We also considered the choice of edge directionality and the granularity of edge types.

There are also tradeoffs to be made in recording multiple events between the same two entities and tracking events in an IT network.

Finally, we looked at what we mean by graph power including the essential question, why use graph in the first place? We looked at some general use cases including:

  1. Connecting the Dots – how a graph forms an actionable body of knowledge

  2. The 360 View – how a 360 graph view eliminates blind spots

  3. Looking Deep for More Insight – how deep graph search reveals vast amounts of connected information

  4. Seeing and Finding Patterns – how graphs present a new perspective, revealing hidden data patterns which are easy to interpret

  5. Matching and Merging – why graph is the most intuitive and efficient data structure for matching and merging records

  6. Weighing and Predicting – how graphs with weighted relationships let us easily model and analyze complex cost structures

As stated at the beginning of this chapter, you should now be able to:

  • Use the standard terminology for describing graphs

  • Know the difference between a graph schema and a graph instance

  • Create a basic graph model or schema from scratch or from a relational database model

  • Apply the “traversal” metaphor for searching and exploring graph data

  • List three ways that graph data empowers your knowledge and analytics

  • State the entity resolution problem and show how graphs resolve this problem

1 Some graph databases would handle multiple roles by having a single Worked_on edge whose role property accepts a list of roles.

Get Graph-Powered Analytics and Machine Learning with TigerGraph now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.