Chapter 4. Data Processing for Driving Decisions
Graphs provide context to answer questions, improve predictions, and suggest best next actions. But uncovering insight from graph data is a necessary step toward unleashing value.
In the actioning knowledge graphs we saw in Chapter 3, an organizing principle was applied to an underlying graph in order to extract knowledge. We said this makes the data smarter. Deciding upon or discovering an organizing principle, or even just exploring the graph to find its general properties, is a useful activity in its own right.
In this chapter, we’re going to explore decisioning knowledge graphs. A decisioning knowledge graph does not drive actions directly but surfaces trends in the data, which can be used in several ways such as to extract a view or subgraph for:
-
Specific analyses (e.g., monopartite graphs like customer-bought-product) yielding actionable knowledge that can be written back into an actioning knowledge graph
-
Human analysis (assisted by tooling) for data science exploration and experimentation, eventually possibly yielding insight that is written to the actioning knowledge graph or influences organizational structure
-
Further processing by downstream systems (e.g., training machine-learning models)
Physically, our decisioning graph might or might not be the same graph as our actioning knowledge graph. Sometimes it’s helpful to keep all the actionable data and decision making together (particularly when we want to enrich the actioning knowledge graph), and sometimes we want to physically separate them (for data science workflows).
However we physically arrange our infrastructure, our toolkit for these jobs consists of discovery, analytics, and data science. Separately, discovery, analytics, and data science are helpful. But together they become extremely powerful tools in our toolbox for turning decisions into useful actions.
This chapter covers the advantages of bringing discovery, graph data analytics, and graph data science into the mix. It explores the capabilities of decisioning knowledge graphs, to drive better actions and outcomes and highlights a number of enterprise-ready use cases.
Data Discovery and Exploration
The first step in any analysis is to find the data we need. For example, in criminal investigations, a suspect, individual, or organization is identified because connections in the data point to it in unusual ways. In a telecoms network, a device that is the root cause of a failure is identified because a surrounding constellation of working devices that depend directly or indirectly on that devices themselves report degradation in service. A group of fraudsters collaborating to create synthetic identities can often be identified because their shared means of identification forms rings in ways that would be otherwise highly unlikely.
Knowledge graphs provide the organizing principles to connect disparate datasets and a contextual platform for reasoning over linked information. Prime examples of this are POLE (Persons, Objects, Locations, and Events) databases often applied to governmental/law enforcement use cases or in IT systems management where failures can be predicted or retrospectively analyzed using a knowledge graph.
Leveraging the connections in data is transformative when sifting through large volumes of information. In Chapter 3 for example, we explained how the ICIJ makes sense of terabytes of leaked data. Similarly, NASA enables semantic searches over millions of documents to shave years and many millions of dollars off projects in its space program. Government agencies all over the world process countless phone records, financial transactions, fingerprints, DNA, and court records to fight crime and prevent terrorism. Financial institutions are able to use data discovery to improve fiscal responsibility and fight money laundering at scale.
The common thread among all these examples is that useful properties and patterns in the data first have to be discovered. Some intuition and thought have to go into the design of an organizing principle (such as a taxonomy), and from there the data can be explored to discover its useful properties. When useful patterns are discovered, they can be analyzed, used to train ML models, be written back to an actioning knowledge graph, or sent downstream to other systems.
The Predictive Power of Relationships
It’s worth noting at this point that beyond helping with discovery and exploration, relationships are highly predictive of behavior. In fact, researchers have found that even without demographic information like age, location, and socioeconomic status, they can be highly accurate in predicting who will vote, smoke, or suffer obesity based on one thing: social relationships. It’s not surprising that if we have many friends who vote, we’re more likely to vote, or that if we’re friends with smokers, we’d be more likely to smoke.
However, it is remarkable that a researcher can make this prediction even more accurately based on our friends-of-friends behavior, not one but two hops away from us. That is, the behavior of our friends-of-friends, whom we may not know that well or at all, is more predictive of our behavior than information that pertains only to us.
Tip
For more information on the science underlying social graphs, see Connected by James Fowler and Nicholas Christakis (Little, Brown and Company, 2009).
Despite their predictive power, most analytics and data science practices ignore relationships because it has been historically challenging to process them at scale. Consider trying to find similar customers or products in a three-hop radius of an account. With nongraph technology, you might be able to process this data, even if it is slower than querying a knowledge graph. But what if you need to scale such processing over a large graph of your customer base, then distill useful information (e.g., for every pair of accounts in this radius, calculate the number of accounts in common), and finally transform the results into a format required for machine processing? It’s just not practical in a nongraph system. This explosion of complexity quickly overwhelms the ability to perform processing and hinders the use of “graphy” data for predictions to the ultimate detriment of decision makers.
Instead of ignoring relationships, knowledge graphs incorporate them into analytics and ML workflows. Graph analytics excels at finding the unobvious because it can process patterns even when we don’t exactly know what to look for, while graph-based ML can predict how a graph might evolve. This is precisely what most data scientists are trying to achieve!
A performant knowledge graph makes it practical to incorporate connections and network structures into data analytics and from there to enrich ML models. For the business, this means better predictions and better decisions using the data we already have.
The Decisioning Knowledge Graph
We call a knowledge graph used for analytics, ML, or data science a decisioning knowledge graph because the aim is ultimately to improve decisions made by human or software agents. A decisioning knowledge graph must support analytics and data science workflows from simple queries to ML as well as provide graph visualizations.
Figure 4-1 illustrates the capabilities of a decisioning knowledge graph. These capabilities may be used alone or combined with one another, often in a pipeline.
- Queries
-
These are written by humans during an investigation and typically produce human-readable results.
- Algorithms
-
While algorithms also produce human-readable results, they are coded in advance of any particular investigation and are based on well-understood principles from graph theory.
- Embeddings
-
These are also defined in advance and use machine-learned formulas to produce machine-readable results.
Once we have results from queries, algorithms, and embeddings, we can put them to further use. As we saw in Chapter 2, graph-based visualization tools are helpful for exploring connections in graph data, but we can also use these outputs as training data for ML models.
Graph Queries
Most analysts start down the path of graph analytics with graph queries, which are (usually) human crafted and human readable.
They’re typically used for real-time pattern matching when we know the shape of the data we’re interested in.
For example, in Figure 4-2 we’re looking for potential allies in a graph of enemies on the basis of the concept, “the enemy of my enemy is my friend.”
Once a potential ally has been located, we create a FRIEND
relationship.
Unlike data discovery, where we’re asking a specific question for investigation, here we use the query results to feed subsequent analyses.
With a graph database and a graph query language, these kinds of graph-local patterns are computationally cheap and straightforward to express.
Graph Algorithms
But what if we don’t know where to start the query or want to find patterns anywhere in the graph? We call these operations graph-global, and querying is not always the right way to tackle these challenges. A graph-global problem is often an indication that we should instead consider a graph algorithm.
For more comprehensive analysis, where we need to consider the entire graph (or substantial parts of it), graph algorithms provide an efficient and scalable way to produce results. We use them to achieve a particular goal, like looking for popular paths or influential nodes. For example, if we’re looking for the most influential person in a social graph, we’d use the PageRank algorithm,1 which measures the importance of a node in the graph relative to the others.
In Figure 4-3 we see a snapshot of part of a graph.
Visually, we can see that the node representing Rosa
is the most connected, but that’s an imprecise definition.
If we run the PageRank algorithm over the data in Figure 4-3, we can see that Rosa
has the highest PageRank score, indicating that she’s more influential than other nodes in the data.
We can use this metadata in knowledge graphs by incorporating it as part of the organizing principle, just like any other data item, to drive users toward good decisions.
Graph algorithms excel at finding global patterns and trends, but we’ll want to choose and tune the algorithms to suit our specific questions. A decisioning knowledge graph should support a variety of algorithms and allow us to customize for future growth.
Graph Embeddings
Beyond just understanding data better, graph queries and algorithm results can be used to train ML models, but what if you don’t know what to query or which algorithm to use? Do you know if PageRank or a different type of algorithm would be more or less predictive? You could try them all and compare the results, but that would be tedious.
Graph embeddings are a special type of algorithm that encodes the topology of a graph (its nodes and relationships) into a structure suitable for consumption by ML processes. We use these when we know important data exists in the graph, but it’s unclear which patterns to look for and we’d like the ML pipeline to do the heavy lifting of discovering patterns. Graph embeddings can be used in conjunction with graph queries and algorithms to enrich ML input data to provide additional features.
Embeddings encode a representation of what’s significant in our graph for our specific problem and then translate that into a vectorized format (as seen at the bottom of Figure 4-1). We humans can’t readily understand the list of numbers it creates, but it’s precisely in the format we can use to train ML models.
Tip
Some types of graph embeddings also learn a function to represent our graph. We can apply that function to new incoming data and predict where it fits in the graph topology.
Graph embeddings are very useful because rather than running multiple algorithms to describe specific aspects of our graph topology, we can use graph structure itself as a predictor. Graph embeddings expand our predictive capabilities, but they typically take longer to run and have more parameters to tune than other graph algorithms. If we know what elements are predictive, we use queries and algorithms for feature engineering in ML. If we don’t know what is predictive, we use graph embeddings. Both are good ways to improve decisioning graphs.
ML Workflows for Graphs
Analyzing the output of graph queries and algorithms and using them to improve ML is great, but we can also write results back to enrich the graph. In doing so, those results become queryable and processable in a virtuous cycle that feeds the next round of algorithmic or ML analysis. Creating a closed loop within our decisioning knowledge graph means we can start with a graph, learn what’s significant, and then predict things about new data coming in, such as classifying someone as a probable fraudster, and write it back into the graph. The knowledge graph is enriched by the cycle shown in Figure 4-4.
Graph machine learning is often used for knowledge graph completion to predict missing data and relationships. In-graph ML keeps ML training inside the graph, which enables us to incorporate its ML workflows into our knowledge graph for continuous updates as new data is added.
It also avoids the need to create data integration pipelines and move data between various systems when making graph-specific predictions. For example, using a graph-centered approach we can avoid having to export similarity scores and embeddings from our knowledge graph into toolkits like TensorFlow2 to predict node categories and then write back the results to update our knowledge graph—all at high latency and systemic complexity. Instead, doing this entirely within our knowledge graphs significantly streamlines and accelerates the process.
Tip
We recommend automating some of the trickier tasks like transforming data into a graph format, test/train splitting of graph data, multiple model building, and accuracy evaluation. If you have graph and ML expertise, you might be able to build a decisioning knowledge graph that includes and automates ML model training for graphs. If not, consider a vendor or open-source solution with in-graph ML capabilities.
Graph Visualization
Visualizing data, especially relationships, allows domain experts to better explore connections and infer meaning. For knowledge graphs, we need tools like those shown in Figure 2-3 to help visually inspect raw data, understand analytics results, and share findings with others.
Walking through relationships, expanding scenes, filtering views, and following specific paths are natural ways to investigate graphs. A visual representation of a graph needs to be dynamic and user customizable to support interactive exploration. In addition to direct query support, graph visualizations need low-code/no-code options and assistive search features, like type-ahead suggestions, to empower a broader range of users.
Data scientists also benefit from visualizing algorithm and ML results. With the level of abstraction raised visually, a data scientist can focus on the necessary complexity of an experiment and not become bogged down in accidental complexity. For example, our tools can visualize PageRank scores as node sizes, node classifications as icons, traversal cost as line thickness, and community groups as colors. With these visual abstractions, the important aspects of the underlying data are immediately apparent, where they would be hidden in raw data. Once a data scientist is satisfied with their results, a graph visualization tool enables them to quickly prototype and communicate findings with others, even where those others are not experts in graph technology.
Decisioning Knowledge Graph Use Cases
There are many use cases for analytics, data science, and ML with decisioning knowledge graphs. Here are a few:
-
Finding and preventing fraud based on detecting communities of like behavior, unusual transactions, or suspicious commonalities. Results are typically written into an actioning knowledge graph to support online fraud detection or passed downstream into a ML workflow such that better predictive models can be built.
-
Improving customer experience and patient outcomes by surfacing complex sequences of activities for journey analysis. Typically, results are interpreted by domain experts who understand the journey of the user and can spot anomalous data, usually with help from visualization tools.
-
Preventing churn by combining individual data with community behavior and influential individuals in that network. Results are often written back into an actioning knowledge graph so that risky customers can be identified at points of contact. Results are also used to improve ML models so that holistic churn prediction across the customer base can be improved.
-
Forecasting supply chain needs using a holistic view of dependencies and identified bottlenecks. Results are written into the actioning knowledge graph that underpins the supply chain system, enriching it so that the supply chain is more robust.
-
Recommending products based on customer history, seasonality, warehouse stock, and product influence on sales of other items. Results are written into the actioning knowledge graph that supports the product catalog.
-
Eliminating duplicates and ambiguous entities in data based on highly correlated attributes and actions. The output is typically sent to downstream ML systems (e.g., graph neural networks) for predictive analysis around whether claims to identity should be linked. Output may also be used to train those downstream systems.
-
Finding missing data using an existing data structure to predict undocumented relationships and attributes. Often computed data is written back into the actioning knowledge graph or sent downstream to ML for further processing before being written back into an actioning knowledge graph.
-
What-if scenario planning and “next best action” recommendations using alternative pathways and similarities, typically consumed by experts in the first instance and ultimately written back to an actioning knowledge graph so that online systems can have better “next best actions.”
Each of these use cases is valuable, and it’s possible that more than one may apply to your business. What’s nice about these use cases is that tooling already exists that can implement them. We don’t need to invest in building such tools; we just need to get our data in a format so that the tools can conveniently process it. Often this can be as simple as storing it in a graph database. From here, we can incorporate the knowledge graph into our business processes to help to make better decisions.
Boston Scientific’s Decisioning Graph
Boston Scientific is a global medical device company that develops and manufactures a wide range of innovative diagnostic and treatment medical products, including pacemakers and artificial heart valves. Health care practitioners have helped more than 30 million patients around the world using the company’s products.
Boston Scientific has an integrated supply chain from raw materials to complex devices that includes development, design, manufacturing, and sales. Predicting and preventing device failures early in the process is crucial. However, the company had difficulty pinpointing the root cause of defects, limiting its ability to prevent future problems.
Using a decisioning graph, Boston Scientific was able to apply graph analytics to their supply chain and consequently improve the reliability of their products.
It started with a knowledge graph that included parts, finished products, and failures, as seen in Figure 4-5.
The chosen organizing principles used an ontology to define a hierarchy of parts as ISSUED
other parts to create assemblies that lead to finished products.
Then they add relationships of finished products to events that RESULTS_IN
failures.
Using graph queries, Boston Scientific is able to quickly reveal subcomponents’ complex relationships and trace any failures to relevant parts.
The company was able to identify previously unknown vulnerabilities by adding graph algorithms to rank parts based on their proximity to failures and match other components based on similarity. Since results can be automatically written back to their decisioning knowledge graph, Boston Scientific continues to enhance its data and improve product reliability across multiple collaborating teams.
Better Predictions and More Breakthroughs
With a decisioning knowledge graph, we can answer otherwise difficult questions based on nuanced relationships from large graphs. Graph algorithms and in-graph ML make it possible to predict outcomes based on the connections between data points rather than raw data alone. Combining both approaches can substantially improve the quality of results obtained.
A decisioning knowledge graph is not used directly by online systems but powers those online systems by enriching their underlying knowledge graphs, either directly or as the result of a longer analytics and ML workflow. The tools and patterns around a decisioning knowledge graph open up a new possibilities for gaining insight from data that has until recently only been accessible to researchers and a very few advanced technology companies. A decisioning knowledge graph commoditizes and democratizes a powerful set of tools for widespread business use.
1 Although PageRank was developed by Google to understand the relative ranking of Web pages, it’s actually named after its inventor, Larry Page.
2 TensorFlow is a free, open-source software library for ML.
Get Knowledge Graphs now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.