Chapter 4. Exploring Neighborhoods in Development

To get to the next phase in graph application development, we are going to build upon the simple Customer 360 (C360) application from Chapter 3. We’ll add a few more layers, or neighborhoods, onto that example to illustrate the next wave of concepts in graph thinking.

Adding data to our example provides a more realistic picture of the complexity of data modeling, querying, and applying graph thinking to our customer-centric financial data.

We consider the transition from the basic example in Chapter 3 to the complexity in this chapter to be analogous to steps in the process of learning how to scuba dive. What we did in Chapter 3 was like starting to learn how to scuba dive in a wading pool; it is not really clear what the point is when you are in water that shallow. But we needed to start from a familiar place. The examples in this chapter are like scuba diving in a deep pool. Afterwards, we will be ready to head into more interesting depths in Chapter 5.

Chapter Preview: Building a More Realistic Customer 360

There are three main sections within this chapter.

In the first section, we will explore and explain graph thinking to present best practices in graph data modeling. We will do this by adding more neighborhoods of data to our C360 example so that we can answer the following questions:

  1. What are the most recent 20 transactions involving Michael’s account?

  2. In December, at which vendors did Michael shop, and with what frequency?

  3. Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan. (Query 3 is an example of personalization.)

Throughout this initial section, we will follow query-driven design to illustrate common best practices for creating a property graph data model. Topics include mapping your data to vertices or edges, modeling time, and common mistakes.

In the next section, we will build up deeper Gremlin queries. These queries walk through three, four, and five neighborhoods of data. We will also introduce how to use properties to slice, order, and range over graph data, and we will discuss querying in time windows. By the end of this section, we will have illustrated all of the data, technical concepts, and data modeling that we planned for our example.

We will end the chapter by revisiting the basic queries to introduce some more advanced querying techniques. These techniques are most commonly a part of trying to format your query results into a more user-friendly structure.

This content sets us up to present the final, production-quality schema for this example, which we will do in Chapter 5.

Graph Data Modeling 101

During the early days of working with graph databases backed by Apache Cassandra, my team was sitting around the couches in the living room of our venture-backed startup. We were whiteboarding a graph data model for storing healthcare data in a graph database.

We quickly agreed that doctors, patients, and hospitals were our primary entities of importance, and therefore they would be vertices. Everything else after that was a debate. Vertices, edges, properties, and names: everyone had a defensible opinion about everything. Our most memorable disagreements were polarizing. What should we name the edges between doctors and patients? All of these entities live or work somewhere; how do we model addresses? Is country a vertex or a property, or should it be left out of our model?

It was a difficult conversation. It took much longer than we had expected to arrive at a design consensus, and none of us really felt comfortable with it.

Since that design session, each time I advise a graph team around the world, I can feel similar tensions and see similar design consensus. The tensions are always real, always there, and always observable.

This section is all about helping your team have a more constructive discussion about your graph data model. To accomplish this, we want to walk through three sections of advice for creating a good graph data model. Those sections of advice will be:

  1. Should This Be a Vertex or an Edge?

  2. Lost yet? Walk Me Through Direction.

  3. A Graph Has No Name: Common Mistakes in Naming

We selected these topics for two reasons. First, these topics cover most of the points of contention you will encounter during the modeling process. Second, these topics support where we are in the development of the running example for these chapters. Details for deeper and more advanced modeling advice will be introduced when we get there.

Should This Be a Vertex or an Edge?

This is the most debatable topic about property graph modeling. From the middle of the most heated debates, we’ve grabbed a number of tips for creating graph data models.

Let’s start our tips at the beginning. In our world, the beginning is where you want to start your graph traversals.

Rule of Thumb #1

If you want to start your traversal on some piece of data, make that data a vertex.

To unpack our first tip, let’s revisit one of the queries we constructed in Chapter 3:

Which accounts does this customer own?

There are three pieces of data required to answer that question: customers, accounts, and a connection from which customer owns an account. Think about how you could use that data to “find all accounts owned by Michael.” There are two ways to translate this statement into a database query: “Michael owns accounts” or “accounts owned by Michael.”

Let’s talk about the first option: starting with Michael to find his accounts. This means that you are starting with data about people—specifically, the piece of data about Michael. In your head, when you find a starting place for a query, you would want to translate that data to being a vertex label in your graph model. With this, we have our first vertex label for our graph model: customers.

Consider the second way to find this information: you could first find all accounts and then keep only those that are owned by Michael. In this case, you are starting with the data about accounts. Now we have a second vertex label for our graph model: accounts.

This sets us up for the next tip on how to find the edges in your data.

Rule of Thumb #2

If you need the data to connect concepts, make that data an edge.

For the query we are working with, we know that Michael will be a vertex label and that his account is another vertex label. That leaves the concept of ownership, and yes, you guessed it—it will be the edge. The concept of ownership links a customer to an account for our example data.

To find the edges in your model, examine your data. You find your edges from within the information that links concepts together and to which you have access.

When working with graph data, these edges are the most important piece of your graph model. Edges are why you need graph technology in the first place.

Putting these two together, you can derive the following rule for labeled property graph models.

Rule of Thumb #3

Vertex-Edge-Vertex should read like a sentence or phrase from your queries.

Our advice here is to write out how you want to query with your data into short phrases like “customer owns account.” Identifying these queries and phrases remains a simple way to identify how you want to map your data into graph objects in a property graph database, as shown in Figure 4-1.

pggd 0401
Figure 4-1. Two vertices, named Michael and acct_14, with an edge (relationship) titled, owns; this illustrates an example of translating short phrases of noun-verb-noun to a property graph model: Michael owns account 14

Generally speaking, written forms of your graph queries will translate verbs to edges and nouns to vertices.

Note

This isn’t the first time the graph community has worked with semantic phrases and graph data. Those of you from the semantic community are likely shouting, “We’ve seen this before!” And you are right; we have.1

Putting recommendations #2 and #3 together yields a specific way to translate how you think into graph objects.

Rule of Thumb #4

Nouns and concepts should be vertex labels. Verbs should be edge labels.

Depending on how you think, there are times at which tip #3 and tip #4 can create ambiguous scenarios. We want to delve into some semantics here to help you navigate different ways that people see and think about data.

Specifically, if you think “Michael owns an account,” then “owns” should be an edge label. This is a case in which you are thinking actively about the relationship between Michael and his account. And this active line of thought translates owns to a verb that connects two pieces of data together. This is how we arrive at “owns” as an edge label.

However, there are cases in which you may see this same scenario differently. Namely, if you are thinking “We need to represent the concept of ownership between Michael and his account,” then ownership should be a vertex label. In that case, you are thinking of ownership as a noun—that is, an entity. The difference is that in this case, it is likely that the ownership needs to be identifiable. You probably are trying to relate ownership in other ways. In these cases, other questions you may plan on asking are, “Who established that ownership?” or “Who does the ownership transfer to if the primary agent dies?”

We acknowledge that we are getting into the weeds here. But we know that you will eventually find yourself in the weeds as well. We hope that the guidance we are providing will help you find your way back up and out.

Our first four tips introduced the fundamentals for identifying vertices and edges in your graph data. Let’s walk through how to reason about the direction of your edge labels.

Lost Yet? Let Us Walk You Through Direction

The questions and queries for this chapter integrate more data into our model. Specifically, we want to add transactions into our data so that we can answer questions like:

What are the most recent 20 transactions involving Michael's account?

To answer this query, we need to add transactions into our data model. And these transactions need to give us a way to model and reason about how transactions withdraw and deposit money between the accounts, loans, and credit cards.

When you first start writing graph queries and iterating on data models, it is very easy to get turned around in your data model. Direction of an edge label is a difficult thing to reason around, which is why we make the following recommendation.

Rule of Thumb #5

When in development, let the direction of your edges reflect how you would think about the data in your domain.

Tip #5 infers the direction of an edge label as you combine and apply the advice from the previous four tips. At this point, the pattern of Vertex-Edge-Vertex should be easily read as subject-verb-object sentences.

Therefore, the edge label’s direction comes from subject and goes to object.

Coming up with edge labels between transactions is a discussion we have seen play out many times. Let’s follow through our thought process to detail how we reasoned about modeling something like a transaction in a graph.

An evolution of modeling transactions in a graph

Think about how you would first add transactions into your graph model. You likely are thinking about how an account transacts with other accounts, or something like we are showing in Figure 4-2.

pggd 0402
Figure 4-2. The data model most people start from: thinking about transactions as verbs, with phrases like “this account transacts with that account”

The model for Figure 4-2 doesn’t work for our example because it uses the idea of a transaction as a verb, whereas our questions use transactions as nouns. We want to know things like an account’s most recent transactions and which transactions are loan payments. In this light, we are really thinking about transactions as nouns.

Therefore, transactions need to be vertex labels in our example.

Now we need to reason about the direction of the edges. Most people start with modeling edge direction to follow the flow of money, as shown in Figure 4-3.

pggd 0403
Figure 4-3. Modeling edge direction according to the flow of money

The challenge with a model like Figure 4-3 is to come up with intuitive names for the edges that make it easy to answer our chapter’s questions. The edge direction in Figure 4-3 models the flow of money and is awkward for how we are using transactions in our questions. Would we say, “This account had money withdrawn from it via this transaction”? Let’s hope not.

So Figure 4-3 isn’t going to work for our example, either.

Let’s recall our chapter’s questions and reason about how we use transactions in the queries. We came up with the following subject-verb-object sentences for the context in which we are using transactions in our example:

  1. Transactions withdraw from accounts.

  2. Transactions deposit to accounts.

These two phrases might work; let’s see how this would work with data. In our data, we could model a transaction and how it interacted with accounts as shown in Figure 4-4.

pggd 0404
Figure 4-4. Modeling the direction of your edges according to how you would use them in your queries

For the example in this chapter, we think that Figure 4-4 makes it reasonably easy to use our model to answer our questions. This gives us direction for both of our labels: the edge labels will flow from a Transaction and go to an Account. The schema is shown in Figure 4-5.

pggd 0405
Figure 4-5. Modeling the direction of your edges according to how you would think about the data in your domain

By breaking down your queries into short, active phrases of the structure subject-verb-object, you will be able to naturally find what needs to be a vertex or edge label in your graph model. Then the edge label’s direction will come from the subject and go to the object.

Let’s zoom out from the nuances of modeling direction for transactions and get back to the final main element of a graph’s schema: properties.

When do we use properties?

Let’s repeat the first query that will use the transaction vertices:

What are the most recent 20 transactions involving Michael's account?

The short version of our query from above translates to the following short phrases:

  1. Michael owns account

  2. Transactions withdraw from his account

  3. Select the most recent 20 transactions

So far, we can walk through customers, accounts, and transactions within our graph. Now our question asks for the 20 most recent transactions from an account. This means that we need to subselect our transactions to include only the most recent ones.

Therefore, we will want the ability to filter transactions by time. This brings us to our last tip related to data modeling decisions.

Rule of Thumb #6

If you need to use data to subselect a group, make it a property.

Ordering transactions by time requires us to have that value stored in our graph model: enter properties. This is a great use of a property on the transaction vertex so that we can subselect those vertices in our model. Figure 4-6 shows how we would add time into our ongoing example.

pggd 0406
Figure 4-6. Modeling time as a property on the transaction vertex so that we can subselect to query for only the most recent transactions

Together, tips #1–6 give you a great starting point for identifying what will be a vertex, an edge, or a property in your graph data model. We have one last section of data modeling best practices to consider before we start the implementation details for this chapter.

A Graph Has No Name: Common Mistakes in Naming

The callouts in the upcoming section are common mistakes. Each mistake is followed by our bad-better-best recommendations.

Arriving at a consensus on what something should be named and maintained with your codebase is surprisingly difficult. We have three topics on which teams commonly waste their valuable time in bikeshedding how to address naming conventions in their graph data model.

Pitfalls in Naming Conventions #1

Using the word has as an edge label.

One of the most common mistakes we see comes from naming all of your edges with the label has, as shown on the left side of Figure 4-7. This is a mistake in naming because the word has does not provide meaningful context regarding the edge’s purpose or direction.

pggd 0407
Figure 4-7. From left to right: the bad, better, and recommended ways to name your edges

If your graph model uses has for its edge labels, we have two recommendations for you. A better edge label would have the form has_{vertex_label}, as shown in the center in orange in Figure 4-7. This type of name allows you to have more specificity in your graph queries while also providing a more meaningful name to maintain in your codebase.

The preferred solution to this problem is shown in green at far right in Figure 4-7. This recommendation advises you to use an active verb that communicates meaning, direction, and specificity to your data. We are going to use the edge labels deposit_to and withdraw_from to connect transactions to the accounts in our examples.

After meaningful edge labels have been selected, it is also a common mistake to create property names that do not help uniquely identify your data. This brings us to our next pitfall in property graph modeling.

Pitfalls in Naming Conventions #2

Using the word id as a property.

The concept of which pieces of data uniquely identify an entity is a deep topic. Using a property key called id is a bad decision because it is not descriptive of what it is referring to. Additionally, id is a naming clash with the internal naming conventions within Apache Cassandra and is not supported in DataStax Graph.

A slightly better convention would be to name the property that uniquely identifies your data with {vertex_label}_id, as shown at center in Figure 4-8. We use this a few times throughout the book because we are working with synthetic examples, and this type of identifier is perfectly fine if you use randomly generated identifiers, like UUIDs (universally unique identifiers). However, you will see us move to using more descriptive identifiers when we work with open source data. These identifiers represent concepts that uniquely identify entities within their domain, such as social security numbers, public keys, and domain-specific universally unique identifiers.

pggd 0408
Figure 4-8. From left to right, the bad, better, and recommended ways to name a property to uniquely identify your data.

This brings us to the last and debatably most important mistake that we see throughout application codebases.

Pitfalls in Naming Conventions #3

Inconsistent use of casing.

When it comes to casing, the best approach follows the language conventions that you are writing in. Some languages have style guides that promote CamelCase, whereas others prefer snake_case. For the examples in this book, we plan to follow the following casing and styles:

  1. Capital CamelCase for vertex labels

  2. Lowercase snake_case for edge labels, property keys, and example data

This last tip feels a bit pedantic to even bring up in a graph book. We are mentioning it because consistency in naming conventions tends to be forgotten, creating expensive roadblocks for teams during the last stretch of getting their graph technology into production. The more trivial these tips seem to your team, the better off you probably already are in making sure to remember them.

Our Full Development Graph Model

The previous discussion of graph data modeling illustrated how we broke down our first query to evolve the example from Chapter 3. In this section, we want to build up the remaining elements in our data model to answer all the questions for this chapter’s example.

The example in this chapter adds schema and data that enable our application to answer the following three questions:

  1. What are the most recent 20 transactions involving Michael’s account?

  2. In December, at which vendors did Michael shop, and with what frequency?

  3. Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.

We have already stepped through how to model the first question. Let’s take a closer look at it.

pggd 0409
Figure 4-9. The augmented graph schema from Chapter 3 that applies the data modeling principles to answer the first query of our expanded example

The graph schema in Figure 4-9 applies the principles we built up to answer the first question into a graph data model. The new vertex label is Transaction, with two new edge labels to the Account vertex: withdraw_from and deposit_to, respectively. We discussed how and where to model time in our graph, which you see in Figure 4-9 with timestamp on the Transaction vertex.

Next, let’s consider this chapter’s remaining questions for our example in this chapter by modeling the queries:

1. In December, at which vendors did Michael shop, and with what frequency?

2. Find and update the transactions that Jamie and Aaliyah most value:
     their payments from their account to their mortgage loan.

To arrive at a data model for these questions, let’s apply the thought processes we introduced in “Graph Data Modeling 101”. Following the advice there, we came up with three statements about transactions:

  1. Transactions charge credit cards.

  2. Transactions pay vendors.

  3. Transactions pay loans.

From these statements, we can find the rest of our required schema elements. First, we need a new vertex label to represent where our customers shop: Vendor. Next, we need an edge label, pay, for a transaction to the Loan or Vendor vertex labels. Last, we need another edge label, charge, to indicate that a transaction charges a credit card.

Bringing all of this together, we have the schema shown in Figure 4-10.

pggd 0410
Figure 4-10. The development schema that answers all of the queries we aim to build for this example

Before We Start Building

We reduced the full perspective on graph data modeling to include only the practices that we need for our current example. Beyond these core principles, you will find edge cases about your data that are not covered here. That is expected. We are teaching a thought process and selected the principles here as a starting guide for modeling your data like a graph.

Note

If we could ensure you understood one concept about graph data modeling, it would be the following: modeling your data as a graph is just as much of an art as it is engineering. The art of the data modeling process involves creating and evolving your perspective on your data. This evolution translates your mindset into the paradigm of relationship-first data modeling.

When you find new modeling cases in this book or in your own work, ask the following questions about what you are modeling to help develop your own reasoning:

  1. What does this concept mean to the end user of the application?

  2. How are you going to read this data in your application?

Defining your data model is the first step in applying graph thinking to your application. Focus on the data you can integrate, the queries you want to ask, and what this will mean to your end user. When combined, those three concepts articulate how we see, model, and use graph data within an application.

Our Thoughts on the Importance of Data, Queries, and the End User

To help you learn and apply our perspective to building your own graph model, let’s walk through the importance of data, queries, and the end user.

Our first piece of advice is to focus on the data you have. It is easy to boil the ocean by modeling your industry’s entire graph problem; avoid this rabbit hole! Your graph model will evolve if you keep centered on getting to production with the data with which your application will be working.

Second, apply the practice of query-driven design. Build your data model to accommodate only a predefined set of graph queries. A common red herring we run into on this topic is those applications that aim to create open traversals across any discoverable data in a graph. For developmental purposes, the ability to explore and discover makes sense. However, for production use, an application with open traversal access can introduce a myriad of concerns.

For security, performance, and maintenance implications, we strongly advise teams not to create production platforms with unbounded and unlimited traversals. The warning sign we see is a lack of specificity for your graph application. We know this perspective is very hard to apply when you are first exploring graph data. We see the line here as setting expectations between what you want to do during development versus what you want to push to production in a distributed production application.

Last and most importantly, you have to consider what the data means to your end user. Everything from selecting naming conventions to the objects in your graph will be interpreted by someone else: your team members or your application users. Naming conventions and graph objects are interpreted and maintained by your engineering team members; choose them wisely.

Ultimately, your graph data will be presented to an end user through your application. Spend time designing your data architecture, models, and queries to present information that is most meaningful to them.

When combined, these three concepts articulate how we see, model, and use graph data within an application. Again, the three concepts are to build with the data you have, follow query-driven design, and design for your end user. Following these design principles will help get you unstuck during those difficult data modeling discussions and prepare your application to be the best use of graph data the industry has ever seen.

Implementation Details for Exploring Neighborhoods in Development

Our schema from Figure 4-10 requires only two new vertex labels: Transaction and Vendor. What you have practiced a few times prior to now is how to take a schema drawing and translate it into code. We showed the schema in Figure 4-10, and in Example 4-1 we show you the code.

Example 4-1.
schema.vertexLabel("Transaction").
       ifNotExists().
       partitionBy("transaction_id", Int).
       property("transaction_type", Text).
       property("timestamp", Text).
       create();

schema.vertexLabel("Vendor").
       ifNotExists().
       partitionBy("vendor_id", Int).
       property("vendor_name", Text).
       create();
Tip

In case you are wondering, we are using Text as the data type for timestamp to make it easier to teach concepts in our upcoming examples. We will be using the ISO 8601 standard format stored as text.

In addition to these vertex labels, we added relationships between the Transaction vertex and the other vertex labels in this graph. Let’s start with the new edge labels between the Transaction and Account vertex labels. The schema code for the new edge labels is shown in Example 4-2.

Example 4-2.
schema.edgeLabel("withdraw_from").
       ifNotExists().
       from("Transaction").
       to("Account").
       create();

schema.edgeLabel("deposit_to").
       ifNotExists().
       from("Transaction").
       to("Account").
       create();

These two edges model how money moves to and from an account within your bank. In Example 4-3, we add in the rest of the edge labels in our example:

Example 4-3.
schema.edgeLabel("pay").
       ifNotExists().
       from("Transaction").
       to("Loan").
       create();

schema.edgeLabel("charge").
       ifNotExists().
       from("Transaction").
       to("CreditCard").
       create();

schema.edgeLabel("pay").
       ifNotExists().
       from("Transaction").
       to("Vendor").
       create();

These last three edge labels complete the edges we will need to describe transactions between the assets in our example.

Generating More Data for Our Expanded Example

As examples grow, so too does the data. We wrote a small data generator to expand the data from Chapter 3 to include our data model from Figure 4-10. If you are interested in the data generation process for this chapter, you have two options.

Your first option is to use the bash scripts to reload the exact same data you will see in the upcoming examples. We will teach you about this tool and process in Chapter 5, but you are welcome to preview the loading script in the GitHub repository. We recommend using the scripts throughout this book if you would like the examples you are running locally to match the results we show in the text.

Your second option is to dive into and execute our data generation code. We provided our code in a separate Studio Notebook called Ch4_DataGeneration. We recommend this option if you want to dig into creating fake data with Gremlin and the methods we used.

An Important Warning About the Data Generation Process

If you rerun the data insertion process in your Studio Notebook, the results in your local graph will not precisely match the results printed in this text. If you want the data to match precisely, we recommend importing the exact same graph structure via DataStax Bulk Loader. You will find all of this in the accompanying technical materials.

Up to this point, we have accomplished many tasks. We explored our first set of data modeling tips, created a development model, looked at the schema code, and inserted data.

The last main task is to use the Gremlin query language to walk around our model and answer questions about our data.

Basic Gremlin Navigation

The main objective of this chapter is to illustrate a real-world graph schema that walks through multiple neighborhoods of graph data.

Tip

For your reference, we will use the words walk, navigate, and traverse interchangeably throughout this book to mean that we are writing graph queries.

Everything in this chapter up until now was required to set up answering the following three questions in this section:

  1. What are the most recent 20 transactions involving Michael’s account?

  2. In December, at which vendors did Michael shop, and with what frequency?

  3. Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.

Let’s walk through the queries and their results. Then, in the chapter’s final section on Advanced Gremlin, we will delve a bit deeper into how to shape the result payload.

Our recommendation is that you find a way to reference Figure 4-10 as you practice the queries in the upcoming sections. We recommend doing this because your schema functions as your map; you need to know where you are so that you can walk in the right direction to your destination.

Query 1: What are the most recent 20 transactions involving Michael’s account?

Let’s start with some pseudocode in Example 4-4 to think about how we are going to walk through our data to answer this first question.

Example 4-4.
Question: What are the most recent 20 transactions involving Michael's account?
Process:
    Start at Michael's customer vertex
    Walk to his account
    Walk to all transactions
    Sort them by time, descending
    Return the top 20 transaction ids

We used the process outlined in Example 4-4 to create the Gremlin query in Example 4-5.

Example 4-5.
1 dev.V().has("Customer", "customer_id", "customer_0"). // the customer
2         out("owns").                       // walk to his account
3         in("withdraw_from", "deposit_to"). // walk to all transactions
4         order().                           // sort the vertices
5           by("timestamp", desc).   // by their timestamp, descending
6         limit(20).                         // filter to only the 20 most recent
7         values("transaction_id")           // return the transaction_ids

A sample of the results:

"184", "244", "268", ...

Let’s dig into this query one step at a time.

On line 1, dev.V().has("Customer", "customer_id", "customer_0") looks up a vertex according to its unique identifier. Then on line 2, the step out("owns") walks through the outgoing owns edge to the Account vertices for this customer. In this case, Michael has only one account.

At this point, we want to access all transactions. On line 3, the in("withdraw_from", "deposit_to") step does just that: we walk through the incoming edge labels to access transactions. At line 4, we are on the transaction vertices.

Note

We left a detail out of “An evolution of modeling transactions in a graph” that we want to bring up now. The simplicity of line 3 in Example 4-5 was also part of the motivation that led to how we designed the edges in our data model. This first query was much harder to write and reason about when the edges were going in different directions.

The order() step on line 4 indicates that we need to provide some sort of order to the vertices, which are transactions. We specify the sort order on line 5 with the by("timestamp", desc) step. This means that we are going to access, merge, and sort all Transaction vertices according to their timestamp. Then we want to select only the 20 most recent vertices with limit(20). Last, on line 7, we want to get access to the transaction_ids, so we select them via the values("transaction_id") step.

This query will return a list of values that contains the transaction_id for each of the 20 most recent transactions across all of the customer’s accounts.

Imagine how much more powerful this would be to display for the end user. They would be able to see the details that are most relevant to them instead of navigating multiple screens to join this data together in their head. This type of query is vital in understanding how to personalize your application to what a customer most cares about.

Query 2: In December 2020, at which vendors did Michael shop, and with what frequency?

For this second question, let’s start with an outline of the query in Example 4-6 to think about how we are going to walk through our data to answer the question.

Example 4-6.
Question: In December 2020, at which vendors did Michael shop, and with what frequency?
Process:
    Start at Michael's customer vertex
    Walk to his credit card
    Walk to all transactions
    Only consider transactions in December 2020
    Walk to the vendors for those transactions
    Group and count them by their name

We start the process outlined in Example 4-6 in Example 4-7 and complete it in Example 4-8. In preparation for this query, we used the ISO 8601 timestamp standardization in our data to make it easier to range on dates. In the ISO 8601 standard, timestamps are commonly formatted as YYYY-MM-DD’T’hh:mm:ss’Z’, where 2020-12-01T00:00:00Z represents the very beginning of December in 2020.

Example 4-7.
1 dev.V().has("Customer", "customer_id", "customer_0"). // the customer
2         out("uses").                         // Walk to his credit card
3         in("charge").                        // Walk to all transactions
4         has("timestamp",                     // Only consider transactions
5             between("2020-12-01T00:00:00Z",  // in December 2020
6                     "2021-01-01T00:00:00Z")).
7         out("pay").                          // Walk to the vendors
8         groupCount().                        // group and count them
9           by("vendor_name")                  // by their name

The results are:

{
  "Nike": "2",
  "Amazon": "1",
  "Target": "3"
}
Warning

Randomization affects the results of query 2. If you use the data generation process instead of loading the data, your graph may have a slightly different structure and therefore different counts for query 2.

The setup for Example 4-7 follows a similar access pattern as before, where we start at a customer and then traverse to a neighboring vertex. We start at customer_0 and walk to their credit cards and then to transactions. On lines 4 through 6, we are using a way to filter your data during a traversal. Here, we are filtering all vertices according to their timestamps in a specific range. Specifically, has("timestamp", between("2020-12-01T00:00:00Z", "2021-01-01T00:00:00Z")) sorts and returns all transactions that have a timestamp during the month of December in the year 2020.

At line 7, following our schema, we walk to the vendors with the out("pay") step. Finally, we want to return the vendor’s name along with how many times a transaction was observed with that vendor. We do this on lines 8 and 9 with groupCount().by("vendor_name").

In addition to between, Table 4-1 lists the most popular predicates you can use to range on values. Please refer to the book by Kelvin Lawrence for the full table of predicates.2

Table 4-1. Some of the most popular predicates that you can use to range on values
Predicate Usage

eq

Equal to

neq

Not equal to

gt

Greater than

gte

Greater than or equal to

lt

Less than

lte

Less than or equal to

between

Between two values excluding the upper bound

You may be wondering: what if we wanted to order the output of Example 4-7?

If you wanted to return the results in a decreasing order, you would do that by adding in the order().by() pattern, shown on lines 10 and 11 in Example 4-8.

Example 4-8.
1 dev.V().has("Customer", "customer_id", "customer_0").
2         out("uses").
3         in("charge").
4         has("timestamp",
5             between("2020-12-01T00:00:00Z",
6                     "2021-01-01T00:00:00Z")).
7         out("pay").
8         groupCount().
9           by("vendor_name").
10        order(local).         // Order the map object
11          by(values, desc)    // according to the groupCount map's values

The results are now:

{
  "Target": "3",
  "Nike": "2",
  "Amazon": "1"
}

We threw in the use of scope in a traversal at line 10 with the step order(local).

Scope

Scope determines whether the particular operation is to be performed to the current object (local) at that step or to the entire stream of objects up to that step (global).

For a visual explanation of scope in a traversal, consider Figure 4-11.

pggd 0411
Figure 4-11. A visual example of the difference between global and local scope in a Gremlin traversal

To explain it simply, at the end of line 9, we needed to order the object in the pipeline, which is a map. The use of local on line 10 tells the traversal to sort and order the items within the map object. Another way to think about this is that we want to order the entries within the map. We do that by indicating that the scope is local to the object itself.

The best way to understand traversal scope is to play with different queries in your Studio Notebook and see how the scope affects the shape of your results. More great visual diagrams on understanding the flow of data and object types are available on the DataStax Graph documentation pages.

Tip

If you ever question what object type you have in the middle of developing a Gremlin traversal, add .next().getClass() to where you are in your traversal development. This will inspect the objects at this point in your traversal and give you their class.

Query 3: Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.

The advantage of using a graph database really starts to show as we walk through multiple neighborhoods of data, as we will be doing with this third and last query. Here, we are accessing and mutating data across five neighborhoods of data in our graph. We are going to break this query down into three steps: access, mutation, and then validation.

The first simplification we are going to make to our account is to reduce the scope of the query. We know that Jamie and Aaliyah share only one account: acct_0. Therefore, to further simplify our query, we can focus on walking from only one person; we choose Aaliyah.

This brings us to the first shorter query we want to build:

Query 3a: Find Aaliyah’s transactions that are loan payments

Before we can update important transactions, we need to find the important ones. The transactions we are looking for are those that indicate a loan payment from Aaliyah’s joint account to Jamie and Aaliyah’s mortgage. Let’s outline our approach in pseudocode in Example 4-9 to think about how we are going to walk through our data to answer the question.

Example 4-9.
Question: Find Aaliyah's transactions that are loan payments
Process:
    Start at Aaliyah's customer vertex
    Walk to her account
    Walk to transactions that are withdrawals from the account
    Go to the loan vertices
    Group and count the loan vertices

We used the process outlined in Example 4-9 to create the Gremlin query in Example 4-10.

Example 4-10.
1 dev.V().has("Customer", "customer_id", "customer_4"). // accessing Aaliyah's vertex
2         out("owns").                      // walking to the account
3         in("withdraw_from").              // only consider withdraws
4         out("pay").                       // walking out to loans or vendors
5         hasLabel("Loan").                 // limiting to only loan vertices
6         groupCount().                     // groupCount the loan vertices
7           by("loan_id")                   // by their loan_id

The results for the sample data will look like:

{
  "loan80": "24",
  "loan18": "24"
}

Let’s step through Example 4-10. On line 1, we start by accessing the customer and walking to their account. On line 2, we traverse to Aaliyah’s account. Recalling the schema, we walk through the incoming edge withdraw_from to access account withdrawals on line 3.

On line 4, we walk through the pay edge label to arrive at either Loan or Vendor vertices. The hasLabel("Loan") step on line 5 is a filter that eliminates all vertices at this point that are not loans. This means we are now considering only the assets into which a payment has been made from the account and that are loans. On line 6, we group and count those loan vertices according to their unique identifier, as indicated on line 7.

The result payload indicates that this account has made 24 payments into each loan within the system.

Next, we want to go a step further and update the data in this traversal to indicate which transactions are mortgage payments.

Query 3b: Find and update the transactions that Jamie and Aaliyah most value: their payments from their checking account to their mortgage, loan_18

The traversal required to accomplish this query is a mutating traversal. All we mean by mutating traversal is that it updates data in the graph as a part of the traversal. Example 4-11 shows how we can use the traversal above to write properties on the transactions that go from the account and into loan_18, because loan_18 is Jamie and Aaliyah’s mortgage loan.

Example 4-11.
1 dev.V().has("Customer", "customer_id", "customer_4"). // accessing Aaliyah's vertex
2         out("owns").                                  // walking to the account
3         in("withdraw_from").                          // only consider withdraws
4         filter(
5                out("pay").                            // walking to loans or vendors
6                has("Loan", "loan_id", "loan_18")).    // only keep loan_18
7         property("transaction_type",  // mutating step: set the "transaction_type"
8                  "mortgage_payment"). // to "mortgage_payment"
9         values("transaction_id", "transaction_type")  // return transaction & type

The results are:

"144", "mortgage_payment",
"153", "mortgage_payment",
"132", "mortgage_payment",
...

Example 4-11 starts the same as the first part of our query. The new portion of this traversal spans lines 4 through 6 with the filter(out("pay").has("Loan", "loan_id", "loan_18")) steps. Here, we allow only the transactions that are connected to the loan_18 vertex to continue down the pipeline. This is because loan_18 is Jamie and Aaliyah’s mortgage loan. On line 7, we mutate the transaction vertices by changing “transaction_type” to “mortgage_payment.” At the end of this traversal on line 9, we want to return the transaction_id along with its new property, its transaction_type.

Query 3c: Verify that we didn’t update every transaction

At this point, it is very helpful to make sure that we did not update all of Aaliyah’s transactions with mortgage_payment. We can do that with a quick check, shown in Example 4-12.

Example 4-12.
// check that we didn't update every transaction
1 dev.V().has("Customer", "customer_id", "customer_4"). // at the customer vertex
2         out("owns").                 // at the account vertex
3         in("withdraw_from").         // at all withdrawals
4         groupCount().                // group and count the vertices
5           by("transaction_type")     // according to their transaction_type

The results from the Studio Notebook are shown below. We set unknown as the default value during the data loading process also shown in the Studio Notebook:

{
  "mortgage_payment": "24",
  "unknown": "47"
}

This query does a quick check to validate that we properly mutated our data. Combining lines 1 through 3, we process all of the transactions from Aaliyah’s bank account. At line 4, we do a groupCount() for all of those vertices according to the value stored in the transaction_type property. Here, we see that we correctly updated only the 24 transactions that are mortgage payments to loan_18. This validates that our mutation query properly updated our graph structure.

This section started out with three questions, and the last three examples answered them using the Gremlin query language.

We stepped through the basic queries to show you where to start. Get your basic graph walks ironed out before you start exploring the full flexibility and expressivity of the Gremlin query language. We always recommend iterating through Gremlin steps in development mode to find the basic walks that accomplish your queries. This means we are asking you to execute line 1 of a Gremlin query and look at the results. Then execute lines 1 and 2 and look at the results, and so on.

After you have mapped out your basic walks, you can try out more advanced Gremlin. At this point in development, it is very common to find ways to create specific payload structures to pass back to your endpoint.

We will cover the most popular strategies for building JSON with Gremlin in the next section.

Advanced Gremlin: Shaping Your Query Results

The goal of this section is to build up a more advanced version of our Gremlin query that answers a new question:

Is there anyone else who shares accounts, loans, or credit cards with Michael?

We would like to introduce a new question to demonstrate advanced Gremlin concepts within a small neighborhood of data. Once you understand how these concepts apply to this question, we invite you to use the accompanying notebook for this chapter to implement the concepts for the other queries introduced in “Basic Gremlin Navigation”.

We will work through shaping the results of our new query in a few stages. They are:

  1. Shaping query results with the project(), fold(), and unfold() steps

  2. Removing data from the results with the where(neq()) pattern

  3. Planning for robust result payloads with the coalesce() step

Tip

For anyone diving deeper into the world of Gremlin queries, we highly recommend the detail and explanations in the book Practical Gremlin: An Apache TinkerPop Tutorial by Kelvin Lawrence.3

Shaping Query Results with the project(), fold(), and unfold() Steps

When we start writing a new query, we like to slowly build up its required pieces. One of the most useful Gremlin steps is the project() step, because it helps us build up a specific map of data from our query. Let’s start building our query out by defining the three keys we want to have in our map: CreditCardUsers, AccountOwners, and LoanOwners.

1 dev.V().has("Customer", "customer_id", "customer_0").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3           by(constant("name or no owner for credit cards")).
4           by(constant("name or no owner for accounts")).
5           by(constant("name or no owner for loans"))

This query structure is the base of what we are building toward. We want to start with a specific person in this example—namely Michael. Then we want to create a data structure that will have three keys: CreditCardUsers, AccountOwners, and LoanOwners. We create this map with the project() step on line 2. The arguments to the project() step are the three keys. For each key in the project() step, we want to have a by() step. Each by() modulator creates the values associated to the keys:

  1. The by() modulator on line 3 will create a value for the CreditCardUsers key.

  2. The by() modulator on line 4 will create a value for the AccountOwners key.

  3. The by() modulator on line 5 will create a value for the LoanOwners key.

Let’s take a look at the results at this point:

{
  "CreditCardUsers": "name or no owner for credit cards",
  "AccountOwners": "name or no owner for accounts",
  "LoanOwners": "name or no owner for loans"
}

This is a good baseline to work from. Next, let’s walk through our graph structure to start to populate the values in our map. We will start with the data for the first key: finding people who share a credit card with Michael.

Thinking back to our schema, we will need to walk through the uses edge to get to the credit cards. Then we will walk back through the uses edge to get back to people. After that, we want to access their names. In Gremlin, we would add this walk on lines 3, 4, and 5:

1 dev.V().has("Customer", "customer_id", "customer_0").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5            values("name")).
6         by(constant("name or no owner for accounts")).
7         by(constant("name or no owner for loans"))
1 dev.V().has("Customer", "customer_id", "customer_0").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5            values("name")).
6         by(constant("name or no owner for accounts")).
7         by(constant("name or no owner for loans"))

The only steps we added were to walk from Michael out to his credit card via the uses edge on line 3. Then, on line 4, we walk back to all people who use that credit card. The resulting payload is:

{
  "CreditCardUsers": "Michael",
  "AccountOwners": "name or no owner for accounts",
  "LoanOwners": "name or no owner for loans"
}

This confirms what we know: Michael didn’t share any credit cards with other people. We expected to see his name in the result set.

Now let’s do the same thing for the next key in our map: AccountOwners. Here, we want to walk out the owns edge to the account vertex and back to the person vertex:

1 dev.V().has("Customer", "customer_id", "customer_0").
2           project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5            values("name")).
6         by(out("owns").
7            in("owns").
8            values("name")).
9         by(constant("name or no owner for loans"))

Let’s look at the resulting payload:

{
  "CreditCardUsers": "Michael",
  "AccountOwners": "Michael",
  "LoanOwners": "name or no owner for loans"
}

Looking at this data, we do not see what we would expect. We expected to see Maria as a resulting value for AccountOwners. Maria does not show up because Gremlin is lazy; it returns the first result, not all results. We need to add a barrier to force all results to finish and return.

The barrier that we like to use here is fold(). The fold() step will wait for all of the data to be found and then roll up the results into a list. This is a bonus, because now we can build up specific data type rules for our application. The adjusted query reads:

1 dev.V().has("Customer", "customer_id", "customer_0").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5            values("name").
6            fold()).
7         by(out("owns").
8            in("owns").
9            values("name").
10           fold()).
11        by(constant("name or no owner for loans"))

The shape of the data in the resulting payload is what we were expecting to see:

{
  "CreditCardUsers": [
    "Michael"
  ],
  "AccountOwners": [
    "Michael",
    "Maria"
  ],
  "LoanOwners": "name or no owner for loans"
}

Let’s complete the construction of our map by adding in the statements in the last by() step. These statements need to walk from Michael out to his loan and then back. The query and result set are:

1 dev.V().has("Customer", "customer_id", "customer_0").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5            values("name").
6            fold()).
7         by(out("owns").
8            in("owns").
9            values("name").
10           fold()).
11        by(out("owes").
12           in("owes").
13           values("name").
14           fold())
{
  "CreditCardUsers": [
    "Michael"
  ],
  "AccountOwners": [
    "Michael",
    "Maria"
  ],
  "LoanOwners": [
    "Michael"
  ]
}
1 dev.V().has("Customer", "customer_id", "customer_0").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5            values("name").
6            fold()).
7         by(out("owns").
8            in("owns").
9            values("name").
10           fold()).
11        by(out("owes").
12           in("owes").
13           values("name").
14           fold())
{
  "CreditCardUsers": [
    "Michael"
  ],
  "AccountOwners": [
    "Michael",
    "Maria"
  ],
  "LoanOwners": [
    "Michael"
  ]
}

At this point, we have the expected results. We see that Michael shares an account with Maria. And we see that Michael doesn’t share credit cards or loans with anyone else.

For some applications, it isn’t helpful to return that Michael shares a credit card with himself. Let’s dive into how we would remove Michael from this resulting payload.

Removing Data from the Results with the where(neq()) Pattern

It might be useful for you to eliminate Michael from the result set. We can do that by using the as() step to store Michael’s vertex, and then eliminate it from the result set. You can remove a vertex from your pipeline with the step where(neq ("some_stored_value")).

The next version of our query, in which we have directly applied this step to each section, is shown in Example 4-13.

Example 4-13.
1 dev.V().has("Customer", "customer_id", "customer_0").as("michael").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5              where(neq("michael")).
6            values("name").
7            fold()).
8         by(out("owns").
9            in("owns").
10             where(neq("michael")).
11           values("name").
12           fold()).
13        by(out("owes").
14           in("owes").
15             where(neq("michael")).
16           values("name").
17           fold())

The full results of Example 4-13 are shown below:

{
  "CreditCardUsers": [],
  "AccountOwners": [
    "Maria"
  ],
  "LoanOwners": []
}

The main additions to our query occur on lines 1, 5, 10, and 15 in the above query. On line 1, we store the vertex for Michael with the as("michael") step. Let’s take a look at what is happening with where(neq("michael")) on line 5, which is the same thing that is happening on lines 10 and 15.

To understand what is happening on line 5, you need to remember where you are in your graph. At the end of line 4, we are on Customer vertices. Specifically, we are processing customers that share an account with Michael. This is where the where(neq("michael")) step comes in. We want to apply a true/false filter to every vertex in the pipeline. The true/false filter test is whether or not that vertex is equal to Michael: where(neq("michael")). If the vertex is Michael, line 5 eliminates it from the traversal. If the vertex is not Michael, the vertex passes through the filter and remains in the pipeline.

Planning for Robust Result Payloads with the coalesce() Step

Depending on your team’s data structure rules, checking whether or not a value in your data payload is an empty list may not be preferred. We can help design around that.

We can implement try/catch logic so that your query doesn’t return an empty list. We will step through this for the first key in the map: CreditCardUsers. After we step through that, we will add in the full query details for the two remaining by() steps.

Let’s rewind and go back to just building up the JSON payload for the value associated to CreditCardUsers. We are starting from here:

1 dev.V().has("Customer", "customer_id", "customer_0").as("michael").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5              where(neq("michael")).
6            values("name").
7            fold()).
8         by(constant("name or no owner for accounts")).
9         by(constant("name or no owner for loans"))
{
  "CreditCardUsers": [],
  "AccountOwners": "name or no owner for accounts",
  "LoanOwners": "name or no owner for loans"
}

You can implement try/catch logic in Gremlin with the coalesce() step. We want to shape the results so that there is always a value in the lists for each key, like "CreditCardUsers": ["NoOtherUsers"]. Let’s start by seeing how to integrate the coalesce step into our query:

1 dev.V().has("Customer", "customer_id", "customer_0").as("michael").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5              where(neq("michael")).
6            values("name").
7            fold().
8            coalesce(constant("tryBlockLogic"),    // try block
9                     constant("catchBlockLogic"))).// catch block
10        by(constant("name or no owner for accounts")).
11        by(constant("name or no owner for loans"))

The resulting payload is:

{
  "CreditCardUsers": "tryBlockLogic",
  "AccountOwners": "name or no owner for accounts",
  "LoanOwners": "name or no owner for loans"
}

When you use the coalesce() step in line 8, it takes two arguments. The first argument is on line 8 and can be thought of as the try block logic. The second argument is on line 9 and can be thought of as the catch block logic.

If the try block logic succeeds, then the resulting data is passed down the pipeline. In this case, for illustrative purposes, we used something that would definitely succeed: the constant() step. This step returned the string "tryBlockLogic" that we see in the resulting payload. The constant() step is useful for many reasons, one of which is that it can serve as a placeholder while you build up more complicated queries. This is how we are using it here.

Should the first argument of the coalesce() step fail on line 8, the second argument will execute on line 9. Let’s look at how we can use this to populate what we want in our data payload:

1 dev.V().has("Customer", "customer_id", "customer_0").as("michael").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5              where(neq("michael")).
6            values("name").
7            fold().
8            coalesce(unfold(),                   // try block
9                     constant("NoOtherUsers"))). // catch block
10        by(constant("name or no owner for accounts")).
11        by(constant("name or no owner for loans"))
{
  "CreditCardUsers": "NoOtherUsers",
  "AccountOwners": "name or no owner for accounts",
  "LoanOwners": "name or no owner for loans"
}

On line 8, the logic that we added to the try block is the unfold(). This is trying to take the results from the previous step and successfully unfold them. The results at this point in the pipeline are an empty list []. In Gremlin, you cannot unfold an empty object. This throws an exception that is caught by the try block. Therefore, we execute line 9, the second argument of the coalesce() step: constant("NoOtherUsers"). This is why we see the entry "CreditCardUsers": "NoOtherUsers" in our result payload.

Regrettably, we lost our guaranteed list structure. We can add that back in with a fold() after the coalesce() step:

1 dev.V().has("Customer", "customer_id", "customer_0").as("michael").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5              where(neq("michael")).
6            values("name").
7            fold().
8            coalesce(unfold(),
9                     constant("NoOtherUsers")).fold()).
10        by(constant("name or no owner for accounts")).
11        by(constant("name or no owner for loans"))
{
  "CreditCardUsers": [
    "NoOtherUsers"
  ],
  "AccountOwners": "name or no owner for accounts",
  "LoanOwners": "name or no owner for loans"
}

The steps we added from line 5 to line 9 create a predictable data structure to exchange throughout your application. It will be well-formatted JSON about which other applications can reason.

Next, we need to add this try/catch logic to each by() step. The full logic pattern to add at the end of each by() step in our full query is:

coalesce(unfold(),                  // try to unfold the names
         constant("NoOtherUsers")). // inject this string if there are no names
fold()                              // structure the results into a list

This Gremlin pattern ensures we have a nonempty list in the resulting payload. The full query and its results are:

1 dev.V().has("Customer", "customer_id", "customer_0").as("michael").
2         project("CreditCardUsers", "AccountOwners", "LoanOwners").
3         by(out("uses").
4            in("uses").
5              where(neq("michael")).
6            values("name").
7            fold().
8            coalesce(unfold(),
9                     constant("NoOtherUsers")).fold()).
10         by(out("owns").
11           in("owns").
12             where(neq("michael")).
13           values("name").
14           fold().
15           coalesce(unfold(),
16                    constant("NoOtherUsers")).fold()).
17        by(out("owes").
18           in("owes").
19             where(neq("michael")).
20           values("name").
21           fold().
22           coalesce(unfold(),
23                    constant("NoOtherUsers")).fold())
{
  "CreditCardUsers": [
    "NoOtherUsers"
  ],
  "AccountOwners": [
    "Maria"
  ],
  "LoanOwners": [
    "NoOtherUsers"
  ]
}

We find that iterative building and stepping through Gremlin steps is the best way to wrap your head around the query language. This book is about teaching you our thought processes, and this is how we think through using Gremlin. There is more than one way to write a graph query; we hope you are curious about using other steps to process the same data. Figuring this out can be as easy as opening up a Studio Notebook and exploring new steps on your own.

Moving from Development into Production

Bringing back our scuba analogy from the beginning of this chapter, our time training in the pool has come to a close. As we see it, the progression through the technical examples in this chapter is just like learning buoyancy control or deepwater troubleshooting within a pool. At some point, you have learned everything you can from practicing in a controlled environment.

With the foundation we have built over the past few chapters, it is time to take the leap out of your development environment and build a production-ready graph database.

Before you get too concerned, this doesn’t mean you are supposed to know everything there is to know about graph data. There are still myriad topics we are continuing to explore ourselves.

What it does mean, however, is that we think you are ready to move into a deeper understanding of using graph data in distributed systems. We set up this example to get you ready for one last step down into the physical data layer of understanding graph data structures in Apache Cassandra. Specifically, the upcoming chapter will show you how to optimize your graph structures for distributed applications.

While illustrating how we think through graph data, we purposefully set up some traps in the example in this chapter. In the next chapter, we will show these traps to you and walk you through their resolution. This upcoming chapter will be the last chapter that uses our C360 example, as it will describe the final iteration in creating a production-quality graph schema for this example.

1 Ora Lassila and Ralph R. Swick, “Resource Description Framework (RDF) Model and Syntax Specification,” 1999. https://oreil.ly/zWcnO

2 Kelvin Lawrence, Practical Gremlin: An Apache TinkerPop Tutorial, January 6, 2020, https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html.

3 Kelvin Lawrence, Practical Gremlin: An Apache TinkerPop Tutorial, January 6, 2020, https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html.

Get The Practitioner's Guide to Graph Data now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.