Chapter 1. How to Get Value from Graphs in Just Five Days

All the world’s a graph. The early 2010s saw the vast majority of enterprises adopting graphs for very specific use cases, such as online real-time recommendations or impact analysis. They chose graph databases over relational and other NoSQL databases because of their performance, scalability, and astonishing ability to traverse relationships in connected data in real time.

Fast forward 10 years and the graph technology landscape has exploded. Beyond the niche use cases, they’re the answer to a critical aspect of today’s data: complexity.

The last two decades have been about data - data collection, analysis, prediction, protection. Everything around us captures data. Some organisations exist solely to analyse data and provide insights. For others, the usage of data determines the success of the business.

Ever since Clive Humby¹ proclaimed that “data is the new oil”² back in 2006, the imagination, creativity and technical innovation of various companies has seen no bounds. From the rise of NoSQL databases in the early 2000s to the mind boggling pace of Generative AI (GenAI) today, it is clear that we are not even close to being done with data. We live in the data age for sure, but more importantly, we live in the time of connected data–and value lies in the connections.

As digital consumers, we now expect relevant, personalised experiences. The world is connected, data about the world is inherently connected, and our digital footprints across devices, transactions, and social media leave rich stories to be uncovered.

Many large companies, despite their best intentions to use data to improve users’ experiences and drive decisions, struggle to provide universal access to their data. Data continues to be siloed across various enterprise systems, and even if in much better shape than a decade before, queries across these systems to unlock hidden value are still non-performant.

Discovering and leveraging data from all corners of the organisation has typically fallen to data engineers or scientists– but this takes analytics and data-driven decisions further and further away from those who need access to trustworthy data on a daily basis. How can we solve this?

Graphs democratise data. Graphs let us bring siloed data together, into a model that is a digital twin of the organisation, flexible enough to adapt to its evolving business needs. Relationships connect this data across disparate systems and serve as value multipliers. Suddenly, everyone can explore and work with business data directly. Analytics tools can connect to a single source of truth to provide the insights needed to validate hypotheses or back decisions, empowering those on the front lines of the business.

With the rise of GenAI, knowledge graphs have begun to play a more prominent role as well: they capture explicit relationships, bringing institutional intelligence closer to the data. When paired with vector searches that reveal implicit relationships (those based on semantics), knowledge graphs ground responses from large language models (LLMs) in validated facts.

Note

Graph technology is experiencing accelerated momentum. “Gartner anticipates that the application of graph technology will grow at 100% annually through 2022. With current adoption estimated at nearly 4%, this will increase to 30% in 2022.”³ That’s because graph technology’s unique ability to connect data from diverse sources is delivering business value to the companies that use it.

In this chapter, we guide you through a practical path to start creating value with graph databases in just one week. Neo4j lends itself well to an incremental style of development and delivery, a process enterprises favor over protracted “big bang” approaches. A swift demonstration of value, followed by incremental iterations, is one of the most successful routes to adopting a new technology.

We assume that you have previous experience with Neo4j and Cypher. If you’ve loaded data into Neo4j in any way, written Cypher queries, or built an application with Neo4j as the backing database, then you’re at the right level to proceed reading this book. If you’re a beginner, fear not: Neo4j is a very friendly database and it’s easy to get started. We recommend that you learn the fundamentals of graph databases, modelling, and Cypher and then return. GraphAcademy is an excellent resource-it offers free and hands-on Neo4j training.

Dissonance at ElectricHarmony

ElectricHarmony, an established music-streaming service, is exploring how it can use graph databases. As new music providers capture young people’s attention, ElectricHarmony is struggling to stay competitive. In the last two quarters alone, they have lost a significant number of subscribers, who revealed that their listening experiences felt stale and uninspiring.

After an investigative analysis, ElectricHarmony concluded that it simply is not leveraging all the data it collects in various systems well enough to produce more relevant playlists. New data sources emerge rapidly and the company can’t keep up. Accompanying this is the expectation from the decision makers to implement ad hoc use cases to address immediate needs and react quickly with tactical analysis.

The engineering team at ElectricHarmony decides to experiment with Neo4j. They import a subset of data and learn the basics of Cypher very quickly, and soon they’re writing queries that traverse effortlessly across artists, playlists, albums and tracks . They can already feel the benefits of connecting these key business entities in the graph. Instead of spending months trying to bring data together or find clever ways of querying across data in different sources in real time, they can spend their time working on critical business problems.

Stakeholders relate immediately to the team’s line of thinking when they see it sketched on a whiteboard (as shown in Figure 1-1), and the excitement is palpable. New ideas start to pour in as the dots connect in everyone’s minds.

The whiteboard sketch from ElectricHarmony s discussion.

The team has a theory that a graph database is the answer to the problem of data silos and slow, expensive queries. Now they need to validate that with a short proof of concept–and you’re joining the team to make it happen! In fact, throughout this book, you’ll relate concepts and lessons learned to the music streaming domain and ElectricHarmony.

First, though, a detour to refresh your knowledge of graph databases and Neo4j.

Why Graph Databases?

In database systems, an impedance mismatch is when the data’s representation in the database differs from its representation in the application or the business domain. This term is typically used when comparing object-oriented models or graph models with a relational database. Relational databases, particularly, organise data in the form of tables, rows, and columns. This is, however, not how applications view data, nor is it how humans speak and think about business domains. The mismatch occurs when the natural representation of the domain–entities with descriptive properties and relationships–have to be flattened to be stored in a relational database, then joined at query time to compose the real-world entity once again.

The whiteboard in Figure 1-1 demonstrates that graph databases excel at solving the impedance mismatch between real-world data models and the models (or schemas) databases impose upon it. The beneficial side effect is that individuals, especially non-engineers, across all levels and functions in an organisation now speak the same language. Graphs not only bring clarity to how data is used but also highlights gaps in data and misunderstandings about meaning or intent.

Figure 1-2 shows a non-graph representation of ElectricHarmony’s data.

ElectricHarmony s relational database schema.

Figure 1-3 is the graph model, equivalent to the business-domain drawing on the whiteboard in Figure 1-1.

ElectricHarmony graph model representing the whiteboard sketch

In our many years of consulting with clients, there’s always an enjoyable ‘aha’ moment where it suddenly all makes sense. Then that humble whiteboard begins to spark new use cases and ideas.

Graphs fit neatly into almost every domain, because the world is naturally connected. But the use cases that derive value from relationships are the ones that truly shine; the advantages of graphs increase with the size and complexity of the data.

Graph Use Cases

To help you see the potential of graphs, let’s look at three more ways they can solve real-world problems:

Ultimate beneficial ownership networks: An ultimate beneficial ownership (UBO) is an individual or company that ultimately owns or controls another legal entity. It is a critical component of “Know Your Customer” (KYC) processes; regulators require most financial institutions to verify the UBOs they do business with, to prevent crimes such as money laundering or financing terrorism.; Imagine that Jane Doe owns 100% of Company A, which itself owns 100% of Company B. Ultimately, Jane owns 100% of Company B via her ownership in Company A. The real world is much more complex, with up to 60 or 70 levels of ownership. Graphs can traverse such deep networks efficiently, empowering anti-money-laundering analysts.
Law enforcement: Network analysis is a valuable tool for link analysis, which helps reveal connections between people, places, and things. Graphs enable a visual picture of these relationships, with nodes representing individuals or entities, and edges representing connections between them. Law enforcement organizations can use this information to uncover patterns in data and discover previously unknown connections between suspects, crime victims, and known criminals or gang members. This can provide valuable insights that help them solve crimes more effectively.
Cybercrime networks: The COVID-19 pandemic has made remote work today’s “new normal,” but this change also represents an immense opportunity for cyber attackers. In 2020 alone, the FBI reported a 300% increase in cybercrimes.⁴; John Lambert, of Microsoft’s Threat Intelligence Center, writes that the “biggest problem with network defense is that defenders think in lists. Attackers think in graphs. As long as this is true, attackers win.”⁵ His point is that defenders traditionally rely on lists, such as logs and alerts from software tools, while attackers are more opportunistic in thinking of their target network as a graph. After gaining access to one node, they build an attack graph, a representation of all the possible paths of attack against a cybersecurity network, to gain access to the most valuable systems. Defenders can enhance their security by building a digital twin of their infrastructure, which allows them to identify their most valuable assets, identify the impact on downstream components, and identify suspicious patterns. A digital twin is a digital representation of a physical object, person or process, contextualized in a digital version of its environment. Digital twins can help an organization simulate real situations and their outcomes, ultimately allowing it to make better decisions⁶.

These use cases are just the beginning; later in the book, we’ll discuss some of the popular modern solutions that use knowledge graphs in genAI workflows.

The corollary to all this is that you probably don’t need a graph database if relationships are not important to your business. You don’t need them for tabulating records, aggregating data, and summary reports.

So why would you use Neo4j?

Neo4j

Neo4j is one of the most mature and frequently deployed graph solutions,⁷ having carved out the graph- database category in the early 2010s. Founded in Sweden in 2000, Neo4j is available today in a variety of offerings-an open source Community edition, a commercial Enterprise edition, and as a service on all major cloud platforms. Its flexible graph modelling, however, is not the sole reason to adopt another database. Other factors that may drive this decision include:

Neo4j is highly performant when querying complex data
It uses a powerful query language, Cypher, that expresses traversals intuitively
It is highly scalable
It is operationally sound, with ACID transactions, cluster support and runtime failover

Neo4j is a labelled property graph. Property graph models are very popular for graph databases. They consist of nodes and relationships, each of which can contain zero or more properties that represent their characteristics. (You can think of properties as key-value pairs.)

Nodes represent entities, such as people, vehicles, locations, songs, suppliers, orders, and so forth. Relationships represent the connections between entities. Relationships are the “first-class citizens” of graphs. Relationships have a type and examples are DRIVES, PERFORMS, SUPPLIES, and LIVES_AT.

Labelled property graphs let you assign labels (tags or categories) optionally to nodes. These indicate the “type” of the node, but there is no semantic structure or relationship between labels. Examples of labels would be a Person node label, or a node which has label Person as well as Customer, a Location label, etc.

Native graph databases

Neo4j is also a native graph database. This is a major difference between it and other multi model databases built over other types of database structures, such as key-value or relational. Such databases may support graph operations, but not as a primary use case.

Native graph databases, however, are architected to be “graph first.” This enables them to perform graph queries faster and more efficiently, because they are designed to store and process data as a graph.

When we say that relationships matter, we mean being able to traverse those relationships or connections in a performant manner- quickly and with efficient resource usage. How these connections are represented–whether they are materialised (their structure is physically represented in the storage) or joined at query time–is crucial.

Neo4j optimises how graph data is stored and provides index-free adjacency in terms of how it represents real relationships. Index-free adjacency means that each node has direct pointers to the next node in the chain, making traversals essentially a form of pointer-chasing, but without the overhead of an index lookup. The native storage layer of Neo4j is, in fact, a connected graph, where connected nodes are “adjacent” to each other and are directly accessible via a pointer, as shown in Figure 1-5.

In Neo4j s record storage engine a node s relationships are a linked list with pointers to other nodes.

Non-native graph stores, by contrast, suffer the cost of joins, which are typically achieved by performing repeated index lookups to determine the next connection. While this is only of trivial importance at a couple of hops, querying gets exponentially slower and more expensive as the depth of the connections and the size and complexity of the data increase.

Graph algorithms that rely on pathfinding shine when you use them with native graph databases for network-based cases. Some examples:

Are two people in the network related to each other, with a minimum of 3 and a maximum of 6 relationships separating them?
What will the impact on the electrical grid be if a particular power plant has an outage? How should the network be redistributed to minimise outage time for a particular area?
What is the shortest path from Clapham South station to Kings Park?

Now that the benefits of Neo4j are clear, we return to ElectricHarmony, where you and the team will prove that recommendations powered by Neo4j will help solve the company’s current problem.

Cypher

The query language Neo4j uses is called Cypher. Created by Neo4j in 2011, it was so intuitive and popular that in 2017 it resulted in an open source implementation called openCypher, which many graph databases use. Today, Graph Query Language (GQL), inspired by Cypher, is second only to Structured Query Language (SQL) in the database-language standard project run by the International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC).

Cypher is a declarative query language and focuses on describing what the user wants to create or find in the database, rather than how to do so.

Cypher is also a visual query language, based itself on ASCII art. It uses parentheses `()` to describe nodes and lines `--` with arrows `<>` for directions to describe relationships.

Cypher s ASCII art. Nodes are represented by parentheses and relationships by lines and arrows.

Figure 1-5 expresses a pattern that reads like this: a playlist has a track. The two entities in rounded parentheses are nodes. The labels of the nodes in this case are Playlist and Track, represented by identifiers p and t, respectively. The relationship is in square brackets and the type of the relationship is HAS_TRACK. There is no identifier specified for this relationship. Finally, the dashes are an arrow that tells you that the playlist is connected to a track via an outgoing relationship.

Cypher is very powerful and would probably require an entire book in itself. We’ll explain the queries throughout the book, but fora complete overview of Cypher and its capabilities, visit the Cypher manual section of the Neo4j documentation.

The song recommendation system: A proof of concept

ElectricHarmony’s analytics show that listeners often skip over the songs it recommends at the end of a playlist. Research from the company’s socio-musicology group shows that users prefer to listen to music that others with similar musical tastes like. The surprise effect is also important, as discovering new artists or tracks increases users’ engagement time. Serendipity in recommender systems is a much-written-about subject. Recommendations that produce something unexpected broaden users’ experience and, if crafted well, result in delight and novel discoveries. The group’s hypothesis is that listeners will respond positively to recommendations based on these principles, so they ask the development team to quickly deploy a proof of concept that they can test with real users.

Warning

Recommender systems are an entire discipline in themselves and can be very complex. This book is not meant to teach you how to build recommendation engines. We use simple examples for clarity and to showcase the benefits of using graph databases for analytical workloads. Recommendations produced by traditional systems are often a black box– difficult to reason about and less explainable. This magnifies the gap between highly specialised engineers and stakeholders who are unclear about how and why recommendations are produced. Graphs on the other hand help with traceability as you will see in the following sections.

The development team members agree that they’ll consider the evaluation successful if it can do all of the following:

detect similar playlists based on how many tracks they share
compute a track recommendation based on similar playlists; find similar playlists that share the same last track. The more similar the playlist, the better the recommendation
discard recommendations for tracks that are too popular
Compute recommendations in less than 200ms

Day 1

Today is day 1. In the next five days, you and the team will get value out of the graph database by building the proof of concept. We encourage you to participate in this very realistic exercise by following along and reflecting on your accomplishments at the end of each section (day).

Installing Neo4j

To follow along with the examples and code in this book, you need to install or have access to an instance of Neo4j Enterprise.

The Neo4j Deployment Center lists install and deploy options for Neo4j. We use Docker in this book, but there are various other ways to get started, including:

AuraDB: Graph database as a service, available on all major cloud platforms. A free tier is available.
Neo4j Desktop: Includes a free development licence for Enterprise Edition, for use on your local desktop.
Neo4j Enterprise: Installed on your company server or private cloud, for example. An evaluation licence is available for a 30-day trial.

For other installations, follow the manual to start or access Neo4j.

We’ve created a Github repository for you to follow along with all examples. Clone https://github.com/neo4j-the-definitive-guide/book and follow the instructions in the README file in each chapter’s directory to replicate the code instructions in this book.

Ingesting your first datasets

To demonstrate the song-recommendation use case, you can draw from two of ElectricHarmony’s data sources. The first contains track, artist, and album information. The second contains playlists, including their IDs, playlist names, and references to tracks and their positions in the playlist.

The most practical way to import a limited set of this data into Neo4j is by exporting a sample of the source databases into CSV format.

[INFO] The CSV files are in the chapter01 import directory in the Github repository.

Start up Neo4j. If you are using the code with this book, you will switch to the docker directory and run

docker-compose up -d

Once Neo4j is running, go to http://localhost:7474/ and log in with the following credentials:

Username: neo4j

Password: password

The Neo4j browser will show (as in Figure 1-6) that you are connected to the default database named neo4j.

Successful authentication in the Neo4j browser.

The sample datasets are mounted on the neo4j container and are available as sample_tracks.csv and sample_playlists.csv in the /import directory, which is configured by default in Neo4j.

Previewing the data

The LOAD CSV clause is suitable for ingesting data quickly at the proof-of-concept stage. We’ll cover other methods in later chapters. To preview a couple of lines from the sample files and get accustomed to the shape of the data, run the following in the Neo4j browser:

LOAD CSV WITH HEADERS FROM "file:///sample_tracks.csv" AS row
RETURN row
LIMIT 5

LOAD CSV preview of 5 rows of the sample tracks file

If the CSV has a header line, the WITH HEADERS option allows you to refer to each field by its column name. This is because each row in the file is represented as a map, such as in Figure 1-8, as opposed to an array of Strings for a file without headers. It’s good practice to add a header to your CSV files–it makes the data easier to understand.

You can do the same for the playlist sample file:

LOAD CSV WITH HEADERS FROM "file:///sample_playlists.csv" AS row
RETURN row
LIMIT 5

Figure 1-8 shows the preview you should see.

LOAD CSV preview for the playlists dataset

Designing your graph model

Before you can ingest data into the graph, you and the team have to decide on the first version of your graph model.

You can think of a graph model as akin to a schema in other types of database systems. A graph model (sometimes referred to as a data model) represents the structure of the graph and should be a model of your business domain. It contains nodes and their labels, how they are related to each other, the types of those relationships, the properties of both nodes and relationships with their data types, and any constraints.

Graph modelling is driven by use cases. This makes it very different from creating a schema for a relational database, for example, where you would generally follow normalisation rules without having to know anything about the types of queries users will be executing.

The best tool for graph modelling is a whiteboard. You and the team grab some coffee and assemble in front of one, ready to get started.

First, you need to make sure everyone is clear about the use case. Your goals are to relate similar playlists based on how many tracks they share, and to compute a track recommendation based on similar playlists. The more similar the playlists that share the same last track, the better the recommendation.

You get started by drawing the entities that stand out, which will represent nodes in your graph (Figure 1-9). A handy tip is to start by picking out the nouns from the use case-more often than not, they’re entities that have some conceptual identity. In this use case, the nouns are user, playlist and track, and they serve as appropriate labels as well. Similarly, relationships are usually the verbs or actions in your use case. A playlist has tracks and playlists are owned or created by users, so you draw those relationships using arrows.

The datasets contain information about the artist and album as well, but you don’t need that to fulfil the use case. So what should you do with that information?

There are two ways to go here. One is to simply not ingest that data now and bring it into the graph later, when you need it. The second is to model it in a straightforward way, keeping in mind that the model can change as the use case gets clearer. You decide as a group to add both labels to the graph model; they’re fairly obvious, and will make validating and reasoning about the recommendation results easier. Now your whiteboard looks like Figure 1-10.

Walking the graph

Now you take a step back and look at the graph model. You can “walk the graph” on the whiteboard. This part might feel strange to you if you’ve never done it before but trust us: with time, it will come naturally!

Here’s how to walk this graph:

Start all the way on the left. Point at the User and state the first relationship - “a User owns a Playlist.”
Follow the arrow to point at the Playlist, narrating what you see on the board.
Hop over to the tracks, saying, “A playlist has tracks”.
Now follow the HAS_TRACK and ARTIST relationships:“the artist of a track is”… and land on the artist node.
Do the same with the Album. “An album has tracks”, and arrive back at the track node.
“A track is part of other playlists” - back to the playlist.
And “a playlist is owned by another user”. You’ve reached the start again.

You’ve discovered the next track to play–and you’ve articulated the business domain and the graph model with the same language.

You’re ready to start ingesting data tomorrow!

Key takeaways

The barrier to getting started with Neo4j is extremely low, and the Neo4j ecosystem is geared towards developer friendliness and ease of use. Designing the first graph model is very intuitive–what the team draws on the whiteboard will turn out to be your initial graph model.

Day 2

The team is eager to see data in the graph!

Creating nodes and relationships

You’ve wisely decided to do a dry run of the ingestion with a single row of data, so that you can inspect the graph and see if the model makes sense before going any further.

This query will create the Track, Album, and Artist nodes as well as the relationships between them. While the file contains many properties that will be useful later, the only important ones now are id, name, and uri. Here it is:

LOAD CSV WITH HEADERS FROM "file:///sample_tracks.csv" AS row
WITH row LIMIT 1
CREATE (track:Track {id: row.track_id})
SET track.uri = row.track_uri,
track.name = row.track_name
CREATE (album:Album {id: row.album_id})
SET album.uri = row.album_uri,
album.name = row.album_name
CREATE (artist:Artist {id: row.artist_id})
SET artist.uri = row.artist_uri,
artist.name = row.artist_name
CREATE (album)-[:HAS_TRACK]->(track)
CREATE (track)-[:ARTIST]->(artist)

Now you retrieve the data you just created:

MATCH path=(artist:Artist)<-[:ARTIST]-(t:Track)<-[:HAS_TRACK]-(album:Album)
RETURN path

Result of a Cypher pathfinding query matching artists tracks and albums.

Your results look like the ones in Figure 1-11.

Do the same for the playlists sample file. As with the previous file, you’re only interested in properties that represent the IDs and the tracks’ positions in a playlist:

LOAD CSV WITH HEADERS FROM "file:///sample_playlists.csv" AS row
WITH row LIMIT 1
CREATE (playlist:Playlist {id: row.id}) 
SET playlist.name = row.name
CREATE (user:User {id: row.user_id})
CREATE (track:Track {id: row.track_id})
CREATE (user)-[:OWNS]->(playlist)
CREATE (playlist)-[:HAS_TRACK {position: row.playlist_track_index}]->(track)

Match the playlist just created and view the results in the Neo4j browser (Figure 1-12) to see if this subgraph makes sense:

MATCH path=(user:User)-[:OWNS]->(p:Playlist)-[:HAS_TRACK]->(t:Track)
RETURN path

Cypher query result showing a user named bleapkin who owns a playlist with one track.

Querying across datasets

Now the team checks what their “single source of truth” looks like after having ingested a row from each dataset. You start to write a Cypher query to find patterns that represent playlists that have tracks and their artists or albums. This query traverses from the Playlist node to the Artist and Album nodes:

MATCH path=(p:Playlist)-[:HAS_TRACK]->(track:Track)<--(albumOrArtist)
RETURN path

The type (label) of the albumOrArtist node in the pattern and the type of the relationship between the track and the albumOrArtist node are not specified. This allows you to match any node or relationship that can exist at that place in the pattern in a concise and friendly form.

Alas, no results are returned, which tells you that something went wrong during the data ingestion. Run the following query to have an overview of the whole graph:

MATCH (n)
OPTIONAL MATCH (n)-[r]->(o)
RETURN *

Since there is a very small amount of data in the graph, this query is suitable: it matches everything and you can visualise the results. In real graphs with millions or billions of nodes and relationships, you wouldn’t be able to return the whole graph to your screen.

Cypher query result showing all the graph data in the database

Indeed, it doesn’t look quite right. The data model seems to be similar to how it would be represented in relational databases with rows and columns. The first pattern represents a row from the playlists file: (:User)-[:OWNS]->(:Playlist)-[:HAS_TRACK]->(:Track).

The second pattern represents a row from the tracks file:

(:Album)-[:HAS_TRACK]->(:Track)-[:ARTIST]->(:Artist). The track in both patterns is the same track and should be the same node. To correct this, you use the MERGE clause.

Merging nodes

In the queries above, you used CREATE, which simply created a node or relationship as instructed. But every track has a unique track ID. If this track exists in the graph already, you do not want to create it again.

MERGE is a combination of MATCH and CREATE. It will try to find the pattern in the graph and if it does, nothing is created. Only if the pattern cannot be matched will it be created. As such, MERGE needs something that it can use effectively to MATCH first. In this case, the ID of the track, album, or artist is unique and can be reliably used to determine if the node exists in the graph or not.

Tip

It is good practice to give nodes unique identifiers. They can be used in constraints to ensure that duplicate data is not created, and they also serve as foreign key references to other systems.

Before re-ingesting the data with the updated query using MERGE, remove what the previous LOAD CSV created.

Use the following query to delete all nodes and relationships from the graph:

MATCH (n)
DETACH DELETE n

Now, recreate the data using the MERGE clause instead of CREATE. Start with the sample tracks file first.

LOAD CSV WITH HEADERS FROM "file:///sample_tracks.csv" AS row
WITH row LIMIT 1
MERGE (track:Track {id: row.track_id})
SET track.uri = row.track_uri,
track.name = row.track_name
MERGE (album:Album {id: row.album_id})
SET album.uri = row.album_uri,
album.name = row.album_name
MERGE (artist:Artist {id: row.artist_id})
SET artist.uri = row.artist_uri,
artist.name = row.artist_name
MERGE (album)-[:HAS_TRACK]->(track)
MERGE (track)-[:ARTIST]->(artist)

Do the same with the sample playlists.

LOAD CSV WITH HEADERS FROM "file:///sample_playlists.csv" AS row
WITH row LIMIT 1
MERGE (playlist:Playlist {id: row.id})
SET playlist.name = row.name
MERGE (user:User {id: row.user_id})
MERGE (track:Track {id: row.track_id})
MERGE (user)-[:OWNS]->(playlist)
MERGE (playlist)-[:HAS_TRACK {position: row.track_index}]->(track)

This time, since the Track was created in the graph with the previous ingest, the MERGE (track:Track {id: row.track_id}) found it by its ID during its MATCH phase and hence did not create anything. This Track node is now the reference to link to the Playlist with the MERGE of the HAS_TRACK relationship.

Now verify that you can link between a playlist and album or artist nodes via a track node.

MATCH path=(user)-[:OWNS]->(p:Playlist)-[:HAS_TRACK]->(track:Track)--(albumOrArtist)
RETURN path

Cypher query result showing all the graph data after using the MERGE clause

This looks much better! It clearly represents a connected set of data–a track that is on an album and is part of a playlist, along with the artist.

You now ingest the remaining of the sample data by removing the one-row limit. Start with the sample tracks. The only thing that has changed in this query is that the LIMIT 1 has been discarded.

LOAD CSV WITH HEADERS FROM "file:///sample_tracks.csv" AS row
MERGE (track:Track {id: row.track_id})
SET track.uri = row.track_uri,
track.name = row.track_name
MERGE (album:Album {id: row.album_id})
SET album.uri = row.album_uri,
album.name = row.album_name
MERGE (artist:Artist {id: row.artist_id})
SET artist.uri = row.artist_uri,
artist.name = row.artist_name
MERGE (album)-[:HAS_TRACK]->(track)
MERGE (track)-[:ARTIST]->(artist)

Repeat for the playlists file, dropping the limit.

LOAD CSV WITH HEADERS FROM "file:///sample_playlists.csv" AS row
MERGE (playlist:Playlist {id: row.id})
SET playlist.name = row.name
MERGE (user:User {id: row.user_id})
MERGE (track:Track {id: row.track_id})
MERGE (user)-[:OWNS]->(playlist)
MERGE (playlist)-[:HAS_TRACK {position: row.track_index}]->(track)

You now have a high-level overview of the connected graph, as shown in Figure 1-15.

Graph overview of the database. You can see this picture in color and full resolution at https raw.githubusercontent.com neo4j the definitive guide book main figures svg full small graph view.svg

Beautiful, isn’t it?

Exploration and refactoring

Just as a business undergoes constant change and optimisations, so does a graph. As you work with graph databases, it’s common to find yourself understanding the shape of the graph as it evolves and refactoring it to accommodate new use cases or improve performance. Exploring the graph frequently also reveals connections that were not apparent earlier, and can also uncover new use cases.

Before jumping into the new recommendation query, you and the team now run some basic queries for insights into the limited data you’ve ingested into the graph:

How many playlists does the graph contain?

The first query is simple and tells you how many playlists are in the graph.

MATCH (n:Playlist)
RETURN count(n) AS playlistCount

playlistCount
40

Forty isn’t much, but perhaps enough to prove the concept. However, you need more information to determine how connected the data is.

Are any tracks present in more than one playlist?

The proof-of-concept recommendation query is based on playlists that share tracks. The following query matches all tracks that are in more than one playlist:

MATCH (t:Track)<-[:HAS_TRACK]-(p:Playlist)
WITH t AS track, count(p) AS playlistCount
WHERE playlistCount > 1
RETURN track.name as trackName, playlistCount

trackName	playlistCount
Where Is My Mind?	2

There’s just one track that appears on two playlists. It’s getting clear that the amount of data you’ve ingested so far isn’t going to be enough.

Who are the five artists most featured in playlists ?

Check for the five artists who are often most featured in playlists and count how many playlists they’re part of.

MATCH (a:Artist)<-[:ARTIST]-(track)<-[:HAS_TRACK]-(p:Playlist)
RETURN a.name AS artistName, count(*) AS playlistCount
ORDER BY playlistCount DESC
LIMIT 5

artistName	playlistCount
Pixies	3
David Bowie	2
James Newton Howard	2
Jennifer Lawrence	2
Classified	2

Which artist has the most tracks in the last position of a playlist?

The recommendation query for the proof of concept is based on the last track of the playlist. To find which artists are commonly in this place, you need to know how many tracks are in the playlist, then use that number to match a track at the last position. A COUNT subquery in the WHERE clause does the trick.

MATCH (a:Artist)<-[:ARTIST]-(t:Track)<-[r:HAS_TRACK]-(p:Playlist)
WHERE r.position = COUNT { (p)-[:HAS_TRACK]->() }
RETURN a.name AS artist, count(*) AS numberOfTracks
ORDER BY numberOfTracks DESC
LIMIT 1

(No changes, no records)

This is puzzling. It’s quite impossible to have playlists without a track in the last position. Perhaps something went wrong with the data ingestion? You check five such tracks to see if their position is set for in playlists with the following query:

MATCH (a:Artist)<-[:ARTIST]-(t:Track)<-[r:HAS_TRACK]-(p:Playlist)
RETURN a.name AS artist, t.name as track, r.position as position
LIMIT 5

Cypher query result for track position in playlists. When your query doesn t return graph data nodes relationships paths but scalar values the Neo4j browser switches to table result view.

That’s it! Look at the results in the table for position column, then check the WHERE clause of the query:

WHERE r.position = 1

The 1 in the WHERE clause is expressed as a number, not a string, but the table results clearly show that the position is a string. You realize that you used the LOAD CSV command to ingest data, and the nature of CSV files is that every single cell is typed as a string.

You could modify your query to to use a string condition for the value 1, but it wouldn’t be very elegant compared to storing the data with the correct type. Fortunately, one of Neo4j’s great benefits is that it’s easy to refactor data you’ve already stored in the graph–whether that means changing the data values, its types, or the graph model itself.

Note

Neo4j is a schema-free database. The word schemaless is also common, but we prefer schema-free. No dataset is schemaless. A schema exists, logically, even if it’s undocumented. Neo4j gives you total control and dictates nothing, which enables you to get any sort of data into your graph very quickly, without regard for types or structures. You can then modify or refactor them, gently moulding them into shape as you proceed.

Schema-free databases also gives you the flexibility to impose a schema when required, in the form of constraints on your graph. This is usually done to ensure data quality and adhere to a documented or communicated graph model.

You change the type of the values stored on the position property of the HAS_TRACK relationships to be an integer:

MATCH (p:Playlist)-[r:HAS_TRACK]->()
SET r.position = toInteger(r.position)
Results are now returned for your initial query: 
MATCH (a:Artist)<-[:ARTIST]-(t:Track)<-[r:HAS_TRACK]-(p:Playlist)
WHERE r.position = COUNT { (p)-[:HAS_TRACK]->() }
RETURN a.name AS artist, count(*) AS numberOfTracks
ORDER BY numberOfTracks DESC
LIMIT 1

artist	numberOfTracks
Pixies	2

No artist other than the Pixies has tracks in the last position of a playlist. With such sparse data, it’s a bit of a problem to fulfil the use case. This is a good problem to have. Very often, we see teams struggle to realise their use case or answer queries satisfactorily, simply because the data they thought they had either did not exist or belonged to other systems they could not access. However, ElectricHarmony does have more data and you’ve detected this early enough, so you’re going to ingest some more tomorrow and see if that solves the problem.

Key takeaways

There are many ways to get data into the graph. You might use a connector or driver to connect to another data source and transfer data, but one of the easiest ways to get started is a simple CSV dump of limited data, which you can import into Neo4j with LOAD CSV, no code required. The Neo4j Browser is an excellent companion to help you get your hands on the graph, ingest data, query it, and visualise the results.

The schema-free nature of Neo4j removes another barrier by not imposing any rigid structure in the early stages, making it easy to refactor, extend the model, and change data types or properties.

Day 3

Thanks to your discovery the day before, you’re going to ingest a larger dataset using the previous LOAD CSV.Simply change the filenames to file:///medium/sample_tracks_medium.csv and file:///medium/sample_playlists_medium.csv. And you wait.

Fifteen minutes later, you’re still waiting? Does ingesting a fairly small amount of data (approximately 10,000 rows) really take so long? Well, yes. Most teams stumble upon this problem early in the process. The solution is easy: it’s called indexing.

Wait a minute. We said earlier that Neo4j offers index-free adjacency. So why is the solution indexes? The difference lies in how and why you access the data.

In the database world, indexes usually make identity lookups faster. They offer an efficient way to find a particular entity by the value of some identifier, typically a key.

When you write a Cypher query, the starting point for the traversal–a pattern specified in the MATCH clause– consists of nodes or relationships which should to be accessed as quickly as possible by the query engine to stay performant. The graph storage format allows direct access to a node or relationship by its internal identity, called an element ID. However, the element ID is rarely how you’ll match nodes or relationships. You’ll typically use some business identifier or key, such as a person ID, a vehicle registration number, or a bank account number.

It gets expensive to traverse through all the nodes or relationships to find those with this value of a property. Traditional indexes help you get to the starting point of the traversal very quickly. From this point on, there are no more joins to find related data. Instead, you can traverse the graph at lightning speed without indexes, by following pointers. This is the index-free adjacency that Neo4j provides.

Indexes for boosting data ingestion speed

The bottleneck in your query is the MERGE. You learned earlier that MERGE is a MATCH or CREATE. Without indexes, if you ask Neo4j to MATCH the Track node with a particular ID, say id-1, it will iterate over all nodes with the Track label and, for each of those nodes, filter the ones that have the value id-1 for the id property. If the graph has a hundred tracks, this would result in 200 operations (100 for extracting each node with the Track label and 100 for filtering on the property). If there are 100,000 tracks in the graph, this would result in 200,000 operations. And that’s just for one type of node: the Track.

The query you are running, however, creates more than one type of node. Apart from the Track, it also creates Album and Artist nodes, and the number of operations increases drastically as the dataset grows.

For the sole purpose of ensuring fast ingestion of large datasets, you can add an index on all labels for their respective id property. (Note, however, that Neo4j has several types of indexes and constraints; chapter X will cover them in detail.)

CREATE INDEX playlist_id FOR (n:Playlist) ON n.id;
CREATE INDEX user_id FOR (n:User) ON n.id;
CREATE INDEX track_id FOR (n:Track) ON n.id;
CREATE INDEX album_id FOR (n:Album) ON n.id;
CREATE INDEX artist_id FOR (n:Artist) ON n.id;

With these indexes in place, every MATCH will have only one operation, the index lookup, providing a better O(log(n)) complexity.

Try running the LOAD CSVs again. They should complete now in seconds!

Minimum data quality

Just before lunch, your team runs into exactly the same situation you’d puzzled over just the day before: the query “Which artist has the most tracks in the last position of a playlist?” For a moment or two you all wonder why you don’t see the results you expect, and then you facepalm as you realise that the track position in the playlist is still a string.

This is the double-edged sword of Neo4j’s schema-free design. While nothing prevents quick wins, its lack of enforcement allows data-quality issues to slip in. You will want to prevent situations where you suddenly don’t have any results because of a type mismatch between how the data is stored and how you refer to it in your queries.

Refactoring is always possible, just like you did yesterday. But refactoring operations take longer to execute as the dataset grows, because they typically work over all data of a specific type or types. To maximise your effort on solving business problems and minimise the time spent on these hair-pulling moments caused by the lack of minimal data quality, you want to ensure this never happens again. Here’s how you can do that:

First, add a property-type constraint on the position property on the HAS_TRACK relationship, between the Playlist and Track nodes, so that it accepts only values of type integer:

CREATE CONSTRAINT has_track_position_integer 
FOR ()-[r:HAS_TRACK]-() 
REQUIRE r.position IS TYPED INTEGER

Tip

A property-type constraint will ensure that a property has the required type for all nodes with a specific label or all relationships of a specific type. Any query that violates this constraint will fail.

Since you ingested some of the sample dataset yesterday, you find that you cannot create the constraint. A ConstraintCreationFailed error is produced because some of the existing data in the graph is in violation. You’ll need to drop the data and reingest it.

To drop the data, use the following query:

MATCH (n)
DETACH DELETE n

Then move the type transformation (from string to integer) to the data-ingestion level:

MERGE (p)-[:HAS_TRACK {position: toInteger(row.track_index)}]->(t)

Now you’re ready to ingest the larger dataset. Note that the filename has changed. Start with the tracks file, sample_tracks_medium.csv:

LOAD CSV WITH HEADERS FROM "file:///medium/sample_tracks_medium.csv" AS row
MERGE (track:Track {id: row.track_id})
SET track.uri = row.track_uri,
track.name = row.track_name
MERGE (album:Album {id: row.album_id})
SET album.uri = row.album_uri,
album.name = row.album_name
MERGE (artist:Artist {id: row.artist_id})
SET artist.uri = row.artist_uri,
artist.name = row.artist_name
MERGE (album)-[:HAS_TRACK]->(track)
MERGE (track)-[:ARTIST]->(artist)

Repeat for the playlists, sample_playlists_medium.csv:

LOAD CSV WITH HEADERS FROM "file:///medium/sample_playlists_medium.csv" AS row
MERGE (playlist:Playlist {id: row.id})
SET playlist.name = row.name
MERGE (user:User {id: row.user_id})
MERGE (track:Track {id: row.track_id})
MERGE (user)-[:OWNS]->(playlist)
MERGE (playlist)-[:HAS_TRACK {position: toInteger(row.track_index)}]->(track)

Now that you have a a higher volume of data and you’ve ensured that the right constraints are in place, guaranteeing data type quality, you’re ready to start constructing the recommendation query.

Finding similarities

The very nature of a graph is its expressive model: it stores data in a form that represents the real world.

When you look at the model in Figure 1-17, it’s easy to see that two playlists have some similarity if they share a connection to the same track. Their similarity will be greater if the track has the same position in both playlists.

Playlists are similar when they share tracks

You explore the similarities between these playlists, and decide to look for playlists that have a track in common at the same position.

MATCH path=(n:Playlist)-[r1:HAS_TRACK]->(track)<-[r2:HAS_TRACK]-(other:Playlist)
WHERE r1.position = r2.position
RETURN path
LIMIT 10

This produces the graph in Figure 1-18.

Playlist similarity you can easily identify which playlists share two tracks

To find similar playlists with at least five tracks in common, you count the number of tracks with the same position and those without:

MATCH path=(p:Playlist)-[r1:HAS_TRACK]->(track)<-[r2:HAS_TRACK]-(other:Playlist)
WITH p AS playlistLeft, other AS playlistRight,
collect({track: track, positionLeft: r1.position, positionRight: r2.position}) AS commonTracks
WHERE size(commonTracks) > 5
RETURN playlistLeft.name, playlistRight.name,
size([track in commonTracks WHERE track.positionLeft = track.positionRight]) AS tracksWithSamePosition,
size([track in commonTracks WHERE NOT track.positionLeft = track.positionRight]) AS tracksAtDifferentPosition
ORDER BY tracksWithSamePosition DESC
LIMIT 100

This returns the query results in Figure 1-19. As you can see, playlists can be similar to themselves. You didn’t add a condition to the query to prevent this.

Playlists with a minimum of 5 tracks in common

Key takeaways

While your schema-free database allows you to gett data into the graph quickly, the ability to add a schema incrementally, such as with constraints, lets you fix data-quality issues early.

Day 4

Now that ElectricHarmony has a query to find similar playlists, you need to think about how and when to compute their similarity.

Materialising similarities

You don’t need to compute similar playlists for every single user request at this stage. Instead, you will want to materialise the fact that two playlists are similar by explicitly creating a SIMILAR relationship between two similar playlists, so that the graph is easy and efficient to traverse.

Yesterday, you wrote a query to find similar playlists. Now you use the same query, but add a MERGE, as shown below, to create a SIMILAR relationship between playlists that have at least five tracks in common. Your query will also record the number of tracks in the same or different positions as properties on the SIMILAR relationship.

MATCH path=(p:Playlist)-[r1:HAS_TRACK]->(track)<-[r2:HAS_TRACK]-(other:Playlist)
WITH p AS playlistLeft, other AS playlistRight,
collect({track: track, positionLeft: r1.position, positionRight: r2.position}) AS commonTracks
WHERE size(commonTracks) > 5
WITH playlistLeft, playlistRight,
size([track in commonTracks WHERE track.positionLeft = track.positionRight]) AS tracksWithSamePosition,
size([track in commonTracks WHERE NOT track.positionLeft = track.positionRight]) AS tracksAtDifferentPosition
MERGE (playlistLeft)-[r:SIMILAR]->(playlistRight)
SET r.samePosition = tracksWithSamePosition, r.notSamePosition = tracksAtDifferentPosition

You can now inspect the similar playlists quite easily.

MATCH path=(playlist1)-[:SIMILAR]-(playlist2)
RETURN path
LIMIT 25

This produces the graph in Figure 1-20.

The bidirectionality of the SIMILAR relationships between two playlists is a common graph-modelling hiccup. We cover best practices for handling this in Chapter 2.

You know that the SIMILAR relationship you just created can only exist only between Playlist nodes, so you can drop the labels from either end of the pattern.

Implicit relationships

Look at the updated graph model with the new SIMILAR relationship between playlists (Figure 1-21). It doesn’t take your team long to realise that you can consider two users to be potentially similar if they own similar playlists!

The dotted line between two users represents an implicit relationship between them based on other explicit relationships in the graph

But since the potential number of similar playlists between two users is low, there is probably no need to materialise this fact as an explicit relationship, since traversing between users via their shared playlists will be very fast. These implicit relationships are revealed by other data connections and provide another dimension of insights.

You’re now ready to write the final recommendation query.

Recommending a track when the playlist ends

To recommend tracks, you will need to first express how you would like to query the data from the graph, then translate that to Cypher.

While the last song of a user’s playlist is playing, you want your query to perform the following steps in order:

Calculate the 10 most popular tracks by the number of playlists in which they appear
Find the last track
Get the previous tracks from the playlist (since you do not want to recommend tracks that are already in this playlist)
Find other playlists that have the same last track and are similar to the given playlist
Find other tracks on those playlists that are not in the given playlist
Exclude the 10 most popular tracks
Score the remaining tracks by the number of times they appear, so that tracks that appear more frequently in similar playlists rank higher

Here’s the query you write:

//Find popular tracks
MATCH (popularTrack:Track)-[:HAS_TRACK]-(:Playlist)
WITH popularTrack, count(*) as playlistCount
ORDER BY playlistCount DESC
LIMIT 10
WITH collect(elementId(popularTrack)) as popularTracks
// For a given Playlist
MATCH (p:Playlist) WHERE p.name = "all that jazz"
// Find the last track
MATCH (p)-[r:HAS_TRACK]->(t)
WHERE r.position = COUNT { (p)-[:HAS_TRACK]->()}
WITH p AS playlist, t AS lastTrack, popularTracks
// Get the previous tracks
WITH playlist, lastTrack, popularTracks, COLLECT {
MATCH (playlist)-[:HAS_TRACK]->(previous)
WHERE previous <> lastTrack
RETURN elementId(previous)
} AS previousTracks
// Find other playlists that have the same the last track
MATCH (lastTrack)<-[:HAS_TRACK]-(otherPlaylist)-[:SIMILAR]-(playlist)
WHERE otherPlaylist <> playlist
// Find other tracks which are not in the given playlist
MATCH (otherPlaylist)-[:HAS_TRACK]->(recommendation)
WHERE NOT elementId(recommendation) IN previousTracks
AND NOT elementId(recommendation) IN popularTracks
// Score them by how frequently they appear
RETURN  recommendation.id as recommendedTrackId, recommendation.name AS recommendedTrack, otherPlaylist.name AS fromPlaylist, count(*) AS score
ORDER BY score DESC
LIMIT 5

This produces the results in Table 1-x:

recommendedTrackId	recommendedTrack	fromPlaylist	score
“7N2UmTJG5Uv6zQvjf4eIjd”	“Now See How You Are - Remastered”	“smooth jazz”	1
“1Z9XpsIg7YlzDTGbLQyXMK”	“Stompin’ at the Savoy”	“smooth jazz”	1
“4T0ohWvlVJenmxPUVeuaue”	“The Folks Who Live On The Hill”	“smooth jazz”	1
“0pmO0O5FCPYri3rVycwl00”	“I May Be Wrong”	“smooth jazz”	1
“5NliRTDw7ktmLDtUH2Dvqr”	“Intermezzo”	“smooth jazz”	1

The recommendation is computed in a couple of milliseconds, proving that traversing relationships in the graphs really is efficient.

Day 5

Your work is producing some recommendations on a limited dataset– brilliant! You can expect even richer results once all data is ingested. But how do you know if the recommendations are any good? Is this feature ready to roll out to a small set of users?

The nice thing about a graph is that it makes recommendation results explainable–no black box involved. It’s easy to trace why a particular song is recommended. Not only does this help you test the query, it also means you can elicit and account for negative feedback from users, to build a pattern of which songs the system should not recommend to that user.

Say a user routinely skips over recommended tracks. If you query these tracks to see what they have in common, you might find that they’re performed by the same artist or belong to the same genre. That information will feed back into the recommendation query, to avoid repeating the same mistake and make more relevant recommendations.

As a test, you decide to take the results of yesterday’s recommendation query from and see if they make sense.

The playlist titled “all that jazz” is over. The next song to be played could be any of the five listed in Table 1-x. Take the first one: “Now See How You Are - Remastered,” with Track ID 4:d693749a-1841-4fe7-a2ca-9d75e80b04f8:124974.

The first thing to observe is that this track is on a playlist called “smooth jazz.” That’s a good start–it sounds logical.

What was the last track on “all that jazz”? You can reuse a part of the recommendation query to find it:

// For a given Playlist
MATCH (p:Playlist) WHERE p.name = "all that jazz"
// Find the last track
MATCH (p)-[r:HAS_TRACK]->(t)
WHERE r.position = COUNT { (p)-[:HAS_TRACK]->()}
RETURN t

You double-click the Track node to expand its connections (Figure 1-22) and see that the last track, “Journey into Melody - 2007 Digital Remaster/Rudy Van Gelder Edition” by Stanley Turrentine, is actually on both playlists.

The query returns the Track node double clicking the node expands its relationships

If smooth jazz is your genre of music, you probably don’t need much more explanation to know that the recommended track sounds like a good bet. If it isn’t, though, then seeing how both tracks are connected is a quick way to feel your way around the graph and explore why this particular track is being recommended.

Using the track IDs of each track, you set out to find all connections between them, to a depth of 5. Why 5? There isn’t any magic formula for this. We could have tried the shortestPath first, like this:

MATCH (t1:Track {id: "7ysmJhXFQtiBQlk6EZ6sks"})
MATCH (t2:Track {id:"7N2UmTJG5Uv6zQvjf4eIjd"})
RETURN shortestPath((t1)-[*..5]-(t2))

shortestPath is a function that returns the path between two nodes with the fewest relationships connecting them (Figure 1-23). You can define and constrain the path pattern by, for example, specifying relationship types. In this case, you’re interested in any kind of relationship, but don’t want to find paths longer than 10 hops away, so you set an upper bound of 5.

You need something more than the playlist to justify why “Now See How You Are - Remastered” was recommended, so you play with depths to find just enough information. You don’t want deep traversals to bring in so much of the graph that it adds more noise than value. The depth of 5 is just right in this case.

MATCH (t1:Track {id: "7ysmJhXFQtiBQlk6EZ6sks"})
MATCH (t2:Track {id:"7N2UmTJG5Uv6zQvjf4eIjd"})
MATCH p = ((t1)-[*..5]-(t2))
RETURN p

Paths between playlists up to a depth of 5.

On the far left and right of the graph in Figure 1-24, you see the two tracks, along with their artists and the common paths between them, via the playlists they’re on.

The query returns “Midnight Blue,” an album recorded in 1963 by Kenny Burrell that features Stanley Turrentine on tenor saxophone. Sounds like a good recommendation.You and the team have achieved what you set out to do this week, and with time to spare. Congratulations!

There’s one thing left to do this week: it’s time to sum up the results of your work for ElectricHarmony’s stakeholders.

The proof is in the pudding, so your team starts off by asking one of the stakeholders to pull up one of her playlists. You run the recommendation query for that playlist and tell everyone which song will be played next. The recommendation gets it right. Instead of a Top 40 track with no real relevance to her tastes or a totally random track, it’s a song she likes.

The executives are impressed! They all agree that this simple recommendation is already improving their experience.

Next, you present the current version of the graph model (Figure 1-25).

This graph has been populated with limited data from two data sources that exist in different systems at ElectricHarmony. You explain that you created the SIMILAR relationship by matching playlists that share a number of tracks at the same position. This makes explicit knowledge that was already in the data, but was obscured.

The clear relationships in the graph, you add, enabled the team to build a simple recommendation query that could traverse from the last track played in a playlist to a similar playlist, exclude any tracks in common, and pick the next track to be played to the user, with a relatively high chance of success and an execution time of milliseconds.

Another stakeholder points out that he sees a way for the company to leverage the implicit relationship your team has discovered: that users can be considered similar if they own similar playlists. Another chimes in to suggest developing other new features, such as “users to follow” or “suggested playlists.”

The executives ask your team to do some A/B testing with a subset of users in the following weeks to gather feedback about whether this feature improves their listening experience. They also ask you to test ways to extend the system to use the artists and albums present in the graph, working from users’ favourite artists or revealing rare albums.

Summary

It’s time to celebrate: you and the team have got yourselves a graph database that opens up many exciting opportunities!

While you made mistakes along the way, you learned from them, and now you all have a better understanding of the value that graphs provide. Using the graph database allowed you to focus on validating business ideas in a short time.

You’ll need more time to further refine your evaluation and build other use cases. During this process, as you scale up your usage of Neo4j, you will rapidly hit issues related to scaling. .

In the rest of this book, we’ll take you beyond the proof of concept and build on use cases in the music domain. While they are fictitious, they closely mirror real-world paths to production and address typical issues you’re likely to encounter on your own journey. The next chapters are designed to help you circumvent those issues by providing you with best practices, built upon our years of experience developing and deploying Neo4j for all sizes and kinds of applications. Chapter 4 explores query profiling and tuning, while controlling disk space usage with repeated ingestions during evaluation phases is covered in Chapter 8. In the next chapter, we’ll look at graph modelling decisions and the impact of using one model over another.

¹ Wikipedia Contributors. “Clive Humby.” Wikipedia. Wikimedia Foundation, June 17, 2023. https://en.wikipedia.org/wiki/Clive_Humby#cite_note-10.

² Arthur, Charles. “Tech Giants May Be Huge, but Nothing Matches Big Data.” The Guardian. The Guardian, August 23, 2013. https://www.theguardian.com/technology/2013/aug/23/tech-giants-data.

³ Graph Database & Analytics. “Graph Database Executive Insights,” February 4, 2022. https://neo4j.com/graph-database-executive-insights/.

⁴ Cyber crimes skyrockets 300% since Covid-19 https://www.intelice.com/cybercrime-skyrockets/

⁵ Defender’s Mindset - John Lambert, Engineer and General Manager, Microsoft Threat Intelligence Center https://medium.com/@johnlatwc/defenders-mindset-319854d10aaa

⁶ “What Is Digital-Twin Technology?” 2023. McKinsey & Company. McKinsey & Company. July 12, 2023. https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-digital-twin-technology.
‌

⁷ “Key Benefits of Neo4j Graph Database Lightning Fast Performance Neo4j Graph Database the Most Trusted Database for Intelligent Applications.” Neo4j, Inc. 2022. https://go.neo4j.com/rs/710-RRC-335/images/Neo4j-product-brief-database-US-EN.pdf. For more differences between graph databases and Relational Database Management Systems (RDBMS), see https://go.neo4j.com/rs/710-RRC-335/images/5-Advantages-Graph-Database-Infographic.pdf.

Get Neo4j: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Chapter 1. How to Get Value from Graphs in Just Five Days

Note

Dissonance at ElectricHarmony

Figure 1-1. The whiteboard sketch from ElectricHarmony’s discussion.

Why Graph Databases?

Figure 1-2. ElectricHarmony’s relational database schema.

Figure 1-3. ElectricHarmony graph model representing the whiteboard sketch

Graph Use Cases

Neo4j

Native graph databases

Figure 1-4. In Neo4j’s record-storage engine, a node’s relationships are a linked list with pointers to other nodes.

Cypher

Figure 1-5. Cypher’s ASCII art. Nodes are represented by parentheses and relationships by lines and arrows.

The song recommendation system: A proof of concept

Warning

Day 1

Installing Neo4j

Ingesting your first datasets

Figure 1-6. Successful authentication in the Neo4j browser.

Previewing the data

Figure 1-7. LOAD CSV preview of 5 rows of the sample tracks file

Figure 1-8. LOAD CSV preview for the playlists dataset

Designing your graph model

Figure 1-9. The graph model with key entities.

Figure 1-10. The extended graph model.

Walking the graph

Key takeaways

Day 2

Creating nodes and relationships

Figure 1-11. Result of a Cypher pathfinding query matching artists, tracks, and albums.

Figure 1-12. Cypher query result showing a user named “bleapkin” who owns a playlist with one track.

Querying across datasets

Figure 1-13. Cypher query result showing all the graph data in the database

Merging nodes

Tip

Figure 1-14. Cypher query result showing all the graph data after using the MERGE clause

Figure 1-15. Graph overview of the database. You can see this picture in color and full resolution at https://raw.githubusercontent.com/neo4j-the-definitive-guide/book/main/figures/svg/full-small-graph-view.svg

Exploration and refactoring

Figure 1-16. Cypher query result for track position in playlists. When your query doesn’t return graph data (nodes, relationships, paths) but scalar values, the Neo4j browser switches to table-result view.

Note

Key takeaways

Day 3

Indexes for boosting data ingestion speed

Minimum data quality

Tip

Finding similarities

Figure 1-17. Playlists are similar when they share tracks

Figure 1-18. Playlist similarity: you can easily identify which playlists share two tracks

Figure 1-19. Playlists with a minimum of 5 tracks in common

Key takeaways

Day 4

Materialising similarities

Figure 1-20. Similar playlists

Implicit relationships

Figure 1-21. The dotted line between two users represents an implicit relationship between them based on other explicit relationships in the graph

Recommending a track when the playlist ends

Day 5

Figure 1-22. The query returns the Track node, double clicking the node expands its relationships

Figure 1-23. The query returns the Track node, double clicking the node expands its relationships

Figure 1-24. Paths between playlists up to a depth of 5.

Figure 1-25. Latest version of the graph model.

Summary

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly