Chapter 4. Modelling and Refactoring Patterns

Now that you’ve mastered the principles of graph modelling, you’re ready to explore some common modelling patterns. You’ll learn to recognise when to apply these patterns and weigh their pros and cons. Sooner or later you’ll also need to refactor your graph, so the second half of this chapter takes you through refactoring patterns with a hands-on approach.

Hyperedges– N-way Relationships

There’s a new trend–everybody appears to be interested in covers of songs. ElectricHarmony wants to jump onto the bandwagon by introducing a new feature that helps listeners find covers of their favourite tracks easily. This feature should also help them discover if some of their favourite tracks are indeed covers–and, if so, who the original artist is. You find yourself at the whiteboard with the team once again to model this use case.

The most obvious way to start is like in Figure 4-1: by quantifying the ARTIST relationship to the Track to specify whether their recording of the track is a cover or not.

Quantifying the ARTIST relationship to indicate whether the track is a cover or not.
Figure 4-1. Quantifying the ARTIST relationship to indicate whether the track is a cover or not.

Good modelling practice involves revisiting existing use cases to see if your refactoring has impacted them (perhaps adversely). You immediately think back to a key use case–fetching the track artist. So far, ElectricHarmony has leaned towards always showing a track’s original artist, and you don’t want that to change. However, now that this information is contained in a property on the relationship, Neo4j is doing extra work to traverse all the ARTIST relationships from a TRACK, examine the cover property value, and discard the majority. While that isn’t a major problem, because it is rare for tracks to have an extremely high number of artists, your team doesn’t feel like this is the right solution.

Instead, based on what you’ve learned in the previous chapter about relationship granularity, you decide to introduce a new relationship type (Figure 4-2): COVERED_BY.

COVERED_BY is a new relationship type introduced to represent the fact that an artist has covered a track.
Figure 4-2. COVERED_BY is a new relationship type introduced to represent the fact that an artist has covered a track.

This leaves the key use case undisturbed. When Neo4j needs to fetch covers–or conversely, find all tracks that a particular artist has covered, it traverses the COVERED_BY relationship.

What about finding any artist who has performed any version of the track? Simply include both relationship types in the pattern and Neo4j will traverse them both:

MATCH (t:Track {id:"4J4gApJKSC0himDViFotdy"})-[:ARTIST|COVERED_BY]-(a:Artist)
RETURN a

Draw some examples on the whiteboard (like in Figure 4-3) to validate that this approach works.

Representing some data using the new model to validate use cases.
Figure 4-3. Representing some data using the new model to validate use cases.

It does work: it’s easy to find both the original artist and other artists who’ve covered these tracks. But does it really work? Are you now asking your team whether it was Simon & Garfunkel or Disturbed who recorded the album Immortalized? What happened?

The model has now lost information: it is no longer possible to associate an album with a particular version of a track. What you want is an n-way relationship, one that connects more than two nodes. The model in Figure 4-4 would be ideal, but in graph databases such as Neo4j, a relationship must have exactly two nodes - one on either end.

An n way relationship  which is not supported.
Figure 4-4. An n-way relationship, which is not supported.

The solution here is to capture the shared context at the junction of the n-way relationship into a node called a hyperedge. This hyperedge is bound to its related nodes, and the data it holds is only applicable in this local context.

Introducing a hyperedge  the Recording node.
Figure 4-5. Introducing a hyperedge, the Recording node.

The hyperedges in Figure 4-5 are represented by the label Recording. Every time a new version of a track is created by the ingestion process, a new Recording node must be created to hold context for that particular recording. What’s nice about this model is that it can be extended to model other concepts that are specific to the hyperedge, such as whether the track has won a Grammy award. In Figure 4-6, the graph makes it clear via the Award node that the Simon & Garfunkel version won a Grammy in 1968. The previous models would have obscured this information.

Adding context to a hyperedge with new relationships such as WON.
Figure 4-6. Adding context to a hyperedge with new relationships such as WON.

The model can also represent performances, since graphs’ schema-free nature supports exceptions and variations with ease. Figure 4-7 depicts how not every track has an Album:

Neo4j s schema free nature lets you attach information to the graph where it is available  such as on the Performance node.
Figure 4-7. Neo4j’s schema-free nature lets you attach information to the graph where it is available, such as on the Performance node.

Usually, hyperedges emerge to be entities themselves that were previously undiscovered. If that’s not the case, consider carefully whether you want to use them: remember, they add more nodes to the graph, and consequently more hops during traversal.

Key takeaways

Use hyperedges when your graph would otherwise lose context that cannot be represented clearly by two nodes and the relationship that connects them. If you find yourself needing to start a traversal at a relationship, that’s a hint that a hyperedge might be a good idea. Weigh the tradeoffs of introducing more nodes into the graph that will also have to be created or updated during data ingestion flows.

Time-based Versioning

  • Determining how the network changed over time

  • Tracking behaviour changes

  • Updating entities without affecting past records

This section applies a variation of the time-based versioning pattern to a simplified use case for ElectricHarmony: versioning a user’s “Favourites” playlist to track the music tastes of the user over time. This feature will be especially useful by applying time decay to artists and genres–applying lower weights to artists and genres that a user might have loved in the past, but not as much now, to improve the relevance of recommendations.1

The default Favourites playlist for every user.
Figure 4-8. The default Favourites playlist for every user.

Every user has a default Favourites playlist which contains their favourite songs, as shown in Figure 4-8. Over time, they add or remove tracks from this playlist. To apply the principles of time-based versioning, you’ll separate the object from the state: here, the object is the Favourites playlist, and the state is its contents at various points in time.

After some debate about time-period granularity, ElectricHarmony settles on six months to detect if a user’s listening preference has changed. This means that their application will create and maintain a new state for six months. You decide to capture this timeframe as a property of relationships between the objects and their states, as pictured in Figure 4-9.

Separating the object and its state.
Figure 4-9. Separating the object and its state.

Every time a track is added or removed from the Favourites playlist, the HAS_TRACK relationship is created or deleted between the Track and the active FavouriteState node for the current time period, which is matched with a condition in the query, such as:

MATCH (u:User {id:”100”})-[:OWNS]->(p:Playlist {name:”Favourites”})-[r:HAS_STATE]->(f:FavouriteState)
WHERE r.from <= timestamp() <=r.to
RETURN f
Tip

It’s worth familiarizing yourself with the temporal instant types available in Neo4j. In this case, you have two main options. One is to represent time as a Timestamp.2 The other is to represent it as a DateTime. Timestamps are integers, which are compact and straightforward to compare. However, timestamps for dates before the epoch will be represented as negative integers. A DateTime is a richer object, making it easier to calculate durations, convert formats, or extract components (such as the day or month). For time versioning in this example, a timestamp does the job.

Introducing time-based versioning brings its own tradeoffs–the model is more complex, you need to adjust your queries, and writing data to the graph is more involved. One way of reducing complexity in the main graph is to offload all states except the active state to another database. We’ll discuss that and explore composite graphs when we revisit that particular case in chapter X.

Another avenue to explore is encoding the time period into the relationship type. This is helpful for cases where you want to capture when relationships were created or when they were valid or applicable. Recall the section on Bucketed Relationships from Chapter 3–the relationship from the Favourites playlist to a track could also be recorded as FAVOURITE_2024_H1 for the time period of the first half of the year.

Key takeaways

Time-based versioning is useful when you want to track or audit changes made to the graph or go backwards in time to compare the states of the graph. It adds complexity, however, so consider carefully.

Representing Sequences

Recall that, in chapter 1, you learned that Playlists are related to their tracks via the HAS_TRACK relationship, which includes a position property representing the order of the tracks (Figure 4-10). While this did the job for your proof-of-concept model, in working with it, the team has uncovered some disadvantages. You agree to revisit the question of track order in playlists.

The graph model for playlists and their tracks.
Figure 4-10. The graph model for playlists and their tracks.

Finding the last track in a playlist

The recommendation query in the proof-of-concept model is based on the last track in a playlist, which is queried in an awkward way. Neo4j traverses every HAS_TRACK relationship from the playlist to inspect the position property and compare it to the total number of tracks on the playlist (determined by the COUNT subquery):

MATCH (p)-[r:HAS_TRACK]->(t)
WHERE r.position = COUNT { (p)-[:HAS_TRACK]->()}
WITH p AS playlist, t AS lastTrack

It also assumes that all position properties are correct and sequential, with no holes, since the last track position must match the count of all tracks. This brings us to the next problem.

Maintaining track order

Users frequently reorder tracks in a playlist: new tracks are added, others are removed. Every time one of these operations takes place, your code must recalculate the positions of all tracks in the playlist.

You can make that happen by writing a procedure that extends Neo4j. The disadvantage of going down this route is that you have to write procedures in a JVM language (Kotlin, Scala etc.) and compile them into JAR files, which you must then deploy as plugins to the Neo4j database server.

Since you’re working with a graph, there is a more natural way to model sequences–whether they are tracks in a playlist, episodes in a TV series, or jobs in your employment history: linked lists. You’ll also find linked lists to be especially useful when splitting text into chunks and linking them in order to represent the original document– chapter X on GenAI will refer to this concept. Linked lists are fundamental data structures where each element maintains a reference or pointer to the next. The graph fetches the elements in sequence by simply following the pointer to the next element in the list. The complexity of insertion and deletion are, at best, O(1) and, at worst, O(n).

In a graph, the pointers are implicitly represented by relationships, and traversing across relationships is the sweet spot for graph databases. Figure 4-11 shows a playlist modelled as a linked list.

Modelling tracks in a playlist as a linked list.
Figure 4-11. Modelling tracks in a playlist as a linked list.

Let’s connect some of the concepts you’ve read in chapter 3. The coarse-grained HAS_TRACK relationship from the Playlist to a Track has been replaced with a more specific relationship, PLAYLIST_TRACK. (The PLAYLIST_TRACK relationship can also be called NEXT: it’s just a matter of preference.)

Consider Figure 4-12. Can you tell which tracks are part of which playlist?

You can probably tell that “Nobody’s Fool” is the first track of two playlists, “Rock Ballads” and “Drive,” but the tracks that follow it in each playlist are unclear.

To avoid losing context about how tracks are related to playlists, let’s introduce a new node: PlaylistItem. The PlaylistItem node makes it much easier for the graph to traverse the playlist and to modify its order.

The LAST_PLAYLIST_TRACK navigates directly to the last track on a playlist.
Figure 4-13. The LAST_PLAYLIST_TRACK navigates directly to the last track on a playlist.

Figure 4-13 contains one more modification you can make. Since finding the last track of any Playlist is a top use case for ElectricHarmony, instead of traversing all tracks in the sequence, you can give the graph a shortcut and send it straight to the last track by adding a LAST_PLAYLIST_TRACK relationship. Only add this relationship, however, if it’s useful for the majority of your use cases or if it shows a substantial performance advantage when traversing over extremely long playlists. Premature optimisation in the form of adding the LAST_PLAYLIST_TRACK leads to issues in other areas– you increase complexity because the pointer to the tail of the linked list has to be maintained, especially when the list is updated frequently.

The recommendation query you wrote on Day 4 in chapter 1 contained the following segment to find the last track:

// Find the last track
MATCH (p)-[r:HAS_TRACK]->(t)
WHERE r.position = COUNT { (p)-[:HAS_TRACK]->()}
WITH p AS playlist, t AS lastTrack, popularTracks
With the introduction of the LAST_PLAYLIST_TRACK relationship, this part of the query reduces to:
MATCH (p)-[:LAST_PLAYLIST_ITEM]->()<-[:PLAYLIST_ITEM]-(lastTrack)

In the next section, you will refactor the graph you created in chapter 1 to this enhanced model.

Be careful not to model a doubly linked list data structure by introducing a PREVIOUS_PLAYLIST_TRACK relationship to navigate the playlist in reverse, as shown in Figure 4-14.

Do not model doubly linked lists in a graph.
Figure 4-14. Do not model doubly linked lists in a graph.

As you learned in the section on bidirectional relationships, these are redundant and bad practice.

Key takeaways

When you encounter sequences in your domain, think about whether you can model them as linked lists for easier modification and faster traversal through the sequenced nodes. Weigh the pros and cons of query complexity against the frequency of updates to the list.

Refactoring Patterns

At some point, you’ll need to enhance your graph model for performance or new data or new use cases, and that means refactoring your graph data. There are two basic ways to refactor:

Reingest all data

This method assumes that the graph can be completely reconstructed from original sources. Here, you modify the ingestion queries or scripts to write data to the graph as per the new model. The benefit of this approach is that ingestion scripts are always current. The downside is that replacing the graph in production will require some minimal downtime.

Refactor live

When you can’t reconstruct your graph or when it’s quite small, live refactoring is the way to go– refactor queries are executed directly on the live graph. Typically, you’ll do this on a backup of the graph and validate it before applying the refactor in production.

You can also combine these two approaches. If data is constantly streaming into the graph, refactoring will feel like a never-ending quest. One way of tackling this problem is to first update the ingestion scripts or queries to write new data to the graph in accordance with the refactored model. This serves to stop propagating the current graph model. Then you can perform a live refactor to address the current graph.

In most cases, you can maintain backward compatibility. The current version of the graph and the refactored version can coexist, which lets you migrate queries gradually with no downtime. However, you need to plan what to do in case things go wrong. For example, if the graph is too large to apply the refactor in a single transaction, you run the risk of applying a partial refactor until you can reattempt the rest of the transactions that have failed. This is why we advise keeping both versions of the graph in place until you’ve verified the success of the operation. Then you can remove the previous version.

A huge benefit of graph refactoring is that it lets dependent applications catch up and migrate over time. You can phase out the “old” graph model once all queries and applications have switched over to the new one.3 (You will see this in action in the following sections.)

You’ve already performed the first type of refactoring: in chapter 1, you converted the data type of the position property of the HAS_TRACK relationship from a string to an integer by updating the LOAD CSV query and reingesting the data. You also removed the redundant SIMILAR relationship earlier in this chapter.

The next three sections describe a few more refactoring patterns that you’re likely to come across, all of which use live refactoring.

Refactoring to change the type of a relationship

After you read the section on relationship granularity in chapter 3, you decide to refactor the graph to replace the HAS_TRACK relationship between Playlists and Tracks with a ON_PLAYLIST relationship from Tracks to Playlists.

Neo4j does not support changing the type of a relationship, so you need to delete the HAS_TRACK relationship type and create ON_PLAYLIST. Since you want to keep the graph backward-compatible until ElectricHarmony’s engineering team updates their recommendation queries, you’ll refactor in the following stages:

Stage 1

Match all HAS_TRACK relationships from a Playlist to a Track. For each pair of Track and Playlist matches, create an ON_PLAYLIST relationship from the Track to the Playlist and set the position on the new relationship. At this point, both models co-exist in the same graph; the recommendation query will not break.

First, count how many relationships need to be refactored in the graph you that you ended with from chapter 1:

MATCH (:Playlist)-[r:HAS_TRACK]->(:Track)
RETURN count(r)

There are 73,153 relationships to be refactored. This is not a large number; Neo4j can refactor these relationships in a single transaction.

Execute the following query to refactor the graph:

//Step 1
MATCH (p:Playlist)-[hastrack:HAS_TRACK]->(t:Track)
//Step 2 
MERGE (t)-[onplaylist:ON_PLAYLIST]->(p)
SET onplaylist.position=hastrack.position; 

If you pause here and inspect a playlist in the graph, you’ll see that both versions of the model are maintained, like the example in figure 4-15.

The old and new relationship types coexist in the graph.
Figure 4-15. The old and new relationship types coexist in the graph.

Stage 2

Update all queries that depend on the HAS_TRACK relationship to use the new ON_PLAYLIST relationship.

Here’s the updated recommendation query that you can test:

//Find popular tracks
MATCH (popularTrack:Track)-[:ON_PLAYLIST]-(:Playlist)
WITH popularTrack, count(*) as playlistCount
 ORDER BY playlistCount DESC
 LIMIT 10
WITH collect(elementId(popularTrack)) as popularTracks
// For a given Playlist
MATCH (p:Playlist) WHERE p.name = "all that jazz"
// Find the last track
MATCH (p)<-[r:ON_PLAYLIST]-(t)
 WHERE r.position = COUNT { (p)<-[:ON_PLAYLIST]-()}
WITH p AS playlist, t AS lastTrack, popularTracks
// Get the previous tracks
WITH playlist, lastTrack, popularTracks, COLLECT {
MATCH (playlist)<-[:ON_PLAYLIST]-(previous)
 WHERE previous <> lastTrack
RETURN elementId(previous)
} AS previousTracks
// Find other playlists that have the same the last track
MATCH (lastTrack)-[:ON_PLAYLIST]->(otherPlaylist)-[:SIMILAR]-(playlist)
// Find other tracks which are not in the given playlist
MATCH (otherPlaylist)<-[:ON_PLAYLIST]-(recommendation)
WHERE NOT elementId(recommendation) IN previousTracks
AND NOT elementId(recommendation) IN popularTracks
// Score them by how frequently they appear
RETURN  recommendation.id as recommendedTrackId, recommendation.name AS recommendedTrack, otherPlaylist.name AS fromPlaylist, count(*) AS score
ORDER BY score DESC
LIMIT 5

Note that the WHERE clause from the original query:

// Find other playlists that have the same the last track
MATCH (lastTrack)<-[:HAS_TRACK]-(otherPlaylist)-[:SIMILAR]-(playlist)
WHERE otherPlaylist <> playlist

has been dropped, since the graph now has a single SIMILAR relationship between any pair of playlists

Stage 3

Delete the HAS_TRACK relationship.

//Step 3
MATCH (p:Playlist)-[hastrack:HAS_TRACK]->(t:Track)
DELETE hastrack 

Refactoring to create a node from a property

To see this refactoring pattern in action, you’ll first add a property to a few Artist nodes that represents genre. Then you’’ll refactor the graph to extract genres as nodes.

The accompanying Github repository contains a CSV file of genres for a limited number of artists, which you can import into the graph using LOAD CSV. Since this data source does not contain artist IDs, you will match artists by their names. Adding an index on artist name will speed up the MATCH and hence the import:

CREATE INDEX artist_name FOR (n:Artist) ON n.name;

The data source isn’t particularly clean and contains a slash-delimited set of genres for each artist. The following query will split the genres on the / delimiter, trim spaces, convert them to lowercase, and store them as a list of strings on the Artist node:

LOAD CSV WITH HEADERS FROM "file:///genres.csv" AS row
WITH row
WITH row.Artist as artist, split(row.Genre,"/") AS genreList
UNWIND genreList AS genre
WITH artist, collect(trim(toLower(genre))) AS genres
MATCH (a:Artist {name:artist})
SET a.genres=genres

Figure 4-16 shows a sampling of artist-genre pairs after this data has been ingested.

The genres property on Artist nodes.
Figure 4-16. The genres property on Artist nodes.

Your next task is to create Genre nodes from the genres property. Since you don’t want duplicate Genre nodes, you will use MERGE. This will result in a single Genre node for “alternative rock,” even though this genre is repeated across the artists listed in figure 4-16.

The best practice is to create the constraint first:

CREATE CONSTRAINT genre_name
FOR (genre:Genre) REQUIRE genre.name IS UNIQUE

The following steps will complete the refactor:

  1. Match all Artist nodes that have a genres property.

  2. UNWIND the genres, which will convert the list of genres into individual rows

  3. Merge a Genre node for each unwound genre.

  4. Merge a GENRE relationship from the Artist to each Genre node from the previous step.

  5. Now both models coexist in the graph. Once you have refactored all queries that use genres property on an Artist to traverse the GENRE relationship, drop the property.

Here’s the Cypher query to execute on your graph:

//Step 1
MATCH (a:Artist)
WHERE a.genres IS NOT NULL
//Step 2
WITH a
UNWIND a.genres as genreName
//Step 3
MERGE (g:Genre {name:genreName})
//Step 4
MERGE (a)-[:GENRE]->(g)
//Step 5
REMOVE a.genres

Figure 4-17 depicts the refactored graph.

Genre nodes extracted from the genres property.
Figure 4-17. Genre nodes extracted from the genres property.

Refactoring to create a node from a relationship

Now it’s time to refactor the ON_PLAYLIST relationships into new PlaylistTrack nodes and create linked lists, following the model described in the “Representing Sequences” section earlier in this chapter. Refresh your memory using Figure 4-18 which shows the transformation required to accomplish this:

Refactoring the ON_PLAYLIST relationship into PlaylistTrack nodes and connecting them sequentially.
Figure 4-18. Refactoring the ON_PLAYLIST relationship into PlaylistTrack nodes and connecting them sequentially.
Tip

Use realistic examples to expose potential problems with your refactoring early. The graph in figure 4-18 includes “Back in Black”, a track that is on two playlists.

While creating a node from a relationship looks complicated, breaking it up into stages makes it more manageable.

Stage 1

Before getting into the creation of the linked list, you’ll only create PlaylistTrack nodes in the first stage. Remember that the PlaylistTrack nodes are nodes that represent a track at a particular spot in the playlist. The result should look like Figure 4-19.

The state of the graph after creating PlaylistTrack nodes.
Figure 4-19. The state of the graph after creating PlaylistTrack nodes.

Can you spot the problem with this new graph? Look closely at the new PlaylistTrack nodes. How will you sequence them in a linked list later on?

The lists to create are Klasic-Progresif-Hard -> A -> B -> C and jump up -> D.

Introducing the PlaylistTrack nodes has muddied the waters, and it is not clear how to figure out which PlaylistTrack for “Back in Black” belongs to which Playlist, except by correlating its position numbers on the ON_PLAYLIST and PLAYLIST_ITEM relationships. If the same track is at the same position in multiple playlists, then the problem is magnified.

Don’t be afraid to use temporary relationships during refactoring to help you write simple but predictable queries- you’ll delete these relationships when they’ve done their job of assisting with the refactor. Here, try introducing a temporary TRACK_ITEM_TEMP relationship between the Playlist and the PlaylistTrack (as shown in Figure 4-20). This relationship is essentially a copy of the ON_PLAYLIST relationship, but it terminates at the new PlaylistTrack nodes and will help reconstruct the sequence for the linked list.

Adding a temporary helper relationship  TRACK_ITEM_TEMP.
Figure 4-20. Adding a temporary helper relationship, TRACK_ITEM_TEMP.

Now it’s time to write the Cypher query for this stage. It consists of three steps. First, find all tracks and the playlists they’re on. Then, create a PlaylistTrack node for every Track in the Playlist. Finally, create the TRACK_ITEM_TEMP relationship from the Playlist to the PlaylistTrack, recording the position of the track in the playlist on the relationship.

Since the number of playlists and tracks are fairly large, you’ll execute this in transactions, with a default batch size of 1,000:

//Stage 1
//Step 1
MATCH (p:Playlist)
CALL {
  WITH p
  MATCH (t:Track)-[r:ON_PLAYLIST]->(p:Playlist)
  //Step 2
  CREATE (pt:PlaylistTrack)
  CREATE (t)-[:PLAYLIST_ITEM]->(pt)
  //Step 3
  CREATE (p)-[:TRACK_ITEM_TEMP {position:r.position}]->(pt)
} IN TRANSACTIONS

Stage 2

This stage simply links the Playlist to the head and tail PlaylistTrack nodes.

//Stage 2
MATCH (p:Playlist)
CALL {
  WITH p
  MATCH (p)-[:TRACK_ITEM_TEMP {position:1}]->(firstTrack:PlaylistTrack)
  CREATE (p)-[:PLAYLIST_TRACK]->(firstTrack)
  WITH p
  MATCH (p)-[r:TRACK_ITEM_TEMP]->(lastTrack:PlaylistTrack)
    WHERE r.position = COUNT {()-[:ON_PLAYLIST]->(p)}
  CREATE (p)-[:LAST_PLAYLIST_TRACK]->(lastTrack)
} IN TRANSACTIONS

Stage 3

This is where the linked list is produced by creating relationships between PlaylistTrack nodes. You’ll use the position property on the TRACK_ITEM_TEMP relationship to maintain the playlist’s track order.

One way to do this in Cypher is to collect a playlist’s PlaylistTrack nodes, in order. Then use a combination of RANGE and UNWIND to obtain a pair of adjacent tracks and create the relationship between them:

//Stage 3
//Step 1
MATCH (p:Playlist)
CALL {
WITH p
MATCH (p)-[r:TRACK_ITEM_TEMP]->(t:PlaylistTrack)
WITH r,t
  ORDER BY r.position
WITH COLLECT(t) AS playlistTracks
UNWIND RANGE(0,SIZE(playlistTracks) - 2) as idx
WITH playlistTracks[idx] AS t1, playlistTracks[idx+1] AS t2
MERGE (t1)-[:PLAYLIST_TRACK]->(t2)
} IN TRANSACTIONS
Tip

The apoc library contains a handy list of refactoring procedures, such as apoc.nodes.link to creat a linked list – this would simplify the query you just wrote.

Next, drop the temporary relationships as they’ve done their job. You can batch the deletes in transactions to avoid running out of memory:

//Stage 3
//Step 2
MATCH ()-[r:TRACK_ITEM_TEMP]-()
CALL {
    WITH r
    DELETE r
}
IN TRANSACTIONS

Now both models coexist in the same graph, as can be seen in Figure 4-21.

The playlist is refactored into a linked list  and the old ON_PLAYLIST relationship is maintained till it is no longer in use.
Figure 4-21. The playlist is refactored into a linked list, and the old ON_PLAYLIST relationship is maintained till it is no longer in use.

Stage 4

To use the linked list, rework the recommendation query from chapter 1 as follows:

//Stage 4
//Find popular tracks
MATCH (popularTrack:Track)-[:PLAYLIST_ITEM]->()
WITH popularTrack, count(*) as playlistCount
ORDER BY playlistCount DESC
LIMIT 10
WITH collect(elementId(popularTrack)) as popularTracks
// For a given Playlist
MATCH (playlist:Playlist) WHERE playlist.name = "all that jazz"
// Find the last track
MATCH (playlist)-[:LAST_PLAYLIST_TRACK]->(lastTrackItem)
// Get the previous tracks
WITH playlist, lastTrackItem, popularTracks, COLLECT {
MATCH (playlist) (()-[:PLAYLIST_TRACK]->()) {1,100} (previousTrackItem)
WHERE previousTrackItem <> lastTrackItem
RETURN elementId(previousTrackItem)
} AS previousTrackItems
// Find other playlists that have the same the last track
MATCH (playlist)-[:SIMILAR]-(otherPlaylist:Playlist)-[:LAST_PLAYLIST_TRACK]->(otherLastTrack)<-[:PLAYLIST_ITEM]-(:Track)-[:PLAYLIST_ITEM]->()<-[:LAST_PLAYLIST_TRACK]-(playlist)
// Find other tracks which are not in the given playlist
MATCH (otherPlaylist) (()-[:PLAYLIST_TRACK]->()) {1,100} (recommendationItem)<-[:PLAYLIST_ITEM]-(recommendation)
WHERE NOT elementId(recommendationItem) IN previousTrackItems
AND NOT elementId(recommendationItem) IN popularTracks
// Score them by how frequently they appear
RETURN  recommendation.id as recommendedTrackId, recommendation.name AS recommendedTrack, otherPlaylist.name AS fromPlaylist, count(*) AS score
ORDER BY score DESC
LIMIT 5

Finally, drop the ON_PLAYLIST relationships:

//Stage 4
MATCH ()-[r:ON_PLAYLIST]-()
CALL {
WITH r
DELETE r
}
IN TRANSACTIONS

Key takeaways

Refactoring is a normal part of graph modelling. You can refactor a live graph or reingest all the data to comply with an updated model. A model’s old and new versions of the model can coexist in the same graph: this provides backward compatibility and eases the query migration process.

Sometimes refactoring can result in more complex queries, such as with the recommendation query. Always consider the cost and benefits, taking the principles of modelling into account.

Summary

You can now recognize when specific modelling patterns such as linked lists and hyperedges can be applied. Most domains are straightforward to model, and with experience, you’ll find yourself using these patterns intuitively. You’ve also practised refactoring your graph. Soon this will be second nature.

The next chapter addresses query performance and tuning: we’ll show you how queries are executed and how your modelling decisions affect it.

1 A common modelling pattern for this was first introduced by Ian Robinson; although the original article appears to have vanished, the topic has subsequently been covered by Ljubica Lazarevic.

2 This refers to time elapsed in milliseconds since the epoch: midnight UTC on January 1, 1970.

3 This is similar to Martin Fowler’s description of the Strangler Fig Application.

Get Neo4j: The Definitive Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.