Chapter 4. Building a Graph Database Application

In this chapter, we discuss some of the practical issues of working with a graph database. In previous chapters, we’ve looked at graph data; in this chapter, we’ll apply that knowledge in the context of developing a graph database application. We’ll look at some of the data modeling questions that may arise, and at some of the application architecture choices available to us.

In our experience, graph database applications are highly amenable to being developed using the evolutionary, incremental, and iterative software development practices in widespread use today. A key feature of these practices is the prevalence of testing throughout the software development life cycle. Here we’ll show how we develop our data model and our application in a test-driven fashion.

At the end of the chapter, we’ll look at some of the issues we’ll need to consider when planning for production.

Data Modeling

We covered modeling and working with graph data in detail in Chapter 3. Here we summarize some of the more important modeling guidelines, and discuss how implementing a graph data model fits with iterative and incremental software development techniques.

Describe the Model in Terms of the Application’s Needs

The questions we need to ask of the data help identify entities and relationships. Agile user stories provide a concise means for expressing an outside-in, user-centered view of an application’s needs, and the questions that arise in the course of satisfying this need.1 Here’s an example of a user story for a book review web application:

AS A reader who likes a book, I WANT to know which books other readers who like the same book have liked, SO THAT I can find other books to read.

This story expresses a user need, which motivates the shape and content of our data model. From a data modeling point of view, the AS A clause establishes a context comprising two entities—a reader and a book—plus the LIKES relationship that connects them. The I WANT clause then poses a question: which books have the readers who like the book I’m currently reading also liked? This question exposes more LIKES relationships, and more entities: other readers and other books.

The entities and relationships that we’ve surfaced in analyzing the user story quickly translate into a simple data model, as shown in Figure 4-1.

grdb 0401
Figure 4-1. Data model for the book reviews user story

Because this data model directly encodes the question presented by the user story, it lends itself to being queried in a way that similarly reflects the structure of the question we want to ask of the data, since Alice likes Dune, find books that others who like Dune have enjoyed:

MATCH (:Reader {name:'Alice'})-[:LIKES]->(:Book {title:'Dune'})
      <-[:LIKES]-(:Reader)-[:LIKES]->(books:Book)
RETURN books.title

Nodes for Things, Relationships for Structure

Though not applicable in every situation, these general guidelines will help us choose when to use nodes, and when to use relationships:

  • Use nodes to represent entities—that is, the things in our domain that are of interest to us, and which can be labeled and grouped.

  • Use relationships both to express the connections between entities and to establish semantic context for each entity, thereby structuring the domain.

  • Use relationship direction to further clarify relationship semantics. Many relationships are asymmetrical, which is why relationships in a property graph are always directed. For bidirectional relationships, we should make our queries ignore direction, rather than using two relationships.

  • Use node properties to represent entity attributes, plus any necessary entity metadata, such as timestamps, version numbers, etc.

  • Use relationship properties to express the strength, weight, or quality of a relationship, plus any necessary relationship metadata, such as timestamps, version numbers, etc.

It pays to be diligent about discovering and capturing domain entities. As we saw in Chapter 3, it’s relatively easy to model things that really ought to be represented as nodes using carelessly named relationships instead. If we’re tempted to use a relationship to model an entity—an email, or a review, for example—we must make certain that this entity cannot be related to more than two other entities. Remember, a relationship must have a start node and an end node—nothing more, nothing less. If we find later that we need to connect something we’ve modeled as a relationship to more than two other entities, we’ll have to refactor the entity inside the relationship out into a separate node. This is a breaking change to the data model, and will likely require us to make changes to any queries and application code that produce or consume the data.

Fine-Grained versus Generic Relationships

When designing relationships we should be mindful of the trade-offs between using fine-grained relationship names versus generic relationships qualified with properties. It’s the difference between using DELIVERY_ADDRESS and HOME_ADDRESS versus ADDRESS {type:'delivery'} and ADDRESS {type:'home'}.

Relationships are the royal road into the graph. Differentiating by relationship name is the best way of eliminating large swathes of the graph from a traversal. Using one or more property values to decide whether or not to follow a relationship incurs extra I/O the first time those properties are accessed because the properties reside in a separate store file from the relationships (after that, however, they’re cached).

We use fine-grained relationships whenever we have a closed set of relationship names. Weightings—as required by a shortest-weighted-path algorithm—rarely comprise a closed set, and are usually best represented as properties on relationships.

Sometimes, however, we have a closed set of relationships, but in some traversals we want to follow specific kinds of relationships within that set, whereas in others we want to follow all of them, irrespective of type. Addresses are a good example. Following the closed-set principle, we might choose to create HOME_ADDRESS, WORK_ADDRESS, and DELIVERY_ADDRESS relationships. This allows us to follow specific kinds of address relationships (DELIVERY_ADDRESS, for example) while ignoring all the rest. But what do we do if we want to find all addresses for a user? There are a couple of options here. First, we can encode knowledge of all the different relationship types in our queries: e.g., MATCH (user)-[:HOME_ADDRESS|WORK_ADDRESS|DELIVERY_ADDRESS]->(address). This, however, quickly becomes unwieldy when there are lots of different kinds of relationships. Alternatively, we can add a more generic ADDRESS relationship to our model, in addition to the fine-grained relationships. Every node representing an address is then connected to a user using two relationships: a fined-grained relationship (e.g., DELIVERY_ADDRESS) and the more generic ADDRESS {type:'delivery'} relationship.

As we discussed in “Describe the Model in Terms of the Application’s Needs”, the key here is to let the questions we want to ask of our data guide the kinds of relationships we introduce into the model.

Model Facts as Nodes

When two or more domain entities interact for a period of time, a fact emerges. We represent a fact as a separate node with connections to each of the entities engaged in that fact. Modeling an action in terms of its product—that is, in terms of the thing that results from the action—produces a similar structure: an intermediate node that represents the outcome of an interaction between two or more entities. We can use timestamp properties on this intermediate node to represent start and end times.

The following examples show how we might model facts and actions using intermediate nodes.

Employment

Figure 4-2 shows how the fact of Ian being employed by Neo Technology in the role of engineer can be represented in the graph.

In Cypher, this can be expressed as:

CREATE (:Person {name:'Ian'})-[:EMPLOYMENT]->
        (employment:Job {start_date:'2011-01-05'})
        -[:EMPLOYER]->(:Company {name:'Neo'}),
       (employment)-[:ROLE]->(:Role {name:'engineer'})
grdb 0402
Figure 4-2. Ian began employment as an engineer at Neo Technology

Performance

Figure 4-3 shows how the fact that William Hartnell played The Doctor in the story The Sensorites can be represented in the graph.

In Cypher:

CREATE (:Actor {name:'William Hartnell'})-[:PERFORMED_IN]->
         (performance:Performance {year:1964})-[:PLAYED]->
         (:Role {name:'The Doctor'}),
       (performance)-[:FOR]->(:Story {title:'The Sensorites'})
grdb 0403
Figure 4-3. William Hartnell played The Doctor in the story The Sensorites

Emailing

Figure 4-4 shows the act of Ian emailing Jim and copying in Alistair.

grdb 0404
Figure 4-4. Ian emailed Jim, and copied in Alistair

In Cypher, this can be expressed as:

CREATE (:Person {name:'Ian'})-[:SENT]->(e:Email {content:'...'})
         -[:TO]->(:Person {name:'Jim'}),
       (e)-[:CC]->(:Person {name:'Alistair'})

Reviewing

Figure 4-5 shows how the act of Alistair reviewing a film can be represented in the graph.

In Cypher:

CREATE (:Person {name:'Alistair'})-[:WROTE]->
         (review:Review {text:'...'})-[:OF]->(:Film {title:'...'}),
       (review)-[:PUBLISHED_IN]->(:Publication {title:'...'})
grdb 0405
Figure 4-5. Alistair wrote a review of a film, which was published in a magazine

Represent Complex Value Types as Nodes

Value types are things that do not have an identity, and whose equivalence is based solely on their values. Examples include money, address, and SKU. Complex value types are value types with more than one field or property. Address, for example, is a complex value type. Such multiproperty value types may be usefully represented as separate nodes:

MATCH (:Order {orderid:13567})-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address.first_line, address.zipcode

Time

Time can be modeled in several different ways in the graph. Here we describe two techniques: timeline trees and linked lists. In some solutions, it’s useful to combine these two techniques.

Timeline trees

If we need to find all the events that have occurred over a specific period, we can build a timeline tree, as shown in Figure 4-6.

grdb 0406
Figure 4-6. A timeline tree showing the broadcast dates for four episodes of a TV program

Each year has its own set of month nodes; each month has its own set of day nodes. We need only insert nodes into the timeline tree as and when they are needed. Assuming the root timeline node has been indexed, or can be discovered by traversing the graph, the following Cypher statement ensures that all necessary nodes and relationships for a particular event—year, month, day, plus the node representing the event itself—are either already present in the graph, or, if not present, are added to the graph (MERGE will add any missing elements):

MATCH (timeline:Timeline {name:{timelineName}})
MERGE (episode:Episode {name:{newEpisode}})
MERGE (timeline)-[:YEAR]->(year:Year {value:{year}})
MERGE (year)-[:MONTH]->(month:Month {name:{monthName}})
MERGE (month)-[:DAY]->(day:Day {value:{day}, name:{dayName}})
MERGE (day)<-[:BROADCAST_ON]-(episode)

Querying the calendar for all events between a start date (inclusive) and an end date (exclusive) can be done with the following Cypher code:

MATCH (timeline:Timeline {name:{timelineName}})
MATCH (timeline)-[:YEAR]->(year:Year)-[:MONTH]->(month:Month)-[:DAY]->
      (day:Day)<-[:BROADCAST_ON]-(n)
WHERE ((year.value > {startYear} AND year.value < {endYear})
      OR ({startYear} = {endYear} AND {startMonth} = {endMonth}
          AND year.value = {startYear} AND month.value = {startMonth}
          AND day.value >= {startDay} AND day.value < {endDay})
      OR ({startYear} = {endYear} AND {startMonth} < {endMonth}
          AND year.value = {startYear}
          AND ((month.value = {startMonth} AND day.value >= {startDay})
              OR (month.value > {startMonth} AND month.value < {endMonth})
              OR (month.value = {endMonth} AND day.value < {endDay})))
      OR ({startYear} < {endYear}
          AND year.value = {startYear}
          AND ((month.value > {startMonth})
              OR (month.value = {startMonth} AND day.value >= {startDay})))
      OR ({startYear} < {endYear}
          AND year.value = {endYear}
          AND ((month.value < {endMonth})
              OR (month.value = {endMonth} AND day.value < {endDay}))))
RETURN n

The WHERE clause here, though somewhat verbose, simply filters each match based on the start and end dates supplied to the query.

Linked lists

Many events have temporal relationships to the events that precede and follow them. We can use NEXT and/or PREVIOUS relationships (depending on our preference) to create linked lists that capture this natural ordering, as shown in Figure 4-7.2 Linked lists allow for very rapid traversal of time-ordered events.

grdb 0407
Figure 4-7. A doubly linked list representing a time-ordered series of events

Versioning

A versioned graph enables us to recover the state of the graph at a particular point in time. Most graph databases don’t support versioning as a first-class concept. It is possible, however, to create a versioning scheme inside the graph model. With this scheme nodes and relationships are timestamped and archived whenever they are modified.3 The downside of such versioning schemes is that they leak into any queries written against the graph, adding a layer of complexity to even the simplest query.

Iterative and Incremental Development

We develop the data model feature by feature, user story by user story. This will ensure that we identify the relationships our application will use to query the graph. A data model that is developed in line with the iterative and incremental delivery of application features will look quite different from one drawn up using a data model-first approach, but it will be the correct model, motivated throughout by the application’s needs, and the questions that arise in conjunction with those needs.

Graph databases provide for the smooth evolution of our data model. Migrations and denormalization are rarely an issue. New facts and new compositions become new nodes and relationships, while optimizing for performance-critical access patterns typically involves introducing a direct relationship between two nodes that would otherwise be connected only by way of intermediaries. Unlike the optimization strategies we employ in the relational world, which typically involve denormalizing and thereby compromising a high-fidelity model, this is not an either/or issue: either the detailed, highly normalized structure, or the high performance compromise. With the graph we retain the original high-fidelity graph structure, while at the same time enriching it with new elements that cater to new needs.

We will quickly see how different relationships can sit side-by-side with one another, catering to different needs without distorting the model in favor of any one particular need. Addresses help illustrate the point here. Imagine, for example, that we are developing a retail application. While developing a fulfillment story, we add the ability to dispatch a parcel to a customer’s delivery address, which we find using the following query:

MATCH (user:User {id:{userId}})
MATCH (user)-[:DELIVERY_ADDRESS]->(address:Address)
RETURN address

Later on, when adding some billing functionality, we introduce a BILLING_ADDRESS relationship. Later still, we add the ability for customers to manage all their addresses. This last feature requires us to find all addresses—whether delivery, billing, or some other address. To facilitate this, we introduce a general ADDRESS relationship:

MATCH (user:User {id:{userId}})
MATCH (user)-[:ADDRESS]->(address:Address)
RETURN address

By this time, our data model looks something like the one shown in Figure 4-8. DELIVERY_ADDRESS specializes the data on behalf of the application’s fulfillment needs; BILLING_ADDRESS specializes the data on behalf of the application’s billing needs; and ADDRESS specializes the data on behalf of the application’s customer management needs.

grdb 0408
Figure 4-8. Different relationships for different application needs

Just because we can add new relationships to meet new application goals, doesn’t mean we always have to do this. We’ll invariably identify opportunities for refactoring the model as we go. There’ll be plenty of times, for example, where an existing relationship will suffice for a new query, or where renaming an existing relationship will allow it to be used for two different needs. When these opportunities arise, we should take them. If we’re developing our solution in a test-driven manner—described in more detail later in this chapter—we’ll have a sound suite of regression tests in place. These tests give us the confidence to make substantial changes to the model.

Application Architecture

In planning a graph database-based solution, there are several architectural decisions to be made. These decisions will vary slightly depending on the database product we’ve chosen. In this section, we’ll describe some of the architectural choices, and the corresponding application architectures, available to us when using Neo4j.

Embedded versus Server

Most databases today run as a server that is accessed through a client library. Neo4j is somewhat unusual in that it can be run in embedded as well as server mode—in fact, going back nearly ten years, its origins are as an embedded graph database.

Note

An embedded database is not the same as an in-memory database. An embedded instance of Neo4j still makes all data durable on disk. Later, in “Testing”, we’ll discuss ImpermanentGraphDatabase, which is an in-memory version of Neo4j designed for testing purposes.

Embedded Neo4j

In embedded mode, Neo4j runs in the same process as our application. Embedded Neo4j is ideal for hardware devices, desktop applications, and for incorporating in our own application servers. Some of the advantages of embedded mode include:

Low latency

Because our application speaks directly to the database, there’s no network overhead.

Choice of APIs

We have access to the full range of APIs for creating and querying data: the Core API, Traversal Framework, and the Cypher query language.

Explicit transactions

Using the Core API, we can control the transactional life cycle, executing an arbitrarily complex sequence of commands against the database in the context of a single transaction. The Java APIs also expose the transaction life cycle, enabling us to plug in custom transaction event handlers that execute additional logic with each transaction.

When running in embedded mode, however, we should bear in mind the following:

JVM (Java virtual machine) only

Neo4j is a JVM-based database. Many of its APIs are, therefore, accessible only from a JVM-based language.

GC behaviors

When running in embedded mode, Neo4j is subject to the garbage collection (GC) behaviors of the host application. Long GC pauses can affect query times. Further, when running an embedded instance as part of an HA (high-availability) cluster, long GC pauses can cause the cluster protocol to trigger a master reelection.

Database life cycle

The application is responsible for controlling the database life cycle, which includes starting and closing it safely.

Embedded Neo4j can be clustered for high availability and horizontal read scaling just as the server version. In fact, we can run a mixed cluster of embedded and server instances (clustering is performed at the database level, rather than the server level). This is common in enterprise integration scenarios, where regular updates from other systems are executed against an embedded instance, and then replicated out to server instances.

Server mode

Running Neo4j in server mode is the most common means of deploying the database today. At the heart of each server is an embedded instance of Neo4j. Some of the benefits of server mode include:

REST API

The server exposes a rich REST API that allows clients to send JSON-formatted requests over HTTP. Responses comprise JSON-formatted documents enriched with hypermedia links that advertise additional features of the dataset. The REST API is extensible by end users and supports the execution of Cypher queries.

Platform independence

Because access is by way of JSON-formatted documents sent over HTTP, a Neo4j server can be accessed by a client running on practically any platform. All that’s needed is an HTTP client library.4

Scaling independence

With Neo4j running in server mode, we can scale our database cluster independently of our application server cluster.

Isolation from application GC behaviors

In server mode, Neo4j is protected from any untoward GC behaviors triggered by the rest of the application. Of course, Neo4j still produces some garbage, but its impact on the garbage collector has been carefully monitored and tuned during development to mitigate any significant side effects. However, because server extensions enable us to run arbitrary Java code inside the server (see “Server extensions”), the use of server extensions may impact the server’s GC behavior.

When using Neo4j in server mode, we should bear in mind the following:

Network overhead

There is some communication overhead to each HTTP request, though it’s fairly minimal. After the first client request, the TCP connection remains open until closed by the client.

Transaction state

Neo4j server has a transactional Cypher endpoint. This allows the client to execute a series of Cypher statements in the context of a single transaction. With each request, the client extends its lease on the transaction. If the client fails to complete or roll back the transaction for any reason, this transactional state will remain on the server until it times out (by default, the server will reclaim orphaned transactions after 60 seconds). For more complex, multistep operations requiring a single transactional context, we should consider using a server extension (see “Server extensions”).

Access to Neo4j server is typically by way of its REST API, as discussed previously. The REST API comprises JSON-formatted documents over HTTP. Using the REST API we can submit Cypher queries, configure named indexes, and execute several of the built-in graph algorithms. We can also submit JSON-formatted traversal descriptions, and perform batch operations. For the majority of use cases the REST API is sufficient; however, if we need to do something we cannot currently accomplish using the REST API, we should consider developing a server extension.

Server extensions

Server extensions enable us to run Java code inside the server. Using server extensions, we can extend the REST API, or replace it entirely.

Extensions take the form of JAX-RS annotated classes. JAX-RS is a Java API for building RESTful resources. Using JAX-RS annotations, we decorate each extension class to indicate to the server which HTTP requests it handles. Additional annotations control request and response formats, HTTP headers, and the formatting of URI templates.

Here’s an implementation of a simple server extension that allows a client to request the distance between two members of a social network:

@Path("/distance")
public class SocialNetworkExtension
{
    private final GraphDatabaseService db;

    public SocialNetworkExtension( @Context GraphDatabaseService db )
    {
        this.db = db;
    }

    @GET
    @Produces("text/plain")
    @Path("/{name1}/{name2}")
    public String getDistance  ( @PathParam("name1") String name1,
                                 @PathParam("name2") String name2 )
    {
        String query = "MATCH (first:User {name:{name1}}),\n" +
                "(second:User {name:{name2}})\n" +
                "MATCH p=shortestPath(first-[*..4]-second)\n" +
                "RETURN length(p) AS depth";

        Map<String, Object> params = new HashMap<String, Object>();
        params.put( "name1", name1 );
        params.put( "name2", name2 );

        Result result = db.execute( query, params );

        return String.valueOf( result.columnAs( "depth" ).next() );
    }
}

Of particular interest here are the various annotations:

  • @Path("/distance") specifies that this extension will respond to requests directed to relative URIs beginning /distance.

  • The @Path("/{name1}/{name2}") annotation on getDistance() further qualifies the URI template associated with this extension. The fragment here is concatenated with /distance to produce /distance/{name1}/{name2}, where {name1} and {name2} are placeholders for any characters occurring between the forward slashes. Later on, in “Testing server extensions”, we’ll register this extension under the /socnet relative URI. At that point, these several different parts of the path ensure that HTTP requests directed to a relative URI beginning /socnet/distance/{name1}/{name2} (for example, http://localhost/socnet/distance/Ben/Mike) will be dispatched to an instance of this extension.

  • @GET specifies that getDistance() should be invoked only if the request is an HTTP GET. @Produces indicates that the response entity body will be formatted as text/plain.

  • The two @PathParam annotations prefacing the parameters to getDistance() serve to map the contents of the {name1} and {name2} path placeholders to the method’s name1 and name2 parameters. Given the URI http://localhost/socnet/distance/Ben/Mike, getDistance() will be invoked with Ben for name1 and Mike for name2.

  • The @Context annotation in the constructor causes this extension to be handed a reference to the embedded graph database inside the server. The server infrastructure takes care of creating an extension and injecting it with a graph database instance, but the very presence of the GraphDatabaseService parameter here makes this extension exceedingly testable. As we’ll see later, in “Testing server extensions”, we can unit test extensions without having to run them inside a server.

Server extensions can be powerful elements in our application architecture. Their chief benefits include:

Complex transactions

Extensions enable us to execute an arbitrarily complex sequence of operations in the context of a single transaction.

Choice of APIs

Each extension is injected with a reference to the embedded graph database at the heart of the server. This gives us access to the full range of APIs—Core API, Traversal Framework, graph algorithm package, and Cypher—for developing our extension’s behavior.

Encapsulation

Because each extension is hidden behind a RESTful interface, we can improve and modify its implementation over time.

Response formats

We control the response—both the representation format and the HTTP headers. This enables us to create response messages whose contents employ terminology from our domain, rather than the graph-based terminology of the standard REST API (users, products, and orders, for example, rather than nodes, relationships, and properties). Further, in controlling the HTTP headers attached to the response, we can leverage the HTTP protocol for things such as caching and conditional requests.

When considering using server extensions, we should bear in mind the following points:

JVM only

As with developing against embedded Neo4j, we’ll have to use a JVM-based language.

GC behaviors

We can do arbitrarily complex (and dangerous) things inside a server extension. We need to monitor garbage collection behaviors to ensure that we don’t introduce any untoward side effects.

Clustering

As we discuss in more detail in “Availability”, Neo4j clusters for high availability and horizontal read scaling using master-slave replication. In this section we discuss some of the strategies to consider when using clustered Neo4j.

Replication

Although all writes to a cluster are coordinated through the master, Neo4j does allow writing through slaves, but even then, the slave that’s being written to syncs with the master before returning to the client. Because of the additional network traffic and coordination protocol, writing through slaves can be an order of magnitude slower than writing directly to the master. The only reasons for writing through slaves are to increase the durability guarantees of each write (the write is made durable on two instances, rather than one) and to ensure that we can read our own writes when employing cache sharding (see “Cache sharding” and “Read your own writes” later in this chapter). Because newer versions of Neo4j enable us to specify that writes to the master be replicated out to one or more slaves, thereby increasing the durability guarantees of writes to the master, the case for writing through slaves is now less compelling. Today it is recommended that all writes be directed to the master, and then replicated to slaves using the ha.tx_push_factor and ha.tx_push_strategy configuration settings.

Buffer writes using queues

In high write load scenarios, we can use queues to buffer writes and regulate load. With this strategy, writes to the cluster are buffered in a queue. A worker then polls the queue and executes batches of writes against the database. Not only does this regulate write traffic, but it reduces contention and enables us to pause write operations without refusing client requests during maintenance periods.

Global clusters

For applications catering to a global audience, it is possible to install a multiregion cluster in multiple data centers and on cloud platforms such as Amazon Web Services (AWS). A multiregion cluster enables us to service reads from the portion of the cluster geographically closest to the client. In these situations, however, the latency introduced by the physical separation of the regions can sometimes disrupt the coordination protocol. It is, therefore, often desirable to restrict master reelection to a single region. To achieve this, we create slave-only databases for the instances we don’t want to participate in master reelection. We do this by including the ha.slave_coordinator_update_mode=none configuration parameter in an instance’s configuration.

Load Balancing

When using a clustered graph database, we should consider load balancing traffic across the cluster to help maximize throughput and reduce latency. Neo4j doesn’t include a native load balancer, relying instead on the load-balancing capabilities of the network infrastructure.

Separate read traffic from write traffic

Given the recommendation to direct the majority of write traffic to the master, we should consider clearly separating read requests from write requests. We should configure our load balancer to direct write traffic to the master, while balancing the read traffic across the entire cluster.

In a web-based application, the HTTP method is often sufficient to distinguish a request with a significant side effect—a write—from one that has no significant side effect on the server: POST, PUT, and DELETE can modify server-side resources, whereas GET is side-effect free.

When using server extensions, it’s important to distinguish read and write operations using @GET and @POST annotations. If our application depends solely on server extensions, this will suffice to separate the two. If we’re using the REST API to submit Cypher queries to the database, however, the situation is not so straightforward. The REST API uses POST as a general “process this” semantic for both read and write Cypher requests. To separate read and write requests in this scenario, we introduce a pair of load balancers: a write load balancer that always directs requests to the master, and a read load balancer that balances requests across the entire cluster. In our application logic, where we know whether the operation is a read or a write, we will then have to decide which of the two addresses we should use for any particular request, as illustrated in Figure 4-9.

When running in server mode, Neo4j exposes a URI that indicates whether that instance is currently the master, and if it isn’t, which of the instances is the master. Load balancers can poll this URI at intervals to determine where to route traffic.

Cache sharding

Queries run fastest when the portions of the graph needed to satisfy them reside in main memory. When a graph holds many billions of nodes, relationships, and properties, not all of it will fit into main memory. Other data technologies often solve this problem by partitioning their data, but with graphs, partitioning or sharding is unusually difficult (see “The Holy Grail of Graph Scalability”). How, then, can we provide for high-performance queries over a very large graph?

One solution is to use a technique called cache sharding (Figure 4-10), which consists of routing each request to a database instance in an HA cluster where the portion of the graph necessary to satisfy that request is likely already in main memory (remember: every instance in the cluster will contain a full copy of the data). If the majority of an application’s queries are graph-local queries, meaning they start from one or more specific points in the graph and traverse the surrounding subgraphs, then a mechanism that consistently routes queries beginning from the same set of start points to the same database instance will increase the likelihood of each query hitting a warm cache.

grdb 0409
Figure 4-9. Using read/write load balancers to direct requests to a cluster

The strategy used to implement consistent routing will vary by domain. Sometimes it’s good enough to have sticky sessions; other times we’ll want to route based on the characteristics of the dataset. The simplest strategy is to have the instance that first serves requests for a particular user thereafter serve subsequent requests for that user. Other domain-specific approaches will also work. For example, in a geographical data system we can route requests about particular locations to specific database instances that have been warmed for that location. Both strategies increase the likelihood of the required nodes and relationships already being cached in main memory, where they can be quickly accessed and processed.

grdb 0410
Figure 4-10. Cache sharding

Read your own writes

Occasionally we may need to read our own writes—typically when the application applies an end-user change, and needs on the next request to reflect the effect of this change back to the user. Whereas writes to the master are immediately consistent, the cluster as a whole is eventually consistent. How can we ensure that a write directed to the master is reflected in the next load-balanced read request? One solution is to use the same consistent routing technique used in cache sharding to direct the write to the slave that will be used to service the subsequent read. This assumes that the write and the read can be consistently routed based on some domain criteria in each request.

This is one of the few occasions where it makes sense to write through a slave. But remember: writing through a slave can be an order of magnitude slower than writing directly to the master. We should use this technique sparingly. If a high proportion of our writes require us to read our own write, this technique will significantly impact throughput and latency.

Testing

Testing is a fundamental part of the application development process—not only as a means of verifying that a query or application feature behaves correctly, but also as a way of designing and documenting our application and its data model. Throughout this section we emphasize that testing is an everyday activity; by developing our graph database solution in a test-driven manner, we provide for the rapid evolution of our system, and its continued responsiveness to new business needs.

Test-Driven Data Model Development

In discussing data modeling, we’ve stressed that our graph model should reflect the kinds of queries we want to run against it. By developing our data model in a test-driven fashion we document our understanding of our domain, and validate that our queries behave correctly.

With test-driven data modeling, we write unit tests based on small, representative example graphs drawn from our domain. These example graphs contain just enough data to communicate a particular feature of the domain. In many cases, they might only comprise 10 or so nodes, plus the relationships that connect them. We use these examples to describe what is normal for the domain, and also what is exceptional. As we discover anomalies and corner cases in our real data, we write a test that reproduces what we’ve discovered.

The example graphs we create for each test comprise the setup or context for that test. Within this context we exercise a query, and assert that the query behaves as expected. Because we control the contents of the test data, we, as the author of the test, know what results to expect.

Tests can act like documentation. By reading the tests, developers gain an understanding of the problems and needs the application is intended to address, and the ways in which the authors have gone about addressing them. With this in mind, it’s best to use each test to test just one aspect of our domain. It’s far easier to read lots of small tests, each of which communicates a discrete feature of our data in a clear, simple, and concise fashion, than it is to reverse engineer a complex domain from a single large and unwieldy test. In many cases, we’ll find a particular query being exercised by several tests, some of which demonstrate the happy path through our domain, others of which exercise it in the context of some exceptional structure or set of values.5

Over time, we’ll build up a suite of tests that can act as a powerful regression test mechanism. As our application evolves, and we add new sources of data, or change the model to meet new needs, our regression test suite will continue to assert that existing features still behave as they should. Evolutionary architectures, and the incremental and iterative software development techniques that support them, depend upon a bedrock of asserted behavior. The unit-testing approach to data model development described here enables developers to respond to new business needs with very little risk of undermining or breaking what has come before, confident in the continued quality of the solution.

Example: A test-driven social network data model

In this example we’re going to demonstrate developing a very simple Cypher query for a social network. Given the names of a couple of members of the network, our query determines the distance between them.

First, we create a small graph that is representative of our domain. Using Cypher, we create a network comprising 10 nodes and 8 relationships:

public GraphDatabaseService createDatabase()
{
    // Create nodes
    String createGraph = "CREATE\n" +
        "(ben:User {name:'Ben'}),\n" +
        "(arnold:User {name:'Arnold'}),\n" +
        "(charlie:User {name:'Charlie'}),\n" +
        "(gordon:User {name:'Gordon'}),\n" +
        "(lucy:User {name:'Lucy'}),\n" +
        "(emily:User {name:'Emily'}),\n" +
        "(sarah:User {name:'Sarah'}),\n" +
        "(kate:User {name:'Kate'}),\n" +
        "(mike:User {name:'Mike'}),\n" +
        "(paula:User {name:'Paula'}),\n" +
        "(ben)-[:FRIEND]->(charlie),\n" +
        "(charlie)-[:FRIEND]->(lucy),\n" +
        "(lucy)-[:FRIEND]->(sarah),\n" +
        "(sarah)-[:FRIEND]->(mike),\n" +
        "(arnold)-[:FRIEND]->(gordon),\n" +
        "(gordon)-[:FRIEND]->(emily),\n" +
        "(emily)-[:FRIEND]->(kate),\n" +
        "(kate)-[:FRIEND]->(paula)";

    String createIndex = "CREATE INDEX ON :User(name)";

    GraphDatabaseService db =
        new TestGraphDatabaseFactory().newImpermanentDatabase();

    db.execute( createGraph );
    db.execute( createIndex );

    return db;
}

There are two things of interest in createDatabase(). The first is the use of ImpermanentGraphDatabase, which is a lightweight, in-memory version of Neo4j, designed specifically for unit testing. By using ImpermanentGraphDatabase, we avoid having to clear up store files on disk after each test. The class can be found in the Neo4j kernel test jar, which can be obtained with the following dependency reference:

<dependency>
    <groupId>org.neo4j</groupId>
    <artifactId>neo4j-kernel</artifactId>
    <version>${project.version}</version>
    <type>test-jar</type>
    <scope>test</scope>
</dependency>
Warning

ImpermanentGraphDatabase is intended for use in unit-tests only. It is an in-memory only version of Neo4j, not intended for production use.

The second thing of interest in createDatabase() is the Cypher command to index nodes with a given label on a given property. In this case we’re saying that we want to index nodes with a :User label based on the value of their name property.

Having created a sample graph, we can now write our first test. Here’s the test fixture for testing our social network data model and its queries:

public class SocialNetworkTest
{
    private static GraphDatabaseService db;
    private static SocialNetworkQueries queries;

    @BeforeClass
    public static void init()
    {
        db = createDatabase();
        queries = new SocialNetworkQueries( db );
    }

    @AfterClass
    public static void shutdown()
    {
        db.shutdown();
    }

    @Test
    public void shouldReturnShortestPathBetweenTwoFriends() throws Exception
    {
        // when
        Result result = queries.distance( "Ben", "Mike" );

        // then
        assertTrue( result.hasNext() );
        assertEquals( 4, result.next().get( "distance" ) );
    }

    // more tests
}

This test fixture includes an initialization method, annotated with @BeforeClass, which executes before any tests start. Here we call createDatabase() to create an instance of the sample graph, and an instance of SocialNetworkQueries, which houses the queries under development.

Our first test, shouldReturnShortestPathBetweenTwoFriends(), tests that the query under development can find a path between any two members of the network—in this case, Ben and Mike. Given the contents of the sample graph, we know that Ben and Mike are connected, but only remotely, at a distance of 4. The test, therefore, asserts that the query returns a nonempty result containing a distance value of 4.

Having written the test, we now start developing our first query. Here’s the implementation of SocialNetworkQueries:

public class SocialNetworkQueries
{
    private final GraphDatabaseService db;

    public SocialNetworkQueries( GraphDatabaseService db )
    {
        this.db = db;
    }

    public Result distance( String firstUser, String secondUser )
    {
        String query = "MATCH (first:User {name:{firstUser}}),\n" +
            "(second:User {name:{secondUser}})\n" +
            "MATCH p=shortestPath((first)-[*..4]-(second))\n" +
            "RETURN length(p) AS distance";

        Map<String, Object> params = new HashMap<String, Object>();
        params.put( "firstUser", firstUser );
        params.put( "secondUser",  secondUser );

        return db.execute( query, params );
    }

    // More queries
}

In the constructor for SocialNetworkQueries we store the supplied database instance in a member variable, which allows it to be reused over and again throughout the lifetime of the queries instance. The query itself we implement in the distance() method. Here we create a Cypher statement, initialize a map containing the query parameters, and execute the statement.

If shouldReturnShortestPathBetweenTwoFriends() passes (it does), we then go on to test additional scenarios. What happens, for example, if two members of the network are separated by more than four connections? We write up the scenario and what we expect the query to do in another test:

@Test
public void shouldReturnNoResultsWhenNoPathAtDistance4OrLess()
    throws Exception
{
    // when
    Result result = queries.distance( "Ben", "Arnold" );

    // then
    assertFalse( result.hasNext() );
}

In this instance, this second test passes without us having to modify the underlying Cypher query. In many cases, however, a new test will force us to modify a query’s implementation. When that happens, we modify the query to make the new test pass, and then run all the tests in the fixture. A failing test anywhere in the fixture indicates we’ve broken some existing functionality. We continue to modify the query until all tests are green once again.

Testing server extensions

Server extensions can be developed in a test-driven manner just as easily as embedded Neo4j. Using the simple server extension described earlier, here’s how we test it:

@Test
public void extensionShouldReturnDistance() throws Exception
{
    // given
    SocialNetworkExtension extension = new SocialNetworkExtension( db );

    // when
    String distance = extension.getDistance( "Ben", "Mike" );

    // then
    assertEquals( "4", distance );
}

Because the extension’s constructor accepts a GraphDatabaseService instance, we can inject a test instance (an ImpermanentGraphDatabase instance), and then call its methods as per any other object.

If, however, we wanted to test the extension running inside a server, we have a little more setup to do:

public class SocialNetworkExtensionTest
{
    private ServerControls server;

    @BeforeClass
    public static void init() throws IOException
    {
        // Create nodes
        String createGraph = "CREATE\n" +
            "(ben:User {name:'Ben'}),\n" +
            "(arnold:User {name:'Arnold'}),\n" +
            "(charlie:User {name:'Charlie'}),\n" +
            "(gordon:User {name:'Gordon'}),\n" +
            "(lucy:User {name:'Lucy'}),\n" +
            "(emily:User {name:'Emily'}),\n" +
            "(sarah:User {name:'Sarah'}),\n" +
            "(kate:User {name:'Kate'}),\n" +
            "(mike:User {name:'Mike'}),\n" +
            "(paula:User {name:'Paula'}),\n" +
            "(ben)-[:FRIEND]->(charlie),\n" +
            "(charlie)-[:FRIEND]->(lucy),\n" +
            "(lucy)-[:FRIEND]->(sarah),\n" +
            "(sarah)-[:FRIEND]->(mike),\n" +
            "(arnold)-[:FRIEND]->(gordon),\n" +
            "(gordon)-[:FRIEND]->(emily),\n" +
            "(emily)-[:FRIEND]->(kate),\n" +
            "(kate)-[:FRIEND]->(paula)";

        server = TestServerBuilders
            .newInProcessBuilder()
            .withExtension(
                "/socnet",
                ColleagueFinderExtension.class )
            .withFixture( createGraph )
            .newServer();
    }

    @AfterClass
    public static void teardown()
    {
        server.close();
    }

    @Test
    public void serverShouldReturnDistance() throws Exception
    {
        HTTP.Response response = HTTP.GET( server.httpURI()
            .resolve( "/socnet/distance/Ben/Mike" ).toString() );

        assertEquals( 200, response.status() );
        assertEquals( "text/plain", response.header( "Content-Type" ));
        assertEquals( "4", response.rawContent( ) );
    }
}

Here we’re using an instance of ServerControls to host the extension. We create the server and populate its database in the test fixture’s init() method using the builder supplied by TestServerBuilders. This builder enables us to register the extension, and associate it with a relative URI space (in this case, everything below /socnet). Once init() has completed, we have a database server instance up and running.

In the test itself, serverShouldReturnDistance(), we access this server using an HTTP client from the Neo4j test library. The client issues an HTTP GET request for the resource at /socnet/distance/Ben/Mike. (At the server end, this request is dispatched to an instance of SocialNetworkExtension.) When the client receives a response, the test asserts that the HTTP status code, content-type, and content of the response body are correct.

Performance Testing

The test-driven approach that we’ve described so far communicates context and domain understanding, and tests for correctness. It does not, however, test for performance. What works fast against a small, 20-node sample graph may not work so well when confronted with a much larger graph. Therefore, to accompany our unit tests, we should consider writing a suite of query performance tests. On top of that, we should also invest in some thorough application performance testing early in our application’s development life cycle.

Query performance tests

Query performance tests are not the same as full-blown application performance tests. All we’re interested in at this stage is whether a particular query performs well when run against a graph that is roughly as big as the kind of graph we expect to encounter in production. Ideally, these tests are developed side-by-side with our unit tests. There’s nothing worse than investing a lot of time in perfecting a query, only to discover it is not fit for production-sized data.

When creating query performance tests, bear in mind the following guidelines:

  • Create a suite of performance tests that exercise the queries developed through our unit testing. Record the performance figures so that we can see the relative effects of tweaking a query, modifying the heap size, or upgrading from one version of a graph database to another.

  • Run these tests often, so that we quickly become aware of any deterioration in performance. We might consider incorporating these tests into a continuous delivery build pipeline, failing the build if the test results exceed a certain value.

  • Run these tests in-process on a single thread. There’s no need to simulate multiple clients at this stage: if the performance is poor for a single client, it’s unlikely to improve for multiple clients. Even though they are not, strictly speaking, unit tests, we can drive them using the same unit testing framework we use to develop our unit tests.

  • Run each query many times, picking starting nodes at random each time, so that we can see the effect of starting from a cold cache, which is then gradually warmed as multiple queries execute.

Application performance tests

Application performance tests, as distinct from query performance tests, test the performance of the entire application under representative production usage scenarios.

As with query performance tests, we recommend that this kind of performance testing be done as part of everyday development, side-by-side with the development of application features, rather than as a distinct project phase.6 To facilitate application performance testing early in the project life cycle, it is often necessary to develop a “walking skeleton,” an end-to-end slice through the entire system, which can be accessed and exercised by performance test clients. By developing a walking skeleton, we not only provide for performance testing, but we also establish the architectural context for the graph database part of our solution. This enables us to verify our application architecture, and identify layers and abstractions that allow for discrete testing of individual components.

Performance tests serve two purposes: they demonstrate how the system will perform when used in production, and they drive out the operational affordances that make it easier to diagnose performance issues, incorrect behavior, and bugs. What we learn in creating and maintaining a performance test environment will prove invaluable when it comes to deploying and operating the system for real.

When drawing up the criteria for a performance test, we recommend specifying percentiles rather than averages. Never assume a normal distribution of response times: the real world doesn’t work like that. For some applications we may want to ensure that all requests return within a certain time period. In rare circumstances it may be important for the very first request to be as quick as when the caches have been warmed. But in the majority of cases, we will want to ensure that the majority of requests return within a certain time period; that, say, 98% of requests are satisfied in under 200 ms. It is important to keep a record of subsequent test runs so that we can compare performance figures over time, and thereby quickly identify slowdowns and anomalous behavior.

As with unit tests and query performance tests, application performance tests prove most valuable when employed in an automated delivery pipeline, where successive builds of the application are automatically deployed to a testing environment, the tests executed, and the results automatically analyzed. Log files and test results should be stored for later retrieval, analysis, and comparison. Regressions and failures should fail the build, prompting developers to address the issues in a timely manner. One of the big advantages of conducting performance testing over the course of an application’s development life cycle, rather than at the end, is that failures and regressions can very often be tied back to a recent piece of development. This enables us to diagnose, pinpoint, and remedy issues rapidly and succinctly.

For generating load, we’ll need a load-generating agent. For web applications, there are several open source stress and load testing tools available, including Grinder, JMeter, and Gatling.7 When testing load-balanced web applications, we should ensure that our test clients are distributed across different IP addresses so that requests are balanced across the cluster.

Testing with representative data

For both query performance testing and application performance testing we will need a dataset that is representative of the data we will encounter in production. It will be necessary, therefore, to either create or source such a dataset. In some cases we can obtain a dataset from a third party, or adapt an existing dataset that we own; either way, unless the data is already in the form of a graph, we will have to write some custom export-import code.

In many cases, however, we’re starting from scratch. If this is the case, we must dedicate some time to creating a dataset builder. As with the rest of the software development life cycle, this is best done in an iterative and incremental fashion. Whenever we introduce a new element into our domain’s data model, as documented and tested in our unit tests, we add the corresponding element to our performance dataset builder. That way, our performance tests will come as close to real-world usage as our current understanding of the domain allows.

When creating a representative dataset, we try to reproduce any domain invariants we have identified: the minimum, maximum, and average number of relationships per node, the spread of different relationship types, property value ranges, and so on. Of course, it’s not always possible to know these things upfront, and often we’ll find ourselves working with rough estimates until such point as production data is available to verify our assumptions.

Although ideally we would always test with a production-sized dataset, it is often not possible or desirable to reproduce extremely large volumes of data in a test environment. In such cases, we should at least ensure that we build a representative dataset whose size exceeds our capacity to hold the entire graph in main memory. That way, we’ll be able to observe the effect of cache evictions, and query for portions of the graph not currently held in main memory.

Representative datasets also help with capacity planning. Whether we create a full-sized dataset, or a scaled-down sample of what we expect the production graph to be, our representative dataset will give us some useful figures for estimating the size of the production data on disk. These figures then help us plan how much memory to allocate to the page caches and the Java virtual machine (JVM) heap (see “Capacity Planning” for more details).

In the following example, we’re using a dataset builder called Neode to build a sample social network:

private void createSampleDataset( GraphDatabaseService db )
{
    DatasetManager dsm = new DatasetManager( db, SysOutLog.INSTANCE );

    // User node specification
    NodeSpecification userSpec =
        dsm.nodeSpecification( "User",
            indexableProperty( db, "User", "name" ) );

    // FRIEND relationship specification
    RelationshipSpecification friend =
        dsm.relationshipSpecification( "FRIEND" );

    Dataset dataset =
        dsm.newDataset( "Social network example" );

    // Create user nodes
    NodeCollection users =
        userSpec.create( 1_000_000 ).update( dataset );


    // Relate users to each other
    users.createRelationshipsTo(
        getExisting( users )
            .numberOfTargetNodes( minMax( 50, 100 ) )
            .relationship( friend )
            .relationshipConstraints( RelationshipUniqueness.BOTH_DIRECTIONS ) )
        .updateNoReturn( dataset );

    dataset.end();
}

Neode uses node and relationship specifications to describe the nodes and relationships in the graph, together with their properties and permitted property values. Neode then provides a fluent interface for creating and relating nodes.

Capacity Planning

At some point in our application’s development life cycle we’ll want to start planning for production deployment. In many cases, an organization’s project management gating processes mean a project cannot get underway without some understanding of the production needs of the application. Capacity planning is essential both for budgeting purposes and for ensuring there is sufficient lead time for procuring hardware and reserving production resources.

In this section we describe some of the techniques we can use for hardware sizing and capacity planning. Our ability to estimate our production needs depends on a number of factors. The more data we have regarding representative graph sizes, query performance, and the number of expected users and their behaviors, the better our ability to estimate our hardware needs. We can gain much of this information by applying the techniques described in “Testing” early in our application development life cycle. In addition, we should understand the cost/performance trade-offs available to us in the context of our business needs.

Optimization Criteria

As we plan our production environment we will be faced with a number of optimization choices. Which we favor will depend upon our business needs:

Cost

We can optimize for cost by installing the minimum hardware necessary to get the job done.

Performance

We can optimize for performance by procuring the fastest solution (subject to budgetary constraints).

Redundancy

We can optimize for redundancy and availability by ensuring the database cluster is big enough to survive a certain number of machine failures (i.e., to survive two machines failing, we will need a cluster comprising five instances).

Load

With a replicated graph database solution, we can optimize for load by scaling horizontally (for read load) and vertically (for write load).

Performance

Redundancy and load can be costed in terms of the number of machines necessary to ensure availability (five machines to provide continued availability in the face of two machines failing, for example) and scalability (one machine per some number of concurrent requests, as per the calculations in “Load”). But what about performance? How can we cost performance?

Calculating the cost of graph database performance

In order to understand the cost implications of optimizing for performance, we need to understand the performance characteristics of the database stack. As we describe in more detail later in “Native Graph Storage”, a graph database uses disk for durable storage, and main memory for caching portions of the graph.

Spinning disks are cheap, but not very fast for random seeks (around 6ms for a modern disk). Queries that have to reach all the way down to spinning disk will be orders of magnitude slower than queries that touch only an in-memory portion of the graph. Disk access can be improved by using solid-state drives (SSDs) in place of spinning disks, providing an approximate 20-fold increase in performance, or by using enterprise flash hardware, which can reduce latencies even further.

Note

For those deployments where the size of the data in the graph vastly eclipses the amount of RAM (and therefore cache) available, SSDs are an excellent choice, because they don’t have the mechanical penalties associated with spinning disks.

Performance optimization options

There are, then, three areas in which we can optimize for performance:

  • Increase the JVM heap size.

  • Increase the percentage of the store mapped into the page caches.

  • Invest in faster disks: SSDs or enterprise flash hardware.

As Figure 4-11 shows, the sweet spot for any cost versus performance trade-off lies around the point where we can map our store files in their entirety into the page cache, while allowing for a healthy, but modestly sized heap. Heaps of between 4 and 8 GB are not uncommon, though in many cases, a smaller heap can actually improve performance (by mitigating expensive GC behaviors).

Calculating how much RAM to allocate to the heap and the page cache depends on our knowing the projected size of our graph. Building a representative dataset early in our application’s development life cycle will furnish us with some of the data we need to make our calculations. If we cannot fit the entire graph into main memory, we should consider cache sharding (see “Cache sharding”).

Note

For more general performance and tuning tips, see this site.

grdb 0411
Figure 4-11. Cost versus performance trade-offs

In optimizing a graph database solution for performance, we should bear in mind the following guidelines:

  • We should utilize the page cache as much as possible; if possible, we should map our store files in their entirety into this cache.

  • We should tune the JVM heap while monitoring garbage collection to ensure smooth behavior.

  • We should consider using fast disks—SSDs or enterprise flash hardware—to boost baseline performance when disk access becomes inevitable.

Redundancy

Planning for redundancy requires us to determine how many instances in a cluster we can afford to lose while keeping the application up and running. For non–business-critical applications, this figure might be as low as one (or even zero). Once a first instance has failed, another failure will render the application unavailable. Business-critical applications will likely require redundancy of at least two; that is, even after two machines have failed, the application continues serving requests.

For a graph database whose cluster management protocol requires a majority of members to be available to work properly, redundancy of one can be achieved with three or four instances, and redundancy of two with five instances. Four is no better than three in this respect, because if two instances from a four-instance cluster become unavailable, the remaining coordinators will no longer be able to achieve majority.

Load

Optimizing for load is perhaps the trickiest part of capacity planning. As a rule of thumb:

number of concurrent requests = (1000 / average request time (in milliseconds)) * number of cores per machine * number of machines

Actually determining what some of these figures are, or are projected to be, can sometimes be very difficult:

Average request time

This covers the period from when a server receives a request, to when it sends a response. Performance tests can help determine average request time, assuming the tests are running on representative hardware against a representative dataset (we’ll have to hedge accordingly if not). In many cases, the “representative dataset” itself is based on a rough estimate; we should modify our figures whenever this estimate changes.

Number of concurrent requests

We should distinguish here between average load and peak load. Determining the number of concurrent requests a new application must support is a difficult thing to do. If we’re replacing or upgrading an existing application, we may have access to some recent production statistics we can use to refine our estimates. Some organizations are able to extrapolate from existing application data the likely requirements for a new application. Other than that, it’s up to our stakeholders to estimate the projected load on the system, but we must beware of inflated expectations.

Importing and Bulk Loading Data

Many if not most deployments of any kind of database don’t start out with an empty store. As part of deploying the new database, we may also have data to migrate from a legacy platform, require master data from some third party system, or be merely importing test data—such as the data in the examples in this chapter—into an otherwise empty store. As time goes on, we may have to perform other bulk loading operations from upstream systems on a live store.

Neo4j provides tooling to achieve these goals, both for the initial bulk load and ongoing bulk import scenarios, allowing us to stream data from a variety of other sources into the graph.

Initial Import

For initial imports Neo4j has an initial load tool called neo4j-import, which achieves sustained ingest speeds of around 1,000,000 records per second.8 It achieves these impressive performance figures because it does not build the store files using the normal transactional capabilities of the database. Instead, it builds the store files in a raster-like fashion, adding individual layers until the store is complete, and it is only at completion that the store becomes consistent.

The input to the neo4j-import tool is a set of CSV files that provide node and relationship data. As an example, consider the following three CSV files, which represent a small movie data set.

The first file is movies.csv:

:ID,title,year:int,:LABEL
1,"The Matrix",1999,Movie
2,"The Matrix Reloaded",2003,Movie;Sequel
3,"The Matrix Revolutions",2003,Movie;Sequel

This first file represents the movies themselves. The first line of the file contains metadata describing the movies. In this case, we can see that each movie has an ID, a title, and a year (which is an integer). The ID field acts as a key. Other parts of the import can refer to a movie using its ID. Movies also have one or more labels: Movie and Sequel.

The second file, actors.csv, contains movie actors. As we can see, actors have an ID and name property, and an Actor label:

:ID,name,:LABEL
keanu,"Keanu Reeves",Actor
laurence,"Laurence Fishburne",Actor
carrieanne,"Carrie-Anne Moss",Actor

The third file, roles.csv, specifies the roles that actors played in the movies. This file is used to create the relationships in the graph:

:START_ID,role,:END_ID,:TYPE
keanu,"Neo",1,ACTS_IN
keanu,"Neo",2,ACTS_IN
keanu,"Neo",3,ACTS_IN
laurence,"Morpheus",1,ACTS_IN
laurence,"Morpheus",2,ACTS_IN
laurence,"Morpheus",3,ACTS_IN
carrieanne,"Trinity",1,ACTS_IN
carrieanne,"Trinity",2,ACTS_IN
carrieanne,"Trinity",3,ACTS_IN

Each line in this file contains a START_ID and an END_ID, a role value and a relationship TYPE. START_ID values comprise actor ID values from the actors CSV file. END_ID values comprise movie ID values from the movies CSV file. Each relationship is expressed as a START_ID and an END_ID, with a role property, and a name derived from the relationship TYPE.

With these files, we can run the import tool from the command line:

neo4j-import --into target_directory \
--nodes movies.csv --nodes actors.csv --relationships roles.csv

neo4j-import builds the database store files, and puts them in the target_directory.

Batch Import

Another common requirement is to push bulk data from external systems into a live graph. In Neo4j this is commonly performed using Cypher’s LOAD CSV command. LOAD CSV takes as input the same kind of CSV data we used with the neo4j-import tool. It is designed to support intermediate loads of around a million or so items, making it ideal for handling regular batch updates from upstream systems.

As an example, let’s enrich our existing movie graph with some data about set locations. locations.csv contains title and location fields, where location is a semi-colon-separated list of filming locations in the movie:

title,locations
"The Matrix",Sydney
"The Matrix Reloaded",Sydney;Oakland
"The Matrix Revolutions",Sydney;Oakland;Alameda

Given this data, we can load it into a live Neo4j database using the Cypher LOAD CSV command as follows:

LOAD CSV WITH HEADERS FROM 'file:///data/locations.csv' AS line
WITH split(line.locations,";") as locations, line.title as title
UNWIND locations AS location
MERGE (x:Location {name:location})
MERGE (m:Movie {title:title})
MERGE (m)-[:FILMED_IN]->(x)

The first line of this Cypher script tells the database that we want to load some CSV data from a file URI (LOAD CSV also works with HTTP URIs). WITH HEADERS tells the database that the first line of our CSV file contains named headers. AS line assigns the input file to the variable line. The rest of the script will then be executed for each line of CSV data in the source file.

The second line of the script, beginning with WITH, splits an individual line’s locations value into a collection of strings using Cypher’s split function. It then passes the resulting collection and the line’s title value on to the rest of the script.

UNWIND is where the interesting work begins. UNWIND expands a collection. Here, we use it to expand the locations collection into individual location rows (remember, we’re dealing at this point with a single movie’s locations), each of which will be processed by the MERGE statements that follow.

The first MERGE statement ensures that the location is represented by a node in the database. The second MERGE statement ensures that the movie is also present as a node. The third MERGE statement ensures that a FILMED_IN relationship exists between the location and movie nodes.

Note

MERGE is like a mixture of MATCH and CREATE. If the pattern described in the MERGE statement already exists in the graph, the statement’s identifiers will be bound to this existing data, much as if we’d specified MATCH. If the pattern does not currently exist in the graph, MERGE will create it, much as if we’d used CREATE.

For MERGE to match existing data, all the elements in the pattern must already exist in the graph. If it can’t match all parts of a pattern, MERGE will create a new instance of the entire pattern. This is why we have used three MERGE statements in our LOAD CSV script. Given a particular movie and a particular location, it’s quite possible that one or another of them is already present in the graph. It’s also possible for both of them to exist, but without a relationship connecting them. If we were to use a single, large MERGE statement instead of our three small statements:

MERGE (:Movie {title:title})-[:FILMED_IN]->
      (:Location {name:location}))

the match would only succeed if the movie and location nodes and the relationship between them already exist. If any one part of this pattern does not exist, all parts will be created, leading to duplicate data.

Our strategy is to break apart the larger pattern into smaller chunks. We first ensure that the location is present. We next ensure that the movie is present. Finally, we ensure that the two nodes are connected. This incremental approach is quite normal when using MERGE.

At this point we are able to insert bulk CSV data into a live graph. However, we have not yet considered the mechanical implications of our import. In we were to run large queries like this on an existing large dataset, it is likely that the insert would take a very long time. There are two key characteristics of import we need to consider in order to make it efficient:

  • Indexing of the existing graph

  • Transaction flow through the database

For those of us coming from a relational background, the need for indexing is (perhaps) obvious here. Without indexes, we have to search all movie nodes in the database (and in the worst case, all of the nodes) in order to determine whether a movie exists or not. This is a cost O(n) operation. With an index of movies, that cost drops to O(log n), which is a substantial improvement, especially for larger data sets. The same is true of locations.

Declaring an index, as we saw in the previous chapter, is straightforward. To index movies, we simply issue the command CREATE INDEX ON :Movie(title). We can do this via the browser or using the shell. If the index is useful only during import (i.e., it plays no role in operational queries) then we drop it after the import with DROP INDEX ON :Movie(title).

Note

In some cases it is useful to add temporary IDs as properties to nodes so they can be easily referenced during import, especially when creating networks of relationships. These IDs have no domain significance. They exist simply for the duration of a multistep import process so the process can find specific nodes to be connected.

The use of temporary IDs is perfectly valid. Just remember to remove them using REMOVE once the import is complete.

Given that updates to live Neo4j instances are transactional, it follows that batch imports with LOAD CSV are also transactional. In the simplest case, LOAD CSV builds one transaction and feeds it to the database. For larger batch insertions this can be quite inefficient mechanically because the database has to manage a large amount of transaction state (sometimes gigabytes).

For large data imports, we can boost performance by breaking down a single large transactional commit into a series of smaller commits, which are then executed serially against the database. To achieve this, we use the PERIODIC COMMIT functionality. PERIODIC COMMIT breaks the import into a set of smaller transactions, which are committed after a certain number of rows (1000 by default) have been processed. With our movie location data, we could choose to reduce the default number of CSV lines per transaction to 100, for example, by prepending the Cypher script with USING PERIODIC COMMIT 100. The full script is:

USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///data/locations.csv' AS line
WITH split(line.locations,";") as locations, line.title as title
UNWIND locations AS location
MERGE (x:Location {name:location})
MERGE (m:Movie {title:title})
MERGE (m)-[:FILMED_IN]->(x)

These facilities for loading bulk data allow us both to experiment with example datasets when designing a system, and integrate with other systems and sources of data as part of a production deployment. CSV is an ubiquitous data exchange format—almost every data and integration technology has some support for producing CSV output. This makes it extrememly easy to import data into Neo4j, either as a one-time activity or on a periodic basis.

Summary

In this chapter we’ve discussed the most important aspects of developing a graph database application. We’ve seen how to create graph models that address an application’s needs and an end user’s goals, and how to make our models and associated queries expressive and robust using unit and performance tests. We’ve looked at the pros and cons of a couple of different application architectures, and enumerated the factors we need to consider when planning for production.

Finally we looked at options for rapidly loading bulk data into Neo4j, for both initial import and ongoing batch insertion into a live database.

In the next chapter we’ll look at how graph databases are being used today to solve real-world problems in domains as varied as social networking, recommendations, master data management, data center management, access control, and logistics.

1 For Agile user stories, see Mike Cohn, User Stories Applied (Addison-Wesley, 2004).

2 A doubly linked list is a nicety, because in practice relationships can be traversed in constant time in either direction.

3 See, for example, http://iansrobinson.com/2014/05/13/time-based-versioned-graphs/.

4 A list of Neo4j remote client libraries, as developed by the community, is maintained at http://neo4j.com/developer/language-guides/.

5 Tests not only act as documentation, but they can also be used to generate documentation. All of the Cypher documentation in the Neo4j manual is generated automatically from the unit tests used to develop Cypher.

6 A thorough discussion of agile performance testing can be found in Alistair Jones and Patrick Kua’s essay “Extreme Performance Testing,” in The ThoughtWorks Anthology, Volume 2 (Pragmatic Bookshelf, 2012).

7 Max De Marzi describes using Gatling to test Neo4j.

8 Using a new implementation of the tool available from version Neo4j 2.2 onwards.

Get Graph Databases, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.