Technical Blog

Category : neo4j

How to structure a neo4j-based system?

Neo4j is a powerful, free and easy to use graph database. It is ideal to store, for example, relationships between people, items etc. It provides great features including a rich API (like easy traversal of graph, pathfinding), high level query DSLs (Gremlin, Cypher), a server featuring REST protocol, nodes and relationships indexing, ACID compliant, etc.
In this article, I give pointers to help you decide which topology to use for your Neo4j projects: embedded db? server?
This choice is important, because all options have advantages, disadvantages and limitations that have to be taken into consideration before implementing the system. However, one important thing is that whatever solution you use, the way the database is stored on disk is the same. It means that you can change your architecture without having to re-build a database.

Neo4j provides an excellent online documentation that contains everything you need to setup, configure and use this library.

5 ways to structure a Neo4j-based project

The Neo4j server

Using the provided server allows to have very quickly a working system that allows a lot of clients to access the database. Simply call $NEO4J_HOME/bin/neo4j start and the database is running. Then, you can use it using several ways:

  • REST Api, that allows to insert, delete and access data easily from any programming language
  • Writing a server plugin: JVM code in a JAR file you locate in $NEO4J_HOME/plugins that are loaded at startup by the server and which extends the REST API. It’s a nice way to use server’s simplicity and keep the possibility of executing custom Java code. However, it has some limitations: needs to restart the server to use new plugins, limited return type (can be bypassed using JSON).
  • Unmanaged extensions which must be used with caution. It allows a better control of the protocol as it’s not relying on Neo4j’s REST API.
  • Run queries using Gremlin or Cypher plugins

An other advantage is that it includes monitoring using a web interface.

EmbeddedDB

Using an embedded DB looks more like using a powerful graph API than using a complete DataBase Management System (DBMS). Indeed, you have to write your own JVM program (your choice between Java, Scala, Clojure, Groovy, etc.) that loads the DB from file and run operations on it (node & relations creation/deletion/search, queries).

The major advantage is that you have a total control of the database and the program that holds it. For example, you’re not limited to plugins that are called by REST.

You can use an EmbeddedDB to create your own Neo4j server (the provided neo4j server is based on it), so you can include any communication layer you want to handle requests: your favorite HTTP server, Thrift, etc., so multiple clients can connect to it, as only have one program (the server) at a time accessing the database.

Creating your own server allows to decouple the database and the application (for example, you can have one machine running as a server and and lot of other ones accessing it), but you can also use it as an integrated Graph API. For example, you can include it in a standalone video game (when there’s only one client for one database), storing your environment in a graph database and running shortest path algorithms on it. In cases where only one process will ever connect to the database, embedded db are a good solution.

HighlyAvailableGraphDatabase

Neo4j HA has been designed to make the transition from single machine to multi machine operation simple, by not having to change the already existing application. (source).

This solution requires the enterprise edition of Neo4j and allows a more complex system architecture based on a machine’s cluster. You can use it with EmbeddedDB or server mode.

A hybrid solution: WrappingNeoServerBootstrapper

This class allows to keep some of the server’s features (REST API, Monitoring) when using an embedded database. This requires only a few more lines of code than the EmbeddedDB, an example can be found here.

A special case: the BatchInserter

When creating a Neo4j-based project, there is a high probability that you will start by importing existing data. Sources can be other existing databases, csv files, etc. Neo4j provides an API designed to import data very quickly (way faster than using transactions, the common way to insert data). The BatchInserter is usually used at the beginning of the project to build the DB, and then the DB used by an EmbeddedDB or a server.

To give a concrete example, it allowed me to import a graph of 10 million nodes and 50 million relationships in 20 minutes on a Macbook Pro. It took hours using classical transactions.

Conclusion

They are three main architectures to use a Neo4j DB:

  • Using the Neo4j server: very easy and customisable with handmade plugins, REST access and query DSL
  • EmbeddedDB (or the wrapper variant): It is a good choice when you have only one client that would embed the database, like create standalone programs, using Neo4j as a graph API. It allows more possibilities but it requires more work if you want to create your own server (have to build a communication layer treating requests).
  • HighlyAvailableGraphDatabase: when you need a resistant architecture. It is distributed and requires configuration. It is run on top of Embedded mode or server mode.

BatchInserter can be used prior to any of those free to import initial data in the Neo4j database.

In most cases the second option is likely to be the best one, but the first one should be selected when you need a lot of customization or simply a graph API.

Keep in mind that whatever you choose, you can switch from one solution to the other without changing your actual database on disk.