NoSQL Graph Databases
Graph databases are a fairly unusual concept that has received a lot of attention with the increasing popularity of NoSQL, although they have existed in various forms for quite some time. Whereas most databases and file systems hold each item in a list and maintain an index of where each item is stored, graph databases define each entry as to how it relates to another, including a pointer to the next item. C programmers (or people old enough to remember having to program a linked list) may grasp this concept quicker than others.
As an example of this approach, the game Six Degrees of Kevin Bacon works on the premise that any actor can be linked to Kevin Bacon through other actors that they have worked with. Victoria Silvstedt links to Lee Majors (Out Cold, 2001), Lee Majors to Robbi Morgan (The Fall Guy, 1983), and Robbi Morgan to Kevin Bacon (Friday the 13th, 1980).
If you’re building an application to store and query these relationships using a traditional database, you might have a table of actors and then a second table that stores “links” between actors. At a simplistic level, each entry in the tables has a fixed location and the database maintains a list of what is in the databases and where it can be found.
There’s no actual link between Victoria and Lee, only a consistent use of reference IDs that can help you to retrieve the record about the next actor in the sequence by looking up its ID. Calculating the existence of a route between Kevin and any other actor, and ideally the shortest route, might involve multiple database queries to fetch and organize data.
As the database grows to include more information, the processing of relationships takes much longer. Eventually, the database could be so large that processing the relationships between entries is just too resource intensive.
A graph database stores these relationships directly, and makes it much easier to find the optimal route between two nodes. Each node consists of the data about one thing (for example, an actor) and a pointer to another node in a sequence.
The difference is not simply one of visualization – by maintaining a direct link from one node to another (as opposed to a reference value that can be used to find the next node by searching the indices) NoSQL graph databases can traverse and perform operations on a graph very efficiently.
As with other NoSQL databases, the data is largely schema-less and the core object is the collection of key-value pairs. Often called properties in these types of database, they can be applied to data nodes but also to the relationship linking two nodes.
Why Use Graph Databases?
Graph databases are useful in circumstances where:
- Data is highly linked to other items in the database – including situations where nodes may be linked to many other nodes.
- The relationships between data are important to the application using the database. For example, when there is a need to compute the shortest path between two objects or centrality measures.
And they still offer the core benefits of NoSQL database solutions:
- Being suitable for items that are highly variable in their structure.
- Easy integration with application code.
- Performance optimized for specific tasks.
However, unlike most other NoSQL solutions, they do not always provide performance gains over relational databases, and so their use should be balanced against how important the relationships are to the project. Their limited indexing can cause performance issues in databases when information is updated frequently, so may not be suitable as a standalone solution for the data storage needs of every large project.
Querying the database can also be a little more complicated than with other NoSQL databases. Query languages for graph databases are approaching the level of complexity of structured query language (SQL, as used with traditional solutions) but while SQL is extremely well-known, it is harder to find people who competent in working with graph databases.
Neo4j, Titan, and Others
The most prominent NoSQL graph database is Neo4j, describing itself as the “world’s leading” solution and listing massive companies such as Adobe, eBay, and Hewlett Packard among its clients. It is widely used for a diverse range of activities (mostly based in graph theory, obviously) such as business intelligence, fraud detection, logistics, and social networking. It is available on most of the major operating systems, supports a variety of APIs and query methods, and is popular enough that the Internet is littered with tutorials and examples on how to use it.
Among the many competing packages, Titan is perhaps easier than Neo4j to get started with but does require the use of an additional data storage solution such as Cassandra or HBase. Other options are FlockDB, ArangoDB (which claims to be a multi-model solution that also provides key-value and document storage), and HypergraphDB – with so many graph databases now on the market, choosing one is often a case of determining which API languages are most desirable to you, and the level of scalability or distribution that you require.