Thursday, December 29, 2011

Graph Database: neo4j

If we start looking around us in real life we’ll find more and more things are either in form graph or web of graphs. How are peoples connected with each other? How does money flow in a system? How are restaurants, hotels, and roads interconnected? How does a message flow on a social network? We’ll end up with a graph if we try to draw them on white board.

We are dealing with kind of similar domain model in our project. We got a flexible model using some out of box design approaches over relational database. But this flexibility came with some tradeoffs put in by limits of relational database. Growing size of data is a concern; tens of millions of rows get added to a single table every couple of months. We are striving to come up with an improved and better-fit solution before we hit the wall down the road after few years. This provoked us to dive into NoSQL movement and do some experimentation. This paradigm shift to see things in natural form seems interesting or may be some food for thoughts. It may help in solving some of the problems what we have been thinking of. I'll try to touch on NoSQL and neo4j in this post.

Faceoff: acronym NoSQL is not for “No SQL” or “Never SQL”; it is “Not Only SQL”. So most of the times, it means working on some no SQL persistence system along with SQL (relational) database, separating content between the two based on use case.

Different Types of NoSQL Data Models:

Key/Value Stores – fits well for very high volume of data having relatively low complexity e.g. Amazon Dynamo
Column Stores – fits well for high volume of data having fairly high complexity e.g. Hbase, Cassandra, BigTable
Document Databases – a middle path between high volume of data and high complexity e.g. Mongo, CouchDB
Graph Databases – fits well for fairly high volume of data having high complexity e.g. neo4j

Why a graph database (neo4j)?
  • Many domains are graph oriented and they are poorly mapped to tables. Why take the pain of squeezing a graph into table?
  • Performance problems due to SQL joins for connected data  
  •  ACID and JTA compliant – only NoSQL DB, I know so far, which supports transactions like relational DB 
  •  Relationships can be added dynamically if required 
  •  It can represents one to one mapping from real life domain model
You might have heard about Facebook Graph API and Open Graph Protocol; which see data in form of graph of different domains like people, places, business, and events.

Data Modeling in Graph Database (neo4j): 
Entities are nodes – all nodes have ids; id is uniquely created automatically for every new node and it cannot be changed

-         Tied relationships to connect nodes – uniquely identified by its type and direction

-       Properties (key/value pairs) – they can be attached to any node or relationship. Only java primitives can be used for properties, objects go as nodes.

This is a little similar to how we bind data (e.g. single level JSON object) to a node in DOM and then access it later. For example we do this using jQuery

var user = {‘name’:‘John’, ‘age’:28, ‘department’:‘IT’};

${“#user_div”}.data(“userInfo”, user);  // bind data to element having id=“user_div”

alert( ${“#user_div”}.data(“userInfo”).name );  // prints John

Spring Data Graph:
This is a JPA for graph database. An annotation driven, aspectJ based domain layer framework from SpringSource for mapping.

Graph query language. Neo4j also supports a powerful native traversing mechanism to retrieve data from graph.

Real Life Use Cases:
-         Social Networks
-         Geo Spatial Data
-         Recommendation Engines

Test Case – 1000 persons having 50 friends on average over 4 levels. What is query time to find out if any two persons, picked randomly, are friends?

Result – it is 2000 ms for relational database and 2 ms for neo4j. Neo4j reports 2 ms even if number of persons are increased to 1 million. Remember search complexity of a tree from graph theory and algorithm class in school?