-->

2019-06-16

Transferring a subgraph from Janusgraph to Neo4j


This blog will not go in very much technical detail, but merely addresses the fact that a Google search on the blog title does not guide you to any immediately usable resource. Yet, I think this is a relevant use case. While the JanusGraph backends enable you to store and query huge datasets in a linearly scalable way, data science teams often prefer to work on smaller subsets of the graph data in the Neo4j clients because of the better support for visual data exploration and for mixing in additional data.

Exporting a subgraph from JanusGraph

The gremlin query language has a dedicated subgraph "step" to include edges and their attached vertices into a dataset that can be operated upon as a graph. The code below, for execution in the gremlin console, extracts some data from the Graph of the Gods sample graph and subsequently writes it to a file in the graphML format.

graph = JanusGraphFactory.open("inmemory")
GraphOfTheGodsFactory.loadWithoutMixedIndex(graph,true)
g = graph.traversal()

subGraph = g.V().has('name', 'jupiter').bothE().subgraph('jupiter').cap('jupiter').next()
 
stream = new FileOutputStream("data/jupiter.xml")
GraphMLWriter.build().vertexLabelKey("labels").create().writeGraph(stream, subGraph)

 
This code uses the GraphMLWriter class.  While the TinkerPop reference documentation prescribes the .io() method, you will see that the resulting graphML output uses the default "labelV" key to indicate the label of a vertex. However, this is not recognized by the Neo4j apoc plugin and you will get a Neo4j graph without vertex labels. Rather, the "labels" key should be used to have Neo4j understand that the key refers to vertex labels. This is possible by using the GraphMLWriter class. The mismatch in use of label keys probably occurred because TinkerPop supports a single vertex label only, while Neo4j supports multiple vertex labels.

Importing the graphml file into Neo4j

call apoc.import.graphml('../janusgraph-0.3.1-hadoop2/data/jupiter.xml', {batchSize: 10000, readLabels: true, storeNodeIds: false, defaultRelationshipType:"RELATED"})

MATCH (n) RETURN n

This is straight from the Neo4j documentation, which works after you find out about the required "labels" key in the graphML data.

References

https://tinkerpop.apache.org/javadocs/3.3.4/full/org/apache/tinkerpop/gremlin/structure/io/graphml/GraphMLWriter.html
https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/721

No comments:

Post a Comment