Yet Another Analytics & Intelligence Communication Series: Configuring JanusGraph for Spark-Yarn

JanusGraph is an Apache Tinkerpop implementation of a graph database; it is based on a number of popular storage and search backends (e.g. Apache HBase, Apache Cassandra, Apache Solr, Elasticsearch). While provisioning of a storage backend seems to point to mainly transactional use (OLTP), the storage backend connectors also come with InputFormat classes for analytical use with Hadoop/Spark (OLAP), in particular with the Tinkerpop GraphComputer implementations for the Tinkerpop HadoopGraph.
In this post, I describe how to configure the JanusGraph-0.1.1 binary distribution for use with an existing Hadoop/HBase/Spark-Yarn cluster. This post is a follow-up on an earlier post in which I did the same for the bare Apache Tinkerpop-3.2.3 distribution.

Some background

The configurations for running OLAP queries on JanusGraph turn out to be more complicated than for Tinkerpop. While the JanusGraph team did a great job in releasing the Titan fork so soon after their foundation, the janusgraph-hbase module still has some particular quirks:

proper working depends on the order of items on the classpath (e.g. because janusgraph-hbase contains a literal copy of the guava-12 StopWatch class while the classpath alse needs the guava-18 dependency)
the hbase cluster configs need to be present in both the hbase-site.xml file on the classpath and in the HadoopGraph properties file

Running Apache Spark as an Apache Hadoop yarn-client application results in a distributed setup of java containers (JVM's), each with its own classpath:

the client application (e.g. the gremlin console) runs the yarn client and contains the spark driver and the SparkContext
Yarn runs the spark cluster manager in a separate container on the cluster, the yarn ApplicationManager
Yarn runs a separate container for each requested spark executor

Prerequisites

The prerequisites for following the recipe for configuring JanusGraph are the same as for Tinkerpop in my earlier post, so I will not repeat those. Where the current recipe makes JanusGraph perform an OLAP job on a graph persisted in HBase, the earlier Tinkerpop recipe allows you to perform an OLAP job on a graph persisted as kryo file on hdfs.
Note that:

now you need the JanusGraph-0.1.1 binary distribution instead of the Tinkerpop-3.2.3 distribution;
the JanusGraph distribution already includes the hadoop-gremlin and spark-gremlin plugins. The plugin libs are simply present in the lib folder and the jars to be added are assumed to be in the lib2 folder with respect to the root of the JanusGraph distribution;
In effect, you can also follow the Tinkerpop for Spark-Yarn recipe using the JanusGraph-0.1.1 distribution, provided that you replace the single occurrence of '/ext/spark-gremlin' in the configs by '/'.

Configuration

Create a shell script (e.g. bin/jg011.sh) with the following contents:

#!/bin/bash

GREMLIN_HOME=/home/biko/JG011binary
cd $GREMLIN_HOME

# Have janusgraph find the hadoop and hbase cluster configs and spark-yarn dependencies
export CLASSPATH=/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hbase-client/conf:$GREMLIN_HOME/lib/*:$GREMLIN_HOME/lib2/*

# Have hadoop find its native libraries
export JAVA_OPTIONS="-Djava.library.path=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64"

# Does not work for spark-yarn, see spark.yarn.appMasterEnv.CLASSPATH and
# spark.executor.extraClassPath. Set nevertheless to get rid of the warning.
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty

bin/gremlin.sh

Create the file conf/hadoop-graph/hadoop-gryo-yarn.properties:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# JanusGraph HBase InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=hbase
#janusgraphmr.ioformat.conf.storage.hostname=fqdn1,fqdn2,fqdn3
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.hbase.table=janusgraph
zookeeper.znode.parent=/hbase-unsecure
# Security configs are needed in case of a secure cluster
#zookeeper.znode.parent=/hbase-secure
#hbase.rpc.protection=privacy
#hbase.security.authentication=kerberos

#
# SparkGraphComputer with Yarn Configuration
#
spark.master=yarn-client
spark.executor.memory=512m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/biko/JG011binary/lib.zip
spark.yarn.dist.files=/home/biko/JG011binary/janusgraph-hbase-0.1.1.jar
spark.yarn.appMasterEnv.CLASSPATH=/usr/hdp/current/hadoop-client/conf:./lib.zip/*:
spark.executor.extraClassPath=/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hbase-client/conf:janusgraph-hbase-0.1.1.jar:./lib.zip/*
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64

#
# Relevant configs from spark-defaults.conf
#
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.yarn.historyServer.address sandbox.hortonworks.com:18080
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.history.kerberos.enabled false
spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.am.waitTime 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.executor.memoryOverhead 384
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

Demonstration

If you followed the recipe this far, you are ready to run your own demo:

[root@sandbox ~]# . /home/yourdir/bin/jg011.sh

\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yourdir/JG011binary/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yourdir/JG011binary/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19:07:43,534 INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /home/yourdir/JG011binary/empty
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph

// Loading a graph into the default janusgraph table
gremlin] graph = JanusGraphFactory.open('conf/janusgraph-hbase.properties')
==] standardjanusgraph[hbase:[127.0.0.1]]
gremlin] GraphOfTheGodsFactory.loadWithoutMixedIndex(graph,true)
==] null
gremlin] g=graph.traversal()
==] graphtraversalsource[standardjanusgraph[hbase:[127.0.0.1]], standard]
gremlin] g.V().count()
19:10:45,921 WARN StandardJanusGraphTx:1273 - Query requires iterating over all vertices [()]. For better performance, use indexes
==]12

// Loading of a HadoopGraph from janusgraph's hbase table
gremlin] graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
==] hadoopgraph[hbaseinputformat-gryooutputformat]
gremlin] g = graph.traversal().withComputer(SparkGraphComputer)
==] graphtraversalsource[hadoopgraph[hbaseinputformat-
gryooutputformat], sparkgraphcomputer]
gremlin] g.V().count()
==] 12
gremlin]

Final remarks

As for the Tinkerpop recipe, the current recipe was also tested on a real, secure spark-yarn cluster (HDP-2.5.3.0-37). Some configs are superfluous on the sandbox but are needed on the real cluster (e.g. the yarn.dist.files property)
As for the Tinkerpop recipe, the current recipe should work for other client applications than the gremlin console, as long as you do not use spark-submit or spark-shell. Getting your dependencies right turns out to be tedious though (maybe some other post);
Critical readers may note that the CLASSPATH in the run script contains a lot of superfluous items, because the gremlin.sh script already puts most items on. Leaving $GREMLIN_HOME/lib/* out, however, interferes with the logging configuration in ways I still do not understand.

8 comments:

Anonymous15 July 2017 at 00:23
HadoopMarc,

It seems that we have two graph classes that need to be created:

The first is a standardjanusgraph object that runs a standard computer. This is able to perform OLTP data pushes and, I assume, standard OLTP queries. It, however, does not interface with Spark, so SparkGraphComputer cannot be used as the graph computer for its traversal object.

The second object is a HadoopGraph object that can have SparkGraphComputer activated for its associated traversal source object. This can perform appropriate map-reduced OLAP calculations, but wouldn't be good for putting information into the HBase database.

Is this accurate, or can we create a graphtraversalsource that can perform both the OLTP data inserts and utilize SparkGraphComputer? If not, could we create both objects simultaneously? Would there be conflicts between the two if there were two simultaneous traversers?
HadoopMarc15 July 2017 at 11:58
Hi John,
Your assumption about different types of graph object for OLTP and OLAP is right. I remember examples from the gremlin user list though where OLTP and OLAP were mixed in the same traversal (sorry, no time to look this up right now, but feel free to ask on the gremlin user list).
It is no problem to have a graph1 and a graph2 graph object simultaneously. This is also what you do in gremlin-server when you want to serve multiple graphs.
Anonymous18 July 2017 at 20:42
I also assume that the instruction above to:

Create the file conf/hadoop/hadoop-gryo-yarn.properties:

really should be parsed as:

Create the file conf/hadoop-graph/read-hbase-spark-yarn.properties: ?

Otherwise, there is no conf/hadoop-graph/read-hbase-spark-yarn.properties file for us to refer to.
Anonymous18 July 2017 at 21:20
Do we need to install /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.5.0.0-1245.jar? for working in JanusGraph? It isn't on my computer in the designated file.

No, I have not installed Hortonworks Sandbox, since I am already on a CentOS system with an Ambari installation. The particular jar above also can't be downloaded easily from the web in any standard way. (No obvious wget expression.) If I could skip it, that would be best.
Anonymous18 July 2017 at 21:29
Okay, I think I got it. Although, it is: hadoop-lzo-0.6.0.2.4.2.0-258.jar, since I am running hdp 2.4.2. You install it in CentOS by running:

yum install lzo lzo-devel hadooplzo hadooplzo-native
Anonymous25 July 2017 at 00:51
Could you elaborate further on the placement of the .jar files for the guava versions in the classpath? For instance, why are there two lib directories (lib and lib2)? Which system needs which guava file and which goes first? Currently, I seem to have everything running except the serialization between HBase and Spark. (The JanusGraph object works in OLTP mode correctly, and Yarn is launching in OLAP mode, but it just can't read in the HBase data, which is there, since OLTP can read it fine.)
Anonymous1 August 2017 at 19:13
We have finally got it working. There were a couple of concepts regarding classpaths that we did not understand, especially as they relate to YARN containers:

1) spark.yarn.dist.archives and spark.yarn.dist.files point to the jars that will be loaded into the YARN container. Lib.zip is the large collection of jars that were prepared for export to the YARN containers, which will be stored locally in the YARN container in the directory lib.zip. janusgraph-0.1.1-hadoop2.jar is the additional .jar that we are loading separately, so that we can control CLASSPATH ordering. It is also stored in the local directory of the YARN container.

2) spark.executor.extraClassPath gives the classpath to the jar files internally in the Yarn container. This is why it contains the parts ./janusgraph-0.1.1-hadoop2.jar and ./lib.zip. Just because it loaded the jars into the container, doesn't mean it knows how to access them automatically.

Therefore, it is important to understand that there is a two step process. One to load the jars into the container and one to point the program inside the container to the jars. It seems obvious now, but it certainly was not at the time.

2017-07-06

Configuring JanusGraph for Spark-Yarn