-->

2017-07-06

Configuring JanusGraph for Spark-Yarn

JanusGraph is an Apache Tinkerpop implementation of a graph database; it is based on a number of popular storage and search backends (e.g. Apache HBase, Apache Cassandra, Apache Solr, Elasticsearch). While provisioning of a storage backend seems to point to mainly transactional use (OLTP), the storage backend connectors also come with InputFormat classes for analytical use with Hadoop/Spark (OLAP), in particular with the Tinkerpop GraphComputer implementations for the Tinkerpop HadoopGraph.
In this post, I describe how to configure the JanusGraph-0.1.1 binary distribution for use with an existing Hadoop/HBase/Spark-Yarn cluster. This post is a follow-up on an earlier post in which I did the same for the bare Apache Tinkerpop-3.2.3 distribution.

Some background

The configurations for running OLAP queries on JanusGraph turn out to be more complicated than for Tinkerpop. While the JanusGraph team did a great job in releasing the Titan fork so soon after their foundation, the janusgraph-hbase module still has some particular quirks:
  • proper working depends on the order of items on the classpath (e.g. because janusgraph-hbase contains a literal copy of the guava-12 StopWatch class while the classpath alse needs the guava-18 dependency)
  • the hbase cluster configs need to be present in both the hbase-site.xml file on the classpath and in the HadoopGraph properties file
Running Apache Spark as an Apache Hadoop yarn-client application results in a distributed setup of java containers (JVM's), each with its own classpath:
  • the client application (e.g. the gremlin console) runs the yarn client and contains the spark driver and the SparkContext
  • Yarn runs the spark cluster manager in a separate container on the cluster, the yarn ApplicationManager
  • Yarn runs a separate container for each requested spark executor

Prerequisites

The prerequisites for following the recipe for configuring JanusGraph are the same as for Tinkerpop in my earlier post, so I will not repeat those. Where the current recipe makes JanusGraph perform an OLAP job on a graph persisted in HBase, the earlier Tinkerpop recipe allows you to perform an OLAP job on a graph persisted as kryo file on hdfs.
Note that:
  • now you need the JanusGraph-0.1.1 binary distribution instead of the Tinkerpop-3.2.3 distribution;
  • the JanusGraph distribution already includes the hadoop-gremlin and spark-gremlin plugins. The plugin libs are simply present in the lib folder and the jars to be added are assumed to be in the lib2 folder with respect to the root of the JanusGraph distribution;
  • In effect, you can also follow the Tinkerpop for Spark-Yarn recipe using the JanusGraph-0.1.1 distribution, provided that you replace the single occurrence of '/ext/spark-gremlin' in the configs by '/'. 

Configuration

Create a shell script (e.g. bin/jg011.sh) with the following contents:

#!/bin/bash

GREMLIN_HOME=/home/biko/JG011binary
cd $GREMLIN_HOME

# Have janusgraph find the hadoop and hbase cluster configs and spark-yarn dependencies
export CLASSPATH=/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hbase-client/conf:$GREMLIN_HOME/lib/*:$GREMLIN_HOME/lib2/*

# Have hadoop find its native libraries
export JAVA_OPTIONS="-Djava.library.path=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64"

# Does not work for spark-yarn, see spark.yarn.appMasterEnv.CLASSPATH and
# spark.executor.extraClassPath. Set nevertheless to get rid of the warning.
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty

bin/gremlin.sh

Create the file conf/hadoop-graph/hadoop-gryo-yarn.properties:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# JanusGraph HBase InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=hbase
#janusgraphmr.ioformat.conf.storage.hostname=fqdn1,fqdn2,fqdn3
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.hbase.table=janusgraph
zookeeper.znode.parent=/hbase-unsecure
# Security configs are needed in case of a secure cluster
#zookeeper.znode.parent=/hbase-secure
#hbase.rpc.protection=privacy
#hbase.security.authentication=kerberos

#
# SparkGraphComputer with Yarn Configuration
#
spark.master=yarn-client
spark.executor.memory=512m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/biko/JG011binary/lib.zip
spark.yarn.dist.files=/home/biko/JG011binary/janusgraph-hbase-0.1.1.jar
spark.yarn.appMasterEnv.CLASSPATH=/usr/hdp/current/hadoop-client/conf:./lib.zip/*:
spark.executor.extraClassPath=/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hbase-client/conf:janusgraph-hbase-0.1.1.jar:./lib.zip/*
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64

#
# Relevant configs from spark-defaults.conf
#
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.yarn.historyServer.address sandbox.hortonworks.com:18080
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.history.kerberos.enabled false
spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.am.waitTime 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.executor.memoryOverhead 384
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3


Demonstration

If you followed the recipe this far, you are ready to run your own demo:

[root@sandbox ~]# . /home/yourdir/bin/jg011.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yourdir/JG011binary/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yourdir/JG011binary/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19:07:43,534  INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /home/yourdir/JG011binary/empty
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph

// Loading a graph into the default janusgraph table
gremlin] graph = JanusGraphFactory.open('conf/janusgraph-hbase.properties')
==] standardjanusgraph[hbase:[127.0.0.1]]
gremlin] GraphOfTheGodsFactory.loadWithoutMixedIndex(graph,true)
==] null
gremlin] g=graph.traversal()
==] graphtraversalsource[standardjanusgraph[hbase:[127.0.0.1]], standard]
gremlin] g.V().count()
19:10:45,921  WARN StandardJanusGraphTx:1273 - Query requires iterating over all vertices [()]. For better performance, use indexes
==]12

// Loading of a HadoopGraph from janusgraph's hbase table
gremlin] graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
==] hadoopgraph[hbaseinputformat-gryooutputformat]
gremlin] g = graph.traversal().withComputer(SparkGraphComputer)
==] graphtraversalsource[hadoopgraph[hbaseinputformat-
      gryooutputformat], sparkgraphcomputer]
gremlin] g.V().count()
==] 12
gremlin]

Final remarks

  • As for the Tinkerpop recipe, the current recipe was also tested on a real, secure spark-yarn cluster (HDP-2.5.3.0-37). Some configs are superfluous on the sandbox but are needed on the real cluster (e.g. the yarn.dist.files property)
  • As for the Tinkerpop recipe, the current recipe should work for other client applications than the gremlin console, as long as you do not use spark-submit or spark-shell. Getting your dependencies right turns out to be tedious though (maybe some other post);
  • Critical readers may note that the CLASSPATH in the run script contains a lot of superfluous items, because the gremlin.sh script already puts most items on. Leaving $GREMLIN_HOME/lib/* out, however, interferes with the logging configuration in ways I still do not understand.