-->

2017-07-06

Configuring JanusGraph for Spark-Yarn

JanusGraph is an Apache Tinkerpop implementation of a graph database; it is based on a number of popular storage and search backends (e.g. Apache HBase, Apache Cassandra, Apache Solr, Elasticsearch). While provisioning of a storage backend seems to point to mainly transactional use (OLTP), the storage backend connectors also come with InputFormat classes for analytical use with Hadoop/Spark (OLAP), in particular with the Tinkerpop GraphComputer implementations for the Tinkerpop HadoopGraph.
In this post, I describe how to configure the JanusGraph-0.1.1 binary distribution for use with an existing Hadoop/HBase/Spark-Yarn cluster. This post is a follow-up on an earlier post in which I did the same for the bare Apache Tinkerpop-3.2.3 distribution.

Some background

The configurations for running OLAP queries on JanusGraph turn out to be more complicated than for Tinkerpop. While the JanusGraph team did a great job in releasing the Titan fork so soon after their foundation, the janusgraph-hbase module still has some particular quirks:
  • proper working depends on the order of items on the classpath (e.g. because janusgraph-hbase contains a literal copy of the guava-12 StopWatch class while the classpath alse needs the guava-18 dependency)
  • the hbase cluster configs need to be present in both the hbase-site.xml file on the classpath and in the HadoopGraph properties file
Running Apache Spark as an Apache Hadoop yarn-client application results in a distributed setup of java containers (JVM's), each with its own classpath:
  • the client application (e.g. the gremlin console) runs the yarn client and contains the spark driver and the SparkContext
  • Yarn runs the spark cluster manager in a separate container on the cluster, the yarn ApplicationManager
  • Yarn runs a separate container for each requested spark executor

Prerequisites

The prerequisites for following the recipe for configuring JanusGraph are the same as for Tinkerpop in my earlier post, so I will not repeat those. Where the current recipe makes JanusGraph perform an OLAP job on a graph persisted in HBase, the earlier Tinkerpop recipe allows you to perform an OLAP job on a graph persisted as kryo file on hdfs.
Note that:
  • now you need the JanusGraph-0.1.1 binary distribution instead of the Tinkerpop-3.2.3 distribution;
  • the JanusGraph distribution already includes the hadoop-gremlin and spark-gremlin plugins. The plugin libs are simply present in the lib folder and the jars to be added are assumed to be in the lib2 folder with respect to the root of the JanusGraph distribution;
  • In effect, you can also follow the Tinkerpop for Spark-Yarn recipe using the JanusGraph-0.1.1 distribution, provided that you replace the single occurrence of '/ext/spark-gremlin' in the configs by '/'. 

Configuration

Create a shell script (e.g. bin/jg011.sh) with the following contents:

#!/bin/bash

GREMLIN_HOME=/home/biko/JG011binary
cd $GREMLIN_HOME

# Have janusgraph find the hadoop and hbase cluster configs and spark-yarn dependencies
export CLASSPATH=/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hbase-client/conf:$GREMLIN_HOME/lib/*:$GREMLIN_HOME/lib2/*

# Have hadoop find its native libraries
export JAVA_OPTIONS="-Djava.library.path=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64"

# Does not work for spark-yarn, see spark.yarn.appMasterEnv.CLASSPATH and
# spark.executor.extraClassPath. Set nevertheless to get rid of the warning.
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty

bin/gremlin.sh

Create the file conf/hadoop-graph/hadoop-gryo-yarn.properties:

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

#
# JanusGraph HBase InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=hbase
#janusgraphmr.ioformat.conf.storage.hostname=fqdn1,fqdn2,fqdn3
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.hbase.table=janusgraph
zookeeper.znode.parent=/hbase-unsecure
# Security configs are needed in case of a secure cluster
#zookeeper.znode.parent=/hbase-secure
#hbase.rpc.protection=privacy
#hbase.security.authentication=kerberos

#
# SparkGraphComputer with Yarn Configuration
#
spark.master=yarn-client
spark.executor.memory=512m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/biko/JG011binary/lib.zip
spark.yarn.dist.files=/home/biko/JG011binary/janusgraph-hbase-0.1.1.jar
spark.yarn.appMasterEnv.CLASSPATH=/usr/hdp/current/hadoop-client/conf:./lib.zip/*:
spark.executor.extraClassPath=/usr/hdp/current/hadoop-client/conf:/usr/hdp/current/hbase-client/conf:janusgraph-hbase-0.1.1.jar:./lib.zip/*
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64

#
# Relevant configs from spark-defaults.conf
#
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.yarn.historyServer.address sandbox.hortonworks.com:18080
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.history.kerberos.enabled false
spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.am.waitTime 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.executor.memoryOverhead 384
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3


Demonstration

If you followed the recipe this far, you are ready to run your own demo:

[root@sandbox ~]# . /home/yourdir/bin/jg011.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yourdir/JG011binary/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yourdir/JG011binary/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19:07:43,534  INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /home/yourdir/JG011binary/empty
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph

// Loading a graph into the default janusgraph table
gremlin] graph = JanusGraphFactory.open('conf/janusgraph-hbase.properties')
==] standardjanusgraph[hbase:[127.0.0.1]]
gremlin] GraphOfTheGodsFactory.loadWithoutMixedIndex(graph,true)
==] null
gremlin] g=graph.traversal()
==] graphtraversalsource[standardjanusgraph[hbase:[127.0.0.1]], standard]
gremlin] g.V().count()
19:10:45,921  WARN StandardJanusGraphTx:1273 - Query requires iterating over all vertices [()]. For better performance, use indexes
==]12

// Loading of a HadoopGraph from janusgraph's hbase table
gremlin] graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
==] hadoopgraph[hbaseinputformat-gryooutputformat]
gremlin] g = graph.traversal().withComputer(SparkGraphComputer)
==] graphtraversalsource[hadoopgraph[hbaseinputformat-
      gryooutputformat], sparkgraphcomputer]
gremlin] g.V().count()
==] 12
gremlin]

Final remarks

  • As for the Tinkerpop recipe, the current recipe was also tested on a real, secure spark-yarn cluster (HDP-2.5.3.0-37). Some configs are superfluous on the sandbox but are needed on the real cluster (e.g. the yarn.dist.files property)
  • As for the Tinkerpop recipe, the current recipe should work for other client applications than the gremlin console, as long as you do not use spark-submit or spark-shell. Getting your dependencies right turns out to be tedious though (maybe some other post);
  • Critical readers may note that the CLASSPATH in the run script contains a lot of superfluous items, because the gremlin.sh script already puts most items on. Leaving $GREMLIN_HOME/lib/* out, however, interferes with the logging configuration in ways I still do not understand.

2017-06-29

Configuring Apache Tinkerpop for Spark-Yarn

Apache TinkerPop is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). Today, we focus on the OLAP part using Tinkerpop's own HadoopGraph and SparkGraphComputer implementations. While all OLAP tests and examples in the reference documentation run fine on a standalone machine, many people including myself have spent days and days in (trying) getting OLAP working on an actual Spark-Yarn cluster:
https://groups.google.com/forum/#!searchin/gremlin-users/yarn%7Csort:relevance
https://groups.google.com/forum/#!searchin/aureliusgraphs/yarn%7Csort:relevance
https://groups.google.com/forum/#!searchin/janusgraph-users/yarn%7Csort:relevance

With more experience in jar hell than a few years back, and with the recent Tinkerpop-based JanusGraph initiative as a stimulus, I decided to give the configuration of Apache TinkerPop for Spark-Yarn another try.

Prerequisites

The recipes below were tested on a Hortonworks Data Platform sandbox (version 2.5.0.0-1245) and on an actual multi-node HDP cluster (version 2.5.3.0-37), but I expect the recipes to work on other hadoop distributions or a vanilla hadoop cluster as well.
I assume that you have downloaded the Hortonworks VM, imported it into Virtualbox, can ssh into the sandbox and upload files to it (e.g. using an sftp:// location in Nautilus). Beware: set the sandbox configs from NAT to "bridged" and use port 2222 to ssh into the actual Docker container in the sandbox. Of course, you can also use putty and winscp to the same effect.

You will also need the following jars (commands from within the sandbox):
wget http://central.maven.org/maven2/org/apache/spark/spark-yarn_2.10/1.6.1/spark-yarn_2.10-1.6.1.jar 
wget http://central.maven.org/maven2/com/google/inject/extensions/guice-servlet/3.0/guice-servlet-3.0.jar
wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-web-proxy/2.2.0/hadoop-yarn-server-web-proxy-2.2.0.jar
wget http://central.maven.org/maven2/org/scala-lang/scala-reflect/2.10.5/scala-reflect-2.10.5.jar

Ideally, the Tinkerpop spark-gremlin plugin would have the spark-yarn dependency, so that the corresponding jars are included in the binary distribution.
If your hadoop cluster is configured with lzo compression, you will also need the hadoop-lzo jar. In case of the Hortonworks sandbox you will find it at:
cp /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.5.0.0-1245.jar .

Finally, you will need the actual Tinkerpop binaries for the gremlin-console. I used version 3.2.3 because of its compatibility with JanusGraph-0.1.1. Unpack this archive and install the spark-gremlin plugin (you do not need the hadoop-gremlin plugin). Copy the 4(5) jars gathered above into a ext/spark-gremlin/lib2 folder with respect root of the Tinkerpop distribution. Create a lib.zip archive in the root of the Tinkerpop distribution with all jars from the lib folder, the ext/spark-gremlin/lib folder and the ext/spark-gremlin/lib2 folder.

Configuration

Most configuration problems of Tinkerpop on Spark-Yarn stem from three reasons:
  1. SparkGraphComputer creates its own SparkContext so it does not get any configs from the usual spark-submit command.

  2. The Tinkerpop gremlin-spark plugin does not include spark-yarn dependencies.

  3. Adding the cluster's spark-assembly to the classpath creates a host of version conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
In the recipe below I follow a minimalist approach in which I only add the absolutely necessary dependencies in addition to the dependencies included in the Tinkerpop binary distribution. The sandbox/cluster spark installation is almost completely ignored, with the exception of the history service. This approach minimizes the chance of jar version conflicts. 
Note that I also ignore the instructions from the reference documentation regarding the use of HADOOP_GREMLIN_LIBS and the bin/hadoop/init-tp-spark.sh script. On a yarn cluster this approach runs into a bootstrap problem where the Spark application master does not have access to the file cache. Rather, I exploit the spark.yarn.dist.archives feature that does not have this problem and does not need write access to the datanodes either.

Create a shell script (e.g. bin/tp323.sh) with the following contents:
#!/bin/bash
GREMLIN_HOME=/home/yourdir/TP323binary
cd $GREMLIN_HOME

# Have tinkerpop find the hadoop cluster configs and spark-yarn dependencies
export CLASSPATH=/etc/hadoop/conf:/home/yourname/TP323binary/ext/spark-gremlin/lib2/*

# Have hadoop find its native libraries
export JAVA_OPTIONS="-Djava.library.path=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64"

# Does not work for spark-yarn, see spark.yarn.appMasterEnv.CLASSPATH and 
# spark.executor.extraClassPath. Set nevertheless to get rid of the warning.
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty

bin/gremlin.sh


Create the file conf/hadoop/hadoop-gryo-yarn.properties:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

gremlin.hadoop.inputLocation=tinkerpop-modern.kryo
gremlin.hadoop.outputLocation=output

##############################################
# SparkGraphComputer with Yarn Configuration #
##############################################
spark.master=yarn-client
spark.executor.memory=512m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/yourdir/TP323binary/lib.zip
spark.yarn.appMasterEnv.CLASSPATH=./lib.zip/*:/usr/hdp/current/hadoop-client/conf
spark.executor.extraClassPath=./lib.zip/*:/usr/hdp/current/hadoop-client/conf
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
 
#############################################
# Relevant configs from spark-defaults.conf #
#############################################
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.yarn.historyServer.address sandbox.hortonworks.com:18080
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.am.waitTime 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.executor.memoryOverhead 384
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3


Note that:
  1. the spark.yarn.dist.archives property points to the archive with jars on the local file system and is loaded into the various Yarn containers. As a result the lib.zip archive becomes available as the directory lib.zip in the Yarn containers.
  2. spark.executor.extraClassPath gives the classpath to the jar files internally in the Yarn container. This is why it contains ./lib.zip/*. Just because a spark executor got the jars loaded into its container, does not mean it knows how to access them automatically.

Demonstration

Now, you are ready for the demo.
[root@sandbox TP323binary]# hdfs dfs -put data/tinkerpop-modern.kryo .
[root@sandbox TP323binary]# . bin/tp323.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin] graph = GraphFactory.open('conf/hadoop/hadoop-gryo-yarn.properties')
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yourdir/TP323binary/ext/spark-gremlin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yourdir/TP323binary/lib/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
==]hadoopgraph[gryoinputformat->gryooutputformat]
gremlin] g = graph.traversal().withComputer(SparkGraphComputer)
==]graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin] g.V().count()
==]6
gremlin]


If you still run into exceptions, the best way to see what is going in is to look into the yarn resource manager (http://192.168.178.75:8088/cluster) for your applicationId and get the logs using "yarn logs -applicationId application_1498627870374_0008" from the command shell.

Final remarks

A few final remarks:

  • I have not yet tested the recipe yet for a java/scala application. However, things should not be very different compared to the gremlin/groovy console as long as you do not use spark-submit or spark-shell.
  • You may not like the idea that the hadoop and spark versions from the Tinkerpop binaries differ from the verions in your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various pom files (e.g. spark-core_2.10-1.6.1.2.5.0.0-1245.jar instead of
    spark-core_2.10-1.6.1.jar.
  • While this recipe may save you a lot of time if you are new to TinkerPop, actually using TinkerPop OLAP will still require a fair amount of time unless you have deep knowledge of configuring JVM projects and Apache Spark.