-->

2017-06-29

Configuring Apache Tinkerpop for Spark-Yarn

Apache TinkerPop is a graph computing framework for both graph databases (OLTP) and graph analytic systems (OLAP). Today, we focus on the OLAP part using Tinkerpop's own HadoopGraph and SparkGraphComputer implementations. While all OLAP tests and examples in the reference documentation run fine on a standalone machine, many people including myself have spent days and days in (trying) getting OLAP working on an actual Spark-Yarn cluster:
https://groups.google.com/forum/#!searchin/gremlin-users/yarn%7Csort:relevance
https://groups.google.com/forum/#!searchin/aureliusgraphs/yarn%7Csort:relevance
https://groups.google.com/forum/#!searchin/janusgraph-users/yarn%7Csort:relevance

With more experience in jar hell than a few years back, and with the recent Tinkerpop-based JanusGraph initiative as a stimulus, I decided to give the configuration of Apache TinkerPop for Spark-Yarn another try.

Prerequisites

The recipes below were tested on a Hortonworks Data Platform sandbox (version 2.5.0.0-1245) and on an actual multi-node HDP cluster (version 2.5.3.0-37), but I expect the recipes to work on other hadoop distributions or a vanilla hadoop cluster as well.
I assume that you have downloaded the Hortonworks VM, imported it into Virtualbox, can ssh into the sandbox and upload files to it (e.g. using an sftp:// location in Nautilus). Beware: set the sandbox configs from NAT to "bridged" and use port 2222 to ssh into the actual Docker container in the sandbox. Of course, you can also use putty and winscp to the same effect.

You will also need the following jars (commands from within the sandbox):
wget http://central.maven.org/maven2/org/apache/spark/spark-yarn_2.10/1.6.1/spark-yarn_2.10-1.6.1.jar 
wget http://central.maven.org/maven2/com/google/inject/extensions/guice-servlet/3.0/guice-servlet-3.0.jar
wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-yarn-server-web-proxy/2.2.0/hadoop-yarn-server-web-proxy-2.2.0.jar
wget http://central.maven.org/maven2/org/scala-lang/scala-reflect/2.10.5/scala-reflect-2.10.5.jar

Ideally, the Tinkerpop spark-gremlin plugin would have the spark-yarn dependency, so that the corresponding jars are included in the binary distribution.
If your hadoop cluster is configured with lzo compression, you will also need the hadoop-lzo jar. In case of the Hortonworks sandbox you will find it at:
cp /usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.5.0.0-1245.jar .

Finally, you will need the actual Tinkerpop binaries for the gremlin-console. I used version 3.2.3 because of its compatibility with JanusGraph-0.1.1. Unpack this archive and install the spark-gremlin plugin (you do not need the hadoop-gremlin plugin). Copy the 4(5) jars gathered above into a ext/spark-gremlin/lib2 folder with respect root of the Tinkerpop distribution. Create a lib.zip archive in the root of the Tinkerpop distribution with all jars from the lib folder, the ext/spark-gremlin/lib folder and the ext/spark-gremlin/lib2 folder.

Configuration

Most configuration problems of Tinkerpop on Spark-Yarn stem from three reasons:
  1. SparkGraphComputer creates its own SparkContext so it does not get any configs from the usual spark-submit command.

  2. The Tinkerpop gremlin-spark plugin does not include spark-yarn dependencies.

  3. Adding the cluster's spark-assembly to the classpath creates a host of version conflicts, because Spark 1.x dependency versions have remained frozen since 2014.
In the recipe below I follow a minimalist approach in which I only add the absolutely necessary dependencies in addition to the dependencies included in the Tinkerpop binary distribution. The sandbox/cluster spark installation is almost completely ignored, with the exception of the history service. This approach minimizes the chance of jar version conflicts. 
Note that I also ignore the instructions from the reference documentation regarding the use of HADOOP_GREMLIN_LIBS and the bin/hadoop/init-tp-spark.sh script. On a yarn cluster this approach runs into a bootstrap problem where the Spark application master does not have access to the file cache. Rather, I exploit the spark.yarn.dist.archives feature that does not have this problem and does not need write access to the datanodes either.

Create a shell script (e.g. bin/tp323.sh) with the following contents:
#!/bin/bash
GREMLIN_HOME=/home/yourdir/TP323binary
cd $GREMLIN_HOME

# Have tinkerpop find the hadoop cluster configs and spark-yarn dependencies
export CLASSPATH=/etc/hadoop/conf:/home/yourname/TP323binary/ext/spark-gremlin/lib2/*

# Have hadoop find its native libraries
export JAVA_OPTIONS="-Djava.library.path=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64"

# Does not work for spark-yarn, see spark.yarn.appMasterEnv.CLASSPATH and 
# spark.executor.extraClassPath. Set nevertheless to get rid of the warning.
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/empty

bin/gremlin.sh


Create the file conf/hadoop/hadoop-gryo-yarn.properties:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

gremlin.hadoop.inputLocation=tinkerpop-modern.kryo
gremlin.hadoop.outputLocation=output

##############################################
# SparkGraphComputer with Yarn Configuration #
##############################################
spark.master=yarn-client
spark.executor.memory=512m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/yourdir/TP323binary/lib.zip
spark.yarn.appMasterEnv.CLASSPATH=./lib.zip/*:/usr/hdp/current/hadoop-client/conf
spark.executor.extraClassPath=./lib.zip/*:/usr/hdp/current/hadoop-client/conf
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
 
#############################################
# Relevant configs from spark-defaults.conf #
#############################################
spark.eventLog.dir hdfs:///spark-history
spark.eventLog.enabled true
spark.yarn.historyServer.address sandbox.hortonworks.com:18080
spark.history.fs.logDirectory hdfs:///spark-history
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.history.kerberos.keytab none
spark.history.kerberos.principal none

spark.yarn.am.waitTime 10
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.executor.memoryOverhead 384
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3


Note that:
  1. the spark.yarn.dist.archives property points to the archive with jars on the local file system and is loaded into the various Yarn containers. As a result the lib.zip archive becomes available as the directory lib.zip in the Yarn containers.
  2. spark.executor.extraClassPath gives the classpath to the jar files internally in the Yarn container. This is why it contains ./lib.zip/*. Just because a spark executor got the jars loaded into its container, does not mean it knows how to access them automatically.

Demonstration

Now, you are ready for the demo.
[root@sandbox TP323binary]# hdfs dfs -put data/tinkerpop-modern.kryo .
[root@sandbox TP323binary]# . bin/tp323.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin] graph = GraphFactory.open('conf/hadoop/hadoop-gryo-yarn.properties')
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/yourdir/TP323binary/ext/spark-gremlin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/yourdir/TP323binary/lib/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
==]hadoopgraph[gryoinputformat->gryooutputformat]
gremlin] g = graph.traversal().withComputer(SparkGraphComputer)
==]graphtraversalsource[hadoopgraph[gryoinputformat->gryooutputformat], sparkgraphcomputer]
gremlin] g.V().count()
==]6
gremlin]


If you still run into exceptions, the best way to see what is going in is to look into the yarn resource manager (http://192.168.178.75:8088/cluster) for your applicationId and get the logs using "yarn logs -applicationId application_1498627870374_0008" from the command shell.

Final remarks

A few final remarks:

  • I have not yet tested the recipe yet for a java/scala application. However, things should not be very different compared to the gremlin/groovy console as long as you do not use spark-submit or spark-shell.
  • You may not like the idea that the hadoop and spark versions from the Tinkerpop binaries differ from the verions in your cluster. If so, just build Tinkerpop from source with the corresponding dependencies changed in the various pom files (e.g. spark-core_2.10-1.6.1.2.5.0.0-1245.jar instead of
    spark-core_2.10-1.6.1.jar.
  • While this recipe may save you a lot of time if you are new to TinkerPop, actually using TinkerPop OLAP will still require a fair amount of time unless you have deep knowledge of configuring JVM projects and Apache Spark.