Yet Another Analytics & Intelligence Communication Series: July 2019

The other day I heard that two colleagues had managed to run the latest Apache Spark release on an ageing HDP-2.6.x Hadoop cluster. I figured that was cool because I had tried to run Apache TinkerPop's OLAP queries on the same Hadoop cluster without success and knew that a Google search on this issue does not return any usable resources.

While replaying their experiment on my own machine, I hit upon a small configuration issue that did lead me to a previous description of the central idea needed to run vanilla Apache Spark on a commercial Hadoop distribution. It seems, however, that the author used a potentially offensive word in his blog title (which I will not repeat here for obvious reasons) that prevented the blog from appearing in the top 10 results of any Google query on the subject. So, the main intention of my current blog is to get this useful information higher in the Google search results. Posting on blogger.com might also help in this. While at the job, I will provide some addditional details.

The central idea is that you use the vanilla spark-2.4.3-bin-without-hadoop binary distribution. At first sight this seems counterintuitive: Spark provides binary distribution for the various Hadoop versions and the distribution without Hadoop seems only geared towards a stand-alone Spark cluster. However, on second thought it is only logical: the compatibility issues between vanilla Spark and commercial Hadoop distributions arise from the fact that commercial parties like Cloudera, (former) HortonWorks and MapR backport new Hadoop features into older Hadoop versions to satisfy their need for "stable" versions. The issue I ran into with HDP-2.6.x is that Hadoop services could raise HA-related exceptions that are not known to the vanilla Hadoop client. So, this also renders any vanilla Spark version unusable. By using spark-*-without-hadoop you can simply add your cluster's Hadoop binaries to your application classpath and everything will be fine.

Of course, there are still some catches. Apparently, in the case I described Hortonworks only modified the API'so of Hadoop services and left the API of the Hadoop client untouched. But one day, a provider could decide to "optimize" the interworking between the Hadoop client and Spark. Also, putting the complete Hadoop client binaries on your classpath bears the risk of additional dependency conflicts compared to the set of transitive dependencies that you already would get from just using Spark.

After seeing the logic of using spark-*-without-hadoop, the actual job configuration is surprisingly simple. The examples below assume that you have the spark-2.4.3-bin-without-hadoop distribution available in /opt. You only need the distribution on your local machine, not on the cluster. For the particular case of HDP, the various configuration files present in /etc/hadoop/conf require the hdp.version system property to be set on the JVM's of both the spark driver and the Yarn application manager. Other commercial distributions may have similar requirements.

Happy sparking!

Spark Java:

export SPARK_HOME=/opt/spark-2.4.3-bin-without-hadoop

export SPARK_MAJOR_VERSION=2

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

export HADOOP_CONF_DIR=/etc/hadoop/conf/

$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client \

--class org.apache.spark.examples.SparkPi \

--conf "spark.driver.extraJavaOptions=-Dhdp.version=2.6.2.0-205" \

--conf "spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.2.0-205" \

$SPARK_HOME/examples/jars/spark-examples_2.11-2.4.3.jar

PySpark:

export SPARK_HOME=/opt/spark-2.4.3-bin-without-hadoop

export SPARK_MAJOR_VERSION=2

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

export HADOOP_CONF_DIR=/etc/hadoop/conf/

export PYSPARK_PYTHON=/opt/rh/rh-python36/root/usr/bin/python

$SPARK_HOME/bin/spark-submit --master yarn --deploy-mode client \

--conf "spark.driver.extraJavaOptions=-Dhdp.version=2.6.2.0-205" \

--conf "spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.2.0-205" \

$SPARK_HOME/examples/src/main/python/pi.py

Yet Another Analytics & Intelligence Communication Series

2019-07-18

Running the latest Apache Spark version on an existing Hadoop cluster