-->

2015-11-19

Spark with Python2.7 on the Hortonworks Data Platform on RHEL6

The issue

As a data scientist, working on a secure Hadoop cluster is not always a pleasure. You can not just change the cluster configuration, but rather have to make your way through the various configuration options provided by the platform's clients. Today, I address the following situation that anyone working for a large enterprise or public organization, may encounter: you are allowed to submit Spark jobs on a secure cluster, based on the Hortonworks Data Platform 2.x and RedHat Enterprise Linux 6.x. You want to use Python 2.7, but Spark uses the system's default Python 2.6, while Python 2.7 is only available through RedHat's software collections. Here is what I did.

 

Accessing Python 2.7 from Spark

Spark's spark-submit script takes into account the PYSPARK_PYTHON environment variable and applies it for both the driver program and the executors. However, PYSPARK_PYTHON merely accepts a path to a python executable, while we want to access RedHat's Python 2.7 using the bash expression  
scl enable python27 "python $*"
The solution is to create two bash scripts, provided below, and put these in your $HOME/bin. The first script provides the python executable that can be assigned to PYSPARK_PYTHON. En passant, it also solves an issue related to the python egg cache (see the next section about python eggs). The second script is a wrapper around the HDP spark-submit script, which makes available the python27 script in Spark 's execution environments.