-->

2015-11-19

Spark with Python2.7 on the Hortonworks Data Platform on RHEL6

The issue

As a data scientist, working on a secure Hadoop cluster is not always a pleasure. You can not just change the cluster configuration, but rather have to make your way through the various configuration options provided by the platform's clients. Today, I address the following situation that anyone working for a large enterprise or public organization, may encounter: you are allowed to submit Spark jobs on a secure cluster, based on the Hortonworks Data Platform 2.x and RedHat Enterprise Linux 6.x. You want to use Python 2.7, but Spark uses the system's default Python 2.6, while Python 2.7 is only available through RedHat's software collections. Here is what I did.

 

Accessing Python 2.7 from Spark

Spark's spark-submit script takes into account the PYSPARK_PYTHON environment variable and applies it for both the driver program and the executors. However, PYSPARK_PYTHON merely accepts a path to a python executable, while we want to access RedHat's Python 2.7 using the bash expression  
scl enable python27 "python $*"
The solution is to create two bash scripts, provided below, and put these in your $HOME/bin. The first script provides the python executable that can be assigned to PYSPARK_PYTHON. En passant, it also solves an issue related to the python egg cache (see the next section about python eggs). The second script is a wrapper around the HDP spark-submit script, which makes available the python27 script in Spark 's execution environments.

$HOME/bin/python27

# Default /home/.python-eggs is not accessible on the workers

export PYTHON_EGG_CACHE=./.python-eggs
scl enable python27 "python $*" 

$HOME/bin/spark-submit-py27

#!/bin/bash
# Assumes spark-submit-py27 and python27 reside in your $HOME/bin
# PYSPARK_PYTHON is used - and should be valid - in both the 
# driver and executor environment
if ! [ -a python27 ]; then
  ln -s $HOME/bin/python27 python27
  CREATED_PYLINK=true
fi  # Ugly hack, but easier than parsing all spark-submit arguments
export PYSPARK_PYTHON=./python27
spark-submit --files "$HOME/bin/python27" $*
if [ -n "$CREATED_PYLINK" ]; then
  rm python27
fi

Passing Python eggs

Spark offers the possibility to pass Python dependencies as *.egg files (e.g. --pyfiles numpy-1.9.3-py2.7-linux-x86_64.egg). However, Python packages with C dependencies must be built with the righ Python version and for the target platform (the latter implies the need for a homogeneous cluster). So, you better build the eggs from source yourself, using  "[python|python27] [setup.py|setupegg.py] bdist_egg" in the root of the Python package's source distribution. For Python eggs to be used on Spark workers, the selected Python enviroment needs to have setuptools installed.

Example

The example below combines the use of Python 2.7 and passing a Python egg with C dependencies. 

pysparknumpy.py

# Run with: spark-submit-py27 --name PysparkNumpy --master yarn \
#  --py-files numpy-1.9.3-py2.7-linux-x86_64.egg  pysparknumpy.py

import sys
import numpy
from pyspark import SparkContext

def provideVersion(x):
    return str(sys.version_info)

print "***********Driver Python version *****************"
print sys.version_info

sc = SparkContext()
dataset = sc.parallelize(range(1,10)).map(
    lambda x: str(numpy.arange(x)) + provideVersion(x))

print "**********Executor Python version ***************"
print dataset.collect()


As a final remark I would like to refer to a blog post about using a Python virtualenv with Hadoop streaming, which was my inspiration for starting the work in the current post.

No comments:

Post a Comment