The issue
As a data scientist, working on a secure Hadoop cluster is not always a pleasure. You can not just change the cluster configuration, but rather have to make your way through the various configuration options provided by the platform's clients. Today, I address the following situation that anyone working for a large enterprise or public organization, may encounter: you are allowed to submit Spark jobs on a secure cluster, based on the Hortonworks Data Platform 2.x and RedHat Enterprise Linux 6.x. You want to use Python 2.7, but Spark uses the system's default Python 2.6, while Python 2.7 is only available through RedHat's software collections. Here is what I did.Accessing Python 2.7 from Spark
Spark's spark-submit script takes into account the PYSPARK_PYTHON environment variable and applies it for both the driver program and the executors. However, PYSPARK_PYTHON merely accepts a path to a python executable, while we want to access RedHat's Python 2.7 using the bash expression
scl enable python27 "python $*"
The solution is to create two bash scripts, provided below, and put these in your $HOME/bin. The first script provides the python executable that can be assigned to PYSPARK_PYTHON. En passant, it also solves an issue related to the python egg cache (see the next section about python eggs). The second script is a wrapper around the HDP spark-submit script, which makes available the python27 script in Spark 's execution environments.
The solution is to create two bash scripts, provided below, and put these in your $HOME/bin. The first script provides the python executable that can be assigned to PYSPARK_PYTHON. En passant, it also solves an issue related to the python egg cache (see the next section about python eggs). The second script is a wrapper around the HDP spark-submit script, which makes available the python27 script in Spark 's execution environments.
$HOME/bin/python27
# Default /home/.python-eggs is not accessible on the workers export PYTHON_EGG_CACHE=./.python-eggs scl enable python27 "python $*"
$HOME/bin/spark-submit-py27
#!/bin/bash # Assumes spark-submit-py27 and python27 reside in your $HOME/bin # PYSPARK_PYTHON is used - and should be valid - in both the # driver and executor environment if ! [ -a python27 ]; then ln -s $HOME/bin/python27 python27 CREATED_PYLINK=true fi # Ugly hack, but easier than parsing all spark-submit arguments export PYSPARK_PYTHON=./python27 spark-submit --files "$HOME/bin/python27" $* if [ -n "$CREATED_PYLINK" ]; then rm python27 fi
Passing Python eggs
Spark offers the possibility to pass Python dependencies as *.egg files (e.g. --pyfiles numpy-1.9.3-py2.7-linux-x86_64.egg). However, Python packages with C dependencies must be built with the righ Python version and for the target platform (the latter implies the need for a homogeneous cluster). So, you better build the eggs from source yourself, using "[python|python27] [setup.py|setupegg.py] bdist_egg" in the root of the Python package's source distribution. For Python eggs to be used on Spark workers, the selected Python enviroment needs to have setuptools installed.
Example
The example below combines the use of Python 2.7 and passing a Python egg with C dependencies.
pysparknumpy.py
# Run with: spark-submit-py27 --name PysparkNumpy --master yarn \ # --py-files numpy-1.9.3-py2.7-linux-x86_64.egg pysparknumpy.py import sys import numpy from pyspark import SparkContext def provideVersion(x): return str(sys.version_info) print "***********Driver Python version *****************" print sys.version_info sc = SparkContext() dataset = sc.parallelize(range(1,10)).map( lambda x: str(numpy.arange(x)) + provideVersion(x)) print "**********Executor Python version ***************" print dataset.collect()
As a final remark I would like to refer to a blog post about using a Python virtualenv with Hadoop streaming, which was my inspiration for starting the work in the current post.
No comments:
Post a Comment