TensorFlowOnSpark: Install Tutorial Step by Step (spark on Yarn)

references:

Yahoo TensorFlowOnSpark Repo guide
Tensorflow Spark Connector

Create a directory for later on scripts and package

mkdir tensorflowonspark
cd tensorflowonspark

Install Tensorflow and Tensorflow on spark

I tried to use python2.7 running tensorflowonspark, and it failed in importing tensorflow package when spark-submit pyspark scipt. While if I changed to python3, it works. So, if you don’t have a strong reason of using python2.7, just use python3 here.

pip3 install tensorflow tensorflowonspark

Prepare spark and Hadoop

I already installed spark and hadoop(Yarn, HDFS), and if you need some guidance, please check Spark and Hadoop build from source

Install Tensorflow Spark Connector

If you want to use format “TFRecords” to input and output on HDFS, below preparation is necessary.

git clone https://github.com/tensorflow/ecosystem.git
cd ecosystem/hadoop
mvn clean install -DskipTests
cd ../spark/spark-tensorflow-connector
mvn clean install -DskipTests

Then you’ll have tensorflow-hadoop and tensorflow-spark-connector in mvn repo

ll /root/.m2/repository/org/tensorflow/
total 0
drwxr-xr-x 3 root root 20 Apr 17 09:07 parentpom
drwxr-xr-x 3 root root 20 Apr 17 09:07 proto
drwxr-xr-x 3 root root 52 Apr 17 09:19 spark-tensorflow-connector_2.11
drwxr-xr-x 3 root root 52 Apr 17 09:09 tensorflow-hadoop

Upload the jar to HDFS

hadoop fs -put target/tensorflow-hadoop-1.10.0.jar

Prepare TensorFlow on Spark zip, so training pyspark script could call.

git clone https://github.com/yahoo/TensorFlowOnSpark.git
cd TensorFlowOnSpark
zip -r tfspark.zip tensorflowonspark
cd ..

Prepare HDFS native lib (libhdfs.so)

You probably should generated it when building Hadoop from source code, and if you didn’t, check below

cd ${HADOOP_SOURCEC_CODE}/hadoop-hdfs-project
mvn install -Pdist,native -DskipTests -Dtar
find ./ -name libhdfs.so
./hadoop-hdfs-native-client/target/target/usr/local/lib/libhdfs.so 

Prepare MNIST example package

mkdir mnist
cd mnist
curl -O "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
curl -O "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"
zip -r mnist.zip *
cd ..

Run MNIST preparation/tranining/inference on TensorFlowOnSpark

System Variables Configuration

export PYSPARK_PYTHON=/usr/bin/python3
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/usr/bin/python3"
export QUEUE=default  # For CPU
#export QUEUE=gpu     # For GPU

find / -name libhdfs.so
/mnt/nvme2/chendi/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/target/usr/local/lib/libhdfs.so
find / -name libjvm.so
/usr/java/jdk1.8.0_201-amd64/jre/lib/amd64/server/libjvm.so

export LIB_HDFS=${libhdfs.so location directory}  #/mnt/nvme2/chendi/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/target/usr/local/lib/
export LIB_JVM=${libjvm.so location directory}   #/usr/java/jdk1.8.0_201-amd64/jre/lib/amd64/server/
#export LIB_CUDA=/usr/local/cuda-7.5/lib64   #libcuda.so path

Prepare data as TFRecords(CPU mode)

Notice here: When using file path as hdfs:/// the package should be pre-uploaded to hdfs, and hdfs:/// will be translate as hdfs://${ip}:${port}/

${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue default \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist \
--jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/tfr \
--format tfr

Prepare data as TFRecords(GPU mode)

${SPARK_HOME}/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--queue default \
--num-executors 4 \
--executor-memory 4G \
--archives hdfs:///user/${USER}/Python.zip#Python,mnist/mnist.zip#mnist \
--jars hdfs:///user/${USER}/tensorflow-hadoop-1.0-SNAPSHOT.jar \
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_CUDA \
--driver-library-path=$LIB_CUDA \
TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
--output mnist/tfr \
--format tfr

Verification

Check on Spark UI Spark-UI-Screenshot

hdfs dfs -ls mnist
Found 2 items
drwxr-xr-x   - root supergroup          0 2019-04-17 14:01 mnist/tfr

Till now, we can think the TensorFlowOnSpark setup is basically done and verified. And then we can turn to run some real benchmark.

Chendi Xue