Chendi Xue

I am linux software engineer, currently working on Spark, Arrow, Kubernetes, Ceph, c/c++, and etc.

Running Spark on kubernetes Step by steps

18 Oct 2018 » kubernetes devop, Hadoop/spark devop


This blog is practised on CentOS 7, ubuntu reference

  • Pre-Setup

Upgrade kernel to latest version.

$ yum upgrade -y
$ reboot
setenforce 0
sed -i --follow-symlinks 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
modprobe br_netfilter
echo '1' > /proc/sys/net/bridge/bridge-nf-call-iptables
  • Docker and Kubeadm installation
vim /etc/yum.repos.d/kubernetes.repo
yum update
yum install kubeadm docker –y
systemctl restart docker && systemctl enable docker
systemctl restart kubelet && systemctl enable kubelet

[optional] some configuration to set proxy and no proxy

sudo mkdir -p /etc/systemd/system/docker.service.d
vim /etc/systemd/system/docker.service.d/http-proxy.conf
Environment="HTTP_PROXY=${proxy}" "NO_PROXY=${no_proxy}"

vim /etc/systemd/system/docker.service.d/https-proxy.conf
Environment="HTTPS_PROXY=${proxy}" "NO_PROXY=${no_proxy}"

systemctl daemon-reload
systemctl restart docker
  • Kubernetes nodes create and join
swapoff –a

kubeadm init --pod-network-cidr= --feature-gates=CoreDNS=false --apiserver-advertise-address= --kubernetes-version=v1.11.0
# if kubeadm init failed, use “kubeadm reset && systemctl restart kubelet” to rerun

# if init completed, set permission and check status 
mkdir ~/.kube/
chmod +755 ~/.kube
cp /etc/kubernetes/admin.conf ~/.kube/config
chown $(id -u):$(id -g) $HOME/.kube/config

Set up pod flannel networks

kubectl apply -f

$ kubectl get nodes
NAME        STATUS     ROLES     AGE       VERSION
bigdata09   Ready   master    6m        v1.10.3

for kubernetes slave nodes, use below command to join

kubeadm join --token ${provide_by_} --discovery-token-ca-cert-hash sha256:5a549c8e6c1f0e76e96c86b516e830be454a024360c05c08e196ba9bc971284d
  • spark docker img preparation

In order to run spark benchmark tool HiBench, I also added HiBench jar and configuration inside Spark Docker Img

vim /hadoop/spark/kubernetes/dockerfiles/spark/Dockerfile
# add below lines in Dockerfile
# to copy spark configuration with hibench configuration and jars to the img
COPY HiBench/sparkbench /opt/HiBench/sparkbench
COPY sparkbench.conf /opt/HiBench/sparkbench.conf
# set the configuration path to docker system variable.
RUN echo SPARKBENCH_PROPERTIES_FILES=/opt/HiBench/sparkbench.conf >> /root/.bashrc
ENV SPARKBENCH_PROPERTIES_FILES /opt/HiBench/sparkbench.conf

#Notes: sparkbench.conf is a HiBench generated conf from HiBench.conf
#Sparkbench folder is the one under HiBench contains all SparkBench jars.

$ ./bin/ -r -t v2.3.0-with-hibench build
$ ./bin/ -r -t v2.3.0-with-hibench push
  • Run spark benchmark terasort in Kubenetes pods.
spark-submit --master k8s:// --deploy-mode cluster --properties-file /HiBench/report/terasort/spark/conf/sparkbench/spark.conf --class --spark.executor.instances 25 --executor-memory 17g --conf --conf spark.kubernetes.container.image=xuechendi/spark:v2.3.0-with-hibench --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark  local:///opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar "hdfs://" "hdfs://"

local:// refers to the location inside container.

  • Run Spark Sql benchmark in Kubernetes pods.

We will also put our application inside HiBench jar.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveContext
object ScalaSparkSQLBench{
  def main(args: Array[String]){
    if (args.length < 2){
        s"Usage: $ScalaSparkSQLBench <workload name> <SQL sciprt file>"
    val workload_name = args(0)
    val sql_file = args(1)
	// Configure Hive Context inline
    val spark = SparkSession.builder()
    val hc = new HiveContext(spark.sparkContext)

    val _sql =
    _sql.split(';').foreach { x =>
      if (x.trim.nonEmpty)

Compile scala codes with “mvn package” and Build a new docker img with this new jar.

spark-submit --master k8s:// --deploy-mode cluster --properties-file /hadoop/spark/spark.conf --class --num-executors 25 --executor-cores 6 --executor-memory 17g --conf --conf spark.kubernetes.container.image=xuechendi/spark/spark:v2.3.0-with-spark-sql --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark local:///opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar Spark-sql-on-k8s /opt/UC11_K8S/$query_id
  • Optimization spark in kube performance by mounting a faster media device to container
echo "spark.local.dir /tmp/" >> spark-defaults.conf #so spark will use /tmp as its shuffle & spill directory
echo "VOLUME /tmp" >> Dockerfile #this configuration will tell run-time container to use /var/lib/docker/volumes for /tmp dir inside container.
  • Debug and check
docker ps
CONTAINER ID        IMAGE                        COMMAND                  CREATED             STATUS              PORTS               NAMES
844ab90bcfb8        71f3e1f14cb6                 "/opt/entrypoint.s..."   4 minutes ago       Up 4 minutes                            k8s_executor_com-intel-hibench-sparkbench-micro-scalaterasort-3ed74fbe8cf935e9a34dfebb8baafbfa-exec-5_d
$ docker inspect 844ab90bcfb8
"Mounts": [
                "Type": "bind",
                "Source": "/var/lib/kubelet/pods/346e134d-8579-11e8-b100-001e677c4f1a/etc-hosts",
                "Destination": "/etc/hosts",
                "Mode": "",

One way I did to debug a spark kube is add a sleep in dockerfile, so once the spark executor pod is up, it won’t die too quickly if this executor hit some error because of the sleep cmdline.

kubectl exec -it ${podname} -- /bin/bash
# this helps to get into the pod for some checking.