Spark on Kubernetes with Helm and Docker Desktop

Spark on Kubernetes with Helm and Docker Desktop

A guide on installing a Spark cluster using Helm onto Docker Desktop's local Kubernetes cluster and how to spark-submit PySpark scripts to it

ยท

4 min read

Prerequisites

๐Ÿ’ก
This Quickstart guide will show you how to install all of the prerequisites for this guide.

Installation and Setup

Download the bitnami Helm repository:

helm repo add bitnami https://charts.bitnami.com/bitnami

Install the bitnami/spark Helm Chart into the docker-desktop Kubernetes cluster:

helm upgrade --install spark bitnami/spark --create-namespace --namespace ns-spark
โ„น
The helm upgrade command installs the chart and creates the necessary pods, services and service accounts within the namespace ns-spark for Spark to run. The pods run as a service account named "spark" because that's what we've named the Helm release.

Run the k9s command to see the local Kubernetes cluster:

k9s

You should see something like this:

โš 
The next command is a hack to give the service account "spark" god mode access to the Kubernetes cluster. The Helm Chart creates a service account. However, it does not have enough permission to do anything meaningful, such as creating pods, which is required later to execute Spark jobs in cluster mode. Ideally, we would create a patch with the appropriate permissions to the service account; however, for simplicity, we will just apply god mode.
# god mode "spark" service account
kubectl create rolebinding godmode \
  --clusterrole=cluster-admin \
  --serviceaccount=system:ns-spark:spark \
  --namespace=ns-spark

Now that the Spark Cluster is installed, run the kubectl proxy command in a new terminal window:

kubectl proxy

And then access the Spark Master Web UI using the following URL:

http://localhost:8001/api/v1/namespaces/ns-spark/services/http:spark-master-svc:http/proxy/

You should see something like this:

Hooray ๐Ÿš€ Spark is now installed on your local Kubernetes cluster.

Submit PySpark Jobs with spark-submit

Now that the Spark Cluster is installed, let's submit a Spark job to it.

Create a simple PySpark script and save it as test.py on your local machine:

import pyspark
import random
import time

spark = pyspark.sql.SparkSession.builder.appName('myPySparkApp').getOrCreate()
sc = spark.sparkContext

num_samples = 100

def inside(p):
  print('hello world from PySpark')
  time.sleep(10)
  x, y = random.random(), random.random()
  return x*x + y*y < 1

count = sc.parallelize(range(0, num_samples)).filter(inside).count()

pi = 4 * count / num_samples
print(pi)

sc.stop()
๐Ÿ’ก
The easiest way to use the spark-submit command without installing Spark directly onto your machine is to use it from inside an ephemeral container with Spark installed, running in the k8s cluster.

Run a Spark container using the official Apache Spark docker image in k8s with this command:

kubectl run --namespace ns-spark \
spark-ephemeral --rm --tty -i \
--image docker.io/spark:3.5.0-scala2.12-java17-python3-ubuntu \
--overrides='{"apiVersion":"v1","spec":{"serviceAccount":"spark","serviceAccountName":"spark"}}' \
-- bash

You should now be inside the terminal of the running ephemeral container:

Within the container terminal, run this command to run test.py using spark-submit:

/opt/spark/bin/spark-submit \
--conf spark.executor.cores=1 \
--conf spark.executor.memory=500M \
--conf spark.driver.cores=1 \
--conf spark.driver.memory=500M \
--conf spark.executor.memoryOverhead=500M \
--class org.apache.spark.examples.SparkPi \
--name spark-pi \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=docker.io/spark:3.5.0-scala2.12-java17-python3-ubuntu \
--master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \
--conf spark.kubernetes.driverEnv.SPARK_MASTER_URL=spark://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS \
--conf spark.executor.instances=4 \
--conf spark.kubernetes.namespace=ns-spark \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.driver.volumes.hostPath.<VOLUME NAME>.mount.path=<PATH IN CONTAINER> \
--conf spark.kubernetes.driver.volumes.hostPath.<VOLUME NAME>.mount.readOnly=true \
--conf spark.kubernetes.driver.volumes.hostPath.<VOLUME NAME>.options.path=<YOUR LOCAL DIR PATH> \
--conf spark.kubernetes.executor.volumes.hostPath.<VOLUME NAME>.mount.path=<PATH IN CONTAINER> \
--conf spark.kubernetes.executor.volumes.hostPath.<VOLUME NAME>.mount.readOnly=true \
--conf spark.kubernetes.executor.volumes.hostPath.<VOLUME NAME>.options.path=<YOUR LOCAL DIR PATH> \
local:<PATH IN CONTAINER>/<SCRIPT NAME>
๐Ÿ’ก
An example --conf spark.kubernetes.driver.volumes.hostPath.<VOLUME NAME>.options.path=<YOUR LOCAL DIR PATH> could be --conf spark.kubernetes.driver.volumes.hostPath.mylocalfolder.options.path=/Users/FirstName.LastName/Documents/MyCode/. More information about Spark and Kubernetes Volumes can be found here.

Inside K9s, you should see the Spark driver and executor pods running:

If you observe the logs of one of the executor pods, you'll see that test.py is running:

To uninstall Spark, including associated service accounts, run this command:

helm delete spark --namespace ns-spark

You have successfully submitted a PySpark job to a Spark cluster running in Kubernetes ๐Ÿš€

References

ย