Spark on Kubernetes Tutorial (Beginner-Friendly with Minikube)
This post is a beginner-friendly introduction to Spark on Kubernetes, the guide I wish I had when I first tried it.
If you’re trying to run Spark on Kubernetes locally, this guide will walk you through a working example step by step.
We’ll go through examples from the official documentation
and run them on a local Minikube cluster.
Along the way we’ll break things down and look under the hood.
This is beginner-friendly, but we’ll assume:
- Some basic understanding of Spark
- Basic knowledge about containers
- No prior Kubernetes experience
I hope it will help you avoid frustration on your journey through this Spark jungle.
Basic Concept of Spark on Kubernetes
Instead of one big cluster, you can run many Spark mini clusters. Each submitted job runs as its own small minicluster, which gets cleaned up after the job is finished.
That means you get:
-
Driver – running as a pod
- Communicates with the Kubernetes API to orchestrate your mini cluster
- Communicates with executors to coordinate job execution
-
Executors – running as separate pods
- Actually perform the work you requested
With that in mind, let’s jump straight into it.
Hands-on Example (Running Spark on Kubernetes with Minikube)
You will need:
- Minikube – if you are not familiar at all, don’t worry, you only need the first three steps from their tutorial
- Spark – download so you can follow along (I had version 4.0.1 locally when writing this)
- Official Spark Docker image 4.0.1 – the newest available version with Scala (at the time of writing)
Documentation Example
Every command is executed from the main folder of your downloaded binaries.
This is what you will find in the official documentation:
$ ./bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=<spark-image> \
local:///path/to/examples.jar
Spark Submit Script
Let’s begin with the core script we are running: spark-submit.
This script is part of every Spark binary and is your gateway to run anything in Spark.
Based on your configuration, spark-submit tells Spark how to run your workload.
Kubernetes API Server Host and Port
If you are not familiar with Kubernetes (as I wasn’t at the time), you might ask: what is the k8s-apiserver? You can check the explanation in the official documentation.
Basically, Kubernetes manages containerized workloads. Your submit script tells Kubernetes, through the API, what it requires, and then Kubernetes takes over - creating, deleting, and modifying workloads as needed.
With Minikube, if you run:
$ minikube ip
You’ll get the local IP address of your Minikube cluster. The port you need to know is 8443, the default for Minikube.
Then change the --master command to:
--master k8s://https://<your-minikube-ip>:8443
# or in your shell
--master k8s://https://$(minikube ip):8443
This tells your submit script where the Kubernetes API is available.
Workload You Want to Run
--class org.apache.spark.examples.SparkPi
This line tells Spark what you actually want to run. You can find this specific class in the examples directories of your downloaded Spark binaries:
<path-to-your-spark-binaries>/examples/src/main/scala/org/apache/spark/examples
The SparkPi class is the one your example calls.
Docker Images for Spark
Since Kubernetes runs containerized workloads, you must specify the Docker image using:
--conf spark.kubernetes.container.image=<spark-image>
In our case, we can use the prebuilt image, which already contains everything we need.
apache/spark:4.0.1-scala2.13-java21-ubuntu
Bundled JAR
The last line refers to the example JAR packaged inside the container. For cluster deploy mode, it points to:
local:///opt/spark/examples/jars/spark-examples.jar
You can also find a corresponding JAR in your downloaded Spark binaries:
<path-to-your-spark-binaries>/examples/jars/spark-examples_2.13-4.0.1.jar
Locally, you won’t see spark-examples.jar because it is a symbolic link
inside the Docker container.
Putting It All Together
Here’s the full command you should be able to use:
$ ./bin/spark-submit \
--master k8s://https://$(minikube ip):8443 \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=5 \
--conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
local:///opt/spark/examples/jars/spark-examples.jar
If you run it on Minikube, the driver pod will start but will fail immediately:
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/default/pods/org-apache-spark-examples-sparkpi-38d3dc9c3d8a8786-dr
This happens because Kubernetes requires permissions for everything. Using the default service account won’t allow the driver pod to create or manage pods.
RBAC Hell
For testing, we can create a Spark service account and a role binding to allow the test application to run. This is only for testing. In a real cluster, you’d want stricter permissions (least privilege).
$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
You can also find this in the Spark documentation.
The Real Working Example
Now tweak your submit command:
$ ./bin/spark-submit \
--master k8s://https://$(minikube ip):8443 \
--deploy-mode cluster \
--name spark-pi \
--executor-memory 2G \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
--conf spark.kubernetes.authenticate.driver.serviceAccountName="spark" \
local:///opt/spark/examples/jars/spark-examples.jar
This should finally work.
I reduced the number of executors and executor memory
so it doesn’t eat up all your local resources.
Check the Expected Result
$ kubectl get pods
Expected output:
NAME READY STATUS RESTARTS AGE
org-apache-spark-examples-sparkpi-46c3389c3da281d4-driver 1/1 Running 0 23s
spark-pi-4450d19c3da2a0fc-exec-1 1/1 Running 0 15s
spark-pi-4450d19c3da2a0fc-exec-2 1/1 Running 0 15s
This shows one driver and two executors, as defined in the submit script.
If you see:
NAME READY STATUS RESTARTS AGE
org-apache-spark-examples-sparkpi-46c3389c3da281d4-driver 1/1 Completed 0 23s
The job finished and executors were cleaned. In that case, run the command again, or try:
$ kubectl get pods -w
This allows you to watch pods being created and deleted in real time.
Cleaning Up Pods
To remove single pod manually:
$ kubectl delete pods <pod-name>
# example
$ kubectl delete pods spark-pi-27719e9c572a28d9-driver
# Or delete all pods in your Minikube cluster:
$ kubectl delete pods --all
That’s Just the Beginning
If you made it to the end and it works, great job! If not, don’t worry — try again, experiment, and you might find yourself enjoying it.
Quick recap of how to run Spark on Kubernetes locally:
- Set up a local Kubernetes cluster with Minikube
- Configure your spark-submit command
- Point Spark to the Kubernetes API
- Specify the correct Docker image
- Configure RBAC permissions
- Run the job and verify pods
I hope by the end you have a basic understanding of what’s happening on this small scale. There’s much more to learn — the Spark + Kubernetes ecosystem is huge. In the future, I’d like to cover useful tips and traps I ran into along the way.