Spark on Kubernetes Tutorial (Beginner-Friendly with Minikube)

This post is a beginner-friendly introduction to Spark on Kubernetes, the guide I wish I had when I first tried it.

If you’re trying to run Spark on Kubernetes locally, this guide will walk you through a working example step by step.

We’ll go through examples from the official documentation and run them on a local Minikube cluster.
Along the way we’ll break things down and look under the hood.

This is beginner-friendly, but we’ll assume:

Some basic understanding of Spark
Basic knowledge about containers
No prior Kubernetes experience

I hope it will help you avoid frustration on your journey through this Spark jungle.

Basic Concept of Spark on Kubernetes

Instead of one big cluster, you can run many Spark mini clusters. Each submitted job runs as its own small minicluster, which gets cleaned up after the job is finished.

That means you get:

Driver – running as a pod
- Communicates with the Kubernetes API to orchestrate your mini cluster
- Communicates with executors to coordinate job execution
Executors – running as separate pods
- Actually perform the work you requested

With that in mind, let’s jump straight into it.

Hands-on Example (Running Spark on Kubernetes with Minikube)

You will need:

Minikube – if you are not familiar at all, don’t worry, you only need the first three steps from their tutorial
Spark – download so you can follow along (I had version 4.0.1 locally when writing this)
Official Spark Docker image 4.0.1 – the newest available version with Scala (at the time of writing)

Documentation Example

Every command is executed from the main folder of your downloaded binaries.

This is what you will find in the official documentation:

$ ./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

Spark Submit Script

Let’s begin with the core script we are running: spark-submit. This script is part of every Spark binary and is your gateway to run anything in Spark.

Based on your configuration, spark-submit tells Spark how to run your workload.

Kubernetes API Server Host and Port

If you are not familiar with Kubernetes (as I wasn’t at the time), you might ask: what is the k8s-apiserver? You can check the explanation in the official documentation.

Basically, Kubernetes manages containerized workloads. Your submit script tells Kubernetes, through the API, what it requires, and then Kubernetes takes over - creating, deleting, and modifying workloads as needed.

With Minikube, if you run:

$ minikube ip

You’ll get the local IP address of your Minikube cluster. The port you need to know is 8443, the default for Minikube.

Then change the --master command to:

  --master k8s://https://<your-minikube-ip>:8443
  # or in your shell
  --master k8s://https://$(minikube ip):8443

This tells your submit script where the Kubernetes API is available.

Workload You Want to Run

  --class org.apache.spark.examples.SparkPi

This line tells Spark what you actually want to run. You can find this specific class in the examples directories of your downloaded Spark binaries:

<path-to-your-spark-binaries>/examples/src/main/scala/org/apache/spark/examples

The SparkPi class is the one your example calls.

Docker Images for Spark

Since Kubernetes runs containerized workloads, you must specify the Docker image using:

    --conf spark.kubernetes.container.image=<spark-image>

In our case, we can use the prebuilt image, which already contains everything we need.

apache/spark:4.0.1-scala2.13-java21-ubuntu

Bundled JAR

The last line refers to the example JAR packaged inside the container. For cluster deploy mode, it points to:

local:///opt/spark/examples/jars/spark-examples.jar

You can also find a corresponding JAR in your downloaded Spark binaries:

<path-to-your-spark-binaries>/examples/jars/spark-examples_2.13-4.0.1.jar

Locally, you won’t see spark-examples.jar because it is a symbolic link inside the Docker container.

Putting It All Together

Here’s the full command you should be able to use:

$ ./bin/spark-submit \
    --master k8s://https://$(minikube ip):8443 \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
    local:///opt/spark/examples/jars/spark-examples.jar

If you run it on Minikube, the driver pod will start but will fail immediately:

 Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/default/pods/org-apache-spark-examples-sparkpi-38d3dc9c3d8a8786-dr

This happens because Kubernetes requires permissions for everything. Using the default service account won’t allow the driver pod to create or manage pods.

RBAC Hell

For testing, we can create a Spark service account and a role binding to allow the test application to run. This is only for testing. In a real cluster, you’d want stricter permissions (least privilege).

$ kubectl create serviceaccount spark
$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

You can also find this in the Spark documentation.

The Real Working Example

Now tweak your submit command:

$ ./bin/spark-submit \
    --master k8s://https://$(minikube ip):8443 \
    --deploy-mode cluster \
    --name spark-pi \
    --executor-memory 2G \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=2 \
    --conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName="spark" \
    local:///opt/spark/examples/jars/spark-examples.jar

This should finally work.
I reduced the number of executors and executor memory so it doesn’t eat up all your local resources.

Check the Expected Result

$ kubectl get pods

Expected output:

NAME                                                        READY   STATUS        RESTARTS   AGE
org-apache-spark-examples-sparkpi-46c3389c3da281d4-driver   1/1     Running       0          23s
spark-pi-4450d19c3da2a0fc-exec-1                            1/1     Running       0          15s
spark-pi-4450d19c3da2a0fc-exec-2                            1/1     Running       0          15s

This shows one driver and two executors, as defined in the submit script.

If you see:

NAME                                                        READY   STATUS        RESTARTS   AGE
org-apache-spark-examples-sparkpi-46c3389c3da281d4-driver   1/1     Completed       0          23s

The job finished and executors were cleaned. In that case, run the command again, or try:

$ kubectl get pods -w

This allows you to watch pods being created and deleted in real time.

Cleaning Up Pods

To remove single pod manually:

$ kubectl delete pods <pod-name>
# example
$ kubectl delete pods spark-pi-27719e9c572a28d9-driver

# Or delete all pods in your Minikube cluster:
$ kubectl delete pods --all

That’s Just the Beginning

If you made it to the end and it works, great job! If not, don’t worry — try again, experiment, and you might find yourself enjoying it.

Quick recap of how to run Spark on Kubernetes locally:

Set up a local Kubernetes cluster with Minikube
Configure your spark-submit command
Point Spark to the Kubernetes API
Specify the correct Docker image
Configure RBAC permissions
Run the job and verify pods

I hope by the end you have a basic understanding of what’s happening on this small scale. There’s much more to learn — the Spark + Kubernetes ecosystem is huge. In the future, I’d like to cover useful tips and traps I ran into along the way.