Spark on Kubernetes: Understanding Driver Lifecycle in Client and Cluster Mode
In the first article of this Spark on Kubernetes series, we submitted our first Spark applications to Kubernetes and focused primarily on cluster mode.
While working on redeployments, I realized I had misunderstood an important concept: the difference between Spark deploy modes and, more importantly, who owns the driver lifecycle.
Once you start running Spark inside Kubernetes, you want to avoid situations, where previous driver keeps running when new one starts, especially if both access the same datalake
Let’s break it down and see how it works in practice.
This is for anyone who:
- Has basic knowledge of Spark
- Has basic knowledge of Kubernetes
- Wants a deeper understanding of deployment modes and their practical implications
Spark on Kubernetes Series
- ✅ Part 1: Introduction to Spark on Kubernetes
- 👉 Part 2: Spark Driver Lifecycle in Kubernetes (current article)
- 🔜 Part 3: Spark on Kubernetes as a Managed Workload (coming soon)
To understand this better, let’s first create a mental model of Spark deployment modes.
Spark Deploy Modes - Mental Model
Spark supports two deploy modes, and the difference between them is quite simple:
- In client mode - the driver runs wherever you run
spark-submit - In cluster mode - the driver runs inside Kubernetes
You can use both modes from a local machine without much confusion.
The real complexity starts when spark-submit itself runs inside another Kubernetes
pod, which is exactly when understanding driver ownership becomes critical.
Client Mode Submit inside Kubernetes
In client mode, the driver lives wherever spark-submit runs.
If you kill that process - or delete the pod running it - the Spark job
dies immediately, which might influence your data consistency.
Cluster Mode Submit inside Kubernetes
In cluster mode, once the driver pod is created, it is managed by Kubernetes -
not by the spark-submit process.
Killing the submit process, or killing the submit pod, doesn’t stop the job.
To stop the job, you need to delete the driver pod itself.
The key difference is who owns the driver lifecycle, which is important depending on your data workflow.
Hands-on Example
With that in mind, I want to take you through a real example of using these modes.
You will once again need:
- Minikube
- Spark – (I had version 4.0.1 locally when writing this)
- Official Spark Docker image 4.0.1
This time we will use a streaming example to simulate a long-running application for the sake of simpler observation of the driver lifecycle.
This will require some setup before we can run it, so follow these steps:
$ minikube ssh # get into the minikube container
# inside the container run
$ nc -lk 9999 # netcat listening on port 9999
This is just to generate a TCP stream for our example to read from.
Now let’s jump into it.
Cluster Mode from Local Machine
In this mode, after submitting your application, Kubernetes will start the Spark driver,
which will take care of job coordination.
You can submit these jobs from your local workstation even to a remote Kubernetes cluster and they
will work. The reason is that the whole ecosystem runs inside that Kubernetes cluster.
You can try it:
$ ./bin/spark-submit \
--master k8s://https://$(minikube ip):8443 \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--deploy-mode cluster \
--name spark-pi \
--executor-memory 2G \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
--conf spark.kubernetes.authenticate.driver.serviceAccountName="spark" \
local:///opt/spark/examples/jars/spark-examples.jar $(minikube ip) 9999
Let’s observe:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
networkwordcount-e7bc879c678f3a3f-exec-1 1/1 Running 0 13s
networkwordcount-e7bc879c678f3a3f-exec-2 1/1 Running 0 12s
spark-pi-bb4ee39c678f2075-driver 1/1 Running 0 19s
💡 Info
The names of pods in your environment might differ, but the behavior remains the same.
Cluster mode is straightforward. Once submitted, the job continues as long as the driver pod is alive, even if the submit process is killed.
If you read my previous post on the topic, we did something similar, but what we couldn’t really observe
was what happens when we stop the spark-submit process running in the local shell.
Go ahead and interrupt that process. After it stops, check the pods:
$ k get pods
NAME READY STATUS RESTARTS AGE
networkwordcount-e7bc879c678f3a3f-exec-1 1/1 Running 0 2m48s
networkwordcount-e7bc879c678f3a3f-exec-2 1/1 Running 0 2m47s
spark-pi-bb4ee39c678f2075-driver 1/1 Running 0 2m54s
Key observation: they are still there. There is no direct influence from stopping the shell process on the running job.
Now stop the running driver to clean up our namespace:
kubectl delete pods spark-pi-bb4ee39c678f2075-driver # name of your driver from previous "get pods" example
Client Mode from Local Machine
If you change the mode to client and try the same command, your job is going to fail. Try this:
$ ./bin/spark-submit \
--master k8s://https://$(minikube ip):8443 \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--deploy-mode client \
--name spark-pi \
--executor-memory 2G \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
--conf spark.kubernetes.authenticate.driver.serviceAccountName="spark" \
local:///opt/spark/examples/jars/spark-examples.jar $(minikube ip) 9999
In your console you will find:
Error: Failed to load class org.apache.spark.examples.streaming.NetworkWordCount
In client mode, spark-submit runs the driver locally.
That means the JAR must exist on your local machine.
The path:
/opt/spark/examples/jars/spark-examples.jar
exists inside the container image, not on your workstation.
We could change the path to:
local:///<full-path-to-your-local-binaries>/examples/jars/spark-examples_2.13-4.0.1.jar
But that would fail as well. You can check for yourself: your executors would start and fail, and you would find the following error on them:
java.nio.file.NoSuchFileException: /<full-path-to-your-local-binaries>/examples/jars/spark-examples_2.13-4.0.1.jar
That’s because the driver on your local machine would find the file on the specified path, but your containers don’t contain that file at the same path.
To test this, we will have to make a few adjustments. We will copy the examples JAR file to the same location where it exists in the container. If you don’t already have Spark binaries at:
/opt/spark/
The directory /opt is by default owned by root, so you might need to run the command as a sudo user.
$ mkdir -p /opt/spark/ # create folder
$ cp -R <full-path-to-your-local-binaries> /opt/spark # copy files to that folder
$ cp /opt/spark/examples/jars/spark-examples_2.13-4.0.1.jar /opt/spark/examples/jars/spark-examples.jar # because spark-examples doesn't exist by default
Now if you run:
$ ./bin/spark-submit \
--master k8s://https://$(minikube ip):8443 \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--deploy-mode client \
--name spark-pi \
--executor-memory 2G \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.container.image=apache/spark:4.0.1-scala2.13-java21-ubuntu \
--conf spark.kubernetes.authenticate.driver.serviceAccountName="spark" \
local:///opt/spark/examples/jars/spark-examples.jar $(minikube ip) 9999
It will work, because after copying the files into /opt/spark, your local machine directory
structure matches the one inside the Docker container - at least for the JARs used by spark-submit.
Check pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
networkwordcount-c4dc429c6a7c3472-exec-1 1/1 Running 0 16s
networkwordcount-c4dc429c6a7c3472-exec-4 1/1 Running 0 9s
The key observation: there is no driver running in Kubernetes. Your shell process is the driver and owns the entire lifecycle. If you stop it, the job stops.
Go ahead and do that - and check the pods:
$ kubectl get pods
No resources found in default namespace.
All the executors were deleted because the driver was stopped. So your local driver owns the entire driver lifecycle.
📌 Pro tip
Executors sometimes fail and get replaced. You can control this with--conf spark.executor.maxNumFailures. If the job runs out of retries, it won’t start. This usually happens in client mode, when the driver coordinates from your local machine and executors don’t respond quickly enough.
Deploy Mode Should Depend on Your Use Case
Which mode to use really depends on your use case.
Cluster mode is simpler, especially when submitting from your local machine. It works well for jobs that run for a finite time and then finish. If something goes wrong during execution, Kubernetes can handle restarts and retries automatically.
Client mode is more complex and generally not practical from a local machine outside of testing. It is useful for long-running applications — ones you don’t expect to stop automatically, but where you want full control over when and how they terminate.
The choice of data source also matters. If you are using something like Delta Lake, it handles many data consistency concerns automatically. With simpler formats like Parquet, you need to be more careful about when and how you stop jobs.
That’s why in the next post, we’ll explore a production-like setup, including submitting Spark applications from inside Kubernetes rather than from a local machine.
Continue Reading
- ← Previous: Introduction to Spark on Kubernetes