Run a batch job
Introduction
This tutorial will take you through running a computation on the cluster as a batch job. It uses a script written in Python, but many steps are not specific to Python.
The goal is to find the solution of an integer equation.
To go through this tutorial, you will need access to the Kubernetes cluster; see Get Started.
See also How to send emails on job completion.
Write the code
This is the Python code we'll use, please put it in a file solver.py
:
import sys
def f(value):
value += 1610142364
for k in range(5):
k = value // 127773
value = 16807 * (value - k * 127773) - k * 2836
if value <= 0:
value = value + 2147483647
return value
def main(lower, higher):
for i in range(lower, higher):
if i % 5000000 == 0:
print("%d..." % i)
if f(i) == 45:
print("Found solution: %d" % i)
return
print("No solution in [%d, %d)" % (lower, higher))
if __name__ == '__main__':
main(int(sys.argv[1], 10), int(sys.argv[2], 10))
As you can see, it takes as arguments the two endpoints of the range to search, and prints the solution if it finds it.
Submit it as a job
Because this job only requires a single short file, we will use a ConfigMap to send it into the cluster. ConfigMap are commonly used to store configuration values for other workloads.
If we had more code, we would build a new Docker container image from our code. You can see an example of that in the Flask application tutorial.
You can create a ConfigMap named solver
that contains our script using this command:
You can see that it has been created:
Now write this Job definition to job.yml
:
apiVersion: batch/v1
kind: Job
metadata:
name: solver
labels:
app: solver
source: hsrn-tutorial
spec:
template:
spec:
restartPolicy: Never
containers:
- name: python
image: python:3.10
args: ["python", "-u", "/app/solver.py", "0", "100000000"]
volumeMounts:
- name: script
mountPath: /app
resources:
requests:
cpu: 1 # Request one CPU (code is not multi-threaded)
memory: 5Mi # Request that this memory be allocated to us
limits:
cpu: 2 # Throttle the container if using more CPU
memory: 100Mi # Terminate the container if using more memory
volumes:
- name: script
configMap:
name: solver
And submit it with Kubectl:
Monitoring your Job
You can see the Job and the Pod that has been created for it:
$ kubectl get job
NAME COMPLETIONS DURATION AGE
solver 0/1 50s 50s
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
solver--1-rshnr 1/1 Running 0 51s
If there was a failure, or if you simply delete the Pod, a new Pod will be created and start over:
$ kubectl delete pod solver--1-rshnr
pod "solver--1-rshnr" deleted
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
solver--1-rshnr 1/1 Terminating 0 59s
solver--1-vkzs7 0/1 ContainerCreating 0 2s
You can see the output from the Pod (once it starts) using:
$ kubectl logs -f job/solver
0...
1000000...
2000000...
3000000...
4000000...
5000000...
6000000...
7000000...
And after a few minutes the Job completes:
Parallelism
It is possible to run a Job that consists of multiple tasks. Each task will be run as a separate Pod, either in turn or at the same time.
There are multiple configurations that varies based on when the Kubernetes Job controller should consider the Job completed. You can refer to the official documentation for a complete list of options.
We are going to accelerate our search by running 10 separate Pods, which will each search in a separate range.
We will make the following changes to our Job definition:
- Set
completions
to 10 - Set
completionMode
toIndexed
, which will give each Pod an index from 0 to 9 as theJOB_COMPLETION_INDEX
environment variable - Set
parallelism
to 4, which will allow Kubernetes to run 4 of the 10 Pods at a time - Change the container's command to search in the correct range based on the index in the Job, computed using bash
Create the new manifest as job-parallel.yml
:
apiVersion: batch/v1
kind: Job
metadata:
name: solver-parallel
labels:
app: solver
source: hsrn-tutorial
spec:
completions: 10
parallelism: 4
completionMode: Indexed
template:
spec:
restartPolicy: Never
containers:
- name: python
image: python:3.10
args:
- bash
- -c
- "python -u /app/solver.py $(expr $JOB_COMPLETION_INDEX \\* 10000000) $(expr \\( $JOB_COMPLETION_INDEX + 1 \\) \\* 10000000)"
volumeMounts:
- name: script
mountPath: /app
resources:
# Resource request and limits apply to each container,
# not to the whole Job
requests:
cpu: 1
memory: 5Mi
limits:
cpu: 2
memory: 100Mi
volumes:
- name: script
configMap:
name: solver
Create the new Job and watch the Pods get created, four by four:
$ kubectl apply -f job-parallel.yml
job.batch/solver-parallel configured
$ kubectl get pod --watch
NAME READY STATUS RESTARTS AGE
solver-parallel-0-gb75c 0/1 ContainerCreating 0 2s
solver-parallel-1-2rr8t 0/1 ContainerCreating 0 2s
solver-parallel-2-f6dt8 0/1 ContainerCreating 0 2s
solver-parallel-3-789vq 0/1 ContainerCreating 0 2s
solver-parallel-0-gb75c 1/1 Running 0 5s
solver-parallel-1-2rr8t 1/1 Running 0 5s
solver-parallel-2-f6dt8 1/1 Running 0 5s
solver-parallel-3-789vq 1/1 Running 0 5s
solver-parallel-0-gb75c 0/1 Completed 0 18s
solver-parallel-3-789vq 0/1 Completed 0 18s
...
After a few seconds, all Pods are completed, and the Job reports this too:
You can get the logs of all the Pods using a label selector. The Pods will not be in order, but you will be able to see the solution:
$ kubectl logs -l job-name=solver-parallel
10000000...
15000000...
No solution in [10000000, 20000000)
70000000...
75000000...
No solution in [70000000, 80000000)
80000000...
85000000...
No solution in [80000000, 90000000)
40000000...
45000000...
No solution in [40000000, 50000000)
50000000...
55000000...
No solution in [50000000, 60000000)
60000000...
Found solution: 64782617
90000000...
95000000...
No solution in [90000000, 100000000)
0...
5000000...
No solution in [0, 10000000)
20000000...
25000000...
No solution in [20000000, 30000000)
30000000...
35000000...
No solution in [30000000, 40000000)
Clean up
You can delete all the Jobs you created in this tutorial by using the labels, like this:
Also delete the ConfigMap: