Run a batch job

Introduction

This tutorial will take you through running a computation on the cluster as a batch job. It uses a script written in Python, but many steps are not specific to Python.

The goal is to find the solution of an integer equation.

To go through this tutorial, you will need access to the Kubernetes cluster; see Get Started.

Write the code

This is the Python code we'll use, please put it in a file solver.py:

import sys


def f(value):
    value += 1610142364
    for k in range(5):
        k = value // 127773
        value = 16807 * (value - k * 127773) - k * 2836
        if value <= 0:
            value = value + 2147483647
    return value


def main(lower, higher):
    for i in range(lower, higher):
        if i % 5000000 == 0:
            print("%d..." % i)
        if f(i) == 45:
            print("Found solution: %d" % i)
            return

    print("No solution in [%d, %d)" % (lower, higher))


if __name__ == '__main__':
    main(int(sys.argv[1], 10), int(sys.argv[2], 10))

As you can see, it takes as arguments the two endpoints of the range to search, and prints the solution if it finds it.

Submit it as a job

Because this job only requires a single short file, we will use a ConfigMap to send it into the cluster. ConfigMap are commonly used to store configuration values for other workloads.

If we had more code, we would build a new Docker container image from our code. You can see an example of that in the Flask application tutorial.

You can create a ConfigMap named solver that contains our script using this command:

$ kubectl create configmap solver --from-file=solver.py
configmap/solver created

You can see that it has been created:

$ kubectl get configmap
NAME     DATA   AGE
solver   1      19s

Now write this Job definition to job.yml:

apiVersion: batch/v1
kind: Job
metadata:
  name: solver
  labels:
    app: solver
    source: hsrn-tutorial
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: python
          image: python:3.10
          args: ["python", "-u", "/app/solver.py", "0", "100000000"]
          volumeMounts:
            - name: script
              mountPath: /app
          resources:
            requests:
              cpu: 1 # Request one CPU (code is not multi-threaded)
              memory: 5Mi # Request that this memory be allocated to us
            limits:
              cpu: 2 # Throttle the container if using more CPU
              memory: 100Mi # Terminate the container if using more memory
      volumes:
        - name: script
          configMap:
            name: solver

And submit it with Kubectl:

$ kubectl apply -f job.yml
job.batch/solver created

Monitoring your Job

You can see the Job and the Pod that has been created for it:

$ kubectl get job
NAME     COMPLETIONS   DURATION   AGE
solver   0/1           50s        50s
$ kubectl get pod
NAME              READY   STATUS    RESTARTS   AGE
solver--1-rshnr   1/1     Running   0          51s

If there was a failure, or if you simply delete the Pod, a new Pod will be created and start over:

$ kubectl delete pod solver--1-rshnr
pod "solver--1-rshnr" deleted
$ kubectl get pod
NAME              READY   STATUS              RESTARTS   AGE
solver--1-rshnr   1/1     Terminating         0          59s
solver--1-vkzs7   0/1     ContainerCreating   0          2s

You can see the output from the Pod (once it starts) using:

$ kubectl logs -f job/solver
0...
1000000...
2000000...
3000000...
4000000...
5000000...
6000000...
7000000...

And after a few minutes the Job completes:

$ kubectl get job
NAME     COMPLETIONS   DURATION   AGE
solver   1/1           89s        3m14s

Parallelism

It is possible to run a Job that consists of multiple tasks. Each task will be run as a separate Pod, either in turn or at the same time.

There are multiple configurations that varies based on when the Kubernetes Job controller should consider the Job completed. You can refer to the official documentation for a complete list of options.

We are going to accelerate our search by running 10 separate Pods, which will each search in a separate range.

We will make the following changes to our Job definition:

Set completions to 10
Set completionMode to Indexed, which will give each Pod an index from 0 to 9 as the JOB_COMPLETION_INDEX environment variable
Set parallelism to 4, which will allow Kubernetes to run 4 of the 10 Pods at a time
Change the container's command to search in the correct range based on the index in the Job, computed using bash

Create the new manifest as job-parallel.yml:

apiVersion: batch/v1
kind: Job
metadata:
  name: solver-parallel
  labels:
    app: solver
    source: hsrn-tutorial
spec:
  completions: 10
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: python
          image: python:3.10
          args:
            - bash
            - -c
            - "python -u /app/solver.py $(expr $JOB_COMPLETION_INDEX \\* 10000000) $(expr \\( $JOB_COMPLETION_INDEX + 1 \\) \\* 10000000)"
          volumeMounts:
            - name: script
              mountPath: /app
          resources:
            # Resource request and limits apply to each container,
            # not to the whole Job
            requests:
              cpu: 1
              memory: 5Mi
            limits:
              cpu: 2
              memory: 100Mi
      volumes:
        - name: script
          configMap:
            name: solver

Create the new Job and watch the Pods get created, four by four:

$ kubectl apply -f job-parallel.yml
job.batch/solver-parallel configured
$ kubectl get pod --watch
NAME                      READY   STATUS              RESTARTS   AGE
solver-parallel-0-gb75c   0/1     ContainerCreating   0          2s
solver-parallel-1-2rr8t   0/1     ContainerCreating   0          2s
solver-parallel-2-f6dt8   0/1     ContainerCreating   0          2s
solver-parallel-3-789vq   0/1     ContainerCreating   0          2s
solver-parallel-0-gb75c   1/1     Running             0          5s
solver-parallel-1-2rr8t   1/1     Running             0          5s
solver-parallel-2-f6dt8   1/1     Running             0          5s
solver-parallel-3-789vq   1/1     Running             0          5s
solver-parallel-0-gb75c   0/1     Completed           0          18s
solver-parallel-3-789vq   0/1     Completed           0          18s
...

After a few seconds, all Pods are completed, and the Job reports this too:

$ kubectl get job
NAME              COMPLETIONS   DURATION   AGE
solver-parallel   10/10         39s        39s

You can get the logs of all the Pods using a label selector. The Pods will not be in order, but you will be able to see the solution:

$ kubectl logs -l job-name=solver-parallel
10000000...
15000000...
No solution in [10000000, 20000000)
70000000...
75000000...
No solution in [70000000, 80000000)
80000000...
85000000...
No solution in [80000000, 90000000)
40000000...
45000000...
No solution in [40000000, 50000000)
50000000...
55000000...
No solution in [50000000, 60000000)
60000000...
Found solution: 64782617
90000000...
95000000...
No solution in [90000000, 100000000)
0...
5000000...
No solution in [0, 10000000)
20000000...
25000000...
No solution in [20000000, 30000000)
30000000...
35000000...
No solution in [30000000, 40000000)

Clean up

You can delete all the Jobs you created in this tutorial by using the labels, like this:

$ kubectl delete job -l source=hsrn-tutorial

Also delete the ConfigMap:

$ kubectl delete configmap solver