Chapter 4. Batch Job

The Batch Job pattern is suited for managing isolated atomic units of work. It is based on Job abstraction, which runs short-lived Pods reliably until completion on a distributed environment.

Problem

The main primitive in Kubernetes for managing and running containers is the Pod. There are different ways of creating Pods with varying characteristics:

Bare Pod

It is possible to create a Pod manually to run containers. However, when the node such a Pod is running on fails, the Pod is not restarted. Running Pods this way is discouraged except for development or testing purposes. This mechanism is also known under the names of unmanaged or naked Pods.

ReplicaSet

This controller is used for creating and managing the lifecycle of Pods expected to run continuously (e.g., to run a web server container). It maintains a stable set of replica Pods running at any given time and guarantees the availability of a specified number of identical Pods.

DaemonSet

A controller for running a single Pod on every node. Typically used for managing platform capabilities such as monitoring, log aggregation, storage containers, and others. See Chapter 9 for a detailed discussion on DaemonSets.

A common aspect of these Pods is the fact that they represent long-running processes that are not meant to stop after some time. However, in some cases there is a need to perform a predefined finite unit of work reliably and then shut down the container. For this task, Kubernetes provides the Job resource.

Solution

A Kubernetes Job is similar to a ReplicaSet as it creates one or more Pods and ensures they run successfully. However, the difference is that, once the expected number of Pods terminate successfully, the Job is considered complete and no additional Pods are started. A Job definition looks like Example 4-1.

Example 4-1. A Job specification
apiVersion: batch/v1
kind: Job
metadata:
  name: random-generator
spec:
  completions: 5                      1
  parallelism: 2                      2
  ttlSecondsAfterFinished: 300        3
  template:
    metadata:
      name: random-generator
    spec:
      restartPolicy: OnFailure        4
      containers:
      - image: k8spatterns/random-generator:1.0
        name: random-generator
        command: [ "java", "-cp", "/", "RandomRunner", "/numbers.txt", "10000" ]
1

Job should run five Pods to completion, which all must succeed.

2

Two Pods can run in parallel.

3

Keep Pods for five minutes (300 seconds) before garbage collecting them.

4

Specifying the restartPolicy is mandatory for a Job. Possible values are OnFailure on Never.

One crucial difference between the Job and the ReplicaSet definition is the .spec.template.spec.restartPolicy. The default value for a ReplicaSet is Always, which makes sense for long-running processes that must always be kept running. The value Always is not allowed for a Job and the only possible options are either OnFailure or Never.

So why bother creating a Job to run a Pod only once instead of using bare Pods? Using Jobs provides many reliability and scalability benefits that make them the preferred option:

  • A Job is not an ephemeral in-memory task, but a persisted one that survives cluster restarts.

  • When a Job is completed, it is not deleted but kept for tracking purposes. The Pods that are created as part of the Job are also not deleted but available for examination (e.g., to check the container logs). This is also true for bare Pods, but only for a restartPolicy: OnFailure. You can still remove the Pods of a Job after a certain time by specifying .spec.ttlSecondsAfterFinished.

  • A Job may need to be performed multiple times. Using the .spec.completions field it is possible to specify how many times a Pod should complete successfully before the Job itself is done.

  • When a Job has to be completed multiple times (set through .spec.completions), it can also be scaled and executed by starting multiple Pods at the same time. That can be done by specifying the .spec.parallelism field.

  • A Job can be suspended by setting the field .spec.suspend to true. In this case all active Pods are deleted and restarted if the Job is resumed (i.e. .spec.suspend set to false by the user).

  • If the node fails or when the Pod is evicted for some reason while still running, the scheduler places the Pod on a new healthy node and reruns it. Bare Pods would remain in a failed state as existing Pods are never moved to other nodes.

All of this makes the Job primitive attractive for scenarios where some guarantees are required for the completion of a unit of work.

The two fields that play major roles in the behavior of a Job are:

.spec.completions

Specifies how many Pods should run to complete a Job.

.spec.parallelism

Specifies how many Pod replicas could run in parallel. Setting a high number does not guarantee a high level of parallelism and the actual number of Pods may still be less (and in some corner cases, more) than the desired number (e.g., due to throttling, resource quotas, not enough completions left, and other reasons). Setting this field to 0 effectively pauses the Job.

Figure 4-1 shows how the Batch Job defined in Example 4-1 with a completion count of five and a parallelism of two is processed.

Parallel Batch Job with a fixed completion count
Figure 4-1. Parallel Batch Job with a fixed completion count

Based on these two parameters, there are the following types of Jobs:

Single Pod Job

This type is selected when you leave out both .spec.completions and .spec.parallelism or set them to their default values of one. Such a Job starts only one Pod and is completed as soon as the single Pod terminates successfully (with exit code 0).

Fixed completion count Jobs

When you specify .spec.completions with a number greater than one, this many Pods must succeed. Optionally, you can set .spec.parallelism, or leave it at the default value of one. Such a Job is considered completed after the .spec.completions number of Pods has completed successfully. Example 4-1 shows this mode in action and is the best choice when we know the number of work items in advance, and the processing cost of a single work item justifies the use of a dedicated Pod.

Work queue Jobs

You have a work queue for parallel Jobs when you leave out .spec.completions and set .spec.parallelism to an integer greater than one. A work queue Job is considered completed when at least one Pod has terminated successfully, and all other Pods have terminated too. This setup requires the Pods to coordinate among themselves and determine what each one is working on so that they can finish in a coordinated fashion. For example, when a fixed but unknown number of work items is stored in a queue, parallel Pods can pick these up one by one to work on them. The first Pod that detects that the queue is empty and exits with success indicates the completion of the Job. The Job controller waits for all other Pods to terminate too. Since one Pod processes multiple work items, this Job type is an excellent choice for granular work items—when the overhead for one Pod per work item is not justified.

Indexed Jobs

Similar to Work queue Jobs, you can distribute work items to individual Jobs without needing an external work queue. When using a fixed completion count and setting the completion mode .spec.completionMode to Indexed, every Pod of the Job gets an associated index ranging from 0 to .spec.completionMode - 1. The assigned index is available to the containers through the Pod annotation batch.kubernetes.io/job-completion-index (see Chapter 13 for how this annotation can be accessed from your code) or directly via the environment variable JOB_COMPLETION_INDEX that is set to the index associated with this Pod. With this index at hand, the application can pick the associated work item without any external synchronization. Example 4-2 shows a Job that processes the lines of a single file individually by separate Pods. A more realistic example would be an indexed Job used for video processing where parallel Pods are processing a certain frame range calculated from the index, as shown in the example.

Example 4-2. An indexed Job selecting their work items based on a job index
apiVersion: batch/v1
kind: Job
metadata:
  name: file-split
spec:
  completionMode: Indexed                                               1
  completions: 5                                                        2
  parallelism: 5
  template:
    metadata:
      name: file-split
    spec:
      containers:
      - image: library/perl
        name: split
        command:
        - "perl"                                                        3
        - "-ne"
        - |
          BEGIN {
            $idx = $ENV{JOB_COMPLETION_INDEX};                          4
            open($fh,">","/logs/random-${idx}.txt");                    5
          };
          print $fh $_ if $. >= $idx * 10000 && $. < ($idx+1) * 10000;  6
          END {
            close($fh)
          }
        - /logs/random.log
        volumeMounts:
        - mountPath: /logs                                              7
          name: log-volume
      restartPolicy: OnFailure
1

Enable an indexed completion mode.

2

Run 5 pods in parallel to completion.

3

Execute a Perl scripts that prints out a range of line from a given file /logs/random.log. This file is expected to have 50000 lines of data.

4

Remember the current completion index (0 … 4) in a Perl variable $idx.

5

Open a new file that is based on the index name.

6

Write out the line of the input file if it is in the range dedicated to the completion index ($_ is the current line in the input file, $. the input line number).

7

Mount the input data from an external volume. The volume is not shown here, you can find the full working definition in the example repository.

If you have an unlimited stream of work items to process, other controllers like ReplicaSet are the better choice for managing the Pods processing these work items.

Discussion

The Job abstraction is a pretty basic but also fundamental primitive that other primitives such as CronJobs are based on. Jobs help turn isolated work units into a reliable and scalable unit of execution. However, a Job doesn’t dictate how you should map individually processable work items into Jobs or Pods. That is something you have to determine after considering the pros and cons of each option:

One Job per work item

This option has the overhead of creating Kubernetes Jobs, and also for the platform to manage a large number of Jobs that are consuming resources. This option is useful when each work item is a complex task that has to be recorded, tracked, or scaled independently.

One Job for all work items

This option is right for a large number of work items that do not have to be independently tracked and managed by the platform. In this scenario, the work items have to be managed from within the application via a batch framework.

The Job primitive provides only the very minimum basics for scheduling of work items. Any complex implementation has to combine the Job primitive with a batch application framework (e.g., in the Java ecosystem we have Spring Batch and JBeret as standard implementations) to achieve the desired outcome.

Not all services must run all the time. Some services must run on demand, some on a specific time, and some periodically. Using Jobs can run Pods only when needed, and only for the duration of the task execution. Jobs are scheduled on nodes that have the required capacity, satisfy Pod placement policies, and other container dependency considerations. Using Jobs for short-lived tasks rather than using long-running abstractions (such as ReplicaSet) saves resources for other workloads on the platform. All of that makes Jobs a unique primitive, and Kubernetes a platform supporting diverse workloads.

Get Kubernetes Patterns, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience live online training, plus books, videos, and digital content from nearly 200 publishers.