Kubernetes Jobs and CronJobs
So far in this series we have focused on long-running processes — web APIs, databases, and background services that run until they are explicitly upgraded or decommissioned. These workloads are managed by Deployments, ReplicaSets, and DaemonSets, all of which keep Pods running indefinitely.
But not every workload is long-running. Many real-world tasks are short-lived and one-off: applying a database migration, generating a monthly report, compressing old log files, or sending a batch of notification emails. If you ran these tasks as regular Pods, Kubernetes would restart them endlessly after they finish — your migration would re-apply on every exit, your report would regenerate in a loop.
The Job object solves this problem. A Job creates one or more Pods and ensures they run until successful termination (exit code 0). Once a Pod completes successfully, the Job does not restart it. If a Pod fails, the Job controller creates a replacement. This makes Jobs ideal for batch processing, one-time setup tasks, and any workload that should run to completion and then stop.
Kubernetes also provides the CronJob object, which creates a new Job on a recurring schedule — just like the Unix cron daemon, but with the full power of Kubernetes orchestration behind it.
This article covers every Job pattern you will encounter in production, from simple one-shot tasks to parallel batch processing and work-queue consumers, plus scheduled CronJobs for recurring work.
Core Concepts
Step 1: Jobs vs. Regular Pods — Why the Difference Matters
A regular Pod managed by a Deployment has a restartPolicy of Always. This means the kubelet will restart the container every time it exits, regardless of the exit code. That behavior is perfect for a web server that should stay up 24/7, but disastrous for a task like a database migration:
| Scenario | Regular Pod (restartPolicy: Always) | Job Pod (restartPolicy: OnFailure / Never) |
|---|---|---|
| Container exits with code 0 (success) | Container is restarted immediately | Pod is marked Completed — no restart |
| Container exits with code 1 (failure) | Container is restarted immediately | Container is restarted in-place (OnFailure) or a new Pod is created (Never) |
| Node crashes mid-execution | Pod rescheduled on another node | Job controller creates a new Pod on another node |
The key insight: a Job stops when the work is done. A Deployment never stops on purpose — it always tries to keep Pods running.
Step 2: The Job Object — How It Works
The Job object is a Kubernetes controller responsible for creating and managing Pods defined in its spec.template. It tracks how many Pods have completed successfully and continues creating new Pods until the desired number of completions is reached.
Here is the lifecycle of a Job:
- You submit a Job manifest to the Kubernetes API server.
- The Job controller reads the Pod template and creates one or more Pods.
- Each Pod is scheduled onto a node and runs the specified container(s).
- If a Pod succeeds (exits with code 0), the Job controller records the completion. If a Pod fails, the controller either restarts it in-place or creates a replacement, depending on the
restartPolicy. - Once the number of successful completions reaches
spec.completions, the Job is marked as Complete. - The completed Job and its Pods are retained in the cluster so you can inspect their logs. They are not automatically deleted.
Because Jobs have a finite lifespan, the Job controller automatically generates a unique controller-uid label and applies it to every Pod it creates. This prevents accidental overlap with Pods from other Jobs or controllers.
Step 3: Job Patterns — One Shot, Parallel, and Work Queue
Jobs support three primary patterns, controlled by two fields: completions (how many Pods must succeed) and parallelism (how many Pods can run at the same time). Think of completions as how much work and parallelism as how many workers:
| Pattern | Use Case | Behavior | completions | parallelism |
|---|---|---|---|---|
| One Shot | Database migration, one-time setup | A single Pod runs once until it succeeds | 1 | 1 |
| Parallel Fixed Completions | Batch report generation, data processing | Multiple Pods run in parallel until a fixed total succeeds | 1+ | 1+ |
| Work Queue | Processing items from a message queue | Multiple Pods run in parallel; Job completes when the first Pod exits 0 | 1 (or unset) | 2+ |
An analogy: imagine a restaurant kitchen. A one-shot job is like a single chef cooking one special dish — once it is done, the kitchen closes. A parallel fixed completions job is like four chefs each preparing their own dish — once all dishes are ready, the kitchen closes. A work queue job is like four chefs sharing one stack of orders — they keep cooking until the order stack is empty.
Step 4: Understanding completions and parallelism
Let's break down the two key fields in detail:
spec.completions— The total number of Pods that must terminate successfully for the Job to be considered complete. Defaults to 1. If you setcompletions: 10, the Job will keep creating Pods until 10 have exited with code 0.spec.parallelism— The maximum number of Pods that can run simultaneously. Defaults to 1. If you setparallelism: 5andcompletions: 10, Kubernetes will run up to 5 Pods at a time, launching new ones as existing Pods finish, until 10 total succeed.
A practical example: suppose you need to generate 800 invoices and each Pod generates 100. You need 8 successful completions (completions: 8). You have enough cluster resources for 4 Pods at once (parallelism: 4). Kubernetes will launch 4 Pods, and as each finishes, a new one takes its place, until all 8 complete.
Step 5: Restart Policies and Failure Handling
Jobs only support two restart policies: OnFailure and Never. The Always policy is not allowed because it would conflict with the Job's run-to-completion semantics. The choice between them affects how failures are handled:
| restartPolicy | On Failure | Result | Recommendation |
|---|---|---|---|
OnFailure | The kubelet restarts the container inside the same Pod | One Pod, multiple restarts. Clean and tidy. | Use this for most Jobs |
Never | The Job controller creates a brand-new Pod | Multiple failed Pods accumulate. Useful for debugging. | Use when you need to inspect each failed Pod |
When a Pod fails with restartPolicy: OnFailure, the kubelet applies an exponential backoff delay (10s, 20s, 40s, ... up to 6 minutes) before restarting. This is called CrashLoopBackOff and prevents a crashing container from consuming node resources in a tight loop.
The spec.backoffLimit field (default: 6) sets the maximum number of retries before the Job is marked as Failed. Choose this value based on how transient you expect failures to be. For a database migration that either works or doesn't, a backoffLimit of 2–4 is reasonable. For a flaky network call, you might set it higher.
You can also use liveness probes with Jobs. If a worker Pod gets stuck (for example, a deadlock or infinite loop with no progress), the liveness probe will detect it and restart or replace the Pod — just like it does for Deployment Pods.
Step 6: CronJobs — Scheduling Recurring Jobs
A CronJob is a higher-level object that creates a new Job on a schedule. It uses the standard Unix cron syntax:
Common schedule examples:
| Schedule Expression | Meaning |
|---|---|
0 2 * * * | Every day at 2:00 AM |
*/15 * * * * | Every 15 minutes |
0 0 1 * * | First day of every month at midnight |
0 */6 * * * | Every 6 hours |
30 8 * * 1-5 | Weekdays at 8:30 AM |
CronJobs have two important history settings:
successfulJobsHistoryLimit(default: 3) — How many completed Jobs to keep. Older ones are automatically deleted.failedJobsHistoryLimit(default: 1) — How many failed Jobs to keep for debugging.
The CronJob controller also supports a concurrencyPolicy field that controls what happens when a new Job is due but the previous one is still running:
Allow(default) — Multiple Jobs can run concurrently.Forbid— Skip the new Job if the previous one is still running.Replace— Cancel the running Job and start a new one.
Step 7: Job History and Automatic Cleanup
Unlike Deployments, completed Jobs and their Pods are not automatically deleted. They remain in the cluster so you can inspect their logs and status. Over time, this can clutter your cluster with thousands of completed Job objects.
There are two ways to handle cleanup:
- TTL controller — Set
spec.ttlSecondsAfterFinishedon the Job. For example,ttlSecondsAfterFinished: 3600deletes the Job and its Pods one hour after completion. This is the recommended approach for automated cleanup. - Manual cleanup — Use
kubectl delete job <name>to remove a Job and all its Pods.
For CronJobs, the successfulJobsHistoryLimit and failedJobsHistoryLimit fields handle this automatically.
Hands-On: Kubernetes Commands
Create a Job from a manifest file:
List all Jobs in the current namespace (including completed ones):
Describe a Job to see its status, completions, and events:
List Pods created by a specific Job using label selectors:
View the logs of a completed Job Pod:
Watch Pods in real time as a parallel Job progresses:
Delete a Job and all its Pods:
Delete a Job but keep its Pods running (orphan them):
Create a CronJob from a manifest file:
List all CronJobs:
Describe a CronJob to see its schedule, last run, and next trigger time:
Manually trigger a CronJob immediately (creates a one-off Job):
Suspend a CronJob to stop it from creating new Jobs:
Resume a suspended CronJob:
Clean up all resources labelled for this chapter:
Step-by-Step Example
In this walkthrough you will deploy a one-shot database migration Job, a parallel batch processing Job, a work-queue consumer pattern with multiple workers, and a scheduled CronJob. Each example builds on a different Job pattern so you can see how completions, parallelism, and restartPolicy interact in practice.
Step 1: Deploy a One-Shot Database Migration Job
One-shot Jobs are the simplest pattern: one Pod, one task, run until success. This is perfect for tasks like database schema migrations. The following Job simulates running Entity Framework Core migrations for a .NET ShippingApi application. Save the manifest as schema-migrate-job.yaml:
Apply the manifest to create the Job:
Key points about this manifest:
restartPolicy: OnFailuretells the kubelet to restart the container inside the same Pod if it exits with a non-zero code. The Pod stays on the same node.backoffLimit: 4means the Job will retry up to 4 times before being marked as Failed.- Neither
completionsnorparallelismis set, so both default to 1 — the classic one-shot pattern.
Step 2: Inspect the Job and Its Pod
After applying the Job, check its status:
You will see output similar to:
The Pods Statuses line confirms one Pod succeeded. Now view the Pod's logs to see the migration output:
Expected output:
The Job ran exactly once, the migration completed, and the Pod stopped. Kubernetes will not restart it. The Job object and its Pod remain in the cluster so you can inspect them. You can also see the Pod in a Completed state:
Step 3: Understand Job Failure Behavior
What happens when a Job Pod fails? The behavior depends on the restartPolicy:
Scenario A: restartPolicy: OnFailure
When the container exits with a non-zero code, the kubelet restarts it inside the same Pod. You will see the RESTARTS counter increment and eventually CrashLoopBackOff status as the kubelet applies exponential backoff delays:
This is the recommended approach for most Jobs because it keeps the cluster clean — only one Pod exists, and it retries in place.
Scenario B: restartPolicy: Never
With restartPolicy: Never, the kubelet does not restart the failed container. Instead, it marks the Pod as failed. The Job controller notices and creates a brand-new Pod. If the failure persists, you end up with multiple failed Pods:
This creates "junk" in your cluster but is useful when you need to inspect each failed Pod individually (for example, to read different error logs from each attempt). For production use, prefer restartPolicy: OnFailure to keep things tidy.
Step 4: Deploy a Parallel Batch Processing Job
Now let's scale up. Suppose your billing system needs to generate invoices in batches. You want 8 total batch runs, each generating a set of invoices, with up to 4 running simultaneously. This is the parallel fixed completions pattern.
Save the following manifest as invoice-batch-job.yaml:
Apply it:
The Job will immediately launch 4 Pods (the parallelism limit). As each Pod completes, a new one is created until 8 total have succeeded.
Step 5: Monitor Parallel Job Progress
Use the --watch flag to see Pods come and go in real time:
You will see output similar to:
Notice that exactly 4 Pods run at any time. As one completes, a new one is scheduled immediately. Once all 8 completions are done, no more Pods are created.
Check the final Job status:
All 8 batches completed successfully. Clean up before the next step:
Step 6: Set Up a Work Queue Infrastructure
The work queue pattern is different from fixed completions. Instead of each Pod doing independent work, all Pods pull items from a shared queue. The Job completes when the queue is empty and a worker exits successfully.
In this example, we will deploy a Redis server as the work queue and then launch consumer workers. First, create a ReplicaSet to manage the Redis queue server. Using a ReplicaSet (rather than a bare Pod) ensures the server is automatically restarted if the node fails. Save this as task-queue-rs.yaml:
Apply it:
Next, expose the Redis server with a Service so that worker Pods can find it via DNS. Save this as task-queue-service.yaml:
Apply it:
Verify the queue server is running:
Step 7: Create Consumer Workers for the Queue
Now deploy a Job that runs 5 worker Pods in parallel. Each worker simulates pulling items from the queue and processing them. In a real application, these workers would connect to Redis and dequeue work items. Save this as queue-processor-job.yaml:
Apply it:
Notice that completions is not set. When you omit completions and set parallelism to more than 1, Kubernetes enters work queue mode: once any single Pod exits with code 0, the Job begins winding down. No new Pods are started, and the Job waits for all running Pods to finish.
Watch the workers process their items:
All 5 workers run in parallel, process their items, and exit. Once all are done, the Job is complete. Check the logs of any worker to see its output:
Step 8: Schedule a CronJob for Recurring Cleanup
Finally, let's set up a CronJob that runs a cleanup task every day at 2:00 AM. This is useful for tasks like purging old reports, archiving logs, or compacting database indexes. Save the following as report-cleanup-cronjob.yaml:
Apply it:
Verify the CronJob was created:
The CronJob will not run until 2:00 AM. To test it immediately, you can manually create a Job from the CronJob template:
Watch the manually triggered Job run:
Key CronJob details to note:
successfulJobsHistoryLimit: 3keeps only the last 3 successful Job runs. Older completed Jobs are automatically deleted.failedJobsHistoryLimit: 1keeps only the most recent failed Job for debugging.- You can suspend a CronJob at any time to stop it from creating new Jobs, and resume it later. This is useful during maintenance windows.
Step 9: Clean Up All Resources
All resources in this walkthrough were labelled with chapter: jobs. Use this label to clean everything up in one command:
Also delete the manually triggered Job:
Summary
- A Job creates one or more Pods that run until successful termination. Unlike Deployments, a Job does not restart Pods that exit successfully.
- The one-shot pattern (
completions: 1,parallelism: 1) is perfect for one-time tasks like database migrations. - The parallel fixed completions pattern processes batch work by running multiple Pods simultaneously until a target number of completions is reached.
- The work queue pattern runs multiple parallel workers that pull from a shared queue. The Job completes when a worker exits successfully, signalling the queue is empty.
- Use
restartPolicy: OnFailurefor most Jobs to keep the cluster clean. UserestartPolicy: Neveronly when you need to inspect each failed Pod. backoffLimitcontrols how many retries a Job attempts before being marked as Failed.- A CronJob creates a new Job on a cron schedule. Use
successfulJobsHistoryLimitandfailedJobsHistoryLimitto control how many old Jobs are retained. - Use
ttlSecondsAfterFinishedon standalone Jobs to automatically clean them up after completion.