Skip to content

reana-job-controller: completed/failed Kubernetes Jobs are never cleaned up (no ttlSecondsAfterFinished) #944

@kratsg

Description

@kratsg

Summary

reana-job-controller creates Kubernetes Jobs for each workflow step but never sets ttlSecondsAfterFinished on them. Completed and failed Jobs accumulate indefinitely, with no automatic cleanup.

Observed behaviour

On our deployment, after normal operation we observed:

  • 246 completed (succeeded) Jobs sitting in the namespace
  • 125 failed Jobs sitting in the namespace
  • 970 total Jobs in the reana namespace

This grows unboundedly as workflows run.

Root cause

In reana_job_controller/kubernetes_job_manager.py, the V1Job spec is constructed without setting ttl_seconds_after_finished. Kubernetes's built-in TTL controller would automatically garbage-collect completed Jobs if this field were set, but since it is not, nothing removes them.

Impact

  • etcd bloat: Every Job object (plus its pod) is stored in etcd. At scale this creates memory and API server pressure.
  • Namespace clutter: kubectl get jobs -n reana becomes unreadable with hundreds of stale entries.
  • Pod accumulation: Completed Job pods also persist, consuming entries in the kubelet's pod cache and in kubectl get pods output.

Suggested fix

Set ttl_seconds_after_finished when constructing the Job spec in kubernetes_job_manager.py, configurable via an environment variable (consistent with the existing REANA_KUBERNETES_JOBS_TIMEOUT_LIMIT pattern):

REANA_KUBERNETES_JOBS_TTL_SECONDS_AFTER_FINISHED = int(
    os.getenv("REANA_KUBERNETES_JOBS_TTL_SECONDS_AFTER_FINISHED", 604800)  # 7 days default
)

# when building V1JobSpec:
ttl_seconds_after_finished=REANA_KUBERNETES_JOBS_TTL_SECONDS_AFTER_FINISHED,

A 7-day default gives operators enough time to inspect failed jobs for debugging while still reclaiming resources automatically.

Workaround

Until this is fixed upstream, operators can run a periodic cleanup CronJob:

# Delete completed jobs
kubectl delete jobs -n reana --field-selector=status.successful=1 --ignore-not-found=true

# Delete failed jobs
kubectl get jobs -n reana -o json | jq -r '
  .items[] |
  select((.status.active // 0) == 0) |
  select((.status.succeeded // 0) == 0) |
  select((.status.failed // 0) > 0) |
  .metadata.name
' | xargs -r kubectl delete job -n reana --ignore-not-found=true

Environment

  • REANA version: 0.9.5-alpha.1 (reana-server), 0.9.4-alpha.1 (reana-job-controller)
  • Kubernetes: v1.33.10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions