reana-job-controller: completed/failed Kubernetes Jobs are never cleaned up (no ttlSecondsAfterFinished)

## Summary

`reana-job-controller` creates Kubernetes Jobs for each workflow step but never sets `ttlSecondsAfterFinished` on them. Completed and failed Jobs accumulate indefinitely, with no automatic cleanup.

## Observed behaviour

On our deployment, after normal operation we observed:

- 246 completed (succeeded) Jobs sitting in the namespace
- 125 failed Jobs sitting in the namespace
- 970 total Jobs in the reana namespace

This grows unboundedly as workflows run.

## Root cause

In `reana_job_controller/kubernetes_job_manager.py`, the `V1Job` spec is constructed without setting `ttl_seconds_after_finished`. Kubernetes's built-in TTL controller would automatically garbage-collect completed Jobs if this field were set, but since it is not, nothing removes them.

## Impact

- **etcd bloat**: Every Job object (plus its pod) is stored in etcd. At scale this creates memory and API server pressure.
- **Namespace clutter**: `kubectl get jobs -n reana` becomes unreadable with hundreds of stale entries.
- **Pod accumulation**: Completed Job pods also persist, consuming entries in the kubelet's pod cache and in `kubectl get pods` output.

## Suggested fix

Set `ttl_seconds_after_finished` when constructing the Job spec in `kubernetes_job_manager.py`, configurable via an environment variable (consistent with the existing `REANA_KUBERNETES_JOBS_TIMEOUT_LIMIT` pattern):

```python
REANA_KUBERNETES_JOBS_TTL_SECONDS_AFTER_FINISHED = int(
    os.getenv("REANA_KUBERNETES_JOBS_TTL_SECONDS_AFTER_FINISHED", 604800)  # 7 days default
)

# when building V1JobSpec:
ttl_seconds_after_finished=REANA_KUBERNETES_JOBS_TTL_SECONDS_AFTER_FINISHED,
```

A 7-day default gives operators enough time to inspect failed jobs for debugging while still reclaiming resources automatically.

## Workaround

Until this is fixed upstream, operators can run a periodic cleanup CronJob:

```bash
# Delete completed jobs
kubectl delete jobs -n reana --field-selector=status.successful=1 --ignore-not-found=true

# Delete failed jobs
kubectl get jobs -n reana -o json | jq -r '
  .items[] |
  select((.status.active // 0) == 0) |
  select((.status.succeeded // 0) == 0) |
  select((.status.failed // 0) > 0) |
  .metadata.name
' | xargs -r kubectl delete job -n reana --ignore-not-found=true
```

## Environment

- REANA version: 0.9.5-alpha.1 (reana-server), 0.9.4-alpha.1 (reana-job-controller)
- Kubernetes: v1.33.10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reana-job-controller: completed/failed Kubernetes Jobs are never cleaned up (no ttlSecondsAfterFinished) #944

Summary

Observed behaviour

Root cause

Impact

Suggested fix

Workaround

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

reana-job-controller: completed/failed Kubernetes Jobs are never cleaned up (no ttlSecondsAfterFinished) #944

Description

Summary

Observed behaviour

Root cause

Impact

Suggested fix

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions