Skip to content

fix(controller): sanitize k8s_request_total name label to prevent unbounded cardinality. Fixes #4571#4631

Open
pmichna wants to merge 1 commit intoargoproj:masterfrom
pmichna:fix-k8s-request-total-cardinality
Open

fix(controller): sanitize k8s_request_total name label to prevent unbounded cardinality. Fixes #4571#4631
pmichna wants to merge 1 commit intoargoproj:masterfrom
pmichna:fix-k8s-request-total-cardinality

Conversation

@pmichna
Copy link
Copy Markdown

@pmichna pmichna commented Feb 25, 2026

Addresses #4571

Motivation

The controller_clientset_k8s_request_total Prometheus CounterVec uses a name label containing actual Kubernetes resource names. Since AnalysisRun and Experiment names are unique per deployment ({rollout}-{podHash}-{revision}), every deployment permanently creates new metric time series that are never cleaned up. This causes the leader controller pod to grow from ~140Mi to 8.8GB over ~127 days.

PR #2851 deliberately kept the name label for resources with stable names (e.g. rollouts) — the maintainer explicitly wanted per-rollout observability. Only events and replicasets were sanitized because they were the top cardinality offenders at the time. The leak from analysisruns and experiments was not addressed.

Changes

  • Extend name sanitization to analysisruns and experiments in IncKubernetesRequest — these resources have ephemeral hash-based names and are the primary leak vectors. Resources with stable names (e.g. rollouts, services, configmaps) continue to preserve the actual name for per-rollout observability, matching the original intent from PR fix(controller): Remove name label from some k8s client metrics #2851.
  • Add unit tests verifying that ephemeral resource names (analysisruns, experiments) are sanitized to "N/A", while stable resource names (rollouts) are preserved.

Verification

$ go test ./controller/metrics/... -v -count=1
--- PASS: TestIncKubernetesRequest (0.10s)
--- PASS: TestIncKubernetesRequestEphemeralResourcesSanitized (0.11s)
--- PASS: TestRemove (2.11s)
... (all 14 tests pass)
PASS
ok  	github.com/argoproj/argo-rollouts/controller/metrics	4.211s

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@pmichna pmichna force-pushed the fix-k8s-request-total-cardinality branch 5 times, most recently from 97d8787 to d05a16b Compare February 25, 2026 12:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 25, 2026

Published E2E Test Results

  4 files    4 suites   3h 34m 50s ⏱️
120 tests 113 ✅  7 💤 0 ❌
480 runs  452 ✅ 28 💤 0 ❌

Results for commit 5ca6304.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 25, 2026

Published Unit Test Results

2 430 tests   2 430 ✅  3m 17s ⏱️
  129 suites      0 💤
    1 files        0 ❌

Results for commit 5ca6304.

♻️ This comment has been updated with latest results.

@pmichna pmichna marked this pull request as ready for review February 25, 2026 13:33
…ounded cardinality. Fixes argoproj#4571

The `controller_clientset_k8s_request_total` metric uses a `name` label
containing actual Kubernetes resource names. Since AnalysisRun names are
unique per deployment (`{rollout}-{podHash}-{revision}`), every
deployment permanently creates new metric time series that are never
cleaned up, causing the leader controller pod to grow from ~140Mi to
8.8GB over time.

Changes:
- Always set `name = "N/A"` in IncKubernetesRequest, extending the
  existing sanitization (previously only for events, replicasets, and
  List verbs) to all resources
- Add MetricK8sRequestTotal cleanup in Remove() for all resource kinds
- Add unit tests verifying name sanitization and metric cleanup

Signed-off-by: Paweł Michna <pawel@fresha.com>
@pmichna pmichna force-pushed the fix-k8s-request-total-cardinality branch from d05a16b to 5ca6304 Compare February 25, 2026 13:52
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants