fix(controller): sanitize k8s_request_total name label to prevent unbounded cardinality. Fixes #4571#4631
Open
pmichna wants to merge 1 commit intoargoproj:masterfrom
Open
Conversation
97d8787 to
d05a16b
Compare
Contributor
Published E2E Test Results 4 files 4 suites 3h 34m 50s ⏱️ Results for commit 5ca6304. ♻️ This comment has been updated with latest results. |
Contributor
Published Unit Test Results2 430 tests 2 430 ✅ 3m 17s ⏱️ Results for commit 5ca6304. ♻️ This comment has been updated with latest results. |
…ounded cardinality. Fixes argoproj#4571 The `controller_clientset_k8s_request_total` metric uses a `name` label containing actual Kubernetes resource names. Since AnalysisRun names are unique per deployment (`{rollout}-{podHash}-{revision}`), every deployment permanently creates new metric time series that are never cleaned up, causing the leader controller pod to grow from ~140Mi to 8.8GB over time. Changes: - Always set `name = "N/A"` in IncKubernetesRequest, extending the existing sanitization (previously only for events, replicasets, and List verbs) to all resources - Add MetricK8sRequestTotal cleanup in Remove() for all resource kinds - Add unit tests verifying name sanitization and metric cleanup Signed-off-by: Paweł Michna <pawel@fresha.com>
d05a16b to
5ca6304
Compare
|
blkperl
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Addresses #4571
Motivation
The
controller_clientset_k8s_request_totalPrometheus CounterVec uses anamelabel containing actual Kubernetes resource names. Since AnalysisRun and Experiment names are unique per deployment ({rollout}-{podHash}-{revision}), every deployment permanently creates new metric time series that are never cleaned up. This causes the leader controller pod to grow from ~140Mi to 8.8GB over ~127 days.PR #2851 deliberately kept the
namelabel for resources with stable names (e.g.rollouts) — the maintainer explicitly wanted per-rollout observability. Onlyeventsandreplicasetswere sanitized because they were the top cardinality offenders at the time. The leak fromanalysisrunsandexperimentswas not addressed.Changes
namesanitization toanalysisrunsandexperimentsinIncKubernetesRequest— these resources have ephemeral hash-based names and are the primary leak vectors. Resources with stable names (e.g.rollouts,services,configmaps) continue to preserve the actualnamefor per-rollout observability, matching the original intent from PR fix(controller): Remove name label from some k8s client metrics #2851.analysisruns,experiments) are sanitized to"N/A", while stable resource names (rollouts) are preserved.Verification
Checklist:
"fix(controller): Updates such and such. Fixes #1234".