|
| 1 | +# Controller Runtime Cache Transform for Secrets and ConfigMaps |
| 2 | + |
| 3 | +This proposal outlines an approach to reduce memory consumed by the operator’s manager pod by stripping data fields from Secrets and ConfigMaps stored in controller-runtime’s cache when they are not tracked or required by the operator. |
| 4 | + |
| 5 | +## Problem |
| 6 | + |
| 7 | +We recently discovered that the Argo CD Operator consumes significantly more memory on large clusters, particularly those with a high number of Secrets and ConfigMaps. For example, in a test cluster with 2,000 Secrets and 2,000 ConfigMaps spread across 100 namespaces, the operator manager pod consumed over 2 GB of memory at peak, with just one Argo CD CR instance. |
| 8 | + |
| 9 | +Upon further investigation, we found that the primary contributor to this high memory usage is the underlying controller-runtime object cache, which the operator uses to watch resources. |
| 10 | + |
| 11 | +### Why Does This Happen? |
| 12 | + |
| 13 | +By default, controller-runtime caches all objects of a given type when **a watch is registered** for that type. |
| 14 | + |
| 15 | +For example, even if we add a watch only for Secrets owned by the operator: |
| 16 | +```go |
| 17 | +// Watch for changes to Secrets sub-resources owned by ArgoCD instances. |
| 18 | +bldr.Owns(&v1.Secrets{}) |
| 19 | +``` |
| 20 | +controller-runtime will still cache **all Secrets in the cluster**, not just the operator-owned ones. This results in excessive memory usage on large clusters with many resources. |
| 21 | + |
| 22 | +Ideally, the operator should cache only the resources it needs. To achieve this optimization, we explored various caching options available in controller-runtime. |
| 23 | + |
| 24 | +## Proposed Solution |
| 25 | + |
| 26 | +At a high level, this proposal suggests: |
| 27 | +- Use a labels to identify operator-owned/tracked/required resources. |
| 28 | +- Strip unnecessary fields from non-operator objects before storing them in the controller-runtime cache. |
| 29 | +- Introduce a self-healing mechanism that automatically labels resources of interest that initially lack the label, ensuring they are cached in full going forward. |
| 30 | + |
| 31 | +### Implementation |
| 32 | + |
| 33 | +1. Cache Transform |
| 34 | + Apply a transform on Secrets and ConfigMaps: |
| 35 | + - For non-operator objects, strip heavy fields (data, stringData, binaryData). |
| 36 | + - For operator-tracked objects (identified by labels like `operator.argoproj.io/tracked-by`, `argocd.argoproj.io/secret-type`), retain full content. |
| 37 | + |
| 38 | + This reduces memory footprint by storing only metadata for irrelevant objects. |
| 39 | + |
| 40 | +2. Client Wrapper |
| 41 | + Introduce a wrapper around the cached client: |
| 42 | + - On `Get`, if an object looks “stripped” (heuristic: `Data == nil` etc.) or missing required labels → fallback to live client. |
| 43 | + - After a successful live fetch, patch a tracking label so the cache retains the full object in future updates. |
| 44 | + - Errors while patching are non-fatal: subsequent reconciles will retry or fallback to live again. |
| 45 | + |
| 46 | +3. Integration |
| 47 | + - Wire transforms into `main.go` via `cache.Options.ByObject`. |
| 48 | + - Update reconcilers to use the wrapped client for transparent fallback handling. |
| 49 | + |
| 50 | +### Benefits |
| 51 | + |
| 52 | +- **Reduced memory usage:** Only operator-relevant Secrets/ConfigMaps are cached with full data. |
| 53 | +- **Correctness preserved:** Fallback ensures reconciles always see full objects when needed. |
| 54 | +- **Self-healing:** Once an object is accessed, it is labeled and cached fully, avoiding repeated live GETs. |
| 55 | + |
| 56 | +## Proof-of-Concept (PoC) Results |
| 57 | + |
| 58 | +These metrics were collected from a test cluster containing **100 ConfigMaps** and **100 Secrets**, each approximately **1 MB** in size. The cluster was running four Argo CD instances and no other workload operators. |
| 59 | + |
| 60 | +With the optimization enabled, operator memory usage dropped from **~350 MB** to **~100 MB**. |
| 61 | + |
| 62 | +UnOptimized Operator Manager Memory: |
| 63 | + |
| 64 | + |
| 65 | + |
| 66 | +Optimized Operator Manager Memory: |
| 67 | + |
| 68 | + |
| 69 | + |
| 70 | +However, we could not reduce the startup memory consumption, which remained at **~750 MB** in both cases. |
| 71 | + |
| 72 | +We previously attempted another approach in [#1795](https://github.com/argoproj-labs/argocd-operator/pull/1795), but it introduced significant complexity and restricted how watches could be set up. Compared to that, this solution provides a better balance between complexity, maintenance overhead, and outcome. |
| 73 | + |
| 74 | +## Trade-offs / Risks and Mitigations |
| 75 | + |
| 76 | +1. Extra API calls |
| 77 | + - When a resource is stripped in the cache, the operator performs a live lookup. |
| 78 | + - For resources that the operator cares about and are missing the label, this happens only once per resource, since the label is added for future caching. |
| 79 | + - Legitimately empty objects may trigger extra live GETs, but this is rare. |
| 80 | + |
| 81 | +2. External resource labeling |
| 82 | + - The operator supports referencing external ConfigMaps and Secrets in the Argo CD CR. |
| 83 | + - These resources will be labelled. However, adding labels for operator tracking is a widely accepted practice within the Kubernetes community. |
| 84 | + - Additionally, we can provide a feature flag to disable this optimization for users who prefer not to mutate external resources. |
| 85 | + |
| 86 | +## Future scope |
| 87 | + |
| 88 | +- Add metrics for cache hits vs. live fallbacks to measure effectiveness. |
| 89 | +- Extend this approach to other resource types once proven stable. |
0 commit comments