Skip to content

Commit 013277d

Browse files
committed
Add proposal doc
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
1 parent 15856bf commit 013277d

File tree

3 files changed

+89
-0
lines changed

3 files changed

+89
-0
lines changed
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Controller Runtime Cache Transform for Secrets and ConfigMaps
2+
3+
This proposal outlines an approach to reduce memory consumed by the operator’s manager pod by stripping data fields from Secrets and ConfigMaps stored in controller-runtime’s cache when they are not tracked or required by the operator.
4+
5+
## Problem
6+
7+
We recently discovered that the Argo CD Operator consumes significantly more memory on large clusters, particularly those with a high number of Secrets and ConfigMaps. For example, in a test cluster with 2,000 Secrets and 2,000 ConfigMaps spread across 100 namespaces, the operator manager pod consumed over 2 GB of memory at peak, with just one Argo CD CR instance.
8+
9+
Upon further investigation, we found that the primary contributor to this high memory usage is the underlying controller-runtime object cache, which the operator uses to watch resources.
10+
11+
### Why Does This Happen?
12+
13+
By default, controller-runtime caches all objects of a given type when **a watch is registered** for that type.
14+
15+
For example, even if we add a watch only for Secrets owned by the operator:
16+
```go
17+
// Watch for changes to Secrets sub-resources owned by ArgoCD instances.
18+
bldr.Owns(&v1.Secrets{})
19+
```
20+
controller-runtime will still cache **all Secrets in the cluster**, not just the operator-owned ones. This results in excessive memory usage on large clusters with many resources.
21+
22+
Ideally, the operator should cache only the resources it needs. To achieve this optimization, we explored various caching options available in controller-runtime.
23+
24+
## Proposed Solution
25+
26+
At a high level, this proposal suggests:
27+
- Use a labels to identify operator-owned/tracked/required resources.
28+
- Strip unnecessary fields from non-operator objects before storing them in the controller-runtime cache.
29+
- Introduce a self-healing mechanism that automatically labels resources of interest that initially lack the label, ensuring they are cached in full going forward.
30+
31+
### Implementation
32+
33+
1. Cache Transform
34+
Apply a transform on Secrets and ConfigMaps:
35+
- For non-operator objects, strip heavy fields (data, stringData, binaryData).
36+
- For operator-tracked objects (identified by labels like `operator.argoproj.io/tracked-by`, `argocd.argoproj.io/secret-type`), retain full content.
37+
38+
This reduces memory footprint by storing only metadata for irrelevant objects.
39+
40+
2. Client Wrapper
41+
Introduce a wrapper around the cached client:
42+
- On `Get`, if an object looks “stripped” (heuristic: `Data == nil` etc.) or missing required labels → fallback to live client.
43+
- After a successful live fetch, patch a tracking label so the cache retains the full object in future updates.
44+
- Errors while patching are non-fatal: subsequent reconciles will retry or fallback to live again.
45+
46+
3. Integration
47+
- Wire transforms into `main.go` via `cache.Options.ByObject`.
48+
- Update reconcilers to use the wrapped client for transparent fallback handling.
49+
50+
### Benefits
51+
52+
- **Reduced memory usage:** Only operator-relevant Secrets/ConfigMaps are cached with full data.
53+
- **Correctness preserved:** Fallback ensures reconciles always see full objects when needed.
54+
- **Self-healing:** Once an object is accessed, it is labeled and cached fully, avoiding repeated live GETs.
55+
56+
## Proof-of-Concept (PoC) Results
57+
58+
These metrics were collected from a test cluster containing **100 ConfigMaps** and **100 Secrets**, each approximately **1 MB** in size. The cluster was running four Argo CD instances and no other workload operators.
59+
60+
With the optimization enabled, operator memory usage dropped from **~350 MB** to **~100 MB**.
61+
62+
UnOptimized Operator Manager Memory:
63+
64+
![UnOptimized Operator Manager](assets/unoptimized-manager-memory.png)
65+
66+
Optimized Operator Manager Memory:
67+
68+
![Optimized Operator Manager](assets/optimized-manager-memory.png)
69+
70+
However, we could not reduce the startup memory consumption, which remained at **~750 MB** in both cases.
71+
72+
We previously attempted another approach in [#1795](https://github.com/argoproj-labs/argocd-operator/pull/1795), but it introduced significant complexity and restricted how watches could be set up. Compared to that, this solution provides a better balance between complexity, maintenance overhead, and outcome.
73+
74+
## Trade-offs / Risks and Mitigations
75+
76+
1. Extra API calls
77+
- When a resource is stripped in the cache, the operator performs a live lookup.
78+
- For resources that the operator cares about and are missing the label, this happens only once per resource, since the label is added for future caching.
79+
- Legitimately empty objects may trigger extra live GETs, but this is rare.
80+
81+
2. External resource labeling
82+
- The operator supports referencing external ConfigMaps and Secrets in the Argo CD CR.
83+
- These resources will be labelled. However, adding labels for operator tracking is a widely accepted practice within the Kubernetes community.
84+
- Additionally, we can provide a feature flag to disable this optimization for users who prefer not to mutate external resources.
85+
86+
## Future scope
87+
88+
- Add metrics for cache hits vs. live fallbacks to measure effectiveness.
89+
- Extend this approach to other resource types once proven stable.
35.6 KB
Loading
31.7 KB
Loading

0 commit comments

Comments
 (0)