BE-1454 - Split CloudWatch rate limits per account region#13
BE-1454 - Split CloudWatch rate limits per account region#13
Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
WalkthroughRefactors CloudWatch rate limiting to per-account-and-region limiter buckets, adds ChangesAccount-Scoped Rate Limiting
sequenceDiagram
participant Client as Client
participant SRLC as SimpleRateLimitedClient
participant GL as GlobalRateLimiter
participant RB as rateLimiterBuckets
participant RL as rate.Limiter
participant P as Prometheus
Client->>SRLC: API call (ListMetrics/GetMetricData/GetMetricStatistics) [region, accountID, role]
SRLC->>GL: limit(op, accountID, region, role)
GL->>RB: get(accountID, region)
RB->>RL: create or return limiter
SRLC->>RL: Allow()/Wait()
RL-->>SRLC: permit/grant
SRLC->>Client: forward call to underlying client
SRLC->>P: increment allowed/wait counters (labels include account_id, region, role)
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/clients/v2/factory.go (1)
170-190:⚠️ Potential issue | 🟠 Major | ⚡ Quick winApply the rate-limit wrapper after cache lookup, not before caching.
Line 177 returns the cached client before the account-aware wrapper is applied, so a warmed cache bypasses rate limiting, and the first cold miss permanently bakes one wrapper/account into
cachedClients.cloudwatch. Keep the cached value as the raw CloudWatch client and build the rate-limited wrapper perGetCloudwatchClient(...)call.Suggested fix
func (c *CachingFactory) GetCloudwatchClient(region string, accountID string, role model.Role, concurrency cloudwatch_client.ConcurrencyConfig, globalRateLimiter *cloudwatch_client.GlobalRateLimiter) cloudwatch_client.Client { if !c.refreshed.Load() { // if we have not refreshed then we need to lock in case we are accessing concurrently c.mu.Lock() defer c.mu.Unlock() } - if cachedClient := c.clients[role][region].cloudwatch; cachedClient != nil { - return cloudwatch_client.NewLimitedConcurrencyClient(cachedClient, concurrency.NewLimiter()) - } + baseClient := c.clients[role][region].cloudwatch + if baseClient == nil { + baseClient = cloudwatch_v2.NewClient(c.logger, c.createCloudwatchClient(c.clients[role][region].awsConfig)) + c.clients[role][region].cloudwatch = baseClient + } - // Create new client and cache it - client := cloudwatch_v2.NewClient(c.logger, c.createCloudwatchClient(c.clients[role][region].awsConfig)) - // Apply global rate limiter if provided + client := baseClient if globalRateLimiter != nil { client = cloudwatch_client.NewRateLimitedClient(client, globalRateLimiter, region, accountID, role.RoleArn) } - c.clients[role][region].cloudwatch = client return cloudwatch_client.NewLimitedConcurrencyClient(client, concurrency.NewLimiter()) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/clients/v2/factory.go` around lines 170 - 190, GetCloudwatchClient currently returns or caches a wrapped client which bakes a specific account-aware rate limiter into c.clients[role][region].cloudwatch; change it so the cache stores only the raw cloudwatch_v2 client (c.clients[role][region].cloudwatch should be the unwrapped client), and always apply the account-aware rate-limit wrapper (cloudwatch_client.NewRateLimitedClient) and the concurrency wrapper (cloudwatch_client.NewLimitedConcurrencyClient) per call after retrieving the cached raw client. Concretely, in GetCloudwatchClient: when reading c.clients[role][region].cloudwatch treat it as the raw client and wrap it with NewRateLimitedClient (if globalRateLimiter != nil) and then NewLimitedConcurrencyClient before returning, and when creating and storing a new client use cloudwatch_v2.NewClient(...) to cache the raw client (do not cache a rate-limited client).pkg/clients/v1/factory.go (1)
290-309:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon’t cache the account-aware wrapper here.
Line 298 returns the cached client before the new wrapper is applied, so
accountIDandglobalRateLimiteronly matter on a cold miss. That meansRefresh()hot paths skip rate limiting entirely, and cold misses lock the first wrapper state into the cache. Cache the raw CloudWatch client, then applyNewRateLimitedClient(...)per call before the concurrency wrapper.Suggested fix
func (c *CachingFactory) GetCloudwatchClient(region string, accountID string, role model.Role, concurrency cloudwatch_client.ConcurrencyConfig, globalRateLimiter *cloudwatch_client.GlobalRateLimiter) cloudwatch_client.Client { if !c.refreshed.Load() { // if we have not refreshed then we need to lock in case we are accessing concurrently c.mu.Lock() defer c.mu.Unlock() } - if cachedClient := c.clients[role][region].cloudwatch; cachedClient != nil { - return cloudwatch_client.NewLimitedConcurrencyClient(cachedClient, concurrency.NewLimiter()) - } - client := createCloudWatchClient(c.logger, c.session, ®ion, role, c.fips) + baseClient := c.clients[role][region].cloudwatch + if baseClient == nil { + baseClient = createCloudWatchClient(c.logger, c.session, ®ion, role, c.fips) + c.clients[role][region].cloudwatch = baseClient + } // Apply global rate limiter if provided + client := baseClient if globalRateLimiter != nil { client = cloudwatch_client.NewRateLimitedClient(client, globalRateLimiter, region, accountID, role.RoleArn) } - c.clients[role][region].cloudwatch = client return cloudwatch_client.NewLimitedConcurrencyClient(client, concurrency.NewLimiter()) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/clients/v1/factory.go` around lines 290 - 309, GetCloudwatchClient is caching the already-wrapped client so globalRateLimiter/accountID are only applied on cold misses; instead store the raw CloudWatch client in c.clients[role][region].cloudwatch (the value set by createCloudWatchClient) and on every call, wrap that cached raw client with cloudwatch_client.NewRateLimitedClient(...) when globalRateLimiter != nil, then always return cloudwatch_client.NewLimitedConcurrencyClient(...) around the per-call (possibly rate-limited) client; adjust the cachedClient check to detect only the raw cached client and apply NewRateLimitedClient and NewLimitedConcurrencyClient per invocation, keeping existing locking (refreshed.Load, c.mu.Lock) and use createCloudWatchClient to populate the cache when nil.
🧹 Nitpick comments (1)
pkg/promutil/prometheus.go (1)
88-95: Document the metric-schema break.Adding
account_idchanges the series identity for both existing metric names. Dashboards, alerts, and recording rules on these counters will need a migration note or compatibility plan.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/promutil/prometheus.go` around lines 88 - 95, Adding the new "account_id" label to CloudwatchRateLimitWaitCounter and CloudwatchRateLimitAllowedCounter changes series identity and will break dashboards/alerts; update the metric schema documentation and provide a compatibility plan: either rename the new counters (e.g., add a "_v2" suffix) and register both old and new metrics, or keep the old metric shape by exposing a legacy label value (e.g., account_id="") while introducing the new metric name; update README/CHANGELOG with the migration note and list of affected metric names (yace_cloudwatch_rate_limit_waits_total, yace_cloudwatch_rate_limit_allowed_total), and adjust any recording rules/dashboards/alerting docs to point to the new metric or explain how to transition.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/clients/cloudwatch/rate_limiter_test.go`:
- Around line 284-291: The test reuses a single shortCtx (context.WithTimeout
50ms) for two independent ListMetrics calls which can make the second assertion
fail due to elapsed time; update the test to create a fresh context.WithTimeout
(and corresponding cancel) for each call to
sameRegionDifferentAccountClient.ListMetrics and
sameAccountDifferentRegionClient.ListMetrics so each assertion has its own
independent deadline and is properly cancelled after the call.
---
Outside diff comments:
In `@pkg/clients/v1/factory.go`:
- Around line 290-309: GetCloudwatchClient is caching the already-wrapped client
so globalRateLimiter/accountID are only applied on cold misses; instead store
the raw CloudWatch client in c.clients[role][region].cloudwatch (the value set
by createCloudWatchClient) and on every call, wrap that cached raw client with
cloudwatch_client.NewRateLimitedClient(...) when globalRateLimiter != nil, then
always return cloudwatch_client.NewLimitedConcurrencyClient(...) around the
per-call (possibly rate-limited) client; adjust the cachedClient check to detect
only the raw cached client and apply NewRateLimitedClient and
NewLimitedConcurrencyClient per invocation, keeping existing locking
(refreshed.Load, c.mu.Lock) and use createCloudWatchClient to populate the cache
when nil.
In `@pkg/clients/v2/factory.go`:
- Around line 170-190: GetCloudwatchClient currently returns or caches a wrapped
client which bakes a specific account-aware rate limiter into
c.clients[role][region].cloudwatch; change it so the cache stores only the raw
cloudwatch_v2 client (c.clients[role][region].cloudwatch should be the unwrapped
client), and always apply the account-aware rate-limit wrapper
(cloudwatch_client.NewRateLimitedClient) and the concurrency wrapper
(cloudwatch_client.NewLimitedConcurrencyClient) per call after retrieving the
cached raw client. Concretely, in GetCloudwatchClient: when reading
c.clients[role][region].cloudwatch treat it as the raw client and wrap it with
NewRateLimitedClient (if globalRateLimiter != nil) and then
NewLimitedConcurrencyClient before returning, and when creating and storing a
new client use cloudwatch_v2.NewClient(...) to cache the raw client (do not
cache a rate-limited client).
---
Nitpick comments:
In `@pkg/promutil/prometheus.go`:
- Around line 88-95: Adding the new "account_id" label to
CloudwatchRateLimitWaitCounter and CloudwatchRateLimitAllowedCounter changes
series identity and will break dashboards/alerts; update the metric schema
documentation and provide a compatibility plan: either rename the new counters
(e.g., add a "_v2" suffix) and register both old and new metrics, or keep the
old metric shape by exposing a legacy label value (e.g., account_id="") while
introducing the new metric name; update README/CHANGELOG with the migration note
and list of affected metric names (yace_cloudwatch_rate_limit_waits_total,
yace_cloudwatch_rate_limit_allowed_total), and adjust any recording
rules/dashboards/alerting docs to point to the new metric or explain how to
transition.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d3c4ccec-754d-442a-8f7c-83d3e41cbfb4
📒 Files selected for processing (9)
pkg/clients/cloudwatch/rate_limiter.gopkg/clients/cloudwatch/rate_limiter_test.gopkg/clients/factory.gopkg/clients/v1/factory.gopkg/clients/v1/factory_test.gopkg/clients/v2/factory.gopkg/clients/v2/factory_test.gopkg/job/scrape.gopkg/promutil/prometheus.go
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/clients/cloudwatch/rate_limiter_test.go (1)
301-324: ⚡ Quick winAlso pin the same-account/same-region/different-role case.
This test proves bucket isolation by account and region, but the new wrapper now also carries
role. One more assertion here would lock in thatroleaffects metrics only, not limiter bucket selection.Possible extension
firstBucketClient := NewRateLimitedClient(&mockClient{}, limiter, "us-east-1", "111111111111", "test-role") sameRegionDifferentAccountClient := NewRateLimitedClient(&mockClient{}, limiter, "us-east-1", "222222222222", "test-role") sameAccountDifferentRegionClient := NewRateLimitedClient(&mockClient{}, limiter, "us-west-2", "111111111111", "test-role") + sameBucketDifferentRoleClient := NewRateLimitedClient(&mockClient{}, limiter, "us-east-1", "111111111111", "other-role") @@ err = sameAccountDifferentRegionClient.ListMetrics(shortCtx2, "test", nil, false, nil) require.NoError(t, err) + + sharedCtx, sharedCancel := context.WithTimeout(context.Background(), 50*time.Millisecond) + defer sharedCancel() + err = sameBucketDifferentRoleClient.ListMetrics(sharedCtx, "test", nil, false, nil) + require.Error(t, err)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/clients/cloudwatch/rate_limiter_test.go` around lines 301 - 324, Add a same-account/same-region/different-role case to TestRateLimitingUsesIndependentAccountRegionBuckets: create a client via NewRateLimitedClient using the same region ("us-east-1") and account ("111111111111") as firstBucketClient but with a different role string, call ListMetrics with the shortCtx used for the other checks, and assert it is rate-limited (expect a non-nil error) so the test verifies that role does not change bucket selection — only account+region do.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/clients/cloudwatch/rate_limiter_test.go`:
- Around line 301-324: Add a same-account/same-region/different-role case to
TestRateLimitingUsesIndependentAccountRegionBuckets: create a client via
NewRateLimitedClient using the same region ("us-east-1") and account
("111111111111") as firstBucketClient but with a different role string, call
ListMetrics with the shortCtx used for the other checks, and assert it is
rate-limited (expect a non-nil error) so the test verifies that role does not
change bucket selection — only account+region do.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3ebd74e5-d831-48fa-b55a-b03ff78e43a6
📒 Files selected for processing (2)
pkg/clients/cloudwatch/rate_limiter.gopkg/clients/cloudwatch/rate_limiter_test.go
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/clients/cloudwatch/rate_limiter.go`:
- Around line 151-160: The cache currently stores a role/region-scoped wrapped
client while NewRateLimitedClient(now capturing accountID) requires per-account
labeling; update the factories that populate c.clients[role][region] so they
either include accountID in the cache key (e.g.,
c.clients[role][region][accountID]) or stop caching the rate-limited wrapper and
instead cache the raw SDK client and call NewRateLimitedClient(rawClient,
globalLimiter, region, accountID, role) per request to build the
SimpleRateLimitedClient with the correct accountID and limiter bucket; change
the code in the factory methods that create the client entries to implement one
of these approaches and ensure the cached value is now the raw SDK client if you
choose the latter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3e20c28d-15b4-4cde-8eed-e418f5f12ebf
📒 Files selected for processing (2)
pkg/clients/cloudwatch/rate_limiter.gopkg/clients/cloudwatch/rate_limiter_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/clients/cloudwatch/rate_limiter.go (1)
134-152:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon't cache an account-bound wrapper under a role/region key.
Lines 134-152 now bind
accountIDintoSimpleRateLimitedClient, butpkg/clients/v1/factory.goLines 290-310 andpkg/clients/v2/factory.goLines 170-191 still cache that wrapper asc.clients[role][region].cloudwatch. The first account to populate a slot will be reused for later accounts in the same role/region, so those requests hit the wrong limiter bucket and export the wrongaccount_id. Cache by account too, or cache the raw SDK client and rebuild the rate-limited wrapper per request.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/clients/cloudwatch/rate_limiter.go` around lines 134 - 152, The wrapper SimpleRateLimitedClient created by NewRateLimitedClient currently embeds accountID but the factories cache it under c.clients[role][region].cloudwatch, causing the first account to be reused for other accounts; fix by changing the caching strategy so the rate-limited wrapper is keyed by account as well (e.g. include accountID in the cache key) or alternatively stop caching the wrapper and instead cache the raw SDK client and call NewRateLimitedClient per request to rebuild a fresh SimpleRateLimitedClient with the correct accountID; update the factory cache accesses that set c.clients[role][region].cloudwatch (and any equivalent v1/v2 factory code paths) to use the account-aware approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@pkg/clients/cloudwatch/rate_limiter.go`:
- Around line 134-152: The wrapper SimpleRateLimitedClient created by
NewRateLimitedClient currently embeds accountID but the factories cache it under
c.clients[role][region].cloudwatch, causing the first account to be reused for
other accounts; fix by changing the caching strategy so the rate-limited wrapper
is keyed by account as well (e.g. include accountID in the cache key) or
alternatively stop caching the wrapper and instead cache the raw SDK client and
call NewRateLimitedClient per request to rebuild a fresh SimpleRateLimitedClient
with the correct accountID; update the factory cache accesses that set
c.clients[role][region].cloudwatch (and any equivalent v1/v2 factory code paths)
to use the account-aware approach.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5825532d-2cc0-4d2d-983d-f11fc2d0c9e9
📒 Files selected for processing (2)
pkg/clients/cloudwatch/rate_limiter.gopkg/clients/cloudwatch/rate_limiter_test.go
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Metrics schema note: leaving the |
Summary
(account_id, region)buckets while preserving the existing per-API limit values.ListMetrics,GetMetricData, andGetMetricStatisticsuse the correct bucket.account_idto rate-limit telemetry and cover independent account/region buckets with a regression test.Test plan
go test ./pkg/clients/cloudwatchgo test ./pkg/clients/... ./pkg/job/... ./pkg/promutilgo test ./...Made with Cursor
Summary by CodeRabbit
Refactor
Observability
Breaking Changes
Tests