Skip to content

resourcemanager: make controller config updates atomic#10504

Open
okJiang wants to merge 3 commits intotikv:masterfrom
okJiang:codex/issue-10335-atomic-controller-config
Open

resourcemanager: make controller config updates atomic#10504
okJiang wants to merge 3 commits intotikv:masterfrom
okJiang:codex/issue-10335-atomic-controller-config

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented Mar 27, 2026

What problem does this PR solve?

Issue Number: Close #10335

POST /resource-manager/api/v1/config/controller validates request keys before
applying updates, but it still persists each field one at a time through
UpdateControllerConfigItem. A mixed valid/invalid payload can therefore write
an earlier field before a later invalid value returns 400.

What is changed and how does it work?

resourcemanager: make controller config updates atomic

Batch controller config updates so mixed valid/invalid payloads no longer
persist earlier fields before later validation errors are returned.
  • collect all resolved controller-config fields before applying any update
  • add UpdateControllerConfigItems to clone the current controller config,
    apply every requested field to the clone, and persist once on success
  • route the existing single-item helper through the batch path so callers keep
    the same entry point
  • add unit and integration regression coverage for mixed valid/invalid payloads

Check List

Tests

  • Unit test
  • Integration test

Release note

Fix a bug where the resource manager controller config API could partially
persist a multi-field update before returning a validation error.

Summary by CodeRabbit

  • New Features

    • Support bulk controller configuration updates in a single request.
  • Bug Fixes

    • Configuration updates are atomic: either all changes apply or none do.
    • Validation tightened so invalid values no longer modify configuration and return clear validation errors.
  • API Changes

    • Config update responses include explicit 400 (validation), 403 (writes disabled), and 500 (server error).
  • Tests

    • Added unit and integration tests for atomic updates and API error handling.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 236154d3-798b-46d0-ad9e-18e12c4d85f4

📥 Commits

Reviewing files that changed from the base of the PR and between c6bb61e and 05f99f0.

📒 Files selected for processing (2)
  • pkg/mcs/resourcemanager/server/controller_config_errors.go
  • tests/integrations/mcs/resourcemanager/api_test.go

📝 Walkthrough

Walkthrough

Introduces atomic, batched controller-config updates: new Manager.UpdateControllerConfigItems applies multiple key updates under one lock, validates per-item, saves once if any change succeeds, and UpdateControllerConfigItem delegates to it. API/service layers and tests updated to use and verify the all-or-nothing behavior.

Changes

Cohort / File(s) Summary
Manager (batch update)
pkg/mcs/resourcemanager/server/manager.go
Add UpdateControllerConfigItems(map[string]any) error that clones config, applies multiple items via applyControllerConfigItem, validates per-item, records which changed, saves once if any change; UpdateControllerConfigItem now delegates to it.
Controller config errors
pkg/mcs/resourcemanager/server/controller_config_errors.go
Add sentinel/wrapper type and helpers: errControllerConfigValidation, wrapControllerConfigValidationError, and IsControllerConfigValidationError for validation error classification.
Config service & store interface
pkg/mcs/resourcemanager/metadataapi/config_service.go
Extend ConfigStore with UpdateControllerConfigItems; add ManagerStore.UpdateControllerConfigItems; refactor SetControllerConfig to call single batch update and simplify HTTP error mapping (403/400/500).
Manager unit tests
pkg/mcs/resourcemanager/server/manager_test.go
Add TestUpdateControllerConfigItemsAtomic to assert mixed valid/invalid batch is atomic and TestUpdateControllerConfigValidationError to assert validation error classification.
Metadata service tests
pkg/mcs/resourcemanager/metadataapi/config_service_test.go
Update test double to track bulk vs single-item calls, assert all-or-nothing behavior, and add test for store failure returning 500.
Integration tests & helpers
tests/integrations/mcs/resourcemanager/api_test.go
Add TestControllerConfigAPIAllOrNothing, adjust request helper to return body+status, and add tryToSetControllerConfig helper to exercise mixed payloads and verify no partial persistence.
API docs (annotations)
pkg/mcs/resourcemanager/server/apis/v1/api.go
Extend OpenAPI annotations for POST /config/controller to include 403 and 500 response cases.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API as ConfigService/API
  participant Manager
  participant Storage

  Client->>API: POST /config/controller (items map)
  API->>Manager: UpdateControllerConfigItems(items)
  Manager->>Manager: clone config & lock
  loop for each item
    Manager->>Manager: applyControllerConfigItem(item) (parse path, jsonutil.AddKeyValue, validate)
  end
  alt any item updated
    Manager->>Storage: Save(updated config)
    Storage-->>Manager: OK
    Manager->>Manager: set m.controllerConfig = updated
    Manager-->>API: nil (success)
    API-->>Client: 200 Success!
  else no updates
    Manager-->>API: nil (no-op)
    API-->>Client: 200 Success!
  else validation error
    Manager-->>API: wrapControllerConfigValidationError(err)
    API-->>Client: 400 Bad Request
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/L, lgtm

Suggested reviewers

  • nolouch
  • lhy1024
  • AndreMouche

Poem

🐰 I hop through configs, tidy and bright,
Batch every change in one careful bite,
No half-saved carrots left on the floor,
All-or-nothing—secure to the core,
Hooray for atomic updates—more! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'resourcemanager: make controller config updates atomic' clearly and concisely summarizes the main change: making controller config updates atomic to prevent partial persistence.
Description check ✅ Passed The PR description follows the template with a clear problem statement (Issue #10335), detailed explanation of changes, commit message, and release note. All required sections are present and well-documented.
Linked Issues check ✅ Passed The PR successfully addresses all objectives from issue #10335: implements atomic updates via UpdateControllerConfigItems, prevents partial persistence on validation errors, and includes unit and integration tests for mixed valid/invalid payloads.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing atomic controller config updates: manager, API layer, configuration service, error handling, and comprehensive tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: okjiang <819421878@qq.com>
@okJiang okJiang force-pushed the codex/issue-10335-atomic-controller-config branch from ae1bc34 to f7de7cc Compare March 27, 2026 10:20
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/resourcemanager/api_test.go`:
- Around line 318-323: The test sends "true" as a string for the
"enable-controller-trace-log" field which makes the request's failure
non-deterministic; update the call to tryToSetControllerConfig so the map value
for "enable-controller-trace-log" is a boolean true (not the string "true")
while keeping "ltb-max-wait-duration" as the invalid string "not-a-duration" so
the request deterministically exercises the valid+invalid mix; locate the call
to tryToSetControllerConfig in the test and change that map entry accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0d9242a-516b-4128-81df-8fb780263953

📥 Commits

Reviewing files that changed from the base of the PR and between 99eb5b5 and ae1bc34.

📒 Files selected for processing (4)
  • pkg/mcs/resourcemanager/server/apis/v1/api.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • tests/integrations/mcs/resourcemanager/api_test.go

Comment on lines +318 to +323
resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
"enable-controller-trace-log": "true",
"ltb-max-wait-duration": "not-a-duration",
})
re.Equal(http.StatusBadRequest, statusCode)
re.Contains(resp, "time:")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a real boolean here to keep the regression deterministic.

Line 319 sends "true" as a string, so this request can fail on either field instead of exercising the intended valid+invalid mix. That makes the "time:" assertion order-dependent and weakens the atomicity regression.

Suggested fix
 	resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
-		"enable-controller-trace-log": "true",
+		"enable-controller-trace-log": true,
 		"ltb-max-wait-duration":       "not-a-duration",
 	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
"enable-controller-trace-log": "true",
"ltb-max-wait-duration": "not-a-duration",
})
re.Equal(http.StatusBadRequest, statusCode)
re.Contains(resp, "time:")
resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
"enable-controller-trace-log": true,
"ltb-max-wait-duration": "not-a-duration",
})
re.Equal(http.StatusBadRequest, statusCode)
re.Contains(resp, "time:")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/api_test.go` around lines 318 - 323,
The test sends "true" as a string for the "enable-controller-trace-log" field
which makes the request's failure non-deterministic; update the call to
tryToSetControllerConfig so the map value for "enable-controller-trace-log" is a
boolean true (not the string "true") while keeping "ltb-max-wait-duration" as
the invalid string "not-a-duration" so the request deterministically exercises
the valid+invalid mix; locate the call to tryToSetControllerConfig in the test
and change that map entry accordingly.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/integrations/mcs/resourcemanager/api_test.go (1)

343-348: ⚠️ Potential issue | 🟠 Major

Use a real boolean in the mixed-payload regression.

Line 344 sends "true" as a string, so this request contains two invalid values instead of one valid field plus one invalid field. That weakens the all-or-nothing regression and can make the "time:" assertion fail for the wrong reason.

Suggested fix
 	resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
-		"enable-controller-trace-log": "true",
+		"enable-controller-trace-log": true,
 		"ltb-max-wait-duration":       "not-a-duration",
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/api_test.go` around lines 343 - 348,
The test is sending the boolean as a string which creates two invalid fields;
update the payload in the call to tryToSetControllerConfig so
"enable-controller-trace-log" is sent as a real boolean true (not the string
"true") while leaving "ltb-max-wait-duration": "not-a-duration" unchanged, so
the request has one valid field and one invalid duration field and the existing
assertion against tryToSetControllerConfig's response containing "time:" remains
meaningful.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/metadataapi/config_service.go`:
- Around line 251-257: The current handler collapses all non-permission errors
from s.configStore.UpdateControllerConfigItems(resolvedConf) into 400 Bad
Request; change the error handling to distinguish validation errors from
persistence/storage failures (e.g., errors returned by SaveControllerConfig in
the store). Specifically, detect validation-related errors (the same error
type/value returned by your validation code) and continue to return 400 for
those, but treat store persistence/etcd/write errors (wrap/inspect errors coming
from UpdateControllerConfigItems/SaveControllerConfig or provide a
store.IsPersistenceError helper) as server-side failures and return an
appropriate 5xx (e.g., 500 or 503) with a clear log message; keep the existing
IsMetadataWriteDisabledError check for forbidden. Ensure the store layer
wraps/save failures so the handler can reliably distinguish the error kinds.

---

Duplicate comments:
In `@tests/integrations/mcs/resourcemanager/api_test.go`:
- Around line 343-348: The test is sending the boolean as a string which creates
two invalid fields; update the payload in the call to tryToSetControllerConfig
so "enable-controller-trace-log" is sent as a real boolean true (not the string
"true") while leaving "ltb-max-wait-duration": "not-a-duration" unchanged, so
the request has one valid field and one invalid duration field and the existing
assertion against tryToSetControllerConfig's response containing "time:" remains
meaningful.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01871f43-b5bb-429e-80b5-dbc8ca11a1a6

📥 Commits

Reviewing files that changed from the base of the PR and between ae1bc34 and f7de7cc.

📒 Files selected for processing (5)
  • pkg/mcs/resourcemanager/metadataapi/config_service.go
  • pkg/mcs/resourcemanager/metadataapi/config_service_test.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • tests/integrations/mcs/resourcemanager/api_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mcs/resourcemanager/server/manager_test.go

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-02 03:30:41.572788789 +0000 UTC m=+408646.778148836: ☑️ agreed by lhy1024.

@ti-chi-bot ti-chi-bot bot added the approved label Apr 2, 2026
@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Apr 8, 2026

/retest

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 90.62500% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.99%. Comparing base (3eb99ae) to head (05f99f0).
⚠️ Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10504      +/-   ##
==========================================
+ Coverage   78.88%   78.99%   +0.10%     
==========================================
  Files         530      533       +3     
  Lines       71548    72000     +452     
==========================================
+ Hits        56439    56874     +435     
+ Misses      11092    11089       -3     
- Partials     4017     4037      +20     
Flag Coverage Δ
unittests 78.99% <90.62%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Apr 14, 2026

/retest

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/integrations/mcs/resourcemanager/api_test.go (1)

355-356: Don't bind this regression to Go's duration-parser wording.

The contract you care about is already covered by 400 Bad Request plus before == after. re.Contains(resp, "time:") makes the test brittle against stdlib error-text changes.

Suggested change
 		resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), payload)
 		re.Equal(http.StatusBadRequest, statusCode)
-		re.Contains(resp, "time:")
+		re.NotEmpty(resp)
 
 		after := suite.mustGetControllerConfig(re)
 		re.Equal(before, after)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/api_test.go` around lines 355 - 356,
The test should not assert on Go stdlib error wording; remove the brittle
re.Contains(resp, "time:") check and instead rely on the contract already
asserted by re.Equal(http.StatusBadRequest, statusCode) plus a
state-immutability assertion: ensure the pre-request and post-request state
variables (e.g., before and after) are equal (replace the re.Contains line with
re.Equal(before, after) or equivalent) so the test verifies 400 Bad Request and
that nothing changed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/integrations/mcs/resourcemanager/api_test.go`:
- Around line 355-356: The test should not assert on Go stdlib error wording;
remove the brittle re.Contains(resp, "time:") check and instead rely on the
contract already asserted by re.Equal(http.StatusBadRequest, statusCode) plus a
state-immutability assertion: ensure the pre-request and post-request state
variables (e.g., before and after) are equal (replace the re.Contains line with
re.Equal(before, after) or equivalent) so the test verifies 400 Bad Request and
that nothing changed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9f3b9e7a-2123-40e6-8caf-43fefe1eee69

📥 Commits

Reviewing files that changed from the base of the PR and between f7de7cc and c6bb61e.

📒 Files selected for processing (7)
  • pkg/mcs/resourcemanager/metadataapi/config_service.go
  • pkg/mcs/resourcemanager/metadataapi/config_service_test.go
  • pkg/mcs/resourcemanager/server/apis/v1/api.go
  • pkg/mcs/resourcemanager/server/controller_config_errors.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • tests/integrations/mcs/resourcemanager/api_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/mcs/resourcemanager/server/apis/v1/api.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mcs/resourcemanager/server/manager_test.go

Signed-off-by: okjiang <819421878@qq.com>
@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Apr 14, 2026

/retest

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 14, 2026

@okJiang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 05f99f0 link true /test pull-error-log-review
pull-unit-test-next-gen-3 05f99f0 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rm: make controller config metadata update atomic

2 participants