*: fix potential bug by ai in pkg/schedule/schedulers by bufferflies · Pull Request #10312 · tikv/pd

bufferflies · 2026-03-07T11:56:23Z

What problem does this PR solve?

Issue Number: Close #10449

What is changed and how does it work?

- It removes several unsafe type assertions on JSON input (like input["x"].(string)), replacing them with checked parsing and clear 400 Bad Request responses, so malformed payloads no longer panic PD.  
  - Key places: pkg/schedule/schedulers/balance_range.go, pkg/schedule/schedulers/evict_leader.go, pkg/schedule/schedulers/grant_hot_region.go, pkg/schedule/schedulers/grant_leader.go
- It fixes a risky type cast in transfer witness handling by making RecvRegionInfo safe (nil on wrong scheduler type) and guarding before channel send.  
  - pkg/schedule/schedulers/transfer_witness_leader.go, pkg/schedule/schedulers/scheduler_controller.go
- It corrects HTTP error behavior in scheduler API paths:
  - invalid user options now return 400 instead of 500
  - avoids nil dereference when delete redirect fails and response is nil
  - handles “handler unavailable” cases explicitly  
  - server/api/scheduler.go
- It adds regression tests specifically asserting no panic and expected status codes for bad/odd inputs.  
  - pkg/schedule/schedulers/balance_range_test.go, pkg/schedule/schedulers/evict_leader_test.go, pkg/schedule/schedulers/grant_hot_region_test.go, pkg/schedule/schedulers/grant_leader_test.go, pkg/schedule/schedulers/transfer_witness_leader_test.go

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Code changes

Has the configuration change
Has HTTP APIs changed (Don't forget to add the declarative for the new API)
Has persistent data change

Side effects

Possible performance regression
Increased code complexity
Breaking backward compatibility

Related changes

PR to update pingcap/docs/pingcap/docs-cn:
PR to update pingcap/tiup:
Need to cherry-pick to the release branch

Release note

None.

Summary by CodeRabbit

Bug Fixes
- Stricter input validation and clearer client-facing errors for scheduler config updates (engine, rule, alias, timeouts, store IDs, ranges), safer control flow to avoid runtime errors and unintended pauses/resumes, and more accurate HTTP status codes for invalid input.
Tests
- Added tests for invalid field types, malformed ranges, wrong scheduler types, and config update error cases to prevent regressions.

Signed-off-by: tongjian <1045931706@qq.com>

coderabbitai · 2026-03-07T11:57:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds defensive input validation and type checks across multiple scheduler handlers, tightens HTTP error responses and control flow to avoid panics/nil derefs, and introduces HTTP/unit tests covering invalid input scenarios.

Changes

Cohort / File(s)	Summary
Balance Range Scheduler `pkg/schedule/schedulers/balance_range.go`, `pkg/schedule/schedulers/balance_range_test.go`	Validate `engine`, `rule`, `alias` as strings and restrict `engine` to allowed values; convert/validate `rule` via `core.NewRule`; parse optional `timeout` string to duration with positivity check; update error messages. Adds test `TestAddBalanceRangeJobWithInvalidFieldType`.
Evict Leader Scheduler `pkg/schedule/schedulers/evict_leader.go`, `pkg/schedule/schedulers/evict_leader_test.go`	`pauseLeaderTransferIfStoreNotExist` early-returns when store tracked and surfaces PauseLeaderTransfer errors. `updateConfig` accepts `ranges` as `[]string` or `[]any` with per-element validation and falls back to existing ranges when `id` present. Adds HTTP tests for valid string-array ranges and invalid element types.
Grant Hot Region Scheduler `pkg/schedule/schedulers/grant_hot_region.go`, `pkg/schedule/schedulers/grant_hot_region_test.go`	Validate `"store-leader-id"` is a string before parsing to uint64; return 400 on wrong type. Adds test asserting 400 Bad Request for invalid `store-leader-id` type.
Transfer Witness Leader Scheduler `pkg/schedule/schedulers/transfer_witness_leader.go`, `pkg/schedule/schedulers/transfer_witness_leader_test.go`	`RecvRegionInfo` now safely checks type assertion and returns nil on mismatch to avoid panics. Adds test `TestRecvRegionInfoWithWrongSchedulerType`.
Scheduler Controller `pkg/schedule/schedulers/scheduler_controller.go`	Cache `RecvRegionInfo` result in `CheckTransferWitnessLeader` and return early on nil to prevent nil channel usage.
Scheduler API `server/api/scheduler.go`	Change error responses: use BadRequest for invalid client inputs, preserve default timeout when option missing, use upstream response status on redirect delete errors, and ensure early returns after sending error responses.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

okJiang
JmPotato

Poem

🐰 I hopped through fields with twitching nose,

engines checked, rules lined in tidy rows,
timeouts parsed, and ranges neatly typed,
no panics, no nils — my work's well-typed,
carrot cake later, for careful code that glows.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title mentions fixing potential bugs in pkg/schedule/schedulers, which aligns with the primary changes across multiple scheduler files that add input validation and error handling.
Description check	✅ Passed	The pull request description includes all required sections and provides substantive details about changes in the commit-message block.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/api/scheduler.go (1)
145-151: ⚠️ Potential issue | 🟠 Major

Malformed timeout values still fall back to 30m.

apiutil.CollectStringOption returns ErrOptionNotExist both when timeout is absent and when it is present but not a string (pkg/utils/apiutil/apiutil.go:337-345). With this branch, a payload like {"timeout":123} still appends the default and creates the scheduler instead of returning 400. Check raw key presence in input before applying the default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/scheduler.go` around lines 145 - 151, The branch treating
ErrOptionNotExist as "use default" is incorrect because
apiutil.CollectStringOption returns ErrOptionNotExist both for a missing key and
for a key with a non-string value; update the logic in the scheduler handler
around apiutil.CollectStringOption("timeout", input, collector) to first check
raw presence of the "timeout" key in input (e.g., via a map lookup on input) and
only call collector(defaultTimeout) when the key is absent; if the key exists
but CollectStringOption returns ErrOptionNotExist (meaning the value is
malformed), return a 400 using h.r.JSON with the error message instead of
falling back to default; keep the existing behavior for other errors.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 351-353: The early-nil-start guard in checkJob is letting an
invalid running job be marked finished but then allows checkJob to continue and
reach prepare/Schedule; fix this by detecting job == nil or job.Start == nil,
calling finishLocked(job) to mark it finished/clean up, and then immediately
return (stop further processing) from checkJob so it cannot proceed to prepare
or Schedule; update the branch around job == nil || job.Start == nil in checkJob
to perform finishLocked and return immediately.

In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 176-182: The helper currently conflates "store already configured"
with "we paused it", causing rollbacks in updateConfig to resume stores that
this request never paused; change the helper used in evict_leader logic (the
function that checks conf.StoreIDWithRanges and calls
conf.cluster.PauseLeaderTransfer) to return separate flags (e.g., exists and
pausedByRequest or a small result struct) instead of a single boolean, set
pausedByRequest only when conf.cluster.PauseLeaderTransfer(id, constant.In)
succeeds, and update all callers (including resumeLeaderTransferIfExist and the
rollback paths in updateConfig) to only call resumeLeaderTransfer when
pausedByRequest is true (leave stores that merely existed untouched).

In `@pkg/schedule/schedulers/grant_leader.go`:
- Around line 268-293: The error branches in the ranges parsing block always
call handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) which can
unpause existing stores; change the logic in
pkg/schedule/schedulers/grant_leader.go so ResumeLeaderTransfer is only called
if this request previously paused the store. Introduce a local boolean (e.g.,
paused) set to true where PauseLeaderTransfer is invoked for new stores (the
existing PauseLeaderTransfer call around lines ~257-263) and then check that
paused flag before calling handler.config.cluster.ResumeLeaderTransfer(id,
constant.Out) in the ranges validation error/default branches inside the ranges
handling switch.

---

Outside diff comments:
In `@server/api/scheduler.go`:
- Around line 145-151: The branch treating ErrOptionNotExist as "use default" is
incorrect because apiutil.CollectStringOption returns ErrOptionNotExist both for
a missing key and for a key with a non-string value; update the logic in the
scheduler handler around apiutil.CollectStringOption("timeout", input,
collector) to first check raw presence of the "timeout" key in input (e.g., via
a map lookup on input) and only call collector(defaultTimeout) when the key is
absent; if the key exists but CollectStringOption returns ErrOptionNotExist
(meaning the value is malformed), return a 400 using h.r.JSON with the error
message instead of falling back to default; keep the existing behavior for other
errors.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0e9d009-8d30-4d17-8018-37955c835928

📥 Commits

Reviewing files that changed from the base of the PR and between 95cde21 and db0ecac.

📒 Files selected for processing (12)

pkg/schedule/schedulers/balance_range.go
pkg/schedule/schedulers/balance_range_test.go
pkg/schedule/schedulers/evict_leader.go
pkg/schedule/schedulers/evict_leader_test.go
pkg/schedule/schedulers/grant_hot_region.go
pkg/schedule/schedulers/grant_hot_region_test.go
pkg/schedule/schedulers/grant_leader.go
pkg/schedule/schedulers/grant_leader_test.go
pkg/schedule/schedulers/scheduler_controller.go
pkg/schedule/schedulers/transfer_witness_leader.go
pkg/schedule/schedulers/transfer_witness_leader_test.go
server/api/scheduler.go

coderabbitai · 2026-03-07T12:11:22Z

pkg/schedule/schedulers/balance_range.go

+	if job == nil || job.Start == nil {
+		return true
+	}


⚠️ Potential issue | 🟡 Minor

Nil-start jobs are still eligible for one scheduling round.

This avoids the panic, but checkJob keeps going after finishLocked (Lines 470-496). A running job restored without Start will be marked finished here and can still reach prepare/Schedule once in the same tick. Return immediately after finishing that invalid job state.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/schedulers/balance_range.go` around lines 351 - 353, The early-nil-start guard in checkJob is letting an invalid running job be marked finished but then allows checkJob to continue and reach prepare/Schedule; fix this by detecting job == nil or job.Start == nil, calling finishLocked(job) to mark it finished/clean up, and then immediately return (stop further processing) from checkJob so it cannot proceed to prepare or Schedule; update the branch around job == nil || job.Start == nil in checkJob to perform finishLocked and return immediately.

coderabbitai · 2026-03-07T12:11:22Z

pkg/schedule/schedulers/evict_leader.go

+	if _, exist := conf.StoreIDWithRanges[id]; exist {
+		return true, nil
+	}
+	if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil {
+		return false, err
 	}
-	return true, nil
+	return false, nil


⚠️ Potential issue | 🟠 Major

Bad updates can now unpause an existing evict-leader store.

Because this helper now returns true for stores that are already configured, the rollback paths in updateConfig still call resumeLeaderTransferIfExist(id) on 400s (Lines 424, 442, 449, 454, and 464) even though this request never paused that store. A request like {"store_id":1,"batch":0} will now fail validation and also clear the scheduler’s pause state for store 1. Track pausedByRequest separately from exists, and only resume when this request actually called PauseLeaderTransfer.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/schedule/schedulers/evict_leader.go` around lines 176 - 182, The helper currently conflates "store already configured" with "we paused it", causing rollbacks in updateConfig to resume stores that this request never paused; change the helper used in evict_leader logic (the function that checks conf.StoreIDWithRanges and calls conf.cluster.PauseLeaderTransfer) to return separate flags (e.g., exists and pausedByRequest or a small result struct) instead of a single boolean, set pausedByRequest only when conf.cluster.PauseLeaderTransfer(id, constant.In) succeeds, and update all callers (including resumeLeaderTransferIfExist and the rollback paths in updateConfig) to only call resumeLeaderTransfer when pausedByRequest is true (leave stores that merely existed untouched).

pkg/schedule/schedulers/grant_leader.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/schedule/schedulers/balance_range.go (1)

350-351: ⚠️ Potential issue | 🔴 Critical

shouldFinished still panics on nil job or missing Start.

This still dereferences job.Start unconditionally, so the new regression case with job == nil / job.Start == nil will panic instead of returning safely.

Suggested fix

 func (job *balanceRangeSchedulerJob) shouldFinished() bool {
+	if job == nil || job.Start == nil {
+		return true
+	}
 	return time.Since(*job.Start) > job.Timeout
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/balance_range.go` around lines 350 - 351, The
shouldFinished method currently dereferences job.Start unconditionally causing a
panic when job or job.Start is nil; update
balanceRangeSchedulerJob.shouldFinished to first check for job != nil and
job.Start != nil and return false (not finished) if either is nil, then perform
the time.Since(*job.Start) > job.Timeout comparison; ensure you reference the
balanceRangeSchedulerJob receiver, its Start field and Timeout field when adding
the nil-guards so the behavior is safe for the regression case.

♻️ Duplicate comments (1)

pkg/schedule/schedulers/balance_range.go (1)

467-475: ⚠️ Potential issue | 🟠 Major

Return immediately after timing out the job.

Once this branch marks the job as finished, checkJob still falls through into prepare and can schedule one more operator in the same tick. Stop here after cleanup.

Suggested fix

 	if job.shouldFinished() {
 		if err := s.conf.finishLocked(index); err != nil {
 			balancePersistFailedCounter.Inc()
 			return false
 		}
 		s.cleanJobStatus(cluster, job)
 		balanceRangeExpiredCounter.Inc()
 		log.Info("balance key range job finished due to timeout", zap.String("alias", job.Alias), zap.Uint64("job-id", job.JobID))
+		return false
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/balance_range.go` around lines 467 - 475, The
checkJob branch that handles timeout needs to return immediately after marking
the job finished to avoid falling through into prepare; update the block inside
checkJob where it calls job.shouldFinished(), s.conf.finishLocked(index),
s.cleanJobStatus(cluster, job), balanceRangeExpiredCounter.Inc(), and the
log.Info call so that it returns false (or otherwise exits) right after cleanup
and logging, preventing prepare/prepareRange from scheduling another operator in
the same tick.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 350-351: The shouldFinished method currently dereferences
job.Start unconditionally causing a panic when job or job.Start is nil; update
balanceRangeSchedulerJob.shouldFinished to first check for job != nil and
job.Start != nil and return false (not finished) if either is nil, then perform
the time.Since(*job.Start) > job.Timeout comparison; ensure you reference the
balanceRangeSchedulerJob receiver, its Start field and Timeout field when adding
the nil-guards so the behavior is safe for the regression case.

---

Duplicate comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 467-475: The checkJob branch that handles timeout needs to return
immediately after marking the job finished to avoid falling through into
prepare; update the block inside checkJob where it calls job.shouldFinished(),
s.conf.finishLocked(index), s.cleanJobStatus(cluster, job),
balanceRangeExpiredCounter.Inc(), and the log.Info call so that it returns false
(or otherwise exits) right after cleanup and logging, preventing
prepare/prepareRange from scheduling another operator in the same tick.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c16d46e1-b0bf-4117-a0bd-4a0dfe8d55ed

📥 Commits

Reviewing files that changed from the base of the PR and between 9d2b70fd176c11a0d06b118e9e095b0f54e8c8a4 and 65ccf3a2f6cc8b734e7b9168eb799b2349fa813a.

📒 Files selected for processing (3)

pkg/schedule/schedulers/balance_range.go
pkg/schedule/schedulers/balance_range_test.go
pkg/schedule/schedulers/evict_leader.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/schedule/schedulers/evict_leader.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pkg/schedule/schedulers/evict_leader.go (1)

176-180: ⚠️ Potential issue | 🟠 Major

Keep alreadyConfigured separate from pausedByRequest.

This helper now returns the same true value for two different states, but the rollback paths at Line 424, Line 442, Line 449, Line 454, and Line 464 still treat it as “this request paused the store”. A rejected update can therefore clear the existing pause on an already-configured store even though this request never called PauseLeaderTransfer.

💡 One way to fix it

+type pauseLeaderTransferResult struct {
+	alreadyConfigured bool
+	pausedByRequest   bool
+}
+
-func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (bool, error) {
+func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (pauseLeaderTransferResult, error) {
 	conf.RLock()
 	defer conf.RUnlock()
 	if _, exist := conf.StoreIDWithRanges[id]; exist {
-		return true, nil
+		return pauseLeaderTransferResult{alreadyConfigured: true}, nil
 	}
 	if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil {
-		return false, err
+		return pauseLeaderTransferResult{}, err
 	}
-	return true, nil
+	return pauseLeaderTransferResult{pausedByRequest: true}, nil
 }

-	exist, err = handler.config.pauseLeaderTransferIfStoreNotExist(id)
+	pauseState, err = handler.config.pauseLeaderTransferIfStoreNotExist(id)
 	...
-		handler.config.resumeLeaderTransferIfExist(id)
+		if pauseState.pausedByRequest {
+			handler.config.resumeLeaderTransferIfExist(id)
+		}
 		...
-	} else if exist {
+	} else if pauseState.alreadyConfigured {
 		ranges = handler.config.getRanges(id)
 	}

Also applies to: 423-465

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/evict_leader.go` around lines 176 - 180, The helper
that checks/conflicts and calls conf.cluster.PauseLeaderTransfer conflates
"alreadyConfigured" and "pausedByRequest" into a single boolean; change it to
return two distinct booleans (e.g., alreadyConfigured, pausedByRequest) so
callers can distinguish stores that were paused by this request from stores that
were already configured; update the helper (the code that reads
conf.StoreIDWithRanges and calls conf.cluster.PauseLeaderTransfer) to set
alreadyConfigured=true when the store exists in conf.StoreIDWithRanges and set
pausedByRequest=true only when PauseLeaderTransfer is actually invoked
successfully; then update all rollback paths that currently assume the single
true to only call the resume/unpause logic when pausedByRequest is true (leave
alreadyConfigured alone).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 121-135: The parsed timeout can be negative which causes
shouldFinished() to immediately return true; after parsing timeoutStr with
time.ParseDuration in the block surrounding timeoutStr and err, add a validation
that the resulting timeout is strictly positive (e.g., timeout > 0) and if not,
return a 400 via handler.rd.JSON with a clear message (e.g., "timeout must be
positive") and do not assign job.Timeout; keep existing handling of empty string
and parse errors intact and reference timeoutStr, time.ParseDuration,
job.Timeout and shouldFinished() when locating the change.

---

Duplicate comments:
In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 176-180: The helper that checks/conflicts and calls
conf.cluster.PauseLeaderTransfer conflates "alreadyConfigured" and
"pausedByRequest" into a single boolean; change it to return two distinct
booleans (e.g., alreadyConfigured, pausedByRequest) so callers can distinguish
stores that were paused by this request from stores that were already
configured; update the helper (the code that reads conf.StoreIDWithRanges and
calls conf.cluster.PauseLeaderTransfer) to set alreadyConfigured=true when the
store exists in conf.StoreIDWithRanges and set pausedByRequest=true only when
PauseLeaderTransfer is actually invoked successfully; then update all rollback
paths that currently assume the single true to only call the resume/unpause
logic when pausedByRequest is true (leave alreadyConfigured alone).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0dfccb36-3fe0-44a2-9af9-d9cd001700a9

📥 Commits

Reviewing files that changed from the base of the PR and between 65ccf3a2f6cc8b734e7b9168eb799b2349fa813a and 1d91e5fcf353812fa109020f177c7fdab2e43ab7.

📒 Files selected for processing (3)

pkg/schedule/schedulers/balance_range.go
pkg/schedule/schedulers/balance_range_test.go
pkg/schedule/schedulers/evict_leader.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/schedule/schedulers/balance_range_test.go

pkg/schedule/schedulers/balance_range.go

codecov · 2026-03-10T03:40:38Z

Codecov Report

❌ Patch coverage is 49.29577% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.95%. Comparing base (525524f) to head (be9cf36).
⚠️ Report is 43 commits behind head on master.

❌ Your patch status has failed because the patch coverage (49.29%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10312      +/-   ##
==========================================
+ Coverage   78.78%   78.95%   +0.17%     
==========================================
  Files         525      532       +7     
  Lines       70824    71846    +1022     
==========================================
+ Hits        55796    56725     +929     
- Misses      11004    11095      +91     
- Partials     4024     4026       +2

Flag	Coverage Δ
unittests	`78.95% <49.29%> (+0.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liyishuai

Tesed on my side. The AI review is worth considering. Otherwise LGTM.

ti-chi-bot · 2026-03-10T04:12:34Z

@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Tesed on my side. The AI review is worth considering. Otherwise LGTM.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: tongjian <1045931706@qq.com>

okJiang

Please link the corresponding issue and explain your changes in the PR description.

Even when using AI to fix bugs, we should still follow the process of submitting a bug issue -> pull request. The process of submitting a bug issue can also be completed by AI, and we should try to follow atomic commits, with one PR fixing one issue.

bufferflies · 2026-03-12T11:44:38Z

/retest

rleungx · 2026-03-13T03:49:00Z

pkg/schedule/schedulers/evict_leader.go

+			for _, item := range val {
+				s, ok := item.(string)
+				if !ok {
+					handler.config.resumeLeaderTransferIfExist(id)


If there is a wrong request, e.g., an invalid range type, will it resume incorrectly?

Good catch. I think the original code is pretty good; we should ignore the other type.

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies · 2026-03-17T08:24:41Z

CollectStringOption

// CollectStringOption is used to collect string using from input map for given option
func CollectStringOption(option string, input map[string]any, collectors ...func(v string)) error {
	if v, ok := input[option].(string); ok {
		for _, c := range collectors {
			c(v)
		}
		return nil
	}
	return errs.ErrOptionNotExist.FastGenByArgs(option)
}

It will return ErrOptionNotExist if the key does not exist or is not a string. Do you suggest disguising them?

pkg/schedule/schedulers/transfer_witness_leader_test.go

Signed-off-by: tongjian <1045931706@qq.com>

okJiang · 2026-03-20T08:15:35Z

/retest

bufferflies · 2026-03-20T09:43:46Z

/retest

bufferflies · 2026-03-20T11:04:08Z

/retest

bufferflies · 2026-03-25T11:37:23Z

/ping @rleungx

bufferflies · 2026-03-31T06:35:47Z

/ping @lhy1024 @rleungx

lhy1024 · 2026-04-01T03:32:22Z

server/api/scheduler.go

 		defaultTimeout := "30m"
 		if err := apiutil.CollectStringOption("timeout", input, collector); err != nil {
 			if errors.ErrorEqual(err, errs.ErrOptionNotExist) {
 				collector(defaultTimeout)
 			} else {
-				h.r.JSON(w, http.StatusInternalServerError, err.Error())
+				h.r.JSON(w, http.StatusBadRequest, err.Error())
 				return
 			}
 		}


CollectStringOption returns ErrOptionNotExist for non-string type and option not exist.

The core issue is that in the create path, {"timeout":123} is still treated as “timeout not provided”, so it silently falls back to the default value instead of returning 400.

func TestCollectStringOptionTypeMismatchMatchesErrOptionNotExist(t *testing.T) { input := map[string]any{"timeout": 123} err := CollectStringOption("timeout", input) require.Error(t, err) require.True(t, perrors.ErrorEqual(err, errs.ErrOptionNotExist)) }

go test ./pkg/utils/apiutil -run '^TestCollectStringOptionTypeMismatchMatchesErrOptionNotExist$' -count=1 ok github.com/tikv/pd/pkg/utils/apiutil 0.007s

It must be string lit, can't be number lit

Should we distinguish between non-string type and option not exist for timeout?

pkg/schedule/schedulers/balance_range.go

rleungx · 2026-04-01T05:31:01Z

server/api/scheduler.go

 		}
 	}

 	if err := h.AddScheduler(tp, args...); err != nil {


If there is a bad rule, it still returns 500, instead of 400.

good catch, done

Actually, AddScheduler can also return 500 sometimes.

Signed-off-by: bufferflies <1045931706@qq.com>

ti-chi-bot · 2026-04-02T02:42:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024, liyishuai, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [lhy1024,okJiang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-02T02:42:33Z

[LGTM Timeline notifier]

Timeline:

2026-03-20 08:15:52.437771143 +0000 UTC m=+517679.525428680: ☑️ agreed by okJiang.
2026-04-02 02:42:32.11721576 +0000 UTC m=+405757.322575807: ☑️ agreed by lhy1024.

Signed-off-by: bufferflies <1045931706@qq.com>

bufferflies · 2026-04-02T03:27:30Z

/retest

bufferflies · 2026-04-02T07:11:35Z

/retest

ai scan potential panix

db0ecac

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 7, 2026

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

bufferflies force-pushed the bugfix/ai branch from 9d2b70f to 65ccf3a Compare March 9, 2026 12:24

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

bufferflies force-pushed the bugfix/ai branch from 65ccf3a to 1d91e5f Compare March 10, 2026 03:30

ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 10, 2026

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

pkg/schedule/schedulers/balance_range.go Show resolved Hide resolved

liyishuai approved these changes Mar 10, 2026

View reviewed changes

bufferflies mentioned this pull request Mar 10, 2026

schedule, server: harden scheduler config input validation #10311

Closed

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Mar 10, 2026

remove unuseless change

c4f9234

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies force-pushed the bugfix/ai branch from 1d91e5f to c4f9234 Compare March 10, 2026 08:21

ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 10, 2026

bufferflies changed the title *: fix potential panic by ai *: fix potential bug by ai Mar 10, 2026

bufferflies requested a review from okJiang March 10, 2026 08:26

okJiang reviewed Mar 10, 2026

View reviewed changes

bufferflies requested a review from rleungx March 13, 2026 02:57

rleungx reviewed Mar 13, 2026

View reviewed changes

bufferflies changed the title *: fix potential bug by ai *: fix potential bug by ai in pkg/schedule/schedulers Mar 13, 2026

evict leader

c6d35b9

Signed-off-by: tongjian <1045931706@qq.com>

bufferflies mentioned this pull request Mar 18, 2026

pkg/schedule/schedulers: avoid panics on malformed scheduler config input #10449

Closed

ti-chi-bot bot added the do-not-merge/needs-triage-completed label Mar 18, 2026

bufferflies requested a review from okJiang March 20, 2026 06:51

okJiang reviewed Mar 20, 2026

View reviewed changes

pkg/schedule/schedulers/transfer_witness_leader_test.go Outdated Show resolved Hide resolved

address comment

011fe79

Signed-off-by: tongjian <1045931706@qq.com>

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Mar 20, 2026

okJiang approved these changes Mar 20, 2026

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 20, 2026

bufferflies requested a review from rleungx March 20, 2026 11:28

bufferflies requested a review from lhy1024 March 31, 2026 02:12

lhy1024 reviewed Apr 1, 2026

View reviewed changes

rleungx reviewed Apr 1, 2026

View reviewed changes

address comment

47a5532

Signed-off-by: bufferflies <1045931706@qq.com>

lhy1024 approved these changes Apr 2, 2026

View reviewed changes

ti-chi-bot bot added the lgtm label Apr 2, 2026

ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 2, 2026

add new error ErrOptionTypeInvalid

be9cf36

Signed-off-by: bufferflies <1045931706@qq.com>

bufferflies force-pushed the bugfix/ai branch from 39c0c46 to be9cf36 Compare April 2, 2026 03:05

ti-chi-bot bot merged commit 9ead470 into tikv:master Apr 2, 2026
57 of 65 checks passed

Conversation

bufferflies commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liyishuai left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot bot commented Mar 10, 2026

Uh oh!

okJiang left a comment

Choose a reason for hiding this comment

Uh oh!

bufferflies commented Mar 12, 2026

Uh oh!

rleungx Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bufferflies commented Mar 17, 2026

Uh oh!

Uh oh!

okJiang commented Mar 20, 2026

Uh oh!

bufferflies commented Mar 20, 2026

Uh oh!

bufferflies commented Mar 20, 2026

Uh oh!

bufferflies commented Mar 25, 2026

Uh oh!

bufferflies commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bufferflies commented Mar 7, 2026 •

edited

Loading

coderabbitai bot commented Mar 7, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading

rleungx Mar 13, 2026 •

edited

Loading