Skip to content

*: fix potential bug by ai in pkg/schedule/schedulers#10312

Merged
ti-chi-bot[bot] merged 6 commits intotikv:masterfrom
bufferflies:bugfix/ai
Apr 2, 2026
Merged

*: fix potential bug by ai in pkg/schedule/schedulers#10312
ti-chi-bot[bot] merged 6 commits intotikv:masterfrom
bufferflies:bugfix/ai

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Mar 7, 2026

What problem does this PR solve?

Issue Number: Close #10449

What is changed and how does it work?

- It removes several unsafe type assertions on JSON input (like input["x"].(string)), replacing them with checked parsing and clear 400 Bad Request responses, so malformed payloads no longer panic PD.  
  - Key places: pkg/schedule/schedulers/balance_range.go, pkg/schedule/schedulers/evict_leader.go, pkg/schedule/schedulers/grant_hot_region.go, pkg/schedule/schedulers/grant_leader.go
- It fixes a risky type cast in transfer witness handling by making RecvRegionInfo safe (nil on wrong scheduler type) and guarding before channel send.  
  - pkg/schedule/schedulers/transfer_witness_leader.go, pkg/schedule/schedulers/scheduler_controller.go
- It corrects HTTP error behavior in scheduler API paths:
  - invalid user options now return 400 instead of 500
  - avoids nil dereference when delete redirect fails and response is nil
  - handles “handler unavailable” cases explicitly  
  - server/api/scheduler.go
- It adds regression tests specifically asserting no panic and expected status codes for bad/odd inputs.  
  - pkg/schedule/schedulers/balance_range_test.go, pkg/schedule/schedulers/evict_leader_test.go, pkg/schedule/schedulers/grant_hot_region_test.go, pkg/schedule/schedulers/grant_leader_test.go, pkg/schedule/schedulers/transfer_witness_leader_test.go

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

Release note

None.

Summary by CodeRabbit

  • Bug Fixes

    • Stricter input validation and clearer client-facing errors for scheduler config updates (engine, rule, alias, timeouts, store IDs, ranges), safer control flow to avoid runtime errors and unintended pauses/resumes, and more accurate HTTP status codes for invalid input.
  • Tests

    • Added tests for invalid field types, malformed ranges, wrong scheduler types, and config update error cases to prevent regressions.

Signed-off-by: tongjian <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 7, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds defensive input validation and type checks across multiple scheduler handlers, tightens HTTP error responses and control flow to avoid panics/nil derefs, and introduces HTTP/unit tests covering invalid input scenarios.

Changes

Cohort / File(s) Summary
Balance Range Scheduler
pkg/schedule/schedulers/balance_range.go, pkg/schedule/schedulers/balance_range_test.go
Validate engine, rule, alias as strings and restrict engine to allowed values; convert/validate rule via core.NewRule; parse optional timeout string to duration with positivity check; update error messages. Adds test TestAddBalanceRangeJobWithInvalidFieldType.
Evict Leader Scheduler
pkg/schedule/schedulers/evict_leader.go, pkg/schedule/schedulers/evict_leader_test.go
pauseLeaderTransferIfStoreNotExist early-returns when store tracked and surfaces PauseLeaderTransfer errors. updateConfig accepts ranges as []string or []any with per-element validation and falls back to existing ranges when id present. Adds HTTP tests for valid string-array ranges and invalid element types.
Grant Hot Region Scheduler
pkg/schedule/schedulers/grant_hot_region.go, pkg/schedule/schedulers/grant_hot_region_test.go
Validate "store-leader-id" is a string before parsing to uint64; return 400 on wrong type. Adds test asserting 400 Bad Request for invalid store-leader-id type.
Transfer Witness Leader Scheduler
pkg/schedule/schedulers/transfer_witness_leader.go, pkg/schedule/schedulers/transfer_witness_leader_test.go
RecvRegionInfo now safely checks type assertion and returns nil on mismatch to avoid panics. Adds test TestRecvRegionInfoWithWrongSchedulerType.
Scheduler Controller
pkg/schedule/schedulers/scheduler_controller.go
Cache RecvRegionInfo result in CheckTransferWitnessLeader and return early on nil to prevent nil channel usage.
Scheduler API
server/api/scheduler.go
Change error responses: use BadRequest for invalid client inputs, preserve default timeout when option missing, use upstream response status on redirect delete errors, and ensure early returns after sending error responses.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • okJiang
  • JmPotato

Poem

🐰 I hopped through fields with twitching nose,

engines checked, rules lined in tidy rows,
timeouts parsed, and ranges neatly typed,
no panics, no nils — my work's well-typed,
carrot cake later, for careful code that glows.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title mentions fixing potential bugs in pkg/schedule/schedulers, which aligns with the primary changes across multiple scheduler files that add input validation and error handling.
Description check ✅ Passed The pull request description includes all required sections and provides substantive details about changes in the commit-message block.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/api/scheduler.go (1)

145-151: ⚠️ Potential issue | 🟠 Major

Malformed timeout values still fall back to 30m.

apiutil.CollectStringOption returns ErrOptionNotExist both when timeout is absent and when it is present but not a string (pkg/utils/apiutil/apiutil.go:337-345). With this branch, a payload like {"timeout":123} still appends the default and creates the scheduler instead of returning 400. Check raw key presence in input before applying the default.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/scheduler.go` around lines 145 - 151, The branch treating
ErrOptionNotExist as "use default" is incorrect because
apiutil.CollectStringOption returns ErrOptionNotExist both for a missing key and
for a key with a non-string value; update the logic in the scheduler handler
around apiutil.CollectStringOption("timeout", input, collector) to first check
raw presence of the "timeout" key in input (e.g., via a map lookup on input) and
only call collector(defaultTimeout) when the key is absent; if the key exists
but CollectStringOption returns ErrOptionNotExist (meaning the value is
malformed), return a 400 using h.r.JSON with the error message instead of
falling back to default; keep the existing behavior for other errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 351-353: The early-nil-start guard in checkJob is letting an
invalid running job be marked finished but then allows checkJob to continue and
reach prepare/Schedule; fix this by detecting job == nil or job.Start == nil,
calling finishLocked(job) to mark it finished/clean up, and then immediately
return (stop further processing) from checkJob so it cannot proceed to prepare
or Schedule; update the branch around job == nil || job.Start == nil in checkJob
to perform finishLocked and return immediately.

In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 176-182: The helper currently conflates "store already configured"
with "we paused it", causing rollbacks in updateConfig to resume stores that
this request never paused; change the helper used in evict_leader logic (the
function that checks conf.StoreIDWithRanges and calls
conf.cluster.PauseLeaderTransfer) to return separate flags (e.g., exists and
pausedByRequest or a small result struct) instead of a single boolean, set
pausedByRequest only when conf.cluster.PauseLeaderTransfer(id, constant.In)
succeeds, and update all callers (including resumeLeaderTransferIfExist and the
rollback paths in updateConfig) to only call resumeLeaderTransfer when
pausedByRequest is true (leave stores that merely existed untouched).

In `@pkg/schedule/schedulers/grant_leader.go`:
- Around line 268-293: The error branches in the ranges parsing block always
call handler.config.cluster.ResumeLeaderTransfer(id, constant.Out) which can
unpause existing stores; change the logic in
pkg/schedule/schedulers/grant_leader.go so ResumeLeaderTransfer is only called
if this request previously paused the store. Introduce a local boolean (e.g.,
paused) set to true where PauseLeaderTransfer is invoked for new stores (the
existing PauseLeaderTransfer call around lines ~257-263) and then check that
paused flag before calling handler.config.cluster.ResumeLeaderTransfer(id,
constant.Out) in the ranges validation error/default branches inside the ranges
handling switch.

---

Outside diff comments:
In `@server/api/scheduler.go`:
- Around line 145-151: The branch treating ErrOptionNotExist as "use default" is
incorrect because apiutil.CollectStringOption returns ErrOptionNotExist both for
a missing key and for a key with a non-string value; update the logic in the
scheduler handler around apiutil.CollectStringOption("timeout", input,
collector) to first check raw presence of the "timeout" key in input (e.g., via
a map lookup on input) and only call collector(defaultTimeout) when the key is
absent; if the key exists but CollectStringOption returns ErrOptionNotExist
(meaning the value is malformed), return a 400 using h.r.JSON with the error
message instead of falling back to default; keep the existing behavior for other
errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0e9d009-8d30-4d17-8018-37955c835928

📥 Commits

Reviewing files that changed from the base of the PR and between 95cde21 and db0ecac.

📒 Files selected for processing (12)
  • pkg/schedule/schedulers/balance_range.go
  • pkg/schedule/schedulers/balance_range_test.go
  • pkg/schedule/schedulers/evict_leader.go
  • pkg/schedule/schedulers/evict_leader_test.go
  • pkg/schedule/schedulers/grant_hot_region.go
  • pkg/schedule/schedulers/grant_hot_region_test.go
  • pkg/schedule/schedulers/grant_leader.go
  • pkg/schedule/schedulers/grant_leader_test.go
  • pkg/schedule/schedulers/scheduler_controller.go
  • pkg/schedule/schedulers/transfer_witness_leader.go
  • pkg/schedule/schedulers/transfer_witness_leader_test.go
  • server/api/scheduler.go

Comment on lines +351 to +353
if job == nil || job.Start == nil {
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Nil-start jobs are still eligible for one scheduling round.

This avoids the panic, but checkJob keeps going after finishLocked (Lines 470-496). A running job restored without Start will be marked finished here and can still reach prepare/Schedule once in the same tick. Return immediately after finishing that invalid job state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/balance_range.go` around lines 351 - 353, The
early-nil-start guard in checkJob is letting an invalid running job be marked
finished but then allows checkJob to continue and reach prepare/Schedule; fix
this by detecting job == nil or job.Start == nil, calling finishLocked(job) to
mark it finished/clean up, and then immediately return (stop further processing)
from checkJob so it cannot proceed to prepare or Schedule; update the branch
around job == nil || job.Start == nil in checkJob to perform finishLocked and
return immediately.

Comment on lines +176 to +182
if _, exist := conf.StoreIDWithRanges[id]; exist {
return true, nil
}
if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil {
return false, err
}
return true, nil
return false, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bad updates can now unpause an existing evict-leader store.

Because this helper now returns true for stores that are already configured, the rollback paths in updateConfig still call resumeLeaderTransferIfExist(id) on 400s (Lines 424, 442, 449, 454, and 464) even though this request never paused that store. A request like {"store_id":1,"batch":0} will now fail validation and also clear the scheduler’s pause state for store 1. Track pausedByRequest separately from exists, and only resume when this request actually called PauseLeaderTransfer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/evict_leader.go` around lines 176 - 182, The helper
currently conflates "store already configured" with "we paused it", causing
rollbacks in updateConfig to resume stores that this request never paused;
change the helper used in evict_leader logic (the function that checks
conf.StoreIDWithRanges and calls conf.cluster.PauseLeaderTransfer) to return
separate flags (e.g., exists and pausedByRequest or a small result struct)
instead of a single boolean, set pausedByRequest only when
conf.cluster.PauseLeaderTransfer(id, constant.In) succeeds, and update all
callers (including resumeLeaderTransferIfExist and the rollback paths in
updateConfig) to only call resumeLeaderTransfer when pausedByRequest is true
(leave stores that merely existed untouched).

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/schedule/schedulers/balance_range.go (1)

350-351: ⚠️ Potential issue | 🔴 Critical

shouldFinished still panics on nil job or missing Start.

This still dereferences job.Start unconditionally, so the new regression case with job == nil / job.Start == nil will panic instead of returning safely.

Suggested fix
 func (job *balanceRangeSchedulerJob) shouldFinished() bool {
+	if job == nil || job.Start == nil {
+		return true
+	}
 	return time.Since(*job.Start) > job.Timeout
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/balance_range.go` around lines 350 - 351, The
shouldFinished method currently dereferences job.Start unconditionally causing a
panic when job or job.Start is nil; update
balanceRangeSchedulerJob.shouldFinished to first check for job != nil and
job.Start != nil and return false (not finished) if either is nil, then perform
the time.Since(*job.Start) > job.Timeout comparison; ensure you reference the
balanceRangeSchedulerJob receiver, its Start field and Timeout field when adding
the nil-guards so the behavior is safe for the regression case.
♻️ Duplicate comments (1)
pkg/schedule/schedulers/balance_range.go (1)

467-475: ⚠️ Potential issue | 🟠 Major

Return immediately after timing out the job.

Once this branch marks the job as finished, checkJob still falls through into prepare and can schedule one more operator in the same tick. Stop here after cleanup.

Suggested fix
 	if job.shouldFinished() {
 		if err := s.conf.finishLocked(index); err != nil {
 			balancePersistFailedCounter.Inc()
 			return false
 		}
 		s.cleanJobStatus(cluster, job)
 		balanceRangeExpiredCounter.Inc()
 		log.Info("balance key range job finished due to timeout", zap.String("alias", job.Alias), zap.Uint64("job-id", job.JobID))
+		return false
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/balance_range.go` around lines 467 - 475, The
checkJob branch that handles timeout needs to return immediately after marking
the job finished to avoid falling through into prepare; update the block inside
checkJob where it calls job.shouldFinished(), s.conf.finishLocked(index),
s.cleanJobStatus(cluster, job), balanceRangeExpiredCounter.Inc(), and the
log.Info call so that it returns false (or otherwise exits) right after cleanup
and logging, preventing prepare/prepareRange from scheduling another operator in
the same tick.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 350-351: The shouldFinished method currently dereferences
job.Start unconditionally causing a panic when job or job.Start is nil; update
balanceRangeSchedulerJob.shouldFinished to first check for job != nil and
job.Start != nil and return false (not finished) if either is nil, then perform
the time.Since(*job.Start) > job.Timeout comparison; ensure you reference the
balanceRangeSchedulerJob receiver, its Start field and Timeout field when adding
the nil-guards so the behavior is safe for the regression case.

---

Duplicate comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 467-475: The checkJob branch that handles timeout needs to return
immediately after marking the job finished to avoid falling through into
prepare; update the block inside checkJob where it calls job.shouldFinished(),
s.conf.finishLocked(index), s.cleanJobStatus(cluster, job),
balanceRangeExpiredCounter.Inc(), and the log.Info call so that it returns false
(or otherwise exits) right after cleanup and logging, preventing
prepare/prepareRange from scheduling another operator in the same tick.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c16d46e1-b0bf-4117-a0bd-4a0dfe8d55ed

📥 Commits

Reviewing files that changed from the base of the PR and between 9d2b70fd176c11a0d06b118e9e095b0f54e8c8a4 and 65ccf3a2f6cc8b734e7b9168eb799b2349fa813a.

📒 Files selected for processing (3)
  • pkg/schedule/schedulers/balance_range.go
  • pkg/schedule/schedulers/balance_range_test.go
  • pkg/schedule/schedulers/evict_leader.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/schedule/schedulers/evict_leader.go

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/schedule/schedulers/evict_leader.go (1)

176-180: ⚠️ Potential issue | 🟠 Major

Keep alreadyConfigured separate from pausedByRequest.

This helper now returns the same true value for two different states, but the rollback paths at Line 424, Line 442, Line 449, Line 454, and Line 464 still treat it as “this request paused the store”. A rejected update can therefore clear the existing pause on an already-configured store even though this request never called PauseLeaderTransfer.

💡 One way to fix it
+type pauseLeaderTransferResult struct {
+	alreadyConfigured bool
+	pausedByRequest   bool
+}
+
-func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (bool, error) {
+func (conf *evictLeaderSchedulerConfig) pauseLeaderTransferIfStoreNotExist(id uint64) (pauseLeaderTransferResult, error) {
 	conf.RLock()
 	defer conf.RUnlock()
 	if _, exist := conf.StoreIDWithRanges[id]; exist {
-		return true, nil
+		return pauseLeaderTransferResult{alreadyConfigured: true}, nil
 	}
 	if err := conf.cluster.PauseLeaderTransfer(id, constant.In); err != nil {
-		return false, err
+		return pauseLeaderTransferResult{}, err
 	}
-	return true, nil
+	return pauseLeaderTransferResult{pausedByRequest: true}, nil
 }
-	exist, err = handler.config.pauseLeaderTransferIfStoreNotExist(id)
+	pauseState, err = handler.config.pauseLeaderTransferIfStoreNotExist(id)
 	...
-		handler.config.resumeLeaderTransferIfExist(id)
+		if pauseState.pausedByRequest {
+			handler.config.resumeLeaderTransferIfExist(id)
+		}
 		...
-	} else if exist {
+	} else if pauseState.alreadyConfigured {
 		ranges = handler.config.getRanges(id)
 	}

Also applies to: 423-465

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/schedule/schedulers/evict_leader.go` around lines 176 - 180, The helper
that checks/conflicts and calls conf.cluster.PauseLeaderTransfer conflates
"alreadyConfigured" and "pausedByRequest" into a single boolean; change it to
return two distinct booleans (e.g., alreadyConfigured, pausedByRequest) so
callers can distinguish stores that were paused by this request from stores that
were already configured; update the helper (the code that reads
conf.StoreIDWithRanges and calls conf.cluster.PauseLeaderTransfer) to set
alreadyConfigured=true when the store exists in conf.StoreIDWithRanges and set
pausedByRequest=true only when PauseLeaderTransfer is actually invoked
successfully; then update all rollback paths that currently assume the single
true to only call the resume/unpause logic when pausedByRequest is true (leave
alreadyConfigured alone).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/schedule/schedulers/balance_range.go`:
- Around line 121-135: The parsed timeout can be negative which causes
shouldFinished() to immediately return true; after parsing timeoutStr with
time.ParseDuration in the block surrounding timeoutStr and err, add a validation
that the resulting timeout is strictly positive (e.g., timeout > 0) and if not,
return a 400 via handler.rd.JSON with a clear message (e.g., "timeout must be
positive") and do not assign job.Timeout; keep existing handling of empty string
and parse errors intact and reference timeoutStr, time.ParseDuration,
job.Timeout and shouldFinished() when locating the change.

---

Duplicate comments:
In `@pkg/schedule/schedulers/evict_leader.go`:
- Around line 176-180: The helper that checks/conflicts and calls
conf.cluster.PauseLeaderTransfer conflates "alreadyConfigured" and
"pausedByRequest" into a single boolean; change it to return two distinct
booleans (e.g., alreadyConfigured, pausedByRequest) so callers can distinguish
stores that were paused by this request from stores that were already
configured; update the helper (the code that reads conf.StoreIDWithRanges and
calls conf.cluster.PauseLeaderTransfer) to set alreadyConfigured=true when the
store exists in conf.StoreIDWithRanges and set pausedByRequest=true only when
PauseLeaderTransfer is actually invoked successfully; then update all rollback
paths that currently assume the single true to only call the resume/unpause
logic when pausedByRequest is true (leave alreadyConfigured alone).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0dfccb36-3fe0-44a2-9af9-d9cd001700a9

📥 Commits

Reviewing files that changed from the base of the PR and between 65ccf3a2f6cc8b734e7b9168eb799b2349fa813a and 1d91e5fcf353812fa109020f177c7fdab2e43ab7.

📒 Files selected for processing (3)
  • pkg/schedule/schedulers/balance_range.go
  • pkg/schedule/schedulers/balance_range_test.go
  • pkg/schedule/schedulers/evict_leader.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/schedule/schedulers/balance_range_test.go

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 49.29577% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.95%. Comparing base (525524f) to head (be9cf36).
⚠️ Report is 43 commits behind head on master.

❌ Your patch status has failed because the patch coverage (49.29%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10312      +/-   ##
==========================================
+ Coverage   78.78%   78.95%   +0.17%     
==========================================
  Files         525      532       +7     
  Lines       70824    71846    +1022     
==========================================
+ Hits        55796    56725     +929     
- Misses      11004    11095      +91     
- Partials     4024     4026       +2     
Flag Coverage Δ
unittests 78.95% <49.29%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@liyishuai liyishuai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tesed on my side. The AI review is worth considering. Otherwise LGTM.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 10, 2026

@liyishuai: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Tesed on my side. The AI review is worth considering. Otherwise LGTM.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: tongjian <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 10, 2026
@bufferflies bufferflies changed the title *: fix potential panic by ai *: fix potential bug by ai Mar 10, 2026
@bufferflies bufferflies requested a review from okJiang March 10, 2026 08:26
Copy link
Copy Markdown
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link the corresponding issue and explain your changes in the PR description.

Even when using AI to fix bugs, we should still follow the process of submitting a bug issue -> pull request. The process of submitting a bug issue can also be completed by AI, and we should try to follow atomic commits, with one PR fixing one issue.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@bufferflies bufferflies requested a review from rleungx March 13, 2026 02:57
for _, item := range val {
s, ok := item.(string)
if !ok {
handler.config.resumeLeaderTransferIfExist(id)
Copy link
Copy Markdown
Member

@rleungx rleungx Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a wrong request, e.g., an invalid range type, will it resume incorrectly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I think the original code is pretty good; we should ignore the other type.

@bufferflies bufferflies changed the title *: fix potential bug by ai *: fix potential bug by ai in pkg/schedule/schedulers Mar 13, 2026
Signed-off-by: tongjian <1045931706@qq.com>
@bufferflies
Copy link
Copy Markdown
Contributor Author

CollectStringOption

// CollectStringOption is used to collect string using from input map for given option
func CollectStringOption(option string, input map[string]any, collectors ...func(v string)) error {
	if v, ok := input[option].(string); ok {
		for _, c := range collectors {
			c(v)
		}
		return nil
	}
	return errs.ErrOptionNotExist.FastGenByArgs(option)
}

It will return ErrOptionNotExist if the key does not exist or is not a string. Do you suggest disguising them?

Signed-off-by: tongjian <1045931706@qq.com>
@okJiang
Copy link
Copy Markdown
Member

okJiang commented Mar 20, 2026

/retest

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 20, 2026
@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@bufferflies bufferflies requested a review from rleungx March 20, 2026 11:28
@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @rleungx

@bufferflies bufferflies requested a review from lhy1024 March 31, 2026 02:12
@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @lhy1024 @rleungx

Comment on lines 144 to 152
defaultTimeout := "30m"
if err := apiutil.CollectStringOption("timeout", input, collector); err != nil {
if errors.ErrorEqual(err, errs.ErrOptionNotExist) {
collector(defaultTimeout)
} else {
h.r.JSON(w, http.StatusInternalServerError, err.Error())
h.r.JSON(w, http.StatusBadRequest, err.Error())
return
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CollectStringOption returns ErrOptionNotExist for non-string type and option not exist.

The core issue is that in the create path, {"timeout":123} is still treated as “timeout not provided”, so it silently falls back to the default value instead of returning 400.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func TestCollectStringOptionTypeMismatchMatchesErrOptionNotExist(t *testing.T) {
        input := map[string]any{"timeout": 123}

        err := CollectStringOption("timeout", input)
        require.Error(t, err)
        require.True(t, perrors.ErrorEqual(err, errs.ErrOptionNotExist))
  }
go test ./pkg/utils/apiutil -run '^TestCollectStringOptionTypeMismatchMatchesErrOptionNotExist$' -count=1
ok    github.com/tikv/pd/pkg/utils/apiutil    0.007s

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be string lit, can't be number lit

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we distinguish between non-string type and option not exist for timeout?

}
}

if err := h.AddScheduler(tp, args...); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a bad rule, it still returns 500, instead of 400.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, AddScheduler can also return 500 sometimes.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added the lgtm label Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024, liyishuai, okJiang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-20 08:15:52.437771143 +0000 UTC m=+517679.525428680: ☑️ agreed by okJiang.
  • 2026-04-02 02:42:32.11721576 +0000 UTC m=+405757.322575807: ☑️ agreed by lhy1024.

Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit 9ead470 into tikv:master Apr 2, 2026
57 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pkg/schedule/schedulers: avoid panics on malformed scheduler config input

5 participants