Skip to content

lightning: restore two-level split/scatter while keeping limiter behavior | tidb-test=release-8.1.2#66312

Merged
ti-chi-bot[bot] merged 4 commits intopingcap:feature/release-8.1-gsort-testfrom
OliverS929:feature/release-8.1-gsort-test
Feb 24, 2026
Merged

lightning: restore two-level split/scatter while keeping limiter behavior | tidb-test=release-8.1.2#66312
ti-chi-bot[bot] merged 4 commits intopingcap:feature/release-8.1-gsort-testfrom
OliverS929:feature/release-8.1-gsort-test

Conversation

@OliverS929
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: ref #66311

Problem Summary:

  • #59609 introduced two-level split/scatter (coarse + fine) to reduce region concentration during large import workloads.
  • #62419 introduced split/ingest rate limiting but removed the coarse split/scatter stage.
  • With large split key sets, this can increase split concentration and skew region distribution.

What changed and how does it work?

  • Restore two-level split/scatter in pkg/lightning/backend/local/localhelper.go:
    • run coarse split/scatter first when len(splitKeys) > 100
    • then run fine-grained split/scatter on all split keys
  • Keep #62419 limiter behavior and apply it to both levels.
    • both coarse and fine stages share the same limiter path
    • batch cap by burst is still preserved
  • Add/extend unit tests in pkg/lightning/backend/local/localhelper_test.go:
    • large key set triggers coarse + fine
    • small key set remains fine-only
    • coarse-stage failure returns immediately
    • limiter is still enforced after two-level restoration

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Lightning: restore two-level split/scatter before ingest and keep rate limiting for both coarse and fine stages.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 19, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Feb 19, 2026

Hi @OliverS929. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@OliverS929 OliverS929 changed the title lightning: restore two-level split/scatter while keeping limiter behavior lightning: restore two-level split/scatter while keeping limiter behavior | tidb-test=release-8.1.2 Feb 19, 2026
@OliverS929
Copy link
Copy Markdown
Contributor Author

/test mysql-test

@tiprow
Copy link
Copy Markdown

tiprow bot commented Feb 19, 2026

@OliverS929: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/test mysql-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (feature/release-8.1-gsort-test@93fa53f). Learn more about missing BASE report.

Additional details and impacted files
@@                         Coverage Diff                         @@
##             feature/release-8.1-gsort-test     #66312   +/-   ##
===================================================================
  Coverage                                  ?   71.0029%           
===================================================================
  Files                                     ?       1479           
  Lines                                     ?     427509           
  Branches                                  ?          0           
===================================================================
  Hits                                      ?     303544           
  Misses                                    ?     103338           
  Partials                                  ?      20627           
Flag Coverage Δ
unit 71.0029% <100.0000%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.9656% <0.0000%> (?)
parser ∅ <0.0000%> (?)
br 41.5992% <0.0000%> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@D3Hunter
Copy link
Copy Markdown
Contributor

@pantheon-ai review

@lance6716
Copy link
Copy Markdown
Contributor

limiter = rate.NewLimiter(rate.Limit(eventLimit), burstPerSec*ratePerSecMultiplier)
batchCnt = min(batchCnt, burstPerSec)
}
if len(splitKeys) > 100 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe smaller, such as 64, in nextgen, each region is about 1g, our current subtask size is around 100g, so nearly 100 regions, and it might not trigger this

also make it constant

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to a named threshold constant and lowered to 64. PTAL.

Comment on lines +329 to +332
type splitAndScatterCounterClient struct {
split.SplitClient
splitCount atomic.Int32
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe generate mock impl for split.SplitClient to make adding test easier, see example in gen_mock of Makefile

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take the approach to remove one-off split client test structs and reused/extened existing testSplitClient with callback injection added. PTAL.

@lance6716
Copy link
Copy Markdown
Contributor

lance6716 commented Feb 24, 2026

@pantheon-ai[bot] PTAL

-- update

image

Don't know how to trigger it. The [bot] part is name is not included 🤔 @wjhuang2016 Can you tell us?

@D3Hunter
Copy link
Copy Markdown
Contributor

remember to pick to master branch which also doesn't have it now

@D3Hunter
Copy link
Copy Markdown
Contributor

@pantheon-bot review this pr

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Feb 24, 2026

Hi @D3Hunter,

You're not in the allow list for this repository. Only users in the allow list can trigger Pantheon here. Please contact the repo maintainer if you need access.

Learn more about Pantheon AI

@wjhuang2016
Copy link
Copy Markdown
Member

@pantheon-bot review this pr

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Feb 24, 2026

Hi @wjhuang2016,

You're not in the allow list for this repository. Only users in the allow list can trigger Pantheon here. Please contact the repo maintainer if you need access.

Learn more about Pantheon AI

@OliverS929
Copy link
Copy Markdown
Contributor Author

remember to pick to master branch which also doesn't have it now

Yeah will do once this PR is merged.

@8rosemary8
Copy link
Copy Markdown

@pantheon-bot review this pr

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Feb 24, 2026

Hi @8rosemary8,

You're not in the allow list for this repository. Only users in the allow list can trigger Pantheon here. Please contact the repo maintainer if you need access.

Learn more about Pantheon AI

@D3Hunter
Copy link
Copy Markdown
Contributor

@pantheon-bot review

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Feb 24, 2026

This PR restores Lightning’s two-level split/scatter (coarse + fine) in localhelper.go: coarse split/scatter runs first when len(splitKeys) > 100, followed by fine-grained split/scatter on all keys, to reduce region concentration for large imports. It preserves the split/ingest limiter behavior from #62419 and applies it consistently to both stages (including burst caps). Unit tests were added/extended to cover large vs small key sets, coarse-stage failure handling, and limiter enforcement after the restoration.

Open in Web
Learn more about Pantheon AI

Copy link
Copy Markdown
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

"golang.org/x/time/rate"
)

const coarseGrainedSplitKeysThreshold = 64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe comment the reason why we choose it

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 24, 2026
@ti-chi-bot ti-chi-bot bot added the approved label Feb 24, 2026
@D3Hunter
Copy link
Copy Markdown
Contributor

/cherry-pick master

@ti-chi-bot
Copy link
Copy Markdown
Member

@D3Hunter: once the present PR merges, I will cherry-pick it on top of master in the new PR and assign it to you.

Details

In response to this:

/cherry-pick master

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@D3Hunter
Copy link
Copy Markdown
Contributor

@lance6716 ptal

return local.splitAndScatterRegionInBatchesWithLimiter(ctx, splitKeys, batchCnt, limiter)
}

func getCoarseGrainedSplitKeys(splitKeys [][]byte) [][]byte {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Corner Case Analysis: getCoarseGrainedSplitKeys

I've analyzed this function for corner cases and found one critical bug and several observations:

🔴 P0 - Critical: Empty Input Panic

Issue: When splitKeys is empty (nil or len==0), the function panics with index out of range:

// When len(splitKeys) == 0:
sqrtCnt := int(math.Sqrt(0)) = 0
// Loop doesn't execute (i < 0 is false)
// But then this executes:
if i-sqrtCnt != len(splitKeys)-1  // 0 - 0 != -1 is true
    splitKeys[len(splitKeys)-1]   // splitKeys[-1] → PANIC!

Test confirmation: I created and ran comprehensive tests that confirmed this panic.

Current protection: The caller only invokes this when len(splitKeys) > 100, so it's not reachable in production currently. However, this is fragile if:

  • The function is called directly elsewhere
  • Future refactoring changes the guard condition
  • Unit tests call it directly

Recommendation: Add defensive guard:

func getCoarseGrainedSplitKeys(splitKeys [][]byte) [][]byte {
    if len(splitKeys) == 0 {
        return nil
    }
    // ... rest of function
}

All Other Cases Work Correctly

Tested extensively with inputs ranging from n=1 to n=122:

  • ✅ No hangs detected
  • ✅ Last key always included exactly once
  • ✅ No duplicate keys introduced
  • ✅ Output maintains sorted order

📝 Minor: Test Comment Needs Update

In the test comment:

// 121 keys => coarse pass(11 keys, 1 batch)

Should be:

// 121 keys => coarse pass(12 keys, 1 batch)

The function returns 12 coarse keys for n=121, not 11 (indices: 0,11,22,...,110,120).

📊 Test Coverage

The existing test covers the "last key not duplicated" case (n=122) well. Consider adding:

  • Empty input test (currently would panic)
  • Case where last key must be appended (e.g., n=100, n=121)

I can provide the test code if needed.


Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current call path, empty splitKeys cannot reach this helper because caller has len(splitKeys) > coarseGrainedSplitKeysThreshold guard. But I can add an explicit len==0 guard here improves helper robustness and future-proofing, if that's a must for other reviewers.

@OliverS929
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Feb 24, 2026

@OliverS929: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the lgtm label Feb 24, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Feb 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, lance6716

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Feb 24, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Feb 24, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-02-24 05:59:17.197677109 +0000 UTC m=+161829.712471718: ☑️ agreed by D3Hunter.
  • 2026-02-24 12:57:56.610609328 +0000 UTC m=+186949.125403956: ☑️ agreed by lance6716.

@ti-chi-bot ti-chi-bot bot merged commit 4ffecda into pingcap:feature/release-8.1-gsort-test Feb 24, 2026
18 of 19 checks passed
@ti-chi-bot
Copy link
Copy Markdown
Member

@D3Hunter: new pull request created to branch master: #66354.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick master

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Feb 24, 2026
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@OliverS929 OliverS929 deleted the feature/release-8.1-gsort-test branch February 27, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants