Skip to content

importer: sample a portion of compressed files to speed up import spec generation#64769

Merged
ti-chi-bot[bot] merged 15 commits intopingcap:masterfrom
joechenrh:sample-partial
Dec 11, 2025
Merged

importer: sample a portion of compressed files to speed up import spec generation#64769
ti-chi-bot[bot] merged 15 commits intopingcap:masterfrom
joechenrh:sample-partial

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

@joechenrh joechenrh commented Nov 29, 2025

What problem does this PR solve?

Issue Number: close #64770

Problem Summary:

What changed and how does it work?

For compressed files, it may be time consuming to get compression ratio for each file. Since the ratio we got is also a rough value, here we only sample first 512 (maybe make it configurable) files for each compression type and use harmonic mean to get the average compression ratio.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Create 10,000 zstd files on ks3, and import with a 8C instance.

Before:

mysql> import into test.t1 from "s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&secret-access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&region=Beijing&provider=ks" with thread=8, detached;
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
| Job_ID | Group_Key | Data_Source                                                                                                                                                                             | Target_Table | Table_ID | Phase | Status  | Source_File_Size | Imported_Rows | Result_Message | Create_Time                | Start_Time | End_Time | Created_By | Last_Update_Time | Cur_Step | Cur_Step_Processed_Size | Cur_Step_Total_Size | Cur_Step_Progress_Pct | Cur_Step_Speed | Cur_Step_ETA |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
|      1 | NULL      | s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&provider=ks&region=Beijing&secret-access-key=xxxxxx | `test`.`t1`  |      114 |       | pending | 35.98GiB         |          NULL |                | 2025-12-10 05:43:08.049237 | NULL       | NULL     | root@%     | NULL             | NULL     | NULL                    | NULL                | NULL                  | NULL           | NULL         |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
1 row in set (3 min 9.709 sec)

After:

mysql> import into test.t1 from "s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&secret-access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&region=Beijing&provider=ks" with thread=8, detached;
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
| Job_ID | Group_Key | Data_Source                                                                                                                                                                             | Target_Table | Table_ID | Phase | Status  | Source_File_Size | Imported_Rows | Result_Message | Create_Time                | Start_Time | End_Time | Created_By | Last_Update_Time | Cur_Step | Cur_Step_Processed_Size | Cur_Step_Total_Size | Cur_Step_Progress_Pct | Cur_Step_Speed | Cur_Step_ETA |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
|      1 | NULL      | s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&provider=ks&region=Beijing&secret-access-key=xxxxxx | `test`.`t1`  |      114 |       | pending | 35.98GiB         |          NULL |                | 2025-12-10 05:43:08.049237 | NULL       | NULL     | root@%     | NULL             | NULL     | NULL                    | NULL                | NULL                  | NULL           | NULL         |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
1 row in set (11.757 sec)

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 29, 2025
@joechenrh joechenrh added skip-issue-check Indicates that a PR no need to check linked issue. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 29, 2025
@tiprow
Copy link
Copy Markdown

tiprow bot commented Nov 29, 2025

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 29, 2025
@joechenrh joechenrh changed the title importer: only same part of the files to get compression ratio importer: only sample part of the files to get compression ratio Nov 29, 2025
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@joechenrh joechenrh marked this pull request as draft November 29, 2025 06:29
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 29, 2025
joechenrh and others added 2 commits November 29, 2025 02:26
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <ruihao.chen@pingcap.cn>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 2, 2025
@joechenrh joechenrh marked this pull request as ready for review December 2, 2025 07:08
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 2, 2025
@joechenrh joechenrh changed the title importer: only sample part of the files to get compression ratio importer: speed up import spec generation Dec 2, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 46.26866% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.6468%. Comparing base (b5e9bbc) to head (5bb4166).
⚠️ Report is 55 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #64769        +/-   ##
================================================
- Coverage   74.7349%   68.6468%   -6.0881%     
================================================
  Files          1889       1867        -22     
  Lines        515296     515259        -37     
================================================
- Hits         385106     353709     -31397     
- Misses       106380     139119     +32739     
+ Partials      23810      22431      -1379     
Flag Coverage Δ
integration 41.6228% <0.0000%> (-6.5409%) ⬇️
unit 66.0241% <46.2686%> (-6.2787%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 52.8700% <ø> (ø)
parser ∅ <ø> (∅)
br 38.2999% <ø> (-24.8844%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Comment thread pkg/lightning/mydump/region.go Outdated
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 5, 2025
@D3Hunter
Copy link
Copy Markdown
Contributor

D3Hunter commented Dec 5, 2025

maybe add a manual test for many number of GZ files, to see how much it can speed up the precheck part

@joechenrh
Copy link
Copy Markdown
Contributor Author

/hold
Wait manual test

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 5, 2025
@joechenrh
Copy link
Copy Markdown
Contributor Author

/unhold
Manual test result updated

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 10, 2025
@D3Hunter
Copy link
Copy Markdown
Contributor

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Dec 10, 2025

@D3Hunter: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown
Collaborator

@GMHDBJD GMHDBJD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Dec 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, GMHDBJD

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 11, 2025
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Dec 11, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-12-05 07:54:15.799114322 +0000 UTC m=+595600.612891894: ☑️ agreed by D3Hunter.
  • 2025-12-11 04:21:51.46442842 +0000 UTC m=+1101256.278205992: ☑️ agreed by GMHDBJD.

@tiprow
Copy link
Copy Markdown

tiprow bot commented Dec 11, 2025

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
fast_test_tiprow 92153ca link true /test fast_test_tiprow

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Dec 11, 2025

@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Dec 11, 2025

@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@joechenrh
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Dec 11, 2025

@joechenrh: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot merged commit e60489c into pingcap:master Dec 11, 2025
27 checks passed
@joechenrh joechenrh deleted the sample-partial branch December 11, 2025 10:41
@D3Hunter
Copy link
Copy Markdown
Contributor

D3Hunter commented Apr 9, 2026

/cherry-pick release-nextgen-20251011

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 9, 2026
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Copy Markdown
Member

@D3Hunter: new pull request created to branch release-nextgen-20251011: #67654.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick release-nextgen-20251011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

importinto: scanning large amount of compressed files is slow

5 participants