Skip to content

importer: sample a portion of compressed files to speed up import spec generation (#64769)#67654

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-64769-to-release-nextgen-20251011
Apr 10, 2026
Merged

importer: sample a portion of compressed files to speed up import spec generation (#64769)#67654
ti-chi-bot[bot] merged 2 commits intopingcap:release-nextgen-20251011from
ti-chi-bot:cherry-pick-64769-to-release-nextgen-20251011

Conversation

@ti-chi-bot
Copy link
Copy Markdown
Member

@ti-chi-bot ti-chi-bot commented Apr 9, 2026

This is an automated cherry-pick of #64769

What problem does this PR solve?

Issue Number: close #64770

Problem Summary:

What changed and how does it work?

For compressed files, it may be time consuming to get compression ratio for each file. Since the ratio we got is also a rough value, here we only sample first 512 (maybe make it configurable) files for each compression type and use harmonic mean to get the average compression ratio.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Create 10,000 zstd files on ks3, and import with a 8C instance.

Before:

mysql> import into test.t1 from "s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&secret-access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&region=Beijing&provider=ks" with thread=8, detached;
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
| Job_ID | Group_Key | Data_Source                                                                                                                                                                             | Target_Table | Table_ID | Phase | Status  | Source_File_Size | Imported_Rows | Result_Message | Create_Time                | Start_Time | End_Time | Created_By | Last_Update_Time | Cur_Step | Cur_Step_Processed_Size | Cur_Step_Total_Size | Cur_Step_Progress_Pct | Cur_Step_Speed | Cur_Step_ETA |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
|      1 | NULL      | s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&provider=ks&region=Beijing&secret-access-key=xxxxxx | `test`.`t1`  |      114 |       | pending | 35.98GiB         |          NULL |                | 2025-12-10 05:43:08.049237 | NULL       | NULL     | root@%     | NULL             | NULL     | NULL                    | NULL                | NULL                  | NULL           | NULL         |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
1 row in set (3 min 9.709 sec)

After:

mysql> import into test.t1 from "s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&secret-access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&region=Beijing&provider=ks" with thread=8, detached;
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
| Job_ID | Group_Key | Data_Source                                                                                                                                                                             | Target_Table | Table_ID | Phase | Status  | Source_File_Size | Imported_Rows | Result_Message | Create_Time                | Start_Time | End_Time | Created_By | Last_Update_Time | Cur_Step | Cur_Step_Processed_Size | Cur_Step_Total_Size | Cur_Step_Progress_Pct | Cur_Step_Speed | Cur_Step_ETA |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
|      1 | NULL      | s3://global-sort/joechenrh/zstd/*.csv.zst?access-key=xxxxxx&endpoint=xxxxxx&force-path-style=false&provider=ks&region=Beijing&secret-access-key=xxxxxx | `test`.`t1`  |      114 |       | pending | 35.98GiB         |          NULL |                | 2025-12-10 05:43:08.049237 | NULL       | NULL     | root@%     | NULL             | NULL     | NULL                    | NULL                | NULL                  | NULL           | NULL         |
+--------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+----------+-------+---------+------------------+---------------+----------------+----------------------------+------------+----------+------------+------------------+----------+-------------------------+---------------------+-----------------------+----------------+--------------+
1 row in set (11.757 sec)

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Improved real-size estimation for compressed Parquet files during data import by sampling compression ratios to produce more accurate size calculations.
  • Tests

    • Added test coverage for compressed file handling in data import operations to validate estimation behavior.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011 labels Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member Author

@D3Hunter This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 9, 2026

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Adds sampling-based compression-aware real-size estimation to importer initialization, using per-format sampling (Parquet special-case) and a cached harmonic-mean compression estimator; also increases a test shard count and adds a test for scanning many compressed files.

Changes

Cohort / File(s) Summary
Build config
pkg/executor/importer/BUILD.bazel
Incremented go_test shard_count from 32 to 33.
Importer logic
pkg/executor/importer/import.go
Added estimateCompressionRatio(...), a compressionEstimator with capped sampling and harmonic-mean aggregation, Parquet sampling path, and updated LoadDataController.InitDataFiles to detect format once (via sync.Once) and apply sampled size expansion when computing fileMeta.RealSize.
Tests
pkg/executor/importer/import_test.go
Added TestInitCompressedFiles which creates many *.csv.gz files, enables a failpoint to force sampling behavior, and verifies InitDataFiles with a glob pattern succeeds.

Sequence Diagram

sequenceDiagram
    participant Client
    participant InitDataFiles
    participant Detector
    participant Sampler
    participant Estimator
    participant SizeCalc

    Client->>InitDataFiles: InitDataFiles(globPattern)
    InitDataFiles->>Detector: detectAndUpdateFormat() [sync.Once]
    Detector-->>InitDataFiles: sourceType

    InitDataFiles->>Sampler: sample files (bounded)
    Sampler-->>Estimator: sampled stats
    Estimator->>Estimator: compute harmonic mean ratio
    Estimator-->>InitDataFiles: sizeExpansionRatio

    InitDataFiles->>SizeCalc: estimate(file) * fileSize * sizeExpansionRatio
    SizeCalc-->>InitDataFiles: estimated real sizes
    InitDataFiles-->>Client: completed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

component/import, size/XXL, ok-to-test

Suggested reviewers

  • D3Hunter

Poem

🐰 I nibble bytes in quiet rows,

I sample where the wind still blows,
A harmonic hop, a ratio true,
Millions of files — I peek at few,
Now import dances light and new 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: sampling compressed files instead of scanning all to speed up import spec generation.
Description check ✅ Passed The PR description includes issue number, problem summary, explanation of changes, completed testing checklist, and manual test results demonstrating performance improvement.
Linked Issues check ✅ Passed The PR implements sampling of compressed files with harmonic mean estimation to avoid opening every file, directly addressing issue #64770's objective to reduce scanning time for large numbers of compressed input files.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing compression ratio sampling for import spec generation. The Bazel shard count adjustment and test additions support the main sampling feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@joechenrh
Copy link
Copy Markdown
Contributor

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
pkg/executor/importer/import_test.go (1)

325-328: Trim the fixture count to the actual sampling boundary.

Creating 2048 files is heavier than needed, and the magic number will drift if maxSampledCompressedFiles changes. maxSampledCompressedFiles + 1 is enough to cross the new cutoff and keeps the test targeted.

♻️ Proposed fix
-	for i := range 2048 {
+	for i := 0; i < maxSampledCompressedFiles+1; i++ {
 		fileName := filepath.Join(tempDir, fmt.Sprintf("test_%d.csv.gz", i))
 		require.NoError(t, os.WriteFile(fileName, []byte{}, 0o644))
 	}

As per coding guidelines "Keep test changes minimal and deterministic; avoid broad golden/testdata churn unless required."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import_test.go` around lines 325 - 328, The test
currently creates 2048 files unnecessarily; replace the hardcoded range(2048)
with a minimal deterministic value based on the sampling boundary (use
maxSampledCompressedFiles + 1) so the test only produces one more than the
cutoff and remains correct if maxSampledCompressedFiles changes; update the loop
that builds fileName and writes empty files (the block creating test_%d.csv.gz)
to iterate up to maxSampledCompressedFiles+1 instead of 2048.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/executor/importer/import_test.go`:
- Line 330: The test uses testfailpoint.Enable (in import_test.go) but the
package isn't imported; add the missing import for the testfailpoint package
(e.g., import "github.com/pingcap/tidb/util/testfailpoint") to the test's import
block so testfailpoint.Enable is defined and the test compiles.

In `@pkg/executor/importer/import.go`:
- Around line 1454-1485: The RealSize calculation is inconsistent between glob
and exact-path imports: the glob branch uses ce.estimate(...) combined with
sizeExpansionRatio (from
detectAndUpdateFormat/getSourceType/estimateCompressionRatio) while the
exact-path branch still calls mydump.EstimateRealSizeForFile; factor the new
logic into a shared helper (e.g., computeRealSize(ctx, ce, path, size,
sourceType, s) or similar) that calls
detectAndUpdateFormat/getSourceType/estimateCompressionRatio as needed, uses
mydump.ParseCompressionOnFileExtension and ce.estimate to compute base size and
then applies sizeExpansionRatio, and replace both usages of
mydump.EstimateRealSizeForFile and the inline ce.estimate/sizeExpansionRatio
code in the ParallelProcess lambda and the exact-path branch to call this helper
and set SourceFileMeta.RealSize.
- Around line 1308-1310: The fast-path returns the just-sampled per-file
compressRatio if another worker stores r.ratio[compressTp] between the initial
r.ratio.Load(compressTp) check and acquiring the mutex; to fix this, after
acquiring the mutex (r.mu) re-check r.ratio.Load(compressTp) and if an aggregate
is now present return that cached aggregate instead of the per-file
compressRatio; otherwise proceed to initialize and publish the aggregate as
before (ensure you use the same compressTp/compressRatio symbols and release the
mutex after).
- Around line 1227-1229: The parquet sampling error should be handled as a
best-effort fallback instead of returning an error that aborts InitDataFiles: in
estimateCompressionRatio, catch errors from mydump.SampleStatisticsFromParquet
(the call that currently returns rows, rowSize, err) and on failure log or warn
and fall back to using FileSize (or a default compression ratio) to compute and
return the compression estimate rather than propagating the error; ensure the
once.Do path in InitDataFiles no longer receives an error from
estimateCompressionRatio so one unreadable/corrupt parquet file won't abort spec
generation.

---

Nitpick comments:
In `@pkg/executor/importer/import_test.go`:
- Around line 325-328: The test currently creates 2048 files unnecessarily;
replace the hardcoded range(2048) with a minimal deterministic value based on
the sampling boundary (use maxSampledCompressedFiles + 1) so the test only
produces one more than the cutoff and remains correct if
maxSampledCompressedFiles changes; update the loop that builds fileName and
writes empty files (the block creating test_%d.csv.gz) to iterate up to
maxSampledCompressedFiles+1 instead of 2048.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cdafe954-5d67-439b-840b-23f2deb13b79

📥 Commits

Reviewing files that changed from the base of the PR and between a8ad696 and d833dd1.

📒 Files selected for processing (3)
  • pkg/executor/importer/BUILD.bazel
  • pkg/executor/importer/import.go
  • pkg/executor/importer/import_test.go

Comment thread pkg/executor/importer/import_test.go
Comment thread pkg/executor/importer/import.go Outdated
Comment on lines +1227 to +1229
rows, rowSize, err := mydump.SampleStatisticsFromParquet(ctx, filePath, store)
if err != nil {
return 1.0, err
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep parquet size estimation best-effort.

estimateCompressionRatio now returns an error when parquet sampling fails, and the once.Do path propagates that out of InitDataFiles. The old path degraded to FileSize on estimation errors, so one unreadable or corrupt sampled parquet file now aborts spec generation instead of just losing the optimization.

🐛 Proposed fix
 rows, rowSize, err := mydump.SampleStatisticsFromParquet(ctx, filePath, store)
 if err != nil {
-	return 1.0, err
+	logutil.Logger(ctx).Warn("fail to sample parquet statistics, fallback to file size",
+		zap.String("path", filePath),
+		zap.Error(err),
+	)
+	return 1.0, nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rows, rowSize, err := mydump.SampleStatisticsFromParquet(ctx, filePath, store)
if err != nil {
return 1.0, err
rows, rowSize, err := mydump.SampleStatisticsFromParquet(ctx, filePath, store)
if err != nil {
logutil.Logger(ctx).Warn("fail to sample parquet statistics, fallback to file size",
zap.String("path", filePath),
zap.Error(err),
)
return 1.0, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1227 - 1229, The parquet
sampling error should be handled as a best-effort fallback instead of returning
an error that aborts InitDataFiles: in estimateCompressionRatio, catch errors
from mydump.SampleStatisticsFromParquet (the call that currently returns rows,
rowSize, err) and on failure log or warn and fall back to using FileSize (or a
default compression ratio) to compute and return the compression estimate rather
than propagating the error; ensure the once.Do path in InitDataFiles no longer
receives an error from estimateCompressionRatio so one unreadable/corrupt
parquet file won't abort spec generation.

Comment on lines +1308 to +1310
if _, ok := r.ratio.Load(compressTp); ok {
return compressRatio
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return the published aggregate after cache initialization.

If another worker stores r.ratio[compressTp] between the fast-path Load and taking mu, Line 1309 returns the just-sampled per-file ratio instead of the cached aggregate. That leaves a small tail of files with mixed estimates after the cache is initialized.

🐛 Proposed fix
-	if _, ok := r.ratio.Load(compressTp); ok {
-		return compressRatio
+	if v, ok := r.ratio.Load(compressTp); ok {
+		return v.(float64)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1308 - 1310, The fast-path
returns the just-sampled per-file compressRatio if another worker stores
r.ratio[compressTp] between the initial r.ratio.Load(compressTp) check and
acquiring the mutex; to fix this, after acquiring the mutex (r.mu) re-check
r.ratio.Load(compressTp) and if an aggregate is now present return that cached
aggregate instead of the per-file compressRatio; otherwise proceed to initialize
and publish the aggregate as before (ensure you use the same
compressTp/compressRatio symbols and release the mutex after).

Comment thread pkg/executor/importer/import.go Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
pkg/executor/importer/import.go (3)

1231-1233: ⚠️ Potential issue | 🟠 Major

Keep parquet expansion sampling best-effort.

estimateCompressionRatio now returns the SampleParquetRowSize error, and the once.Do path propagates that out of InitDataFiles. One unreadable/corrupt sampled parquet file now aborts spec generation even though this value is only used to estimate RealSize. Falling back to 1.0/file size with a warning would preserve the old behavior.

🐛 Proposed fix
 rows, rowSize, err := mydump.SampleParquetRowSize(ctx, fileMeta, store)
 if err != nil {
-	return 1.0, err
+	logutil.Logger(ctx).Warn("fail to sample parquet statistics, fallback to file size",
+		zap.String("path", filePath),
+		zap.Error(err),
+	)
+	return 1.0, nil
 }

Also applies to: 1457-1464

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1231 - 1233, The parquet
sampling call in estimateCompressionRatio (calling mydump.SampleParquetRowSize)
currently returns its error and causes the once.Do path in InitDataFiles to
abort spec generation; change estimateCompressionRatio to treat
SampleParquetRowSize failures as non-fatal: catch the error, log a warning
including fileMeta/err, and fall back to using compression ratio = 1.0 (or file
size-based RealSize) instead of returning the error so InitDataFiles/once.Do
won't propagate the failure; update both call sites (around the rows,rowSize
assignment and the duplicate block at the other location) to preserve the old
best-effort behavior.

1295-1314: ⚠️ Potential issue | 🟡 Minor

Return the cached aggregate after the locked re-check.

If another worker stores r.ratio[compressTp] between the fast-path Load and taking mu, Lines 1312-1314 still return this file's sampled ratio instead of the published harmonic mean. That leaves a small tail of files with mixed estimates.

🐛 Proposed fix
-	if _, ok := r.ratio.Load(compressTp); ok {
-		return compressRatio
+	if v, ok := r.ratio.Load(compressTp); ok {
+		return v.(float64)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1295 - 1314, The code does a
second r.ratio.Load(compressTp) check under r.mu but still returns the local
compressRatio; change the locked re-check in the function handling file sampling
so that if r.ratio.Load(compressTp) is present after acquiring r.mu it returns
the cached aggregate (the value stored in r.ratio for compressTp) instead of
returning the just-sampled compressRatio; locate symbols r.ratio, compressTp,
r.mu, and compressRatio in the import.go sampling/ratio logic and update the
control flow to return the stored value when present, otherwise continue to
store/use compressRatio.

1404-1413: ⚠️ Potential issue | 🟠 Major

Use the same RealSize calculation for exact-path imports.

The glob branch now applies ce.estimate(...) * sizeExpansionRatio, but the exact-path branch still calls mydump.EstimateRealSizeForFile at Line 1413. IMPORT INTO '/a.parquet' and IMPORT INTO '/a*.parquet' can therefore derive different RealSize and chunk sizing for the same source. Please factor the new logic into a shared helper and call it from both branches.

Also applies to: 1452-1474

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/executor/importer/import.go` around lines 1404 - 1413, The exact-path
branch is still calling mydump.EstimateRealSizeForFile while the glob branch
uses the new ce.estimate(...) * sizeExpansionRatio logic, causing inconsistent
RealSize and chunking; extract the new RealSize computation into a shared helper
(e.g., computeRealSize(ctx, engineContext, fileMeta, sizeExpansionRatio, s))
that encapsulates the ce.estimate(...) * sizeExpansionRatio fallback to
mydump.EstimateRealSizeForFile, then replace the direct call to
mydump.EstimateRealSizeForFile in the exact-path code that sets
fileMeta.RealSize (after
detectAndUpdateFormat/getSourceType/ParseCompressionOnFileExtension) with a call
to this helper, and make the same replacement in the other affected block
(around the 1452-1474 region) so both glob and exact-path use the identical
RealSize logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/executor/importer/import.go`:
- Around line 1231-1233: The parquet sampling call in estimateCompressionRatio
(calling mydump.SampleParquetRowSize) currently returns its error and causes the
once.Do path in InitDataFiles to abort spec generation; change
estimateCompressionRatio to treat SampleParquetRowSize failures as non-fatal:
catch the error, log a warning including fileMeta/err, and fall back to using
compression ratio = 1.0 (or file size-based RealSize) instead of returning the
error so InitDataFiles/once.Do won't propagate the failure; update both call
sites (around the rows,rowSize assignment and the duplicate block at the other
location) to preserve the old best-effort behavior.
- Around line 1295-1314: The code does a second r.ratio.Load(compressTp) check
under r.mu but still returns the local compressRatio; change the locked re-check
in the function handling file sampling so that if r.ratio.Load(compressTp) is
present after acquiring r.mu it returns the cached aggregate (the value stored
in r.ratio for compressTp) instead of returning the just-sampled compressRatio;
locate symbols r.ratio, compressTp, r.mu, and compressRatio in the import.go
sampling/ratio logic and update the control flow to return the stored value when
present, otherwise continue to store/use compressRatio.
- Around line 1404-1413: The exact-path branch is still calling
mydump.EstimateRealSizeForFile while the glob branch uses the new
ce.estimate(...) * sizeExpansionRatio logic, causing inconsistent RealSize and
chunking; extract the new RealSize computation into a shared helper (e.g.,
computeRealSize(ctx, engineContext, fileMeta, sizeExpansionRatio, s)) that
encapsulates the ce.estimate(...) * sizeExpansionRatio fallback to
mydump.EstimateRealSizeForFile, then replace the direct call to
mydump.EstimateRealSizeForFile in the exact-path code that sets
fileMeta.RealSize (after
detectAndUpdateFormat/getSourceType/ParseCompressionOnFileExtension) with a call
to this helper, and make the same replacement in the other affected block
(around the 1452-1474 region) so both glob and exact-path use the identical
RealSize logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2f2788ff-222b-4b12-b814-2199797e5e27

📥 Commits

Reviewing files that changed from the base of the PR and between d833dd1 and 47019ae.

📒 Files selected for processing (2)
  • pkg/executor/importer/import.go
  • pkg/executor/importer/import_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/executor/importer/import_test.go

@joechenrh
Copy link
Copy Markdown
Contributor

/retest

Copy link
Copy Markdown
Collaborator

@Benjamin2037 Benjamin2037 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added cherry-pick-approved Cherry pick PR approved by release team. approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. and removed do-not-merge/cherry-pick-not-approved labels Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Benjamin2037, joechenrh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joechenrh
Copy link
Copy Markdown
Contributor

/retest

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 9, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-09 12:46:59.503569373 +0000 UTC m=+1046824.708929420: ☑️ agreed by Benjamin2037.
  • 2026-04-09 12:53:40.860382734 +0000 UTC m=+1047226.065742791: ☑️ agreed by joechenrh.

@joechenrh
Copy link
Copy Markdown
Contributor

/retest

1 similar comment
@joechenrh
Copy link
Copy Markdown
Contributor

/retest

@joechenrh
Copy link
Copy Markdown
Contributor

/retest

1 similar comment
@joechenrh
Copy link
Copy Markdown
Contributor

/retest

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 60.20408% with 39 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (release-nextgen-20251011@a8ad696). Learn more about missing BASE report.

Additional details and impacted files
@@                      Coverage Diff                      @@
##             release-nextgen-20251011     #67654   +/-   ##
=============================================================
  Coverage                            ?   71.8595%           
=============================================================
  Files                               ?       1833           
  Lines                               ?     493020           
  Branches                            ?          0           
=============================================================
  Hits                                ?     354282           
  Misses                              ?     115390           
  Partials                            ?      23348           
Flag Coverage Δ
unit 71.8595% <60.2040%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 56.3493% <0.0000%> (?)
parser ∅ <0.0000%> (?)
br 46.5631% <0.0000%> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@joechenrh
Copy link
Copy Markdown
Contributor

/retest

2 similar comments
@dillon-zheng
Copy link
Copy Markdown

/retest

@joechenrh
Copy link
Copy Markdown
Contributor

/retest

@ti-chi-bot ti-chi-bot bot merged commit 133f195 into pingcap:release-nextgen-20251011 Apr 10, 2026
18 checks passed
@ti-chi-bot ti-chi-bot bot deleted the cherry-pick-64769-to-release-nextgen-20251011 branch April 10, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. lgtm release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-nextgen-20251011

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants