importsdk, importer: fix sampled source size in import estimate#67492
Conversation
|
@GMHDBJD I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
Hi @GMHDBJD. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughkvSizeSampler.sampleOneFile now iterates rows via Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/executor/importer/sampler_test.go`:
- Around line 242-290: The test
"sql_source_size_uses_consumed_bytes_not_buffered_progress" currently has only
20 rows so SampleFileImportKVSize reads to EOF; update the fixture so the
sampler only reads a prefix by either (a) increasing the number of inserted rows
to exceed the sampler's rowsPerFile threshold (e.g. >30 short rows) or (b)
temporarily reducing the sampler limit such as maxSampleFileSize before calling
SampleFileImportKVSize, then keep the assertions but change the expectation to
ensure sampled.SourceSize reflects the consumed prefix (not full file) — target
symbols: the test function name string, the created content builder,
NewLoadDataController/Plan usage, and the SampleFileImportKVSize call to force a
partial sample.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c006d2c9-e94c-4023-b132-79c0413df7d5
📒 Files selected for processing (2)
pkg/executor/importer/sampler.gopkg/executor/importer/sampler_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67492 +/- ##
================================================
- Coverage 77.7173% 77.5790% -0.1383%
================================================
Files 1959 1943 -16
Lines 543377 543573 +196
================================================
- Hits 422298 421699 -599
- Misses 120238 121872 +1634
+ Partials 841 2 -839
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| // Sampling needs per-row source bytes, not buffered reader progress. | ||
| // SQL/CSV parsers expose byte offsets through Pos(), while parquet Pos() | ||
| // is row-count based and must fall back to the row-size estimate. |
There was a problem hiding this comment.
This will also overestimate source size for compressed file. Do we need to add comment for this?
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
D3Hunter
left a comment
There was a problem hiding this comment.
Summary
- Total findings: 1
- Inline comments: 1
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
🟡 [Minor] (1)
- Parquet and fallback source-size branches lack targeted regression coverage (
pkg/executor/importer/sampler.go:388; pkg/executor/importer/sampler_test.go:238)
| return sourceSize, dataKVSize, indexKVSize, nil | ||
| } | ||
|
|
||
| func (s *kvSizeSampler) sampledRowSourceSize(parser mydump.Parser, startPos int64, row mydump.Row) int64 { |
There was a problem hiding this comment.
🟡 [Minor] Parquet and fallback source-size branches lack targeted regression coverage
Why
The patch introduces format-specific source-size logic in sampledRowSourceSize, but the new test only validates the SQL consumed-bytes path and does not exercise the newly added Parquet or non-positive delta fallback branches.
Scope
pkg/executor/importer/sampler.go:388; pkg/executor/importer/sampler_test.go:238
Risk if unchanged
Future parser position behavior changes can silently skew sampled source-size estimation for Parquet or edge parser offsets, which may mis-tune IMPORT resource planning without an obvious failure signal.
Evidence
sampledRowSourceSize adds if s.cfg.Format == DataFormatParquet { return int64(row.Length) } and if rowDelta := endPos - startPos; rowDelta > 0 { ... } fallback logic, while the added test sql_source_size_uses_consumed_bytes_not_buffered_progress covers only SQL input.
Change request
add UT for it: add a case for DataFormatParquet and a case for the non-parquet endPos <= startPos fallback path so each new sizing branch is pinned by deterministic assertions.
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
//retest |
|
/retest |
1 similar comment
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
/hold |
|
/unhold |
|
/hold |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: D3Hunter, joechenrh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/unhold |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/importsdk/file_scanner_test.go (1)
270-303: Add one more case for>1 CREATE TABLEselection logic.This subtest is good, but it only covers the single-
CREATE TABLEpath. Please add a sibling case with multipleCREATE TABLEstatements to directly validate table/schema matching branch behavior inbuildEstimateCreateTableStmt.As per coding guidelines "Prefer extending existing test suites and fixtures over creating new scaffolding."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner_test.go` around lines 270 - 303, Add a sibling subtest to EstimateImportDataSizeMultiStatementSchema that exercises the multi-CREATE TABLE branch in buildEstimateCreateTableStmt: create files under a new temp dir including a schema file containing multiple CREATE TABLE statements (e.g., CREATE TABLE users... and CREATE TABLE orders...), associated CSVs for each table, then instantiate NewFileScanner with defaultSDKConfig (skipInvalidFiles=true) and call EstimateImportDataSize; assert that estimate.Tables contains entries for both "users" and "orders" with positive SourceSize/TiKVSize and that TotalSourceSize/TotalTiKVSize equal the sum of the tables. Reference the existing subtest, NewFileScanner, EstimateImportDataSize, and buildEstimateCreateTableStmt when adding the case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/importsdk/file_scanner.go`:
- Around line 444-451: The matching logic in estimateCreateTableStmtMatchesMeta
incorrectly rejects matches when tblMeta.DB is empty but createStmt.Table.Schema
is non-empty; update estimateCreateTableStmtMatchesMeta so that after verifying
the table name (createStmt.Table.Name) it treats an empty tblMeta.DB as a
wildcard and returns true (i.e., if tblMeta.DB == "" return true) before
comparing createStmt.Table.Schema.String() to tblMeta.DB; keep the existing
behavior for empty createStmt.Table.Schema.String().
---
Nitpick comments:
In `@pkg/importsdk/file_scanner_test.go`:
- Around line 270-303: Add a sibling subtest to
EstimateImportDataSizeMultiStatementSchema that exercises the multi-CREATE TABLE
branch in buildEstimateCreateTableStmt: create files under a new temp dir
including a schema file containing multiple CREATE TABLE statements (e.g.,
CREATE TABLE users... and CREATE TABLE orders...), associated CSVs for each
table, then instantiate NewFileScanner with defaultSDKConfig
(skipInvalidFiles=true) and call EstimateImportDataSize; assert that
estimate.Tables contains entries for both "users" and "orders" with positive
SourceSize/TiKVSize and that TotalSourceSize/TotalTiKVSize equal the sum of the
tables. Reference the existing subtest, NewFileScanner, EstimateImportDataSize,
and buildEstimateCreateTableStmt when adding the case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 939e3083-5fcc-43ee-b447-3da77be33169
📒 Files selected for processing (2)
pkg/importsdk/file_scanner.gopkg/importsdk/file_scanner_test.go
| func estimateCreateTableStmtMatchesMeta(createStmt *ast.CreateTableStmt, tblMeta *mydump.MDTableMeta) bool { | ||
| if !strings.EqualFold(createStmt.Table.Name.String(), tblMeta.Name) { | ||
| return false | ||
| } | ||
| if createStmt.Table.Schema.String() == "" { | ||
| return true | ||
| } | ||
| return strings.EqualFold(createStmt.Table.Schema.String(), tblMeta.DB) |
There was a problem hiding this comment.
Handle empty tblMeta.DB when matching schema-qualified CREATE TABLE statements.
If tblMeta.DB is empty, current logic rejects otherwise valid matches (CREATE TABLE db.tbl ...) and can fail estimation when multiple CREATE TABLE statements exist.
Suggested fix
func estimateCreateTableStmtMatchesMeta(createStmt *ast.CreateTableStmt, tblMeta *mydump.MDTableMeta) bool {
if !strings.EqualFold(createStmt.Table.Name.String(), tblMeta.Name) {
return false
}
- if createStmt.Table.Schema.String() == "" {
+ if createStmt.Table.Schema.String() == "" || tblMeta.DB == "" {
return true
}
return strings.EqualFold(createStmt.Table.Schema.String(), tblMeta.DB)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func estimateCreateTableStmtMatchesMeta(createStmt *ast.CreateTableStmt, tblMeta *mydump.MDTableMeta) bool { | |
| if !strings.EqualFold(createStmt.Table.Name.String(), tblMeta.Name) { | |
| return false | |
| } | |
| if createStmt.Table.Schema.String() == "" { | |
| return true | |
| } | |
| return strings.EqualFold(createStmt.Table.Schema.String(), tblMeta.DB) | |
| func estimateCreateTableStmtMatchesMeta(createStmt *ast.CreateTableStmt, tblMeta *mydump.MDTableMeta) bool { | |
| if !strings.EqualFold(createStmt.Table.Name.String(), tblMeta.Name) { | |
| return false | |
| } | |
| if createStmt.Table.Schema.String() == "" || tblMeta.DB == "" { | |
| return true | |
| } | |
| return strings.EqualFold(createStmt.Table.Schema.String(), tblMeta.DB) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/importsdk/file_scanner.go` around lines 444 - 451, The matching logic in
estimateCreateTableStmtMatchesMeta incorrectly rejects matches when tblMeta.DB
is empty but createStmt.Table.Schema is non-empty; update
estimateCreateTableStmtMatchesMeta so that after verifying the table name
(createStmt.Table.Name) it treats an empty tblMeta.DB as a wildcard and returns
true (i.e., if tblMeta.DB == "" return true) before comparing
createStmt.Table.Schema.String() to tblMeta.DB; keep the existing behavior for
empty createStmt.Table.Schema.String().
|
/retest |
|
@GMHDBJD: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
1 similar comment
|
/retest |
What problem does this PR solve?
Issue Number: ref #67240
Problem Summary:
EstimateImportDataSizesampled source bytes from buffered reader progress for SQL/CSV files. That inflated the sampled source-size denominator and could collapse the finaltotalTiKVSizeestimate to an unrealistically small value. Fors3://tidbcloud-samples/sp500insight_new/, the estimate wastotalSourceSize=211686583buttotalTiKVSize=147600.What changed and how does it work?
Pos()deltas as sampled source bytes for SQL/CSV files instead of bufferedScannedPos()progress.Row.Lengthfallback because parquetPos()is row-count based.Check List
Tests
Manual test details:
go test ./pkg/executor/importer -run 'TestSampleIndexSizeRatio' -tags=intest,deadlockgo test ./pkg/importsdk -run 'TestFileScanner/EstimateImportDataSize' -tags=intest,deadlockmake lints3://tidbcloud-samples/sp500insight_new/totalTiKVSize=147600tototalTiKVSize=348537047Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Bug Fixes
New Features
Tests