importsdk, importer, importinto: add import size estimate#67241
importsdk, importer, importinto: add import size estimate#67241ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
Conversation
…mport size estimate
|
Review failed due to infrastructure/execution failure after retries. Please re-trigger review. ℹ️ Learn more details on Pantheon AI. |
|
Hi @GMHDBJD. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds import-size estimation: extends the import SDK and FileScanner with an API to estimate per-table and total import TiKV sizes, threads CSV/charset config into the SDK, and refactors executor sampling to compute source, data-KV, and index-KV sizes per file. Changes
Sequence DiagramsequenceDiagram
participant Client as SDK Consumer
participant FS as FileScanner
participant Loader as Loader/Metadata
participant Sampler as SampleFileImportKVSize
participant Result as ImportDataSizeEstimate
Client->>FS: EstimateImportDataSize(ctx)
FS->>Loader: Discover databases & tables
loop per table
FS->>FS: estimateOneTableSize()
FS->>FS: build KVSizeSampleConfig & TableInfo
FS->>Sampler: SampleFileImportKVSize(ctx, cfg, tbl, dataStore, files)
Sampler->>Sampler: parse files, encode rows, accumulate SourceSize/DataKV/IndexKV
Sampler-->>FS: SampledKVSizeResult
FS->>FS: compute table TiKV estimate and append
end
FS-->>Client: ImportDataSizeEstimate (tables + totals)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.3)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/ok-to-test |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67241 +/- ##
================================================
+ Coverage 77.7732% 78.6386% +0.8653%
================================================
Files 2016 1951 -65
Lines 552852 551762 -1090
================================================
+ Hits 429971 433898 +3927
+ Misses 121139 117410 -3729
+ Partials 1742 454 -1288
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/importsdk/file_scanner.go (2)
341-344: Fallback to source size when sampling yields non-positive results.The fallback at lines 341-344 returns
tblMeta.TotalSizeas the TiKV estimate when sampling produces non-positive values. This is a reasonable safeguard, but it may overestimate (source size ≠ KV size) or underestimate depending on the data format. Consider logging a warning when this fallback is triggered to aid debugging.📝 Suggested logging improvement
if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 { + s.logger.Warn("sampling returned non-positive size, using source size as estimate", + zap.String("database", dbMeta.Name), + zap.String("table", tblMeta.Name), + zap.Int64("sourceSize", sampledSize.SourceSize), + zap.Int64("kvSize", sampledSize.TotalKVSize())) return tblMeta.TotalSize, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` around lines 341 - 344, The current fallback returns tblMeta.TotalSize when sampledSize.SourceSize <= 0 or sampledSize.TotalKVSize() <= 0 without any visibility; update the function containing this return in file_scanner.go to log a warning when the fallback path is taken (include sampledSize.SourceSize, sampledSize.TotalKVSize(), and tblMeta.TotalSize in the message) so callers can debug sampling issues; use the package logger already used nearby (or the function's contextual logger) and keep the log level as warning; then return the same tblMeta.TotalSize as before.
47-47: Consider adding a doc comment to the interface method.The new
EstimateImportDataSizemethod on theFileScannerinterface lacks documentation. A brief comment explaining its purpose would help consumers of this API.📝 Suggested documentation
GetTotalSize(ctx context.Context) int64 + // EstimateImportDataSize samples source data to estimate the final TiKV size + // after encoding. Returns per-table and aggregated size estimates. EstimateImportDataSize(ctx context.Context) (*ImportDataSizeEstimate, error) Close() error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` at line 47, Add a short doc comment above the FileScanner interface method EstimateImportDataSize describing its purpose and behavior: explain that EstimateImportDataSize(ctx context.Context) returns an ImportDataSizeEstimate and error, what the estimate represents (e.g., expected total bytes/objects to be imported), any important semantics (context cancellation support, when it may return an error or a partial estimate), and whether the call is expected to be fast or may perform a scan; update the comment near the FileScanner interface so consumers understand inputs, outputs, and error conditions.pkg/executor/importer/sampler.go (1)
117-128: Returning partial results alongside an error may mask sampling failures.The function accumulates results across files and returns both the accumulated
resultandfirstErr. If sampling fails for one file, callers receive partial data without clear indication of which tables succeeded. Consider whether this "best-effort" behavior is intentional or if you should fail fast on the first error.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/sampler.go` around lines 117 - 128, The current loop in sampler.go accumulates SampledKVSizeResult across files while returning the first error (firstErr), which yields partial results when any call to sampleKVSizeForOneFile fails; decide and implement one behavior: either fail fast by checking err after each sampleKVSizeForOneFile call and immediately return nil, err (remove accumulating when err!=nil), or explicitly switch to an explicit best-effort mode by collecting per-file results and errors (e.g., map[file]error or a multi-error) and return both the full per-file result set and an aggregated error; update the code around SampledKVSizeResult, firstErr, and the loop over files to follow the chosen approach and ensure callers can distinguish success vs partial success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 117-128: The current loop in sampler.go accumulates
SampledKVSizeResult across files while returning the first error (firstErr),
which yields partial results when any call to sampleKVSizeForOneFile fails;
decide and implement one behavior: either fail fast by checking err after each
sampleKVSizeForOneFile call and immediately return nil, err (remove accumulating
when err!=nil), or explicitly switch to an explicit best-effort mode by
collecting per-file results and errors (e.g., map[file]error or a multi-error)
and return both the full per-file result set and an aggregated error; update the
code around SampledKVSizeResult, firstErr, and the loop over files to follow the
chosen approach and ensure callers can distinguish success vs partial success.
In `@pkg/importsdk/file_scanner.go`:
- Around line 341-344: The current fallback returns tblMeta.TotalSize when
sampledSize.SourceSize <= 0 or sampledSize.TotalKVSize() <= 0 without any
visibility; update the function containing this return in file_scanner.go to log
a warning when the fallback path is taken (include sampledSize.SourceSize,
sampledSize.TotalKVSize(), and tblMeta.TotalSize in the message) so callers can
debug sampling issues; use the package logger already used nearby (or the
function's contextual logger) and keep the log level as warning; then return the
same tblMeta.TotalSize as before.
- Line 47: Add a short doc comment above the FileScanner interface method
EstimateImportDataSize describing its purpose and behavior: explain that
EstimateImportDataSize(ctx context.Context) returns an ImportDataSizeEstimate
and error, what the estimate represents (e.g., expected total bytes/objects to
be imported), any important semantics (context cancellation support, when it may
return an error or a partial estimate), and whether the call is expected to be
fast or may perform a scan; update the comment near the FileScanner interface so
consumers understand inputs, outputs, and error conditions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fa78505a-ef74-42a0-a3ff-91e57d5ccb07
📒 Files selected for processing (8)
lightning/pkg/importinto/importer.gopkg/executor/importer/sampler.gopkg/importsdk/BUILD.bazelpkg/importsdk/config.gopkg/importsdk/file_scanner.gopkg/importsdk/file_scanner_test.gopkg/importsdk/mock/sdk_mock.gopkg/importsdk/model.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/executor/importer/import.go (1)
1104-1122: Edge case: nil vs empty slice semantics.The function treats
columnNames == nil(line 1108) specially by returning columns as-is, but an empty slice[]string{}would fail the length check on line 1105-1107 ifcolsis non-empty. This appears intentional based on caller usage patterns, but consider adding a brief comment explaining the nil vs empty slice distinction.📝 Suggested documentation
func reorderColumnsByNames(cols []*table.Column, columnNames []string) ([]*table.Column, error) { if len(cols) != len(columnNames) { return nil, exeerrors.ErrColumnsNotMatched } + // nil columnNames means no reordering needed (preserve original order) if columnNames == nil { return cols, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/import.go` around lines 1104 - 1122, The function reorderColumnsByNames currently treats columnNames == nil as "no reordering" but treats an empty slice as a length-mismatch error (ErrColumnsNotMatched); add a concise comment above reorderColumnsByNames explaining this nil-vs-empty-slice semantic (mention cols, columnNames and ErrColumnsNotMatched) so future readers understand that nil means "leave cols as-is" while an empty slice is validated against cols length and triggers the error.pkg/importsdk/file_scanner.go (1)
296-339: Consider adding a log for fallback estimation.When
sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0(line 335), the method falls back to usingtblMeta.TotalSizedirectly as the TiKV size estimate. This fallback may produce inaccurate estimates (source size ≠ TiKV size). Consider logging a warning to help users understand when estimates may be less accurate.💡 Suggested improvement
if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 { + s.logger.Warn("sampling returned non-positive sizes, falling back to source size as TiKV estimate", + zap.String("table", tblMeta.Name), + zap.Int64("sourceSize", sampledSize.SourceSize), + zap.Int64("totalKVSize", sampledSize.TotalKVSize())) return tblMeta.TotalSize, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` around lines 296 - 339, In estimateOneTableSize, when the sampledSize fallback is used (condition sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0), add a warning log to indicate the fallback and include identifying details (tblMeta.DB, tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize values so users know the estimate may be inaccurate; locate this in function estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/executor/importer/import.go`:
- Around line 1104-1122: The function reorderColumnsByNames currently treats
columnNames == nil as "no reordering" but treats an empty slice as a
length-mismatch error (ErrColumnsNotMatched); add a concise comment above
reorderColumnsByNames explaining this nil-vs-empty-slice semantic (mention cols,
columnNames and ErrColumnsNotMatched) so future readers understand that nil
means "leave cols as-is" while an empty slice is validated against cols length
and triggers the error.
In `@pkg/importsdk/file_scanner.go`:
- Around line 296-339: In estimateOneTableSize, when the sampledSize fallback is
used (condition sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0),
add a warning log to indicate the fallback and include identifying details
(tblMeta.DB, tblMeta.Name), the tblMeta.TotalSize used, and the sampledSize
values so users know the estimate may be inaccurate; locate this in function
estimateOneTableSize and use the existing logger (s.logger / s.logger.Logger) to
emit a clear Warn/Warnf message immediately before returning tblMeta.TotalSize.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fe5118e1-feeb-4805-b6cd-13cf6f800bda
📒 Files selected for processing (5)
pkg/executor/importer/import.gopkg/executor/importer/kv_encode.gopkg/executor/importer/sampler.gopkg/executor/importer/sampler_test.gopkg/importsdk/file_scanner.go
|
🔍 Starting code review for this PR... |
|
🔍 New commits detected — starting re-review... |
1 similar comment
|
🔍 New commits detected — starting re-review... |
|
/retest |
| return 0, errors.Trace(err) | ||
| } | ||
| if sampledSize.SourceSize <= 0 || sampledSize.TotalKVSize() <= 0 { | ||
| return tblMeta.TotalSize, nil |
There was a problem hiding this comment.
Should this return 0 when sampling succeeds but there are no data rows?
| err = scanner.CreateSchemaAndTableByName(ctx, "db1", "nonexistent") | ||
| require.Error(t, err) | ||
| }) | ||
|
|
There was a problem hiding this comment.
Should we add some CSV-based tests here as well?
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 6
- Inline comments: 6
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
⚠️ [Major] (1)
- Production estimation path depends on test-only utilities (pkg/importsdk/file_scanner.go:397, pkg/util/mock/fortest.go:22)
🟡 [Minor] (5)
- Dead method
reorderColumnsleft after refactoring (pkg/executor/importer/import.go:1172) sampleIndexSizeRatiodiscards valid partial results on sampling error (pkg/executor/importer/sampler.go:130)- Bare boolean literals at
generateCSVConfigcall site obscure parameter meaning (pkg/executor/importer/sampler.go:217) - Parser (and underlying reader) leaked when handleSkipNRows or SetPos fails after parser creation (pkg/executor/importer/sampler.go:250)
- FileScanner and SDK interface expansion breaks external implementors (pkg/importsdk/file_scanner.go:46, pkg/importsdk/sdk.go:23)
| } | ||
| for _, dbMeta := range dbMetas { | ||
| for _, tblMeta := range dbMeta.Tables { | ||
| singleReplicaSize, err := s.estimateOneTableSize(ctx, tblMeta) |
There was a problem hiding this comment.
AI-assisted review: EstimateImportDataSize currently fails the whole request on the first bad table, while GetTableMetas already honors skipInvalidFiles and can return partial results. Is that difference intentional?
It seems a bit surprising from the SDK caller's perspective: with skipInvalidFiles=true, metadata discovery can still succeed for the valid tables, but size estimation becomes all-or-nothing. If that behavior is intended, it may be worth documenting explicitly; otherwise, this path may want to follow the same partial-result behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/executor/importer/sampler.go (1)
129-136:⚠️ Potential issue | 🟠 MajorKeep usable sample data when sampling hits an error.
This path still throws valid sample data away in two places:
sampleIndexSizeRatioreturns0on any error, andsampleOneFilezeroes out bytes/KVs already collected before a late read/encode/add failure.CalResourceParamscan then proceed withindexSizeRatio == 0even though the sample already contains enough data to produce a usable estimate.💡 Suggested fix
func (e *LoadDataController) sampleIndexSizeRatio( ctx context.Context, ksCodec []byte, ) (float64, error) { result, err := e.sampleKVSize(ctx, ksCodec) - if err != nil { - return 0, err - } - if result.DataKVSize == 0 { - return 0, nil + if result == nil || result.DataKVSize == 0 { + return 0, err } - return float64(result.IndexKVSize) / float64(result.DataKVSize), nil + return float64(result.IndexKVSize) / float64(result.DataKVSize), err }var ( count int readRowCache []types.Datum readFn = parserEncodeReader(parser, chunk.Chunk.EndOffset, chunk.GetKey()) kvBatch = newEncodedKVGroupBatch(ksCodec, maxRowCount) ) + finalize := func(retErr error) (int64, uint64, uint64, error) { + dataKVSize, indexKVSize := kvBatch.groupChecksum.DataAndIndexSumSize() + return sourceSize, dataKVSize, indexKVSize, retErr + } for count < maxRowCount { row, closed, readErr := readFn(ctx, readRowCache) if readErr != nil { - return 0, 0, 0, readErr + return finalize(readErr) } if closed { break } @@ kvs, encodeErr := encoder.Encode(row.row, row.rowID) row.resetFn() if encodeErr != nil { - return 0, 0, 0, common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos) + return finalize(common.ErrEncodeKV.Wrap(encodeErr).GenWithStackByArgs(chunk.GetKey(), row.startPos)) } if _, err = kvBatch.add(kvs); err != nil { - return 0, 0, 0, err + return finalize(err) } count++ } - dataKVSize, indexKVSize = kvBatch.groupChecksum.DataAndIndexSumSize() - return sourceSize, dataKVSize, indexKVSize, nil + return finalize(nil) }Also applies to: 354-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/executor/importer/sampler.go` around lines 129 - 136, sample data is being discarded on errors: in sampleIndexSizeRatio and sampleOneFile you should preserve and use any already-collected sample totals instead of returning 0 or zeroing totals when a later error occurs; update sampleIndexSizeRatio to check the returned result (e.g., result.DataKVSize) and if it contains usable data compute and return the ratio (IndexKVSize/DataKVSize) even if an error occurred while finishing sampling, and modify sampleOneFile so that on a late read/encode/add failure you do not reset the aggregated totals (bytes/KVs) — only discard or reset the temporary batch variables for that file/operation so the overall sample totals accumulated in e.sample... remain intact for CalResourceParams to consume.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/executor/importer/import.go`:
- Around line 1717-1722: The switch default currently sets
exeerrors.ErrLoadDataUnsupportedFormat into err but the common post-switch
wrapper always remaps any err to exeerrors.ErrLoadDataWrongFormatConfig; update
the post-switch handling in import.go so that if err is
exeerrors.ErrLoadDataUnsupportedFormat (use errors.Is or direct comparison) you
return that error directly, otherwise keep the existing wrapping into
exeerrors.ErrLoadDataWrongFormatConfig; reference the local variable err and the
two error symbols exeerrors.ErrLoadDataUnsupportedFormat and
exeerrors.ErrLoadDataWrongFormatConfig to locate and implement the conditional
return.
---
Duplicate comments:
In `@pkg/executor/importer/sampler.go`:
- Around line 129-136: sample data is being discarded on errors: in
sampleIndexSizeRatio and sampleOneFile you should preserve and use any
already-collected sample totals instead of returning 0 or zeroing totals when a
later error occurs; update sampleIndexSizeRatio to check the returned result
(e.g., result.DataKVSize) and if it contains usable data compute and return the
ratio (IndexKVSize/DataKVSize) even if an error occurred while finishing
sampling, and modify sampleOneFile so that on a late read/encode/add failure you
do not reset the aggregated totals (bytes/KVs) — only discard or reset the
temporary batch variables for that file/operation so the overall sample totals
accumulated in e.sample... remain intact for CalResourceParams to consume.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 18a0f6a1-3464-4e96-b23b-547b35233303
📒 Files selected for processing (8)
pkg/executor/importer/BUILD.bazelpkg/executor/importer/import.gopkg/executor/importer/sampler.gopkg/executor/importer/sampler_test.gopkg/executor/importer/table_import.gopkg/importsdk/BUILD.bazelpkg/importsdk/file_scanner.gopkg/importsdk/file_scanner_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/importsdk/BUILD.bazel
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/executor/importer/sampler_test.go
- pkg/importsdk/file_scanner_test.go
- pkg/importsdk/file_scanner.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/importsdk/file_scanner.go (1)
39-47: Document the new SDK method's contract.A short comment on
EstimateImportDataSizewould help callers understand that this is an estimate, thatTiKVSizeis single-replica size, and whetherskipInvalidFilescan yield partial results.As per coding guidelines, "Keep exported-symbol doc comments, and prefer semantic constraints over name restatement."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/importsdk/file_scanner.go` around lines 39 - 47, Add a concise exported doc comment for the EstimateImportDataSize method and the ImportDataSizeEstimate type clarifying that the returned numbers are estimates (not guarantees), that ImportDataSizeEstimate.TiKVSize represents single‑replica raw data size (replication factor not applied), and describe the behavior when callers use skipInvalidFiles (it may produce partial results and which errors are skipped vs. fatal); update comments on the EstimateImportDataSize signature and the ImportDataSizeEstimate struct to state these semantic constraints so callers know the meaning and limitations of the estimate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/importsdk/file_scanner.go`:
- Around line 202-229: EstimateImportDataSize currently uses tblMeta.TotalSize
(loader totals) together with the sampled ratio returned by
estimateOneTableSize, which mixes logical (real) and on-disk/compressed sizes
and underestimates compressed inputs; update EstimateImportDataSize to either
(A) ensure real-size estimation is forced for this path (set or require
s.config.realSizeEstimation before calling estimateOneTableSize) so
tblMeta.TotalSize reflects real sizes, or (B) if you must use loader totals,
recompute the scaling multiplier inside EstimateImportDataSize using real-size
values returned by estimateOneTableSize (e.g., use the sampled real SourceSize
and TotalKVSize to derive a ratio and then apply it to a real total) before
constructing TableDataSizeEstimate; locate and modify EstimateImportDataSize and
its interaction with estimateOneTableSize/TableDataSizeEstimate to make the
multiplier come from real sizes rather than mixing compressed loader totals.
---
Nitpick comments:
In `@pkg/importsdk/file_scanner.go`:
- Around line 39-47: Add a concise exported doc comment for the
EstimateImportDataSize method and the ImportDataSizeEstimate type clarifying
that the returned numbers are estimates (not guarantees), that
ImportDataSizeEstimate.TiKVSize represents single‑replica raw data size
(replication factor not applied), and describe the behavior when callers use
skipInvalidFiles (it may produce partial results and which errors are skipped
vs. fatal); update comments on the EstimateImportDataSize signature and the
ImportDataSizeEstimate struct to state these semantic constraints so callers
know the meaning and limitations of the estimate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e6a2ea2d-3ab6-4e3c-8ea7-c2b2b5dea9ac
📒 Files selected for processing (4)
pkg/executor/importer/import.gopkg/executor/importer/import_test.gopkg/importsdk/file_scanner.gopkg/importsdk/file_scanner_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/importsdk/file_scanner_test.go
| func (s *fileScanner) EstimateImportDataSize(ctx context.Context) (*ImportDataSizeEstimate, error) { | ||
| dbMetas := s.loader.GetDatabases() | ||
| if len(dbMetas) == 0 { | ||
| return nil, errors.Annotatef(ErrNoDatabasesFound, "source=%s", s.sourcePath) | ||
| } | ||
|
|
||
| result := &ImportDataSizeEstimate{ | ||
| Tables: make([]TableDataSizeEstimate, 0), | ||
| } | ||
| for _, dbMeta := range dbMetas { | ||
| for _, tblMeta := range dbMeta.Tables { | ||
| singleReplicaSize, err := s.estimateOneTableSize(ctx, tblMeta) | ||
| if err != nil { | ||
| if s.config.skipInvalidFiles { | ||
| s.logger.Warn("skipping table during size estimation", zap.String("database", dbMeta.Name), zap.String("table", tblMeta.Name), zap.Error(err)) | ||
| continue | ||
| } | ||
| return nil, err | ||
| } | ||
| tableEstimate := TableDataSizeEstimate{ | ||
| Database: dbMeta.Name, | ||
| Table: tblMeta.Name, | ||
| SourceSize: tblMeta.TotalSize, | ||
| TiKVSize: singleReplicaSize, | ||
| } | ||
| result.Tables = append(result.Tables, tableEstimate) | ||
| result.TotalSourceSize += tableEstimate.SourceSize | ||
| result.TotalTiKVSize += tableEstimate.TiKVSize |
There was a problem hiding this comment.
Don't mix sampled logical bytes with loader totals.
estimateOneTableSize scales the sampled TotalKVSize/SourceSize ratio by tblMeta.TotalSize, and EstimateImportDataSize exposes that same loader total as SourceSize. When the scanner is created with real-size estimation disabled, those totals can stay at on-disk/compressed bytes, so gzip/zstd inputs get underestimated by roughly the compression ratio. Please force real-size estimation for this API or recompute the multiplier from real sizes in this path before applying the ratio.
Also applies to: 300-345
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/importsdk/file_scanner.go` around lines 202 - 229, EstimateImportDataSize
currently uses tblMeta.TotalSize (loader totals) together with the sampled ratio
returned by estimateOneTableSize, which mixes logical (real) and
on-disk/compressed sizes and underestimates compressed inputs; update
EstimateImportDataSize to either (A) ensure real-size estimation is forced for
this path (set or require s.config.realSizeEstimation before calling
estimateOneTableSize) so tblMeta.TotalSize reflects real sizes, or (B) if you
must use loader totals, recompute the scaling multiplier inside
EstimateImportDataSize using real-size values returned by estimateOneTableSize
(e.g., use the sampled real SourceSize and TotalKVSize to derive a ratio and
then apply it to a real total) before constructing TableDataSizeEstimate; locate
and modify EstimateImportDataSize and its interaction with
estimateOneTableSize/TableDataSizeEstimate to make the multiplier come from real
sizes rather than mixing compressed loader totals.
|
🔍 Starting code review for this PR... |
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 6
- Inline comments: 5
- Summary-only findings (no inline anchor): 1
Findings (highest risk first)
🟡 [Minor] (5)
- Exported
SampleFileImportKVSizeparameterksCodecis a cryptic abbreviation (pkg/executor/importer/sampler.go:115) - Stale doc comment on
generateCSVConfigreferences non-existentLoadDataWorker(pkg/executor/importer/import.go:1175) - Exported
HandleSkipNRowsuses vague "Handle" verb that hides its actual action (pkg/executor/importer/import.go:1745) - validateKVSizeSampleConfig rejects DataFormatDelimitedData which newLoadDataParser accepts identically to CSV (pkg/executor/importer/sampler.go:197, pkg/executor/importer/import.go:1685)
- EstimateImportDataSize continues iterating tables after context cancellation when skipInvalidFiles is true (pkg/importsdk/file_scanner.go:215)
🧹 [Nit] (1)
- KVSizeSampleConfig omits two generateCSVConfig parameters that are hardcoded in the sampler (pkg/executor/importer/sampler.go:216, pkg/executor/importer/sampler.go:77)
Unanchored findings
🟡 [Minor] (1)
- Exported
SampleFileImportKVSizeparameterksCodecis a cryptic abbreviation- Request: Rename the parameter to
keyspaceCodecin the exported function signature, or add a doc comment line explaining thatksCodecis the keyspace codec prefix (nil for the default keyspace).
- Request: Rename the parameter to
| @@ -1178,28 +1173,38 @@ func (e *LoadDataController) GetFieldCount() int { | |||
| } | |||
|
|
|||
| // GenerateCSVConfig generates a CSV config for parser from LoadDataWorker. | |||
There was a problem hiding this comment.
🟡 [Minor] Stale doc comment on generateCSVConfig references non-existent LoadDataWorker
Impact
The comment on the new standalone generateCSVConfig function says "generates a CSV config for parser from LoadDataWorker," but this function is a pure helper with explicit parameters and no relationship to any LoadDataWorker type.
Maintainers looking for context will be misled into searching for a type that does not exist at this call site.
Scope
pkg/executor/importer/import.go:1175—generateCSVConfig
Evidence
Line 1175 carries // GenerateCSVConfig generates a CSV config for parser from LoadDataWorker. above the unexported generateCSVConfig function. The identical comment on the exported method at line 1203 has the same stale reference, but the standalone function's copy was introduced by this PR when the logic was extracted.
Change request
Update the comment on generateCSVConfig (line 1175) to describe its actual parameters and purpose, e.g., "generateCSVConfig builds a CSVConfig from the given field/line settings." Consider also fixing the method's comment at line 1203 to reference LoadDataController instead of LoadDataWorker.
| // GenerateCSVConfig generates a CSV config for parser from LoadDataWorker. | |
| // generateCSVConfig builds a CSVConfig from the given field/line format settings. |
|
|
||
| // HandleSkipNRows skips the first N rows of the data file. | ||
| func (e *LoadDataController) HandleSkipNRows(parser mydump.Parser) error { | ||
| func HandleSkipNRows(parser mydump.Parser, ignoreLines uint64) error { |
There was a problem hiding this comment.
🟡 [Minor] Exported HandleSkipNRows uses vague "Handle" verb that hides its actual action
Impact
The function was promoted from a receiver method to a standalone exported function. As a method, "Handle" inherited context from its receiver; as a package-level export, the name reads as generic dispatch rather than a specific action.
Callers in pkg/executor/load_data.go and table_import.go now import a name that does not signal that rows are being consumed and discarded from the parser.
Scope
pkg/executor/importer/import.go:1745—HandleSkipNRows
Evidence
HandleSkipNRows is called from three sites: sampler.go:259, table_import.go:352, and load_data.go:403. The function body calls parser.ReadRow in a loop, consuming and discarding rows. The verb "Handle" is flagged by checklist item 7 as a generic name to avoid when a specific alternative exists.
Change request
Rename to SkipNRows (or SkipIgnoredRows). The function's sole action is to skip rows; "Handle" adds no information.
|
|
||
| func validateKVSizeSampleConfig(cfg *KVSizeSampleConfig) error { | ||
| switch cfg.Format { | ||
| case DataFormatCSV, DataFormatSQL, DataFormatParquet: |
There was a problem hiding this comment.
🟡 [Minor] validateKVSizeSampleConfig rejects DataFormatDelimitedData which newLoadDataParser accepts identically to CSV
Impact
The new public API SampleFileImportKVSize rejects format "delimited data" during validation even though the underlying parser treats it identically to "csv". A caller constructing a KVSizeSampleConfig with the LOAD DATA default format would hit an unexpected unsupported-format error.
Scope
pkg/executor/importer/sampler.go:197—validateKVSizeSampleConfigpkg/executor/importer/import.go:1685—newLoadDataParser
Evidence
validateKVSizeSampleConfig switches on DataFormatCSV, DataFormatSQL, DataFormatParquet (line 198) and rejects everything else. newLoadDataParser at line 1685 handles case DataFormatDelimitedData, DataFormatCSV: — the two formats share the same parser branch. Current callers (IMPORT INTO and SDK) never pass DataFormatDelimitedData, but the public API contract doesn't document this restriction.
Change request
Add DataFormatDelimitedData to the accepted formats in validateKVSizeSampleConfig, or document in KVSizeSampleConfig.Format that only the three canonical format strings are accepted.
| for _, tblMeta := range dbMeta.Tables { | ||
| singleReplicaSize, err := s.estimateOneTableSize(ctx, tblMeta) | ||
| if err != nil { | ||
| if s.config.skipInvalidFiles { |
There was a problem hiding this comment.
🟡 [Minor] EstimateImportDataSize continues iterating tables after context cancellation when skipInvalidFiles is true
Impact
When context is cancelled mid-estimation and skipInvalidFiles=true, the loop processes all remaining tables. Each call fails due to the cancelled context, producing a spurious warning log per table.
For sources with many tables, this delays cancellation propagation and floods the log with misleading "skipping table" warnings that obscure the real cause (context cancelled, not invalid file).
Scope
pkg/importsdk/file_scanner.go:215—EstimateImportDataSize
Evidence
In EstimateImportDataSize, the error from estimateOneTableSize is checked only against skipInvalidFiles. A context cancellation error (ctx.Err() != nil) is indistinguishable from a schema parse error, so the continue path fires for every remaining table.
Change request
Check ctx.Err() before continuing the loop so that context cancellation short-circuits immediately regardless of skipInvalidFiles.
| } | ||
|
|
||
| func (s *kvSizeSampler) generateCSVConfig() *config.CSVConfig { | ||
| return generateCSVConfig(s.cfg.FieldNullDef, s.cfg.LineFieldsInfo, true, false) |
There was a problem hiding this comment.
🧹 [Nit] KVSizeSampleConfig omits two generateCSVConfig parameters that are hardcoded in the sampler
Impact
The exported KVSizeSampleConfig documents itself as the complete parser/encoder settings for sampling, but kvSizeSampler.generateCSVConfig() hardcodes inImportInto=true and nullValueOptEnclosed=false, which control AllowEmptyLine, QuotedNullIsText, and UnescapedQuote in the resulting CSV config.
Scope
pkg/executor/importer/sampler.go:216—kvSizeSampler.generateCSVConfigpkg/executor/importer/sampler.go:77—KVSizeSampleConfig
Evidence
generateCSVConfig (import.go:1176) takes 4 parameters; only 2 (FieldNullDef, LineFieldsInfo) come from KVSizeSampleConfig. The other 2 are hardcoded at sampler.go:217 as true, false.
Change request
Consider adding InImportInto and NullValueOptEnclosed as fields on KVSizeSampleConfig (defaulting to true and false respectively) so the config type fully describes the parsing surface.
|
Since the first time review failed, the bot started a new round again after detecting new commit...... Maybe ignore it. |
[LGTM Timeline notifier]Timeline:
|
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Benjamin2037, joechenrh, OliverS929 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
What problem does this PR solve?
Issue Number: close #67240
Problem Summary:
Premium needs the final import size estimate before an import starts, so it can expand disk in advance.
What changed and how does it work?
EstimateImportDataSizeto import SDK.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Improvements
Tests
Chores