Skip to content

pkg/importsdk: fix wildcard generation for subdir csv files#67472

Merged
ti-chi-bot[bot] merged 2 commits intopingcap:masterfrom
GMHDBJD:fix/importsdk-subdir-csv-pattern
Apr 2, 2026
Merged

pkg/importsdk: fix wildcard generation for subdir csv files#67472
ti-chi-bot[bot] merged 2 commits intopingcap:masterfrom
GMHDBJD:fix/importsdk-subdir-csv-pattern

Conversation

@GMHDBJD
Copy link
Copy Markdown
Collaborator

@GMHDBJD GMHDBJD commented Mar 31, 2026

What problem does this PR solve?

Issue Number: close #67471

Problem Summary:

CSV data files that belong to the same table can be stored under sibling subdirectories such as dir/subdir1/*.csv and dir/subdir2/*.csv. The old generic wildcard fallback built one flat prefix*suffix pattern across the whole path, but filepath.Match does not allow * to match /, so TiDB could not derive a valid unique wildcard for these files.

What changed and how does it work?

The generic prefix/suffix fallback is now directory-aware when all candidate paths have the same number of /-separated components. It generates the wildcard component by component so every * stays within a single path segment. For the subdirectory CSV case above, TiDB now derives dir/subdir*/*.csv, which can be validated correctly by filepath.Match.

This PR also adds regression coverage for both wildcard validation and wildcard generation.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Validation commands:

  • pushd pkg/importsdk >/dev/null && go test -run '^(TestValidatePattern|TestGenerateWildcardPath)$' -tags=intest,deadlock && popd >/dev/null
  • make lint

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

`IMPORT INTO` can infer wildcard paths for CSV files stored under sibling subdirectories when the table's files differ by directory segment.

Summary by CodeRabbit

  • Bug Fixes

    • Improved wildcard pattern matching to safely handle file paths within nested directory structures, ensuring patterns work correctly across multiple directory levels.
  • Tests

    • Expanded test coverage for pattern validation and wildcard path generation, including scenarios with CSV files in multiple subdirectories.

@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 31, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Mar 31, 2026

@GMHDBJD I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 31, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Mar 31, 2026

Hi @GMHDBJD. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

The change enhances wildcard pattern generation to handle CSV files stored in subdirectories. The implementation splits file paths by "/" and generates per-segment prefix/suffix patterns when all paths have the same number of components, enabling patterns like dir/subdir*/*.csv, while falling back to flat prefix/suffix behavior for paths with differing component counts.

Changes

Cohort / File(s) Summary
Pattern Generation Logic
pkg/importsdk/pattern.go
Refactored generatePrefixSuffixPattern into a wrapper that conditionally generates slash-segment-safe wildcard patterns. Added generateFlatPrefixSuffixPattern helper to preserve original flat behavior. New logic splits paths by "/" and generates per-segment patterns only when component counts match and exceed 1, otherwise falls back to flat prefix/suffix.
Pattern Tests
pkg/importsdk/pattern_test.go
Extended TestValidatePattern with nested directory fixtures and positive assertion for dir/subdir*/*.csv pattern. Extended TestGenerateWildcardPath with new dataset (files5/allFiles5) containing CSV paths in multiple subdirectories, validating wildcard generation returns dir/subdir*/*.csv.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

Hopping through folders where CSV files hide,
We split the paths smartly by / as our guide,
Now subdir*/*.csv patterns take flight,
The rabbit's delight—wildcard patterns done right! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing wildcard generation for CSV files in subdirectories.
Description check ✅ Passed The PR description is comprehensive, includes the issue number (close #67471), explains the problem and solution clearly, provides validation commands, and includes a proper release note.
Linked Issues check ✅ Passed The PR directly addresses issue #67471 by implementing directory-aware wildcard generation for subdirectory CSV files, generating patterns like dir/subdir*/*.csv that filepath.Match can validate.
Out of Scope Changes check ✅ Passed All changes are focused on the wildcard generation fix and related test coverage; no out-of-scope modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/importsdk/pattern.go (1)

185-189: Soften the contract in this comment.

generatePrefixSuffixPattern still returns a candidate that needs isValidPattern validation. In mixed-depth cases it can intentionally fall back to a flat pattern that does not satisfy “all and only” on its own, so the current wording overstates the guarantee.

✏️ Suggested wording
-// generatePrefixSuffixPattern returns a wildcard pattern that matches all and only the given paths.
+// generatePrefixSuffixPattern builds a candidate wildcard pattern for the given paths.
+// The caller must still validate that it matches all and only the intended files.
 // When all paths have the same number of '/'-separated components, it generates the wildcard
 // component by component so every '*' stays within a single path segment, which is required by
 // filepath.Match.

As per coding guidelines, "Comments SHOULD explain non-obvious intent, constraints, invariants, concurrency guarantees, SQL/compatibility contracts, or important performance trade-offs, and SHOULD NOT restate what the code already makes clear."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/importsdk/pattern.go` around lines 185 - 189, Update the doc comment for
generatePrefixSuffixPattern to soften its contract: state that it produces a
candidate wildcard pattern which may require validation via isValidPattern
rather than guaranteeing it matches “all and only” the given paths; explicitly
mention that in mixed-depth cases the function intentionally falls back to a
flat pattern that might not satisfy the stricter guarantee and therefore must be
checked with isValidPattern, and keep the rest of the implementation and
behavior unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/importsdk/pattern.go`:
- Around line 185-189: Update the doc comment for generatePrefixSuffixPattern to
soften its contract: state that it produces a candidate wildcard pattern which
may require validation via isValidPattern rather than guaranteeing it matches
“all and only” the given paths; explicitly mention that in mixed-depth cases the
function intentionally falls back to a flat pattern that might not satisfy the
stricter guarantee and therefore must be checked with isValidPattern, and keep
the rest of the implementation and behavior unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 99badddb-1bdf-46db-abe2-cea14d9f9769

📥 Commits

Reviewing files that changed from the base of the PR and between 55d31cd and 74f5731.

📒 Files selected for processing (2)
  • pkg/importsdk/pattern.go
  • pkg/importsdk/pattern_test.go

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.6075%. Comparing base (8412422) to head (1369aef).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67472        +/-   ##
================================================
- Coverage   77.7173%   77.6075%   -0.1098%     
================================================
  Files          1959       1943        -16     
  Lines        543377     543523       +146     
================================================
- Hits         422298     421815       -483     
- Misses       120238     121706      +1468     
+ Partials        841          2       -839     
Flag Coverage Δ
integration 41.0316% <ø> (+4.8568%) ⬆️
unit 76.7855% <94.1176%> (+0.4424%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 48.9204% <ø> (-12.0597%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ingress-bot
Copy link
Copy Markdown

🔍 Starting code review for this PR...

@ingress-bot
Copy link
Copy Markdown

🔍 New commits detected — starting re-review...

1 similar comment
@ingress-bot
Copy link
Copy Markdown

🔍 New commits detected — starting re-review...

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

  • Total findings: 1
  • Inline comments: 1
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🟡 [Minor] (1)

  1. generatePrefixSuffixPattern comment overstates guarantee and hides required post-validation (pkg/importsdk/pattern.go:185)

Comment thread pkg/importsdk/pattern.go
return prefix + "*" + suffix
}

// generatePrefixSuffixPattern returns a wildcard pattern that matches all and only the given paths.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] generatePrefixSuffixPattern comment overstates guarantee and hides required post-validation

Why
The doc says the function returns a wildcard that matches all and only the given paths, but this helper can return an unchecked candidate and relies on caller-side isValidPattern for that exactness guarantee.

Scope
pkg/importsdk/pattern.go:185

Risk if unchanged
A future callsite can treat this helper as self-validating and skip isValidPattern, which risks producing patterns that do not match intended files or that match an unintended set.

Evidence
The mixed-depth fallback returns generateFlatPrefixSuffixPattern(paths) at pkg/importsdk/pattern.go:200-201, and generatePrefixSuffixPattern itself performs no validity check. Exactness is enforced only where callers run isValidPattern, such as pkg/importsdk/pattern.go:56.

Change request
Please add a short why-comment for this contract: describe this function as generating a candidate pattern, and state that callers must run isValidPattern before treating it as an all-and-only match.

Copy link
Copy Markdown
Contributor

@joechenrh joechenrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM
(So the implicit assumption is that all file paths of the same table need to have the same directory depth, since both SDK and IMPORT INTO can't handle such case. 🤔

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, joechenrh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 1, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 1, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-01 03:27:26.845265628 +0000 UTC m=+322052.050625684: ☑️ agreed by D3Hunter.
  • 2026-04-01 14:02:25.653561786 +0000 UTC m=+360150.858921843: ☑️ agreed by joechenrh.

@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Apr 1, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 1, 2026

@GMHDBJD: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hawkingrei
Copy link
Copy Markdown
Member

/retest

@hawkingrei
Copy link
Copy Markdown
Member

/retest

5 similar comments
@hawkingrei
Copy link
Copy Markdown
Member

/retest

@hawkingrei
Copy link
Copy Markdown
Member

/retest

@hawkingrei
Copy link
Copy Markdown
Member

/retest

@hawkingrei
Copy link
Copy Markdown
Member

/retest

@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Apr 2, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 2, 2026

@GMHDBJD: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Apr 2, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 2, 2026

@GMHDBJD: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hawkingrei
Copy link
Copy Markdown
Member

/retest

1 similar comment
@GMHDBJD
Copy link
Copy Markdown
Collaborator Author

GMHDBJD commented Apr 2, 2026

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 2, 2026

@GMHDBJD: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hawkingrei
Copy link
Copy Markdown
Member

/retest

1 similar comment
@hawkingrei
Copy link
Copy Markdown
Member

/retest

@ti-chi-bot ti-chi-bot bot merged commit 01ddf6c into pingcap:master Apr 2, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support wildcard generation for CSV files stored in subdirectories

5 participants