import: honor Spark legacy Parquet datetime metadata#67908
import: honor Spark legacy Parquet datetime metadata#67908D3Hunter wants to merge 17 commits intopingcap:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Spark legacy datetime detection and rebasing to Parquet parsing/conversion, introduces a new rebase implementation, updates Parquet writer option handling and mydump build inputs, and adds unit and integration tests plus embedded Spark-legacy Parquet fixtures and minor Bazel test config tweaks. Changes
Sequence DiagramsequenceDiagram
participant Reader as Parquet Reader
participant Parser as Parser (parquet_parser.go)
participant Converter as Type Converter (parquet_type_converter.go)
participant Rebase as Spark Rebase (spark_rebase.go)
Reader->>Parser: Read footer & file metadata
Parser->>Parser: Extract org.apache.spark.* keys, version, timezone
Parser->>Parser: Build per-column sparkRebaseMicros lookup or none
Reader->>Converter: Provide raw column values + sparkRebaseMicros
Converter->>Converter: Decode raw value (Arrow constructors)
alt spark rebase lookup present
Converter->>Rebase: rebaseSparkJulianToGregorianMicros / rebaseJulianToGregorianDays
Rebase->>Rebase: Use version cutoff, timezone table, or hybrid conversion
Rebase-->>Converter: Return rebased micros/days or error
Converter->>Converter: Convert rebased value to Go time
else no rebasing
Converter->>Converter: Convert raw value to Go time
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @D3Hunter. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@D3Hunter I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/lightning/mydump/parquet_writer.go (1)
166-207:⚠️ Potential issue | 🟡 MinorValidate options before opening the object writer.
The unsupported-option branch now returns after
s.Create, leaving the object writer unclosed. ClassifyaddOptsand build the schema/properties before creating the writer, or add cleanup for every early return.Suggested direction
func WriteParquetFile(path, fileName string, pcolumns []ParquetColumn, rows int, addOpts ...any) error { - s, err := getStore(path) - if err != nil { - return err - } - writer, err := s.Create(context.Background(), fileName, nil) - if err != nil { - return err - } - wrapper := &writeWrapper{Writer: writer} + var extraProps []parquet.WriterProperty + writerOpts := make([]file.WriteOption, 0, len(addOpts)+1) + for _, opt := range addOpts { + switch v := opt.(type) { + case parquet.WriterProperty: + extraProps = append(extraProps, v) + case file.WriteOption: + writerOpts = append(writerOpts, v) + default: + return fmt.Errorf("unsupported parquet writer option type %T", opt) + } + } fields := make([]schema.Node, len(pcolumns)) opts := make([]parquet.WriterProperty, 0, len(pcolumns)*2) @@ } node, _ := schema.NewGroupNode("schema", parquet.Repetitions.Required, fields, -1) - var writerOpts []file.WriteOption - for _, opt := range addOpts { - switch v := opt.(type) { - case parquet.WriterProperty: - opts = append(opts, v) - case file.WriteOption: - writerOpts = append(writerOpts, v) - default: - return fmt.Errorf("unsupported parquet writer option type %T", opt) - } - } + opts = append(opts, extraProps...) props := parquet.NewWriterProperties(opts...) writerOpts = append(writerOpts, file.WithWriterProps(props)) + + s, err := getStore(path) + if err != nil { + return err + } + writer, err := s.Create(context.Background(), fileName, nil) + if err != nil { + return err + } + wrapper := &writeWrapper{Writer: writer} pw := file.NewParquetWriter(wrapper, node, writerOpts...)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lightning/mydump/parquet_writer.go` around lines 166 - 207, The WriteParquetFile function currently opens the object writer via s.Create before validating addOpts, so any early return (e.g., unsupported option in the switch over addOpts) leaks the writer; move the addOpts classification and schema/property construction (the loop building fields and opts and the switch over addOpts) to occur before calling s.Create (getStore and s.Create should be invoked only after options are validated and fields/opts prepared), or if you prefer to keep s.Create where it is, ensure every early return closes writer (wrapper.Close/ writer.Close) and handles errors; update references in this function (WriteParquetFile, getStore, s.Create, addOpts, fields, opts, writer/wrapper) accordingly so no path returns with the object writer left open.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 405-410: The TIMESTAMP_MILLIS rebasing can overflow when computing
val*1000 before passing to rebaseSparkJulianToGregorianMicros; modify the block
that checks converted.sparkRebaseTimeZoneID to first validate that val is within
safe bounds (e.g. ensure val <= math.MaxInt64/1000 and val >= math.MinInt64/1000
or compare against the known millis cutoff) and return an error if it would
overflow, only then multiply by 1000 and call
rebaseSparkJulianToGregorianMicros(converted.sparkRebaseTimeZoneID, val*1000);
ensure the function handling (and error return) remains unchanged for safe
values.
---
Outside diff comments:
In `@pkg/lightning/mydump/parquet_writer.go`:
- Around line 166-207: The WriteParquetFile function currently opens the object
writer via s.Create before validating addOpts, so any early return (e.g.,
unsupported option in the switch over addOpts) leaks the writer; move the
addOpts classification and schema/property construction (the loop building
fields and opts and the switch over addOpts) to occur before calling s.Create
(getStore and s.Create should be invoked only after options are validated and
fields/opts prepared), or if you prefer to keep s.Create where it is, ensure
every early return closes writer (wrapper.Close/ writer.Close) and handles
errors; update references in this function (WriteParquetFile, getStore,
s.Create, addOpts, fields, opts, writer/wrapper) accordingly so no path returns
with the object writer left open.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 7748fdc9-e2bd-4b73-927e-aa174ae131ad
⛔ Files ignored due to path filters (2)
tests/realtikvtest/importintotest/spark-legacy-date.gz.parquetis excluded by!**/*.parquettests/realtikvtest/importintotest/spark-legacy-datetime.gz.parquetis excluded by!**/*.parquet
📒 Files selected for processing (10)
br/pkg/metautil/BUILD.bazelpkg/importsdk/BUILD.bazelpkg/lightning/mydump/BUILD.bazelpkg/lightning/mydump/parquet_parser.gopkg/lightning/mydump/parquet_parser_test.gopkg/lightning/mydump/parquet_type_converter.gopkg/lightning/mydump/parquet_writer.gopkg/lightning/mydump/spark_rebase_micros_generated.gotests/realtikvtest/importintotest/BUILD.bazeltests/realtikvtest/importintotest/parquet_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67908 +/- ##
================================================
+ Coverage 77.5894% 79.4259% +1.8364%
================================================
Files 1982 1995 +13
Lines 548964 551480 +2516
================================================
+ Hits 425938 438018 +12080
+ Misses 122221 111991 -10230
- Partials 805 1471 +666
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
🔍 Starting code review for this PR... |
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 11
- Inline comments: 11
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
🚨 [Blocker] (1)
- Spark legacy rebase is enabled without mixed-version rollout guard (pkg/lightning/mydump/parquet_parser.go:768, pkg/lightning/mydump/parquet_type_converter.go:405, pkg/dxf/importinto/job.go:62, pkg/dxf/importinto/proto.go:46)
⚠️ [Major] (5)
- Unknown Spark timezone is silently coerced to UTC instead of surfacing incompatibility (pkg/lightning/mydump/parquet_parser.go:337, pkg/lightning/mydump/parquet_parser_test.go:595)
- Legacy Spark timestamp rebasing falls back to UTC before parser location defaults are applied (pkg/lightning/mydump/parquet_parser.go:768, pkg/lightning/mydump/parquet_parser.go:455, pkg/lightning/mydump/loader.go:630)
- No regression test pins the Spark 3.0.x INT96-vs-datetime cutoff split (pkg/lightning/mydump/parquet_parser.go:94, pkg/lightning/mydump/parquet_parser.go:95, pkg/lightning/mydump/parquet_parser_test.go:321)
WriteParquetFileno longer communicates its accepted option contract (pkg/lightning/mydump/parquet_writer.go:166)WriteParquetFileswitched to untyped varargs and dropped compile-time option contracts (pkg/lightning/mydump/parquet_writer.go:166)
🟡 [Minor] (5)
- Spark rebase policy is now split across generated and handwritten tables (pkg/lightning/mydump/parquet_type_converter.go:43, pkg/lightning/mydump/spark_rebase_micros_generated.go:15)
- Legacy timestamp rebasing repeats timezone index lookup on every value (pkg/lightning/mydump/parquet_type_converter.go:301, pkg/lightning/mydump/parquet_type_converter.go:405, pkg/lightning/mydump/parquet_type_converter.go:483)
- INT96 conversion now truncates sub-microsecond precision instead of preserving canonical rounding (pkg/lightning/mydump/parquet_type_converter.go:501, pkg/types/time.go:183)
- Exported
WriteParquetFilevariadic type change breaks typed-slice callers (pkg/lightning/mydump/parquet_writer.go:166) - Unsupported writer-option path returns without closing the created object writer (pkg/lightning/mydump/parquet_writer.go:166, pkg/lightning/mydump/parquet_writer.go:206)
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/lightning/mydump/parquet_type_converter.go (1)
525-528:⚠️ Potential issue | 🟡 MinorPreserve INT96 sub-microsecond rounding behavior.
nanosOfDay/int64(time.Microsecond)truncates beforetypes.FromGoTimecan apply TiDB’s usual nearest-microsecond rounding, so INT96 values with non-zero sub-microsecond nanos can import 1µs lower than the previous path. This was already raised in an earlier review thread.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/lightning/mydump/parquet_type_converter.go` around lines 525 - 528, The int96ToUnixMicros function currently truncates sub-microsecond precision by doing nanosOfDay/int64(time.Microsecond); instead, compute and return the timestamp in nanoseconds (totalNanoseconds := (julianDay-julianDayOfUnixEpoch)*int64(24*time.Hour) + nanosOfDay) so callers can convert to time.Time / use types.FromGoTime and let TiDB's nearest-microsecond rounding happen there; update all call sites of int96ToUnixMicros to accept/handle nanoseconds (or rename to int96ToUnixNanos) and only divide by int64(time.Microsecond) at the final conversion step where types.FromGoTime is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 525-528: The int96ToUnixMicros function currently truncates
sub-microsecond precision by doing nanosOfDay/int64(time.Microsecond); instead,
compute and return the timestamp in nanoseconds (totalNanoseconds :=
(julianDay-julianDayOfUnixEpoch)*int64(24*time.Hour) + nanosOfDay) so callers
can convert to time.Time / use types.FromGoTime and let TiDB's
nearest-microsecond rounding happen there; update all call sites of
int96ToUnixMicros to accept/handle nanoseconds (or rename to int96ToUnixNanos)
and only divide by int64(time.Microsecond) at the final conversion step where
types.FromGoTime is used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 428367ed-3f64-4807-9013-1c7aef5b4e77
📒 Files selected for processing (4)
pkg/lightning/mydump/parquet_parser.gopkg/lightning/mydump/parquet_parser_test.gopkg/lightning/mydump/parquet_type_converter.gopkg/lightning/mydump/parquet_writer.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/lightning/mydump/parquet_parser_test.go
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| return nil, errors.Trace(err) | ||
| } | ||
| } | ||
| case parquet.Types.Int96: |
There was a problem hiding this comment.
It's a bad choice to store legacy converted type instead of using logical type directly. I don't remember why I wrote this, may just following the old logic 😢. Perhaps we can refactor it later.
There was a problem hiding this comment.
as logical type might be invalid ?
tidb/pkg/lightning/mydump/parquet_parser.go
Lines 637 to 638 in d705928
There was a problem hiding this comment.
Yes, logical type is not mandatory when writing files. But maybe we can convert "converted type" to "logical type".
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: joechenrh The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
|
@D3Hunter: PRs from untrusted users cannot be marked as trusted with DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-nextgen-20251011 |
|
@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
|
/cherry-pick release-nextgen-202603 |
|
@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-202603 in the new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
What problem does this PR solve?
Issue Number: close #67849
Problem Summary:
IMPORT INTOcan read Spark-written(Aurora snapshot is also written this way) Parquet files whose footer marks ancient DATE/TIMESTAMP values as using Spark's legacy hybrid Julian/Gregorian calendar. The previous Parquet read path did not honor that Spark metadata, so ancient values could be imported on the wrong calendar axis, for example importing0001-01-01 00:00:00as0000-12-30 00:00:00.What changed and how does it work?
This PR teaches the Lightning/MyDump Parquet parser to detect Spark legacy datetime and INT96 footer metadata, including Spark version and timezone keys, and to apply Spark-compatible legacy Julian-to-Gregorian rebasing for DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS, and INT96 values before converting them to TiDB datums.
It also adds Spark rebase switch-table data, parser/converter unit coverage for legacy and modern Spark Parquet metadata, and RealTiKV IMPORT INTO regression coverage using Spark legacy Parquet fixtures.
Check List
Tests
benchmark for the rebase method for handling datetime
we dump the expect as parquet with legacy mode, and import with new/old binary. here is the result of
date/datetime, nownewequals toexpectnewvs expectoldvs expect0mismatches25mismatches0mismatches92mismatchesbelow are the first 5 rows
Date
Datetime
Validated locally:
make bazel_prepare git diff --check upstream/master...HEAD ./tools/check/failpoint-go-test.sh pkg/lightning/mydump -run 'TestParquetVariousTypes/(spark_legacy_datetime_rebase|spark_legacy_date_switches|legacy_timestamp_rebase_utc|legacy_timestamp_rebase_non_utc|spark_legacy_timestamp_rebase_uses_spark_zone_tables|spark_legacy_timestamp_default_zone_exists_in_table|spark_legacy_timestamp_rebase_uses_utc_when_zone_table_is_missing|spark_legacy_timestamp_before_table_range_uses_hybrid_calendar_fallback)'Not run locally:
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores