Skip to content

import: honor Spark legacy Parquet datetime metadata#67908

Open
D3Hunter wants to merge 17 commits intopingcap:masterfrom
D3Hunter:fix-parquet-rebase-mode
Open

import: honor Spark legacy Parquet datetime metadata#67908
D3Hunter wants to merge 17 commits intopingcap:masterfrom
D3Hunter:fix-parquet-rebase-mode

Conversation

@D3Hunter
Copy link
Copy Markdown
Contributor

@D3Hunter D3Hunter commented Apr 20, 2026

What problem does this PR solve?

Issue Number: close #67849

Problem Summary:

IMPORT INTO can read Spark-written(Aurora snapshot is also written this way) Parquet files whose footer marks ancient DATE/TIMESTAMP values as using Spark's legacy hybrid Julian/Gregorian calendar. The previous Parquet read path did not honor that Spark metadata, so ancient values could be imported on the wrong calendar axis, for example importing 0001-01-01 00:00:00 as 0000-12-30 00:00:00.

What changed and how does it work?

This PR teaches the Lightning/MyDump Parquet parser to detect Spark legacy datetime and INT96 footer metadata, including Spark version and timezone keys, and to apply Spark-compatible legacy Julian-to-Gregorian rebasing for DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS, and INT96 values before converting them to TiDB datums.

It also adds Spark rebase switch-table data, parser/converter unit coverage for legacy and modern Spark Parquet metadata, and RealTiKV IMPORT INTO regression coverage using Spark legacy Parquet fixtures.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

benchmark for the rebase method for handling datetime

BenchmarkRebaseSparkJulianToGregorianMicros/Table-10          136634576   8.453 ns/op   0 B/op   0 allocs/op
BenchmarkRebaseSparkJulianToGregorianMicros/BeforeSwitch-10    51227137   29.22 ns/op   0 B/op   0 allocs/op

we dump the expect as parquet with legacy mode, and import with new/old binary. here is the result of date/datetime, now new equals to expect

group rows compared new vs expect old vs expect missing rows
date 33 matches exactly, 0 mismatches does not match, 25 mismatches none
datetime 117 matches exactly, 0 mismatches does not match, 92 mismatches none

below are the first 5 rows

Date

rn tdate tnewdate tolddate
1 0001-01-01 0001-01-01 (MATCH) 0000-12-30 (DIFF)
2 0100-02-28 0100-02-28 (MATCH) 0100-02-26 (DIFF)
3 0100-03-01 0100-03-01 (MATCH) 0100-02-28 (DIFF)
4 0200-02-28 0200-02-28 (MATCH) 0200-02-27 (DIFF)
5 0200-03-01 0200-03-01 (MATCH) 0200-03-01 (MATCH)

Datetime

rn tdatetime tnewdatetime tolddatetime
1 0001-01-01 00:00:00.000000 0001-01-01 00:00:00.000000 (MATCH) 0000-12-30 00:00:00.000000 (DIFF)
2 0001-01-01 00:00:00.000001 0001-01-01 00:00:00.000001 (MATCH) 0000-12-30 00:00:00.000001 (DIFF)
3 0001-06-15 12:34:56.789123 0001-06-15 12:34:56.789123 (MATCH) 0001-06-13 12:34:56.789123 (DIFF)
4 0001-12-31 23:59:59.999999 0001-12-31 23:59:59.999999 (MATCH) 0001-12-29 23:59:59.999999 (DIFF)
5 0050-01-01 00:00:00.000000 0050-01-01 00:00:00.000000 (MATCH) 0049-12-30 00:00:00.000000 (DIFF)
  • No need to test
    • I checked and no code files have been changed.

Validated locally:

make bazel_prepare
git diff --check upstream/master...HEAD
./tools/check/failpoint-go-test.sh pkg/lightning/mydump -run 'TestParquetVariousTypes/(spark_legacy_datetime_rebase|spark_legacy_date_switches|legacy_timestamp_rebase_utc|legacy_timestamp_rebase_non_utc|spark_legacy_timestamp_rebase_uses_spark_zone_tables|spark_legacy_timestamp_default_zone_exists_in_table|spark_legacy_timestamp_rebase_uses_utc_when_zone_table_is_missing|spark_legacy_timestamp_before_table_range_uses_hybrid_calendar_fallback)'

Not run locally:

go test -run 'TestImportParquetWithSparkLegacy(Date|DateTimes)' -tags=intest,deadlock ./tests/realtikvtest/importintotest/...
make lint

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fix IMPORT INTO from Spark legacy Parquet files to honor legacy datetime rebasing metadata.

Summary by CodeRabbit

  • New Features

    • Added Spark-legacy Parquet rebasing with timezone-aware handling for accurate DATE/TIMESTAMP imports.
  • Bug Fixes

    • Fixed INT96 and negative pre-epoch timestamp handling and improved rounding/precision for time values.
  • Tests

    • Added extensive tests, embedded Spark legacy Parquet fixtures, and a benchmark covering rebasing and precision scenarios.
  • Chores

    • Updated build/test configurations and widened decoding/runtime option support to improve parsing reliability and test parallelism.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 20, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 20, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Spark legacy datetime detection and rebasing to Parquet parsing/conversion, introduces a new rebase implementation, updates Parquet writer option handling and mydump build inputs, and adds unit and integration tests plus embedded Spark-legacy Parquet fixtures and minor Bazel test config tweaks.

Changes

Cohort / File(s) Summary
Build / Test config
br/pkg/metautil/BUILD.bazel, pkg/importsdk/BUILD.bazel, pkg/lightning/mydump/BUILD.bazel, tests/realtikvtest/importintotest/BUILD.bazel
Adjusted test sharding and timeouts, added //pkg/parser/ast to a test dep, included Arrow/parquet deps and generated sources in mydump build, and embedded Spark legacy Parquet fixtures for tests.
Parser: metadata & column setup
pkg/lightning/mydump/parquet_parser.go
Capture file metadata once, ensure non-nil effective timezone, and populate per-column sparkRebaseMicros lookups from Parquet footer Spark metadata.
Spark rebase implementation
pkg/lightning/mydump/spark_rebase.go
New implementation detecting Spark legacy flags/version/timezone from footer and computing Julian↔Gregorian rebase lookups and hybrid-calendar conversions for microseconds/days.
Type conversion & INT96 fixes
pkg/lightning/mydump/parquet_type_converter.go
Switch to Arrow time constructors, apply conditional Spark-style rebasing for DATE/TIMESTAMP/INT96, fix INT96 negative-time handling, and propagate rebase errors.
Parquet writer options
pkg/lightning/mydump/parquet_writer.go
WriteParquetFile variadic changed to ...any; runtime-classify options into parquet.WriterProperty and file.WriteOption, reorder init, and error on unsupported option types.
Unit tests & benchmark
pkg/lightning/mydump/parquet_parser_test.go
Large test additions: Int96 helper, many subtests for rounding and Spark legacy rebasing across versions/timezones, plus a benchmark for rebase lookup.
Integration tests & fixtures
tests/realtikvtest/importintotest/parquet_test.go
Embed Spark legacy Parquet fixtures and add two integration tests verifying IMPORT INTO handles Spark legacy DATE and DATETIME imports; adds testkit usage.

Sequence Diagram

sequenceDiagram
    participant Reader as Parquet Reader
    participant Parser as Parser (parquet_parser.go)
    participant Converter as Type Converter (parquet_type_converter.go)
    participant Rebase as Spark Rebase (spark_rebase.go)

    Reader->>Parser: Read footer & file metadata
    Parser->>Parser: Extract org.apache.spark.* keys, version, timezone
    Parser->>Parser: Build per-column sparkRebaseMicros lookup or none
    Reader->>Converter: Provide raw column values + sparkRebaseMicros
    Converter->>Converter: Decode raw value (Arrow constructors)
    alt spark rebase lookup present
        Converter->>Rebase: rebaseSparkJulianToGregorianMicros / rebaseJulianToGregorianDays
        Rebase->>Rebase: Use version cutoff, timezone table, or hybrid conversion
        Rebase-->>Converter: Return rebased micros/days or error
        Converter->>Converter: Convert rebased value to Go time
    else no rebasing
        Converter->>Converter: Convert raw value to Go time
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested Labels

component/import, component/lightning, ok-to-test

Suggested Reviewers

  • joechenrh
  • OliverS929
  • GMHDBJD
  • Benjamin2037

Poem

🐰 I hopped through footers, bytes, and time,
I nibbled Julian days until they rhyme,
With timezone maps and tiny rebase hops,
Old Spark timestamps now find proper clocks,
Parquet imports prance, the rabbit nods in chimes.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'import: honor Spark legacy Parquet datetime metadata' directly and clearly describes the main change—teaching the Parquet importer to respect Spark legacy datetime metadata—which aligns with all modified files and the PR's core objective.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #67849 by implementing Spark legacy datetime metadata detection, Julian-to-Gregorian rebasing for DATE/TIMESTAMP/INT96 values, and adding unit and integration tests, with validation showing the fix correctly imports 0001-01-01 as expected instead of 0000-12-30.
Out of Scope Changes check ✅ Passed All changes are directly scoped to Spark legacy Parquet datetime rebasing: new spark_rebase.go implementation, parser/converter updates, test files, BUILD files, and one unrelated test timeout change in temporarytabletest. The timeout change is minor and included in the commit.
Description check ✅ Passed The PR description is complete with all required sections filled: issue number linked, problem summary explaining the Spark legacy datetime metadata issue, detailed explanation of changes, comprehensive test checklist with unit tests and manual validation results, side effects assessed, documentation impact noted, and a release note provided.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 20, 2026
@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 20, 2026

Hi @D3Hunter. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter D3Hunter changed the title lightning: honor Spark legacy Parquet datetime metadata import: honor Spark legacy Parquet datetime metadata Apr 20, 2026
@D3Hunter D3Hunter marked this pull request as ready for review April 20, 2026 09:32
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Apr 20, 2026

@D3Hunter I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/lightning/mydump/parquet_writer.go (1)

166-207: ⚠️ Potential issue | 🟡 Minor

Validate options before opening the object writer.

The unsupported-option branch now returns after s.Create, leaving the object writer unclosed. Classify addOpts and build the schema/properties before creating the writer, or add cleanup for every early return.

Suggested direction
 func WriteParquetFile(path, fileName string, pcolumns []ParquetColumn, rows int, addOpts ...any) error {
-	s, err := getStore(path)
-	if err != nil {
-		return err
-	}
-	writer, err := s.Create(context.Background(), fileName, nil)
-	if err != nil {
-		return err
-	}
-	wrapper := &writeWrapper{Writer: writer}
+	var extraProps []parquet.WriterProperty
+	writerOpts := make([]file.WriteOption, 0, len(addOpts)+1)
+	for _, opt := range addOpts {
+		switch v := opt.(type) {
+		case parquet.WriterProperty:
+			extraProps = append(extraProps, v)
+		case file.WriteOption:
+			writerOpts = append(writerOpts, v)
+		default:
+			return fmt.Errorf("unsupported parquet writer option type %T", opt)
+		}
+	}
 
 	fields := make([]schema.Node, len(pcolumns))
 	opts := make([]parquet.WriterProperty, 0, len(pcolumns)*2)
@@
 	}
 
 	node, _ := schema.NewGroupNode("schema", parquet.Repetitions.Required, fields, -1)
-	var writerOpts []file.WriteOption
-	for _, opt := range addOpts {
-		switch v := opt.(type) {
-		case parquet.WriterProperty:
-			opts = append(opts, v)
-		case file.WriteOption:
-			writerOpts = append(writerOpts, v)
-		default:
-			return fmt.Errorf("unsupported parquet writer option type %T", opt)
-		}
-	}
+	opts = append(opts, extraProps...)
 	props := parquet.NewWriterProperties(opts...)
 	writerOpts = append(writerOpts, file.WithWriterProps(props))
+
+	s, err := getStore(path)
+	if err != nil {
+		return err
+	}
+	writer, err := s.Create(context.Background(), fileName, nil)
+	if err != nil {
+		return err
+	}
+	wrapper := &writeWrapper{Writer: writer}
 	pw := file.NewParquetWriter(wrapper, node, writerOpts...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_writer.go` around lines 166 - 207, The
WriteParquetFile function currently opens the object writer via s.Create before
validating addOpts, so any early return (e.g., unsupported option in the switch
over addOpts) leaks the writer; move the addOpts classification and
schema/property construction (the loop building fields and opts and the switch
over addOpts) to occur before calling s.Create (getStore and s.Create should be
invoked only after options are validated and fields/opts prepared), or if you
prefer to keep s.Create where it is, ensure every early return closes writer
(wrapper.Close/ writer.Close) and handles errors; update references in this
function (WriteParquetFile, getStore, s.Create, addOpts, fields, opts,
writer/wrapper) accordingly so no path returns with the object writer left open.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 405-410: The TIMESTAMP_MILLIS rebasing can overflow when computing
val*1000 before passing to rebaseSparkJulianToGregorianMicros; modify the block
that checks converted.sparkRebaseTimeZoneID to first validate that val is within
safe bounds (e.g. ensure val <= math.MaxInt64/1000 and val >= math.MinInt64/1000
or compare against the known millis cutoff) and return an error if it would
overflow, only then multiply by 1000 and call
rebaseSparkJulianToGregorianMicros(converted.sparkRebaseTimeZoneID, val*1000);
ensure the function handling (and error return) remains unchanged for safe
values.

---

Outside diff comments:
In `@pkg/lightning/mydump/parquet_writer.go`:
- Around line 166-207: The WriteParquetFile function currently opens the object
writer via s.Create before validating addOpts, so any early return (e.g.,
unsupported option in the switch over addOpts) leaks the writer; move the
addOpts classification and schema/property construction (the loop building
fields and opts and the switch over addOpts) to occur before calling s.Create
(getStore and s.Create should be invoked only after options are validated and
fields/opts prepared), or if you prefer to keep s.Create where it is, ensure
every early return closes writer (wrapper.Close/ writer.Close) and handles
errors; update references in this function (WriteParquetFile, getStore,
s.Create, addOpts, fields, opts, writer/wrapper) accordingly so no path returns
with the object writer left open.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7748fdc9-e2bd-4b73-927e-aa174ae131ad

📥 Commits

Reviewing files that changed from the base of the PR and between b6200ce and 5f08ad1.

⛔ Files ignored due to path filters (2)
  • tests/realtikvtest/importintotest/spark-legacy-date.gz.parquet is excluded by !**/*.parquet
  • tests/realtikvtest/importintotest/spark-legacy-datetime.gz.parquet is excluded by !**/*.parquet
📒 Files selected for processing (10)
  • br/pkg/metautil/BUILD.bazel
  • pkg/importsdk/BUILD.bazel
  • pkg/lightning/mydump/BUILD.bazel
  • pkg/lightning/mydump/parquet_parser.go
  • pkg/lightning/mydump/parquet_parser_test.go
  • pkg/lightning/mydump/parquet_type_converter.go
  • pkg/lightning/mydump/parquet_writer.go
  • pkg/lightning/mydump/spark_rebase_micros_generated.go
  • tests/realtikvtest/importintotest/BUILD.bazel
  • tests/realtikvtest/importintotest/parquet_test.go

Comment thread pkg/lightning/mydump/parquet_parser.go Outdated
Comment thread pkg/lightning/mydump/parquet_type_converter.go Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 81.56863% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.4259%. Comparing base (d0712ac) to head (dd2a4b8).
⚠️ Report is 14 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67908        +/-   ##
================================================
+ Coverage   77.5894%   79.4259%   +1.8364%     
================================================
  Files          1982       1995        +13     
  Lines        548964     551480      +2516     
================================================
+ Hits         425938     438018     +12080     
+ Misses       122221     111991     -10230     
- Partials        805       1471       +666     
Flag Coverage Δ
integration 46.7977% <21.5139%> (+12.4577%) ⬆️
unit 76.6633% <81.5686%> (+0.3364%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (+0.0901%) ⬆️
parser ∅ <ø> (∅)
br 66.0241% <ø> (+5.5069%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 20, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 20, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 20, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 20, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 21, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ingress-bot
Copy link
Copy Markdown

🔍 Starting code review for this PR...

Copy link
Copy Markdown

@ingress-bot ingress-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 11
  • Inline comments: 11
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🚨 [Blocker] (1)

  1. Spark legacy rebase is enabled without mixed-version rollout guard (pkg/lightning/mydump/parquet_parser.go:768, pkg/lightning/mydump/parquet_type_converter.go:405, pkg/dxf/importinto/job.go:62, pkg/dxf/importinto/proto.go:46)

⚠️ [Major] (5)

  1. Unknown Spark timezone is silently coerced to UTC instead of surfacing incompatibility (pkg/lightning/mydump/parquet_parser.go:337, pkg/lightning/mydump/parquet_parser_test.go:595)
  2. Legacy Spark timestamp rebasing falls back to UTC before parser location defaults are applied (pkg/lightning/mydump/parquet_parser.go:768, pkg/lightning/mydump/parquet_parser.go:455, pkg/lightning/mydump/loader.go:630)
  3. No regression test pins the Spark 3.0.x INT96-vs-datetime cutoff split (pkg/lightning/mydump/parquet_parser.go:94, pkg/lightning/mydump/parquet_parser.go:95, pkg/lightning/mydump/parquet_parser_test.go:321)
  4. WriteParquetFile no longer communicates its accepted option contract (pkg/lightning/mydump/parquet_writer.go:166)
  5. WriteParquetFile switched to untyped varargs and dropped compile-time option contracts (pkg/lightning/mydump/parquet_writer.go:166)

🟡 [Minor] (5)

  1. Spark rebase policy is now split across generated and handwritten tables (pkg/lightning/mydump/parquet_type_converter.go:43, pkg/lightning/mydump/spark_rebase_micros_generated.go:15)
  2. Legacy timestamp rebasing repeats timezone index lookup on every value (pkg/lightning/mydump/parquet_type_converter.go:301, pkg/lightning/mydump/parquet_type_converter.go:405, pkg/lightning/mydump/parquet_type_converter.go:483)
  3. INT96 conversion now truncates sub-microsecond precision instead of preserving canonical rounding (pkg/lightning/mydump/parquet_type_converter.go:501, pkg/types/time.go:183)
  4. Exported WriteParquetFile variadic type change breaks typed-slice callers (pkg/lightning/mydump/parquet_writer.go:166)
  5. Unsupported writer-option path returns without closing the created object writer (pkg/lightning/mydump/parquet_writer.go:166, pkg/lightning/mydump/parquet_writer.go:206)

Comment thread pkg/lightning/mydump/parquet_parser.go Outdated
Comment thread pkg/lightning/mydump/parquet_parser.go Outdated
Comment thread pkg/lightning/mydump/parquet_parser.go Outdated
Comment thread pkg/lightning/mydump/parquet_parser_test.go
Comment thread pkg/lightning/mydump/parquet_writer.go
Comment thread pkg/lightning/mydump/parquet_type_converter.go Outdated
Comment thread pkg/lightning/mydump/parquet_type_converter.go Outdated
Comment thread pkg/lightning/mydump/parquet_type_converter.go
Comment thread pkg/lightning/mydump/parquet_writer.go
Comment thread pkg/lightning/mydump/parquet_writer.go
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/lightning/mydump/parquet_type_converter.go (1)

525-528: ⚠️ Potential issue | 🟡 Minor

Preserve INT96 sub-microsecond rounding behavior.

nanosOfDay/int64(time.Microsecond) truncates before types.FromGoTime can apply TiDB’s usual nearest-microsecond rounding, so INT96 values with non-zero sub-microsecond nanos can import 1µs lower than the previous path. This was already raised in an earlier review thread.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/lightning/mydump/parquet_type_converter.go` around lines 525 - 528, The
int96ToUnixMicros function currently truncates sub-microsecond precision by
doing nanosOfDay/int64(time.Microsecond); instead, compute and return the
timestamp in nanoseconds (totalNanoseconds :=
(julianDay-julianDayOfUnixEpoch)*int64(24*time.Hour) + nanosOfDay) so callers
can convert to time.Time / use types.FromGoTime and let TiDB's
nearest-microsecond rounding happen there; update all call sites of
int96ToUnixMicros to accept/handle nanoseconds (or rename to int96ToUnixNanos)
and only divide by int64(time.Microsecond) at the final conversion step where
types.FromGoTime is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/lightning/mydump/parquet_type_converter.go`:
- Around line 525-528: The int96ToUnixMicros function currently truncates
sub-microsecond precision by doing nanosOfDay/int64(time.Microsecond); instead,
compute and return the timestamp in nanoseconds (totalNanoseconds :=
(julianDay-julianDayOfUnixEpoch)*int64(24*time.Hour) + nanosOfDay) so callers
can convert to time.Time / use types.FromGoTime and let TiDB's
nearest-microsecond rounding happen there; update all call sites of
int96ToUnixMicros to accept/handle nanoseconds (or rename to int96ToUnixNanos)
and only divide by int64(time.Microsecond) at the final conversion step where
types.FromGoTime is used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 428367ed-3f64-4807-9013-1c7aef5b4e77

📥 Commits

Reviewing files that changed from the base of the PR and between 77d044d and 29fb239.

📒 Files selected for processing (4)
  • pkg/lightning/mydump/parquet_parser.go
  • pkg/lightning/mydump/parquet_parser_test.go
  • pkg/lightning/mydump/parquet_type_converter.go
  • pkg/lightning/mydump/parquet_writer.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/lightning/mydump/parquet_parser_test.go

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 21, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

return nil, errors.Trace(err)
}
}
case parquet.Types.Int96:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad choice to store legacy converted type instead of using logical type directly. I don't remember why I wrote this, may just following the old logic 😢. Perhaps we can refactor it later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as logical type might be invalid ?

logicalType := desc.LogicalType()
if logicalType.IsValid() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, logical type is not mandatory when writing files. But maybe we can convert "converted type" to "logical type".

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joechenrh
Once this PR has been reviewed and has the lgtm label, please assign benjamin2037, yujuncen for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 21, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 21, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-21 07:02:51.780202396 +0000 UTC m=+2062976.985562454: ☑️ agreed by joechenrh.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 21, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 21, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/retest

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 21, 2026

@D3Hunter: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-nextgen-20251011

@ti-chi-bot
Copy link
Copy Markdown
Member

@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-20251011 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-nextgen-20251011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@D3Hunter
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-nextgen-202603

@ti-chi-bot
Copy link
Copy Markdown
Member

@D3Hunter: once the present PR merges, I will cherry-pick it on top of release-nextgen-202603 in the new PR and assign it to you.

Details

In response to this:

/cherry-pick release-nextgen-202603

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IMPORT INTO parquet ignores Spark legacy datetime metadata and imports 0001-01-01 as 0000-12-30

4 participants