Skip to content

lightning_compress local+gzip checkpoint restore can deterministically fail with ErrChecksumMismatch after restart #67293

@GMHDBJD

Description

@GMHDBJD

Bug Report

1. Minimal reproduce step (Required)

This issue is about a flaky Lightning integration test, not a user-facing regression.
I found it while triaging pull-lightning-integration-test / Jenkins G01 for PR #67241, and the failure looks unrelated to that PR.

To reproduce locally:

  1. Prepare the normal Lightning integration test environment described in lightning/tests/README.md.
  2. On current source (I reproduced at 91838e970ad95cf58777718bf402c624d027b918), apply the following debug-only patch to pin the first crash to `compress`.`multi_rows` and preserve both Lightning stderr streams:
diff --git a/lightning/pkg/importer/chunk_process.go b/lightning/pkg/importer/chunk_process.go
@@
 		failpoint.Inject("SlowDownWriteRows", func() {
 			deliverLogger.Warn("Slowed down write rows")
 			finished := rc.status.FinishedFileSize.Load()
 			total := rc.status.TotalFileSize.Load()
 			deliverLogger.Warn("PrintStatus Failpoint",
 				zap.Int64("finished", finished),
 				zap.Int64("total", total))
 		})
+		failpoint.Inject("FailAfterWriteRowsForTable", func(v failpoint.Value) {
+			if targetTable, ok := v.(string); ok && targetTable == t.tableName {
+				deliverLogger.Warn("FailAfterWriteRowsForTable", zap.String("targetTable", targetTable))
+				panic("failpoint: FailAfterWriteRowsForTable")
+			}
+		})
 		failpoint.Inject("FailAfterWriteRows", nil)
diff --git a/lightning/tests/lightning_compress/run.sh b/lightning/tests/lightning_compress/run.sh
@@
-	# Set minDeliverBytes to a small enough number to only write only 1 row each time
-	# Set the failpoint to kill the lightning instance as soon as one row is written
+	# Set minDeliverBytes to a small enough number to only write only 1 row each time.
+	# For local+gzip, pin the crash to compress.multi_rows so the flaky checkpoint restore
+	# path is reproducible instead of depending on whichever table writes first.
 	PKG="github.com/pingcap/tidb/lightning/pkg/importer"
-	export GO_FAILPOINTS="$PKG/SlowDownWriteRows=sleep(1000);$PKG/FailAfterWriteRows=panic;$PKG/SetMinDeliverBytes=return(1)"
+	TARGET_TABLE='`compress`.`multi_rows`'
+	export GO_FAILPOINTS="$PKG/SlowDownWriteRows=sleep(1000);$PKG/SetMinDeliverBytes=return(1)"
+	if [ "$BACKEND" = 'local' ] && [ "$compress" = 'gzip' ]; then
+	  export GO_FAILPOINTS="$GO_FAILPOINTS;$PKG/FailAfterWriteRowsForTable=return(\"$TARGET_TABLE\")"
+	else
+	  export GO_FAILPOINTS="$GO_FAILPOINTS;$PKG/FailAfterWriteRows=panic"
+	fi
+	first_stderr="$TEST_DIR/$TEST_NAME.$BACKEND.$compress.first.stderr"
+	second_stderr="$TEST_DIR/$TEST_NAME.$BACKEND.$compress.second.stderr"
+	rm -f "$first_stderr" "$second_stderr"
@@
-	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> /dev/null
+	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> "$first_stderr"
+	first_status=$?
 	set -e
+	[ "$first_status" -ne 0 ] || exit 1
@@
-	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> /dev/null
+	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> "$second_stderr"
+	second_status=$?
 	set -e
+	if [ "$second_status" -ne 0 ]; then
+	  cat "$second_stderr"
+	  exit "$second_status"
+	fi
  1. Build the Lightning integration-test binaries:
make build_for_lightning_integration_test
  1. Run the single test case:
TEST_NAME=lightning_compress lightning/tests/run.sh --no-tiflash

I ran step 4 twice in a row. Both runs failed in the same place.

2. What did you expect to see? (Required)

lightning/tests/lightning_compress/run.sh should either pass, or at least fail deterministically for the same reason after restart.
The resume path should not depend on which table happens to receive the first write under the global FailAfterWriteRows failpoint.

3. What did you see instead (Required)

After pinning the first crash to `compress`.`multi_rows` in local+gzip, the second restore run fails consistently with:

tidb lightning encountered error: [Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)

Relevant evidence from /tmp/lightning_test/lightning.log:

[WARN] [chunk_process.go:751] [FailAfterWriteRowsForTable] [table=`compress`.`multi_rows`] [path=compress.multi_rows.000000000.csv.gz:0] [targetTable=`compress`.`multi_rows`]
[ERROR] [import.go:1417] ["restore all tables data failed"] [error="[Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)"]
[ERROR] [import.go:175] [-] [table=`compress`.`empty_strings`] [status=checksum] [error="[Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)"]

The saved stderr from the second run is now preserved in:

  • /tmp/lightning_test/lightning_compress.local.gzip.second.stderr

Before this debug patch, the test was flaky because:

  • the global FailAfterWriteRows could hit compress.escapes or another table first
  • both run_lightning invocations redirected stderr to /dev/null
  • the script later failed with a misleading count(*) = 0 assertion on compress.multi_rows

4. What is your TiDB version? (Required)

Reproduced on current source at 91838e970ad95cf58777718bf402c624d027b918.

bin/tidb-server -V from the local integration-test environment:

Release Version: v8.4.0-this-is-a-placeholder
Edition: Community
Git Commit Hash: None
Git Branch: None
UTC Build Time: None
GoVersion: go1.23.12
Race Enabled: false
Check Table Before Drop: false
Store: unistore
Kernel Type: Classic

Additional context:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions