lightning_compress local+gzip checkpoint restore can deterministically fail with ErrChecksumMismatch after restart

## Bug Report

### 1. Minimal reproduce step (Required)

This issue is about a flaky Lightning integration test, not a user-facing regression.
I found it while triaging `pull-lightning-integration-test` / Jenkins `G01` for PR #67241, and the failure looks unrelated to that PR.

To reproduce locally:

1. Prepare the normal Lightning integration test environment described in `lightning/tests/README.md`.
2. On current source (I reproduced at `91838e970ad95cf58777718bf402c624d027b918`), apply the following debug-only patch to pin the first crash to `` `compress`.`multi_rows` `` and preserve both Lightning stderr streams:

```diff
diff --git a/lightning/pkg/importer/chunk_process.go b/lightning/pkg/importer/chunk_process.go
@@
 		failpoint.Inject("SlowDownWriteRows", func() {
 			deliverLogger.Warn("Slowed down write rows")
 			finished := rc.status.FinishedFileSize.Load()
 			total := rc.status.TotalFileSize.Load()
 			deliverLogger.Warn("PrintStatus Failpoint",
 				zap.Int64("finished", finished),
 				zap.Int64("total", total))
 		})
+		failpoint.Inject("FailAfterWriteRowsForTable", func(v failpoint.Value) {
+			if targetTable, ok := v.(string); ok && targetTable == t.tableName {
+				deliverLogger.Warn("FailAfterWriteRowsForTable", zap.String("targetTable", targetTable))
+				panic("failpoint: FailAfterWriteRowsForTable")
+			}
+		})
 		failpoint.Inject("FailAfterWriteRows", nil)
diff --git a/lightning/tests/lightning_compress/run.sh b/lightning/tests/lightning_compress/run.sh
@@
-	# Set minDeliverBytes to a small enough number to only write only 1 row each time
-	# Set the failpoint to kill the lightning instance as soon as one row is written
+	# Set minDeliverBytes to a small enough number to only write only 1 row each time.
+	# For local+gzip, pin the crash to compress.multi_rows so the flaky checkpoint restore
+	# path is reproducible instead of depending on whichever table writes first.
 	PKG="github.com/pingcap/tidb/lightning/pkg/importer"
-	export GO_FAILPOINTS="$PKG/SlowDownWriteRows=sleep(1000);$PKG/FailAfterWriteRows=panic;$PKG/SetMinDeliverBytes=return(1)"
+	TARGET_TABLE='`compress`.`multi_rows`'
+	export GO_FAILPOINTS="$PKG/SlowDownWriteRows=sleep(1000);$PKG/SetMinDeliverBytes=return(1)"
+	if [ "$BACKEND" = 'local' ] && [ "$compress" = 'gzip' ]; then
+	  export GO_FAILPOINTS="$GO_FAILPOINTS;$PKG/FailAfterWriteRowsForTable=return(\"$TARGET_TABLE\")"
+	else
+	  export GO_FAILPOINTS="$GO_FAILPOINTS;$PKG/FailAfterWriteRows=panic"
+	fi
+	first_stderr="$TEST_DIR/$TEST_NAME.$BACKEND.$compress.first.stderr"
+	second_stderr="$TEST_DIR/$TEST_NAME.$BACKEND.$compress.second.stderr"
+	rm -f "$first_stderr" "$second_stderr"
@@
-	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> /dev/null
+	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> "$first_stderr"
+	first_status=$?
 	set -e
+	[ "$first_status" -ne 0 ] || exit 1
@@
-	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> /dev/null
+	run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> "$second_stderr"
+	second_status=$?
 	set -e
+	if [ "$second_status" -ne 0 ]; then
+	  cat "$second_stderr"
+	  exit "$second_status"
+	fi
```

3. Build the Lightning integration-test binaries:

```bash
make build_for_lightning_integration_test
```

4. Run the single test case:

```bash
TEST_NAME=lightning_compress lightning/tests/run.sh --no-tiflash
```

I ran step 4 twice in a row. Both runs failed in the same place.

### 2. What did you expect to see? (Required)

`lightning/tests/lightning_compress/run.sh` should either pass, or at least fail deterministically for the same reason after restart.
The resume path should not depend on which table happens to receive the first write under the global `FailAfterWriteRows` failpoint.

### 3. What did you see instead (Required)

After pinning the first crash to `` `compress`.`multi_rows` `` in `local+gzip`, the second restore run fails consistently with:

```text
tidb lightning encountered error: [Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)
```

Relevant evidence from `/tmp/lightning_test/lightning.log`:

```text
[WARN] [chunk_process.go:751] [FailAfterWriteRowsForTable] [table=`compress`.`multi_rows`] [path=compress.multi_rows.000000000.csv.gz:0] [targetTable=`compress`.`multi_rows`]
[ERROR] [import.go:1417] ["restore all tables data failed"] [error="[Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)"]
[ERROR] [import.go:175] [-] [table=`compress`.`empty_strings`] [status=checksum] [error="[Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)"]
```

The saved stderr from the second run is now preserved in:

- `/tmp/lightning_test/lightning_compress.local.gzip.second.stderr`

Before this debug patch, the test was flaky because:

- the global `FailAfterWriteRows` could hit `compress.escapes` or another table first
- both `run_lightning` invocations redirected stderr to `/dev/null`
- the script later failed with a misleading `count(*) = 0` assertion on `compress.multi_rows`

### 4. What is your TiDB version? (Required)

Reproduced on current source at `91838e970ad95cf58777718bf402c624d027b918`.

`bin/tidb-server -V` from the local integration-test environment:

```text
Release Version: v8.4.0-this-is-a-placeholder
Edition: Community
Git Commit Hash: None
Git Branch: None
UTC Build Time: None
GoVersion: go1.23.12
Race Enabled: false
Check Table Before Drop: false
Store: unistore
Kernel Type: Classic
```

Additional context:

- Found while triaging `pull-lightning-integration-test` / Jenkins `G01` for PR #67241.
- This looks unrelated to PR #67241 and more like a pre-existing flaky test / checkpoint-restore bug in Lightning local+gzip.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightning_compress local+gzip checkpoint restore can deterministically fail with ErrChecksumMismatch after restart #67293

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lightning_compress local+gzip checkpoint restore can deterministically fail with ErrChecksumMismatch after restart #67293

Description

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions