Bug Report
1. Minimal reproduce step (Required)
This issue is about a flaky Lightning integration test, not a user-facing regression.
I found it while triaging pull-lightning-integration-test / Jenkins G01 for PR #67241, and the failure looks unrelated to that PR.
To reproduce locally:
- Prepare the normal Lightning integration test environment described in
lightning/tests/README.md.
- On current source (I reproduced at
91838e970ad95cf58777718bf402c624d027b918), apply the following debug-only patch to pin the first crash to `compress`.`multi_rows` and preserve both Lightning stderr streams:
diff --git a/lightning/pkg/importer/chunk_process.go b/lightning/pkg/importer/chunk_process.go
@@
failpoint.Inject("SlowDownWriteRows", func() {
deliverLogger.Warn("Slowed down write rows")
finished := rc.status.FinishedFileSize.Load()
total := rc.status.TotalFileSize.Load()
deliverLogger.Warn("PrintStatus Failpoint",
zap.Int64("finished", finished),
zap.Int64("total", total))
})
+ failpoint.Inject("FailAfterWriteRowsForTable", func(v failpoint.Value) {
+ if targetTable, ok := v.(string); ok && targetTable == t.tableName {
+ deliverLogger.Warn("FailAfterWriteRowsForTable", zap.String("targetTable", targetTable))
+ panic("failpoint: FailAfterWriteRowsForTable")
+ }
+ })
failpoint.Inject("FailAfterWriteRows", nil)
diff --git a/lightning/tests/lightning_compress/run.sh b/lightning/tests/lightning_compress/run.sh
@@
- # Set minDeliverBytes to a small enough number to only write only 1 row each time
- # Set the failpoint to kill the lightning instance as soon as one row is written
+ # Set minDeliverBytes to a small enough number to only write only 1 row each time.
+ # For local+gzip, pin the crash to compress.multi_rows so the flaky checkpoint restore
+ # path is reproducible instead of depending on whichever table writes first.
PKG="github.com/pingcap/tidb/lightning/pkg/importer"
- export GO_FAILPOINTS="$PKG/SlowDownWriteRows=sleep(1000);$PKG/FailAfterWriteRows=panic;$PKG/SetMinDeliverBytes=return(1)"
+ TARGET_TABLE='`compress`.`multi_rows`'
+ export GO_FAILPOINTS="$PKG/SlowDownWriteRows=sleep(1000);$PKG/SetMinDeliverBytes=return(1)"
+ if [ "$BACKEND" = 'local' ] && [ "$compress" = 'gzip' ]; then
+ export GO_FAILPOINTS="$GO_FAILPOINTS;$PKG/FailAfterWriteRowsForTable=return(\"$TARGET_TABLE\")"
+ else
+ export GO_FAILPOINTS="$GO_FAILPOINTS;$PKG/FailAfterWriteRows=panic"
+ fi
+ first_stderr="$TEST_DIR/$TEST_NAME.$BACKEND.$compress.first.stderr"
+ second_stderr="$TEST_DIR/$TEST_NAME.$BACKEND.$compress.second.stderr"
+ rm -f "$first_stderr" "$second_stderr"
@@
- run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> /dev/null
+ run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> "$first_stderr"
+ first_status=$?
set -e
+ [ "$first_status" -ne 0 ] || exit 1
@@
- run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> /dev/null
+ run_lightning --backend $BACKEND --config "$CUR/$BACKEND-config.toml" -d "$CUR/data.$compress" --enable-checkpoint=1 2> "$second_stderr"
+ second_status=$?
set -e
+ if [ "$second_status" -ne 0 ]; then
+ cat "$second_stderr"
+ exit "$second_status"
+ fi
- Build the Lightning integration-test binaries:
make build_for_lightning_integration_test
- Run the single test case:
TEST_NAME=lightning_compress lightning/tests/run.sh --no-tiflash
I ran step 4 twice in a row. Both runs failed in the same place.
2. What did you expect to see? (Required)
lightning/tests/lightning_compress/run.sh should either pass, or at least fail deterministically for the same reason after restart.
The resume path should not depend on which table happens to receive the first write under the global FailAfterWriteRows failpoint.
3. What did you see instead (Required)
After pinning the first crash to `compress`.`multi_rows` in local+gzip, the second restore run fails consistently with:
tidb lightning encountered error: [Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)
Relevant evidence from /tmp/lightning_test/lightning.log:
[WARN] [chunk_process.go:751] [FailAfterWriteRowsForTable] [table=`compress`.`multi_rows`] [path=compress.multi_rows.000000000.csv.gz:0] [targetTable=`compress`.`multi_rows`]
[ERROR] [import.go:1417] ["restore all tables data failed"] [error="[Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)"]
[ERROR] [import.go:175] [-] [table=`compress`.`empty_strings`] [status=checksum] [error="[Lighting:Restore:ErrChecksumMismatch]checksum mismatched remote vs local => (checksum: 17625213566641260363 vs 6131315339012058461) (total_kvs: 1 vs 4) (total_bytes:35 vs 141)"]
The saved stderr from the second run is now preserved in:
/tmp/lightning_test/lightning_compress.local.gzip.second.stderr
Before this debug patch, the test was flaky because:
- the global
FailAfterWriteRows could hit compress.escapes or another table first
- both
run_lightning invocations redirected stderr to /dev/null
- the script later failed with a misleading
count(*) = 0 assertion on compress.multi_rows
4. What is your TiDB version? (Required)
Reproduced on current source at 91838e970ad95cf58777718bf402c624d027b918.
bin/tidb-server -V from the local integration-test environment:
Release Version: v8.4.0-this-is-a-placeholder
Edition: Community
Git Commit Hash: None
Git Branch: None
UTC Build Time: None
GoVersion: go1.23.12
Race Enabled: false
Check Table Before Drop: false
Store: unistore
Kernel Type: Classic
Additional context:
Bug Report
1. Minimal reproduce step (Required)
This issue is about a flaky Lightning integration test, not a user-facing regression.
I found it while triaging
pull-lightning-integration-test/ JenkinsG01for PR #67241, and the failure looks unrelated to that PR.To reproduce locally:
lightning/tests/README.md.91838e970ad95cf58777718bf402c624d027b918), apply the following debug-only patch to pin the first crash to`compress`.`multi_rows`and preserve both Lightning stderr streams:I ran step 4 twice in a row. Both runs failed in the same place.
2. What did you expect to see? (Required)
lightning/tests/lightning_compress/run.shshould either pass, or at least fail deterministically for the same reason after restart.The resume path should not depend on which table happens to receive the first write under the global
FailAfterWriteRowsfailpoint.3. What did you see instead (Required)
After pinning the first crash to
`compress`.`multi_rows`inlocal+gzip, the second restore run fails consistently with:Relevant evidence from
/tmp/lightning_test/lightning.log:The saved stderr from the second run is now preserved in:
/tmp/lightning_test/lightning_compress.local.gzip.second.stderrBefore this debug patch, the test was flaky because:
FailAfterWriteRowscould hitcompress.escapesor another table firstrun_lightninginvocations redirected stderr to/dev/nullcount(*) = 0assertion oncompress.multi_rows4. What is your TiDB version? (Required)
Reproduced on current source at
91838e970ad95cf58777718bf402c624d027b918.bin/tidb-server -Vfrom the local integration-test environment:Additional context:
pull-lightning-integration-test/ JenkinsG01for PR importsdk, importer, importinto: add import size estimate #67241.