lightning: TestRegionJobBaseWorker is flaky in nextgen CI (sync: negative WaitGroup counter)

## Bug Report

### 1. Minimal reproduce step (Required)

Observed in CI run:
- Pipeline: `pull_unit_test_next_gen`
- Run: `#11975` (2026-03-05)
- Node log: https://do.pingcap.net/jenkins/blue/rest/organizations/jenkins/pipelines/pingcap/pipelines/tidb/pipelines/pull_unit_test_next_gen/runs/11975/nodes/54/log/?start=0

Failure happens in:
- `//pkg/lightning/backend/local:local_test (shard 27 of 50)`
- `TestRegionJobBaseWorker/if_the_region_has_no_leader,_rescan_the_region`

The first attempt failed with panic:
- `sync: negative WaitGroup counter`
- stack includes:
  - `pkg/lightning/backend/local/job_worker.go:95` (`mockJobWgDone` failpoint path)
  - `pkg/lightning/backend/local/region_job.go:312` (`regionJob.done -> wg.Done`)

Same shard then passed on retry and Bazel marked it `FLAKY`.

Root-cause analysis from code path:
- In test helper `prepareAndExecute`:
  - `jobInCh <- job` is executed before `jobWg.Add(1)`.
- In the no-leader subtest, failpoint `mockJobWgDone` is set to `return(3)`.
- Worker can process job immediately and execute `w.jobWg.Add(-3)` before producer goroutine executes `jobWg.Add(1)`.
- If this interleaving happens, WaitGroup counter goes negative and panics.

Relevant code:
- `pkg/lightning/backend/local/job_worker_test.go:134-135`
- `pkg/lightning/backend/local/job_worker.go:93-96`
- `pkg/lightning/backend/local/region_job.go:312`

### 2. What did you expect to see? (Required)

`TestRegionJobBaseWorker` should be deterministic and should not panic with `sync: negative WaitGroup counter`.

### 3. What did you see instead (Required)

A flaky panic in CI:
- First attempt fails with `sync: negative WaitGroup counter`.
- Retry passes, and test target is reported as `FLAKY`.

### 4. What is your TiDB version? (Required)

N/A for SQL runtime (this is a unit-test-only failure in CI).

Test context:
- TiDB repo `master` nextgen unit test pipeline
- CI run `pull_unit_test_next_gen #11975` on 2026-03-05

---

Potential fix direction:
- In `prepareAndExecute`, move `jobWg.Add(1)` before `jobInCh <- job` to avoid race between producer accounting and worker-side decrement/failpoint behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lightning: TestRegionJobBaseWorker is flaky in nextgen CI (sync: negative WaitGroup counter) #66702

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

lightning: TestRegionJobBaseWorker is flaky in nextgen CI (sync: negative WaitGroup counter) #66702

Description

Bug Report

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions