Skip to content

Potential deadlock in TestWriteTxnPanicWithoutApply because batchTx lock is not released after panic #21939

@zhengshi1998

Description

@zhengshi1998

Bug report criteria

What happened?

Bug Description
When running the unit test TestWriteTxnPanicWithoutApply in server/etcdserver/txn/txn_test.go, the test appears to have a deadlock in the background goroutine, due to a lock that is never released after a panic.

Execution Flow

  1. betesting.NewDefaultTmpBackend(t) creates a backend instance and starts a background goroutine executing (*backend).run().
  2. Inside backend.run(), the goroutine periodically invokes backend.batchTx.safePending(), which attempts to acquire the backend.batchTx mutex.
  3. In the main goroutine, the test then executes : Txn(ctx, zaptest.NewLogger(t), txn, false, s, &lease.FakeLessor{}). This txn is expected to panic because the context has already been canceled.
  4. During Txn() execution, a write transaction is created through kv.Write(), which acquires the backend.batchTx lock. Under normal execution, the lock is released by txnWrite.End().
  5. However, the panic occurs during: txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath). As a result, the function exits before txnWrite.End() is reached, leaving the backend.batchTx lock permanently held.

Resulting Hang
The sequence leading to the issue is:

  1. The main test goroutine acquires backend.batchTx.
  2. A panic occurs before the lock is released.
  3. The write transaction cleanup txnWrite.End() is skipped.
  4. The background backend.run() goroutine subsequently blocks forever while attempting to acquire backend.batchTx in safePending().
  5. The backend goroutine never exits.

Root Cause
backend.batchTx is acquired before invoking code that may panic, but its release depends on normal control flow (txnWrite.End()). Since panic unwinds the stack before End() is executed, the mutex remains locked indefinitely. The lock release should be protected by a deferred cleanup mechanism to ensure it executes even when a panic occurs.

What did you expect to happen?

The background goroutine executing run method should safely return.

How can we reproduce it (as minimally and precisely as possible)?

  1. Add a sleep stmt to the test: located at server/etcdserver/txn/txn_test.go
func TestWriteTxnPanicWithoutApply(t *testing.T) {
    defer time.Sleep(1000 * time.Millisecond)
    // origin test code
}
  1. Add a printing stmt to mark the begin and exit of backend.run() located at server/storage/backend/backend.go
func (b *backend) run() {
    fmt.Println("----- backend.run() begins -----")
    defer fmt.Println("----- backend.run() ends -----")
    // original backend.run() code
}
  1. Run the test: go test -run TestWriteTxnPanicWithoutApply. Then you will see that the test exits in 1 second but the backend.run() never exits.

Anything else we need to know?

Perhaps the way to fix this issue is that inside the Txn function located at server/etcdserver/txn/txn.go, we can change

if isWrite {
	txnRead.End()
	txnWrite = kv.Write(trace)
} else {
	txnWrite = mvcc.NewReadOnlyTxnWrite(txnRead)
}
txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath)
txnWrite.End()

to

if isWrite {
	txnRead.End()
	txnWrite = kv.Write(trace)
} else {
	txnWrite = mvcc.NewReadOnlyTxnWrite(txnRead)
}
defer txnWrite.End()
txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath)

so that even if txn panics, the txnWrite.End() is still executed to cleanup the lock.

Etcd version (please run commands below)

Details
$ etcd --version
etcd Version: 3.6.12
Git SHA: 90b034a
Go Version: go1.25.10
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.6.12
API version: 3.6

Etcd configuration (command line flags or environment variables)

Details

Not applicable. I just ran the unit test.

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Details
$ etcdctl member list -w table
Not applicable.

$ etcdctl --endpoints=<member list> endpoint status -w table
Not applicable.

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions