Bug report criteria
What happened?
Bug Description
When running the unit test TestWriteTxnPanicWithoutApply in server/etcdserver/txn/txn_test.go, the test appears to have a deadlock in the background goroutine, due to a lock that is never released after a panic.
Execution Flow
betesting.NewDefaultTmpBackend(t) creates a backend instance and starts a background goroutine executing (*backend).run().
- Inside
backend.run(), the goroutine periodically invokes backend.batchTx.safePending(), which attempts to acquire the backend.batchTx mutex.
- In the main goroutine, the test then executes :
Txn(ctx, zaptest.NewLogger(t), txn, false, s, &lease.FakeLessor{}). This txn is expected to panic because the context has already been canceled.
- During
Txn() execution, a write transaction is created through kv.Write(), which acquires the backend.batchTx lock. Under normal execution, the lock is released by txnWrite.End().
- However, the panic occurs during:
txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath). As a result, the function exits before txnWrite.End() is reached, leaving the backend.batchTx lock permanently held.
Resulting Hang
The sequence leading to the issue is:
- The main test goroutine acquires
backend.batchTx.
- A panic occurs before the lock is released.
- The write transaction cleanup
txnWrite.End() is skipped.
- The background
backend.run() goroutine subsequently blocks forever while attempting to acquire backend.batchTx in safePending().
- The backend goroutine never exits.
Root Cause
backend.batchTx is acquired before invoking code that may panic, but its release depends on normal control flow (txnWrite.End()). Since panic unwinds the stack before End() is executed, the mutex remains locked indefinitely. The lock release should be protected by a deferred cleanup mechanism to ensure it executes even when a panic occurs.
What did you expect to happen?
The background goroutine executing run method should safely return.
How can we reproduce it (as minimally and precisely as possible)?
- Add a sleep stmt to the test: located at
server/etcdserver/txn/txn_test.go
func TestWriteTxnPanicWithoutApply(t *testing.T) {
defer time.Sleep(1000 * time.Millisecond)
// origin test code
}
- Add a printing stmt to mark the begin and exit of
backend.run() located at server/storage/backend/backend.go
func (b *backend) run() {
fmt.Println("----- backend.run() begins -----")
defer fmt.Println("----- backend.run() ends -----")
// original backend.run() code
}
- Run the test:
go test -run TestWriteTxnPanicWithoutApply. Then you will see that the test exits in 1 second but the backend.run() never exits.
Anything else we need to know?
Perhaps the way to fix this issue is that inside the Txn function located at server/etcdserver/txn/txn.go, we can change
if isWrite {
txnRead.End()
txnWrite = kv.Write(trace)
} else {
txnWrite = mvcc.NewReadOnlyTxnWrite(txnRead)
}
txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath)
txnWrite.End()
to
if isWrite {
txnRead.End()
txnWrite = kv.Write(trace)
} else {
txnWrite = mvcc.NewReadOnlyTxnWrite(txnRead)
}
defer txnWrite.End()
txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath)
so that even if txn panics, the txnWrite.End() is still executed to cleanup the lock.
Etcd version (please run commands below)
Details
$ etcd --version
etcd Version: 3.6.12
Git SHA: 90b034a
Go Version: go1.25.10
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.6.12
API version: 3.6
Etcd configuration (command line flags or environment variables)
Details
Not applicable. I just ran the unit test.
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Details
$ etcdctl member list -w table
Not applicable.
$ etcdctl --endpoints=<member list> endpoint status -w table
Not applicable.
Relevant log output
Bug report criteria
What happened?
Bug Description
When running the unit test
TestWriteTxnPanicWithoutApplyinserver/etcdserver/txn/txn_test.go, the test appears to have a deadlock in the background goroutine, due to a lock that is never released after a panic.Execution Flow
betesting.NewDefaultTmpBackend(t)creates a backend instance and starts a background goroutine executing(*backend).run().backend.run(), the goroutine periodically invokesbackend.batchTx.safePending(), which attempts to acquire thebackend.batchTxmutex.Txn(ctx, zaptest.NewLogger(t), txn, false, s, &lease.FakeLessor{}). This txn is expected to panic because the context has already been canceled.Txn()execution, a write transaction is created throughkv.Write(), which acquires thebackend.batchTxlock. Under normal execution, the lock is released bytxnWrite.End().txnResp, err := txn(ctx, lg, txnWrite, rt, isWrite, txnPath). As a result, the function exits beforetxnWrite.End()is reached, leaving thebackend.batchTxlock permanently held.Resulting Hang
The sequence leading to the issue is:
backend.batchTx.txnWrite.End()is skipped.backend.run()goroutine subsequently blocks forever while attempting to acquirebackend.batchTxinsafePending().Root Cause
backend.batchTxis acquired before invoking code that may panic, but its release depends on normal control flow (txnWrite.End()). Since panic unwinds the stack beforeEnd()is executed, the mutex remains locked indefinitely. The lock release should be protected by a deferred cleanup mechanism to ensure it executes even when a panic occurs.What did you expect to happen?
The background goroutine executing
runmethod should safely return.How can we reproduce it (as minimally and precisely as possible)?
server/etcdserver/txn/txn_test.gobackend.run()located atserver/storage/backend/backend.gogo test -run TestWriteTxnPanicWithoutApply. Then you will see that the test exits in 1 second but thebackend.run()never exits.Anything else we need to know?
Perhaps the way to fix this issue is that inside the
Txnfunction located atserver/etcdserver/txn/txn.go, we can changeto
so that even if
txnpanics, thetxnWrite.End()is still executed to cleanup the lock.Etcd version (please run commands below)
Details
Etcd configuration (command line flags or environment variables)
Details
Not applicable. I just ran the unit test.
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Details
Relevant log output