-
Notifications
You must be signed in to change notification settings - Fork 505
Deadlock in _process_event_persist_queue_task causes all message delivery to hang (single-process) #19601
Description
Description
In single-process Synapse (no workers), all message delivery permanently hangs
after startup. No events are persisted. The lock warning appears after ~5 minutes:
synapse.handlers.worker_lock - WARNING - Lock timeout is getting excessive: 640s. There may be a deadlock.
Affected versions
Introduced in: 1.148.0 (unconfirmed exact version, confirmed broken in 1.149.1)
Not affected: 1.147.1
Root cause
message.py acquires new_event_during_purge_lock (read) for a room, then
awaits persist_events(...). The persistence queue task starts and tries to
acquire the same lock from the same instance.
The table worker_read_write_locks has PRIMARY KEY (lock_name, lock_key, instance_name)
— one row per instance. A second INSERT from the same instance raises IntegrityError,
so try_acquire_read_write_lock returns None, and the task waits forever.
Meanwhile message.py is waiting for the task to finish → classic deadlock.
The comment in the code says:
# We might already have taken out the lock, but since this is just a
# "read" lock its inherently reentrant.
This is incorrect — the implementation does NOT support reentrancy.
Affected file
synapse/storage/controllers/persist_events.py
Function: _process_event_persist_queue_task (~line 372)
Fix
Remove the redundant lock acquisition from _process_event_persist_queue_task.
All callers (message.py:1081, federation_server.py:1302, room_member.py:662)
already hold this read lock before calling persist_events().
Diagnosis
- py-spy shows reactor idle (doPoll), all thread pool threads idle → coroutines
stuck waiting for each other, no I/O pending - worker_read_write_locks shows read lock renewed every 30s but never released
- Affects only single-process deployments (workers have separate instance_name
per worker, so the PK conflict doesn't occur)