Skip to content

Deadlock in _process_event_persist_queue_task causes all message delivery to hang (single-process) #19601

@proterian

Description

@proterian

Description

In single-process Synapse (no workers), all message delivery permanently hangs
after startup. No events are persisted. The lock warning appears after ~5 minutes:

  synapse.handlers.worker_lock - WARNING - Lock timeout is getting excessive: 640s. There may be a deadlock.                                                                                                                                    

Affected versions

Introduced in: 1.148.0 (unconfirmed exact version, confirmed broken in 1.149.1)
Not affected: 1.147.1

Root cause

message.py acquires new_event_during_purge_lock (read) for a room, then
awaits persist_events(...). The persistence queue task starts and tries to
acquire the same lock from the same instance.

The table worker_read_write_locks has PRIMARY KEY (lock_name, lock_key, instance_name)
— one row per instance. A second INSERT from the same instance raises IntegrityError,
so try_acquire_read_write_lock returns None, and the task waits forever.

Meanwhile message.py is waiting for the task to finish → classic deadlock.

The comment in the code says:
# We might already have taken out the lock, but since this is just a
# "read" lock its inherently reentrant.
This is incorrect — the implementation does NOT support reentrancy.

Affected file

synapse/storage/controllers/persist_events.py
Function: _process_event_persist_queue_task (~line 372)

Fix

Remove the redundant lock acquisition from _process_event_persist_queue_task.
All callers (message.py:1081, federation_server.py:1302, room_member.py:662)
already hold this read lock before calling persist_events().

Diagnosis

  • py-spy shows reactor idle (doPoll), all thread pool threads idle → coroutines
    stuck waiting for each other, no I/O pending
  • worker_read_write_locks shows read lock renewed every 30s but never released
  • Affects only single-process deployments (workers have separate instance_name
    per worker, so the PK conflict doesn't occur)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions