Deadlock in _process_event_persist_queue_task causes all message delivery to hang (single-process)

  # Description
                                                                                                                                                                                                                                                    
  In single-process Synapse (no workers), all message delivery permanently hangs                                                                                                                                                                    
  after startup. No events are persisted. The lock warning appears after ~5 minutes:
                                                                                                                                                                                                                                                    
      synapse.handlers.worker_lock - WARNING - Lock timeout is getting excessive: 640s. There may be a deadlock.                                                                                                                                    
                                                                                                                                                                                                                                                    
  # Affected versions                                                                                                                                                                                                                             
  Introduced in: 1.148.0 (unconfirmed exact version, confirmed broken in 1.149.1)
  Not affected: 1.147.1                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                    
  ### Root cause                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                    
  `message.py` acquires `new_event_during_purge_lock` (read) for a room, then                                                                                                                                                                       
  awaits `persist_events(...)`. The persistence queue task starts and tries to
  acquire the **same lock** from the same instance.                                                                                                                                                                                                 
                  
  The table `worker_read_write_locks` has PRIMARY KEY (lock_name, lock_key, instance_name)                                                                                                                                                          
  — one row per instance. A second INSERT from the same instance raises IntegrityError,
  so `try_acquire_read_write_lock` returns None, and the task waits forever.                                                                                                                                                                        
                                                                                                                                                                                                                                                    
  Meanwhile `message.py` is waiting for the task to finish → classic deadlock.                                                                                                                                                                      
                                                                                                                                                                                                                                                    
  The comment in the code says:                                                                                                                                                                                                                     
      # We might already have taken out the lock, but since this is just a
      # "read" lock its inherently reentrant.                                                                                                                                                                                                       
  This is incorrect — the implementation does NOT support reentrancy.                                                                                                                                                                               
                                                                                                                                                                                                                                                    
  ### Affected file                                                                                                                                                                                                                                 
  synapse/storage/controllers/persist_events.py                                                                                                                                                                                                     
  Function: _process_event_persist_queue_task (~line 372)                                                                                                                                                                                           
   
  ### Fix                                                                                                                                                                                                                                           
  Remove the redundant lock acquisition from _process_event_persist_queue_task.
  All callers (message.py:1081, federation_server.py:1302, room_member.py:662)                                                                                                                                                                      
  already hold this read lock before calling persist_events().                                                                                                                                                                                      
                                                                                                                                                                                                                                                    
  ### Diagnosis                                                                                                                                                                                                                                     
  - py-spy shows reactor idle (doPoll), all thread pool threads idle → coroutines                                                                                                                                                                   
    stuck waiting for each other, no I/O pending                                                                                                                                                                                                    
  - worker_read_write_locks shows read lock renewed every 30s but never released                                                                                                                                                                    
  - Affects only single-process deployments (workers have separate instance_name                                                                                                                                                                    
    per worker, so the PK conflict doesn't occur)  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in _process_event_persist_queue_task causes all message delivery to hang (single-process) #19601

Description

Affected versions

Root cause

Affected file

Fix

Diagnosis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock in _process_event_persist_queue_task causes all message delivery to hang (single-process) #19601

Description

Description

Affected versions

Root cause

Affected file

Fix

Diagnosis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions