Skip to content

High CPU and DB usage upon start on Synapse v1.138.0 #18925

@anoadragon453

Description

@anoadragon453

Upon upgrading the element.io homeserver, we found high CPU, memory and DB pressure.

We were seeing many calls to the _fetch_event_list and cancel_delayed_state_events database queries, and the *getEvent* cache growing quickly, ballooning memory.

My theory is that #18858 is the cause. The fix allows the surrounding for loop to iterate (whereas before it would exit early with an Exception). get_event calls _fetch_event_list, and we are tight-looping over fetching a single state event at a time. We then call cancel_delayed_state_events on each state event individually.

It would be more efficient to pull out a batch of (say 500) events at a time and cancel them all with a single query as well.


Since this loop operates on state deltas from self._store.get_delayed_events_stream_pos() to self._store.get_room_max_stream_ordering(), one can assume that the intensive operations will run for a bounded time and then settle. Though due to the earlier code aborting early, it's likely that the delayed_events_stream_pos is currently very far behind for most deployments.

Understandably, element.io found this too intensive and rolled Synapse back before it completed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions