Upon upgrading the element.io homeserver, we found high CPU, memory and DB pressure.
We were seeing many calls to the _fetch_event_list and cancel_delayed_state_events database queries, and the *getEvent* cache growing quickly, ballooning memory.
My theory is that #18858 is the cause. The fix allows the surrounding for loop to iterate (whereas before it would exit early with an Exception). get_event calls _fetch_event_list, and we are tight-looping over fetching a single state event at a time. We then call cancel_delayed_state_events on each state event individually.
It would be more efficient to pull out a batch of (say 500) events at a time and cancel them all with a single query as well.
Since this loop operates on state deltas from self._store.get_delayed_events_stream_pos() to self._store.get_room_max_stream_ordering(), one can assume that the intensive operations will run for a bounded time and then settle. Though due to the earlier code aborting early, it's likely that the delayed_events_stream_pos is currently very far behind for most deployments.
Understandably, element.io found this too intensive and rolled Synapse back before it completed.
Upon upgrading the element.io homeserver, we found high CPU, memory and DB pressure.
We were seeing many calls to the
_fetch_event_listandcancel_delayed_state_eventsdatabase queries, and the*getEvent*cache growing quickly, ballooning memory.My theory is that #18858 is the cause. The fix allows the surrounding for loop to iterate (whereas before it would exit early with an Exception).
get_eventcalls_fetch_event_list, and we are tight-looping over fetching a single state event at a time. We then callcancel_delayed_state_eventson each state event individually.It would be more efficient to pull out a batch of (say 500) events at a time and cancel them all with a single query as well.
Since this loop operates on state deltas from
self._store.get_delayed_events_stream_pos()toself._store.get_room_max_stream_ordering(), one can assume that the intensive operations will run for a bounded time and then settle. Though due to the earlier code aborting early, it's likely that thedelayed_events_stream_posis currently very far behind for most deployments.Understandably, element.io found this too intensive and rolled Synapse back before it completed.