Prune old rows in device_lists_changes_in_room table.#19473
Prune old rows in device_lists_changes_in_room table.#19473erikjohnston merged 32 commits intodevelopfrom
device_lists_changes_in_room table.#19473Conversation
We called `txn.rowcount` *after* we used the txn for something else, so we were no longer counting rows deleted by the prune but rows inserted into the cache stream. This caused a tight loop.
| return num_deleted | ||
|
|
||
| num_rows_deleted = 0 | ||
| while True: |
There was a problem hiding this comment.
Are we happy that there isn't much in the way of rate control on these deletions?
Co-authored-by: Olivier 'reivilibre' <oliverw@element.io>
Co-authored-by: Olivier 'reivilibre' <olivier@librepush.net>
reivilibre
left a comment
There was a problem hiding this comment.
Seems like it's basically there, save for a bit of potential raciness and maybe a test?
| if changes is not None: | ||
| local_changes = {(u, d) for u, d in changes if self.hs.is_mine_id(u)} | ||
| else: | ||
| # The `device_lists_stream_id` is too old, so we need to fall back |
There was a problem hiding this comment.
how much of a pain would it be to stand up a test against this case? I worry this is probably untested code. Being an edge case, it will therefore probably go unnoticed if it breaks until a very awkward moment. Would be nice to have something
Co-authored-by: Olivier 'reivilibre' <oliverw@element.io>
Co-authored-by: Olivier 'reivilibre' <oliverw@element.io>
Instead of checking it in a separate transaction, check in the transaction we're reading the table from.
bb6c559 to
756675c
Compare
| "synapse.storage.databases.main.devices.PRUNE_DEVICE_LISTS_CHANGES_IN_ROOM_AGE", | ||
| Duration(minutes=1), | ||
| ) | ||
| def test_local_device_changes_sent_to_new_servers_on_un_partial_state( |
There was a problem hiding this comment.
I imagine it was reasonably fiddly to get this test all pieced together; thanks for writing it!
68b2415 to
5e5c3f2
Compare
Rather than trying to infer it from the minimum ID in the table. We were seeing issues in CI due to not all device list updates having entries in the `device_lists_changes_in_rooms` table due to the user not being in any rooms. This meant that the returned minimum ID was larger than expected, causing failures when calling `get_all_device_list_changes(..)`. By explicitly tracking the max pruned ID, we don't have to worry about problems with trying to infer the actual max pruned ID.
5e5c3f2 to
fee4c3f
Compare
|
@reivilibre sorry, the worker integration tests raised an issue around the fact that taking the minimum stream ID of |
reivilibre
left a comment
There was a problem hiding this comment.
Probably sane but would be good to have a really good idea of what the new table is for; it does feel like the kind of thing someone will dig up years down the line and have to scratch their head over it.
It doesn't sound insane so probably just a matter of getting the illustration right, e.g. with a step by step sequence to show what goes wrong with the 'simple' MIN(stream_id)
| -- the table cannot provide a complete answer. | ||
| -- | ||
| -- This replaces the previous approach of using MIN(stream_id) on the | ||
| -- device_lists_changes_in_room table, which incorrectly returned 0 when |
There was a problem hiding this comment.
Ironically this new table is more confusing to me, as in I don't see why this is preventing anything.
By my reading, COALESCE(MIN(stream_id), 0) is the reason the previous approach returned 0 when empty. (So I think the comment here is maybe a bit misleading, or doesn't quite reveal the reason we need this?)
Trying to understand the difference, here is a (step by step) comparison table as I see it:
Old _get_min_device_lists_changes_in_room |
New _get_max_pruned_device_lists_changes_in_room_txn |
|
|---|---|---|
| Seed (empty table) | returns 0 | returns 0 |
| New rows added with stream_ids 2..=15 (empty table) | returns 2 | still returns 0 |
| Prune with prune_before_stream_id 10 | now returns MIN(stream_id) = 10 | now returns 10 (explicitly stored) |
Old _get_min_device_lists_changes_in_room |
New _get_max_pruned_device_lists_changes_in_room_txn |
|
|---|---|---|
| Seed (populated table with stream_ids 2..=9) | returns MIN(stream_id) = 2 | returns MIN(stream_id) - 1 = 1 |
| New rows added with stream_ids 10..=15 (populated table) | still returns 2 | still returns 1 |
| Prune with prune_before_stream_id 10 | now returns MIN(stream_id) = 10 | now returns 10 (explicitly stored) |
Apart from the initial state before the first prune, it looks like both approaches behave the same between prunes?
What am I missing? Maybe the device_lists_changes_in_room rows get deleted as rooms get purged?
There was a problem hiding this comment.
By my reading, COALESCE(MIN(stream_id), 0) is the reason the previous approach returned 0 when empty. (So I think the comment here is maybe a bit misleading, or doesn't quite reveal the reason we need this?)
Argh, sorry that comment should have been deleted. I was messing around with testing getting LLM to generate the patch and it misunderstood the rationale (but got the change correct). I changed it locally but it got swallowed.
The actual problem (as per commit comment) comes from when we generate stream positions that don't have associated data in the device_lists_changes_in_rooms table, e.g. because the user isn't in any rooms. In that case if we insert a row into device_lists_stream table but not in device_lists_changes_in_rooms, and then later one that inserts rows into both, then MIN(stream_id) will return a stream ID greater than the first row even though its data hasn't been pruned (there was just no associated data to fetch).
reivilibre
left a comment
There was a problem hiding this comment.
LGTM otherwise, thanks!
| -- it's safe to read from that table for a given stream_id — if the | ||
| -- requested stream_id is < the value here, the data has been pruned and | ||
| -- the table cannot provide a complete answer. | ||
| CREATE TABLE IF NOT EXISTS device_lists_changes_in_room_max_pruned_stream_id ( |
There was a problem hiding this comment.
would be good to still have a little distilled comment about why we need this table, which is that device_lists_stream somehow ties into this (I don't have a great picture of what this is otherwise I'd suggest some wording)
Follows on from #19473. We should be recording where we have deleted up to in the same transaction as we perform the delete, rather than at the end. Also let's log more regularly, as the initial set of deletions will likely take a long time
Follows on from #19473. We should be recording where we have deleted up to in the same transaction as we perform the delete, rather than at the end. Also let's log more regularly, as the initial set of deletions will likely take a long time
Follows on from #19473. We should be recording where we have deleted up to in the same transaction as we perform the delete, rather than at the end. Also let's log more regularly, as the initial set of deletions will likely take a long time
Follows on from #19473. We should be recording where we have deleted up to in the same transaction as we perform the delete, rather than at the end. Also let's log more regularly, as the initial set of deletions will likely take a long time
Follows on from #19473. We should be recording where we have deleted up to in the same transaction as we perform the delete, rather than at the end. Also let's log more regularly, as the initial set of deletions will likely take a long time
Follows on from #19473. We should be recording where we have deleted up to in the same transaction as we perform the delete, rather than at the end. This code only starts deleting rows after a month (and the original PR isn't in a release yet), so no server should have run into this problem yet. Also let's log more regularly, as the initial set of deletions will likely take a long time.
Fixes #13043
The usages of the table mostly already correctly handled if we don't have old entries, as that was needed when we first added the table.
I arbitrarily set the prune time to 30 days. The only use for old entries is for sync streams that haven't synced since then, and we should very rarely see sync streams that haven't been used in 30 days.
Reviewable commit-by-commit.