Revert "Fix LaterGauge metrics to collect from all servers (#18751)"#18789
Merged
Revert "Fix LaterGauge metrics to collect from all servers (#18751)"#18789
LaterGauge metrics to collect from all servers (#18751)"#18789Conversation
This reverts commit 076db0a.
MadLittleMods
commented
Aug 6, 2025
devonh
approved these changes
Aug 6, 2025
3 tasks
Closed
3 tasks
MadLittleMods
added a commit
that referenced
this pull request
Sep 2, 2025
…18791) Re-introduce: #18751 that was reverted in #18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see #18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see #18791 (comment))
itsoyou
pushed a commit
to famedly/synapse
that referenced
this pull request
Oct 13, 2025
…18791) Re-introduce: element-hq/synapse#18751 that was reverted in element-hq/synapse#18789 (explains why the PR was reverted in the first place). - Adds a `cleanup` pattern that cleans up metrics from each homeserver in the tests. Previously, the list of hooks built up until our CI machines couldn't operate properly, see element-hq/synapse#18789 - Fix long-standing issue with `synapse_background_update_status` metrics only tracking the last database listed in the config (see element-hq/synapse#18791 (comment))
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reverts #18751
Why revert?
@reivilibre found that our CI was failing in bizarre ways (thanks for stepping up to dive into this 🙇). Examples:
twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 9.twisted.internet.error.ProcessTerminated: A process has ended with a probable error condition: process ended by signal 15.More detailed part of the log
https://github.com/element-hq/synapse/actions/runs/16758038107/job/47500520633#step:9:6809
With more debugging (thanks @devonh for also stepping in as maintainer), we were finding that the CI was consistently failing at
test_exposed_to_prometheuswhich was a bit of smoke because of all of the metrics changes that were merged recently.Locally, although I wasn't able to reproduce the bizarre errors, I could easily see increased memory usage (~20GB vs ~2GB) and the
test_exposed_to_prometheustest taking a while to complete when running a full test run (SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests).After updating
test_exposed_to_prometheusto dump thelatest_metrics_response = generate_latest(REGISTRY), I could see that it's a massive 3.2GB response. Inspecting the contents, we can see 4.1M (4,137,123) entries for justsynapse_background_update_status{server_name="test"} 3.0which is aLaterGauge. I don't think we have 4.1M test cases so it's also unclear why we end up with so many samples but it does make sense that we do see a lot of duplicates because eachHomeserverTestCasewill create a homeserver for each test case that will callLaterGauge.register_hook(...)(part of the #18751 changes).tests/storage/databases/main/test_metrics.py(simple code for dumping the metrics response)After reverting the #18751 changes, running the full test suite locally doesn't result in memory spikes and seems to run normally.
Dev notes
Discussion in the
#synapse-dev:matrix.orgroom.Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.