Skip to content

Refactor cache metrics to be homeserver-scoped#18604

Merged
MadLittleMods merged 29 commits intodevelopfrom
madlittlemods/per-hs-metrics-cache
Jul 16, 2025
Merged

Refactor cache metrics to be homeserver-scoped#18604
MadLittleMods merged 29 commits intodevelopfrom
madlittlemods/per-hs-metrics-cache

Conversation

@MadLittleMods
Copy link
Copy Markdown
Contributor

@MadLittleMods MadLittleMods commented Jun 27, 2025

Refactor cache metrics to be homeserver-scoped (add server_name label to cache metrics).

Part of #18592

This can be reviewed commit by commit to skip over some of the bulk refactor but there are some fixes down the line and I'd prefer to keep the history than clean it all up in a rebase.

Testing strategy

See behavior of previous metrics listener

  1. Add the metrics listener in your homeserver.yaml
    listeners:
      - port: 9323
        type: metrics
        bind_addresses: ['127.0.0.1']
  2. Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Fetch http://localhost:9323/metrics
  4. Observe response includes the cache metrics (synapse_util_caches_cache_size, synapse_util_caches_cache_hits, synapse_util_caches_cache_evicted_size, etc)

See behavior of the http metrics resource

  1. Add the metrics resource to a new or existing http listeners in your homeserver.yaml
    listeners:
      - port: 9322
        type: http
        bind_addresses: ['127.0.0.1']
        resources:
          - names: [metrics]
            compress: false
  2. Start the homeserver: poetry run synapse_homeserver --config-path homeserver.yaml
  3. Fetch http://localhost:9322/_synapse/metrics (it's just a GET request so you can even do in the browser)
  4. Observe response includes the cache metrics (synapse_util_caches_cache_size, synapse_util_caches_cache_hits, synapse_util_caches_cache_evicted_size, etc): example, example from develop

Dev notes

LruCache/@cached, CacheMetric

register_cache(


ExpiringCache(

ResponseCache(

StreamChangeCache(

TTLCache(
	WellKnownResolver( -> MatrixFederationAgent(

LruCache(
	DeferredCache( -> DeferredCacheDescriptor( -> _CachedFunctionDescriptor( -> cached(
	AsyncLruCache(
	DictionaryCache(

Todo

  • Update @cached
  • Ensure scripts-dev/mypy_synapse_plugin.py works correctly with cached functions
    • This was more relevant when I was thinking I needed to change @cached more but should be fine with how we've done it.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

```
synapse/replication/tcp/streams/_base.py:568: error: Cannot determine type of "_device_list_id_gen"  [has-type]
synapse/storage/databases/main/event_push_actions.py:256: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/event_push_actions.py:256: error: Cannot determine type of "server_name" in base class "EventsWorkerStore"  [misc]
synapse/storage/databases/main/event_push_actions.py:256: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/metrics.py:64: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/metrics.py:64: error: Cannot determine type of "server_name" in base class "EventsWorkerStore"  [misc]
synapse/storage/databases/main/metrics.py:64: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/push_rule.py:118: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/push_rule.py:118: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/push_rule.py:118: error: Cannot determine type of "server_name" in base class "EventsWorkerStore"  [misc]
synapse/storage/databases/main/account_data.py:60: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/account_data.py:60: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/account_data.py:60: error: Cannot determine type of "server_name" in base class "EventsWorkerStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "PresenceStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "ClientIpWorkerStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "DeviceInboxWorkerStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "EventsWorkerStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore"  [misc]
synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "_instance_name" in base class "DeviceInboxWorkerStore"  [misc]
synapse/app/generic_worker.py:117: error: Cannot determine type of "_instance_name" in base class "DeviceInboxWorkerStore"  [misc]
synapse/app/generic_worker.py:117: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore"  [misc]
Found 22 errors in 7 files (checked 937 source files)
```
Comment thread synapse/util/caches/__init__.py Outdated
from prometheus_client.core import Gauge

from synapse.config.cache import add_resizable_cache
from synapse.metrics import INSTANCE_LABEL_NAME
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics being refactored to be homeserver scoped are in this file.

The rest of the changes are to support that change and supply the server_name to the instance label.

Comment on lines +157 to +162
class HasServerName(Protocol):
server_name: str
"""
The homeserver name that this cache is associated with (used to label the metric)
(`hs.hostname`).
"""
Copy link
Copy Markdown
Contributor Author

@MadLittleMods MadLittleMods Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is copied from Measure

class HasClock(Protocol):
clock: Clock

(The Measure pattern is also updated in #18601)

@MadLittleMods MadLittleMods marked this pull request as ready for review July 1, 2025 00:59
@MadLittleMods MadLittleMods requested a review from a team as a code owner July 1, 2025 00:59
Comment thread synapse/metrics/__init__.py Outdated
Comment thread synapse/handlers/profile.py Outdated
Conflicts:
	synapse/http/federation/matrix_federation_agent.py
	synapse/http/federation/well_known_resolver.py
	synapse/storage/_base.py
	synapse/storage/controllers/state.py
Copy link
Copy Markdown
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! Thank you for updating the docstrings of each function you touched to include all parameter names.

Comment thread synapse/metrics/__init__.py Outdated
Comment on lines +75 to +76
Normally, this would be set automatically by the Prometheus server scraping the data but
since we support multiple instances of Synapse running in the same process and all
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolute nit, readability:

Suggested change
Normally, this would be set automatically by the Prometheus server scraping the data but
since we support multiple instances of Synapse running in the same process and all
Normally, this would be set automatically by the Prometheus server scraping the data. But
since we support multiple instances of Synapse running in the same process and all

Copy link
Copy Markdown
Contributor Author

@MadLittleMods MadLittleMods Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is from an old version of the diff. This is the latest:

SERVER_NAME_LABEL = "server_name"
"""
The `server_name` label is used to identify the homeserver that the metrics correspond
to. Because we support multiple instances of Synapse running in the same process and all
metrics are in a single global `REGISTRY`, we need to manually label any metrics.
In the case of a Synapse homeserver, this should be set to the homeserver name
(`hs.hostname`).
We're purposely not using the `instance` label for this purpose as that should be "The
<host>:<port> part of the target's URL that was scraped.". Also: "In Prometheus
terms, an endpoint you can scrape is called an *instance*, usually corresponding to a
single process." (source: https://prometheus.io/docs/concepts/jobs_instances/)
"""

Comment thread Cargo.lock Outdated
@MadLittleMods MadLittleMods merged commit 88785db into develop Jul 16, 2025
74 of 76 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/per-hs-metrics-cache branch July 16, 2025 21:04
@MadLittleMods
Copy link
Copy Markdown
Contributor Author

Thanks for the review @anoadragon453 🦜

MadLittleMods added a commit that referenced this pull request Jul 30, 2025
Same changelog as #18604 so they merge
MadLittleMods added a commit that referenced this pull request Aug 1, 2025
Follow-up to #18604

Previously, our cache metrics did include the `server_name` label as
expected but we were only seeing the last server being reported. This
was caused because we would
`CACHE_METRIC_REGISTRY.register_hook(metric_name, metric.collect)` where
the `metric_name` only took into account the cache name so it would be
overwritten every time we spawn a new server.

This PR updates the register logic to include the `server_name` so we
have a hook for every cache on every server as expected.

I noticed this problem thanks to some [tests in the Synapse Pro for
Small Hosts](element-hq/synapse-small-hosts#173)
repo that sanity check all metrics to ensure that we can see each metric
includes data from multiple servers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants