Refactor cache metrics to be homeserver-scoped#18604
Conversation
``` synapse/replication/tcp/streams/_base.py:568: error: Cannot determine type of "_device_list_id_gen" [has-type] synapse/storage/databases/main/event_push_actions.py:256: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/event_push_actions.py:256: error: Cannot determine type of "server_name" in base class "EventsWorkerStore" [misc] synapse/storage/databases/main/event_push_actions.py:256: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/metrics.py:64: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/metrics.py:64: error: Cannot determine type of "server_name" in base class "EventsWorkerStore" [misc] synapse/storage/databases/main/metrics.py:64: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/push_rule.py:118: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/push_rule.py:118: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/push_rule.py:118: error: Cannot determine type of "server_name" in base class "EventsWorkerStore" [misc] synapse/storage/databases/main/account_data.py:60: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/account_data.py:60: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/account_data.py:60: error: Cannot determine type of "server_name" in base class "EventsWorkerStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "PresenceStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "ClientIpWorkerStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "DeviceInboxWorkerStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "server_name" in base class "EventsWorkerStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore" [misc] synapse/storage/databases/main/__init__.py:114: error: Cannot determine type of "_instance_name" in base class "DeviceInboxWorkerStore" [misc] synapse/app/generic_worker.py:117: error: Cannot determine type of "_instance_name" in base class "DeviceInboxWorkerStore" [misc] synapse/app/generic_worker.py:117: error: Cannot determine type of "_instance_name" in base class "ReceiptsWorkerStore" [misc] Found 22 errors in 7 files (checked 937 source files) ```
| from prometheus_client.core import Gauge | ||
|
|
||
| from synapse.config.cache import add_resizable_cache | ||
| from synapse.metrics import INSTANCE_LABEL_NAME |
There was a problem hiding this comment.
The metrics being refactored to be homeserver scoped are in this file.
The rest of the changes are to support that change and supply the server_name to the instance label.
| class HasServerName(Protocol): | ||
| server_name: str | ||
| """ | ||
| The homeserver name that this cache is associated with (used to label the metric) | ||
| (`hs.hostname`). | ||
| """ |
There was a problem hiding this comment.
This pattern is copied from Measure
synapse/synapse/util/metrics.py
Lines 95 to 96 in a2bee2f
(The Measure pattern is also updated in #18601)
Conflicts: synapse/http/federation/matrix_federation_agent.py synapse/http/federation/well_known_resolver.py synapse/storage/_base.py synapse/storage/controllers/state.py
anoadragon453
left a comment
There was a problem hiding this comment.
This LGTM! Thank you for updating the docstrings of each function you touched to include all parameter names.
| Normally, this would be set automatically by the Prometheus server scraping the data but | ||
| since we support multiple instances of Synapse running in the same process and all |
There was a problem hiding this comment.
absolute nit, readability:
| Normally, this would be set automatically by the Prometheus server scraping the data but | |
| since we support multiple instances of Synapse running in the same process and all | |
| Normally, this would be set automatically by the Prometheus server scraping the data. But | |
| since we support multiple instances of Synapse running in the same process and all |
There was a problem hiding this comment.
That is from an old version of the diff. This is the latest:
synapse/synapse/metrics/__init__.py
Lines 69 to 82 in fc10a5e
|
Thanks for the review @anoadragon453 🦜 |
Same changelog as #18604 so they merge
Follow-up to #18604 Previously, our cache metrics did include the `server_name` label as expected but we were only seeing the last server being reported. This was caused because we would `CACHE_METRIC_REGISTRY.register_hook(metric_name, metric.collect)` where the `metric_name` only took into account the cache name so it would be overwritten every time we spawn a new server. This PR updates the register logic to include the `server_name` so we have a hook for every cache on every server as expected. I noticed this problem thanks to some [tests in the Synapse Pro for Small Hosts](element-hq/synapse-small-hosts#173) repo that sanity check all metrics to ensure that we can see each metric includes data from multiple servers.
Refactor cache metrics to be homeserver-scoped (add
server_namelabel to cache metrics).Part of #18592
This can be reviewed commit by commit to skip over some of the bulk refactor but there are some fixes down the line and I'd prefer to keep the history than clean it all up in a rebase.
Testing strategy
See behavior of previous
metricslistenermetricslistener in yourhomeserver.yamlpoetry run synapse_homeserver --config-path homeserver.yamlhttp://localhost:9323/metricssynapse_util_caches_cache_size,synapse_util_caches_cache_hits,synapse_util_caches_cache_evicted_size, etc)See behavior of the
httpmetricsresourcemetricsresource to a new or existinghttplisteners in yourhomeserver.yamlpoetry run synapse_homeserver --config-path homeserver.yamlhttp://localhost:9322/_synapse/metrics(it's just aGETrequest so you can even do in the browser)synapse_util_caches_cache_size,synapse_util_caches_cache_hits,synapse_util_caches_cache_evicted_size, etc): example, example fromdevelopDev notes
LruCache/@cached,CacheMetricTodo
@cachedscripts-dev/mypy_synapse_plugin.pyworks correctly with cached functions@cachedmore but should be fine with how we've done it.Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.