Add a way to expose metrics from the Docker image (SYNAPSE_ENABLE_METRICS)#19324
Add a way to expose metrics from the Docker image (SYNAPSE_ENABLE_METRICS)#19324MadLittleMods merged 25 commits intodevelopfrom
SYNAPSE_ENABLE_METRICS)#19324Conversation
| COPY ./docker/start.py /start.py | ||
| COPY ./docker/conf /conf | ||
|
|
||
| EXPOSE 8008/tcp 8009/tcp 8448/tcp |
There was a problem hiding this comment.
8009 was removed because it's ACME support was removed in matrix-org/synapse#10194
| # Keep the `shared_config` up to date with the `shared_extra_conf` from each | ||
| # worker. | ||
| shared_config = { | ||
| **worker_config["shared_extra_conf"], | ||
| # We combine `shared_config` second to avoid overwriting existing keys | ||
| # because TODO: why? | ||
| **shared_config, | ||
| } |
| # SYNAPSE_ENABLE_METRICS=1). Metrics for workers are on ports starting from 19091 but | ||
| # since these are dynamic we don't expose them by default. | ||
| EXPOSE 19090/tcp |
There was a problem hiding this comment.
In a future PR, I think it would be useful to add Prometheus service discovery endpoint to make it easy to discover all of the workers and random ports here.
…rate-config` This is necessary as the Docker image actually uses `--generate-config` to generate the main homeserver config. It's only in worker mode that it uses the other route.
Let `ServerConfig.generate_config_section(...)` figure it out
| # regardless of the SYNAPSE_LOG_LEVEL setting. | ||
| # * SYNAPSE_LOG_TESTING: if set, Synapse will log additional information useful | ||
| # for testing. | ||
| # * SYNAPSE_USE_UNIX_SOCKET: TODO |
There was a problem hiding this comment.
Something for a future PR to address ⏩
| {% if SYNAPSE_ENABLE_METRICS %} | ||
| - type: metrics | ||
| # The main process always uses the same port 19090 | ||
| # | ||
| # Prometheus does not support Unix sockets so we don't bother with | ||
| # `SYNAPSE_USE_UNIX_SOCKET`, https://github.com/prometheus/prometheus/issues/12024 | ||
| port: 19090 | ||
| {% endif %} |
There was a problem hiding this comment.
The whole config situation for our Docker image is pretty confusing. This docker/conf/homeserver.yaml is only used for the migrate_config mode,
Lines 218 to 230 in 7a24faf
For the generate mode, we also have a bunch of changes to support SYNAPSE_ENABLE_METRICS (see synapse/config)
| def strtobool(val: str) -> bool: | ||
| """Convert a string representation of truth to True or False | ||
|
|
||
| True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values | ||
| are 'n', 'no', 'f', 'false', 'off', and '0'. Raises ValueError if | ||
| 'val' is anything else. | ||
|
|
||
| This is lifted from distutils.util.strtobool, with the exception that it actually | ||
| returns a bool, rather than an int. | ||
| """ | ||
| val = val.lower() | ||
| if val in ("y", "yes", "t", "true", "on", "1"): | ||
| return True | ||
| elif val in ("n", "no", "f", "false", "off", "0"): | ||
| return False | ||
| else: | ||
| raise ValueError("invalid truth value %r" % (val,)) |
There was a problem hiding this comment.
This is copied from
synapse/synapse/util/stringutils.py
Lines 249 to 265 in 7a24faf
In docker/start.py, it doesn't seem like we have any dependencies on Synapse code so I just lifted it over here.
| listeners: | ||
| - port: 8008 | ||
| - bind_addresses: | ||
| - ::1 | ||
| - 127.0.0.1 | ||
| port: 8008 | ||
| resources: | ||
| - compress: false | ||
| names: | ||
| - client | ||
| - federation | ||
| tls: false | ||
| type: http | ||
| x_forwarded: true |
There was a problem hiding this comment.
These changes are generated from poetry run scripts-dev/generate_sample_config.sh
This has changed because we're now passing in a default set of listeners instead of raw string manipulation. See synapse/config/_base.py and synapse/config/server.py
anoadragon453
left a comment
There was a problem hiding this comment.
I agree that the way we generate worker configs for testing is pretty convoluted. I'm also conscious that ESS is trying to do the same thing for production services, albeit with proprietary components.
Either way, thanks for this. Very useful option!
| # regardless of the SYNAPSE_LOG_LEVEL setting. | ||
| # * SYNAPSE_LOG_TESTING: if set, Synapse will log additional information useful | ||
| # for testing. | ||
| # * SYNAPSE_USE_UNIX_SOCKET: TODO |
There was a problem hiding this comment.
| # * SYNAPSE_USE_UNIX_SOCKET: TODO | |
| # * SYNAPSE_USE_UNIX_SOCKET: TODO (https://github.com/prometheus/prometheus/issues/12024) |
Let's add the blocker here as well.
There was a problem hiding this comment.
This is a general doc comment for SYNAPSE_USE_UNIX_SOCKET.
prometheus/prometheus#12024 isn't relevant to link there.
|
Thanks for the review @anoadragon453 🦋 |
…configuration scripts (#19323) For reference, this PR used to include this whole `shared_config` block in the diff. But #19324 was merged first which introduced parts of it already. Here is what this code used to look like: https://github.com/element-hq/synapse/blob/566670c363915691826b5b435c4aa7acde61b408/docker/configure_workers_and_start.py#L865-L868 --- Original context for why it was changed this way: matrix-org/synapse#14921 (comment) Previously, this code made me question two things: 1. Do we actually use `worker_config["shared_extra_conf"]` in the templates? - At first glance, I couldn't see why we're updating `shared_extra_conf` here. It's not used in the `worker.yaml.j2` template so all of this seemed a bit pointless. - Turns out, updating `shared_extra_conf` itself is pointless and it's being done as a convenient place to mix the objects to get things right in `shared_config` (confusing). 1. Does it actually do anything? - Because `shared_config` starts out as an empty object, my first glance made me think we we're just updating with an empty object and then just re-assigning. But because we're in a loop, we actually accumulate the `shared_extra_conf` from each worker. I'm not sure whether I'm capturing my confusion well enough here but basically, this made me spend time trying to figure out what/why we're doing things this way and we can use a more clear pattern to accomplish the same thing. --- This change is spawning from looking at the `docker/configure_workers_and_start.py` script in order to add a metrics listener ([upcoming PR](#19324)).
- Update `synapse_xxx` (server-level) metrics to use
`server_name="$server_name",` instead of `instance="$instance"`
- Add `synapse_server_name_info` metric to map Synapse `server_name`s to
the `instance`s they're hosted on.
- For process level metrics, update to use `xxx * on (instance, job,
index) group_left(server_name)
synapse_server_name_info{server_name="$server_name"}`
All of the changes here are backwards compatible with whatever people
were doing before with their Prometheus/Grafana dashboards.
Previously, the recommendation was to use the `instance` label to group
everything under the same server (https://github.com/element-hq/synapse/blob/803e4b4d884b2de4b9e20dc47ffb59a983b8a4b5/docs/metrics-howto.md#L93-L147)
But the `instance` label actually has a special meaning and we're
actually abusing it by using it that way:
> `instance`: The `<host>:<port>` part of the target's URL that was
scraped.
>
> *--
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series*
Since #18592 (Synapse
`v1.139.0`), we now have the `server_name` label to use instead.
---
Additionally, the assumption that a single process is serving a single
server is no longer true with [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/).
Part of element-hq/synapse-small-hosts#106
### Motivating use case
Although this change also benefits [Synapse Pro for small
hosts](https://docs.element.io/latest/element-server-suite-pro/synapse-pro-for-small-hosts/overview/)
(element-hq/synapse-small-hosts#106), this is
actually spawning from adding Prometheus metrics to our workerized
Docker image (#19324,
#19336) with a more correct
label setup (without `instance`) and wanting the dashboard to be better.
### Testing strategy
1. Make sure your firewall allows the Docker containers to communicate
to the host (`host.docker.internal`) so they can access exposed ports of
other Docker containers. We want to allow Synapse to access the
Prometheus container and Grafana to access to the Prometheus container.
- `sudo ufw allow in on docker0 comment "Allow traffic from the default
Docker network to the host machine (host.docker.internal)"`
- `sudo ufw allow in on br-+ comment "(from Matrix Complement testing)
Allow traffic from custom Docker networks to the host machine
(host.docker.internal)"`
- [Complement firewall
docs](https://github.com/matrix-org/complement/blob/ee6acd9154bbae2d0071a9cb39593c0a5e37268b/README.md#potential-conflict-with-firewall-software)
1. Build the Docker image for Synapse: `docker build -t
matrixdotorg/synapse -f docker/Dockerfile .`
([docs](https://github.com/element-hq/synapse/blob/7a24fafbc376b9bffeb3277b1ad4aa950720c96c/docker/README-testing.md#building-and-running-the-images-manually))
1. Generate config for Synapse:
```
docker run -it --rm \
--mount type=volume,src=synapse-data,dst=/data \
-e SYNAPSE_SERVER_NAME=my.docker.synapse.server \
-e SYNAPSE_REPORT_STATS=yes \
-e SYNAPSE_ENABLE_METRICS=1 \
matrixdotorg/synapse:latest generate
```
1. Start Synapse:
```
docker run -d --name synapse \
--mount type=volume,src=synapse-data,dst=/data \
-p 8008:8008 \
-p 19090:19090 \
matrixdotorg/synapse:latest
```
1. You should be able to see metrics from Synapse at
http://localhost:19090/_synapse/metrics
1. Create a Prometheus config (`prometheus.yml`)
```yaml
global:
scrape_interval: 15s
scrape_timeout: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
scrape_interval: 15s
metrics_path: /_synapse/metrics
scheme: http
static_configs:
- targets:
# This should point to the Synapse metrics listener (we're using
`host.docker.internal` because this is from within the Prometheus
container)
- host.docker.internal:19090
```
1. Start Prometheus (update the volume bind mount to the config you just
saved somewhere):
```
docker run \
--detach \
--name=prometheus \
--add-host host.docker.internal:host-gateway \
-p 9090:9090 \
-v
~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml
\
prom/prometheus
```
1. Make sure you're seeing some data in Prometheus. On
http://localhost:9090/query, search for `synapse_build_info`
1. Start [Grafana](https://hub.docker.com/r/grafana/grafana)
```
docker run -d --name=grafana --add-host
host.docker.internal:host-gateway -p 3000:3000 grafana/grafana
```
1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials:
`admin`/`admin`)
1. **Connections** -> **Data Sources** -> **Add data source** ->
**Prometheus**
- Prometheus server URL: `http://host.docker.internal:9090`
1. Import the Synapse dashboard: `contrib/grafana/synapse.json`
To test workers, you can use the testing strategy from
#19336 (assumes both changes
from this PR and the other PR are combined)
…all workers in Docker image (#19336) Add Prometheus [HTTP service discovery](https://prometheus.io/docs/prometheus/latest/http_sd/) endpoint for easy discovery of all workers in Docker image. Follow-up to #19324 Spawning from wanting to [run a load test](element-hq/synapse-rust-apps#397) against the Complement Docker image of Synapse and see metrics from the homeserver. `GET http://<synapse_container>:9469/metrics/service_discovery` ```json5 [ { "targets": [ "<host>", ... ], "labels": { "<labelname>": "<labelvalue>", ... } }, ... ] ``` The metrics from each worker can also be accessed via `http://<synapse_container>:9469/metrics/worker/<worker_name>` which is what the service discovery response points to behind the scenes. This way, you only need to expose a single port (9469) to access all metrics. <details> <summary>Real HTTP service discovery response</summary> ```json5 [ { "targets": [ "localhost:9469" ], "labels": { "job": "event_persister", "index": "1", "__metrics_path__": "/metrics/worker/event_persister1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "event_persister", "index": "2", "__metrics_path__": "/metrics/worker/event_persister2" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "background_worker", "index": "1", "__metrics_path__": "/metrics/worker/background_worker1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "event_creator", "index": "1", "__metrics_path__": "/metrics/worker/event_creator1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "user_dir", "index": "1", "__metrics_path__": "/metrics/worker/user_dir1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "media_repository", "index": "1", "__metrics_path__": "/metrics/worker/media_repository1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "federation_inbound", "index": "1", "__metrics_path__": "/metrics/worker/federation_inbound1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "federation_reader", "index": "1", "__metrics_path__": "/metrics/worker/federation_reader1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "federation_sender", "index": "1", "__metrics_path__": "/metrics/worker/federation_sender1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "synchrotron", "index": "1", "__metrics_path__": "/metrics/worker/synchrotron1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "client_reader", "index": "1", "__metrics_path__": "/metrics/worker/client_reader1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "appservice", "index": "1", "__metrics_path__": "/metrics/worker/appservice1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "pusher", "index": "1", "__metrics_path__": "/metrics/worker/pusher1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "device_lists", "index": "1", "__metrics_path__": "/metrics/worker/device_lists1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "device_lists", "index": "2", "__metrics_path__": "/metrics/worker/device_lists2" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "stream_writers", "index": "1", "__metrics_path__": "/metrics/worker/stream_writers1" } }, { "targets": [ "localhost:9469" ], "labels": { "job": "main", "index": "1", "__metrics_path__": "/metrics/worker/main" } } ] ``` </details> And how it ends up as targets in Prometheus (http://localhost:9090/targets): (image) ### Testing strategy 1. Make sure your firewall allows the Docker containers to communicate to the host (`host.docker.internal`) so they can access exposed ports of other Docker containers. We want to allow Synapse to access the Prometheus container and Grafana to access to the Prometheus container. - `sudo ufw allow in on docker0 comment "Allow traffic from the default Docker network to the host machine (host.docker.internal)"` - `sudo ufw allow in on br-+ comment "(from Matrix Complement testing) Allow traffic from custom Docker networks to the host machine (host.docker.internal)"` - [Complement firewall docs](https://github.com/matrix-org/complement/blob/ee6acd9154bbae2d0071a9cb39593c0a5e37268b/README.md#potential-conflict-with-firewall-software) 1. Build the Docker image for Synapse: `docker build -t matrixdotorg/synapse -f docker/Dockerfile . && docker build -t matrixdotorg/synapse-workers -f docker/Dockerfile-workers .` ([docs](https://github.com/element-hq/synapse/blob/7a24fafbc376b9bffeb3277b1ad4aa950720c96c/docker/README-testing.md#building-and-running-the-images-manually)) 1. Start Synapse: ``` docker run -d --name synapse \ --mount type=volume,src=synapse-data,dst=/data \ -e SYNAPSE_SERVER_NAME=my.docker.synapse.server \ -e SYNAPSE_REPORT_STATS=no \ -e SYNAPSE_ENABLE_METRICS=1 \ -p 8008:8008 \ -p 9469:9469 \ matrixdotorg/synapse-workers:latest ``` - Also try with workers: ``` docker run -d --name synapse \ --mount type=volume,src=synapse-data,dst=/data \ -e SYNAPSE_SERVER_NAME=my.docker.synapse.server \ -e SYNAPSE_REPORT_STATS=no \ -e SYNAPSE_ENABLE_METRICS=1 \ -e SYNAPSE_WORKER_TYPES="\ event_persister:2, \ background_worker, \ event_creator, \ user_dir, \ media_repository, \ federation_inbound, \ federation_reader, \ federation_sender, \ synchrotron, \ client_reader, \ appservice, \ pusher, \ device_lists:2, \ stream_writers=account_data+presence+receipts+to_device+typing" \ -p 8008:8008 \ -p 9469:9469 \ matrixdotorg/synapse-workers:latest ``` 1. You should be able to see Prometheus service discovery endpoint at http://localhost:9469/metrics/service_discovery 1. Create a Prometheus config (`prometheus.yml`) ```yaml global: scrape_interval: 15s scrape_timeout: 15s evaluation_interval: 15s scrape_configs: - job_name: synapse scrape_interval: 15s metrics_path: /_synapse/metrics scheme: http # We set `honor_labels` so that each service can set their own `job` label # # > honor_labels controls how Prometheus handles conflicts between labels that are # > already present in scraped data and labels that Prometheus would attach # > server-side ("job" and "instance" labels, manually configured target # > labels, and labels generated by service discovery implementations). # > # > *-- https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config* honor_labels: true # Use HTTP service discovery # # Reference: # - https://prometheus.io/docs/prometheus/latest/http_sd/ # - https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config http_sd_configs: - url: 'http://localhost:9469/metrics/service_discovery' ``` 1. Start Prometheus (update the volume bind mount to the config you just saved somewhere): ``` docker run \ --detach \ --name=prometheus \ --add-host host.docker.internal:host-gateway \ -p 9090:9090 \ -v ~/Documents/code/random/prometheus-config/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus ``` 1. Make sure you're seeing some data in Prometheus. On http://localhost:9090/query, search for `synapse_build_info` 1. Start [Grafana](https://hub.docker.com/r/grafana/grafana) ``` docker run -d --name=grafana --add-host host.docker.internal:host-gateway -p 3000:3000 grafana/grafana ``` 1. Visit the Grafana dashboard, http://localhost:3000/ (Credentials: `admin`/`admin`) 1. **Connections** -> **Data Sources** -> **Add data source** -> **Prometheus** - Prometheus server URL: `http://host.docker.internal:9090` 1. Import the Synapse dashboard: https://github.com/element-hq/synapse/blob/develop/contrib/grafana/synapse.json
Add a way to expose metrics from the Docker image (
SYNAPSE_ENABLE_METRICS).Spawning from wanting to run a load test against the Complement Docker image of Synapse and see metrics from the homeserver.
Why not just provide your own homeserver config?
Probably possible but it gets tricky when you try to use the workers variant of the Docker image (
docker/Dockerfile-workers). The way to workaround it would probably be toyqedit everything in a script and change/data/homeserver.yamland/conf/workers/*.yamlto add themetricslistener. And then modify/conf/workers/shared.yamlto addenable_metrics: true. Doesn't spark much joy.Testing strategy
host.docker.internal) so they can access exposed ports of other Docker containers. We want to allow Synapse to access the Prometheus container and Grafana to access to the Prometheus container.sudo ufw allow in on docker0 comment "Allow traffic from the default Docker network to the host machine (host.docker.internal)"sudo ufw allow in on br-+ comment "(from Matrix Complement testing) Allow traffic from custom Docker networks to the host machine (host.docker.internal)"docker build -t matrixdotorg/synapse -f docker/Dockerfile .(docs)prometheus.yml)synapse_build_infoadmin/admin)http://host.docker.internal:9090Dev notes
instancevsjoblabels, https://prometheus.io/docs/concepts/jobs_instances/matrix.org: Refactor metrics to be scoped to the homeserver #18592 (comment)SYNAPSE_METRICS_UNIX_SOCKETS; mentioned in Prometheus Enhancements realtyem/synapse-workers#3 but also there is a comment that Prometheus doesn't support this yet8009)Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.