Context: open-telemetry/opentelemetry-go#7883 (comment)
After prometheus/docs#2835, both the prometheus proto and OM 2.0 formats will allow exemplars greater than 128 runes, with the intention that scrapers will enforce limits on the size of exemplars. We should plan for how to remove client-side limits when the new protocol is used.
There is also a pressing problem in the OpenTelemetry Prometheus exporter. Today, the OTel Prometheus exporter doesn't do any truncation, but users can have exemplars dropped if an attribute filter is applied to a metric (which automatically adds it as a label to the exemplar). We would like to fix this, but i'm trying to figure out if we should fix it in the OTel exporter, or prometheus client.
The primary reason to explore having the prometheus client control limits and truncation is that it could apply different behavior depending on the scrape protocol. If we add a limit in the OpenTelemetry exporter, users will need to configure it differently based on the expected scrape protocol.
Design thoughts/ideas:
- My primary goal today is implementing truncation somewhere. Ideally truncation would be controllable by users in some way (e.g. by prioritizing the first labels passed), so that OTel can make sure the trace_id/span_id aren't truncated.
- Could truncation just become the default behavior? It might lead to more confusion ("where are my labels?"), but it is hard to imagine how it would break a user since it is making a hard limit softer.
- Should we make this configurable? Maybe an enum similar to
HandlerOpts.ErrorHandling: ExemplarErrorHandling: TruncateExemplar vs ExemplarErrorHandling: DropExemplar.
- Currently, the error occurs during the exemplar update itself. Assuming OM 2.0 doesn't include client-side limits on exemplars, should we move the exemplar length validation to scrape time?
- That would be problematic if we drop invalid exemplars (rather than truncating) since we can overwrite a valid exemplar with an invalid one that is then dropped. Today, the invalid one would just not overwrite the valid one.
Overall, switching from dropping to truncating seems like a good change, but i'm looking for other opinions.
cc @bwplotka @krajorama @ywwg @ArthurSens @NesterovYehor
Context: open-telemetry/opentelemetry-go#7883 (comment)
After prometheus/docs#2835, both the prometheus proto and OM 2.0 formats will allow exemplars greater than 128 runes, with the intention that scrapers will enforce limits on the size of exemplars. We should plan for how to remove client-side limits when the new protocol is used.
There is also a pressing problem in the OpenTelemetry Prometheus exporter. Today, the OTel Prometheus exporter doesn't do any truncation, but users can have exemplars dropped if an attribute filter is applied to a metric (which automatically adds it as a label to the exemplar). We would like to fix this, but i'm trying to figure out if we should fix it in the OTel exporter, or prometheus client.
The primary reason to explore having the prometheus client control limits and truncation is that it could apply different behavior depending on the scrape protocol. If we add a limit in the OpenTelemetry exporter, users will need to configure it differently based on the expected scrape protocol.
Design thoughts/ideas:
HandlerOpts.ErrorHandling:ExemplarErrorHandling: TruncateExemplarvsExemplarErrorHandling: DropExemplar.Overall, switching from dropping to truncating seems like a good change, but i'm looking for other opinions.
cc @bwplotka @krajorama @ywwg @ArthurSens @NesterovYehor