feat(sdk-trace): implement span processor metrics#6504
feat(sdk-trace): implement span processor metrics#6504anuraaga wants to merge 12 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #6504 +/- ##
==========================================
+ Coverage 95.76% 95.78% +0.02%
==========================================
Files 375 376 +1
Lines 12739 12801 +62
Branches 3013 3026 +13
==========================================
+ Hits 12200 12262 +62
Misses 539 539
🚀 New features to boost your workflow:
|
| const meter = config?.meterProvider | ||
| ? config.meterProvider.getMeter('@opentelemetry/sdk-trace') | ||
| : createNoopMeter(); | ||
| this._metrics = new SpanProcessorMetrics( |
There was a problem hiding this comment.
In the SpanProcessorMetrics constructor, a callback is added to queueSize using meter.createObservableUpDownCounter().addCallback(). Looks like this callback is never removed. Since the callback captures this (to call getQueueSize), it prevents the SpanProcessor and its metrics from being garbage collected even after shutdown.
There was a problem hiding this comment.
Oh thanks - I can add explicit cleanup. But just curious, shouldn't the GC still be handle cycles like that or is there something else at play?
There was a problem hiding this comment.
It's not a cycle issue. The problem is that the Meter lives longer than the span processor and holds onto the callback. Since the callback references the processor, the processor can never be garbage collected, even after shutdown(). Removing the callback in shutdown() should fix it.
There was a problem hiding this comment.
LGTM, modulo the removeCallback thing that Jackson brought up.
My manual test (in "experimental/packages/opentelemetry-sdk-node/"):
play.js
const {trace} = require('@opentelemetry/api');
const {NodeSDK} = require('./build/src/index.js');
const sdk = new NodeSDK();
sdk.start();
process.once('beforeExit', async () => {
return sdk.shutdown();
});
const tracer = trace.getTracer('manual');
tracer.startActiveSpan('myspan', span => {
span.end();
})
// Stay alive.
setInterval(() => {
console.log('.')
}, 10000);Run a mock/dev collector that shows a summary of received OTLP:
npx @elastic/mockotlpserver -o summary,spacerRun the play script:
OTEL_NODE_EXPERIMENTAL_SDK_METRICS=true \
OTEL_METRIC_EXPORT_INTERVAL=5000 \
OTEL_METRIC_EXPORT_TIMEOUT=5000 \
node play.jsWait a bit, and the telemetry summary is:
------ metrics ------
sum "otel.sdk.span.started" (unit={span}, aggTemp=cumulative): 1 { 'otel.span.parent.origin': 'none', 'otel.span.sampling_result': 'RECORD_AND_SAMPLE' }
sum "otel.sdk.span.live" (unit={span}, aggTemp=cumulative): 0 { 'otel.span.sampling_result': 'RECORD_AND_SAMPLE' }
sum "otel.sdk.processor.span.queue.capacity" (unit={span}, aggTemp=cumulative): 2048 { 'otel.component.type': 'batching_span_processor', 'otel.component.name': 'batching_span_processor/0' }
sum "otel.sdk.processor.span.queue.size" (unit={span}, aggTemp=cumulative): 1 { 'otel.component.type': 'batching_span_processor', 'otel.component.name': 'batching_span_processor/0' }
------ trace 3dc16f (1 span) ------
span a80ba5 "myspan" (0.1ms, SPAN_KIND_INTERNAL, service.name=unknown_service:node, scope=manual)
------ metrics ------
sum "otel.sdk.span.started" (unit={span}, aggTemp=cumulative): 1 { 'otel.span.parent.origin': 'none', 'otel.span.sampling_result': 'RECORD_AND_SAMPLE' }
sum "otel.sdk.span.live" (unit={span}, aggTemp=cumulative): 0 { 'otel.span.sampling_result': 'RECORD_AND_SAMPLE' }
sum "otel.sdk.processor.span.processed" (unit={span}, aggTemp=cumulative): 1 { 'otel.component.type': 'batching_span_processor', 'otel.component.name': 'batching_span_processor/0' }
sum "otel.sdk.processor.span.queue.capacity" (unit={span}, aggTemp=cumulative): 2048 { 'otel.component.type': 'batching_span_processor', 'otel.component.name': 'batching_span_processor/0' }
sum "otel.sdk.processor.span.queue.size" (unit={span}, aggTemp=cumulative): 0 { 'otel.component.type': 'batching_span_processor', 'otel.component.name': 'batching_span_processor/0' }
Which looks good to me.
| ]) | ||
| : getSpanProcessorsFromEnv( | ||
| sdkMetricsEnabled ? this._meterProvider : undefined | ||
| ); |
There was a problem hiding this comment.
nit: This ternary getting a bit complex, would be clearer as an if-block.
| : getSpanProcessorsFromEnv(); | ||
| // While SDK metrics are unstable, we require an opt-in. | ||
| // https://opentelemetry.io/docs/specs/semconv/otel/sdk-metrics/ | ||
| const sdkMetricsEnabled = getBooleanFromEnv( |
There was a problem hiding this comment.
Ah, there is a merge issue here now, because #6433 went in.
There was a problem hiding this comment.
Yup fixed it - it seems the code was slightly different enough in the PRs to automerge unexpectedly 😂
Which problem is this PR solving?
I am helping to implement SDK internal metrics
https://opentelemetry.io/docs/specs/semconv/otel/sdk-metrics/
This implements span processor metrics
Implementation in Java - https://github.com/open-telemetry/opentelemetry-java/pull/7895/changes#diff-57fb1f394e77f4d9b90d05aa09755a0b75d3c55bed16176c18f87850788dd664
After this PR is merged, I will send a very similar one for log processor. And then SDK metrics should be complete.
/cc @trentm
Short description of the changes
Implement metrics for span processors per semconv
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist: