Add configurable metric_timestamp_source, granularity, stable host ID, and NodeOperationDetail dedup for APM service map processor#6672
Conversation
…c timestamps (opensearch-project#6672) - Remove randomKey UUID label that caused cardinality explosion (new throwaway time series per metric emission) - Add stable trace_processor_host_id label (hostname) to prevent cross-host collisions in AMP without creating new series each window - Split timestamp granularity by metric type: - Sum metrics (request/error/fault): seconds-truncated timestamps for stable time series with minimal late-span collision risk - Histogram metrics (latency): minutes-truncated timestamps to aggregate more samples for richer bucket distributions Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
7e1f629 to
b4304f5
Compare
…c timestamps (opensearch-project#6672) - Remove randomKey UUID label that caused cardinality explosion (new throwaway time series per metric emission) - Add stable trace_processor_host_id label (hostname) to prevent cross-host collisions in AMP without creating new series each window - Split timestamp granularity by metric type: - Sum metrics (request/error/fault): seconds-truncated timestamps for stable time series with minimal late-span collision risk - Histogram metrics (latency): minutes-truncated timestamps to aggregate more samples for richer bucket distributions Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
dlvenable
left a comment
There was a problem hiding this comment.
Thank you @vamsimanohar for this contribution!
|
|
||
| private static final String SPANS_DB_SIZE = "spansDbSize"; | ||
| private static final String SPANS_DB_COUNT = "spansDbCount"; | ||
| private static final String HOST_ID = resolveHostId(); |
There was a problem hiding this comment.
This should probably be non-static and performed during the construction of the processor.
There was a problem hiding this comment.
Done. Changed HOST_ID from a static field to a non-static hostId instance field, resolved during construction via resolveHostId().
| labels.put("environment", serverSpan.getEnvironment()); | ||
| labels.put("service", serverSpan.getServiceName()); | ||
| labels.put("operation", serverSpan.getOperationName()); | ||
| labels.put("trace_processor_host_id", hostId); |
There was a problem hiding this comment.
We call this processor "service map", not "trace processor." Is this an existing label name? Might this name cause confusion?
There was a problem hiding this comment.
Good catch — renamed from trace_processor_host_id to service_map_processor_host_id to match the processor name.
| @ExtendWith(MockitoExtension.class) | ||
| class ApmServiceMapMetricsUtilTest { | ||
|
|
||
| private static final String TEST_HOST_ID = "test-host-1"; |
There was a problem hiding this comment.
Make this a non-static member. Generate it to a random value in @BeforeEach.
There was a problem hiding this comment.
Done. Changed TEST_HOST_ID to a non-static testHostId field, generated as a random UUID in @BeforeEach.
| String firstTimestamp = metrics.get(0).getTime(); | ||
| String secondTimestamp = metrics.get(1).getTime(); | ||
| assertTrue(firstTimestamp.compareTo(secondTimestamp) <= 0, | ||
| assertTrue(firstTimestamp.compareTo(secondTimestamp) <= 0, |
There was a problem hiding this comment.
Use Hamcrest lessThanOrEqualTo:
assertThat("Metrics should be sorted by timestamp"
firstTimestamp, lessThanOrEqualTo(secondTimestamp));
There was a problem hiding this comment.
Done. Switched to Hamcrest lessThanOrEqualTo for timestamp comparison.
| labels.put("remoteEnvironment", decoration.getRemoteEnvironment()); | ||
| labels.put("remoteService", decoration.getRemoteService()); | ||
| labels.put("remoteOperation", decoration.getRemoteOperation()); | ||
| labels.put("trace_processor_host_id", hostId); |
There was a problem hiding this comment.
This should be a private static final. I see it used three times in this file.
Also, maybe we should have small method named putUniqueLabels() that this and line 133 use. It is small, but this will help us keep a consistent view if things change over time.
There was a problem hiding this comment.
Done. Extracted HOST_ID_LABEL as a private static final constant, and added a putCommonLabels() helper method used by both generateMetricsForClientSpan and generateMetricsForServerSpan.
| * without revealing the actual hostname in emitted metrics. | ||
| * Falls back to a random UUID if hostname resolution fails. | ||
| */ | ||
| private static String resolveHostId() { |
There was a problem hiding this comment.
I wonder if we should perform this in Data Prepper core and expose it somehow in data-prepper-api. We get the hostname (or maybe IP address) in Data Prepper core for peer-forwarding. Maybe we want to consistency here. I'm not sure on the effort to value yet though.
There was a problem hiding this comment.
Done in this PR. Created a shared HostContext utility class in data-prepper-api (org.opensearch.dataprepper.model.host.HostContext) that uses InetAddress.getLocalHost().getHostName() — the same pattern used by LeaseBasedSourceCoordinator and EnhancedLeaseBasedSourceCoordinator in core. The processor now calls HostContext.getHostname() instead of resolving inline. This gives us a single shared hostname source that other components can also use.
| final byte[] hash = digest.digest(hostname.getBytes(java.nio.charset.StandardCharsets.UTF_8)); | ||
| return Hex.encodeHexString(hash).substring(0, 16); | ||
| } catch (Exception e) { | ||
| LOG.warn("Failed to resolve hostname for trace_processor_host_id, using random UUID: {}", e.getMessage()); |
There was a problem hiding this comment.
Can we have a better fallback here?
There was a problem hiding this comment.
Done. The hostname resolution is now handled by HostContext in data-prepper-api, which follows the same pattern as LeaseBasedSourceCoordinator — using InetAddress.getLocalHost().getHostName() with a clear IllegalStateException if resolution fails. The processor hashes the hostname via SHA-256 (truncated to 16 hex chars) for the label value.
| private static String resolveHostId() { | ||
| try { | ||
| final String hostname = InetAddress.getLocalHost().getHostName(); | ||
| final java.security.MessageDigest digest = java.security.MessageDigest.getInstance("SHA-256"); |
There was a problem hiding this comment.
Use imports instead of fully qualified names. There are a few places where this needs to be corrected.
There was a problem hiding this comment.
Done. Replaced all fully qualified names with proper imports.
106051c to
f0e87e1
Compare
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672) Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com> Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
b598c2a to
f6a7469
Compare
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672) Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com> Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
f6a7469 to
2a004f2
Compare
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672) Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com> Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
2a004f2 to
736211a
Compare
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672) Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com> Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
736211a to
f649776
Compare
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672) Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com> Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
f649776 to
4c2fe57
Compare
Summary
Enhances the
otel_apm_service_mapprocessor with configurable timestamp handling, stable host identification, and reduced storage overhead for service map events.Changes
metric_timestamp_sourceconfig option (default:arrival_time)arrival_time: Usesclock.instant()at window evaluation — all spans in a window share the same timestamp, eliminating late-span collision risk in Prometheus/AMPspan_end_time: Uses the span'sendTimefield — provides span-aligned timestamps but risksErrDuplicateSampleForTimestampfor late-arriving spansmetric_timestamp_granularityconfig option (default:seconds)seconds: Truncates timestamps to second boundaries (1s collision window inspan_end_timemode)minutes: Truncates to minute boundaries (60s collision window — use only if coarser granularity is acceptable)Stable
service_map_processor_host_idlabel replaces random UUIDHostContextindata-prepper-api), truncated to 16 hex charsNodeOperationDetail dedup across all traces in a window
operationConnectionHash(ornodeConnectionHashfor leaves)Shared
HostContextutility indata-prepper-apiHostContext.getHostname()returns hostname or"unknown"gracefully (no crash on resolution failure)Configuration Example
Test Plan
MetricTimestampSource,MetricTimestampGranularity,HostContext, and config defaultsservice_map_processor_host_idIssues Resolved
Closes #6710