Skip to content

Add configurable metric_timestamp_source, granularity, stable host ID, and NodeOperationDetail dedup for APM service map processor#6672

Open
vamsimanohar wants to merge 1 commit intoopensearch-project:mainfrom
vamsimanohar:remove-randomkey-apm-metrics
Open

Add configurable metric_timestamp_source, granularity, stable host ID, and NodeOperationDetail dedup for APM service map processor#6672
vamsimanohar wants to merge 1 commit intoopensearch-project:mainfrom
vamsimanohar:remove-randomkey-apm-metrics

Conversation

@vamsimanohar
Copy link
Copy Markdown
Member

@vamsimanohar vamsimanohar commented Mar 24, 2026

Summary

Enhances the otel_apm_service_map processor with configurable timestamp handling, stable host identification, and reduced storage overhead for service map events.

Changes

  • metric_timestamp_source config option (default: arrival_time)

    • arrival_time: Uses clock.instant() at window evaluation — all spans in a window share the same timestamp, eliminating late-span collision risk in Prometheus/AMP
    • span_end_time: Uses the span's endTime field — provides span-aligned timestamps but risks ErrDuplicateSampleForTimestamp for late-arriving spans
  • metric_timestamp_granularity config option (default: seconds)

    • seconds: Truncates timestamps to second boundaries (1s collision window in span_end_time mode)
    • minutes: Truncates to minute boundaries (60s collision window — use only if coarser granularity is acceptable)
  • Stable service_map_processor_host_id label replaces random UUID

    • SHA-256 hash of hostname (via shared HostContext in data-prepper-api), truncated to 16 hex chars
    • Stable across restarts, consistent across all components
  • NodeOperationDetail dedup across all traces in a window

    • Same service relationship from different traces now emits one document per window instead of one per span
    • Dedup by operationConnectionHash (or nodeConnectionHash for leaves)
  • Shared HostContext utility in data-prepper-api

    • HostContext.getHostname() returns hostname or "unknown" gracefully (no crash on resolution failure)

Configuration Example

processor:
  - otel_apm_service_map:
      window_duration: 60s
      metric_timestamp_source: arrival_time
      metric_timestamp_granularity: seconds
      group_by_attributes:
        - "deployment.environment"

Test Plan

  • All existing unit tests pass
  • New tests for MetricTimestampSource, MetricTimestampGranularity, HostContext, and config defaults
  • Verified in running environment: all 4 metrics present in Prometheus with stable service_map_processor_host_id
  • Verified NodeOperationDetail dedup: each window emits exactly one document per unique relationship hash
  • Verified timestamps are seconds-truncated in both metrics and NodeOperationDetail
  • CI passing (Java 11, 17, 25)

Issues Resolved

Closes #6710

vamsimanohar added a commit to vamsimanohar/data-prepper that referenced this pull request Mar 25, 2026
…c timestamps (opensearch-project#6672)

- Remove randomKey UUID label that caused cardinality explosion (new
  throwaway time series per metric emission)
- Add stable trace_processor_host_id label (hostname) to prevent
  cross-host collisions in AMP without creating new series each window
- Split timestamp granularity by metric type:
  - Sum metrics (request/error/fault): seconds-truncated timestamps
    for stable time series with minimal late-span collision risk
  - Histogram metrics (latency): minutes-truncated timestamps to
    aggregate more samples for richer bucket distributions

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch 2 times, most recently from 7e1f629 to b4304f5 Compare March 25, 2026 00:08
vamsimanohar added a commit to vamsimanohar/data-prepper that referenced this pull request Mar 25, 2026
…c timestamps (opensearch-project#6672)

- Remove randomKey UUID label that caused cardinality explosion (new
  throwaway time series per metric emission)
- Add stable trace_processor_host_id label (hostname) to prevent
  cross-host collisions in AMP without creating new series each window
- Split timestamp granularity by metric type:
  - Sum metrics (request/error/fault): seconds-truncated timestamps
    for stable time series with minimal late-span collision risk
  - Histogram metrics (latency): minutes-truncated timestamps to
    aggregate more samples for richer bucket distributions

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar marked this pull request as ready for review March 25, 2026 00:14
Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vamsimanohar for this contribution!


private static final String SPANS_DB_SIZE = "spansDbSize";
private static final String SPANS_DB_COUNT = "spansDbCount";
private static final String HOST_ID = resolveHostId();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be non-static and performed during the construction of the processor.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed HOST_ID from a static field to a non-static hostId instance field, resolved during construction via resolveHostId().

labels.put("environment", serverSpan.getEnvironment());
labels.put("service", serverSpan.getServiceName());
labels.put("operation", serverSpan.getOperationName());
labels.put("trace_processor_host_id", hostId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call this processor "service map", not "trace processor." Is this an existing label name? Might this name cause confusion?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — renamed from trace_processor_host_id to service_map_processor_host_id to match the processor name.

@ExtendWith(MockitoExtension.class)
class ApmServiceMapMetricsUtilTest {

private static final String TEST_HOST_ID = "test-host-1";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a non-static member. Generate it to a random value in @BeforeEach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed TEST_HOST_ID to a non-static testHostId field, generated as a random UUID in @BeforeEach.

String firstTimestamp = metrics.get(0).getTime();
String secondTimestamp = metrics.get(1).getTime();
assertTrue(firstTimestamp.compareTo(secondTimestamp) <= 0,
assertTrue(firstTimestamp.compareTo(secondTimestamp) <= 0,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Hamcrest lessThanOrEqualTo:

assertThat("Metrics should be sorted by timestamp"
  firstTimestamp, lessThanOrEqualTo(secondTimestamp));

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Switched to Hamcrest lessThanOrEqualTo for timestamp comparison.

labels.put("remoteEnvironment", decoration.getRemoteEnvironment());
labels.put("remoteService", decoration.getRemoteService());
labels.put("remoteOperation", decoration.getRemoteOperation());
labels.put("trace_processor_host_id", hostId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a private static final. I see it used three times in this file.

Also, maybe we should have small method named putUniqueLabels() that this and line 133 use. It is small, but this will help us keep a consistent view if things change over time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Extracted HOST_ID_LABEL as a private static final constant, and added a putCommonLabels() helper method used by both generateMetricsForClientSpan and generateMetricsForServerSpan.

* without revealing the actual hostname in emitted metrics.
* Falls back to a random UUID if hostname resolution fails.
*/
private static String resolveHostId() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should perform this in Data Prepper core and expose it somehow in data-prepper-api. We get the hostname (or maybe IP address) in Data Prepper core for peer-forwarding. Maybe we want to consistency here. I'm not sure on the effort to value yet though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in this PR. Created a shared HostContext utility class in data-prepper-api (org.opensearch.dataprepper.model.host.HostContext) that uses InetAddress.getLocalHost().getHostName() — the same pattern used by LeaseBasedSourceCoordinator and EnhancedLeaseBasedSourceCoordinator in core. The processor now calls HostContext.getHostname() instead of resolving inline. This gives us a single shared hostname source that other components can also use.

final byte[] hash = digest.digest(hostname.getBytes(java.nio.charset.StandardCharsets.UTF_8));
return Hex.encodeHexString(hash).substring(0, 16);
} catch (Exception e) {
LOG.warn("Failed to resolve hostname for trace_processor_host_id, using random UUID: {}", e.getMessage());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a better fallback here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The hostname resolution is now handled by HostContext in data-prepper-api, which follows the same pattern as LeaseBasedSourceCoordinator — using InetAddress.getLocalHost().getHostName() with a clear IllegalStateException if resolution fails. The processor hashes the hostname via SHA-256 (truncated to 16 hex chars) for the label value.

private static String resolveHostId() {
try {
final String hostname = InetAddress.getLocalHost().getHostName();
final java.security.MessageDigest digest = java.security.MessageDigest.getInstance("SHA-256");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use imports instead of fully qualified names. There are a few places where this needs to be corrected.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Replaced all fully qualified names with proper imports.

@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch from 106051c to f0e87e1 Compare April 14, 2026 15:59
vamsimanohar added a commit to vamsimanohar/data-prepper that referenced this pull request Apr 14, 2026
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672)

Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com>

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch from b598c2a to f6a7469 Compare April 14, 2026 20:17
@vamsimanohar vamsimanohar changed the title Remove randomKey label from APM service map metrics Add configurable metric_timestamp_source, granularity, stable host ID, and NodeOperationDetail dedup for APM service map processor Apr 14, 2026
vamsimanohar added a commit to vamsimanohar/data-prepper that referenced this pull request Apr 14, 2026
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672)

Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com>

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch from f6a7469 to 2a004f2 Compare April 14, 2026 20:18
vamsimanohar added a commit to vamsimanohar/data-prepper that referenced this pull request Apr 14, 2026
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672)

Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com>

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch from 2a004f2 to 736211a Compare April 14, 2026 21:17
vamsimanohar added a commit to vamsimanohar/data-prepper that referenced this pull request Apr 14, 2026
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672)

Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com>

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch from 736211a to f649776 Compare April 14, 2026 21:24
…y, stable host ID, and NodeOperationDetail dedup for APM service map processor (opensearch-project#6672)

Signed-off-by: vamsimanohar <vamsimanohar@users.noreply.github.com>

Signed-off-by: Vamsi Manohar <reddyvam@amazon.com>
@vamsimanohar vamsimanohar force-pushed the remove-randomkey-apm-metrics branch from f649776 to 4c2fe57 Compare April 14, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

APM service map metrics: cardinality explosion, Prometheus compatibility, and late-span data loss

2 participants