Skip to content

Adopt TracingChannel for observability #1036

@logaretm

Description

@logaretm

I'd like to propose adding first-class TracingChannel support to the Anthropic Node.js SDK, following the pattern established by undici in Node.js core and adopted across the npm ecosystem.

TracingChannel is a higher-level API built on top of diagnostics_channel, specifically designed for tracing async operations. It provides structured lifecycle channels (start, end, error, asyncStart, asyncEnd) and handles async context propagation correctly. This is the missing piece that makes monkey-patching approaches fragile in real-world async applications.

Current APM instrumentations use IITM (import-in-the-middle) for ESM and RITM (require-in-the-middle) for CJS to monkey-patch SDK internals. This has several fragility concerns:

  • Runtime lock-in: both RITM and IITM rely on Node.js-specific module loader internals (Module._resolveFilename, module.register()). They don't work on Bun or Deno, which implement the Node.js API surface but not the module loader internals. The Anthropic SDK explicitly supports Node.js, Deno, and Bun, making monkey-patching especially inadequate.
  • ESM fragility: IITM is built on Node.js's module customization hooks, which are still evolving and have been a persistent source of breakage in the OTel JS ecosystem.
  • Initialization ordering: both require instrumentation to be set up before the SDK is first require()'d / import'd. Get the order wrong and instrumentation silently does nothing, which is very hard to debug in production.
  • Bundling and Externalization: Users have to ensure their instrumented modules are externalized, which is becoming very difficult to guarantee with more and more frameworks bundling server-side code into single executables, binaries, or deployment files.

The current instrumentation landscape for the Anthropic SDK illustrates this problem well. There is no official @opentelemetry/instrumentation-anthropic for JavaScript. Instead, APM vendors have independently built their own solutions:

  • Sentry uses IITM to intercept require('@anthropic-ai/sdk'), replaces the Anthropic constructor, and creates a deep recursive Proxy around the client instance to intercept method calls (client.messages.create, client.messages.stream, client.completions.create, etc.). This spans ~800 lines across 6 files: constructor wrapping, deep Proxy creation, method interception, streaming event accumulation, attribute extraction, error mapping, and type definitions.
  • Traceloop's OpenLLMetry patches Messages.prototype.create and Completions.prototype.create directly.
  • Arize AI's OpenInference patches Messages.prototype.create with its own span creation logic.

Every vendor independently replicates the same logic: intercept construction or patch prototypes, extract model/token attributes, handle streaming chunk accumulation, map error types. With native TracingChannel support, all of this becomes a single subscription.

If the Anthropic SDK emits structured events through TracingChannel, instrumentation libraries become subscribers, not patches. Each tool listens independently with no ordering concerns, no clobbering, and no internal API dependency.


Proposed Tracing Channels

All channels use the Node.js TracingChannel API, which provides start, end, asyncStart, asyncEnd, and error sub-channels automatically. The channel design aims to support the OpenTelemetry Semantic Conventions for Generative AI systems, enabling APM vendors to produce standard gen_ai.* spans and attributes from the emitted events.

TracingChannel Tracks Context fields
@anthropic-ai/sdk:messages.create Non-streaming message creation (messages.create without stream: true) model, params
@anthropic-ai/sdk:messages.stream Streaming message creation (messages.stream() or messages.create with stream: true), from request initiation to stream completion model, params
@anthropic-ai/sdk:completions.create Legacy completions endpoint model, params

Why Separate Channels

Each API method gets its own TracingChannel. This follows the diagnostics_channel design philosophy: many purpose-focused channels with their own subscriber sets, so dispatch is extremely cheap. Subscribers listen only to the operations they care about rather than filtering a firehose channel, which would add continuous overhead on every published message.

This also eliminates the need for a method discriminator field in the context — the channel name itself identifies the operation.

Operations like messages.countTokens, models.get, and messages.batches.create are administrative/utility calls that don't represent AI inference operations. APMs generally don't create GenAI spans for these. They are excluded to keep the channels focused on the operations that matter for tracing.

Context Properties

Shared across all channels:

Field Source OTel attribute it enables
model params.model gen_ai.request.model
params Raw request parameters object APMs extract: gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.request.top_k, gen_ai.request.max_tokens, gen_ai.input.messages, gen_ai.system_instructions, gen_ai.request.available_tools
result Raw response object (auto-set by TracingChannel on completion) APMs extract: gen_ai.response.id, gen_ai.response.model, gen_ai.response.finish_reasons, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cache_creation_input_tokens, gen_ai.usage.cache_read_input_tokens, gen_ai.response.text, gen_ai.response.tool_calls

Why Raw Params and Response

The context passes raw params and the auto-set result (the API response) rather than pre-extracting individual attributes. This follows the pattern established by framework TracingChannel proposals (h3, Hono, Elysia) where raw objects are passed and APMs extract what they need. Benefits:

  1. Forward-compatible. New API parameters and response fields (thinking, citations, new content block types) are automatically available to subscribers without SDK changes.
  2. No duplication. model is a convenience accessor for the most common attribute. Everything else comes from the raw objects.
  3. Privacy is the subscriber's concern. The SDK emits what it has. APMs decide what to record based on their own recordInputs/recordOutputs policies.

Streaming

For non-streaming requests, tracePromise wraps the full operation: start fires before the request, asyncEnd fires when the response promise resolves.

For streaming (messages.stream() or stream: true on messages.create), the SDK returns a stream object immediately, but the work continues until all chunks arrive. The TracingChannel lifecycle should cover the full duration, from request initiation to stream completion. The result is populated with the final accumulated message (including total token usage, finish reasons, response ID, and content blocks) when the stream ends. This ensures APM spans reflect total generation time, not just time to first chunk.

This is particularly important for the Anthropic SDK because streaming responses arrive as a sequence of typed events (message_start, content_block_start, content_block_delta, content_block_stop, message_delta, message_stop). Today, every APM vendor independently implements streaming event accumulation logic. With TracingChannel, the SDK handles this internally and exposes the final accumulated result on asyncEnd.


Example: What the SDK Emits

A simplified sketch of what the instrumentation looks like inside the SDK:

import dc from 'node:diagnostics_channel';
const messagesCreateChannel = dc.tracingChannel('@anthropic-ai/sdk:messages.create');
const messagesStreamChannel = dc.tracingChannel('@anthropic-ai/sdk:messages.stream');

// Inside messages.create (non-streaming)
async function create(params) {
  if (messagesCreateChannel.hasSubscribers === false) {
    return this._makeRequest(params);
  }

  const context = { model: params.model, params };
  return messagesCreateChannel.tracePromise(() => this._makeRequest(params), context);
}

// Inside messages.stream
async function stream(params) {
  if (messagesStreamChannel.hasSubscribers === false) {
    return this._makeStreamingRequest(params);
  }

  const context = { model: params.model, params };
  return messagesStreamChannel.tracePromise(() => this._makeStreamingRequest(params), context);
}

Each method gets its own channel. No Proxy, no constructor wrapping, no stream accumulation logic pushed onto consumers.


How APM Tools Use This

Today: Deep Proxy on the Client Constructor

Taking Sentry as an example, their Anthropic instrumentation uses IITM to intercept require('@anthropic-ai/sdk'), replaces the Anthropic constructor, and creates a deep recursive Proxy around the resulting client instance to intercept method calls at arbitrary nesting depth. This spans ~800 lines across 6 files:

  • instrumentation.ts (~100 lines): IITM module patching, constructor wrapping, prototype chain preservation
  • index.ts (~280 lines): deep Proxy creation, method interception, span lifecycle management, request/response attribute extraction
  • streaming.ts: async iterable wrapping, streaming event accumulation, tool call reconstruction across fragmented events
  • utils.ts (~80 lines): message extraction, error type mapping, system prompt handling
  • constants.ts: method registry mapping API paths to operation types
  • types.ts (~130 lines): type definitions for responses, streaming events, content blocks, options

This approach has several problems:

  • Replaces the Anthropic constructor, wrapping every client instance even if no APM is listening
  • Deep recursive Proxy on every property access. Every client.messages.create() call goes through multiple levels of Proxy get traps
  • IITM dependency. Only works on Node.js, not on Deno or Bun where the SDK also runs
  • Stream accumulation is fragile. Each vendor independently implements streaming event processing, reconstructing tool calls from content_block_start / content_block_delta / content_block_stop sequences
  • Each APM vendor builds their own. Sentry, Traceloop, Arize AI all independently replicate the same deep-proxy or prototype-patching pattern

With TracingChannel: Subscribe to Structured Events

import dc from 'node:diagnostics_channel';

// Subscribe to each channel independently — only pay for what you listen to
const handlers = {
  start(ctx) {
    // ctx.model, ctx.params available
    ctx.span = tracer.startSpan(`chat ${ctx.model}`);
  },
  asyncEnd(ctx) {
    // ctx.result is auto-set by TracingChannel with the API response
    // (or accumulated final message for streaming)
    ctx.span?.end();
  },
  error(ctx) {
    ctx.span?.recordException(ctx.error);
  },
};

dc.tracingChannel('@anthropic-ai/sdk:messages.create').subscribe(handlers);
dc.tracingChannel('@anthropic-ai/sdk:messages.stream').subscribe(handlers);

What changes for APM vendors:

Concern Monkey-patching (today) TracingChannel (proposed)
Setup IITM intercepts require('@anthropic-ai/sdk') before first import Subscribe to diagnostics_channel at any time. No ordering constraint
Scope Replace constructor + deep Proxy on every client instance One subscription per method of interest
Method interception Recursive Proxy intercepts every property access on the client, even non-traced methods No proxying. SDK emits events at execution time, subscribers observe
Streaming Each vendor independently processes streaming events, accumulates content blocks, reconstructs tool calls from fragmented events SDK handles accumulation internally; subscribers see a single span with the final message
Multi-vendor Each vendor builds their own deep-proxy + stream accumulation logic Independent subscribers, no interference
Teardown Cannot cleanly remove a recursive Proxy from a constructor replacement unsubscribe(), clean and reversible
Runtime support IITM: Node.js only. SDK runs on Node.js, Deno, and Bun Any runtime with diagnostics_channel
Maintenance External packages must track SDK internal structure changes across Stainless regenerations Native, maintained as part of the SDK

Implementation Notes

Insertion Points

The Anthropic SDK is generated by Stainless. TracingChannel instrumentation should be added at the resource method level, where each API method (messages.create, messages.stream, completions.create) calls into the core HTTP client. This is where the operation type, model, and parameters are known.

Stainless supports custom code that persists across regeneration. TracingChannel support could be added as:

  1. A core middleware in the HTTP client pipeline, triggered for specific resource methods
  2. Custom method wrappers at the resource class level, configured through Stainless

The exact integration point is an implementation detail. The key requirement is that events are emitted at the right lifecycle moments with the right context.

Async Model

All SDK methods return Promises. tracePromise is the correct wrapper for non-streaming operations. Streaming operations need manual lifecycle management (see the Streaming section above).

shouldTrace Helper

const shouldTrace = (ch) => ch.hasSubscribers !== false;

This treats undefined (Node 18, where the aggregated hasSubscribers is broken) as "trace anyway" and false (Node 20+) as "skip". See Node.js #54470 for background.

Zero-Cost Guarantee

Context objects should only be constructed inside a hasSubscribers guard:

if (shouldTrace(messagesCreateChannel)) {
  const context = { model: params.model, params };
  return messagesCreateChannel.tracePromise(fn, context);
} else {
  return fn();
}

When no APM subscribes, the overhead is a single boolean check per API call.


Backward Compatibility

Zero-cost when no subscribers are registered. hasSubscribers is checked before constructing any context objects. Silently skipped on runtimes where TracingChannel is unavailable.

Since the Anthropic SDK supports Node.js, Deno, and Bun, the cross-runtime loading pattern is needed:

let dc;
try {
  if (typeof process !== 'undefined' && typeof process.getBuiltinModule === 'function') {
    dc = process.getBuiltinModule('node:diagnostics_channel');
  }
  if (!dc) {
    dc = require('node:diagnostics_channel');
  }
} catch {
  // diagnostics_channel not available on this runtime, no-op
}
  • typeof process guard: safe in browsers and edge runtimes where process doesn't exist
  • getBuiltinModule path: bundler-invisible (no static import to resolve), works in Node 22.3+, Deno, and Bun 1.2.7+
  • require fallback: covers older Node, Bun, and Cloudflare Workers (with nodejs_compat)
  • try/catch: swallows the error in browsers or any runtime without diagnostics_channel

Prior Art

This approach follows the same pattern already adopted or in progress by other libraries:

AI / ML:

Frameworks:

Databases:

Other:


Would love to hear if there's appetite for this. Happy to put together a PR with the implementation if so.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions