Skip to content

RFC : Agentic AI Eval Platform : opensearch-genai-sdk-py #2591

@vamsimanohar

Description

@vamsimanohar

Summary

This RFC proposes opensearch-genai-sdk-py — an OTEL-native SDK for instrumenting and scoring agentic applications with OpenSearch as the backend.

POC implementations are published under @vamsimanohar for community testing. Package names and repos will move to opensearch-project upon approval.

  • Python: GitHub | pip install opensearch-genai-sdk-py (PyPI)
  • TypeScript: GitHub | npm install opensearch-genai-sdk (npm)

Note: These are POC packages published under a personal account. Ownership will be transferred to the opensearch-project organization upon approval.

Background

The GenAI observability space has two open-source instrumentation ecosystems:

OpenLLMetry (by Traceloop) — Provides auto-instrumentation for 30+ LLM libraries (OpenAI, Anthropic, Bedrock, LangChain, etc.) and @workflow/@task/@agent/@tool decorators. Uses gen_ai.* attributes aligned with official OTEL GenAI semantic conventions — Traceloop is actively contributing these upstream to OpenTelemetry. Instrumentors register via Python entry_points, enabling runtime auto-discovery.

OpenInference (by Arize/Phoenix) — Similar auto-instrumentation with its own attribute conventions (openinference.span.kind). Also uses entry_points for discovery.

Both produce standard OTEL spans — the traces are portable. However, each comes with an SDK (traceloop-sdk, phoenix-otel) that bundles setup, decorators, and score/eval APIs. The problem: scores and evaluations in these SDKs are locked to their respective backends. Traceloop's user_feedback.create() POSTs to api.traceloop.com. Phoenix's eval APIs write to the Phoenix server. The traces are open, but the scores are not.

Why OpenLLMetry

We chose to align with OpenLLMetry because:

  • It contributes gen_ai.* semantic conventions directly to the OpenTelemetry project
  • Its instrumentors cover the broadest set of LLM/agent frameworks (30+)
    The SDK discovers and activates OpenLLMetry instrumentors via the opentelemetry_instrumentor entry point group.

Why We Need a New SDK

Two reasons: instrumentation defaults and scoring.

Instrumentation

Existing SDKs don't support cloud deployment authentication patterns:

  1. Cloud-specific authorization — Cloud-hosted OpenSearch deployments often require specialized authentication (e.g., AWS SigV4, OAuth, API keys) on OTLP requests. Most existing SDKs only support basic HTTP authentication. Without cloud-native auth support, developers on managed platforms cannot send traces to their OpenSearch clusters.

  2. OpenSearch defaults — Endpoint URLs, Data Prepper paths, service naming conventions, batch processor settings — developers shouldn't have to figure these out. register() should just work for OpenSearch.

  3. Auto-instrumentation — OpenLLMetry instrumentors monkey-patch LLM libraries (OpenAI, Anthropic, Bedrock, etc.) to emit OTEL spans on every call. They register via Python entry_points under the opentelemetry_instrumentor group. register() discovers and activates all installed instrumentors automatically — developers just pip install the instrumentor packages they need.

Scoring

OTEL has no concept of scores or evaluations. Every platform builds this as a proprietary API:

  1. Scores need a transport — Offline evaluations and human feedback need to be stored alongside traces. Other platforms use proprietary HTTP APIs to their backends. We emit scores as OTEL spans with gen_ai.evaluation.* attributes following the OTEL GenAI semantic conventions — they flow through the same exporter pipeline, same authentication, same Data Prepper endpoint. No separate connection. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results via score().

Proposal

opensearch-genai-sdk-py

A single package with three capabilities:

1. register() — One-line OTEL setup

from opensearch_genai_sdk import register

# Cloud-hosted with authentication (e.g., AWS SigV4)
register(
    endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
    auth="sigv4"
)

# Self-hosted Data Prepper
register(endpoint="http://dataprepper:21890/opentelemetry/v1/traces")

# gRPC
register(endpoint="grpc://otel-collector:4317")
  • Creates TracerProvider, exporter, and processor with OpenSearch defaults
  • Supports cloud-specific authentication methods (AWS SigV4, OAuth, etc.)
  • Supports both HTTP and gRPC OTLP transport
  • Auto-discovers and activates installed OpenLLMetry instrumentors

2. Decorators — Trace custom functions

from opensearch_genai_sdk import workflow, task, agent, tool

@workflow(name="qa_pipeline")
def run(question: str) -> str:
    return my_agent(question)

@agent(name="research_agent")
def my_agent(question: str) -> str:
    results = search(question)
    return summarize(results)

@tool(name="web_search")
def search(query: str) -> list:
    """Search the web for information."""
    return search_api.query(query)

Auto-instrumentors trace library calls (OpenAI, Anthropic, etc.). Decorators trace your code — your agents, workflows, and tools. Together they produce a complete trace:

qa_pipeline                            ← @workflow (your code)
  └── invoke_agent research_agent      ← @agent (your code)
      ├── execute_tool web_search      ← @tool (your code)
      └── openai.chat                  ← auto-instrumentor (monkey-patched)

Creates standard OTEL spans with gen_ai.operation.name attributes following OTEL GenAI semantic conventions. Agent spans use invoke_agent operation name with gen_ai.agent.name; tool spans use execute_tool with gen_ai.tool.name, gen_ai.tool.type, gen_ai.tool.description, and gen_ai.tool.call.arguments/gen_ai.tool.call.result for I/O. Supports sync, async, and generator functions.

3. score() — Scores as OTEL spans

from opensearch_genai_sdk import score

# Span-level: score a specific LLM call or tool execution
score(name="accuracy", value=0.95, trace_id="abc123", span_id="def456",
      explanation="Weather data matches ground truth", source="heuristic")

# Trace-level: score an entire workflow
score(name="relevance", value=0.92, trace_id="abc123",
      explanation="Response addresses the user's query", source="llm-judge")

# Session-level: score across multiple traces in a conversation
score(name="user_satisfaction", value=0.88, conversation_id="session-123",
      label="satisfied", source="human")

Emits scores as gen_ai.evaluation.result OTEL spans with gen_ai.evaluation.* attributes following the OTEL GenAI semantic conventions. Supports three scoring levels: span-level (trace_id + span_id), trace-level (trace_id only), and session-level (conversation_id). No separate client, no separate auth — same pipeline as traces. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results through score().

Why an SDK (not just OTEL)

Capability Standard OTEL This SDK
Trace export to OpenSearch Manual setup register() with defaults
Cloud authentication Limited support Pluggable auth (SigV4, OAuth, etc.)
Auto-instrument LLM libraries Manual per-library Auto-discovered (OpenLLMetry)
@workflow/@agent/@task/@tool Not available Built-in
Score/feedback submission Not in OTEL spec score() as OTEL spans

Everything underneath is standard OTEL. Remove the SDK and traces still export.

Architecture

Developer Code
    │
    ├── @workflow / @agent / @task / @tool    ← SDK decorators
    ├── score(name, value, trace_id)          ← SDK score API
    │
    ▼
TracerProvider (OTEL SDK)
    │
    ├── HTTP Exporter  ──→  Data Prepper / Cloud Ingestion  ──→  OpenSearch
    │   (+ Auth)
    └── gRPC Exporter  ──→  Any OTEL Collector              ──→  OpenSearch

Design Principles

  1. OTEL-native — All data is standard OTEL spans. No proprietary wire format.
  2. GenAI conventions — Uses gen_ai.operation.name, gen_ai.agent.*, gen_ai.tool.*, and gen_ai.evaluation.* attributes aligned with OTEL GenAI semantic conventions.
  3. OpenLLMetry-aligned — Discovers and activates OpenLLMetry instrumentors via opentelemetry_instrumentor entry points.
  4. Scores in OTEL — Scores flow through the same pipeline as traces. No separate connection or API.
  5. Single packagepip install opensearch-genai-sdk-py gets everything. Optional extras for cloud providers ([aws], etc.).

Status

POC implementations are published under @vamsimanohar for testing:

  • Python: 74 unit tests, verified end-to-end via pip install opensearch-genai-sdk-py
  • TypeScript: 29 tests, verified end-to-end via npm install opensearch-genai-sdk
  • Both HTTP and gRPC OTLP verified with mini collectors

Package ownership will be transferred to opensearch-project upon RFC approval.

Future Considerations

  • Multi-cloud support — Additional authentication methods beyond AWS SigV4 (Azure, GCP, OAuth, etc.)
  • OpenInference supportOpenInference (by Arize/Phoenix) is a second instrumentation ecosystem with its own entry point group (openinference_instrumentor). Support could be added to discover instrumentors from both ecosystems, giving developers auto-instrumentation regardless of which instrumentor packages they install.
  • Data Prepper routing — Route gen_ai.evaluation.result spans to a dedicated score index.
  • Dashboard integration — Score visualization in OpenSearch Dashboards.
  • Eval orchestration — A future evaluate() function could orchestrate dataset-based evaluations (dataset → task → scorers → results), creating the right span hierarchy and emitting scores automatically. This would be compatible with frameworks like autoevals or custom scorers.
  • What if OTEL adds native score/eval support? — If OpenTelemetry adds scores and evaluations to the spec, the SDK would adopt those conventions and become a thin convenience layer: OpenSearch defaults, cloud auth, and auto-instrumentation discovery. The core value — making OpenSearch work out of the box for GenAI observability — remains regardless.

FAQ - Proposal Requirements

What/Why

What are you proposing?

An OpenTelemetry-native Python SDK (opensearch-genai-sdk-py) that provides one-line setup for comprehensive LLM observability using OpenSearch as the backend. Core capabilities include:

  • register() for automatic OTEL pipeline configuration with OpenSearch defaults
  • Four decorators (@workflow, @task, @agent, @tool) for tracing custom code
  • score() function for submitting evaluation results as OTEL spans
  • Auto-discovery and activation of 30+ LLM provider instrumentors
  • Built-in cloud authentication support (AWS SigV4, OAuth, etc.)

What users have asked for this feature?

  • Cloud platform users: Multiple requests for cloud-native authentication support in existing OTEL SDKs for managed OpenSearch services
  • AI developers: Community feedback on complexity of setting up OpenTelemetry for LLM applications, particularly around instrumentor discovery and endpoint configuration
  • Enterprise users: Need for unified observability and evaluation platform with self-hosted deployment options (no vendor lock-in)
  • OpenSearch community: Growing demand for AI-native tooling as evidenced by the GenAI features roadmap in OpenSearch Dashboards

What problems are you trying to solve?

When developing LLM applications, AI developers want to instrument their code with minimal setup so they can focus on building features rather than debugging observability configuration.

When deploying to cloud platforms, platform engineers want to use managed OpenSearch services with proper authentication so they can securely ingest telemetry data without managing infrastructure.

When evaluating AI systems, ML engineers want to submit scores through the same pipeline as traces so they can correlate evaluation results with execution data in a unified view.

What is the developer experience going to be?

REST API Impact: None. This is a client-side SDK that connects TO existing OpenSearch APIs via OTLP protocol.

New APIs:

  • register(endpoint, auth="auto|sigv4|oauth", **kwargs) - OTEL pipeline setup
  • @workflow/@task/@agent/@tool decorators - span creation for custom functions
  • score(name, value, trace_id=None, span_id=None, conversation_id=None, **kwargs) - evaluation submission

CLI: No CLI interface. Pure Python library.

Configuration: Environment variable support for OTEL_* variables (service name, etc.)

Are there any security considerations?

The SDK integrates with OpenSearch's existing security model:

  • Cloud authentication: Uses platform-native credential providers (e.g., botocore for AWS, standard OAuth flows) for secure authentication
  • No credential storage: Leverages existing credential providers, no additional secrets management
  • Transport security: Supports TLS for OTLP HTTP/gRPC transport
  • Attribute filtering: Input/output capture is configurable and includes truncation to prevent large payloads
Are there any breaking changes to the API?

No breaking changes to any OpenSearch APIs. This is a new client library that uses existing OTLP ingestion endpoints.

What is the user experience going to be?

Installation and Setup:

pip install opensearch-genai-sdk-py[aws]  # Cloud provider extras for authentication

Basic Usage:

from opensearch_genai_sdk import register, workflow, agent, tool, score

# One-line setup with cloud authentication
register(
    endpoint="https://my-opensearch-cluster.com/v1/traces",
    auth="sigv4"  # or "oauth", "basic", etc.
)

# Trace custom code
@workflow("qa_pipeline")
def answer_question(question: str) -> str:
    return my_agent(question)

@agent("research_agent") 
def my_agent(question: str) -> str:
    context = search_web(question)  # Auto-instrumented if OpenAI/etc installed
    return generate_answer(context, question)

@tool("web_search")
def search_web(query: str) -> list:
    return search_api.query(query)

# Submit evaluation scores
score("relevance", 0.95, trace_id="abc123", source="llm-judge")

Dashboard Experience: Traces and scores appear in OpenSearch Dashboards observability views (existing trace analytics functionality).

Are there breaking changes to the User Experience?

No breaking changes to existing OpenSearch user flows. This adds new telemetry data to existing observability dashboards.

Why should it be built? Any reason not to?

Value to build:

  • Strategic positioning: Establishes OpenSearch as the leading open-source AI observability platform
  • Cloud integration: Native support for managed OpenSearch services fills critical gap for cloud users
  • Developer velocity: Reduces AI observability setup from hours to minutes
  • Open ecosystem: Prevents vendor lock-in compared to proprietary solutions (LangSmith, Arize)

Risks if not built:

  • OpenSearch loses mindshare to proprietary AI observability platforms
  • Cloud users continue struggling with OTEL + managed service integration
  • Community fragments across incompatible instrumentation solutions

Risks if built:

  • Maintenance overhead for keeping up with rapidly evolving AI frameworks
  • Competition with existing OTEL SDK ecosystem (positioning as complement, not replacement)

What will it take to execute?

Technical requirements:

  • Core development: Python SDK implementation (completed in POC)
  • Testing: Comprehensive test suite covering OTEL, cloud auth, instrumentor integration (74 tests implemented)
  • Documentation: API docs, examples, integration guides
  • CI/CD: GitHub Actions for testing, security scanning, PyPI publishing (implemented)

Dependencies:

  • OpenTelemetry Python SDK (stable)
  • Cloud provider SDKs for authentication (botocore for AWS, etc.)
  • OpenLLMetry instrumentor ecosystem (actively maintained)

Performance considerations:

  • Spans are batched and exported asynchronously (standard OTEL behavior)
  • Input/output capture includes size limits (10KB) to prevent memory issues
  • Auto-instrumentation discovery happens once at startup

Integration assumptions:

  • OpenSearch cluster with OTLP ingestion capability (Data Prepper or cloud ingestion service)
  • Standard OTEL trace storage and querying via existing OpenSearch APIs

Any remaining open questions?

Future enhancements:

  • Multi-language support: TypeScript/JavaScript SDK (POC exists)
  • Dashboard integration: Native AI observability views in OpenSearch Dashboards
  • Advanced evaluation: Dataset-based evaluation orchestration with evaluate() function
  • OpenInference compatibility: Support for Arize instrumentor ecosystem alongside OpenLLMetry

Long-term questions:

  • OTEL standardization: If OpenTelemetry adds native evaluation/scoring support, how does the SDK adapt?
  • Data Prepper enhancements: Should evaluation spans route to dedicated score indices?
  • Community adoption: What's the strategy for migrating users from existing observability platforms?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions