RFC : Agentic AI Eval Platform : opensearch-genai-sdk-py

## Summary

This RFC proposes `opensearch-genai-sdk-py` — an OTEL-native SDK for instrumenting and scoring agentic applications with OpenSearch as the backend.

POC implementations are published under [@vamsimanohar](https://github.com/vamsimanohar) for community testing. Package names and repos will move to `opensearch-project` upon approval.

- **Python**: [GitHub](https://github.com/vamsimanohar/opensearch-genai-sdk-py) | `pip install opensearch-genai-sdk-py` ([PyPI](https://pypi.org/project/opensearch-genai-sdk-py/))
- **TypeScript**: [GitHub](https://github.com/vamsimanohar/opensearch-genai-sdk-ts) | `npm install opensearch-genai-sdk` ([npm](https://www.npmjs.com/package/opensearch-genai-sdk))

> **Note:** These are POC packages published under a personal account. Ownership will be transferred to the `opensearch-project` organization upon approval.

## Background

The GenAI observability space has two open-source instrumentation ecosystems:

**[OpenLLMetry](https://github.com/traceloop/openllmetry)** (by Traceloop) — Provides auto-instrumentation for 30+ LLM libraries (OpenAI, Anthropic, Bedrock, LangChain, etc.) and `@workflow/@task/@agent/@tool` decorators. Uses `gen_ai.*` attributes aligned with official [OTEL GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) — Traceloop is actively contributing these upstream to OpenTelemetry. Instrumentors register via Python `entry_points`, enabling runtime auto-discovery.

**[OpenInference](https://github.com/Arize-ai/openinference)** (by Arize/Phoenix) — Similar auto-instrumentation with its own attribute conventions (`openinference.span.kind`). Also uses `entry_points` for discovery.

Both produce standard OTEL spans — the traces are portable. However, each comes with an SDK ([traceloop-sdk](https://github.com/traceloop/openllmetry/tree/main/packages/traceloop-sdk), [phoenix-otel](https://github.com/Arize-ai/phoenix/tree/main/packages/phoenix-otel)) that bundles setup, decorators, and score/eval APIs. The problem: **scores and evaluations in these SDKs are locked to their respective backends**. Traceloop's `user_feedback.create()` POSTs to `api.traceloop.com`. Phoenix's eval APIs write to the Phoenix server. The traces are open, but the scores are not.

### Why OpenLLMetry

We chose to align with OpenLLMetry because:
- It contributes `gen_ai.*` semantic conventions directly to the OpenTelemetry project
- Its instrumentors cover the broadest set of LLM/agent frameworks (30+)
The SDK discovers and activates OpenLLMetry instrumentors via the `opentelemetry_instrumentor` entry point group.

## Why We Need a New SDK

Two reasons: **instrumentation defaults** and **scoring**.

### Instrumentation

Existing SDKs don't support cloud deployment authentication patterns:

1. **Cloud-specific authorization** — Cloud-hosted OpenSearch deployments often require specialized authentication (e.g., AWS SigV4, OAuth, API keys) on OTLP requests. Most existing SDKs only support basic HTTP authentication. Without cloud-native auth support, developers on managed platforms cannot send traces to their OpenSearch clusters.

2. **OpenSearch defaults** — Endpoint URLs, Data Prepper paths, service naming conventions, batch processor settings — developers shouldn't have to figure these out. `register()` should just work for OpenSearch.

3. **Auto-instrumentation** — OpenLLMetry instrumentors monkey-patch LLM libraries (OpenAI, Anthropic, Bedrock, etc.) to emit OTEL spans on every call. They register via Python `entry_points` under the `opentelemetry_instrumentor` group. `register()` discovers and activates all installed instrumentors automatically — developers just `pip install` the instrumentor packages they need.

### Scoring

OTEL has no concept of scores or evaluations. Every platform builds this as a proprietary API:

4. **Scores need a transport** — Offline evaluations and human feedback need to be stored alongside traces. Other platforms use proprietary HTTP APIs to their backends. We emit scores as OTEL spans with `gen_ai.evaluation.*` attributes following the [OTEL GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) — they flow through the same exporter pipeline, same authentication, same Data Prepper endpoint. No separate connection. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results via `score()`.

## Proposal

### `opensearch-genai-sdk-py`

A single package with three capabilities:

### 1. `register()` — One-line OTEL setup

```python
from opensearch_genai_sdk import register

# Cloud-hosted with authentication (e.g., AWS SigV4)
register(
    endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
    auth="sigv4"
)

# Self-hosted Data Prepper
register(endpoint="http://dataprepper:21890/opentelemetry/v1/traces")

# gRPC
register(endpoint="grpc://otel-collector:4317")
```

- Creates TracerProvider, exporter, and processor with OpenSearch defaults
- Supports cloud-specific authentication methods (AWS SigV4, OAuth, etc.)
- Supports both HTTP and gRPC OTLP transport
- Auto-discovers and activates installed OpenLLMetry instrumentors

### 2. Decorators — Trace custom functions

```python
from opensearch_genai_sdk import workflow, task, agent, tool

@workflow(name="qa_pipeline")
def run(question: str) -> str:
    return my_agent(question)

@agent(name="research_agent")
def my_agent(question: str) -> str:
    results = search(question)
    return summarize(results)

@tool(name="web_search")
def search(query: str) -> list:
    """Search the web for information."""
    return search_api.query(query)
```

Auto-instrumentors trace **library calls** (OpenAI, Anthropic, etc.). Decorators trace **your code** — your agents, workflows, and tools. Together they produce a complete trace:

```
qa_pipeline                            ← @workflow (your code)
  └── invoke_agent research_agent      ← @agent (your code)
      ├── execute_tool web_search      ← @tool (your code)
      └── openai.chat                  ← auto-instrumentor (monkey-patched)
```

Creates standard OTEL spans with `gen_ai.operation.name` attributes following [OTEL GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). Agent spans use `invoke_agent` operation name with `gen_ai.agent.name`; tool spans use `execute_tool` with `gen_ai.tool.name`, `gen_ai.tool.type`, `gen_ai.tool.description`, and `gen_ai.tool.call.arguments`/`gen_ai.tool.call.result` for I/O. Supports sync, async, and generator functions.

### 3. `score()` — Scores as OTEL spans

```python
from opensearch_genai_sdk import score

# Span-level: score a specific LLM call or tool execution
score(name="accuracy", value=0.95, trace_id="abc123", span_id="def456",
      explanation="Weather data matches ground truth", source="heuristic")

# Trace-level: score an entire workflow
score(name="relevance", value=0.92, trace_id="abc123",
      explanation="Response addresses the user's query", source="llm-judge")

# Session-level: score across multiple traces in a conversation
score(name="user_satisfaction", value=0.88, conversation_id="session-123",
      label="satisfied", source="human")
```

Emits scores as `gen_ai.evaluation.result` OTEL spans with `gen_ai.evaluation.*` attributes following the [OTEL GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/). Supports three scoring levels: **span-level** (trace_id + span_id), **trace-level** (trace_id only), and **session-level** (conversation_id). No separate client, no separate auth — same pipeline as traces. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results through `score()`.

## Why an SDK (not just OTEL)

| Capability | Standard OTEL | This SDK |
|---|---|---|
| Trace export to OpenSearch | Manual setup | `register()` with defaults |
| Cloud authentication | Limited support | Pluggable auth (SigV4, OAuth, etc.) |
| Auto-instrument LLM libraries | Manual per-library | Auto-discovered (OpenLLMetry) |
| `@workflow/@agent/@task/@tool` | Not available | Built-in |
| Score/feedback submission | Not in OTEL spec | `score()` as OTEL spans |

Everything underneath is standard OTEL. Remove the SDK and traces still export.

## Architecture

```
Developer Code
    │
    ├── @workflow / @agent / @task / @tool    ← SDK decorators
    ├── score(name, value, trace_id)          ← SDK score API
    │
    ▼
TracerProvider (OTEL SDK)
    │
    ├── HTTP Exporter  ──→  Data Prepper / Cloud Ingestion  ──→  OpenSearch
    │   (+ Auth)
    └── gRPC Exporter  ──→  Any OTEL Collector              ──→  OpenSearch
```

## Design Principles

1. **OTEL-native** — All data is standard OTEL spans. No proprietary wire format.
2. **GenAI conventions** — Uses `gen_ai.operation.name`, `gen_ai.agent.*`, `gen_ai.tool.*`, and `gen_ai.evaluation.*` attributes aligned with [OTEL GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/).
3. **OpenLLMetry-aligned** — Discovers and activates OpenLLMetry instrumentors via `opentelemetry_instrumentor` entry points.
4. **Scores in OTEL** — Scores flow through the same pipeline as traces. No separate connection or API.
5. **Single package** — `pip install opensearch-genai-sdk-py` gets everything. Optional extras for cloud providers (`[aws]`, etc.).

## Status

POC implementations are published under [@vamsimanohar](https://github.com/vamsimanohar) for testing:
- Python: 74 unit tests, verified end-to-end via `pip install opensearch-genai-sdk-py`
- TypeScript: 29 tests, verified end-to-end via `npm install opensearch-genai-sdk`
- Both HTTP and gRPC OTLP verified with mini collectors

Package ownership will be transferred to `opensearch-project` upon RFC approval.

## Future Considerations

- **Multi-cloud support** — Additional authentication methods beyond AWS SigV4 (Azure, GCP, OAuth, etc.)
- **OpenInference support** — [OpenInference](https://github.com/Arize-ai/openinference) (by Arize/Phoenix) is a second instrumentation ecosystem with its own entry point group (`openinference_instrumentor`). Support could be added to discover instrumentors from both ecosystems, giving developers auto-instrumentation regardless of which instrumentor packages they install.
- **Data Prepper routing** — Route `gen_ai.evaluation.result` spans to a dedicated score index.
- **Dashboard integration** — Score visualization in OpenSearch Dashboards.
- **Eval orchestration** — A future `evaluate()` function could orchestrate dataset-based evaluations (dataset → task → scorers → results), creating the right span hierarchy and emitting scores automatically. This would be compatible with frameworks like [autoevals](https://github.com/braintrustdata/autoevals) or custom scorers.
- **What if OTEL adds native score/eval support?** — If OpenTelemetry adds scores and evaluations to the spec, the SDK would adopt those conventions and become a thin convenience layer: OpenSearch defaults, cloud auth, and auto-instrumentation discovery. The core value — making OpenSearch work out of the box for GenAI observability — remains regardless.

---

## FAQ - Proposal Requirements

### What/Why

#### What are you proposing?
An OpenTelemetry-native Python SDK (`opensearch-genai-sdk-py`) that provides one-line setup for comprehensive LLM observability using OpenSearch as the backend. Core capabilities include:
- `register()` for automatic OTEL pipeline configuration with OpenSearch defaults
- Four decorators (`@workflow`, `@task`, `@agent`, `@tool`) for tracing custom code
- `score()` function for submitting evaluation results as OTEL spans
- Auto-discovery and activation of 30+ LLM provider instrumentors
- Built-in cloud authentication support (AWS SigV4, OAuth, etc.)

#### What users have asked for this feature?
- **Cloud platform users**: Multiple requests for cloud-native authentication support in existing OTEL SDKs for managed OpenSearch services
- **AI developers**: Community feedback on complexity of setting up OpenTelemetry for LLM applications, particularly around instrumentor discovery and endpoint configuration
- **Enterprise users**: Need for unified observability and evaluation platform with self-hosted deployment options (no vendor lock-in)
- **OpenSearch community**: Growing demand for AI-native tooling as evidenced by the GenAI features roadmap in OpenSearch Dashboards

#### What problems are you trying to solve?
When **developing LLM applications**, **AI developers** want to **instrument their code with minimal setup** so they **can focus on building features rather than debugging observability configuration**. 

When **deploying to cloud platforms**, **platform engineers** want to **use managed OpenSearch services with proper authentication** so they **can securely ingest telemetry data without managing infrastructure**.

When **evaluating AI systems**, **ML engineers** want to **submit scores through the same pipeline as traces** so they **can correlate evaluation results with execution data in a unified view**.

#### What is the developer experience going to be?
**REST API Impact**: None. This is a client-side SDK that connects TO existing OpenSearch APIs via OTLP protocol.

**New APIs**: 
- `register(endpoint, auth="auto|sigv4|oauth", **kwargs)` - OTEL pipeline setup
- `@workflow/@task/@agent/@tool` decorators - span creation for custom functions  
- `score(name, value, trace_id=None, span_id=None, conversation_id=None, **kwargs)` - evaluation submission

**CLI**: No CLI interface. Pure Python library.

**Configuration**: Environment variable support for `OTEL_*` variables (service name, etc.)

##### Are there any security considerations?
The SDK integrates with OpenSearch's existing security model:
- **Cloud authentication**: Uses platform-native credential providers (e.g., botocore for AWS, standard OAuth flows) for secure authentication
- **No credential storage**: Leverages existing credential providers, no additional secrets management
- **Transport security**: Supports TLS for OTLP HTTP/gRPC transport
- **Attribute filtering**: Input/output capture is configurable and includes truncation to prevent large payloads

##### Are there any breaking changes to the API?
No breaking changes to any OpenSearch APIs. This is a new client library that uses existing OTLP ingestion endpoints.

#### What is the user experience going to be?

**Installation and Setup**:
```bash
pip install opensearch-genai-sdk-py[aws]  # Cloud provider extras for authentication
```

**Basic Usage**:
```python
from opensearch_genai_sdk import register, workflow, agent, tool, score

# One-line setup with cloud authentication
register(
    endpoint="https://my-opensearch-cluster.com/v1/traces",
    auth="sigv4"  # or "oauth", "basic", etc.
)

# Trace custom code
@workflow("qa_pipeline")
def answer_question(question: str) -> str:
    return my_agent(question)

@agent("research_agent") 
def my_agent(question: str) -> str:
    context = search_web(question)  # Auto-instrumented if OpenAI/etc installed
    return generate_answer(context, question)

@tool("web_search")
def search_web(query: str) -> list:
    return search_api.query(query)

# Submit evaluation scores
score("relevance", 0.95, trace_id="abc123", source="llm-judge")
```

**Dashboard Experience**: Traces and scores appear in OpenSearch Dashboards observability views (existing trace analytics functionality).

##### Are there breaking changes to the User Experience?
No breaking changes to existing OpenSearch user flows. This adds new telemetry data to existing observability dashboards.

#### Why should it be built? Any reason not to?

**Value to build**:
- **Strategic positioning**: Establishes OpenSearch as the leading open-source AI observability platform
- **Cloud integration**: Native support for managed OpenSearch services fills critical gap for cloud users
- **Developer velocity**: Reduces AI observability setup from hours to minutes
- **Open ecosystem**: Prevents vendor lock-in compared to proprietary solutions (LangSmith, Arize)

**Risks if not built**:
- OpenSearch loses mindshare to proprietary AI observability platforms
- Cloud users continue struggling with OTEL + managed service integration
- Community fragments across incompatible instrumentation solutions

**Risks if built**:
- Maintenance overhead for keeping up with rapidly evolving AI frameworks
- Competition with existing OTEL SDK ecosystem (positioning as complement, not replacement)

#### What will it take to execute?

**Technical requirements**:
- **Core development**: Python SDK implementation (completed in POC)
- **Testing**: Comprehensive test suite covering OTEL, cloud auth, instrumentor integration (74 tests implemented)
- **Documentation**: API docs, examples, integration guides
- **CI/CD**: GitHub Actions for testing, security scanning, PyPI publishing (implemented)

**Dependencies**: 
- OpenTelemetry Python SDK (stable)
- Cloud provider SDKs for authentication (botocore for AWS, etc.)
- OpenLLMetry instrumentor ecosystem (actively maintained)

**Performance considerations**:
- Spans are batched and exported asynchronously (standard OTEL behavior)
- Input/output capture includes size limits (10KB) to prevent memory issues
- Auto-instrumentation discovery happens once at startup

**Integration assumptions**:
- OpenSearch cluster with OTLP ingestion capability (Data Prepper or cloud ingestion service)
- Standard OTEL trace storage and querying via existing OpenSearch APIs

#### Any remaining open questions?

**Future enhancements**:
- **Multi-language support**: TypeScript/JavaScript SDK (POC exists)
- **Dashboard integration**: Native AI observability views in OpenSearch Dashboards
- **Advanced evaluation**: Dataset-based evaluation orchestration with `evaluate()` function
- **OpenInference compatibility**: Support for Arize instrumentor ecosystem alongside OpenLLMetry

**Long-term questions**:
- **OTEL standardization**: If OpenTelemetry adds native evaluation/scoring support, how does the SDK adapt?
- **Data Prepper enhancements**: Should evaluation spans route to dedicated score indices?
- **Community adoption**: What's the strategy for migrating users from existing observability platforms?

Capability	Standard OTEL	This SDK
Trace export to OpenSearch	Manual setup	`register()` with defaults
Cloud authentication	Limited support	Pluggable auth (SigV4, OAuth, etc.)
Auto-instrument LLM libraries	Manual per-library	Auto-discovered (OpenLLMetry)
`@workflow/@agent/@task/@tool`	Not available	Built-in
Score/feedback submission	Not in OTEL spec	`score()` as OTEL spans

RFC : Agentic AI Eval Platform : opensearch-genai-sdk-py #2591

Description

Summary

Background

Why OpenLLMetry

Why We Need a New SDK

Instrumentation

Scoring

Proposal

opensearch-genai-sdk-py

1. register() — One-line OTEL setup

2. Decorators — Trace custom functions

3. score() — Scores as OTEL spans

Why an SDK (not just OTEL)

Architecture

Design Principles

Status

Future Considerations

FAQ - Proposal Requirements

What/Why

What are you proposing?

What users have asked for this feature?

What problems are you trying to solve?

What is the developer experience going to be?

Are there any security considerations?

Are there any breaking changes to the API?

What is the user experience going to be?

Are there breaking changes to the User Experience?

Why should it be built? Any reason not to?

What will it take to execute?

Any remaining open questions?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`opensearch-genai-sdk-py`

1. `register()` — One-line OTEL setup

3. `score()` — Scores as OTEL spans