Summary
This RFC proposes opensearch-genai-sdk-py — an OTEL-native SDK for instrumenting and scoring agentic applications with OpenSearch as the backend.
POC implementations are published under @vamsimanohar for community testing. Package names and repos will move to opensearch-project upon approval.
- Python: GitHub |
pip install opensearch-genai-sdk-py (PyPI)
- TypeScript: GitHub |
npm install opensearch-genai-sdk (npm)
Note: These are POC packages published under a personal account. Ownership will be transferred to the opensearch-project organization upon approval.
Background
The GenAI observability space has two open-source instrumentation ecosystems:
OpenLLMetry (by Traceloop) — Provides auto-instrumentation for 30+ LLM libraries (OpenAI, Anthropic, Bedrock, LangChain, etc.) and @workflow/@task/@agent/@tool decorators. Uses gen_ai.* attributes aligned with official OTEL GenAI semantic conventions — Traceloop is actively contributing these upstream to OpenTelemetry. Instrumentors register via Python entry_points, enabling runtime auto-discovery.
OpenInference (by Arize/Phoenix) — Similar auto-instrumentation with its own attribute conventions (openinference.span.kind). Also uses entry_points for discovery.
Both produce standard OTEL spans — the traces are portable. However, each comes with an SDK (traceloop-sdk, phoenix-otel) that bundles setup, decorators, and score/eval APIs. The problem: scores and evaluations in these SDKs are locked to their respective backends. Traceloop's user_feedback.create() POSTs to api.traceloop.com. Phoenix's eval APIs write to the Phoenix server. The traces are open, but the scores are not.
Why OpenLLMetry
We chose to align with OpenLLMetry because:
- It contributes
gen_ai.* semantic conventions directly to the OpenTelemetry project
- Its instrumentors cover the broadest set of LLM/agent frameworks (30+)
The SDK discovers and activates OpenLLMetry instrumentors via the opentelemetry_instrumentor entry point group.
Why We Need a New SDK
Two reasons: instrumentation defaults and scoring.
Instrumentation
Existing SDKs don't support cloud deployment authentication patterns:
-
Cloud-specific authorization — Cloud-hosted OpenSearch deployments often require specialized authentication (e.g., AWS SigV4, OAuth, API keys) on OTLP requests. Most existing SDKs only support basic HTTP authentication. Without cloud-native auth support, developers on managed platforms cannot send traces to their OpenSearch clusters.
-
OpenSearch defaults — Endpoint URLs, Data Prepper paths, service naming conventions, batch processor settings — developers shouldn't have to figure these out. register() should just work for OpenSearch.
-
Auto-instrumentation — OpenLLMetry instrumentors monkey-patch LLM libraries (OpenAI, Anthropic, Bedrock, etc.) to emit OTEL spans on every call. They register via Python entry_points under the opentelemetry_instrumentor group. register() discovers and activates all installed instrumentors automatically — developers just pip install the instrumentor packages they need.
Scoring
OTEL has no concept of scores or evaluations. Every platform builds this as a proprietary API:
- Scores need a transport — Offline evaluations and human feedback need to be stored alongside traces. Other platforms use proprietary HTTP APIs to their backends. We emit scores as OTEL spans with
gen_ai.evaluation.* attributes following the OTEL GenAI semantic conventions — they flow through the same exporter pipeline, same authentication, same Data Prepper endpoint. No separate connection. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results via score().
Proposal
opensearch-genai-sdk-py
A single package with three capabilities:
1. register() — One-line OTEL setup
from opensearch_genai_sdk import register
# Cloud-hosted with authentication (e.g., AWS SigV4)
register(
endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
auth="sigv4"
)
# Self-hosted Data Prepper
register(endpoint="http://dataprepper:21890/opentelemetry/v1/traces")
# gRPC
register(endpoint="grpc://otel-collector:4317")
- Creates TracerProvider, exporter, and processor with OpenSearch defaults
- Supports cloud-specific authentication methods (AWS SigV4, OAuth, etc.)
- Supports both HTTP and gRPC OTLP transport
- Auto-discovers and activates installed OpenLLMetry instrumentors
2. Decorators — Trace custom functions
from opensearch_genai_sdk import workflow, task, agent, tool
@workflow(name="qa_pipeline")
def run(question: str) -> str:
return my_agent(question)
@agent(name="research_agent")
def my_agent(question: str) -> str:
results = search(question)
return summarize(results)
@tool(name="web_search")
def search(query: str) -> list:
"""Search the web for information."""
return search_api.query(query)
Auto-instrumentors trace library calls (OpenAI, Anthropic, etc.). Decorators trace your code — your agents, workflows, and tools. Together they produce a complete trace:
qa_pipeline ← @workflow (your code)
└── invoke_agent research_agent ← @agent (your code)
├── execute_tool web_search ← @tool (your code)
└── openai.chat ← auto-instrumentor (monkey-patched)
Creates standard OTEL spans with gen_ai.operation.name attributes following OTEL GenAI semantic conventions. Agent spans use invoke_agent operation name with gen_ai.agent.name; tool spans use execute_tool with gen_ai.tool.name, gen_ai.tool.type, gen_ai.tool.description, and gen_ai.tool.call.arguments/gen_ai.tool.call.result for I/O. Supports sync, async, and generator functions.
3. score() — Scores as OTEL spans
from opensearch_genai_sdk import score
# Span-level: score a specific LLM call or tool execution
score(name="accuracy", value=0.95, trace_id="abc123", span_id="def456",
explanation="Weather data matches ground truth", source="heuristic")
# Trace-level: score an entire workflow
score(name="relevance", value=0.92, trace_id="abc123",
explanation="Response addresses the user's query", source="llm-judge")
# Session-level: score across multiple traces in a conversation
score(name="user_satisfaction", value=0.88, conversation_id="session-123",
label="satisfied", source="human")
Emits scores as gen_ai.evaluation.result OTEL spans with gen_ai.evaluation.* attributes following the OTEL GenAI semantic conventions. Supports three scoring levels: span-level (trace_id + span_id), trace-level (trace_id only), and session-level (conversation_id). No separate client, no separate auth — same pipeline as traces. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results through score().
Why an SDK (not just OTEL)
| Capability |
Standard OTEL |
This SDK |
| Trace export to OpenSearch |
Manual setup |
register() with defaults |
| Cloud authentication |
Limited support |
Pluggable auth (SigV4, OAuth, etc.) |
| Auto-instrument LLM libraries |
Manual per-library |
Auto-discovered (OpenLLMetry) |
@workflow/@agent/@task/@tool |
Not available |
Built-in |
| Score/feedback submission |
Not in OTEL spec |
score() as OTEL spans |
Everything underneath is standard OTEL. Remove the SDK and traces still export.
Architecture
Developer Code
│
├── @workflow / @agent / @task / @tool ← SDK decorators
├── score(name, value, trace_id) ← SDK score API
│
▼
TracerProvider (OTEL SDK)
│
├── HTTP Exporter ──→ Data Prepper / Cloud Ingestion ──→ OpenSearch
│ (+ Auth)
└── gRPC Exporter ──→ Any OTEL Collector ──→ OpenSearch
Design Principles
- OTEL-native — All data is standard OTEL spans. No proprietary wire format.
- GenAI conventions — Uses
gen_ai.operation.name, gen_ai.agent.*, gen_ai.tool.*, and gen_ai.evaluation.* attributes aligned with OTEL GenAI semantic conventions.
- OpenLLMetry-aligned — Discovers and activates OpenLLMetry instrumentors via
opentelemetry_instrumentor entry points.
- Scores in OTEL — Scores flow through the same pipeline as traces. No separate connection or API.
- Single package —
pip install opensearch-genai-sdk-py gets everything. Optional extras for cloud providers ([aws], etc.).
Status
POC implementations are published under @vamsimanohar for testing:
- Python: 74 unit tests, verified end-to-end via
pip install opensearch-genai-sdk-py
- TypeScript: 29 tests, verified end-to-end via
npm install opensearch-genai-sdk
- Both HTTP and gRPC OTLP verified with mini collectors
Package ownership will be transferred to opensearch-project upon RFC approval.
Future Considerations
- Multi-cloud support — Additional authentication methods beyond AWS SigV4 (Azure, GCP, OAuth, etc.)
- OpenInference support — OpenInference (by Arize/Phoenix) is a second instrumentation ecosystem with its own entry point group (
openinference_instrumentor). Support could be added to discover instrumentors from both ecosystems, giving developers auto-instrumentation regardless of which instrumentor packages they install.
- Data Prepper routing — Route
gen_ai.evaluation.result spans to a dedicated score index.
- Dashboard integration — Score visualization in OpenSearch Dashboards.
- Eval orchestration — A future
evaluate() function could orchestrate dataset-based evaluations (dataset → task → scorers → results), creating the right span hierarchy and emitting scores automatically. This would be compatible with frameworks like autoevals or custom scorers.
- What if OTEL adds native score/eval support? — If OpenTelemetry adds scores and evaluations to the spec, the SDK would adopt those conventions and become a thin convenience layer: OpenSearch defaults, cloud auth, and auto-instrumentation discovery. The core value — making OpenSearch work out of the box for GenAI observability — remains regardless.
FAQ - Proposal Requirements
What/Why
What are you proposing?
An OpenTelemetry-native Python SDK (opensearch-genai-sdk-py) that provides one-line setup for comprehensive LLM observability using OpenSearch as the backend. Core capabilities include:
register() for automatic OTEL pipeline configuration with OpenSearch defaults
- Four decorators (
@workflow, @task, @agent, @tool) for tracing custom code
score() function for submitting evaluation results as OTEL spans
- Auto-discovery and activation of 30+ LLM provider instrumentors
- Built-in cloud authentication support (AWS SigV4, OAuth, etc.)
What users have asked for this feature?
- Cloud platform users: Multiple requests for cloud-native authentication support in existing OTEL SDKs for managed OpenSearch services
- AI developers: Community feedback on complexity of setting up OpenTelemetry for LLM applications, particularly around instrumentor discovery and endpoint configuration
- Enterprise users: Need for unified observability and evaluation platform with self-hosted deployment options (no vendor lock-in)
- OpenSearch community: Growing demand for AI-native tooling as evidenced by the GenAI features roadmap in OpenSearch Dashboards
What problems are you trying to solve?
When developing LLM applications, AI developers want to instrument their code with minimal setup so they can focus on building features rather than debugging observability configuration.
When deploying to cloud platforms, platform engineers want to use managed OpenSearch services with proper authentication so they can securely ingest telemetry data without managing infrastructure.
When evaluating AI systems, ML engineers want to submit scores through the same pipeline as traces so they can correlate evaluation results with execution data in a unified view.
What is the developer experience going to be?
REST API Impact: None. This is a client-side SDK that connects TO existing OpenSearch APIs via OTLP protocol.
New APIs:
register(endpoint, auth="auto|sigv4|oauth", **kwargs) - OTEL pipeline setup
@workflow/@task/@agent/@tool decorators - span creation for custom functions
score(name, value, trace_id=None, span_id=None, conversation_id=None, **kwargs) - evaluation submission
CLI: No CLI interface. Pure Python library.
Configuration: Environment variable support for OTEL_* variables (service name, etc.)
Are there any security considerations?
The SDK integrates with OpenSearch's existing security model:
- Cloud authentication: Uses platform-native credential providers (e.g., botocore for AWS, standard OAuth flows) for secure authentication
- No credential storage: Leverages existing credential providers, no additional secrets management
- Transport security: Supports TLS for OTLP HTTP/gRPC transport
- Attribute filtering: Input/output capture is configurable and includes truncation to prevent large payloads
Are there any breaking changes to the API?
No breaking changes to any OpenSearch APIs. This is a new client library that uses existing OTLP ingestion endpoints.
What is the user experience going to be?
Installation and Setup:
pip install opensearch-genai-sdk-py[aws] # Cloud provider extras for authentication
Basic Usage:
from opensearch_genai_sdk import register, workflow, agent, tool, score
# One-line setup with cloud authentication
register(
endpoint="https://my-opensearch-cluster.com/v1/traces",
auth="sigv4" # or "oauth", "basic", etc.
)
# Trace custom code
@workflow("qa_pipeline")
def answer_question(question: str) -> str:
return my_agent(question)
@agent("research_agent")
def my_agent(question: str) -> str:
context = search_web(question) # Auto-instrumented if OpenAI/etc installed
return generate_answer(context, question)
@tool("web_search")
def search_web(query: str) -> list:
return search_api.query(query)
# Submit evaluation scores
score("relevance", 0.95, trace_id="abc123", source="llm-judge")
Dashboard Experience: Traces and scores appear in OpenSearch Dashboards observability views (existing trace analytics functionality).
Are there breaking changes to the User Experience?
No breaking changes to existing OpenSearch user flows. This adds new telemetry data to existing observability dashboards.
Why should it be built? Any reason not to?
Value to build:
- Strategic positioning: Establishes OpenSearch as the leading open-source AI observability platform
- Cloud integration: Native support for managed OpenSearch services fills critical gap for cloud users
- Developer velocity: Reduces AI observability setup from hours to minutes
- Open ecosystem: Prevents vendor lock-in compared to proprietary solutions (LangSmith, Arize)
Risks if not built:
- OpenSearch loses mindshare to proprietary AI observability platforms
- Cloud users continue struggling with OTEL + managed service integration
- Community fragments across incompatible instrumentation solutions
Risks if built:
- Maintenance overhead for keeping up with rapidly evolving AI frameworks
- Competition with existing OTEL SDK ecosystem (positioning as complement, not replacement)
What will it take to execute?
Technical requirements:
- Core development: Python SDK implementation (completed in POC)
- Testing: Comprehensive test suite covering OTEL, cloud auth, instrumentor integration (74 tests implemented)
- Documentation: API docs, examples, integration guides
- CI/CD: GitHub Actions for testing, security scanning, PyPI publishing (implemented)
Dependencies:
- OpenTelemetry Python SDK (stable)
- Cloud provider SDKs for authentication (botocore for AWS, etc.)
- OpenLLMetry instrumentor ecosystem (actively maintained)
Performance considerations:
- Spans are batched and exported asynchronously (standard OTEL behavior)
- Input/output capture includes size limits (10KB) to prevent memory issues
- Auto-instrumentation discovery happens once at startup
Integration assumptions:
- OpenSearch cluster with OTLP ingestion capability (Data Prepper or cloud ingestion service)
- Standard OTEL trace storage and querying via existing OpenSearch APIs
Any remaining open questions?
Future enhancements:
- Multi-language support: TypeScript/JavaScript SDK (POC exists)
- Dashboard integration: Native AI observability views in OpenSearch Dashboards
- Advanced evaluation: Dataset-based evaluation orchestration with
evaluate() function
- OpenInference compatibility: Support for Arize instrumentor ecosystem alongside OpenLLMetry
Long-term questions:
- OTEL standardization: If OpenTelemetry adds native evaluation/scoring support, how does the SDK adapt?
- Data Prepper enhancements: Should evaluation spans route to dedicated score indices?
- Community adoption: What's the strategy for migrating users from existing observability platforms?
Summary
This RFC proposes
opensearch-genai-sdk-py— an OTEL-native SDK for instrumenting and scoring agentic applications with OpenSearch as the backend.POC implementations are published under @vamsimanohar for community testing. Package names and repos will move to
opensearch-projectupon approval.pip install opensearch-genai-sdk-py(PyPI)npm install opensearch-genai-sdk(npm)Background
The GenAI observability space has two open-source instrumentation ecosystems:
OpenLLMetry (by Traceloop) — Provides auto-instrumentation for 30+ LLM libraries (OpenAI, Anthropic, Bedrock, LangChain, etc.) and
@workflow/@task/@agent/@tooldecorators. Usesgen_ai.*attributes aligned with official OTEL GenAI semantic conventions — Traceloop is actively contributing these upstream to OpenTelemetry. Instrumentors register via Pythonentry_points, enabling runtime auto-discovery.OpenInference (by Arize/Phoenix) — Similar auto-instrumentation with its own attribute conventions (
openinference.span.kind). Also usesentry_pointsfor discovery.Both produce standard OTEL spans — the traces are portable. However, each comes with an SDK (traceloop-sdk, phoenix-otel) that bundles setup, decorators, and score/eval APIs. The problem: scores and evaluations in these SDKs are locked to their respective backends. Traceloop's
user_feedback.create()POSTs toapi.traceloop.com. Phoenix's eval APIs write to the Phoenix server. The traces are open, but the scores are not.Why OpenLLMetry
We chose to align with OpenLLMetry because:
gen_ai.*semantic conventions directly to the OpenTelemetry projectThe SDK discovers and activates OpenLLMetry instrumentors via the
opentelemetry_instrumentorentry point group.Why We Need a New SDK
Two reasons: instrumentation defaults and scoring.
Instrumentation
Existing SDKs don't support cloud deployment authentication patterns:
Cloud-specific authorization — Cloud-hosted OpenSearch deployments often require specialized authentication (e.g., AWS SigV4, OAuth, API keys) on OTLP requests. Most existing SDKs only support basic HTTP authentication. Without cloud-native auth support, developers on managed platforms cannot send traces to their OpenSearch clusters.
OpenSearch defaults — Endpoint URLs, Data Prepper paths, service naming conventions, batch processor settings — developers shouldn't have to figure these out.
register()should just work for OpenSearch.Auto-instrumentation — OpenLLMetry instrumentors monkey-patch LLM libraries (OpenAI, Anthropic, Bedrock, etc.) to emit OTEL spans on every call. They register via Python
entry_pointsunder theopentelemetry_instrumentorgroup.register()discovers and activates all installed instrumentors automatically — developers justpip installthe instrumentor packages they need.Scoring
OTEL has no concept of scores or evaluations. Every platform builds this as a proprietary API:
gen_ai.evaluation.*attributes following the OTEL GenAI semantic conventions — they flow through the same exporter pipeline, same authentication, same Data Prepper endpoint. No separate connection. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results viascore().Proposal
opensearch-genai-sdk-pyA single package with three capabilities:
1.
register()— One-line OTEL setup2. Decorators — Trace custom functions
Auto-instrumentors trace library calls (OpenAI, Anthropic, etc.). Decorators trace your code — your agents, workflows, and tools. Together they produce a complete trace:
Creates standard OTEL spans with
gen_ai.operation.nameattributes following OTEL GenAI semantic conventions. Agent spans useinvoke_agentoperation name withgen_ai.agent.name; tool spans useexecute_toolwithgen_ai.tool.name,gen_ai.tool.type,gen_ai.tool.description, andgen_ai.tool.call.arguments/gen_ai.tool.call.resultfor I/O. Supports sync, async, and generator functions.3.
score()— Scores as OTEL spansEmits scores as
gen_ai.evaluation.resultOTEL spans withgen_ai.evaluation.*attributes following the OTEL GenAI semantic conventions. Supports three scoring levels: span-level (trace_id + span_id), trace-level (trace_id only), and session-level (conversation_id). No separate client, no separate auth — same pipeline as traces. Users bring their own evaluation frameworks (autoevals, RAGAS, custom) and submit results throughscore().Why an SDK (not just OTEL)
register()with defaults@workflow/@agent/@task/@toolscore()as OTEL spansEverything underneath is standard OTEL. Remove the SDK and traces still export.
Architecture
Design Principles
gen_ai.operation.name,gen_ai.agent.*,gen_ai.tool.*, andgen_ai.evaluation.*attributes aligned with OTEL GenAI semantic conventions.opentelemetry_instrumentorentry points.pip install opensearch-genai-sdk-pygets everything. Optional extras for cloud providers ([aws], etc.).Status
POC implementations are published under @vamsimanohar for testing:
pip install opensearch-genai-sdk-pynpm install opensearch-genai-sdkPackage ownership will be transferred to
opensearch-projectupon RFC approval.Future Considerations
openinference_instrumentor). Support could be added to discover instrumentors from both ecosystems, giving developers auto-instrumentation regardless of which instrumentor packages they install.gen_ai.evaluation.resultspans to a dedicated score index.evaluate()function could orchestrate dataset-based evaluations (dataset → task → scorers → results), creating the right span hierarchy and emitting scores automatically. This would be compatible with frameworks like autoevals or custom scorers.FAQ - Proposal Requirements
What/Why
What are you proposing?
An OpenTelemetry-native Python SDK (
opensearch-genai-sdk-py) that provides one-line setup for comprehensive LLM observability using OpenSearch as the backend. Core capabilities include:register()for automatic OTEL pipeline configuration with OpenSearch defaults@workflow,@task,@agent,@tool) for tracing custom codescore()function for submitting evaluation results as OTEL spansWhat users have asked for this feature?
What problems are you trying to solve?
When developing LLM applications, AI developers want to instrument their code with minimal setup so they can focus on building features rather than debugging observability configuration.
When deploying to cloud platforms, platform engineers want to use managed OpenSearch services with proper authentication so they can securely ingest telemetry data without managing infrastructure.
When evaluating AI systems, ML engineers want to submit scores through the same pipeline as traces so they can correlate evaluation results with execution data in a unified view.
What is the developer experience going to be?
REST API Impact: None. This is a client-side SDK that connects TO existing OpenSearch APIs via OTLP protocol.
New APIs:
register(endpoint, auth="auto|sigv4|oauth", **kwargs)- OTEL pipeline setup@workflow/@task/@agent/@tooldecorators - span creation for custom functionsscore(name, value, trace_id=None, span_id=None, conversation_id=None, **kwargs)- evaluation submissionCLI: No CLI interface. Pure Python library.
Configuration: Environment variable support for
OTEL_*variables (service name, etc.)Are there any security considerations?
The SDK integrates with OpenSearch's existing security model:
Are there any breaking changes to the API?
No breaking changes to any OpenSearch APIs. This is a new client library that uses existing OTLP ingestion endpoints.
What is the user experience going to be?
Installation and Setup:
pip install opensearch-genai-sdk-py[aws] # Cloud provider extras for authenticationBasic Usage:
Dashboard Experience: Traces and scores appear in OpenSearch Dashboards observability views (existing trace analytics functionality).
Are there breaking changes to the User Experience?
No breaking changes to existing OpenSearch user flows. This adds new telemetry data to existing observability dashboards.
Why should it be built? Any reason not to?
Value to build:
Risks if not built:
Risks if built:
What will it take to execute?
Technical requirements:
Dependencies:
Performance considerations:
Integration assumptions:
Any remaining open questions?
Future enhancements:
evaluate()functionLong-term questions: