You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This design describes an LLM evaluation platform built natively on the OpenSearch ecosystem. The platform uses OpenSearch indices as the sole data store, OTel Collector for OTLP ingestion and span processing, OpenSearch Job Scheduler for async processing, and OpenSearch Dashboards plugins for the UI.
The data model is grounded in the OpenTelemetry GenAI Semantic Conventions (gen_ai.* attribute namespace). OTLP spans arrive with standard gen_ai.* attributes and are indexed directly into OpenSearch without lossy transformation. This means any OTel-instrumented LLM application (Strands, OpenAI SDK, Bedrock SDK, etc.) can send telemetry to the platform with zero custom mapping.
The system supports three evaluation modes:
Online Agent Trace Evaluation: Automatic post-ingestion scoring of live production traces (reference-free only). Source: EVAL_ONLINE.
Offline Agent Trace Evaluation: Platform-orchestrated batch evaluation against curated Eval_Sets with Ground_Truth. Source: EVAL_OFFLINE.
Local Evaluation: Scores computed client-side by the user's SDK (Strands, DeepEval, Ragas) and submitted via the Scores API. The platform is a passive receiver. Source: SDK.
Key architectural decisions:
OpenSearch as the sole data store -- all entities (spans, scores, eval sets, experiments, jobs) are stored in dedicated OpenSearch indices. No relational database.
OTel Collector for ingestion and processing -- OTLP telemetry flows through OTel Collector pipelines into OpenSearch. The gen_ai.* span attributes are indexed as-is. OTel Collector handles trace-group metric aggregation and Prometheus metric emission. The existing APM index template's dynamic field mapping ("dynamic": "true" on the attributes field) automatically indexes any new gen_ai.* span attributes without requiring schema changes.
GenAI semantic conventions as the canonical schema -- the platform does not define a custom trace/observation schema. It uses the OTel gen_ai.* attributes directly, extended with platform-specific attributes under the eval.* namespace for evaluation-only fields.
Job Scheduler for async work -- native OpenSearch Job Scheduler plugin for LLM-as-a-Judge, deterministic evaluators, and RAG metrics. Only involved in Online and Offline modes.
OSD Plugin for UI -- a single OpenSearch Dashboards plugin using OUI components provides all evaluation UI views.
Passive receiver for Local evaluation -- third-party SDKs compute scores client-side and submit them via the Scores API.
OSS library delegation for scoring via Python Agent Service -- Online and Offline agent trace evaluators delegate scoring to the Python Agent Service, which hosts a Strands-based eval agent that invokes Strands Eval, DeepEval, and Ragas for actual scoring logic. The eval-scheduler-plugin communicates with the Python Agent Service over its internal API; the Python Agent Service owns the LLM provider connection.
No artificial scoping -- spans are global documents. Multi-tenancy is handled at the OpenSearch index level via the security plugin.
Dual-write metrics architecture -- OTel Collector enriches spans with pre-aggregated trace-group fields (traceGroupFields.genAi.*) for fast OpenSearch queries (PPL) while simultaneously emitting derived metrics to Prometheus for time-series analysis (PromQL). OpenSearch handles trace detail and search; Prometheus handles metric aggregation and alerting.
GenAI OTel Conventions Alignment
The platform's data model maps directly to the OTel GenAI semantic conventions (status: Development). The key span types and their gen_ai.operation.name values:
OTel Operation
gen_ai.operation.name
Platform Concept
Description
Chat completion
chat
Generation observation
LLM inference call
Text completion
text_completion
Generation observation
Legacy completion call
Embeddings
embeddings
Embedding observation
Vector embedding call
Invoke agent
invoke_agent
Trace (root span)
Top-level agent invocation
Create agent
create_agent
Agent setup span
Agent initialization
Execute tool
execute_tool
Tool call observation
Tool/function execution
Content generation
generate_content
Generation observation
Multimodal generation
The platform extends the standard gen_ai.* namespace with eval.* attributes for evaluation-specific metadata that has no OTel equivalent:
Custom Attribute
Type
Description
eval.score.name
keyword
Score metric name
eval.score.value
float
Numeric score value
eval.score.source
keyword
One of: EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, API
eval.experiment.run_id
keyword
Links span to an experiment run
eval.experiment.set_id
keyword
Links span to an eval set
eval.experiment.item_id
keyword
Links span to a specific test case
Scoping Model
Spans: Global documents. No artificial project scoping field. Queried via filters (time range, tags, gen_ai.agent.name, gen_ai.request.model, etc.).
Sessions: Correlated via gen_ai.conversation.id (the OTel convention for session/thread tracking).
Eval sets: Independent named collections. Can be used by multiple experiment runs.
Experiment runs: Reference eval sets via evalSetId. Produce run items linking test cases to traces.
Multi-tenancy: OpenSearch security plugin (index-level permissions, roles). The eval platform itself is tenant-unaware.
Local evaluation scores: Arrive via Scores API with source: SDK, referencing traceId and carrying evaluator metadata.
Agent Root Span Identification
Agent root spans in raw trace data are identified by:
sequenceDiagram
participant App as User Application
participant SDK as Instrumentation Library
participant OC as OTel Collector
participant OS as OpenSearch
participant PROM as Prometheus
participant JS as eval-scheduler-plugin
participant PAS as Python Agent Service
participant LLM as LLM Provider
participant SDK3P as Third-Party SDK
Note over App,LLM: Online Agent Trace Evaluation Flow
App->>SDK: Instrumented function call
SDK->>OC: OTLP spans (gen_ai.* attributes)
OC->>OC: Aggregate traceGroupFields.genAi.*
OC->>OS: Index enriched spans
OC->>PROM: Emit derived metrics
JS->>OS: Poll for new spans matching trigger filters
JS->>JS: Create PENDING eval jobs
JS->>OS: Read span data
JS->>PAS: Eval request (evaluator config + span data)
PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas)
LLM-->>PAS: Score response
PAS-->>JS: Structured score result
JS->>OS: Write eval_score (source: EVAL_ONLINE)
Note over App,LLM: Offline Agent Trace Evaluation Flow
SDK->>OS: Fetch eval set items
loop For each experiment item
SDK->>App: Call user function(input)
App-->>SDK: output
SDK->>OC: OTLP spans (eval.experiment.* tags)
SDK->>OS: Write experiment_run_item
end
SDK->>OS: Write run-level scores (source: EVAL_OFFLINE)
Note over JS,PAS: Server-side evaluation (same as online, with ground truth)
JS->>OS: Poll for new spans tagged with eval.experiment.run_id
JS->>OS: Read span data + expectedOutput from eval_experiments
JS->>PAS: Eval request (evaluator config + span data + expectedOutput)
PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas)
LLM-->>PAS: Score response
PAS-->>JS: Structured score result
JS->>OS: Write eval_score (source: EVAL_OFFLINE)
Note over App,SDK3P: Local Evaluation Flow
App->>SDK3P: Run evaluation (Strands/DeepEval/Ragas)
SDK3P->>LLM: LLM call (if metric requires it)
LLM-->>SDK3P: Score response
SDK3P->>OC: OTLP spans (trace telemetry)
SDK3P->>OS: POST /api/scores (source: SDK)
Loading
Evaluation Algorithm Dependencies
The platform does not implement evaluation algorithms from scratch. For Online and Offline agent trace evaluation, the eval-scheduler-plugin delegates to the Python Agent Service, which hosts a Strands-based eval agent invoking OSS libraries: Strands Eval (agent trajectory, tool-use, multi-step reasoning), DeepEval (GEval, hallucination, relevancy, faithfulness), and Ragas (context precision/recall, answer faithfulness/relevancy).
The eval-scheduler-plugin sends requests with the Evaluator_Template config (library, metric, model, target span data). The Python Agent Service constructs the library call, manages the LLM connection, and returns structured scores. Evaluator_Templates are thin wrappers — each specifies library, metric, provider, and parameters. LLM provider config is pluggable at the template level via Strands SDK's model abstraction.
Components and Interfaces
1. OTel Collector Pipeline
Responsibility: Receives OTLP telemetry, aggregates trace-group metrics, emits derived metrics to Prometheus, and indexes spans into OpenSearch.
Validation: Malformed OTLP payloads are rejected at the receiver level. The processor validates required span fields (traceId, spanId, timestamps) and drops documents missing them, logging errors to a dead-letter index.
Interface:
Input: OTLP gRPC (port 4317) and HTTP (port 4318)
Output: OpenSearch bulk index API, Prometheus remote write
Pipeline Stages:
OTLP Receiver: Accepts gRPC and HTTP OTLP payloads
Trace Processor (Trace-Group Aggregation): Buffers spans by traceId, computes trace-level aggregates (traceGroupFields.genAi.*), and writes them back to every span in the trace. Extends the existing traceGroupFields pattern used for standard APM metrics (duration, status) with GenAI-specific aggregations.
OpenSearch Exporter: Indexes enriched spans into otel-v1-apm-span-* indices
Pre-aggregated fields computed at ingest time and written to each span document within a trace. These denormalized fields enable the agent trace list view to display aggregate statistics without expensive query-time aggregations.
Field
Type
Calculation
traceGroupFields.genAi.totalTokens
Long
Sum of input + output tokens across all spans
traceGroupFields.genAi.inputTokens
Long
Sum of gen_ai.usage.input_tokens
traceGroupFields.genAi.outputTokens
Long
Sum of gen_ai.usage.output_tokens
traceGroupFields.genAi.llmCallCount
Integer
Count of spans where gen_ai.operation.name = chat
traceGroupFields.genAi.toolCallCount
Integer
Count of spans where gen_ai.operation.name = execute_tool
traceGroupFields.genAi.errorCount
Integer
Count where status.code = 2
Note: Token cost estimation (traceGroupFields.genAi.estimatedCost) is deferred from P0 due to pricing table maintenance complexity.
Aggregate Metrics (Prometheus):
OTel Collector derives and emits three core metrics to Prometheus from span attributes:
Metric
Type
Description
gen_ai.client.token.usage
Counter
Token consumption by type
gen_ai.client.operation.duration
Histogram
LLM call latency distribution
Metric dimensions: gen_ai.operation.name, gen_ai.system, gen_ai.request.model (normalized to model family), gen_ai.response.model (normalized), service.name, gen_ai.token.type (input/output).
Cardinality Management: High-cardinality fields (traceId, spanId, gen_ai.conversation.id) excluded from metric dimensions. Model IDs normalized to family names (e.g., anthropic.claude-sonnet-4-5-20250929-v1:0 → claude-sonnet-4-5). Estimate: ~9,000 series per customer.
Deduplication: If client-side instrumentation already emits gen_ai.client.token.usage, OTel Collector adds source=span_derived to distinguish its derived metrics.
Large Content Fields: gen_ai.input.messages and gen_ai.output.messages are indexed but not analyzed ("index": false). Full-text search on content fields is opt-in.
Long-Running Traces: Configuration should support increased flush intervals or root-span-triggered flushing for 60+ minute agent conversations.
2. Eval Platform REST API
Responsibility: CRUD operations for eval sets, experiments, scores, and evaluator configs. Exposed as server-side routes within the OSD Plugin.
Endpoints (key routes):
Method
Path
Description
Req
POST
/api/eval-sets
Create eval set
4.1
GET
/api/eval-sets
List eval sets
4.7
POST
/api/eval-sets/{id}/experiments
Add experiment to eval set
4.2
PUT
/api/eval-sets/{id}/experiments/{eid}
Update experiment (versioned)
4.6
POST
/api/experiment-runs
Create experiment run
5.1
POST
/api/experiment-runs/{id}/items
Create run item
5.2
POST
/api/scores
Submit score
3.2, 10.1
POST
/api/score-configs
Create score config
3.1
POST
/api/evaluator-templates
Create evaluator template
8.1
POST
/api/deterministic-evaluators
Create deterministic evaluator
21.2
POST
/api/annotation-queues
Create annotation queue
9.1
Authentication: API calls authenticated via API keys or OSD session tokens. Multi-tenancy via OpenSearch security plugin.
Span Queries via PPL
Span browsing, searching, and detail retrieval use OpenSearch PPL (Piped Processing Language) queries against the _plugins/_ppl endpoint. The OSD Plugin constructs PPL strings from UI filter state.
Example queries:
-- List agent root spans with pre-computed aggregates (Req 15.1)
source = otel-v1-apm-span-*
| where parentSpanId =''AND isnotnull(`attributes.gen_ai.operation.name`)
| fields traceId, name, durationInNanos, status.code,
traceGroupFields.genAi.totalTokens, traceGroupFields.genAi.llmCallCount
| sort - startTime | head 100-- Get all child spans for a trace (Req 15.3)
source = otel-v1-apm-span-*
| where traceId ='abc123' | sort startTime
-- Aggregate latency by model
source = otel-v1-apm-span-*
| where`attributes.gen_ai.operation.name`='chat'
| stats avg(durationInNanos) as avg_latency, count() as call_count by `attributes.gen_ai.request.model`
PromQL (Prometheus dashboards): sum(rate(gen_ai_client_token_usage[5m])) by (gen_ai_system), histogram_quantile(0.99, sum(rate(gen_ai_client_operation_duration_bucket[5m])) by (le, gen_ai_response_model))
3. eval-scheduler-plugin
Responsibility: The eval-scheduler-plugin is a lightweight OpenSearch plugin that owns all async evaluation work — detecting new spans for online agent trace evaluation, executing LLM-as-a-Judge scoring, running deterministic evaluators, computing RAG metrics, and managing job lifecycle. It uses the OpenSearch Job Scheduler SPI as its scheduling infrastructure.
Why a custom plugin: Job Scheduler is an SPI framework — it provides scheduling infrastructure (interval/cron triggers, distributed locking, job persistence) but requires a consumer plugin to define job types and execution logic. The plugin implements a polling sweeper to bridge span ingestion and evaluation execution.
Job Scheduler SPI Integration:
The plugin implements three Job Scheduler SPI interfaces:
Interface
Implementation
Purpose
JobSchedulerExtension
EvalSchedulerExtension
Registers the plugin with Job Scheduler, declares job index (eval_job_metrics) and runners
ScheduledJobParameter
EvalJobParameter
Defines the job document schema stored in eval_job_metrics
ScheduledJobRunner
EvalTriggerSweeper, EvalJobExecutor
Contains the execution logic invoked when a scheduled job fires
The plugin uses Job Scheduler's IntervalSchedule trigger type to register two recurring scheduled jobs:
Trigger Sweeper (EvalTriggerSweeper) — runs every 5–10s, polls for new spans matching online agent trace evaluation triggers and offline experiment traces (via eval.experiment.run_id tags), creates PENDING job documents
Job Executor (EvalJobExecutor) — runs every 2–5s, picks up PENDING job documents ordered by priority, executes evaluations, writes scores
Job Scheduler's CronSchedule is not used — the polling pattern requires sub-minute granularity that interval scheduling provides.
The sweeper is the plugin's span detection mechanism for both online and offline agent trace evaluation. It maintains a lastSweepTime watermark per trigger configuration and queries otel-v1-apm-span-* for spans indexed since the last sweep.
For online agent trace evaluation, it matches root spans against trigger filter criteria (e.g., gen_ai.agent.name, gen_ai.operation.name, tags). For offline agent trace evaluation, it detects spans tagged with eval.experiment.run_id and joins them with the corresponding eval_experiments documents to retrieve ground truth (expectedOutput). In both cases, it creates PENDING job documents in eval_job_metrics.
Sweeper pseudocode:
// Online: poll for new root spans matching trigger filters
for each onlineTrigger:
query otel-v1-apm-span-* WHERE startTime >= lastSweepTime
AND parentSpanId = "" AND matchesTriggerFilter(trigger)
for each hit (deduplicate by targetSpanId + evaluatorId):
createPendingJob(spanId, evaluatorId, type=HIGH)
updateLastSweepTime(trigger)
// Offline: poll for new spans tagged with eval.experiment.run_id
for each pendingExperimentRun:
query otel-v1-apm-span-* WHERE eval.experiment.run_id = runId
AND parentSpanId = ""
for each hit:
groundTruth = fetchExperiment(eval.experiment.item_id) // join for expectedOutput
for each evaluator (deduplicate by targetSpanId + evaluatorId):
createPendingJob(spanId, evaluatorId, type=NORMAL, groundTruth)
updateLastSweepTime(run)
Deduplication: before creating a job, the sweeper checks eval_job_metrics for an existing targetSpanId + evaluatorId combination to prevent duplicate evaluations.
Job Executor (EvalJobExecutor):
The executor picks up PENDING jobs, acquires a distributed lock via LockService, reads span data, and delegates to the Python Agent Service.
1. Query eval_job_metrics WHERE status=PENDING ORDER BY priority DESC, createdAt ASC
2. For each job:
a. Acquire distributed lock via LockService (skip if another node holds it)
b. Mark IN_PROGRESS
c. Read target span data from OpenSearch
d. Load EvaluatorTemplate config
e. Send eval request to Python Agent Service (evaluatorConfig + spanData)
f. Write score to eval_scores, mark COMPLETED
g. On failure: retry with exponential backoff (2^retryCount * 1000ms)
or mark FAILED if maxRetries exceeded
h. Release lock in finally block (TTL fallback: 5min if node crashes)
Distributed Locking: Uses Job Scheduler's LockService. Lock keyed by jobId, configurable TTL (default 5min). If a node crashes, lock expires and job returns to PENDING. Satisfies Req 18.5 (horizontal scaling without duplicate execution).
Priority Queue: Implemented at query level — PENDING jobs sorted by priority DESC, createdAt ASC. Priority values: HIGH=3, NORMAL=2, LOW=1. Online agent trace eval jobs (HIGH) are always picked up before offline batch jobs (NORMAL). Optional concurrency limits per priority level prevent batch starvation.
Batch Job Creation (Offline): When an Experiment_Run completes trace capture, the trigger sweeper detects new spans via eval.experiment.run_id tags, joins with eval_experiments for ground truth, and creates PENDING jobs with priority: NORMAL.
Plugin Configuration (opensearch.yml):
eval.scheduler.trigger_sweep_interval: "5s"# How often sweeper polls for new matching spanseval.scheduler.job_executor_interval: "2s"# How often executor picks up PENDING jobseval.scheduler.executor_batch_size: 10# Max jobs per executor cycleeval.scheduler.lock_ttl_minutes: 5# Job Scheduler LockService TTLeval.scheduler.max_retries: 3# Default max retries per jobeval.scheduler.online_concurrency_limit: 20# Max concurrent online agent trace eval jobseval.scheduler.offline_concurrency_limit: 50# Max concurrent offline agent trace eval jobseval.scheduler.agent_service_endpoint: "http://localhost:8080"# Python Agent Service URLeval.scheduler.agent_service_timeout_ms: 45000# Timeout for eval requests to agent service
Latency Budget (<60s SLA):
Phase
Budget
Notes
Span indexing + refresh
~5s
OpenSearch refresh_interval
Trigger sweep detection
0–10s
Depends on sweep interval (configurable)
Job pickup by executor
0–5s
Depends on executor interval
Lock acquisition
<100ms
Job Scheduler LockService, local cluster op
Span data read
<500ms
Single document fetch
Python Agent Service call
5–30s
Network hop + eval library + LLM provider latency
Score write
<500ms
Single document index
Total
~15–45s
Well within 60s SLA for typical cases
4. Python Agent Service (Eval Agent)
Responsibility: Hosts the Strands-based eval agent that executes LLM-as-a-Judge scoring, RAG metric computation, and any evaluation logic requiring LLM provider access. The eval-scheduler-plugin delegates all LLM-dependent evaluation work to this service.
Context: The Python Agent Service is a broader OpenSearch initiative that provides a unified Python backend for AI-powered assistants. It uses Strands SDK as the orchestration framework and follows a multi-agent pattern with a top-level orchestrator routing requests to specialized sub-agents. The eval platform registers an Eval Agent as a specialized sub-agent within this service.
Why delegate: Eval libraries (Strands Eval, DeepEval, Ragas) and LLM provider SDKs are Python-native. The Java plugin stays focused on scheduling/locking. New eval methods ship as Python library updates without touching the Java plugin. The Python Agent Service already manages LLM credentials, connection pooling, and OTel observability.
Eval Agent Architecture:
The Eval Agent is a specialized sub-agent in the Python Agent Service's agent registry. It receives requests from the eval-scheduler-plugin over an internal API (not user-facing). The agent uses Strands SDK's @tool decorator to expose a run_evaluation tool that resolves the evaluator (library + metric + model config), executes the evaluation, and returns structured scores with explanation and an executionTraceId linking back to the OTel trace of the eval LLM call.
Internal API (eval-scheduler-plugin → Python Agent Service):
Method
Path
Description
POST
/api/eval-agent/evaluate
Execute a single evaluation
POST
/api/eval-agent/evaluate/batch
Execute batch evaluations
GET
/api/eval-agent/health
Health check
Request/Response: The evaluate request carries evaluatorConfig (library, metric, modelConfig, outputSchema) and spanData (input, output, context, expectedOutput). For offline agent trace evaluation, expectedOutput is populated with ground truth; for online, it's null. The response returns scores (array of name/value/dataType), explanation, and executionTraceId.
Deterministic evaluators: Deterministic evaluators (regex, JSON validity, exact match) run directly in the eval-scheduler-plugin. Only LLM Judge and RAG evaluations go to the Python Agent Service.
Observability: All eval agent operations are OTel-instrumented. The executionTraceId links scores to execution traces for debugging.
Deployment: Runs as a sidecar or co-located service. Stateless — all config and state in OpenSearch indices. Horizontally scalable independently.
5. OSD Eval Plugin
Responsibility: Complete evaluation UI as an OpenSearch Dashboards plugin.
Pre-aggregated at ingest, not computed at query time
Latency
durationInNanos
Root span duration
Input/Output
gen_ai.input.messages, gen_ai.output.messages
Opt-in, may contain sensitive data
Waterfall View: Inline with span bar: span name, operation type icon, latency, status. Conditional: token count (chat), model name (chat), tool name (tool_call), agent name (invoke_agent).
All data is stored in OpenSearch indices. The spans index uses the OTel GenAI semantic conventions directly. Evaluation-specific entities use dedicated indices under the eval_* prefix.
Spans Index (otel-v1-apm-span-*)
This index stores OTLP spans as-is, preserving all gen_ai.* attributes from the OTel semantic conventions. OTel Collector indexes spans without lossy transformation. The existing APM index template's dynamic field mapping automatically indexes gen_ai.* attributes without schema changes. Trace-group fields are pre-aggregated by OTel Collector at ingest time.
Key design notes: gen_ai.operation.name distinguishes span types. gen_ai.conversation.id is the OTel standard for session correlation. parentSpanId provides parent-child hierarchy. gen_ai.input/output.messages indexed but not analyzed ("index": false) due to size. traceGroupFields.genAi.* pre-aggregated by OTel Collector at ingest. Dynamic mapping on attributes auto-indexes new gen_ai.* attributes. eval.* attributes link spans to experiments. totalCost deferred from P0.
Scores Index (eval_scores)
Source values: EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, API. Idempotency: upsert by idempotencyKey via _update with doc_as_upsert. Settings: 3 shards, 1 replica, 5s refresh.
Field
Type
Field
Type
id
keyword
source
keyword
name
keyword
traceId
keyword
value
float
spanId
keyword
stringValue
keyword
sessionId
keyword
dataType
keyword
experimentRunId
keyword
authorUserId
keyword
configId
keyword
comment
text
queueId
keyword
metadata
object
executionTraceId
keyword
environment
keyword
timestamp
date
createdAt
date
updatedAt
date
idempotencyKey
keyword
Score Configs Index (eval_score_configs)
Field
Type
Field
Type
id
keyword
isArchived
boolean
name
keyword
minValue
float
dataType
keyword
maxValue
float
categories
nested (label: keyword, value: float)
description
text
createdAt
date
updatedAt
date
Eval Sets Index (eval_sets)
Field
Type
Field
Type
id
keyword
metadata
object
name
text + keyword
inputSchema
object (not indexed)
description
text
expectedOutputSchema
object (not indexed)
createdAt
date
updatedAt
date
Experiments Index (eval_experiments)
Versioning: updates create a new document with a new validFrom. Latest active version: filter status=ACTIVE, sort by validFrom desc per lineageId.
Field
Type
Field
Type
id
keyword
sourceTraceId
keyword
evalSetId
keyword
sourceSpanId
keyword
input
object
status
keyword
expectedOutput
object
lineageId
keyword
metadata
object
validFrom
date
createdAt
date
Experiment Runs Index (eval_experiment_runs)
Field
Type
Field
Type
id
keyword
description
text
evalSetId
keyword
metadata
object
name
text + keyword
createdAt
date
Experiment Run Items Index (eval_experiment_run_items)
Field
Type
Field
Type
id
keyword
traceId
keyword
experimentRunId
keyword
spanId
keyword
experimentId
keyword
error
text
createdAt
date
Evaluator Templates Index (eval_evaluator_templates)
evalLibrary (e.g., deepeval, ragas, strands_eval) and evalMetric (e.g., faithfulness, geval) identify which OSS library and metric to invoke. promptTemplate is optional (custom LLM Judge only).
Field
Type
Field
Type
id
keyword
modelConfig.provider
keyword
name
text + keyword
modelConfig.modelName
keyword
evalLibrary
keyword
modelConfig.temperature
float
evalMetric
keyword
modelConfig.maxTokens
integer
promptTemplate
text
outputSchema
nested (scoreName, dataType, valueMapping)
targetType
keyword
evaluationMode
keyword
createdAt
date
updatedAt
date
Deterministic Evaluators Index (eval_deterministic_evaluators)
Field
Type
Field
Type
id
keyword
evaluationMode
keyword
name
text + keyword
targetType
keyword
evaluatorType
keyword
scoreConfigId
keyword
configuration
object (not indexed)
createdAt
date
updatedAt
date
Annotation Queues Index (eval_annotation_queues)
Field
Type
Field
Type
id
keyword
scoreConfigIds
keyword
name
text + keyword
assignedUserIds
keyword
description
text
createdAt
date
updatedAt
date
Annotation Tasks Index (eval_annotation_tasks)
Locking: optimistic concurrency via _seq_no and _primary_term. Lock release job sweeps for expired locks.
Field
Type
Field
Type
id
keyword
lockedBy
keyword
queueId
keyword
lockedAt
date
targetId
keyword
lockTimeout
integer
targetType
keyword
completedBy
keyword
status
keyword
completedAt
date
createdAt
date
Job Metrics Index (eval_job_metrics)
Field
Type
Field
Type
jobId
keyword
experimentItemId
keyword
jobType
keyword
expectedOutput
object
status
keyword
retryCount
integer
priority
keyword
maxRetries
integer
evaluatorId
keyword
error
text
evaluatorType
keyword
processingTimeMs
long
targetId
keyword
createdAt
date
targetType
keyword
startedAt
date
completedAt
date
Key Data Model Relationships
erDiagram
SPAN ||--o{ SPAN : "parent-child via parentSpanId"
SPAN ||--o{ SCORE : "scored by"
SESSION ||--o{ SPAN : "groups via gen_ai.conversation.id"
EVAL_SET ||--o{ EXPERIMENT : contains
EVAL_SET ||--o{ EXPERIMENT_RUN : "executed as"
EXPERIMENT_RUN ||--o{ EXPERIMENT_RUN_ITEM : contains
EXPERIMENT_RUN_ITEM ||--|| EXPERIMENT : references
EXPERIMENT_RUN_ITEM ||--|| SPAN : "linked to via traceId"
SCORE_CONFIG ||--o{ SCORE : validates
EVALUATOR_TEMPLATE ||--o{ SCORE : produces
DETERMINISTIC_EVALUATOR ||--o{ SCORE : produces
ANNOTATION_QUEUE ||--o{ ANNOTATION_TASK : contains
ANNOTATION_TASK ||--o{ SCORE : "produces via review"
SPAN {
keyword traceId
keyword spanId
keyword parentSpanId
keyword gen_ai_operation_name
keyword gen_ai_request_model
keyword gen_ai_conversation_id
keyword gen_ai_agent_name
long gen_ai_usage_input_tokens
long gen_ai_usage_output_tokens
}
SCORE {
keyword id
keyword name
float value
keyword dataType
keyword source
keyword traceId
keyword spanId
keyword experimentRunId
}
EVAL_SET {
keyword id
keyword name
object inputSchema
object expectedOutputSchema
}
EXPERIMENT {
keyword id
keyword evalSetId
object input
object expectedOutput
keyword status
}
EXPERIMENT_RUN {
keyword id
keyword evalSetId
keyword name
}
EXPERIMENT_RUN_ITEM {
keyword id
keyword experimentRunId
keyword experimentId
keyword traceId
text error
}
Loading
OTel GenAI Convention to Platform Concept Mapping
This table shows how the platform's UI concepts map to OTel span attributes, eliminating the need for a custom schema translation layer:
Platform UI Concept
OTel Span Attribute
Notes
Trace (root)
Span where parentSpanId is null or gen_ai.operation.name = invoke_agent
Root span of a trace
Observation (child)
Any child span within a trace
Linked via parentSpanId
Generation
gen_ai.operation.name in (chat, text_completion, generate_content)
LLM inference call
Tool call
gen_ai.operation.name = execute_tool
Tool/function execution
Embedding
gen_ai.operation.name = embeddings
Vector embedding call
Retrieval
gen_ai.operation.name = execute_tool with gen_ai.tool.type = datastore
RAG retrieval step
Session
gen_ai.conversation.id
Groups related traces
Model
gen_ai.request.model / gen_ai.response.model
Model identifier
Provider
gen_ai.provider.name
e.g., openai, aws.bedrock, anthropic
Input tokens
gen_ai.usage.input_tokens
Token count
Output tokens
gen_ai.usage.output_tokens
Token count
Agent name
gen_ai.agent.name
Human-readable agent identifier
Agent ID
gen_ai.agent.id
Unique agent identifier
Tool name
gen_ai.tool.name
Tool identifier
Environment
resource.deployment.environment
OTel resource attribute
Service
resource.service.name
OTel resource attribute
OTel Evaluation Event Alignment
The OTel GenAI conventions define a gen_ai.evaluation.result event for capturing evaluation results. The platform's Score documents align with this event:
OTel Event Attribute
Platform Score Field
Notes
gen_ai.evaluation.name
name
Score metric name
gen_ai.evaluation.score.value
value
Numeric score
gen_ai.evaluation.score.label
stringValue
Categorical/boolean label
gen_ai.evaluation.explanation
comment
Evaluator reasoning
gen_ai.response.id
traceId / spanId
Links score to evaluated span
When scores are submitted via OTLP as gen_ai.evaluation.result events (Local evaluation mode), the OTel Collector maps them to eval_scores documents. When scores are submitted via the REST API, the platform stores them directly.
Correctness Properties
Each property is universally quantified and suitable for property-based testing.
#
Property
Description
Reqs
1
OTLP Telemetry Round-Trip
Ingesting valid OTLP spans and reading back produces equivalent gen_ai.* attribute values
1.2, 1.3, 13.8, 14.8
2
Span Hierarchy Preservation
N spans with parent-child tree → N stored documents, valid parentSpanId refs, isomorphic tree
All tests must be decoupled. Each it or test block runs independently and concurrently. Tests must never depend on the action or outcome of previous or subsequent tests. No shared mutable state between tests.
References
RFC: Python Agent Service for OpenSearch — The Python Agent Service that hosts the Strands-based eval agent used by the eval-scheduler-plugin for LLM-dependent evaluation execution.
Overview
This design describes an LLM evaluation platform built natively on the OpenSearch ecosystem. The platform uses OpenSearch indices as the sole data store, OTel Collector for OTLP ingestion and span processing, OpenSearch Job Scheduler for async processing, and OpenSearch Dashboards plugins for the UI.
The data model is grounded in the OpenTelemetry GenAI Semantic Conventions (
gen_ai.*attribute namespace). OTLP spans arrive with standardgen_ai.*attributes and are indexed directly into OpenSearch without lossy transformation. This means any OTel-instrumented LLM application (Strands, OpenAI SDK, Bedrock SDK, etc.) can send telemetry to the platform with zero custom mapping.The system supports three evaluation modes:
EVAL_ONLINE.EVAL_OFFLINE.SDK.Key architectural decisions:
gen_ai.*span attributes are indexed as-is. OTel Collector handles trace-group metric aggregation and Prometheus metric emission. The existing APM index template's dynamic field mapping ("dynamic": "true"on theattributesfield) automatically indexes any newgen_ai.*span attributes without requiring schema changes.gen_ai.*attributes directly, extended with platform-specific attributes under theeval.*namespace for evaluation-only fields.traceGroupFields.genAi.*) for fast OpenSearch queries (PPL) while simultaneously emitting derived metrics to Prometheus for time-series analysis (PromQL). OpenSearch handles trace detail and search; Prometheus handles metric aggregation and alerting.GenAI OTel Conventions Alignment
The platform's data model maps directly to the OTel GenAI semantic conventions (status: Development). The key span types and their
gen_ai.operation.namevalues:gen_ai.operation.namechattext_completionembeddingsinvoke_agentcreate_agentexecute_toolgenerate_contentThe platform extends the standard
gen_ai.*namespace witheval.*attributes for evaluation-specific metadata that has no OTel equivalent:eval.score.nameeval.score.valueeval.score.sourceeval.experiment.run_ideval.experiment.set_ideval.experiment.item_idScoping Model
gen_ai.agent.name,gen_ai.request.model, etc.).gen_ai.conversation.id(the OTel convention for session/thread tracking).evalSetId. Produce run items linking test cases to traces.source: SDK, referencingtraceIdand carrying evaluator metadata.Agent Root Span Identification
Agent root spans in raw trace data are identified by:
parentSpanId = ""gen_ai.operation.nameexists (e.g.,invoke_agent)Architecture
graph TB subgraph "Client Layer" PY[Python Instrumentation Library] TS[TypeScript Instrumentation Library] APP[User LLM Application] SDK3P[Third-Party SDKs - Strands / DeepEval / Ragas] end subgraph "Ingestion Layer - OTel Collector" OC[OTel Collector] OC_OTLP[OTLP gRPC/HTTP Receiver] OC_TRACE[Trace Processor - Trace-Group Aggregation] OC_SINK_OS[OpenSearch Exporter] OC_SINK_PROM[Prometheus Remote Write Exporter] OC_OTLP --> OC_TRACE OC_TRACE --> OC_SINK_OS OC_TRACE --> OC_SINK_PROM end subgraph "OpenSearch Cluster" subgraph "Data Indices" IDX_SPANS[otel-v1-apm-span - gen_ai attributes + traceGroupFields.genAi] IDX_SCORES[eval_scores] end subgraph "Config Indices" IDX_SC[eval_score_configs] IDX_ET[eval_evaluator_templates] IDX_DE[eval_deterministic_evaluators] IDX_AQ[eval_annotation_queues] end subgraph "Eval Indices" IDX_ES[eval_sets] IDX_EX[eval_experiments] IDX_ER[eval_experiment_runs] IDX_ERI[eval_experiment_run_items] IDX_AT[eval_annotation_tasks] end subgraph "Operational Indices" IDX_JM[eval_job_metrics] end JS[eval-scheduler-plugin] end subgraph "Python Agent Service" PAS[Strands Orchestrator] EVAL_AGENT[Eval Agent - Strands] PAS --> EVAL_AGENT end subgraph "Metrics Layer" PROM[Prometheus] end subgraph "LLM Providers" LLM[Bedrock / OpenAI / Anthropic] end subgraph "OpenSearch Dashboards" OSD[OSD Eval Plugin] subgraph "P0 Views" V1[Agent Trace List View] V1M[Trace List Metrics Summary] V9[Agent Trace Timeline / Waterfall View] V3D[Agent Span Detail View] end subgraph "P1 Views" V10[Agent Call Graph View] end subgraph "Eval Views" V2[Sessions] V3[Eval Sets & Experiments] V4[Experiment Runs] V5[Annotation Queues] V6[Scores & Analytics] V7[Evaluators] V8[Dashboards] V11[Agent Map / Agent Path] end end APP --> PY & TS APP --> SDK3P PY & TS -->|OTLP spans with gen_ai.* attrs| OC_OTLP PY & TS -->|REST API| IDX_ES & IDX_EX & IDX_ER & IDX_ERI & IDX_SCORES SDK3P -->|OTLP spans| OC_OTLP SDK3P -->|Scores API - source: SDK| IDX_SCORES OC_SINK_OS -->|index| IDX_SPANS OC_SINK_PROM -->|remote write| PROM IDX_SPANS -->|polling sweep| JS JS -->|eval request| PAS EVAL_AGENT -->|eval library call| LLM JS -->|eval scores| IDX_SCORES JS -->|job metrics| IDX_JM OSD --> V1 & V1M & V9 & V3D & V10 OSD --> V2 & V3 & V4 & V5 & V6 & V7 & V8 & V11 OSD -->|PPL queries| IDX_SPANS & IDX_SCORES & IDX_ES & IDX_ER OSD -->|PromQL queries| PROMComponent Interaction Flow
sequenceDiagram participant App as User Application participant SDK as Instrumentation Library participant OC as OTel Collector participant OS as OpenSearch participant PROM as Prometheus participant JS as eval-scheduler-plugin participant PAS as Python Agent Service participant LLM as LLM Provider participant SDK3P as Third-Party SDK Note over App,LLM: Online Agent Trace Evaluation Flow App->>SDK: Instrumented function call SDK->>OC: OTLP spans (gen_ai.* attributes) OC->>OC: Aggregate traceGroupFields.genAi.* OC->>OS: Index enriched spans OC->>PROM: Emit derived metrics JS->>OS: Poll for new spans matching trigger filters JS->>JS: Create PENDING eval jobs JS->>OS: Read span data JS->>PAS: Eval request (evaluator config + span data) PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas) LLM-->>PAS: Score response PAS-->>JS: Structured score result JS->>OS: Write eval_score (source: EVAL_ONLINE) Note over App,LLM: Offline Agent Trace Evaluation Flow SDK->>OS: Fetch eval set items loop For each experiment item SDK->>App: Call user function(input) App-->>SDK: output SDK->>OC: OTLP spans (eval.experiment.* tags) SDK->>OS: Write experiment_run_item end SDK->>OS: Write run-level scores (source: EVAL_OFFLINE) Note over JS,PAS: Server-side evaluation (same as online, with ground truth) JS->>OS: Poll for new spans tagged with eval.experiment.run_id JS->>OS: Read span data + expectedOutput from eval_experiments JS->>PAS: Eval request (evaluator config + span data + expectedOutput) PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas) LLM-->>PAS: Score response PAS-->>JS: Structured score result JS->>OS: Write eval_score (source: EVAL_OFFLINE) Note over App,SDK3P: Local Evaluation Flow App->>SDK3P: Run evaluation (Strands/DeepEval/Ragas) SDK3P->>LLM: LLM call (if metric requires it) LLM-->>SDK3P: Score response SDK3P->>OC: OTLP spans (trace telemetry) SDK3P->>OS: POST /api/scores (source: SDK)Evaluation Algorithm Dependencies
The platform does not implement evaluation algorithms from scratch. For Online and Offline agent trace evaluation, the eval-scheduler-plugin delegates to the Python Agent Service, which hosts a Strands-based eval agent invoking OSS libraries: Strands Eval (agent trajectory, tool-use, multi-step reasoning), DeepEval (GEval, hallucination, relevancy, faithfulness), and Ragas (context precision/recall, answer faithfulness/relevancy).
The eval-scheduler-plugin sends requests with the Evaluator_Template config (library, metric, model, target span data). The Python Agent Service constructs the library call, manages the LLM connection, and returns structured scores. Evaluator_Templates are thin wrappers — each specifies library, metric, provider, and parameters. LLM provider config is pluggable at the template level via Strands SDK's model abstraction.
Components and Interfaces
1. OTel Collector Pipeline
Responsibility: Receives OTLP telemetry, aggregates trace-group metrics, emits derived metrics to Prometheus, and indexes spans into OpenSearch.
Validation: Malformed OTLP payloads are rejected at the receiver level. The processor validates required span fields (
traceId,spanId, timestamps) and drops documents missing them, logging errors to a dead-letter index.Interface:
Pipeline Stages:
traceId, computes trace-level aggregates (traceGroupFields.genAi.*), and writes them back to every span in the trace. Extends the existingtraceGroupFieldspattern used for standard APM metrics (duration, status) with GenAI-specific aggregations.otel-v1-apm-span-*indicesgen_ai.client.token.usage,gen_ai.client.operation.duration) to PrometheusTrace-Group Fields (GenAI):
Pre-aggregated fields computed at ingest time and written to each span document within a trace. These denormalized fields enable the agent trace list view to display aggregate statistics without expensive query-time aggregations.
traceGroupFields.genAi.totalTokenstraceGroupFields.genAi.inputTokensgen_ai.usage.input_tokenstraceGroupFields.genAi.outputTokensgen_ai.usage.output_tokenstraceGroupFields.genAi.llmCallCountgen_ai.operation.name=chattraceGroupFields.genAi.toolCallCountgen_ai.operation.name=execute_tooltraceGroupFields.genAi.errorCountstatus.code= 2Aggregate Metrics (Prometheus):
OTel Collector derives and emits three core metrics to Prometheus from span attributes:
gen_ai.client.token.usagegen_ai.client.operation.durationMetric dimensions:
gen_ai.operation.name,gen_ai.system,gen_ai.request.model(normalized to model family),gen_ai.response.model(normalized),service.name,gen_ai.token.type(input/output).Cardinality Management: High-cardinality fields (
traceId,spanId,gen_ai.conversation.id) excluded from metric dimensions. Model IDs normalized to family names (e.g.,anthropic.claude-sonnet-4-5-20250929-v1:0→claude-sonnet-4-5). Estimate: ~9,000 series per customer.Deduplication: If client-side instrumentation already emits
gen_ai.client.token.usage, OTel Collector addssource=span_derivedto distinguish its derived metrics.Large Content Fields:
gen_ai.input.messagesandgen_ai.output.messagesare indexed but not analyzed ("index": false). Full-text search on content fields is opt-in.Long-Running Traces: Configuration should support increased flush intervals or root-span-triggered flushing for 60+ minute agent conversations.
2. Eval Platform REST API
Responsibility: CRUD operations for eval sets, experiments, scores, and evaluator configs. Exposed as server-side routes within the OSD Plugin.
Endpoints (key routes):
Authentication: API calls authenticated via API keys or OSD session tokens. Multi-tenancy via OpenSearch security plugin.
Span Queries via PPL
Span browsing, searching, and detail retrieval use OpenSearch PPL (Piped Processing Language) queries against the
_plugins/_pplendpoint. The OSD Plugin constructs PPL strings from UI filter state.Example queries:
PromQL (Prometheus dashboards):
sum(rate(gen_ai_client_token_usage[5m])) by (gen_ai_system),histogram_quantile(0.99, sum(rate(gen_ai_client_operation_duration_bucket[5m])) by (le, gen_ai_response_model))3. eval-scheduler-plugin
Responsibility: The
eval-scheduler-pluginis a lightweight OpenSearch plugin that owns all async evaluation work — detecting new spans for online agent trace evaluation, executing LLM-as-a-Judge scoring, running deterministic evaluators, computing RAG metrics, and managing job lifecycle. It uses the OpenSearch Job Scheduler SPI as its scheduling infrastructure.Why a custom plugin: Job Scheduler is an SPI framework — it provides scheduling infrastructure (interval/cron triggers, distributed locking, job persistence) but requires a consumer plugin to define job types and execution logic. The plugin implements a polling sweeper to bridge span ingestion and evaluation execution.
Job Scheduler SPI Integration:
The plugin implements three Job Scheduler SPI interfaces:
JobSchedulerExtensionEvalSchedulerExtensioneval_job_metrics) and runnersScheduledJobParameterEvalJobParametereval_job_metricsScheduledJobRunnerEvalTriggerSweeper,EvalJobExecutorThe plugin uses Job Scheduler's
IntervalScheduletrigger type to register two recurring scheduled jobs:EvalTriggerSweeper) — runs every 5–10s, polls for new spans matching online agent trace evaluation triggers and offline experiment traces (viaeval.experiment.run_idtags), creates PENDING job documentsEvalJobExecutor) — runs every 2–5s, picks up PENDING job documents ordered by priority, executes evaluations, writes scoresJob Scheduler's
CronScheduleis not used — the polling pattern requires sub-minute granularity that interval scheduling provides.Job Types:
online_agent_trace_evaloffline_agent_trace_eval_itemoffline_agent_trace_eval_runannotation_lock_releaseJob Document Schema (stored in
eval_job_metrics):{ "jobId": "keyword", "jobType": "keyword (online_agent_trace_eval | offline_agent_trace_eval_item | offline_agent_trace_eval_run)", "status": "keyword (PENDING | RUNNING | COMPLETED | FAILED)", "priority": "keyword (HIGH | NORMAL | LOW)", "evaluatorId": "keyword", "evaluatorType": "keyword (LLM_JUDGE | DETERMINISTIC | RAG)", "targetSpanId": "keyword", "targetType": "keyword (SPAN | EXPERIMENT_RUN_ITEM)", "experimentItemId": "keyword", "expectedOutput": "object", "retryCount": "integer", "maxRetries": "integer", "error": "text", "createdAt": "date", "startedAt": "date", "completedAt": "date" }Trigger Sweeper (
EvalTriggerSweeper):The sweeper is the plugin's span detection mechanism for both online and offline agent trace evaluation. It maintains a
lastSweepTimewatermark per trigger configuration and queriesotel-v1-apm-span-*for spans indexed since the last sweep.For online agent trace evaluation, it matches root spans against trigger filter criteria (e.g.,
gen_ai.agent.name,gen_ai.operation.name, tags). For offline agent trace evaluation, it detects spans tagged witheval.experiment.run_idand joins them with the correspondingeval_experimentsdocuments to retrieve ground truth (expectedOutput). In both cases, it creates PENDING job documents ineval_job_metrics.Sweeper pseudocode:
Deduplication: before creating a job, the sweeper checks
eval_job_metricsfor an existingtargetSpanId+evaluatorIdcombination to prevent duplicate evaluations.Job Executor (
EvalJobExecutor):The executor picks up PENDING jobs, acquires a distributed lock via
LockService, reads span data, and delegates to the Python Agent Service.Distributed Locking: Uses Job Scheduler's
LockService. Lock keyed byjobId, configurable TTL (default 5min). If a node crashes, lock expires and job returns to PENDING. Satisfies Req 18.5 (horizontal scaling without duplicate execution).Priority Queue: Implemented at query level — PENDING jobs sorted by
priority DESC, createdAt ASC. Priority values: HIGH=3, NORMAL=2, LOW=1. Online agent trace eval jobs (HIGH) are always picked up before offline batch jobs (NORMAL). Optional concurrency limits per priority level prevent batch starvation.Batch Job Creation (Offline): When an Experiment_Run completes trace capture, the trigger sweeper detects new spans via
eval.experiment.run_idtags, joins witheval_experimentsfor ground truth, and creates PENDING jobs withpriority: NORMAL.Plugin Configuration (
opensearch.yml):Latency Budget (<60s SLA):
4. Python Agent Service (Eval Agent)
Responsibility: Hosts the Strands-based eval agent that executes LLM-as-a-Judge scoring, RAG metric computation, and any evaluation logic requiring LLM provider access. The eval-scheduler-plugin delegates all LLM-dependent evaluation work to this service.
Context: The Python Agent Service is a broader OpenSearch initiative that provides a unified Python backend for AI-powered assistants. It uses Strands SDK as the orchestration framework and follows a multi-agent pattern with a top-level orchestrator routing requests to specialized sub-agents. The eval platform registers an Eval Agent as a specialized sub-agent within this service.
Why delegate: Eval libraries (Strands Eval, DeepEval, Ragas) and LLM provider SDKs are Python-native. The Java plugin stays focused on scheduling/locking. New eval methods ship as Python library updates without touching the Java plugin. The Python Agent Service already manages LLM credentials, connection pooling, and OTel observability.
Eval Agent Architecture:
The Eval Agent is a specialized sub-agent in the Python Agent Service's agent registry. It receives requests from the eval-scheduler-plugin over an internal API (not user-facing). The agent uses Strands SDK's
@tooldecorator to expose arun_evaluationtool that resolves the evaluator (library + metric + model config), executes the evaluation, and returns structured scores with explanation and anexecutionTraceIdlinking back to the OTel trace of the eval LLM call.Internal API (eval-scheduler-plugin → Python Agent Service):
Request/Response: The evaluate request carries
evaluatorConfig(library, metric, modelConfig, outputSchema) andspanData(input, output, context, expectedOutput). For offline agent trace evaluation,expectedOutputis populated with ground truth; for online, it'snull. The response returnsscores(array of name/value/dataType),explanation, andexecutionTraceId.Deterministic evaluators: Deterministic evaluators (regex, JSON validity, exact match) run directly in the eval-scheduler-plugin. Only LLM Judge and RAG evaluations go to the Python Agent Service.
Observability: All eval agent operations are OTel-instrumented. The
executionTraceIdlinks scores to execution traces for debugging.Deployment: Runs as a sidecar or co-located service. Stateless — all config and state in OpenSearch indices. Horizontally scalable independently.
5. OSD Eval Plugin
Responsibility: Complete evaluation UI as an OpenSearch Dashboards plugin.
Plugin Registration:
Navigation Structure:
P0 Agent Tracing Views:
traceGroupFields.genAi.*)P1 Agent Tracing Views:
Eval Platform Views:
gen_ai.conversation.id)Agent Trace List View Columns:
Each column maps to specific OTel GenAI semantic convention attributes:
status.code,error.typegen_ai.operation.namegen_ai.agent.name(agent),gen_ai.request.model(LLM),gen_ai.tool.name(tool)traceGroupFields.genAi.totalTokensdurationInNanosgen_ai.input.messages,gen_ai.output.messagesWaterfall View: Inline with span bar: span name, operation type icon, latency, status. Conditional: token count (chat), model name (chat), tool name (tool_call), agent name (invoke_agent).
Span Detail View (conditional by
gen_ai.operation.name):chat→ Messages, Model, Tokens, Temperature, Finish Reason.execute_tool→ Tool Name, Arguments, Result.invoke_agent→ Agent Name/ID. Common → Trace/Span IDs, Service, Timestamps, Duration, Status.Query Architecture: Trace list views use pre-computed
traceGroupFields.genAi.*— no aggregation at query time. Metric dashboards query Prometheus.6. Python Instrumentation Library
Responsibility: Instruments Python LLM applications using OTel GenAI conventions, provides eval set/experiment/score APIs.
Core API:
OTLP Export: Uses OpenTelemetry Python SDK. Spans carry standard
gen_ai.*attributes:gen_ai.operation.name,gen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.input.messages,gen_ai.output.messages, etc.7. TypeScript Instrumentation Library
Responsibility: Same as Python library but for TypeScript/Node.js.
Data Models
All data is stored in OpenSearch indices. The spans index uses the OTel GenAI semantic conventions directly. Evaluation-specific entities use dedicated indices under the
eval_*prefix.Spans Index (
otel-v1-apm-span-*)This index stores OTLP spans as-is, preserving all
gen_ai.*attributes from the OTel semantic conventions. OTel Collector indexes spans without lossy transformation. The existing APM index template's dynamic field mapping automatically indexesgen_ai.*attributes without schema changes. Trace-group fields are pre-aggregated by OTel Collector at ingest time.{ "mappings": { "dynamic": "true", "properties": { "traceId": { "type": "keyword" }, "spanId": { "type": "keyword" }, "parentSpanId": { "type": "keyword" }, "traceGroup": { "type": "keyword" }, "name": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "kind": { "type": "keyword" }, "startTime": { "type": "date_nanos" }, "endTime": { "type": "date_nanos" }, "durationInNanos": { "type": "long" }, "status": { "properties": { "code": { "type": "keyword" }, "message": { "type": "text" } } }, "traceGroupFields": { "properties": { "endTime": { "type": "date_nanos" }, "durationInNanos": { "type": "long" }, "statusCode": { "type": "integer" }, "genAi": { "properties": { "totalTokens": { "type": "long" }, "inputTokens": { "type": "long" }, "outputTokens": { "type": "long" }, "llmCallCount": { "type": "integer" }, "toolCallCount": { "type": "integer" }, "errorCount": { "type": "integer" } } } } }, "attributes": { "dynamic": "true", "properties": { "gen_ai.operation.name": { "type": "keyword" }, "gen_ai.provider.name": { "type": "keyword" }, "gen_ai.request.model": { "type": "keyword" }, "gen_ai.response.model": { "type": "keyword" }, "gen_ai.request.temperature": { "type": "float" }, "gen_ai.request.max_tokens": { "type": "integer" }, "gen_ai.request.top_p": { "type": "float" }, "gen_ai.usage.input_tokens": { "type": "long" }, "gen_ai.usage.output_tokens": { "type": "long" }, "gen_ai.response.id": { "type": "keyword" }, "gen_ai.response.finish_reasons": { "type": "keyword" }, "gen_ai.conversation.id": { "type": "keyword" }, "gen_ai.agent.id": { "type": "keyword" }, "gen_ai.agent.name": { "type": "keyword" }, "gen_ai.agent.description": { "type": "text" }, "gen_ai.tool.name": { "type": "keyword" }, "gen_ai.tool.type": { "type": "keyword" }, "gen_ai.tool.call.id": { "type": "keyword" }, "gen_ai.tool.description": { "type": "text" }, "gen_ai.input.messages": { "type": "object", "enabled": true, "index": false }, "gen_ai.output.messages": { "type": "object", "enabled": true, "index": false }, "gen_ai.system_instructions": { "type": "object", "enabled": true, "index": false }, "gen_ai.tool.definitions": { "type": "object", "enabled": false }, "gen_ai.tool.call.arguments": { "type": "object", "enabled": true }, "gen_ai.tool.call.result": { "type": "object", "enabled": true }, "gen_ai.data_source.id": { "type": "keyword" }, "gen_ai.output.type": { "type": "keyword" }, "error.type": { "type": "keyword" }, "server.address": { "type": "keyword" }, "server.port": { "type": "integer" } } }, "resource": { "properties": { "service.name": { "type": "keyword" }, "service.version": { "type": "keyword" }, "deployment.environment": { "type": "keyword" }, "telemetry.sdk.name": { "type": "keyword" }, "telemetry.sdk.version": { "type": "keyword" } } }, "eval.experiment.run_id": { "type": "keyword" }, "eval.experiment.set_id": { "type": "keyword" }, "eval.experiment.item_id": { "type": "keyword" }, "latency": { "type": "float" }, "tags": { "type": "keyword" }, "bookmarked": { "type": "boolean" }, "createdAt": { "type": "date" } } }, "settings": { "index": { "number_of_shards": 5, "number_of_replicas": 1, "refresh_interval": "5s" } } }Key design notes:
Key design notes:
gen_ai.operation.namedistinguishes span types.gen_ai.conversation.idis the OTel standard for session correlation.parentSpanIdprovides parent-child hierarchy.gen_ai.input/output.messagesindexed but not analyzed ("index": false) due to size.traceGroupFields.genAi.*pre-aggregated by OTel Collector at ingest. Dynamic mapping onattributesauto-indexes newgen_ai.*attributes.eval.*attributes link spans to experiments.totalCostdeferred from P0.Scores Index (
eval_scores)Source values:
EVAL_ONLINE,EVAL_OFFLINE,SDK,ANNOTATION,API. Idempotency: upsert byidempotencyKeyvia_updatewithdoc_as_upsert. Settings: 3 shards, 1 replica, 5s refresh.Score Configs Index (
eval_score_configs)Eval Sets Index (
eval_sets)Experiments Index (
eval_experiments)Versioning: updates create a new document with a new
validFrom. Latest active version: filterstatus=ACTIVE, sort byvalidFromdesc perlineageId.Experiment Runs Index (
eval_experiment_runs)Experiment Run Items Index (
eval_experiment_run_items)Evaluator Templates Index (
eval_evaluator_templates)evalLibrary(e.g.,deepeval,ragas,strands_eval) andevalMetric(e.g.,faithfulness,geval) identify which OSS library and metric to invoke.promptTemplateis optional (custom LLM Judge only).Deterministic Evaluators Index (
eval_deterministic_evaluators)Annotation Queues Index (
eval_annotation_queues)Annotation Tasks Index (
eval_annotation_tasks)Locking: optimistic concurrency via
_seq_noand_primary_term. Lock release job sweeps for expired locks.Job Metrics Index (
eval_job_metrics)Key Data Model Relationships
erDiagram SPAN ||--o{ SPAN : "parent-child via parentSpanId" SPAN ||--o{ SCORE : "scored by" SESSION ||--o{ SPAN : "groups via gen_ai.conversation.id" EVAL_SET ||--o{ EXPERIMENT : contains EVAL_SET ||--o{ EXPERIMENT_RUN : "executed as" EXPERIMENT_RUN ||--o{ EXPERIMENT_RUN_ITEM : contains EXPERIMENT_RUN_ITEM ||--|| EXPERIMENT : references EXPERIMENT_RUN_ITEM ||--|| SPAN : "linked to via traceId" SCORE_CONFIG ||--o{ SCORE : validates EVALUATOR_TEMPLATE ||--o{ SCORE : produces DETERMINISTIC_EVALUATOR ||--o{ SCORE : produces ANNOTATION_QUEUE ||--o{ ANNOTATION_TASK : contains ANNOTATION_TASK ||--o{ SCORE : "produces via review" SPAN { keyword traceId keyword spanId keyword parentSpanId keyword gen_ai_operation_name keyword gen_ai_request_model keyword gen_ai_conversation_id keyword gen_ai_agent_name long gen_ai_usage_input_tokens long gen_ai_usage_output_tokens } SCORE { keyword id keyword name float value keyword dataType keyword source keyword traceId keyword spanId keyword experimentRunId } EVAL_SET { keyword id keyword name object inputSchema object expectedOutputSchema } EXPERIMENT { keyword id keyword evalSetId object input object expectedOutput keyword status } EXPERIMENT_RUN { keyword id keyword evalSetId keyword name } EXPERIMENT_RUN_ITEM { keyword id keyword experimentRunId keyword experimentId keyword traceId text error }OTel GenAI Convention to Platform Concept Mapping
This table shows how the platform's UI concepts map to OTel span attributes, eliminating the need for a custom schema translation layer:
parentSpanIdis null orgen_ai.operation.name=invoke_agentparentSpanIdgen_ai.operation.namein (chat,text_completion,generate_content)gen_ai.operation.name=execute_toolgen_ai.operation.name=embeddingsgen_ai.operation.name=execute_toolwithgen_ai.tool.type=datastoregen_ai.conversation.idgen_ai.request.model/gen_ai.response.modelgen_ai.provider.nameopenai,aws.bedrock,anthropicgen_ai.usage.input_tokensgen_ai.usage.output_tokensgen_ai.agent.namegen_ai.agent.idgen_ai.tool.nameresource.deployment.environmentresource.service.nameOTel Evaluation Event Alignment
The OTel GenAI conventions define a
gen_ai.evaluation.resultevent for capturing evaluation results. The platform's Score documents align with this event:gen_ai.evaluation.namenamegen_ai.evaluation.score.valuevaluegen_ai.evaluation.score.labelstringValuegen_ai.evaluation.explanationcommentgen_ai.response.idtraceId/spanIdWhen scores are submitted via OTLP as
gen_ai.evaluation.resultevents (Local evaluation mode), the OTel Collector maps them toeval_scoresdocuments. When scores are submitted via the REST API, the platform stores them directly.Correctness Properties
Each property is universally quantified and suitable for property-based testing.
gen_ai.*attribute valuesparentSpanIdrefs, isomorphic treevalidFrom; latest active query returns one perlineageId{{expectedOutput}}error.typeparentSpanIdtree → valid tree with one root, no cyclesgen_ai.conversation.id→ all spans returned in chronological orderparentSpanIdidentified as parallelexecute_tool+datastorespans → RAG_Context =gen_ai.tool.call.result{{contexts}}from datastore tool results,{{question}}from root inputError Handling
Ingestion Errors
Score Validation Errors
Eval Set / Experiment Errors
Evaluation Errors
Annotation Errors
Testing Strategy
Dual Testing Approach
Property-Based Testing Configuration
Feature: opensearch-eval-platform, Property {N}: {property_title}Test Organization
Property-based tests are distributed across components: OTel pipeline (P1-3), score validation (P4-5, P18), JSON schema (P6), experiment versioning (P7-8), name enforcement (P9), experiment runner (P10-12), eval mode (P13), LLM Judge (P14-15), annotations (P16-17, P37), agreement metrics (P19), instrumentation (P20), span queries (P21), span tree (P22, P28-29), sessions (P23-24), job scheduler (P25-27, P39), agent map/path (P30-31), deterministic evaluators (P32-33), RAG (P34-36), entity CRUD (P38). TypeScript components use fast-check + Jest; Python SDK uses Hypothesis + pytest.
Test Independence
All tests must be decoupled. Each
itortestblock runs independently and concurrently. Tests must never depend on the action or outcome of previous or subsequent tests. No shared mutable state between tests.References