RFC : Agentic AI Eval Platform : High level Design

## Overview

This design describes an LLM evaluation platform built natively on the OpenSearch ecosystem. The platform uses OpenSearch indices as the sole data store, OTel Collector for OTLP ingestion and span processing, OpenSearch Job Scheduler for async processing, and OpenSearch Dashboards plugins for the UI.

The data model is grounded in the [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) (`gen_ai.*` attribute namespace). OTLP spans arrive with standard `gen_ai.*` attributes and are indexed directly into OpenSearch without lossy transformation. This means any OTel-instrumented LLM application (Strands, OpenAI SDK, Bedrock SDK, etc.) can send telemetry to the platform with zero custom mapping.

The system supports three evaluation modes:

- **Online Agent Trace Evaluation**: Automatic post-ingestion scoring of live production traces (reference-free only). Source: `EVAL_ONLINE`.
- **Offline Agent Trace Evaluation**: Platform-orchestrated batch evaluation against curated Eval_Sets with Ground_Truth. Source: `EVAL_OFFLINE`.
- **Local Evaluation**: Scores computed client-side by the user's SDK (Strands, DeepEval, Ragas) and submitted via the Scores API. The platform is a passive receiver. Source: `SDK`.

Key architectural decisions:

1. **OpenSearch as the sole data store** -- all entities (spans, scores, eval sets, experiments, jobs) are stored in dedicated OpenSearch indices. No relational database.
2. **OTel Collector for ingestion and processing** -- OTLP telemetry flows through OTel Collector pipelines into OpenSearch. The `gen_ai.*` span attributes are indexed as-is. OTel Collector handles trace-group metric aggregation and Prometheus metric emission. The existing APM index template's dynamic field mapping (`"dynamic": "true"` on the `attributes` field) automatically indexes any new `gen_ai.*` span attributes without requiring schema changes.
3. **GenAI semantic conventions as the canonical schema** -- the platform does not define a custom trace/observation schema. It uses the OTel `gen_ai.*` attributes directly, extended with platform-specific attributes under the `eval.*` namespace for evaluation-only fields.
4. **Job Scheduler for async work** -- native OpenSearch Job Scheduler plugin for LLM-as-a-Judge, deterministic evaluators, and RAG metrics. Only involved in Online and Offline modes.
5. **OSD Plugin for UI** -- a single OpenSearch Dashboards plugin using OUI components provides all evaluation UI views.
6. **SDK-driven experiment execution** -- Python and TypeScript instrumentation libraries orchestrate offline experiments.
7. **Passive receiver for Local evaluation** -- third-party SDKs compute scores client-side and submit them via the Scores API.
8. **OSS library delegation for scoring via Python Agent Service** -- Online and Offline agent trace evaluators delegate scoring to the [Python Agent Service](https://github.com/opensearch-project/OpenSearch/issues/20602), which hosts a Strands-based eval agent that invokes Strands Eval, DeepEval, and Ragas for actual scoring logic. The eval-scheduler-plugin communicates with the Python Agent Service over its internal API; the Python Agent Service owns the LLM provider connection.
9. **No artificial scoping** -- spans are global documents. Multi-tenancy is handled at the OpenSearch index level via the security plugin.
10. **Dual-write metrics architecture** -- OTel Collector enriches spans with pre-aggregated trace-group fields (`traceGroupFields.genAi.*`) for fast OpenSearch queries (PPL) while simultaneously emitting derived metrics to Prometheus for time-series analysis (PromQL). OpenSearch handles trace detail and search; Prometheus handles metric aggregation and alerting.

### GenAI OTel Conventions Alignment

The platform's data model maps directly to the OTel GenAI semantic conventions (status: Development). The key span types and their `gen_ai.operation.name` values:

| OTel Operation     | `gen_ai.operation.name` | Platform Concept       | Description                |
| ------------------ | ----------------------- | ---------------------- | -------------------------- |
| Chat completion    | `chat`                  | Generation observation | LLM inference call         |
| Text completion    | `text_completion`       | Generation observation | Legacy completion call     |
| Embeddings         | `embeddings`            | Embedding observation  | Vector embedding call      |
| Invoke agent       | `invoke_agent`          | Trace (root span)      | Top-level agent invocation |
| Create agent       | `create_agent`          | Agent setup span       | Agent initialization       |
| Execute tool       | `execute_tool`          | Tool call observation  | Tool/function execution    |
| Content generation | `generate_content`      | Generation observation | Multimodal generation      |

The platform extends the standard `gen_ai.*` namespace with `eval.*` attributes for evaluation-specific metadata that has no OTel equivalent:

| Custom Attribute          | Type    | Description                                             |
| ------------------------- | ------- | ------------------------------------------------------- |
| `eval.score.name`         | keyword | Score metric name                                       |
| `eval.score.value`        | float   | Numeric score value                                     |
| `eval.score.source`       | keyword | One of: EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, API |
| `eval.experiment.run_id`  | keyword | Links span to an experiment run                         |
| `eval.experiment.set_id`  | keyword | Links span to an eval set                               |
| `eval.experiment.item_id` | keyword | Links span to a specific test case                      |

### Scoping Model

- **Spans**: Global documents. No artificial project scoping field. Queried via filters (time range, tags, `gen_ai.agent.name`, `gen_ai.request.model`, etc.).
- **Sessions**: Correlated via `gen_ai.conversation.id` (the OTel convention for session/thread tracking).
- **Eval sets**: Independent named collections. Can be used by multiple experiment runs.
- **Experiment runs**: Reference eval sets via `evalSetId`. Produce run items linking test cases to traces.
- **Multi-tenancy**: OpenSearch security plugin (index-level permissions, roles). The eval platform itself is tenant-unaware.
- **Local evaluation scores**: Arrive via Scores API with `source: SDK`, referencing `traceId` and carrying evaluator metadata.

### Agent Root Span Identification

Agent root spans in raw trace data are identified by:

1. No parent -- `parentSpanId = ""`
2. Agent operation -- `gen_ai.operation.name` exists (e.g., `invoke_agent`)

```sql
source = otel-v1-apm-span-*
| where isnull(parentSpanId) and isnotnull(`attributes.gen_ai.agent.name`)
```

## Architecture

```mermaid
graph TB
    subgraph "Client Layer"
        PY[Python Instrumentation Library]
        TS[TypeScript Instrumentation Library]
        APP[User LLM Application]
        SDK3P[Third-Party SDKs - Strands / DeepEval / Ragas]
    end

    subgraph "Ingestion Layer - OTel Collector"
        OC[OTel Collector]
        OC_OTLP[OTLP gRPC/HTTP Receiver]
        OC_TRACE[Trace Processor - Trace-Group Aggregation]
        OC_SINK_OS[OpenSearch Exporter]
        OC_SINK_PROM[Prometheus Remote Write Exporter]
        OC_OTLP --> OC_TRACE
        OC_TRACE --> OC_SINK_OS
        OC_TRACE --> OC_SINK_PROM
    end

    subgraph "OpenSearch Cluster"
        subgraph "Data Indices"
            IDX_SPANS[otel-v1-apm-span - gen_ai attributes + traceGroupFields.genAi]
            IDX_SCORES[eval_scores]
        end
        subgraph "Config Indices"
            IDX_SC[eval_score_configs]
            IDX_ET[eval_evaluator_templates]
            IDX_DE[eval_deterministic_evaluators]
            IDX_AQ[eval_annotation_queues]
        end
        subgraph "Eval Indices"
            IDX_ES[eval_sets]
            IDX_EX[eval_experiments]
            IDX_ER[eval_experiment_runs]
            IDX_ERI[eval_experiment_run_items]
            IDX_AT[eval_annotation_tasks]
        end
        subgraph "Operational Indices"
            IDX_JM[eval_job_metrics]
        end
        JS[eval-scheduler-plugin]
    end

    subgraph "Python Agent Service"
        PAS[Strands Orchestrator]
        EVAL_AGENT[Eval Agent - Strands]
        PAS --> EVAL_AGENT
    end

    subgraph "Metrics Layer"
        PROM[Prometheus]
    end

    subgraph "LLM Providers"
        LLM[Bedrock / OpenAI / Anthropic]
    end

    subgraph "OpenSearch Dashboards"
        OSD[OSD Eval Plugin]
        subgraph "P0 Views"
            V1[Agent Trace List View]
            V1M[Trace List Metrics Summary]
            V9[Agent Trace Timeline / Waterfall View]
            V3D[Agent Span Detail View]
        end
        subgraph "P1 Views"
            V10[Agent Call Graph View]
        end
        subgraph "Eval Views"
            V2[Sessions]
            V3[Eval Sets & Experiments]
            V4[Experiment Runs]
            V5[Annotation Queues]
            V6[Scores & Analytics]
            V7[Evaluators]
            V8[Dashboards]
            V11[Agent Map / Agent Path]
        end
    end

    APP --> PY & TS
    APP --> SDK3P
    PY & TS -->|OTLP spans with gen_ai.* attrs| OC_OTLP
    PY & TS -->|REST API| IDX_ES & IDX_EX & IDX_ER & IDX_ERI & IDX_SCORES
    SDK3P -->|OTLP spans| OC_OTLP
    SDK3P -->|Scores API - source: SDK| IDX_SCORES
    OC_SINK_OS -->|index| IDX_SPANS
    OC_SINK_PROM -->|remote write| PROM
    IDX_SPANS -->|polling sweep| JS
    JS -->|eval request| PAS
    EVAL_AGENT -->|eval library call| LLM
    JS -->|eval scores| IDX_SCORES
    JS -->|job metrics| IDX_JM
    OSD --> V1 & V1M & V9 & V3D & V10
    OSD --> V2 & V3 & V4 & V5 & V6 & V7 & V8 & V11
    OSD -->|PPL queries| IDX_SPANS & IDX_SCORES & IDX_ES & IDX_ER
    OSD -->|PromQL queries| PROM
```

### Component Interaction Flow

```mermaid
sequenceDiagram
    participant App as User Application
    participant SDK as Instrumentation Library
    participant OC as OTel Collector
    participant OS as OpenSearch
    participant PROM as Prometheus
    participant JS as eval-scheduler-plugin
    participant PAS as Python Agent Service
    participant LLM as LLM Provider
    participant SDK3P as Third-Party SDK

    Note over App,LLM: Online Agent Trace Evaluation Flow
    App->>SDK: Instrumented function call
    SDK->>OC: OTLP spans (gen_ai.* attributes)
    OC->>OC: Aggregate traceGroupFields.genAi.*
    OC->>OS: Index enriched spans
    OC->>PROM: Emit derived metrics
    JS->>OS: Poll for new spans matching trigger filters
    JS->>JS: Create PENDING eval jobs
    JS->>OS: Read span data
    JS->>PAS: Eval request (evaluator config + span data)
    PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas)
    LLM-->>PAS: Score response
    PAS-->>JS: Structured score result
    JS->>OS: Write eval_score (source: EVAL_ONLINE)

    Note over App,LLM: Offline Agent Trace Evaluation Flow
    SDK->>OS: Fetch eval set items
    loop For each experiment item
        SDK->>App: Call user function(input)
        App-->>SDK: output
        SDK->>OC: OTLP spans (eval.experiment.* tags)
        SDK->>OS: Write experiment_run_item
    end
    SDK->>OS: Write run-level scores (source: EVAL_OFFLINE)
    Note over JS,PAS: Server-side evaluation (same as online, with ground truth)
    JS->>OS: Poll for new spans tagged with eval.experiment.run_id
    JS->>OS: Read span data + expectedOutput from eval_experiments
    JS->>PAS: Eval request (evaluator config + span data + expectedOutput)
    PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas)
    LLM-->>PAS: Score response
    PAS-->>JS: Structured score result
    JS->>OS: Write eval_score (source: EVAL_OFFLINE)

    Note over App,SDK3P: Local Evaluation Flow
    App->>SDK3P: Run evaluation (Strands/DeepEval/Ragas)
    SDK3P->>LLM: LLM call (if metric requires it)
    LLM-->>SDK3P: Score response
    SDK3P->>OC: OTLP spans (trace telemetry)
    SDK3P->>OS: POST /api/scores (source: SDK)
```

## Evaluation Algorithm Dependencies

The platform does not implement evaluation algorithms from scratch. For Online and Offline agent trace evaluation, the eval-scheduler-plugin delegates to the [Python Agent Service](https://github.com/opensearch-project/OpenSearch/issues/20602), which hosts a Strands-based eval agent invoking OSS libraries: Strands Eval (agent trajectory, tool-use, multi-step reasoning), DeepEval (GEval, hallucination, relevancy, faithfulness), and Ragas (context precision/recall, answer faithfulness/relevancy).

The eval-scheduler-plugin sends requests with the Evaluator_Template config (library, metric, model, target span data). The Python Agent Service constructs the library call, manages the LLM connection, and returns structured scores. Evaluator_Templates are thin wrappers — each specifies library, metric, provider, and parameters. LLM provider config is pluggable at the template level via Strands SDK's model abstraction.

## Components and Interfaces

### 1. OTel Collector Pipeline

**Responsibility**: Receives OTLP telemetry, aggregates trace-group metrics, emits derived metrics to Prometheus, and indexes spans into OpenSearch.

**Validation**: Malformed OTLP payloads are rejected at the receiver level. The processor validates required span fields (`traceId`, `spanId`, timestamps) and drops documents missing them, logging errors to a dead-letter index.

**Interface**:

- Input: OTLP gRPC (port 4317) and HTTP (port 4318)
- Output: OpenSearch bulk index API, Prometheus remote write

**Pipeline Stages**:

1. **OTLP Receiver**: Accepts gRPC and HTTP OTLP payloads
2. **Trace Processor (Trace-Group Aggregation)**: Buffers spans by `traceId`, computes trace-level aggregates (`traceGroupFields.genAi.*`), and writes them back to every span in the trace. Extends the existing `traceGroupFields` pattern used for standard APM metrics (duration, status) with GenAI-specific aggregations.
3. **OpenSearch Exporter**: Indexes enriched spans into `otel-v1-apm-span-*` indices
4. **Prometheus Remote Write Exporter**: Emits derived time-series metrics (`gen_ai.client.token.usage`, `gen_ai.client.operation.duration`) to Prometheus

**Trace-Group Fields (GenAI)**:

Pre-aggregated fields computed at ingest time and written to each span document within a trace. These denormalized fields enable the agent trace list view to display aggregate statistics without expensive query-time aggregations.

| Field                                       | Type    | Calculation                                                        |
| ------------------------------------------- | ------- | ------------------------------------------------------------------ |
| `traceGroupFields.genAi.totalTokens`        | Long    | Sum of input + output tokens across all spans                      |
| `traceGroupFields.genAi.inputTokens`        | Long    | Sum of `gen_ai.usage.input_tokens`                                 |
| `traceGroupFields.genAi.outputTokens`       | Long    | Sum of `gen_ai.usage.output_tokens`                                |
| `traceGroupFields.genAi.llmCallCount`       | Integer | Count of spans where `gen_ai.operation.name` = `chat`              |
| `traceGroupFields.genAi.toolCallCount`      | Integer | Count of spans where `gen_ai.operation.name` = `execute_tool`      |
| `traceGroupFields.genAi.errorCount`         | Integer | Count where `status.code` = 2                                      |

> **Note**: Token cost estimation (`traceGroupFields.genAi.estimatedCost`) is deferred from P0 due to pricing table maintenance complexity.

**Aggregate Metrics (Prometheus)**:

OTel Collector derives and emits three core metrics to Prometheus from span attributes:

| Metric                              | Type      | Description                    |
| ----------------------------------- | --------- | ------------------------------ |
| `gen_ai.client.token.usage`         | Counter   | Token consumption by type      |
| `gen_ai.client.operation.duration`  | Histogram | LLM call latency distribution  |

Metric dimensions: `gen_ai.operation.name`, `gen_ai.system`, `gen_ai.request.model` (normalized to model family), `gen_ai.response.model` (normalized), `service.name`, `gen_ai.token.type` (input/output).

**Cardinality Management**: High-cardinality fields (`traceId`, `spanId`, `gen_ai.conversation.id`) excluded from metric dimensions. Model IDs normalized to family names (e.g., `anthropic.claude-sonnet-4-5-20250929-v1:0` → `claude-sonnet-4-5`). Estimate: ~9,000 series per customer.

**Deduplication**: If client-side instrumentation already emits `gen_ai.client.token.usage`, OTel Collector adds `source=span_derived` to distinguish its derived metrics.

**Large Content Fields**: `gen_ai.input.messages` and `gen_ai.output.messages` are indexed but not analyzed (`"index": false`). Full-text search on content fields is opt-in.

**Long-Running Traces**: Configuration should support increased flush intervals or root-span-triggered flushing for 60+ minute agent conversations.

### 2. Eval Platform REST API

**Responsibility**: CRUD operations for eval sets, experiments, scores, and evaluator configs. Exposed as server-side routes within the OSD Plugin.

**Endpoints** (key routes):

| Method | Path                                  | Description                    | Req       |
| ------ | ------------------------------------- | ------------------------------ | --------- |
| POST   | /api/eval-sets                        | Create eval set                | 4.1       |
| GET    | /api/eval-sets                        | List eval sets                 | 4.7       |
| POST   | /api/eval-sets/{id}/experiments       | Add experiment to eval set     | 4.2       |
| PUT    | /api/eval-sets/{id}/experiments/{eid} | Update experiment (versioned)  | 4.6       |
| POST   | /api/experiment-runs                  | Create experiment run          | 5.1       |
| POST   | /api/experiment-runs/{id}/items       | Create run item                | 5.2       |
| POST   | /api/scores                           | Submit score                   | 3.2, 10.1 |
| POST   | /api/score-configs                    | Create score config            | 3.1       |
| POST   | /api/evaluator-templates              | Create evaluator template      | 8.1       |
| POST   | /api/deterministic-evaluators         | Create deterministic evaluator | 21.2      |
| POST   | /api/annotation-queues                | Create annotation queue        | 9.1       |

**Authentication**: API calls authenticated via API keys or OSD session tokens. Multi-tenancy via OpenSearch security plugin.

**Span Queries via PPL**

Span browsing, searching, and detail retrieval use OpenSearch PPL (Piped Processing Language) queries against the `_plugins/_ppl` endpoint. The OSD Plugin constructs PPL strings from UI filter state.

Example queries:

```sql
-- List agent root spans with pre-computed aggregates (Req 15.1)
source = otel-v1-apm-span-*
| where parentSpanId = '' AND isnotnull(`attributes.gen_ai.operation.name`)
| fields traceId, name, durationInNanos, status.code,
         traceGroupFields.genAi.totalTokens, traceGroupFields.genAi.llmCallCount
| sort - startTime | head 100

-- Get all child spans for a trace (Req 15.3)
source = otel-v1-apm-span-*
| where traceId = 'abc123' | sort startTime

-- Aggregate latency by model
source = otel-v1-apm-span-*
| where `attributes.gen_ai.operation.name` = 'chat'
| stats avg(durationInNanos) as avg_latency, count() as call_count by `attributes.gen_ai.request.model`
```

**PromQL** (Prometheus dashboards): `sum(rate(gen_ai_client_token_usage[5m])) by (gen_ai_system)`, `histogram_quantile(0.99, sum(rate(gen_ai_client_operation_duration_bucket[5m])) by (le, gen_ai_response_model))`

### 3. eval-scheduler-plugin

**Responsibility**: The `eval-scheduler-plugin` is a lightweight OpenSearch plugin that owns all async evaluation work — detecting new spans for online agent trace evaluation, executing LLM-as-a-Judge scoring, running deterministic evaluators, computing RAG metrics, and managing job lifecycle. It uses the OpenSearch [Job Scheduler](https://github.com/opensearch-project/job-scheduler) SPI as its scheduling infrastructure.

**Why a custom plugin**: Job Scheduler is an SPI framework — it provides scheduling infrastructure (interval/cron triggers, distributed locking, job persistence) but requires a consumer plugin to define job types and execution logic. The plugin implements a polling sweeper to bridge span ingestion and evaluation execution.

**Job Scheduler SPI Integration**:

The plugin implements three Job Scheduler SPI interfaces:

| Interface                | Implementation                  | Purpose                                                                 |
| ------------------------ | ------------------------------- | ----------------------------------------------------------------------- |
| `JobSchedulerExtension`  | `EvalSchedulerExtension`        | Registers the plugin with Job Scheduler, declares job index (`eval_job_metrics`) and runners |
| `ScheduledJobParameter`  | `EvalJobParameter`              | Defines the job document schema stored in `eval_job_metrics`            |
| `ScheduledJobRunner`     | `EvalTriggerSweeper`, `EvalJobExecutor` | Contains the execution logic invoked when a scheduled job fires  |

The plugin uses Job Scheduler's `IntervalSchedule` trigger type to register two recurring scheduled jobs:

1. **Trigger Sweeper** (`EvalTriggerSweeper`) — runs every 5–10s, polls for new spans matching online agent trace evaluation triggers and offline experiment traces (via `eval.experiment.run_id` tags), creates PENDING job documents
2. **Job Executor** (`EvalJobExecutor`) — runs every 2–5s, picks up PENDING job documents ordered by priority, executes evaluations, writes scores

Job Scheduler's `CronSchedule` is not used — the polling pattern requires sub-minute granularity that interval scheduling provides.

**Job Types**:

| Job Type                              | Priority | Trigger                   | Concurrency                  |
| ------------------------------------- | -------- | ------------------------- | ---------------------------- |
| `online_agent_trace_eval`             | HIGH     | New span matching filter  | Single-item, <60s SLA        |
| `offline_agent_trace_eval_item`       | NORMAL   | Experiment run batch      | Per-item, configurable limit |
| `offline_agent_trace_eval_run`        | NORMAL   | Experiment run completion | Per-run                      |
| `annotation_lock_release`             | LOW      | Timer                     | Periodic sweep               |

**Job Document Schema** (stored in `eval_job_metrics`):

```json
{
  "jobId": "keyword",
  "jobType": "keyword (online_agent_trace_eval | offline_agent_trace_eval_item | offline_agent_trace_eval_run)",
  "status": "keyword (PENDING | RUNNING | COMPLETED | FAILED)",
  "priority": "keyword (HIGH | NORMAL | LOW)",
  "evaluatorId": "keyword",
  "evaluatorType": "keyword (LLM_JUDGE | DETERMINISTIC | RAG)",
  "targetSpanId": "keyword",
  "targetType": "keyword (SPAN | EXPERIMENT_RUN_ITEM)",
  "experimentItemId": "keyword",
  "expectedOutput": "object",
  "retryCount": "integer",
  "maxRetries": "integer",
  "error": "text",
  "createdAt": "date",
  "startedAt": "date",
  "completedAt": "date"
}
```

**Trigger Sweeper (`EvalTriggerSweeper`)**:

The sweeper is the plugin's span detection mechanism for both online and offline agent trace evaluation. It maintains a `lastSweepTime` watermark per trigger configuration and queries `otel-v1-apm-span-*` for spans indexed since the last sweep.

For online agent trace evaluation, it matches root spans against trigger filter criteria (e.g., `gen_ai.agent.name`, `gen_ai.operation.name`, tags). For offline agent trace evaluation, it detects spans tagged with `eval.experiment.run_id` and joins them with the corresponding `eval_experiments` documents to retrieve ground truth (`expectedOutput`). In both cases, it creates PENDING job documents in `eval_job_metrics`.

**Sweeper pseudocode**:

```
// Online: poll for new root spans matching trigger filters
for each onlineTrigger:
    query otel-v1-apm-span-* WHERE startTime >= lastSweepTime
        AND parentSpanId = "" AND matchesTriggerFilter(trigger)
    for each hit (deduplicate by targetSpanId + evaluatorId):
        createPendingJob(spanId, evaluatorId, type=HIGH)
    updateLastSweepTime(trigger)

// Offline: poll for new spans tagged with eval.experiment.run_id
for each pendingExperimentRun:
    query otel-v1-apm-span-* WHERE eval.experiment.run_id = runId
        AND parentSpanId = ""
    for each hit:
        groundTruth = fetchExperiment(eval.experiment.item_id)  // join for expectedOutput
        for each evaluator (deduplicate by targetSpanId + evaluatorId):
            createPendingJob(spanId, evaluatorId, type=NORMAL, groundTruth)
    updateLastSweepTime(run)
```

Deduplication: before creating a job, the sweeper checks `eval_job_metrics` for an existing `targetSpanId` + `evaluatorId` combination to prevent duplicate evaluations.

**Job Executor (`EvalJobExecutor`)**:

The executor picks up PENDING jobs, acquires a distributed lock via `LockService`, reads span data, and delegates to the Python Agent Service.

```
1. Query eval_job_metrics WHERE status=PENDING ORDER BY priority DESC, createdAt ASC
2. For each job:
   a. Acquire distributed lock via LockService (skip if another node holds it)
   b. Mark IN_PROGRESS
   c. Read target span data from OpenSearch
   d. Load EvaluatorTemplate config
   e. Send eval request to Python Agent Service (evaluatorConfig + spanData)
   f. Write score to eval_scores, mark COMPLETED
   g. On failure: retry with exponential backoff (2^retryCount * 1000ms)
      or mark FAILED if maxRetries exceeded
   h. Release lock in finally block (TTL fallback: 5min if node crashes)
```

**Distributed Locking**: Uses Job Scheduler's `LockService`. Lock keyed by `jobId`, configurable TTL (default 5min). If a node crashes, lock expires and job returns to PENDING. Satisfies Req 18.5 (horizontal scaling without duplicate execution).

**Priority Queue**: Implemented at query level — PENDING jobs sorted by `priority DESC, createdAt ASC`. Priority values: HIGH=3, NORMAL=2, LOW=1. Online agent trace eval jobs (HIGH) are always picked up before offline batch jobs (NORMAL). Optional concurrency limits per priority level prevent batch starvation.

**Batch Job Creation (Offline)**: When an Experiment_Run completes trace capture, the trigger sweeper detects new spans via `eval.experiment.run_id` tags, joins with `eval_experiments` for ground truth, and creates PENDING jobs with `priority: NORMAL`.

**Plugin Configuration** (`opensearch.yml`):

```yaml
eval.scheduler.trigger_sweep_interval: "5s"     # How often sweeper polls for new matching spans
eval.scheduler.job_executor_interval: "2s"       # How often executor picks up PENDING jobs
eval.scheduler.executor_batch_size: 10           # Max jobs per executor cycle
eval.scheduler.lock_ttl_minutes: 5               # Job Scheduler LockService TTL
eval.scheduler.max_retries: 3                    # Default max retries per job
eval.scheduler.online_concurrency_limit: 20      # Max concurrent online agent trace eval jobs
eval.scheduler.offline_concurrency_limit: 50     # Max concurrent offline agent trace eval jobs
eval.scheduler.agent_service_endpoint: "http://localhost:8080"  # Python Agent Service URL
eval.scheduler.agent_service_timeout_ms: 45000   # Timeout for eval requests to agent service
```

**Latency Budget (<60s SLA)**:

| Phase                    | Budget   | Notes                                          |
| ------------------------ | -------- | ---------------------------------------------- |
| Span indexing + refresh  | ~5s      | OpenSearch refresh_interval                     |
| Trigger sweep detection  | 0–10s    | Depends on sweep interval (configurable)        |
| Job pickup by executor   | 0–5s     | Depends on executor interval                    |
| Lock acquisition         | <100ms   | Job Scheduler LockService, local cluster op     |
| Span data read           | <500ms   | Single document fetch                           |
| Python Agent Service call | 5–30s   | Network hop + eval library + LLM provider latency |
| Score write              | <500ms   | Single document index                           |
| **Total**                | **~15–45s** | Well within 60s SLA for typical cases        |

### 4. Python Agent Service (Eval Agent)

**Responsibility**: Hosts the Strands-based eval agent that executes LLM-as-a-Judge scoring, RAG metric computation, and any evaluation logic requiring LLM provider access. The eval-scheduler-plugin delegates all LLM-dependent evaluation work to this service.

**Context**: The [Python Agent Service](https://github.com/opensearch-project/OpenSearch/issues/20602) is a broader OpenSearch initiative that provides a unified Python backend for AI-powered assistants. It uses Strands SDK as the orchestration framework and follows a multi-agent pattern with a top-level orchestrator routing requests to specialized sub-agents. The eval platform registers an Eval Agent as a specialized sub-agent within this service.

**Why delegate**: Eval libraries (Strands Eval, DeepEval, Ragas) and LLM provider SDKs are Python-native. The Java plugin stays focused on scheduling/locking. New eval methods ship as Python library updates without touching the Java plugin. The Python Agent Service already manages LLM credentials, connection pooling, and OTel observability.

**Eval Agent Architecture**:

The Eval Agent is a specialized sub-agent in the Python Agent Service's agent registry. It receives requests from the eval-scheduler-plugin over an internal API (not user-facing). The agent uses Strands SDK's `@tool` decorator to expose a `run_evaluation` tool that resolves the evaluator (library + metric + model config), executes the evaluation, and returns structured scores with explanation and an `executionTraceId` linking back to the OTel trace of the eval LLM call.

**Internal API** (eval-scheduler-plugin → Python Agent Service):

| Method | Path                          | Description                    |
| ------ | ----------------------------- | ------------------------------ |
| POST   | /api/eval-agent/evaluate      | Execute a single evaluation    |
| POST   | /api/eval-agent/evaluate/batch | Execute batch evaluations      |
| GET    | /api/eval-agent/health        | Health check                   |

**Request/Response**: The evaluate request carries `evaluatorConfig` (library, metric, modelConfig, outputSchema) and `spanData` (input, output, context, expectedOutput). For offline agent trace evaluation, `expectedOutput` is populated with ground truth; for online, it's `null`. The response returns `scores` (array of name/value/dataType), `explanation`, and `executionTraceId`.

**Deterministic evaluators**: Deterministic evaluators (regex, JSON validity, exact match) run directly in the eval-scheduler-plugin. Only LLM Judge and RAG evaluations go to the Python Agent Service.

**Observability**: All eval agent operations are OTel-instrumented. The `executionTraceId` links scores to execution traces for debugging.

**Deployment**: Runs as a sidecar or co-located service. Stateless — all config and state in OpenSearch indices. Horizontally scalable independently.

### 5. OSD Eval Plugin

**Responsibility**: Complete evaluation UI as an OpenSearch Dashboards plugin.

**Plugin Registration**:

```typescript
export class EvalPlugin implements Plugin {
  setup(core: CoreSetup) {
    core.application.register({
      id: "eval-platform",
      title: "LLM Evaluation",
      navLinkStatus: AppNavLinkStatus.visible,
      mount: async (params: AppMountParameters) => {
        const { renderApp } = await import("./application");
        return renderApp(params);
      },
    });
  }
}
```

**Navigation Structure**:

P0 Agent Tracing Views:
- Agent Trace List View (root spans with aggregated metrics from `traceGroupFields.genAi.*`)
- Trace List Metrics Summary (total traces, total spans, total tokens, latency P50/P99)
- Agent Trace Timeline / Waterfall View (hierarchical span visualization with inline metadata)
- Agent Span Detail View (operation-specific attributes: chat, tool_call, invoke_agent)

P1 Agent Tracing Views:
- Agent Call Graph View (topology visualization of agent/LLM/tool relationships within a trace)

Eval Platform Views:
- Sessions (list, detail -- correlated via `gen_ai.conversation.id`)
- Eval Sets (CRUD, experiments management)
- Experiment Runs (list, detail, comparison)
- Annotation Queues (manage, review interface)
- Scores & Analytics (distributions, agreement, trends)
- Evaluators (LLM Judge templates, deterministic, RAG metrics)
- Dashboards (custom charts, default dashboards -- PPL for trace views, PromQL for metric dashboards)
- Agent Map / Agent Path (multi-trace analytics)

**Agent Trace List View Columns**:

Each column maps to specific OTel GenAI semantic convention attributes:

| Column       | Source                                                                                                     | Notes                                                    |
| ------------ | ---------------------------------------------------------------------------------------------------------- | -------------------------------------------------------- |
| Status       | `status.code`, `error.type`                                                                                | ✓ green (OK), ✗ red (ERROR)                              |
| Type / Kind  | `gen_ai.operation.name`                                                                                    | Friendly names: Chat, Agent, Tool, Embeddings, etc.      |
| Name         | `gen_ai.agent.name` (agent), `gen_ai.request.model` (LLM), `gen_ai.tool.name` (tool)                      | Varies by span type                                      |
| Token Usage  | `traceGroupFields.genAi.totalTokens`                                                                      | Pre-aggregated at ingest, not computed at query time      |
| Latency      | `durationInNanos`                                                                                          | Root span duration                                       |
| Input/Output | `gen_ai.input.messages`, `gen_ai.output.messages`                                                          | Opt-in, may contain sensitive data                       |

**Waterfall View**: Inline with span bar: span name, operation type icon, latency, status. Conditional: token count (chat), model name (chat), tool name (tool_call), agent name (invoke_agent).

**Span Detail View** (conditional by `gen_ai.operation.name`): `chat` → Messages, Model, Tokens, Temperature, Finish Reason. `execute_tool` → Tool Name, Arguments, Result. `invoke_agent` → Agent Name/ID. Common → Trace/Span IDs, Service, Timestamps, Duration, Status.

**Query Architecture**: Trace list views use pre-computed `traceGroupFields.genAi.*` — no aggregation at query time. Metric dashboards query Prometheus.

### 6. Python Instrumentation Library

**Responsibility**: Instruments Python LLM applications using OTel GenAI conventions, provides eval set/experiment/score APIs.

**Core API**:

```python
# Tracing -- emits spans with gen_ai.* attributes
@observe(name="my_function")
def my_llm_call(input: str) -> str: ...

# Experiment execution
results = client.run_experiment(
    eval_set_id="es-123",
    fn=my_llm_call,
    evaluators=[accuracy_evaluator],
    run_name="v2-prompt-test"
)

# Score submission
client.score(
    trace_id="tr-123",
    name="user_feedback",
    value=1.0,
    data_type="NUMERIC"
)
```

**OTLP Export**: Uses OpenTelemetry Python SDK. Spans carry standard `gen_ai.*` attributes: `gen_ai.operation.name`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.input.messages`, `gen_ai.output.messages`, etc.

### 7. TypeScript Instrumentation Library

**Responsibility**: Same as Python library but for TypeScript/Node.js.

```typescript
const result = await observe({ name: "my_function" }, async (span) => {
  return await myLlmCall(input);
});

const results = await client.runExperiment({
  evalSetId: "es-123",
  fn: myLlmCall,
  evaluators: [accuracyEvaluator],
  runName: "v2-prompt-test",
});

await client.score({
  traceId: "tr-123",
  name: "user_feedback",
  value: 1.0,
  dataType: "NUMERIC",
});
```

## Data Models

All data is stored in OpenSearch indices. The spans index uses the OTel GenAI semantic conventions directly. Evaluation-specific entities use dedicated indices under the `eval_*` prefix.

### Spans Index (`otel-v1-apm-span-*`)

This index stores OTLP spans as-is, preserving all `gen_ai.*` attributes from the OTel semantic conventions. OTel Collector indexes spans without lossy transformation. The existing APM index template's dynamic field mapping automatically indexes `gen_ai.*` attributes without schema changes. Trace-group fields are pre-aggregated by OTel Collector at ingest time.

```json
{
  "mappings": {
    "dynamic": "true",
    "properties": {
      "traceId": { "type": "keyword" },
      "spanId": { "type": "keyword" },
      "parentSpanId": { "type": "keyword" },
      "traceGroup": { "type": "keyword" },
      "name": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "kind": { "type": "keyword" },
      "startTime": { "type": "date_nanos" },
      "endTime": { "type": "date_nanos" },
      "durationInNanos": { "type": "long" },
      "status": {
        "properties": {
          "code": { "type": "keyword" },
          "message": { "type": "text" }
        }
      },
      "traceGroupFields": {
        "properties": {
          "endTime": { "type": "date_nanos" },
          "durationInNanos": { "type": "long" },
          "statusCode": { "type": "integer" },
          "genAi": {
            "properties": {
              "totalTokens": { "type": "long" },
              "inputTokens": { "type": "long" },
              "outputTokens": { "type": "long" },
              "llmCallCount": { "type": "integer" },
              "toolCallCount": { "type": "integer" },
              "errorCount": { "type": "integer" }
            }
          }
        }
      },
      "attributes": {
        "dynamic": "true",
        "properties": {
          "gen_ai.operation.name": { "type": "keyword" },
          "gen_ai.provider.name": { "type": "keyword" },
          "gen_ai.request.model": { "type": "keyword" },
          "gen_ai.response.model": { "type": "keyword" },
          "gen_ai.request.temperature": { "type": "float" },
          "gen_ai.request.max_tokens": { "type": "integer" },
          "gen_ai.request.top_p": { "type": "float" },
          "gen_ai.usage.input_tokens": { "type": "long" },
          "gen_ai.usage.output_tokens": { "type": "long" },
          "gen_ai.response.id": { "type": "keyword" },
          "gen_ai.response.finish_reasons": { "type": "keyword" },
          "gen_ai.conversation.id": { "type": "keyword" },
          "gen_ai.agent.id": { "type": "keyword" },
          "gen_ai.agent.name": { "type": "keyword" },
          "gen_ai.agent.description": { "type": "text" },
          "gen_ai.tool.name": { "type": "keyword" },
          "gen_ai.tool.type": { "type": "keyword" },
          "gen_ai.tool.call.id": { "type": "keyword" },
          "gen_ai.tool.description": { "type": "text" },
          "gen_ai.input.messages": { "type": "object", "enabled": true, "index": false },
          "gen_ai.output.messages": { "type": "object", "enabled": true, "index": false },
          "gen_ai.system_instructions": { "type": "object", "enabled": true, "index": false },
          "gen_ai.tool.definitions": { "type": "object", "enabled": false },
          "gen_ai.tool.call.arguments": { "type": "object", "enabled": true },
          "gen_ai.tool.call.result": { "type": "object", "enabled": true },
          "gen_ai.data_source.id": { "type": "keyword" },
          "gen_ai.output.type": { "type": "keyword" },
          "error.type": { "type": "keyword" },
          "server.address": { "type": "keyword" },
          "server.port": { "type": "integer" }
        }
      },
      "resource": {
        "properties": {
          "service.name": { "type": "keyword" },
          "service.version": { "type": "keyword" },
          "deployment.environment": { "type": "keyword" },
          "telemetry.sdk.name": { "type": "keyword" },
          "telemetry.sdk.version": { "type": "keyword" }
        }
      },
      "eval.experiment.run_id": { "type": "keyword" },
      "eval.experiment.set_id": { "type": "keyword" },
      "eval.experiment.item_id": { "type": "keyword" },
      "latency": { "type": "float" },
      "tags": { "type": "keyword" },
      "bookmarked": { "type": "boolean" },
      "createdAt": { "type": "date" }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "5s"
    }
  }
}
```

**Key design notes**:

**Key design notes**: `gen_ai.operation.name` distinguishes span types. `gen_ai.conversation.id` is the OTel standard for session correlation. `parentSpanId` provides parent-child hierarchy. `gen_ai.input/output.messages` indexed but not analyzed (`"index": false`) due to size. `traceGroupFields.genAi.*` pre-aggregated by OTel Collector at ingest. Dynamic mapping on `attributes` auto-indexes new `gen_ai.*` attributes. `eval.*` attributes link spans to experiments. `totalCost` deferred from P0.

### Scores Index (`eval_scores`)

Source values: `EVAL_ONLINE`, `EVAL_OFFLINE`, `SDK`, `ANNOTATION`, `API`. Idempotency: upsert by `idempotencyKey` via `_update` with `doc_as_upsert`. Settings: 3 shards, 1 replica, 5s refresh.

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | source | keyword |
| name | keyword | traceId | keyword |
| value | float | spanId | keyword |
| stringValue | keyword | sessionId | keyword |
| dataType | keyword | experimentRunId | keyword |
| authorUserId | keyword | configId | keyword |
| comment | text | queueId | keyword |
| metadata | object | executionTraceId | keyword |
| environment | keyword | timestamp | date |
| createdAt | date | updatedAt | date |
| idempotencyKey | keyword | | |

### Score Configs Index (`eval_score_configs`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | isArchived | boolean |
| name | keyword | minValue | float |
| dataType | keyword | maxValue | float |
| categories | nested (label: keyword, value: float) | description | text |
| createdAt | date | updatedAt | date |

### Eval Sets Index (`eval_sets`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | metadata | object |
| name | text + keyword | inputSchema | object (not indexed) |
| description | text | expectedOutputSchema | object (not indexed) |
| createdAt | date | updatedAt | date |

### Experiments Index (`eval_experiments`)

Versioning: updates create a new document with a new `validFrom`. Latest active version: filter `status=ACTIVE`, sort by `validFrom` desc per `lineageId`.

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | sourceTraceId | keyword |
| evalSetId | keyword | sourceSpanId | keyword |
| input | object | status | keyword |
| expectedOutput | object | lineageId | keyword |
| metadata | object | validFrom | date |
| createdAt | date | | |

### Experiment Runs Index (`eval_experiment_runs`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | description | text |
| evalSetId | keyword | metadata | object |
| name | text + keyword | createdAt | date |

### Experiment Run Items Index (`eval_experiment_run_items`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | traceId | keyword |
| experimentRunId | keyword | spanId | keyword |
| experimentId | keyword | error | text |
| createdAt | date | | |

### Evaluator Templates Index (`eval_evaluator_templates`)

`evalLibrary` (e.g., `deepeval`, `ragas`, `strands_eval`) and `evalMetric` (e.g., `faithfulness`, `geval`) identify which OSS library and metric to invoke. `promptTemplate` is optional (custom LLM Judge only).

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | modelConfig.provider | keyword |
| name | text + keyword | modelConfig.modelName | keyword |
| evalLibrary | keyword | modelConfig.temperature | float |
| evalMetric | keyword | modelConfig.maxTokens | integer |
| promptTemplate | text | outputSchema | nested (scoreName, dataType, valueMapping) |
| targetType | keyword | evaluationMode | keyword |
| createdAt | date | updatedAt | date |

### Deterministic Evaluators Index (`eval_deterministic_evaluators`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | evaluationMode | keyword |
| name | text + keyword | targetType | keyword |
| evaluatorType | keyword | scoreConfigId | keyword |
| configuration | object (not indexed) | createdAt | date |
| updatedAt | date | | |

### Annotation Queues Index (`eval_annotation_queues`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | scoreConfigIds | keyword |
| name | text + keyword | assignedUserIds | keyword |
| description | text | createdAt | date |
| updatedAt | date | | |

### Annotation Tasks Index (`eval_annotation_tasks`)

Locking: optimistic concurrency via `_seq_no` and `_primary_term`. Lock release job sweeps for expired locks.

| Field | Type | Field | Type |
|-------|------|-------|------|
| id | keyword | lockedBy | keyword |
| queueId | keyword | lockedAt | date |
| targetId | keyword | lockTimeout | integer |
| targetType | keyword | completedBy | keyword |
| status | keyword | completedAt | date |
| createdAt | date | | |

### Job Metrics Index (`eval_job_metrics`)

| Field | Type | Field | Type |
|-------|------|-------|------|
| jobId | keyword | experimentItemId | keyword |
| jobType | keyword | expectedOutput | object |
| status | keyword | retryCount | integer |
| priority | keyword | maxRetries | integer |
| evaluatorId | keyword | error | text |
| evaluatorType | keyword | processingTimeMs | long |
| targetId | keyword | createdAt | date |
| targetType | keyword | startedAt | date |
| completedAt | date | | |

### Key Data Model Relationships

```mermaid
erDiagram
    SPAN ||--o{ SPAN : "parent-child via parentSpanId"
    SPAN ||--o{ SCORE : "scored by"
    SESSION ||--o{ SPAN : "groups via gen_ai.conversation.id"
    EVAL_SET ||--o{ EXPERIMENT : contains
    EVAL_SET ||--o{ EXPERIMENT_RUN : "executed as"
    EXPERIMENT_RUN ||--o{ EXPERIMENT_RUN_ITEM : contains
    EXPERIMENT_RUN_ITEM ||--|| EXPERIMENT : references
    EXPERIMENT_RUN_ITEM ||--|| SPAN : "linked to via traceId"
    SCORE_CONFIG ||--o{ SCORE : validates
    EVALUATOR_TEMPLATE ||--o{ SCORE : produces
    DETERMINISTIC_EVALUATOR ||--o{ SCORE : produces
    ANNOTATION_QUEUE ||--o{ ANNOTATION_TASK : contains
    ANNOTATION_TASK ||--o{ SCORE : "produces via review"

    SPAN {
        keyword traceId
        keyword spanId
        keyword parentSpanId
        keyword gen_ai_operation_name
        keyword gen_ai_request_model
        keyword gen_ai_conversation_id
        keyword gen_ai_agent_name
        long gen_ai_usage_input_tokens
        long gen_ai_usage_output_tokens
    }
    SCORE {
        keyword id
        keyword name
        float value
        keyword dataType
        keyword source
        keyword traceId
        keyword spanId
        keyword experimentRunId
    }
    EVAL_SET {
        keyword id
        keyword name
        object inputSchema
        object expectedOutputSchema
    }
    EXPERIMENT {
        keyword id
        keyword evalSetId
        object input
        object expectedOutput
        keyword status
    }
    EXPERIMENT_RUN {
        keyword id
        keyword evalSetId
        keyword name
    }
    EXPERIMENT_RUN_ITEM {
        keyword id
        keyword experimentRunId
        keyword experimentId
        keyword traceId
        text error
    }
```

### OTel GenAI Convention to Platform Concept Mapping

This table shows how the platform's UI concepts map to OTel span attributes, eliminating the need for a custom schema translation layer:

| Platform UI Concept | OTel Span Attribute                                                            | Notes                                      |
| ------------------- | ------------------------------------------------------------------------------ | ------------------------------------------ |
| Trace (root)        | Span where `parentSpanId` is null or `gen_ai.operation.name` = `invoke_agent`  | Root span of a trace                       |
| Observation (child) | Any child span within a trace                                                  | Linked via `parentSpanId`                  |
| Generation          | `gen_ai.operation.name` in (`chat`, `text_completion`, `generate_content`)     | LLM inference call                         |
| Tool call           | `gen_ai.operation.name` = `execute_tool`                                       | Tool/function execution                    |
| Embedding           | `gen_ai.operation.name` = `embeddings`                                         | Vector embedding call                      |
| Retrieval           | `gen_ai.operation.name` = `execute_tool` with `gen_ai.tool.type` = `datastore` | RAG retrieval step                         |
| Session             | `gen_ai.conversation.id`                                                       | Groups related traces                      |
| Model               | `gen_ai.request.model` / `gen_ai.response.model`                               | Model identifier                           |
| Provider            | `gen_ai.provider.name`                                                         | e.g., `openai`, `aws.bedrock`, `anthropic` |
| Input tokens        | `gen_ai.usage.input_tokens`                                                    | Token count                                |
| Output tokens       | `gen_ai.usage.output_tokens`                                                   | Token count                                |
| Agent name          | `gen_ai.agent.name`                                                            | Human-readable agent identifier            |
| Agent ID            | `gen_ai.agent.id`                                                              | Unique agent identifier                    |
| Tool name           | `gen_ai.tool.name`                                                             | Tool identifier                            |
| Environment         | `resource.deployment.environment`                                              | OTel resource attribute                    |
| Service             | `resource.service.name`                                                        | OTel resource attribute                    |

### OTel Evaluation Event Alignment

The OTel GenAI conventions define a `gen_ai.evaluation.result` event for capturing evaluation results. The platform's Score documents align with this event:

| OTel Event Attribute            | Platform Score Field | Notes                         |
| ------------------------------- | -------------------- | ----------------------------- |
| `gen_ai.evaluation.name`        | `name`               | Score metric name             |
| `gen_ai.evaluation.score.value` | `value`              | Numeric score                 |
| `gen_ai.evaluation.score.label` | `stringValue`        | Categorical/boolean label     |
| `gen_ai.evaluation.explanation` | `comment`            | Evaluator reasoning           |
| `gen_ai.response.id`            | `traceId` / `spanId` | Links score to evaluated span |

When scores are submitted via OTLP as `gen_ai.evaluation.result` events (Local evaluation mode), the OTel Collector maps them to `eval_scores` documents. When scores are submitted via the REST API, the platform stores them directly.

## Correctness Properties

Each property is universally quantified and suitable for property-based testing.

| # | Property | Description | Reqs |
|---|----------|-------------|------|
| 1 | OTLP Telemetry Round-Trip | Ingesting valid OTLP spans and reading back produces equivalent `gen_ai.*` attribute values | 1.2, 1.3, 13.8, 14.8 |
| 2 | Span Hierarchy Preservation | N spans with parent-child tree → N stored documents, valid `parentSpanId` refs, isomorphic tree | 1.4, 1.6 |
| 3 | Malformed Payload Rejection | Missing required fields → rejected, span count unchanged | 1.5 |
| 4 | Score Validation Against Config | Score accepted iff config exists, not archived, dataType matches, value in range, category valid | 3.2, 3.3, 3.5, 10.3 |
| 5 | Score Idempotency | N submissions with same idempotency key → exactly one document with latest values | 3.6, 10.4 |
| 6 | JSON Schema Validation | Experiments accepted iff input/expectedOutput validate against Eval_Set schemas | 4.3, 4.4, 4.5 |
| 7 | Experiment Versioning | K updates → K+1 documents with distinct `validFrom`; latest active query returns one per `lineageId` | 4.6, 4.7 |
| 8 | Experiment Archival Exclusion | N active + M archived → default query returns exactly N | 4.8 |
| 9 | Unique Name Enforcement | Duplicate Eval_Set or Experiment_Run names rejected | 4.9, 5.5 |
| 10 | Run Error Isolation | K of N items fail → K error items, N-K valid traceIds, no early abort | 5.6, 6.5 |
| 11 | Run Summary Correctness | Accurate totalItems/successfulItems/failedItems; aggregates over successful only | 5.3, 6.6 |
| 12 | SDK Experiment Execution | N items + E evaluators → 1 run, N items, N traces, N×E scores | 6.1–6.4 |
| 13 | Evaluation Mode Enforcement | OFFLINE evaluator on online trigger → rejected; ONLINE templates exclude `{{expectedOutput}}` | 8.8–8.10, 21.3, 22.5, 22.6 |
| 14 | LLM Judge Score Production | Valid library response → 1 Score + 1 execution trace; invalid → no Score | 8.3–8.5 |
| 15 | Multi-Criteria Evaluation | K score mappings → exactly K Score documents from one LLM call | 8.11 |
| 16 | Annotation Task Locking | Opened task locked; no concurrent acquisition; expired lock → PENDING | 9.4, 9.7 |
| 17 | Annotation Score Creation | Completed task → source=ANNOTATION, correct authorUserId, validates against config | 9.5, 9.6 |
| 18 | Multiple Scores Per Target | K scores of different names on one span → all K independently stored and queryable | 10.5 |
| 19 | Inter-Rater Agreement | Pearson/Spearman for NUMERIC, Cohen's Kappa for CATEGORICAL, F1 for BOOLEAN | 11.2, 11.3, 11.5 |
| 20 | Instrumentation Capture | Decorated call → span with input/output messages, valid timestamps, errors in `error.type` | 13.1–13.3 |
| 21 | Span Filtering Correctness | Every returned span satisfies all predicates; no matching span excluded | 15.2, 15.5 |
| 22 | Span Tree Reconstruction | `parentSpanId` tree → valid tree with one root, no cycles | 15.3 |
| 23 | Session Ordering | Same `gen_ai.conversation.id` → all spans returned in chronological order | 16.1, 16.2 |
| 24 | Session Aggregates | N root spans → correct trace count, total latency, score summaries | 16.3, 16.4 |
| 25 | Job Retry/Failure | Failed job retried with exponential backoff; exceeds maxRetries → FAILED | 18.3, 18.4 |
| 26 | Job Priority Ordering | All HIGH jobs picked up before NORMAL | 18.7 |
| 27 | Batch Job Count | N items × E evaluators → N×E jobs | 18.8 |
| 28 | Critical Path | Valid root-to-leaf path with greatest cumulative latency | 19.6 |
| 29 | Parallel Branch Detection | K spans sharing `parentSpanId` identified as parallel | 19.7 |
| 30 | Agent Map Aggregates | Edge/node metrics correctly computed as aggregates across matching traces | 20.3, 20.4 |
| 31 | Agent Path Extraction | Identical paths aggregated; flow widths sum to total trace count | 20.8 |
| 32 | Deterministic Evaluator Correctness | Each evaluator produces correct result per spec (exact match, regex, JSON, etc.) | 21.1 |
| 33 | Deterministic No Execution Trace | Deterministic evaluator → no execution trace created | 21.8 |
| 34 | RAG Context Extraction | `execute_tool` + `datastore` spans → RAG_Context = `gen_ai.tool.call.result` | 22.1 |
| 35 | RAG Score Range | Score value in [0.0, 1.0] | 22.2 |
| 36 | RAG Template Variables | `{{contexts}}` from datastore tool results, `{{question}}` from root input | 22.3, 22.4 |
| 37 | Annotation Bulk Creation | M items added to queue → M PENDING tasks | 9.3 |
| 38 | Entity CRUD Round-Trip | Create + read back → matching fields | 3.1, 3.4, 4.1, 4.2, 5.1, 5.2, 8.1, 9.1, 9.2, 10.1, 10.2, 21.2 |
| 39 | Job Metrics Recording | Completed/failed job → metrics document with accurate processingTimeMs, status, retryCount | 18.6 |

## Error Handling

### Ingestion Errors

| Error                                                 | Handling                                                                            | Req      |
| ----------------------------------------------------- | ----------------------------------------------------------------------------------- | -------- |
| Malformed OTLP payload                                | Reject at OTel Collector receiver, return gRPC/HTTP error, log to dead-letter index | 1.5      |
| Missing required fields (traceId, spanId, timestamps) | Drop document, log warning                                                          | 1.2, 1.3 |
| OTel Collector to OpenSearch write failure             | Retry with backoff, DLQ after max retries                                           | 1.1      |

### Score Validation Errors

| Error                               | Handling                                                                       | Req |
| ----------------------------------- | ------------------------------------------------------------------------------ | --- |
| Score value outside config range    | Reject with 400, message: "Value {v} outside range [{min}, {max}]"             | 3.3 |
| Score dataType mismatch with config | Reject with 400, message: "DataType {actual} does not match config {expected}" | 3.2 |
| Referenced configId not found       | Reject with 404, message: "Score config {id} not found"                        | 3.5 |
| Referenced configId is archived     | Reject with 400, message: "Score config {id} is archived"                      | 3.5 |

### Eval Set / Experiment Errors

| Error                              | Handling                                                     | Req |
| ---------------------------------- | ------------------------------------------------------------ | --- |
| Duplicate eval set name            | Reject with 409, message: "Eval set '{name}' already exists" | 4.9 |
| Experiment input fails JSON Schema | Reject with 400, message with JSON Schema validation errors  | 4.5 |
| Duplicate experiment run name      | Reject with 409                                              | 5.5 |

### Evaluation Errors

| Error                                        | Handling                                                          | Req      |
| -------------------------------------------- | ----------------------------------------------------------------- | -------- |
| OSS eval library returns invalid score       | Mark job FAILED, no score created, error logged                   | 8.5      |
| LLM provider timeout/error                   | Retry with exponential backoff up to maxRetries                   | 18.3     |
| Job exceeds max retries                      | Mark FAILED, record error for operator review                     | 18.4     |
| OFFLINE evaluator assigned to online agent trace trigger | Reject with 400, descriptive error about Ground_Truth requirement | 8.10     |
| Experiment item function throws              | Record error on Experiment_Run_Item, continue remaining items     | 5.6, 6.5 |

### Annotation Errors

| Error                                    | Handling                                              | Req |
| ---------------------------------------- | ----------------------------------------------------- | --- |
| Concurrent lock attempt on same task     | Return 409, reviewer sees "task already in review"    | 9.4 |
| Lock timeout exceeded                    | Background job releases lock, task returns to PENDING | 9.7 |
| Annotation score fails config validation | Reject with 400, reviewer sees validation error       | 9.6 |

## Testing Strategy

### Dual Testing Approach

- **Unit tests**: Verify specific examples, edge cases, integration points, and error conditions
- **Property-based tests**: Verify universal properties across randomly generated inputs

### Property-Based Testing Configuration

- **Library**: [fast-check](https://github.com/dubzzz/fast-check) for TypeScript components (OSD Plugin, TypeScript SDK), [Hypothesis](https://hypothesis.readthedocs.io/) for Python SDK
- **Minimum iterations**: 100 per property test
- **Tag format**: `Feature: opensearch-eval-platform, Property {N}: {property_title}`

### Test Organization

Property-based tests are distributed across components: OTel pipeline (P1-3), score validation (P4-5, P18), JSON schema (P6), experiment versioning (P7-8), name enforcement (P9), experiment runner (P10-12), eval mode (P13), LLM Judge (P14-15), annotations (P16-17, P37), agreement metrics (P19), instrumentation (P20), span queries (P21), span tree (P22, P28-29), sessions (P23-24), job scheduler (P25-27, P39), agent map/path (P30-31), deterministic evaluators (P32-33), RAG (P34-36), entity CRUD (P38). TypeScript components use fast-check + Jest; Python SDK uses Hypothesis + pytest.

### Test Independence

All tests must be decoupled. Each `it` or `test` block runs independently and concurrently. Tests must never depend on the action or outcome of previous or subsequent tests. No shared mutable state between tests.


## References

- [RFC: Python Agent Service for OpenSearch](https://github.com/opensearch-project/OpenSearch/issues/20602) — The Python Agent Service that hosts the Strands-based eval agent used by the eval-scheduler-plugin for LLM-dependent evaluation execution.


Field	Type	Calculation
`traceGroupFields.genAi.totalTokens`	Long	Sum of input + output tokens across all spans
`traceGroupFields.genAi.inputTokens`	Long	Sum of `gen_ai.usage.input_tokens`
`traceGroupFields.genAi.outputTokens`	Long	Sum of `gen_ai.usage.output_tokens`
`traceGroupFields.genAi.llmCallCount`	Integer	Count of spans where `gen_ai.operation.name` = `chat`
`traceGroupFields.genAi.toolCallCount`	Integer	Count of spans where `gen_ai.operation.name` = `execute_tool`
`traceGroupFields.genAi.errorCount`	Integer	Count where `status.code` = 2

Interface	Implementation	Purpose
`JobSchedulerExtension`	`EvalSchedulerExtension`	Registers the plugin with Job Scheduler, declares job index (`eval_job_metrics`) and runners
`ScheduledJobParameter`	`EvalJobParameter`	Defines the job document schema stored in `eval_job_metrics`
`ScheduledJobRunner`	`EvalTriggerSweeper`, `EvalJobExecutor`	Contains the execution logic invoked when a scheduled job fires

OTel Event Attribute	Platform Score Field	Notes
`gen_ai.evaluation.name`	`name`	Score metric name
`gen_ai.evaluation.score.value`	`value`	Numeric score
`gen_ai.evaluation.score.label`	`stringValue`	Categorical/boolean label
`gen_ai.evaluation.explanation`	`comment`	Evaluator reasoning
`gen_ai.response.id`	`traceId` / `spanId`	Links score to evaluated span

OTel Operation	`gen_ai.operation.name`	Platform Concept	Description
Chat completion	`chat`	Generation observation	LLM inference call
Text completion	`text_completion`	Generation observation	Legacy completion call
Embeddings	`embeddings`	Embedding observation	Vector embedding call
Invoke agent	`invoke_agent`	Trace (root span)	Top-level agent invocation
Create agent	`create_agent`	Agent setup span	Agent initialization
Execute tool	`execute_tool`	Tool call observation	Tool/function execution
Content generation	`generate_content`	Generation observation	Multimodal generation

Custom Attribute	Type	Description
`eval.score.name`	keyword	Score metric name
`eval.score.value`	float	Numeric score value
`eval.score.source`	keyword	One of: EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, API
`eval.experiment.run_id`	keyword	Links span to an experiment run
`eval.experiment.set_id`	keyword	Links span to an eval set
`eval.experiment.item_id`	keyword	Links span to a specific test case

Metric	Type	Description
`gen_ai.client.token.usage`	Counter	Token consumption by type
`gen_ai.client.operation.duration`	Histogram	LLM call latency distribution

Method	Path	Description	Req
POST	/api/eval-sets	Create eval set	4.1
GET	/api/eval-sets	List eval sets	4.7
POST	/api/eval-sets/{id}/experiments	Add experiment to eval set	4.2
PUT	/api/eval-sets/{id}/experiments/{eid}	Update experiment (versioned)	4.6
POST	/api/experiment-runs	Create experiment run	5.1
POST	/api/experiment-runs/{id}/items	Create run item	5.2
POST	/api/scores	Submit score	3.2, 10.1
POST	/api/score-configs	Create score config	3.1
POST	/api/evaluator-templates	Create evaluator template	8.1
POST	/api/deterministic-evaluators	Create deterministic evaluator	21.2
POST	/api/annotation-queues	Create annotation queue	9.1

Job Type	Priority	Trigger	Concurrency
`online_agent_trace_eval`	HIGH	New span matching filter	Single-item, <60s SLA
`offline_agent_trace_eval_item`	NORMAL	Experiment run batch	Per-item, configurable limit
`offline_agent_trace_eval_run`	NORMAL	Experiment run completion	Per-run
`annotation_lock_release`	LOW	Timer	Periodic sweep

Phase	Budget	Notes
Span indexing + refresh	~5s	OpenSearch refresh_interval
Trigger sweep detection	0–10s	Depends on sweep interval (configurable)
Job pickup by executor	0–5s	Depends on executor interval
Lock acquisition	<100ms	Job Scheduler LockService, local cluster op
Span data read	<500ms	Single document fetch
Python Agent Service call	5–30s	Network hop + eval library + LLM provider latency
Score write	<500ms	Single document index
Total	~15–45s	Well within 60s SLA for typical cases

Method	Path	Description
POST	/api/eval-agent/evaluate	Execute a single evaluation
POST	/api/eval-agent/evaluate/batch	Execute batch evaluations
GET	/api/eval-agent/health	Health check

Column	Source	Notes
Status	`status.code`, `error.type`	✓ green (OK), ✗ red (ERROR)
Type / Kind	`gen_ai.operation.name`	Friendly names: Chat, Agent, Tool, Embeddings, etc.
Name	`gen_ai.agent.name` (agent), `gen_ai.request.model` (LLM), `gen_ai.tool.name` (tool)	Varies by span type
Token Usage	`traceGroupFields.genAi.totalTokens`	Pre-aggregated at ingest, not computed at query time
Latency	`durationInNanos`	Root span duration
Input/Output	`gen_ai.input.messages`, `gen_ai.output.messages`	Opt-in, may contain sensitive data

Field	Type	Field	Type
id	keyword	source	keyword
name	keyword	traceId	keyword
value	float	spanId	keyword
stringValue	keyword	sessionId	keyword
dataType	keyword	experimentRunId	keyword
authorUserId	keyword	configId	keyword
comment	text	queueId	keyword
metadata	object	executionTraceId	keyword
environment	keyword	timestamp	date
createdAt	date	updatedAt	date
idempotencyKey	keyword

Field	Type	Field	Type
id	keyword	isArchived	boolean
name	keyword	minValue	float
dataType	keyword	maxValue	float
categories	nested (label: keyword, value: float)	description	text
createdAt	date	updatedAt	date

Field	Type	Field	Type
id	keyword	sourceTraceId	keyword
evalSetId	keyword	sourceSpanId	keyword
input	object	status	keyword
expectedOutput	object	lineageId	keyword
metadata	object	validFrom	date
createdAt	date

Field	Type	Field	Type
id	keyword	modelConfig.provider	keyword
name	text + keyword	modelConfig.modelName	keyword
evalLibrary	keyword	modelConfig.temperature	float
evalMetric	keyword	modelConfig.maxTokens	integer
promptTemplate	text	outputSchema	nested (scoreName, dataType, valueMapping)
targetType	keyword	evaluationMode	keyword
createdAt	date	updatedAt	date

Field	Type	Field	Type
id	keyword	scoreConfigIds	keyword
name	text + keyword	assignedUserIds	keyword
description	text	createdAt	date
updatedAt	date

Field	Type	Field	Type
id	keyword	lockedBy	keyword
queueId	keyword	lockedAt	date
targetId	keyword	lockTimeout	integer
targetType	keyword	completedBy	keyword
status	keyword	completedAt	date
createdAt	date

Field	Type	Field	Type
jobId	keyword	experimentItemId	keyword
jobType	keyword	expectedOutput	object
status	keyword	retryCount	integer
priority	keyword	maxRetries	integer
evaluatorId	keyword	error	text
evaluatorType	keyword	processingTimeMs	long
targetId	keyword	createdAt	date
targetType	keyword	startedAt	date
completedAt	date

Platform UI Concept	OTel Span Attribute	Notes
Trace (root)	Span where `parentSpanId` is null or `gen_ai.operation.name` = `invoke_agent`	Root span of a trace
Observation (child)	Any child span within a trace	Linked via `parentSpanId`
Generation	`gen_ai.operation.name` in (`chat`, `text_completion`, `generate_content`)	LLM inference call
Tool call	`gen_ai.operation.name` = `execute_tool`	Tool/function execution
Embedding	`gen_ai.operation.name` = `embeddings`	Vector embedding call
Retrieval	`gen_ai.operation.name` = `execute_tool` with `gen_ai.tool.type` = `datastore`	RAG retrieval step
Session	`gen_ai.conversation.id`	Groups related traces
Model	`gen_ai.request.model` / `gen_ai.response.model`	Model identifier
Provider	`gen_ai.provider.name`	e.g., `openai`, `aws.bedrock`, `anthropic`
Input tokens	`gen_ai.usage.input_tokens`	Token count
Output tokens	`gen_ai.usage.output_tokens`	Token count
Agent name	`gen_ai.agent.name`	Human-readable agent identifier
Agent ID	`gen_ai.agent.id`	Unique agent identifier
Tool name	`gen_ai.tool.name`	Tool identifier
Environment	`resource.deployment.environment`	OTel resource attribute
Service	`resource.service.name`	OTel resource attribute

#	Property	Description	Reqs
1	OTLP Telemetry Round-Trip	Ingesting valid OTLP spans and reading back produces equivalent `gen_ai.*` attribute values	1.2, 1.3, 13.8, 14.8
2	Span Hierarchy Preservation	N spans with parent-child tree → N stored documents, valid `parentSpanId` refs, isomorphic tree	1.4, 1.6
3	Malformed Payload Rejection	Missing required fields → rejected, span count unchanged	1.5
4	Score Validation Against Config	Score accepted iff config exists, not archived, dataType matches, value in range, category valid	3.2, 3.3, 3.5, 10.3
5	Score Idempotency	N submissions with same idempotency key → exactly one document with latest values	3.6, 10.4
6	JSON Schema Validation	Experiments accepted iff input/expectedOutput validate against Eval_Set schemas	4.3, 4.4, 4.5
7	Experiment Versioning	K updates → K+1 documents with distinct `validFrom`; latest active query returns one per `lineageId`	4.6, 4.7
8	Experiment Archival Exclusion	N active + M archived → default query returns exactly N	4.8
9	Unique Name Enforcement	Duplicate Eval_Set or Experiment_Run names rejected	4.9, 5.5
10	Run Error Isolation	K of N items fail → K error items, N-K valid traceIds, no early abort	5.6, 6.5
11	Run Summary Correctness	Accurate totalItems/successfulItems/failedItems; aggregates over successful only	5.3, 6.6
12	SDK Experiment Execution	N items + E evaluators → 1 run, N items, N traces, N×E scores	6.1–6.4
13	Evaluation Mode Enforcement	OFFLINE evaluator on online trigger → rejected; ONLINE templates exclude `{{expectedOutput}}`	8.8–8.10, 21.3, 22.5, 22.6
14	LLM Judge Score Production	Valid library response → 1 Score + 1 execution trace; invalid → no Score	8.3–8.5
15	Multi-Criteria Evaluation	K score mappings → exactly K Score documents from one LLM call	8.11
16	Annotation Task Locking	Opened task locked; no concurrent acquisition; expired lock → PENDING	9.4, 9.7
17	Annotation Score Creation	Completed task → source=ANNOTATION, correct authorUserId, validates against config	9.5, 9.6
18	Multiple Scores Per Target	K scores of different names on one span → all K independently stored and queryable	10.5
19	Inter-Rater Agreement	Pearson/Spearman for NUMERIC, Cohen's Kappa for CATEGORICAL, F1 for BOOLEAN	11.2, 11.3, 11.5
20	Instrumentation Capture	Decorated call → span with input/output messages, valid timestamps, errors in `error.type`	13.1–13.3
21	Span Filtering Correctness	Every returned span satisfies all predicates; no matching span excluded	15.2, 15.5
22	Span Tree Reconstruction	`parentSpanId` tree → valid tree with one root, no cycles	15.3
23	Session Ordering	Same `gen_ai.conversation.id` → all spans returned in chronological order	16.1, 16.2
24	Session Aggregates	N root spans → correct trace count, total latency, score summaries	16.3, 16.4
25	Job Retry/Failure	Failed job retried with exponential backoff; exceeds maxRetries → FAILED	18.3, 18.4
26	Job Priority Ordering	All HIGH jobs picked up before NORMAL	18.7
27	Batch Job Count	N items × E evaluators → N×E jobs	18.8
28	Critical Path	Valid root-to-leaf path with greatest cumulative latency	19.6
29	Parallel Branch Detection	K spans sharing `parentSpanId` identified as parallel	19.7
30	Agent Map Aggregates	Edge/node metrics correctly computed as aggregates across matching traces	20.3, 20.4
31	Agent Path Extraction	Identical paths aggregated; flow widths sum to total trace count	20.8
32	Deterministic Evaluator Correctness	Each evaluator produces correct result per spec (exact match, regex, JSON, etc.)	21.1
33	Deterministic No Execution Trace	Deterministic evaluator → no execution trace created	21.8
34	RAG Context Extraction	`execute_tool` + `datastore` spans → RAG_Context = `gen_ai.tool.call.result`	22.1
35	RAG Score Range	Score value in [0.0, 1.0]	22.2
36	RAG Template Variables	`{{contexts}}` from datastore tool results, `{{question}}` from root input	22.3, 22.4
37	Annotation Bulk Creation	M items added to queue → M PENDING tasks	9.3
38	Entity CRUD Round-Trip	Create + read back → matching fields	3.1, 3.4, 4.1, 4.2, 5.1, 5.2, 8.1, 9.1, 9.2, 10.1, 10.2, 21.2
39	Job Metrics Recording	Completed/failed job → metrics document with accurate processingTimeMs, status, retryCount	18.6

Error	Handling	Req
Malformed OTLP payload	Reject at OTel Collector receiver, return gRPC/HTTP error, log to dead-letter index	1.5
Missing required fields (traceId, spanId, timestamps)	Drop document, log warning	1.2, 1.3
OTel Collector to OpenSearch write failure	Retry with backoff, DLQ after max retries	1.1

Error	Handling	Req
Score value outside config range	Reject with 400, message: "Value {v} outside range [{min}, {max}]"	3.3
Score dataType mismatch with config	Reject with 400, message: "DataType {actual} does not match config {expected}"	3.2
Referenced configId not found	Reject with 404, message: "Score config {id} not found"	3.5
Referenced configId is archived	Reject with 400, message: "Score config {id} is archived"	3.5

Error	Handling	Req
Duplicate eval set name	Reject with 409, message: "Eval set '{name}' already exists"	4.9
Experiment input fails JSON Schema	Reject with 400, message with JSON Schema validation errors	4.5
Duplicate experiment run name	Reject with 409	5.5

RFC : Agentic AI Eval Platform : High level Design #2592

Description

Overview

GenAI OTel Conventions Alignment

Scoping Model

Agent Root Span Identification

Architecture

Component Interaction Flow

Evaluation Algorithm Dependencies

Components and Interfaces

1. OTel Collector Pipeline

2. Eval Platform REST API

3. eval-scheduler-plugin

4. Python Agent Service (Eval Agent)

5. OSD Eval Plugin

6. Python Instrumentation Library

7. TypeScript Instrumentation Library

Data Models

Spans Index (otel-v1-apm-span-*)

Scores Index (eval_scores)

Score Configs Index (eval_score_configs)

Eval Sets Index (eval_sets)

Experiments Index (eval_experiments)

Experiment Runs Index (eval_experiment_runs)

Experiment Run Items Index (eval_experiment_run_items)

Evaluator Templates Index (eval_evaluator_templates)

Deterministic Evaluators Index (eval_deterministic_evaluators)

Annotation Queues Index (eval_annotation_queues)

Annotation Tasks Index (eval_annotation_tasks)

Job Metrics Index (eval_job_metrics)

Key Data Model Relationships

OTel GenAI Convention to Platform Concept Mapping

OTel Evaluation Event Alignment

Correctness Properties

Error Handling

Ingestion Errors

Score Validation Errors

Eval Set / Experiment Errors

Evaluation Errors

Annotation Errors

Testing Strategy

Dual Testing Approach

Property-Based Testing Configuration

Test Organization

Test Independence

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Spans Index (`otel-v1-apm-span-*`)

Scores Index (`eval_scores`)

Score Configs Index (`eval_score_configs`)

Eval Sets Index (`eval_sets`)

Experiments Index (`eval_experiments`)

Experiment Runs Index (`eval_experiment_runs`)

Experiment Run Items Index (`eval_experiment_run_items`)

Evaluator Templates Index (`eval_evaluator_templates`)

Deterministic Evaluators Index (`eval_deterministic_evaluators`)

Annotation Queues Index (`eval_annotation_queues`)

Annotation Tasks Index (`eval_annotation_tasks`)

Job Metrics Index (`eval_job_metrics`)

Error	Handling	Req
OSS eval library returns invalid score	Mark job FAILED, no score created, error logged	8.5
LLM provider timeout/error	Retry with exponential backoff up to maxRetries	18.3
Job exceeds max retries	Mark FAILED, record error for operator review	18.4
OFFLINE evaluator assigned to online agent trace trigger	Reject with 400, descriptive error about Ground_Truth requirement	8.10
Experiment item function throws	Record error on Experiment_Run_Item, continue remaining items	5.6, 6.5

Error	Handling	Req
Concurrent lock attempt on same task	Return 409, reviewer sees "task already in review"	9.4
Lock timeout exceeded	Background job releases lock, task returns to PENDING	9.7
Annotation score fails config validation	Reject with 400, reviewer sees validation error	9.6