Skip to content

RFC : Agentic AI Eval Platform : High level Design #2592

@lezzago

Description

@lezzago

Overview

This design describes an LLM evaluation platform built natively on the OpenSearch ecosystem. The platform uses OpenSearch indices as the sole data store, OTel Collector for OTLP ingestion and span processing, OpenSearch Job Scheduler for async processing, and OpenSearch Dashboards plugins for the UI.

The data model is grounded in the OpenTelemetry GenAI Semantic Conventions (gen_ai.* attribute namespace). OTLP spans arrive with standard gen_ai.* attributes and are indexed directly into OpenSearch without lossy transformation. This means any OTel-instrumented LLM application (Strands, OpenAI SDK, Bedrock SDK, etc.) can send telemetry to the platform with zero custom mapping.

The system supports three evaluation modes:

  • Online Agent Trace Evaluation: Automatic post-ingestion scoring of live production traces (reference-free only). Source: EVAL_ONLINE.
  • Offline Agent Trace Evaluation: Platform-orchestrated batch evaluation against curated Eval_Sets with Ground_Truth. Source: EVAL_OFFLINE.
  • Local Evaluation: Scores computed client-side by the user's SDK (Strands, DeepEval, Ragas) and submitted via the Scores API. The platform is a passive receiver. Source: SDK.

Key architectural decisions:

  1. OpenSearch as the sole data store -- all entities (spans, scores, eval sets, experiments, jobs) are stored in dedicated OpenSearch indices. No relational database.
  2. OTel Collector for ingestion and processing -- OTLP telemetry flows through OTel Collector pipelines into OpenSearch. The gen_ai.* span attributes are indexed as-is. OTel Collector handles trace-group metric aggregation and Prometheus metric emission. The existing APM index template's dynamic field mapping ("dynamic": "true" on the attributes field) automatically indexes any new gen_ai.* span attributes without requiring schema changes.
  3. GenAI semantic conventions as the canonical schema -- the platform does not define a custom trace/observation schema. It uses the OTel gen_ai.* attributes directly, extended with platform-specific attributes under the eval.* namespace for evaluation-only fields.
  4. Job Scheduler for async work -- native OpenSearch Job Scheduler plugin for LLM-as-a-Judge, deterministic evaluators, and RAG metrics. Only involved in Online and Offline modes.
  5. OSD Plugin for UI -- a single OpenSearch Dashboards plugin using OUI components provides all evaluation UI views.
  6. SDK-driven experiment execution -- Python and TypeScript instrumentation libraries orchestrate offline experiments.
  7. Passive receiver for Local evaluation -- third-party SDKs compute scores client-side and submit them via the Scores API.
  8. OSS library delegation for scoring via Python Agent Service -- Online and Offline agent trace evaluators delegate scoring to the Python Agent Service, which hosts a Strands-based eval agent that invokes Strands Eval, DeepEval, and Ragas for actual scoring logic. The eval-scheduler-plugin communicates with the Python Agent Service over its internal API; the Python Agent Service owns the LLM provider connection.
  9. No artificial scoping -- spans are global documents. Multi-tenancy is handled at the OpenSearch index level via the security plugin.
  10. Dual-write metrics architecture -- OTel Collector enriches spans with pre-aggregated trace-group fields (traceGroupFields.genAi.*) for fast OpenSearch queries (PPL) while simultaneously emitting derived metrics to Prometheus for time-series analysis (PromQL). OpenSearch handles trace detail and search; Prometheus handles metric aggregation and alerting.

GenAI OTel Conventions Alignment

The platform's data model maps directly to the OTel GenAI semantic conventions (status: Development). The key span types and their gen_ai.operation.name values:

OTel Operation gen_ai.operation.name Platform Concept Description
Chat completion chat Generation observation LLM inference call
Text completion text_completion Generation observation Legacy completion call
Embeddings embeddings Embedding observation Vector embedding call
Invoke agent invoke_agent Trace (root span) Top-level agent invocation
Create agent create_agent Agent setup span Agent initialization
Execute tool execute_tool Tool call observation Tool/function execution
Content generation generate_content Generation observation Multimodal generation

The platform extends the standard gen_ai.* namespace with eval.* attributes for evaluation-specific metadata that has no OTel equivalent:

Custom Attribute Type Description
eval.score.name keyword Score metric name
eval.score.value float Numeric score value
eval.score.source keyword One of: EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, API
eval.experiment.run_id keyword Links span to an experiment run
eval.experiment.set_id keyword Links span to an eval set
eval.experiment.item_id keyword Links span to a specific test case

Scoping Model

  • Spans: Global documents. No artificial project scoping field. Queried via filters (time range, tags, gen_ai.agent.name, gen_ai.request.model, etc.).
  • Sessions: Correlated via gen_ai.conversation.id (the OTel convention for session/thread tracking).
  • Eval sets: Independent named collections. Can be used by multiple experiment runs.
  • Experiment runs: Reference eval sets via evalSetId. Produce run items linking test cases to traces.
  • Multi-tenancy: OpenSearch security plugin (index-level permissions, roles). The eval platform itself is tenant-unaware.
  • Local evaluation scores: Arrive via Scores API with source: SDK, referencing traceId and carrying evaluator metadata.

Agent Root Span Identification

Agent root spans in raw trace data are identified by:

  1. No parent -- parentSpanId = ""
  2. Agent operation -- gen_ai.operation.name exists (e.g., invoke_agent)
source = otel-v1-apm-span-*
| where isnull(parentSpanId) and isnotnull(`attributes.gen_ai.agent.name`)

Architecture

graph TB
    subgraph "Client Layer"
        PY[Python Instrumentation Library]
        TS[TypeScript Instrumentation Library]
        APP[User LLM Application]
        SDK3P[Third-Party SDKs - Strands / DeepEval / Ragas]
    end

    subgraph "Ingestion Layer - OTel Collector"
        OC[OTel Collector]
        OC_OTLP[OTLP gRPC/HTTP Receiver]
        OC_TRACE[Trace Processor - Trace-Group Aggregation]
        OC_SINK_OS[OpenSearch Exporter]
        OC_SINK_PROM[Prometheus Remote Write Exporter]
        OC_OTLP --> OC_TRACE
        OC_TRACE --> OC_SINK_OS
        OC_TRACE --> OC_SINK_PROM
    end

    subgraph "OpenSearch Cluster"
        subgraph "Data Indices"
            IDX_SPANS[otel-v1-apm-span - gen_ai attributes + traceGroupFields.genAi]
            IDX_SCORES[eval_scores]
        end
        subgraph "Config Indices"
            IDX_SC[eval_score_configs]
            IDX_ET[eval_evaluator_templates]
            IDX_DE[eval_deterministic_evaluators]
            IDX_AQ[eval_annotation_queues]
        end
        subgraph "Eval Indices"
            IDX_ES[eval_sets]
            IDX_EX[eval_experiments]
            IDX_ER[eval_experiment_runs]
            IDX_ERI[eval_experiment_run_items]
            IDX_AT[eval_annotation_tasks]
        end
        subgraph "Operational Indices"
            IDX_JM[eval_job_metrics]
        end
        JS[eval-scheduler-plugin]
    end

    subgraph "Python Agent Service"
        PAS[Strands Orchestrator]
        EVAL_AGENT[Eval Agent - Strands]
        PAS --> EVAL_AGENT
    end

    subgraph "Metrics Layer"
        PROM[Prometheus]
    end

    subgraph "LLM Providers"
        LLM[Bedrock / OpenAI / Anthropic]
    end

    subgraph "OpenSearch Dashboards"
        OSD[OSD Eval Plugin]
        subgraph "P0 Views"
            V1[Agent Trace List View]
            V1M[Trace List Metrics Summary]
            V9[Agent Trace Timeline / Waterfall View]
            V3D[Agent Span Detail View]
        end
        subgraph "P1 Views"
            V10[Agent Call Graph View]
        end
        subgraph "Eval Views"
            V2[Sessions]
            V3[Eval Sets & Experiments]
            V4[Experiment Runs]
            V5[Annotation Queues]
            V6[Scores & Analytics]
            V7[Evaluators]
            V8[Dashboards]
            V11[Agent Map / Agent Path]
        end
    end

    APP --> PY & TS
    APP --> SDK3P
    PY & TS -->|OTLP spans with gen_ai.* attrs| OC_OTLP
    PY & TS -->|REST API| IDX_ES & IDX_EX & IDX_ER & IDX_ERI & IDX_SCORES
    SDK3P -->|OTLP spans| OC_OTLP
    SDK3P -->|Scores API - source: SDK| IDX_SCORES
    OC_SINK_OS -->|index| IDX_SPANS
    OC_SINK_PROM -->|remote write| PROM
    IDX_SPANS -->|polling sweep| JS
    JS -->|eval request| PAS
    EVAL_AGENT -->|eval library call| LLM
    JS -->|eval scores| IDX_SCORES
    JS -->|job metrics| IDX_JM
    OSD --> V1 & V1M & V9 & V3D & V10
    OSD --> V2 & V3 & V4 & V5 & V6 & V7 & V8 & V11
    OSD -->|PPL queries| IDX_SPANS & IDX_SCORES & IDX_ES & IDX_ER
    OSD -->|PromQL queries| PROM
Loading

Component Interaction Flow

sequenceDiagram
    participant App as User Application
    participant SDK as Instrumentation Library
    participant OC as OTel Collector
    participant OS as OpenSearch
    participant PROM as Prometheus
    participant JS as eval-scheduler-plugin
    participant PAS as Python Agent Service
    participant LLM as LLM Provider
    participant SDK3P as Third-Party SDK

    Note over App,LLM: Online Agent Trace Evaluation Flow
    App->>SDK: Instrumented function call
    SDK->>OC: OTLP spans (gen_ai.* attributes)
    OC->>OC: Aggregate traceGroupFields.genAi.*
    OC->>OS: Index enriched spans
    OC->>PROM: Emit derived metrics
    JS->>OS: Poll for new spans matching trigger filters
    JS->>JS: Create PENDING eval jobs
    JS->>OS: Read span data
    JS->>PAS: Eval request (evaluator config + span data)
    PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas)
    LLM-->>PAS: Score response
    PAS-->>JS: Structured score result
    JS->>OS: Write eval_score (source: EVAL_ONLINE)

    Note over App,LLM: Offline Agent Trace Evaluation Flow
    SDK->>OS: Fetch eval set items
    loop For each experiment item
        SDK->>App: Call user function(input)
        App-->>SDK: output
        SDK->>OC: OTLP spans (eval.experiment.* tags)
        SDK->>OS: Write experiment_run_item
    end
    SDK->>OS: Write run-level scores (source: EVAL_OFFLINE)
    Note over JS,PAS: Server-side evaluation (same as online, with ground truth)
    JS->>OS: Poll for new spans tagged with eval.experiment.run_id
    JS->>OS: Read span data + expectedOutput from eval_experiments
    JS->>PAS: Eval request (evaluator config + span data + expectedOutput)
    PAS->>LLM: Eval agent invokes library (Strands/DeepEval/Ragas)
    LLM-->>PAS: Score response
    PAS-->>JS: Structured score result
    JS->>OS: Write eval_score (source: EVAL_OFFLINE)

    Note over App,SDK3P: Local Evaluation Flow
    App->>SDK3P: Run evaluation (Strands/DeepEval/Ragas)
    SDK3P->>LLM: LLM call (if metric requires it)
    LLM-->>SDK3P: Score response
    SDK3P->>OC: OTLP spans (trace telemetry)
    SDK3P->>OS: POST /api/scores (source: SDK)
Loading

Evaluation Algorithm Dependencies

The platform does not implement evaluation algorithms from scratch. For Online and Offline agent trace evaluation, the eval-scheduler-plugin delegates to the Python Agent Service, which hosts a Strands-based eval agent invoking OSS libraries: Strands Eval (agent trajectory, tool-use, multi-step reasoning), DeepEval (GEval, hallucination, relevancy, faithfulness), and Ragas (context precision/recall, answer faithfulness/relevancy).

The eval-scheduler-plugin sends requests with the Evaluator_Template config (library, metric, model, target span data). The Python Agent Service constructs the library call, manages the LLM connection, and returns structured scores. Evaluator_Templates are thin wrappers — each specifies library, metric, provider, and parameters. LLM provider config is pluggable at the template level via Strands SDK's model abstraction.

Components and Interfaces

1. OTel Collector Pipeline

Responsibility: Receives OTLP telemetry, aggregates trace-group metrics, emits derived metrics to Prometheus, and indexes spans into OpenSearch.

Validation: Malformed OTLP payloads are rejected at the receiver level. The processor validates required span fields (traceId, spanId, timestamps) and drops documents missing them, logging errors to a dead-letter index.

Interface:

  • Input: OTLP gRPC (port 4317) and HTTP (port 4318)
  • Output: OpenSearch bulk index API, Prometheus remote write

Pipeline Stages:

  1. OTLP Receiver: Accepts gRPC and HTTP OTLP payloads
  2. Trace Processor (Trace-Group Aggregation): Buffers spans by traceId, computes trace-level aggregates (traceGroupFields.genAi.*), and writes them back to every span in the trace. Extends the existing traceGroupFields pattern used for standard APM metrics (duration, status) with GenAI-specific aggregations.
  3. OpenSearch Exporter: Indexes enriched spans into otel-v1-apm-span-* indices
  4. Prometheus Remote Write Exporter: Emits derived time-series metrics (gen_ai.client.token.usage, gen_ai.client.operation.duration) to Prometheus

Trace-Group Fields (GenAI):

Pre-aggregated fields computed at ingest time and written to each span document within a trace. These denormalized fields enable the agent trace list view to display aggregate statistics without expensive query-time aggregations.

Field Type Calculation
traceGroupFields.genAi.totalTokens Long Sum of input + output tokens across all spans
traceGroupFields.genAi.inputTokens Long Sum of gen_ai.usage.input_tokens
traceGroupFields.genAi.outputTokens Long Sum of gen_ai.usage.output_tokens
traceGroupFields.genAi.llmCallCount Integer Count of spans where gen_ai.operation.name = chat
traceGroupFields.genAi.toolCallCount Integer Count of spans where gen_ai.operation.name = execute_tool
traceGroupFields.genAi.errorCount Integer Count where status.code = 2

Note: Token cost estimation (traceGroupFields.genAi.estimatedCost) is deferred from P0 due to pricing table maintenance complexity.

Aggregate Metrics (Prometheus):

OTel Collector derives and emits three core metrics to Prometheus from span attributes:

Metric Type Description
gen_ai.client.token.usage Counter Token consumption by type
gen_ai.client.operation.duration Histogram LLM call latency distribution

Metric dimensions: gen_ai.operation.name, gen_ai.system, gen_ai.request.model (normalized to model family), gen_ai.response.model (normalized), service.name, gen_ai.token.type (input/output).

Cardinality Management: High-cardinality fields (traceId, spanId, gen_ai.conversation.id) excluded from metric dimensions. Model IDs normalized to family names (e.g., anthropic.claude-sonnet-4-5-20250929-v1:0claude-sonnet-4-5). Estimate: ~9,000 series per customer.

Deduplication: If client-side instrumentation already emits gen_ai.client.token.usage, OTel Collector adds source=span_derived to distinguish its derived metrics.

Large Content Fields: gen_ai.input.messages and gen_ai.output.messages are indexed but not analyzed ("index": false). Full-text search on content fields is opt-in.

Long-Running Traces: Configuration should support increased flush intervals or root-span-triggered flushing for 60+ minute agent conversations.

2. Eval Platform REST API

Responsibility: CRUD operations for eval sets, experiments, scores, and evaluator configs. Exposed as server-side routes within the OSD Plugin.

Endpoints (key routes):

Method Path Description Req
POST /api/eval-sets Create eval set 4.1
GET /api/eval-sets List eval sets 4.7
POST /api/eval-sets/{id}/experiments Add experiment to eval set 4.2
PUT /api/eval-sets/{id}/experiments/{eid} Update experiment (versioned) 4.6
POST /api/experiment-runs Create experiment run 5.1
POST /api/experiment-runs/{id}/items Create run item 5.2
POST /api/scores Submit score 3.2, 10.1
POST /api/score-configs Create score config 3.1
POST /api/evaluator-templates Create evaluator template 8.1
POST /api/deterministic-evaluators Create deterministic evaluator 21.2
POST /api/annotation-queues Create annotation queue 9.1

Authentication: API calls authenticated via API keys or OSD session tokens. Multi-tenancy via OpenSearch security plugin.

Span Queries via PPL

Span browsing, searching, and detail retrieval use OpenSearch PPL (Piped Processing Language) queries against the _plugins/_ppl endpoint. The OSD Plugin constructs PPL strings from UI filter state.

Example queries:

-- List agent root spans with pre-computed aggregates (Req 15.1)
source = otel-v1-apm-span-*
| where parentSpanId = '' AND isnotnull(`attributes.gen_ai.operation.name`)
| fields traceId, name, durationInNanos, status.code,
         traceGroupFields.genAi.totalTokens, traceGroupFields.genAi.llmCallCount
| sort - startTime | head 100

-- Get all child spans for a trace (Req 15.3)
source = otel-v1-apm-span-*
| where traceId = 'abc123' | sort startTime

-- Aggregate latency by model
source = otel-v1-apm-span-*
| where `attributes.gen_ai.operation.name` = 'chat'
| stats avg(durationInNanos) as avg_latency, count() as call_count by `attributes.gen_ai.request.model`

PromQL (Prometheus dashboards): sum(rate(gen_ai_client_token_usage[5m])) by (gen_ai_system), histogram_quantile(0.99, sum(rate(gen_ai_client_operation_duration_bucket[5m])) by (le, gen_ai_response_model))

3. eval-scheduler-plugin

Responsibility: The eval-scheduler-plugin is a lightweight OpenSearch plugin that owns all async evaluation work — detecting new spans for online agent trace evaluation, executing LLM-as-a-Judge scoring, running deterministic evaluators, computing RAG metrics, and managing job lifecycle. It uses the OpenSearch Job Scheduler SPI as its scheduling infrastructure.

Why a custom plugin: Job Scheduler is an SPI framework — it provides scheduling infrastructure (interval/cron triggers, distributed locking, job persistence) but requires a consumer plugin to define job types and execution logic. The plugin implements a polling sweeper to bridge span ingestion and evaluation execution.

Job Scheduler SPI Integration:

The plugin implements three Job Scheduler SPI interfaces:

Interface Implementation Purpose
JobSchedulerExtension EvalSchedulerExtension Registers the plugin with Job Scheduler, declares job index (eval_job_metrics) and runners
ScheduledJobParameter EvalJobParameter Defines the job document schema stored in eval_job_metrics
ScheduledJobRunner EvalTriggerSweeper, EvalJobExecutor Contains the execution logic invoked when a scheduled job fires

The plugin uses Job Scheduler's IntervalSchedule trigger type to register two recurring scheduled jobs:

  1. Trigger Sweeper (EvalTriggerSweeper) — runs every 5–10s, polls for new spans matching online agent trace evaluation triggers and offline experiment traces (via eval.experiment.run_id tags), creates PENDING job documents
  2. Job Executor (EvalJobExecutor) — runs every 2–5s, picks up PENDING job documents ordered by priority, executes evaluations, writes scores

Job Scheduler's CronSchedule is not used — the polling pattern requires sub-minute granularity that interval scheduling provides.

Job Types:

Job Type Priority Trigger Concurrency
online_agent_trace_eval HIGH New span matching filter Single-item, <60s SLA
offline_agent_trace_eval_item NORMAL Experiment run batch Per-item, configurable limit
offline_agent_trace_eval_run NORMAL Experiment run completion Per-run
annotation_lock_release LOW Timer Periodic sweep

Job Document Schema (stored in eval_job_metrics):

{
  "jobId": "keyword",
  "jobType": "keyword (online_agent_trace_eval | offline_agent_trace_eval_item | offline_agent_trace_eval_run)",
  "status": "keyword (PENDING | RUNNING | COMPLETED | FAILED)",
  "priority": "keyword (HIGH | NORMAL | LOW)",
  "evaluatorId": "keyword",
  "evaluatorType": "keyword (LLM_JUDGE | DETERMINISTIC | RAG)",
  "targetSpanId": "keyword",
  "targetType": "keyword (SPAN | EXPERIMENT_RUN_ITEM)",
  "experimentItemId": "keyword",
  "expectedOutput": "object",
  "retryCount": "integer",
  "maxRetries": "integer",
  "error": "text",
  "createdAt": "date",
  "startedAt": "date",
  "completedAt": "date"
}

Trigger Sweeper (EvalTriggerSweeper):

The sweeper is the plugin's span detection mechanism for both online and offline agent trace evaluation. It maintains a lastSweepTime watermark per trigger configuration and queries otel-v1-apm-span-* for spans indexed since the last sweep.

For online agent trace evaluation, it matches root spans against trigger filter criteria (e.g., gen_ai.agent.name, gen_ai.operation.name, tags). For offline agent trace evaluation, it detects spans tagged with eval.experiment.run_id and joins them with the corresponding eval_experiments documents to retrieve ground truth (expectedOutput). In both cases, it creates PENDING job documents in eval_job_metrics.

Sweeper pseudocode:

// Online: poll for new root spans matching trigger filters
for each onlineTrigger:
    query otel-v1-apm-span-* WHERE startTime >= lastSweepTime
        AND parentSpanId = "" AND matchesTriggerFilter(trigger)
    for each hit (deduplicate by targetSpanId + evaluatorId):
        createPendingJob(spanId, evaluatorId, type=HIGH)
    updateLastSweepTime(trigger)

// Offline: poll for new spans tagged with eval.experiment.run_id
for each pendingExperimentRun:
    query otel-v1-apm-span-* WHERE eval.experiment.run_id = runId
        AND parentSpanId = ""
    for each hit:
        groundTruth = fetchExperiment(eval.experiment.item_id)  // join for expectedOutput
        for each evaluator (deduplicate by targetSpanId + evaluatorId):
            createPendingJob(spanId, evaluatorId, type=NORMAL, groundTruth)
    updateLastSweepTime(run)

Deduplication: before creating a job, the sweeper checks eval_job_metrics for an existing targetSpanId + evaluatorId combination to prevent duplicate evaluations.

Job Executor (EvalJobExecutor):

The executor picks up PENDING jobs, acquires a distributed lock via LockService, reads span data, and delegates to the Python Agent Service.

1. Query eval_job_metrics WHERE status=PENDING ORDER BY priority DESC, createdAt ASC
2. For each job:
   a. Acquire distributed lock via LockService (skip if another node holds it)
   b. Mark IN_PROGRESS
   c. Read target span data from OpenSearch
   d. Load EvaluatorTemplate config
   e. Send eval request to Python Agent Service (evaluatorConfig + spanData)
   f. Write score to eval_scores, mark COMPLETED
   g. On failure: retry with exponential backoff (2^retryCount * 1000ms)
      or mark FAILED if maxRetries exceeded
   h. Release lock in finally block (TTL fallback: 5min if node crashes)

Distributed Locking: Uses Job Scheduler's LockService. Lock keyed by jobId, configurable TTL (default 5min). If a node crashes, lock expires and job returns to PENDING. Satisfies Req 18.5 (horizontal scaling without duplicate execution).

Priority Queue: Implemented at query level — PENDING jobs sorted by priority DESC, createdAt ASC. Priority values: HIGH=3, NORMAL=2, LOW=1. Online agent trace eval jobs (HIGH) are always picked up before offline batch jobs (NORMAL). Optional concurrency limits per priority level prevent batch starvation.

Batch Job Creation (Offline): When an Experiment_Run completes trace capture, the trigger sweeper detects new spans via eval.experiment.run_id tags, joins with eval_experiments for ground truth, and creates PENDING jobs with priority: NORMAL.

Plugin Configuration (opensearch.yml):

eval.scheduler.trigger_sweep_interval: "5s"     # How often sweeper polls for new matching spans
eval.scheduler.job_executor_interval: "2s"       # How often executor picks up PENDING jobs
eval.scheduler.executor_batch_size: 10           # Max jobs per executor cycle
eval.scheduler.lock_ttl_minutes: 5               # Job Scheduler LockService TTL
eval.scheduler.max_retries: 3                    # Default max retries per job
eval.scheduler.online_concurrency_limit: 20      # Max concurrent online agent trace eval jobs
eval.scheduler.offline_concurrency_limit: 50     # Max concurrent offline agent trace eval jobs
eval.scheduler.agent_service_endpoint: "http://localhost:8080"  # Python Agent Service URL
eval.scheduler.agent_service_timeout_ms: 45000   # Timeout for eval requests to agent service

Latency Budget (<60s SLA):

Phase Budget Notes
Span indexing + refresh ~5s OpenSearch refresh_interval
Trigger sweep detection 0–10s Depends on sweep interval (configurable)
Job pickup by executor 0–5s Depends on executor interval
Lock acquisition <100ms Job Scheduler LockService, local cluster op
Span data read <500ms Single document fetch
Python Agent Service call 5–30s Network hop + eval library + LLM provider latency
Score write <500ms Single document index
Total ~15–45s Well within 60s SLA for typical cases

4. Python Agent Service (Eval Agent)

Responsibility: Hosts the Strands-based eval agent that executes LLM-as-a-Judge scoring, RAG metric computation, and any evaluation logic requiring LLM provider access. The eval-scheduler-plugin delegates all LLM-dependent evaluation work to this service.

Context: The Python Agent Service is a broader OpenSearch initiative that provides a unified Python backend for AI-powered assistants. It uses Strands SDK as the orchestration framework and follows a multi-agent pattern with a top-level orchestrator routing requests to specialized sub-agents. The eval platform registers an Eval Agent as a specialized sub-agent within this service.

Why delegate: Eval libraries (Strands Eval, DeepEval, Ragas) and LLM provider SDKs are Python-native. The Java plugin stays focused on scheduling/locking. New eval methods ship as Python library updates without touching the Java plugin. The Python Agent Service already manages LLM credentials, connection pooling, and OTel observability.

Eval Agent Architecture:

The Eval Agent is a specialized sub-agent in the Python Agent Service's agent registry. It receives requests from the eval-scheduler-plugin over an internal API (not user-facing). The agent uses Strands SDK's @tool decorator to expose a run_evaluation tool that resolves the evaluator (library + metric + model config), executes the evaluation, and returns structured scores with explanation and an executionTraceId linking back to the OTel trace of the eval LLM call.

Internal API (eval-scheduler-plugin → Python Agent Service):

Method Path Description
POST /api/eval-agent/evaluate Execute a single evaluation
POST /api/eval-agent/evaluate/batch Execute batch evaluations
GET /api/eval-agent/health Health check

Request/Response: The evaluate request carries evaluatorConfig (library, metric, modelConfig, outputSchema) and spanData (input, output, context, expectedOutput). For offline agent trace evaluation, expectedOutput is populated with ground truth; for online, it's null. The response returns scores (array of name/value/dataType), explanation, and executionTraceId.

Deterministic evaluators: Deterministic evaluators (regex, JSON validity, exact match) run directly in the eval-scheduler-plugin. Only LLM Judge and RAG evaluations go to the Python Agent Service.

Observability: All eval agent operations are OTel-instrumented. The executionTraceId links scores to execution traces for debugging.

Deployment: Runs as a sidecar or co-located service. Stateless — all config and state in OpenSearch indices. Horizontally scalable independently.

5. OSD Eval Plugin

Responsibility: Complete evaluation UI as an OpenSearch Dashboards plugin.

Plugin Registration:

export class EvalPlugin implements Plugin {
  setup(core: CoreSetup) {
    core.application.register({
      id: "eval-platform",
      title: "LLM Evaluation",
      navLinkStatus: AppNavLinkStatus.visible,
      mount: async (params: AppMountParameters) => {
        const { renderApp } = await import("./application");
        return renderApp(params);
      },
    });
  }
}

Navigation Structure:

P0 Agent Tracing Views:

  • Agent Trace List View (root spans with aggregated metrics from traceGroupFields.genAi.*)
  • Trace List Metrics Summary (total traces, total spans, total tokens, latency P50/P99)
  • Agent Trace Timeline / Waterfall View (hierarchical span visualization with inline metadata)
  • Agent Span Detail View (operation-specific attributes: chat, tool_call, invoke_agent)

P1 Agent Tracing Views:

  • Agent Call Graph View (topology visualization of agent/LLM/tool relationships within a trace)

Eval Platform Views:

  • Sessions (list, detail -- correlated via gen_ai.conversation.id)
  • Eval Sets (CRUD, experiments management)
  • Experiment Runs (list, detail, comparison)
  • Annotation Queues (manage, review interface)
  • Scores & Analytics (distributions, agreement, trends)
  • Evaluators (LLM Judge templates, deterministic, RAG metrics)
  • Dashboards (custom charts, default dashboards -- PPL for trace views, PromQL for metric dashboards)
  • Agent Map / Agent Path (multi-trace analytics)

Agent Trace List View Columns:

Each column maps to specific OTel GenAI semantic convention attributes:

Column Source Notes
Status status.code, error.type ✓ green (OK), ✗ red (ERROR)
Type / Kind gen_ai.operation.name Friendly names: Chat, Agent, Tool, Embeddings, etc.
Name gen_ai.agent.name (agent), gen_ai.request.model (LLM), gen_ai.tool.name (tool) Varies by span type
Token Usage traceGroupFields.genAi.totalTokens Pre-aggregated at ingest, not computed at query time
Latency durationInNanos Root span duration
Input/Output gen_ai.input.messages, gen_ai.output.messages Opt-in, may contain sensitive data

Waterfall View: Inline with span bar: span name, operation type icon, latency, status. Conditional: token count (chat), model name (chat), tool name (tool_call), agent name (invoke_agent).

Span Detail View (conditional by gen_ai.operation.name): chat → Messages, Model, Tokens, Temperature, Finish Reason. execute_tool → Tool Name, Arguments, Result. invoke_agent → Agent Name/ID. Common → Trace/Span IDs, Service, Timestamps, Duration, Status.

Query Architecture: Trace list views use pre-computed traceGroupFields.genAi.* — no aggregation at query time. Metric dashboards query Prometheus.

6. Python Instrumentation Library

Responsibility: Instruments Python LLM applications using OTel GenAI conventions, provides eval set/experiment/score APIs.

Core API:

# Tracing -- emits spans with gen_ai.* attributes
@observe(name="my_function")
def my_llm_call(input: str) -> str: ...

# Experiment execution
results = client.run_experiment(
    eval_set_id="es-123",
    fn=my_llm_call,
    evaluators=[accuracy_evaluator],
    run_name="v2-prompt-test"
)

# Score submission
client.score(
    trace_id="tr-123",
    name="user_feedback",
    value=1.0,
    data_type="NUMERIC"
)

OTLP Export: Uses OpenTelemetry Python SDK. Spans carry standard gen_ai.* attributes: gen_ai.operation.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.input.messages, gen_ai.output.messages, etc.

7. TypeScript Instrumentation Library

Responsibility: Same as Python library but for TypeScript/Node.js.

const result = await observe({ name: "my_function" }, async (span) => {
  return await myLlmCall(input);
});

const results = await client.runExperiment({
  evalSetId: "es-123",
  fn: myLlmCall,
  evaluators: [accuracyEvaluator],
  runName: "v2-prompt-test",
});

await client.score({
  traceId: "tr-123",
  name: "user_feedback",
  value: 1.0,
  dataType: "NUMERIC",
});

Data Models

All data is stored in OpenSearch indices. The spans index uses the OTel GenAI semantic conventions directly. Evaluation-specific entities use dedicated indices under the eval_* prefix.

Spans Index (otel-v1-apm-span-*)

This index stores OTLP spans as-is, preserving all gen_ai.* attributes from the OTel semantic conventions. OTel Collector indexes spans without lossy transformation. The existing APM index template's dynamic field mapping automatically indexes gen_ai.* attributes without schema changes. Trace-group fields are pre-aggregated by OTel Collector at ingest time.

{
  "mappings": {
    "dynamic": "true",
    "properties": {
      "traceId": { "type": "keyword" },
      "spanId": { "type": "keyword" },
      "parentSpanId": { "type": "keyword" },
      "traceGroup": { "type": "keyword" },
      "name": {
        "type": "text",
        "fields": { "keyword": { "type": "keyword" } }
      },
      "kind": { "type": "keyword" },
      "startTime": { "type": "date_nanos" },
      "endTime": { "type": "date_nanos" },
      "durationInNanos": { "type": "long" },
      "status": {
        "properties": {
          "code": { "type": "keyword" },
          "message": { "type": "text" }
        }
      },
      "traceGroupFields": {
        "properties": {
          "endTime": { "type": "date_nanos" },
          "durationInNanos": { "type": "long" },
          "statusCode": { "type": "integer" },
          "genAi": {
            "properties": {
              "totalTokens": { "type": "long" },
              "inputTokens": { "type": "long" },
              "outputTokens": { "type": "long" },
              "llmCallCount": { "type": "integer" },
              "toolCallCount": { "type": "integer" },
              "errorCount": { "type": "integer" }
            }
          }
        }
      },
      "attributes": {
        "dynamic": "true",
        "properties": {
          "gen_ai.operation.name": { "type": "keyword" },
          "gen_ai.provider.name": { "type": "keyword" },
          "gen_ai.request.model": { "type": "keyword" },
          "gen_ai.response.model": { "type": "keyword" },
          "gen_ai.request.temperature": { "type": "float" },
          "gen_ai.request.max_tokens": { "type": "integer" },
          "gen_ai.request.top_p": { "type": "float" },
          "gen_ai.usage.input_tokens": { "type": "long" },
          "gen_ai.usage.output_tokens": { "type": "long" },
          "gen_ai.response.id": { "type": "keyword" },
          "gen_ai.response.finish_reasons": { "type": "keyword" },
          "gen_ai.conversation.id": { "type": "keyword" },
          "gen_ai.agent.id": { "type": "keyword" },
          "gen_ai.agent.name": { "type": "keyword" },
          "gen_ai.agent.description": { "type": "text" },
          "gen_ai.tool.name": { "type": "keyword" },
          "gen_ai.tool.type": { "type": "keyword" },
          "gen_ai.tool.call.id": { "type": "keyword" },
          "gen_ai.tool.description": { "type": "text" },
          "gen_ai.input.messages": { "type": "object", "enabled": true, "index": false },
          "gen_ai.output.messages": { "type": "object", "enabled": true, "index": false },
          "gen_ai.system_instructions": { "type": "object", "enabled": true, "index": false },
          "gen_ai.tool.definitions": { "type": "object", "enabled": false },
          "gen_ai.tool.call.arguments": { "type": "object", "enabled": true },
          "gen_ai.tool.call.result": { "type": "object", "enabled": true },
          "gen_ai.data_source.id": { "type": "keyword" },
          "gen_ai.output.type": { "type": "keyword" },
          "error.type": { "type": "keyword" },
          "server.address": { "type": "keyword" },
          "server.port": { "type": "integer" }
        }
      },
      "resource": {
        "properties": {
          "service.name": { "type": "keyword" },
          "service.version": { "type": "keyword" },
          "deployment.environment": { "type": "keyword" },
          "telemetry.sdk.name": { "type": "keyword" },
          "telemetry.sdk.version": { "type": "keyword" }
        }
      },
      "eval.experiment.run_id": { "type": "keyword" },
      "eval.experiment.set_id": { "type": "keyword" },
      "eval.experiment.item_id": { "type": "keyword" },
      "latency": { "type": "float" },
      "tags": { "type": "keyword" },
      "bookmarked": { "type": "boolean" },
      "createdAt": { "type": "date" }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "5s"
    }
  }
}

Key design notes:

Key design notes: gen_ai.operation.name distinguishes span types. gen_ai.conversation.id is the OTel standard for session correlation. parentSpanId provides parent-child hierarchy. gen_ai.input/output.messages indexed but not analyzed ("index": false) due to size. traceGroupFields.genAi.* pre-aggregated by OTel Collector at ingest. Dynamic mapping on attributes auto-indexes new gen_ai.* attributes. eval.* attributes link spans to experiments. totalCost deferred from P0.

Scores Index (eval_scores)

Source values: EVAL_ONLINE, EVAL_OFFLINE, SDK, ANNOTATION, API. Idempotency: upsert by idempotencyKey via _update with doc_as_upsert. Settings: 3 shards, 1 replica, 5s refresh.

Field Type Field Type
id keyword source keyword
name keyword traceId keyword
value float spanId keyword
stringValue keyword sessionId keyword
dataType keyword experimentRunId keyword
authorUserId keyword configId keyword
comment text queueId keyword
metadata object executionTraceId keyword
environment keyword timestamp date
createdAt date updatedAt date
idempotencyKey keyword

Score Configs Index (eval_score_configs)

Field Type Field Type
id keyword isArchived boolean
name keyword minValue float
dataType keyword maxValue float
categories nested (label: keyword, value: float) description text
createdAt date updatedAt date

Eval Sets Index (eval_sets)

Field Type Field Type
id keyword metadata object
name text + keyword inputSchema object (not indexed)
description text expectedOutputSchema object (not indexed)
createdAt date updatedAt date

Experiments Index (eval_experiments)

Versioning: updates create a new document with a new validFrom. Latest active version: filter status=ACTIVE, sort by validFrom desc per lineageId.

Field Type Field Type
id keyword sourceTraceId keyword
evalSetId keyword sourceSpanId keyword
input object status keyword
expectedOutput object lineageId keyword
metadata object validFrom date
createdAt date

Experiment Runs Index (eval_experiment_runs)

Field Type Field Type
id keyword description text
evalSetId keyword metadata object
name text + keyword createdAt date

Experiment Run Items Index (eval_experiment_run_items)

Field Type Field Type
id keyword traceId keyword
experimentRunId keyword spanId keyword
experimentId keyword error text
createdAt date

Evaluator Templates Index (eval_evaluator_templates)

evalLibrary (e.g., deepeval, ragas, strands_eval) and evalMetric (e.g., faithfulness, geval) identify which OSS library and metric to invoke. promptTemplate is optional (custom LLM Judge only).

Field Type Field Type
id keyword modelConfig.provider keyword
name text + keyword modelConfig.modelName keyword
evalLibrary keyword modelConfig.temperature float
evalMetric keyword modelConfig.maxTokens integer
promptTemplate text outputSchema nested (scoreName, dataType, valueMapping)
targetType keyword evaluationMode keyword
createdAt date updatedAt date

Deterministic Evaluators Index (eval_deterministic_evaluators)

Field Type Field Type
id keyword evaluationMode keyword
name text + keyword targetType keyword
evaluatorType keyword scoreConfigId keyword
configuration object (not indexed) createdAt date
updatedAt date

Annotation Queues Index (eval_annotation_queues)

Field Type Field Type
id keyword scoreConfigIds keyword
name text + keyword assignedUserIds keyword
description text createdAt date
updatedAt date

Annotation Tasks Index (eval_annotation_tasks)

Locking: optimistic concurrency via _seq_no and _primary_term. Lock release job sweeps for expired locks.

Field Type Field Type
id keyword lockedBy keyword
queueId keyword lockedAt date
targetId keyword lockTimeout integer
targetType keyword completedBy keyword
status keyword completedAt date
createdAt date

Job Metrics Index (eval_job_metrics)

Field Type Field Type
jobId keyword experimentItemId keyword
jobType keyword expectedOutput object
status keyword retryCount integer
priority keyword maxRetries integer
evaluatorId keyword error text
evaluatorType keyword processingTimeMs long
targetId keyword createdAt date
targetType keyword startedAt date
completedAt date

Key Data Model Relationships

erDiagram
    SPAN ||--o{ SPAN : "parent-child via parentSpanId"
    SPAN ||--o{ SCORE : "scored by"
    SESSION ||--o{ SPAN : "groups via gen_ai.conversation.id"
    EVAL_SET ||--o{ EXPERIMENT : contains
    EVAL_SET ||--o{ EXPERIMENT_RUN : "executed as"
    EXPERIMENT_RUN ||--o{ EXPERIMENT_RUN_ITEM : contains
    EXPERIMENT_RUN_ITEM ||--|| EXPERIMENT : references
    EXPERIMENT_RUN_ITEM ||--|| SPAN : "linked to via traceId"
    SCORE_CONFIG ||--o{ SCORE : validates
    EVALUATOR_TEMPLATE ||--o{ SCORE : produces
    DETERMINISTIC_EVALUATOR ||--o{ SCORE : produces
    ANNOTATION_QUEUE ||--o{ ANNOTATION_TASK : contains
    ANNOTATION_TASK ||--o{ SCORE : "produces via review"

    SPAN {
        keyword traceId
        keyword spanId
        keyword parentSpanId
        keyword gen_ai_operation_name
        keyword gen_ai_request_model
        keyword gen_ai_conversation_id
        keyword gen_ai_agent_name
        long gen_ai_usage_input_tokens
        long gen_ai_usage_output_tokens
    }
    SCORE {
        keyword id
        keyword name
        float value
        keyword dataType
        keyword source
        keyword traceId
        keyword spanId
        keyword experimentRunId
    }
    EVAL_SET {
        keyword id
        keyword name
        object inputSchema
        object expectedOutputSchema
    }
    EXPERIMENT {
        keyword id
        keyword evalSetId
        object input
        object expectedOutput
        keyword status
    }
    EXPERIMENT_RUN {
        keyword id
        keyword evalSetId
        keyword name
    }
    EXPERIMENT_RUN_ITEM {
        keyword id
        keyword experimentRunId
        keyword experimentId
        keyword traceId
        text error
    }
Loading

OTel GenAI Convention to Platform Concept Mapping

This table shows how the platform's UI concepts map to OTel span attributes, eliminating the need for a custom schema translation layer:

Platform UI Concept OTel Span Attribute Notes
Trace (root) Span where parentSpanId is null or gen_ai.operation.name = invoke_agent Root span of a trace
Observation (child) Any child span within a trace Linked via parentSpanId
Generation gen_ai.operation.name in (chat, text_completion, generate_content) LLM inference call
Tool call gen_ai.operation.name = execute_tool Tool/function execution
Embedding gen_ai.operation.name = embeddings Vector embedding call
Retrieval gen_ai.operation.name = execute_tool with gen_ai.tool.type = datastore RAG retrieval step
Session gen_ai.conversation.id Groups related traces
Model gen_ai.request.model / gen_ai.response.model Model identifier
Provider gen_ai.provider.name e.g., openai, aws.bedrock, anthropic
Input tokens gen_ai.usage.input_tokens Token count
Output tokens gen_ai.usage.output_tokens Token count
Agent name gen_ai.agent.name Human-readable agent identifier
Agent ID gen_ai.agent.id Unique agent identifier
Tool name gen_ai.tool.name Tool identifier
Environment resource.deployment.environment OTel resource attribute
Service resource.service.name OTel resource attribute

OTel Evaluation Event Alignment

The OTel GenAI conventions define a gen_ai.evaluation.result event for capturing evaluation results. The platform's Score documents align with this event:

OTel Event Attribute Platform Score Field Notes
gen_ai.evaluation.name name Score metric name
gen_ai.evaluation.score.value value Numeric score
gen_ai.evaluation.score.label stringValue Categorical/boolean label
gen_ai.evaluation.explanation comment Evaluator reasoning
gen_ai.response.id traceId / spanId Links score to evaluated span

When scores are submitted via OTLP as gen_ai.evaluation.result events (Local evaluation mode), the OTel Collector maps them to eval_scores documents. When scores are submitted via the REST API, the platform stores them directly.

Correctness Properties

Each property is universally quantified and suitable for property-based testing.

# Property Description Reqs
1 OTLP Telemetry Round-Trip Ingesting valid OTLP spans and reading back produces equivalent gen_ai.* attribute values 1.2, 1.3, 13.8, 14.8
2 Span Hierarchy Preservation N spans with parent-child tree → N stored documents, valid parentSpanId refs, isomorphic tree 1.4, 1.6
3 Malformed Payload Rejection Missing required fields → rejected, span count unchanged 1.5
4 Score Validation Against Config Score accepted iff config exists, not archived, dataType matches, value in range, category valid 3.2, 3.3, 3.5, 10.3
5 Score Idempotency N submissions with same idempotency key → exactly one document with latest values 3.6, 10.4
6 JSON Schema Validation Experiments accepted iff input/expectedOutput validate against Eval_Set schemas 4.3, 4.4, 4.5
7 Experiment Versioning K updates → K+1 documents with distinct validFrom; latest active query returns one per lineageId 4.6, 4.7
8 Experiment Archival Exclusion N active + M archived → default query returns exactly N 4.8
9 Unique Name Enforcement Duplicate Eval_Set or Experiment_Run names rejected 4.9, 5.5
10 Run Error Isolation K of N items fail → K error items, N-K valid traceIds, no early abort 5.6, 6.5
11 Run Summary Correctness Accurate totalItems/successfulItems/failedItems; aggregates over successful only 5.3, 6.6
12 SDK Experiment Execution N items + E evaluators → 1 run, N items, N traces, N×E scores 6.1–6.4
13 Evaluation Mode Enforcement OFFLINE evaluator on online trigger → rejected; ONLINE templates exclude {{expectedOutput}} 8.8–8.10, 21.3, 22.5, 22.6
14 LLM Judge Score Production Valid library response → 1 Score + 1 execution trace; invalid → no Score 8.3–8.5
15 Multi-Criteria Evaluation K score mappings → exactly K Score documents from one LLM call 8.11
16 Annotation Task Locking Opened task locked; no concurrent acquisition; expired lock → PENDING 9.4, 9.7
17 Annotation Score Creation Completed task → source=ANNOTATION, correct authorUserId, validates against config 9.5, 9.6
18 Multiple Scores Per Target K scores of different names on one span → all K independently stored and queryable 10.5
19 Inter-Rater Agreement Pearson/Spearman for NUMERIC, Cohen's Kappa for CATEGORICAL, F1 for BOOLEAN 11.2, 11.3, 11.5
20 Instrumentation Capture Decorated call → span with input/output messages, valid timestamps, errors in error.type 13.1–13.3
21 Span Filtering Correctness Every returned span satisfies all predicates; no matching span excluded 15.2, 15.5
22 Span Tree Reconstruction parentSpanId tree → valid tree with one root, no cycles 15.3
23 Session Ordering Same gen_ai.conversation.id → all spans returned in chronological order 16.1, 16.2
24 Session Aggregates N root spans → correct trace count, total latency, score summaries 16.3, 16.4
25 Job Retry/Failure Failed job retried with exponential backoff; exceeds maxRetries → FAILED 18.3, 18.4
26 Job Priority Ordering All HIGH jobs picked up before NORMAL 18.7
27 Batch Job Count N items × E evaluators → N×E jobs 18.8
28 Critical Path Valid root-to-leaf path with greatest cumulative latency 19.6
29 Parallel Branch Detection K spans sharing parentSpanId identified as parallel 19.7
30 Agent Map Aggregates Edge/node metrics correctly computed as aggregates across matching traces 20.3, 20.4
31 Agent Path Extraction Identical paths aggregated; flow widths sum to total trace count 20.8
32 Deterministic Evaluator Correctness Each evaluator produces correct result per spec (exact match, regex, JSON, etc.) 21.1
33 Deterministic No Execution Trace Deterministic evaluator → no execution trace created 21.8
34 RAG Context Extraction execute_tool + datastore spans → RAG_Context = gen_ai.tool.call.result 22.1
35 RAG Score Range Score value in [0.0, 1.0] 22.2
36 RAG Template Variables {{contexts}} from datastore tool results, {{question}} from root input 22.3, 22.4
37 Annotation Bulk Creation M items added to queue → M PENDING tasks 9.3
38 Entity CRUD Round-Trip Create + read back → matching fields 3.1, 3.4, 4.1, 4.2, 5.1, 5.2, 8.1, 9.1, 9.2, 10.1, 10.2, 21.2
39 Job Metrics Recording Completed/failed job → metrics document with accurate processingTimeMs, status, retryCount 18.6

Error Handling

Ingestion Errors

Error Handling Req
Malformed OTLP payload Reject at OTel Collector receiver, return gRPC/HTTP error, log to dead-letter index 1.5
Missing required fields (traceId, spanId, timestamps) Drop document, log warning 1.2, 1.3
OTel Collector to OpenSearch write failure Retry with backoff, DLQ after max retries 1.1

Score Validation Errors

Error Handling Req
Score value outside config range Reject with 400, message: "Value {v} outside range [{min}, {max}]" 3.3
Score dataType mismatch with config Reject with 400, message: "DataType {actual} does not match config {expected}" 3.2
Referenced configId not found Reject with 404, message: "Score config {id} not found" 3.5
Referenced configId is archived Reject with 400, message: "Score config {id} is archived" 3.5

Eval Set / Experiment Errors

Error Handling Req
Duplicate eval set name Reject with 409, message: "Eval set '{name}' already exists" 4.9
Experiment input fails JSON Schema Reject with 400, message with JSON Schema validation errors 4.5
Duplicate experiment run name Reject with 409 5.5

Evaluation Errors

Error Handling Req
OSS eval library returns invalid score Mark job FAILED, no score created, error logged 8.5
LLM provider timeout/error Retry with exponential backoff up to maxRetries 18.3
Job exceeds max retries Mark FAILED, record error for operator review 18.4
OFFLINE evaluator assigned to online agent trace trigger Reject with 400, descriptive error about Ground_Truth requirement 8.10
Experiment item function throws Record error on Experiment_Run_Item, continue remaining items 5.6, 6.5

Annotation Errors

Error Handling Req
Concurrent lock attempt on same task Return 409, reviewer sees "task already in review" 9.4
Lock timeout exceeded Background job releases lock, task returns to PENDING 9.7
Annotation score fails config validation Reject with 400, reviewer sees validation error 9.6

Testing Strategy

Dual Testing Approach

  • Unit tests: Verify specific examples, edge cases, integration points, and error conditions
  • Property-based tests: Verify universal properties across randomly generated inputs

Property-Based Testing Configuration

  • Library: fast-check for TypeScript components (OSD Plugin, TypeScript SDK), Hypothesis for Python SDK
  • Minimum iterations: 100 per property test
  • Tag format: Feature: opensearch-eval-platform, Property {N}: {property_title}

Test Organization

Property-based tests are distributed across components: OTel pipeline (P1-3), score validation (P4-5, P18), JSON schema (P6), experiment versioning (P7-8), name enforcement (P9), experiment runner (P10-12), eval mode (P13), LLM Judge (P14-15), annotations (P16-17, P37), agreement metrics (P19), instrumentation (P20), span queries (P21), span tree (P22, P28-29), sessions (P23-24), job scheduler (P25-27, P39), agent map/path (P30-31), deterministic evaluators (P32-33), RAG (P34-36), entity CRUD (P38). TypeScript components use fast-check + Jest; Python SDK uses Hypothesis + pytest.

Test Independence

All tests must be decoupled. Each it or test block runs independently and concurrently. Tests must never depend on the action or outcome of previous or subsequent tests. No shared mutable state between tests.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions