Skip to content

[RFC] Agentic AI Eval Platform: Agent-eval-scheduler plugin Design #2599

@lezzago

Description

@lezzago

Purpose & Motivation

The Agentic AI Eval Platform is built natively on OpenSearch, using OTel Collector for span ingestion and OpenSearch indices as the sole data store. The platform needs an asynchronous evaluation engine that detects newly ingested spans, runs evaluations (LLM-as-a-Judge, RAG metrics, deterministic checks), and writes scores back to OpenSearch — all without manual intervention.

OpenSearch Job Scheduler provides the scheduling infrastructure (interval triggers, distributed locking, job persistence), but it is an SPI framework — it requires a consumer plugin to define job types and execution logic. No existing OpenSearch plugin provides evaluation-specific orchestration. The agent-eval-scheduler plugin fills this gap: it implements the Job Scheduler SPI to schedule two recurring jobs that bridge span ingestion and evaluation execution.

The plugin's core architectural principle is the connection-based evaluation routing model. Each evaluation request is routed to a specific backend (Python Agent Service or ML Commons) using a specific protocol (REST or AG-UI) as determined by the Eval Agent Connection linked in the trigger configuration. There are no global backend or protocol settings — the connection is the single source of truth for how an evaluation is executed.


Key Concepts

  • Eval Agent Connection — A registered connection to an evaluation backend, defining the backend type (PYTHON_AGENT_SERVICE or ML_COMMONS), communication protocol (REST or AGUI), endpoint, and timeout. Stored in the eval_agent_connections index.

  • Eval Search Filter — A saved query configuration that matches spans by criteria (agent name, operation, tags) and links them to evaluators via evaluator assignments. Each assignment pairs an evaluator with a specific Eval Agent Connection, determining the backend and protocol at execution time.

  • Job Document — A document in the eval_job_metrics index representing a single evaluation job. Tracks status (PENDINGRUNNINGCOMPLETED/FAILED), priority, retry count, and timing metadata.

  • Evaluator Template — A configuration specifying which OSS library, metric, model, and parameters to use for an LLM-dependent evaluation. Does not specify backend or protocol — those come from the connection.

  • Deterministic Evaluator — An evaluator that runs directly in the plugin without external calls (regex match, JSON validity, exact match, contains). Produces scores with minimal latency.


Architecture Overview

graph TB
    subgraph "OpenSearch Cluster"
        subgraph "Every Data Node"
            PLUGIN[agent-eval-scheduler-plugin]
            TS[Trigger Sweeper<br/><i>polls for new spans</i>]
            JE[Job Executor<br/><i>runs evaluations</i>]
            DET[Deterministic Engine<br/><i>regex, JSON, exact match</i>]
            PLUGIN --> TS
            PLUGIN --> JE
            JE --> DET
        end

        JS[Job Scheduler SPI<br/><i>scheduling + LockService</i>]
        PLUGIN -->|registers via SPI| JS

        subgraph "Indices"
            SPANS[otel-v1-apm-span-*]
            SCORES[eval_scores]
            JOBS[eval_job_metrics]
            CONNS[eval_agent_connections]
            FILTERS[eval_search_filters]
            TEMPLATES[eval_evaluator_templates]
        end
    end

    subgraph "Evaluation Backends"
        PAS[Python Agent Service<br/><i>Strands-based, external</i>]
        MLC[ML Commons Agent Framework<br/><i>in-cluster</i>]
    end

    LLM[LLM Providers<br/><i>Bedrock / OpenAI / Anthropic</i>]

    TS -->|poll new spans| SPANS
    TS -->|read trigger configs| FILTERS
    TS -->|create PENDING jobs| JOBS

    JE -->|pick up PENDING jobs| JOBS
    JE -->|read span data| SPANS
    JE -->|resolve connection| CONNS
    JE -->|load evaluator config| TEMPLATES
    JE -->|write scores| SCORES

    JE -->|REST or AG-UI| PAS
    JE -->|Execute Agent API or AG-UI| MLC

    PAS --> LLM
    MLC --> LLM
Loading

The plugin runs on every data node. Job Scheduler's LockService ensures each job is executed by exactly one node — no external coordination required.


How It Works

1. Set up connections. An operator registers one or more Eval Agent Connections via the plugin's REST API (/_plugins/_eval/connections). Each connection defines a backend type, protocol, endpoint, and timeout. For example: a Python Agent Service over REST at http://eval-agent:8080, or an ML Commons agent via AG-UI with agent ID agent-abc123.

2. Create evaluation triggers. The operator creates Eval Search Filters (/_plugins/_eval/search-filters) that define which spans to evaluate and how. Each filter specifies span matching criteria (e.g., gen_ai.agent.name = "my-agent"), an evaluation mode (online or offline), and a list of evaluator assignments. Each assignment pairs an evaluator template with a specific connection — different evaluators in the same filter can use different backends.

3. Trigger Sweeper detects spans. The Trigger Sweeper runs on a configurable interval (default 5s). It loads all active Eval Search Filters, queries otel-v1-apm-span-* for root spans indexed since each filter's lastSweepTime watermark, and creates PENDING Job Documents for each span × evaluator match. Online jobs get priority: HIGH; offline jobs get priority: NORMAL. Before creating a job, the sweeper checks for existing targetSpanId + evaluatorId combinations to prevent duplicates.

4. Job Executor processes jobs. The Job Executor runs on a configurable interval (default 2s). It queries for PENDING jobs ordered by priority (HIGH before NORMAL before LOW), acquires a distributed lock per job, and executes the evaluation. For LLM-dependent evaluators, it resolves the Eval Agent Connection from the job, selects the matching backend implementation, and calls it using the connection's protocol. For deterministic evaluators, it runs the check in-plugin (no external call). On success, it writes scores to eval_scores and marks the job COMPLETED. On failure, it retries with exponential backoff (2^retryCount × 1000ms) up to a configurable max.

5. Deterministic evaluators skip the network. Simple checks — exact match, regex, JSON validity, contains — execute directly in the Java plugin. No backend call, no LLM invocation. This keeps latency minimal for straightforward quality gates.


Evaluation Backends

The plugin uses an Evaluation_Backend abstraction with two implementations, making backends interchangeable from the Job Executor's perspective.

Python Agent Service — An external Python service hosting a Strands-based eval agent. It invokes OSS evaluation libraries (Strands Eval, DeepEval, Ragas) and manages LLM provider connections. The plugin communicates with it over REST (synchronous HTTP) or AG-UI (streaming events), depending on the connection's protocol setting.

ML Commons Agent Framework — OpenSearch's native ML framework, running in-cluster. The plugin sends evaluation requests via the Execute Agent API (_plugins/_ml/agents/{agent_id}/_execute) or opens an AG-UI stream. No external service deployment required.

Both backends support both protocols. The protocol is a property of the connection, not a global setting — an operator can have one connection using REST and another using AG-UI, even to the same backend type. The Job Executor calls evaluateRest() or evaluateAgui() based on the resolved connection, and both return the same normalized score format.


Data Models

The plugin manages three core data models, each stored in its own OpenSearch index.

Entity Relationships

erDiagram
    EVAL_SEARCH_FILTER ||--o{ EVALUATOR_ASSIGNMENT : contains
    EVALUATOR_ASSIGNMENT }o--|| EVAL_AGENT_CONNECTION : "routes via"
    EVALUATOR_ASSIGNMENT }o--|| EVALUATOR_TEMPLATE : references
    EVAL_SEARCH_FILTER ||--o{ JOB_DOCUMENT : "triggers creation of"
    JOB_DOCUMENT }o--|| EVAL_AGENT_CONNECTION : "resolved at execution"

    EVAL_SEARCH_FILTER {
        keyword id
        text name
        keyword evaluationMode
        object spanMatchCriteria
        nested evaluatorAssignments
        date lastSweepTime
    }

    EVAL_AGENT_CONNECTION {
        keyword id
        text name
        keyword backendType
        keyword protocol
        keyword endpoint
        integer timeoutMs
        keyword status
    }

    JOB_DOCUMENT {
        keyword jobId
        keyword jobType
        keyword status
        integer priority
        keyword evaluatorId
        keyword connectionId
        keyword targetSpanId
        integer retryCount
        date nextEligibleTime
    }

    EVALUATOR_TEMPLATE {
        keyword id
        keyword library
        keyword metric
        object modelConfig
    }
Loading

Eval Agent Connection (eval_agent_connections)

Represents a registered connection to an evaluation backend.

Field Description
id Unique identifier
name Human-readable connection name
backendType PYTHON_AGENT_SERVICE or ML_COMMONS
protocol REST or AGUI
endpoint For Python Agent Service: HTTP URL (e.g., http://eval-agent:8080). For ML Commons: agent ID (e.g., agent-abc123)
timeoutMs Request timeout in milliseconds
status ACTIVE or INACTIVE — only active connections can be used for new jobs

Eval Search Filter (eval_search_filters)

Defines which spans to evaluate and how, linking span criteria to evaluators via connections.

Field Description
id Unique identifier
name Human-readable filter name
evaluationMode ONLINE (real-time) or OFFLINE (batch/experiment)
spanMatchCriteria Query criteria: agent name, operation name, tags, custom filters
evaluatorAssignments List of { evaluatorId, connectionId } pairs — each assignment routes an evaluator through a specific connection
lastSweepTime Per-filter watermark tracking the most recent sweep position

Job Document (eval_job_metrics)

Represents a single evaluation job created by the Trigger Sweeper and processed by the Job Executor.

Field Description
jobId Unique identifier
jobType online_agent_trace_eval, offline_agent_trace_eval_item, offline_agent_trace_eval_run, or annotation_lock_release
status Current state: PENDING, RUNNING, COMPLETED, or FAILED
priority HIGH (3) for online, NORMAL (2) for offline, LOW (1) for annotation lock release
evaluatorId Reference to the evaluator template
connectionId Reference to the Eval Agent Connection used for execution
targetSpanId The span being evaluated
retryCount Number of retry attempts so far
nextEligibleTime Earliest time the job can be picked up (supports exponential backoff)

Job Status Lifecycle

stateDiagram-v2
    [*] --> PENDING: Trigger Sweeper creates job
    PENDING --> RUNNING: Job Executor acquires lock
    RUNNING --> COMPLETED: Evaluation succeeds
    RUNNING --> FAILED: Max retries exceeded
    RUNNING --> PENDING: Transient failure, retry eligible
    note right of PENDING: nextEligibleTime = now + 2^retryCount × 1000ms
    FAILED --> [*]
    COMPLETED --> [*]
Loading

Plugin Name Suggestions

Name Rationale
agent-eval-scheduler-plugin (current) Descriptive, clear purpose. Slightly generic — "scheduler" understates the evaluation execution responsibility.
eval-engine-plugin Emphasizes that the plugin is the evaluation execution engine, not just a scheduler. Captures both orchestration and execution.
eval-orchestrator-plugin Highlights the orchestration role (sweep → create jobs → delegate → write scores). Aligns with the multi-step pipeline nature.
eval-worker-plugin Emphasizes the async worker pattern. Common in job-processing systems.
eval-pipeline-plugin Captures the sweep → detect → evaluate → score pipeline. May conflict with OpenSearch's existing "pipeline" concept (ingest pipelines).
eval-runner-plugin Simple, action-oriented. "Runner" aligns with Job Scheduler's ScheduledJobRunner SPI interface.

Recommendation: eval-engine-plugin — it best captures the dual responsibility of job orchestration and evaluation execution, and distinguishes the plugin from a pure scheduling wrapper.


Key Design Decisions

  1. Job Scheduler SPI as infrastructure — The plugin does not implement its own scheduling, locking, or job persistence. It implements the three SPI interfaces (JobSchedulerExtension, ScheduledJobParameter, ScheduledJobRunner) and lets Job Scheduler handle the rest.

  2. Evaluation_Backend abstraction — A common interface with two implementations (Python Agent Service, ML Commons), each supporting both REST and AG-UI protocols. New backends can be added without changing core job execution logic.

  3. Priority-based job pickup — Jobs are queried by priority DESC, createdAt ASC, ensuring online evaluations (HIGH) always execute before offline batch work (NORMAL) and annotation lock releases (LOW).

  4. Deduplication at creation time — The Trigger Sweeper checks for existing targetSpanId + evaluatorId combinations before creating jobs, preventing duplicate evaluations regardless of job status.

  5. Deterministic evaluators run in-plugin — Simple evaluations (regex, JSON validity, exact match, contains) execute directly in Java without external service calls, minimizing latency.

  6. Horizontal scaling via LockService — Every data node runs the plugin. Job Scheduler's distributed LockService ensures exactly-once execution per job across the cluster. Throughput scales linearly with cluster size.


Configuration

All scheduling behavior is configurable via opensearch.yml under the eval.scheduler.* namespace, including sweep intervals, executor batch size, lock TTL, retry limits, and per-type concurrency limits. Sensible defaults are provided (e.g., 5s sweep interval, 2s executor interval, 3 max retries). Invalid values are rejected at startup with descriptive errors.


FAQ - New Repository Proposal: agent-eval-scheduler-plugin

This section proposes creating a new repository under the opensearch-project GitHub organization to host the plugin described in this RFC. The content below addresses the questions from the opensearch-project proposal template.

What are you proposing?

A new repository under opensearch-project for the agent-eval-scheduler-plugin — the Java-based OpenSearch plugin described in this RFC. The plugin implements asynchronous evaluation job orchestration for the Agentic AI Eval Platform by consuming the OpenSearch Job Scheduler SPI. It is a core backend component of the platform described in the high-level design RFC (dashboards-observability#2592).

What users have asked for this feature?

  • The Agentic AI Eval Platform high-level design RFC identifies the need for an async evaluation engine that runs natively within OpenSearch.
  • The observability community has expressed interest in LLM evaluation capabilities integrated into the OpenSearch ecosystem — specifically for scoring agent traces using LLM-as-a-Judge, RAG metrics, and deterministic checks.
  • Existing evaluation tools (DeepEval, Ragas, Strands Eval) run externally and require separate infrastructure. Users want evaluation orchestration that leverages OpenSearch's native scheduling, indexing, and distributed execution capabilities.

What problems are you trying to solve?

When agent traces are ingested into OpenSearch via OTel Collector, there is no automated way to evaluate their quality (correctness, faithfulness, relevance) without external orchestration. OpenSearch Job Scheduler provides scheduling infrastructure but is an SPI framework — it requires a consumer plugin to define job types and execution logic. No existing OpenSearch plugin provides evaluation-specific orchestration.

When new agent traces are ingested into OpenSearch, a platform operator wants to automatically detect and evaluate those traces using configured evaluators and backends, so they get quality scores written back to OpenSearch without manual intervention or external orchestration.

What is the developer experience going to be?

The plugin exposes REST APIs under the /_plugins/_eval/ namespace:

  • /_plugins/_eval/connections — CRUD for Eval Agent Connections (register evaluation backends with backend type, protocol, endpoint, timeout)
  • /_plugins/_eval/search-filters — CRUD for Eval Search Filters (configure span matching criteria and evaluator-to-connection assignments)

No changes to existing OpenSearch APIs. The plugin depends on the Job Scheduler plugin (existing SPI dependency). All configuration is via opensearch.yml under the eval.scheduler.* namespace. See the Architecture Overview and Data Models sections above for full details.

Security considerations

  • The plugin integrates with OpenSearch's security plugin for index-level access control. All plugin indices (eval_job_metrics, eval_agent_connections, eval_search_filters) are subject to standard OpenSearch security policies.
  • Eval Agent Connection endpoints (external URLs, agent IDs) are stored as configuration — operators control which backends are reachable.
  • No new authentication mechanisms are introduced; the plugin relies on OpenSearch's existing security model.

Breaking changes to the API

None. This is a new plugin with new REST endpoints. No existing OpenSearch APIs are modified.

What is the user experience going to be?

  1. Operator installs the plugin alongside Job Scheduler.
  2. Operator registers Eval Agent Connections via REST API (e.g., a Python Agent Service over REST, or an ML Commons agent via AG-UI).
  3. Operator creates Eval Search Filters that define which spans to evaluate and which evaluator + connection to use.
  4. The plugin automatically detects new spans, creates evaluation jobs, executes them via the configured backends, and writes scores to eval_scores.
  5. Operators monitor job status and metrics via the eval_job_metrics index.

No breaking changes to existing user experience. The plugin is entirely additive.

Why should it be built? Any reason not to?

Why build it:

  • The Agentic AI Eval Platform needs an async evaluation engine that runs natively within OpenSearch. Without this plugin, evaluation orchestration would require external infrastructure (Airflow, Step Functions, custom cron jobs), adding operational complexity.
  • The plugin leverages Job Scheduler's existing distributed locking and scheduling infrastructure, avoiding reinventing these capabilities.
  • It enables horizontal scaling — the plugin runs on every data node, and Job Scheduler's LockService ensures exactly-once execution per job. Throughput scales linearly with cluster size.
  • The connection-based architecture allows operators to manage multiple evaluation backends independently, each with its own protocol and endpoint configuration.

Why a separate repository:

  • The plugin is a standalone OpenSearch server-side plugin (Java) with its own build lifecycle, release cadence, and SPI dependency on Job Scheduler.
  • It does not belong in OpenSearch Core — it is domain-specific evaluation orchestration, not core search/indexing functionality.
  • It does not belong in dashboards-observability — that repository hosts OpenSearch Dashboards UI components, not backend OpenSearch plugins.
  • It follows the same pattern as other standalone OpenSearch plugins in the organization (e.g., job-scheduler, anomaly-detection, index-management).

Potential concern: The plugin introduces new OpenSearch indices and REST endpoints. However, these are isolated to the eval.* namespace and do not affect existing functionality.

What will it take to execute?

  • The plugin will be bootstrapped from the opensearch-plugin-template-java.
  • Dependencies: OpenSearch Core (build dependency), Job Scheduler plugin (SPI runtime dependency).
  • Key components: EvalSchedulerExtension (plugin entry point), EvalTriggerSweeper, EvalJobExecutor, EvaluationBackend interface with PythonAgentServiceBackend and MLCommonsBackend implementations, DeterministicEvaluatorEngine, REST API handlers for connections and search filters.
  • Testing: Property-based tests using jqwik for correctness properties, unit tests for edge cases and error handling, integration tests with embedded OpenSearch cluster.
  • License: Apache License 2.0. No third-party dependencies that are incompatible with Apache-2.0.
  • Publication targets: Maven Snapshots / Sonatype Nexus, Maven Central.
  • Initial maintainers: [To be confirmed — list proposed maintainers here]

Any remaining open questions?

  • Final plugin name: The working name is agent-eval-scheduler-plugin, but alternatives like eval-engine-plugin or eval-orchestrator-plugin may better capture the plugin's dual responsibility. Community input is welcome — see the Plugin Name Suggestions section above.
  • AG-UI protocol specification: The plugin supports AG-UI as a streaming communication protocol alongside REST. The AG-UI integration details will be finalized as the protocol matures.
  • ML Commons Agent Framework integration: The exact request/response format for the Execute Agent API integration will be finalized during implementation.
  • Release coordination: The plugin is a backend component of the broader Agentic AI Eval Platform. The UI components live in dashboards-observability. Coordination on release cadence and compatibility will be needed.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions