You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Agentic AI Eval Platform is built natively on OpenSearch, using OTel Collector for span ingestion and OpenSearch indices as the sole data store. The platform needs an asynchronous evaluation engine that detects newly ingested spans, runs evaluations (LLM-as-a-Judge, RAG metrics, deterministic checks), and writes scores back to OpenSearch — all without manual intervention.
OpenSearch Job Scheduler provides the scheduling infrastructure (interval triggers, distributed locking, job persistence), but it is an SPI framework — it requires a consumer plugin to define job types and execution logic. No existing OpenSearch plugin provides evaluation-specific orchestration. The agent-eval-scheduler plugin fills this gap: it implements the Job Scheduler SPI to schedule two recurring jobs that bridge span ingestion and evaluation execution.
The plugin's core architectural principle is the connection-based evaluation routing model. Each evaluation request is routed to a specific backend (Python Agent Service or ML Commons) using a specific protocol (REST or AG-UI) as determined by the Eval Agent Connection linked in the trigger configuration. There are no global backend or protocol settings — the connection is the single source of truth for how an evaluation is executed.
Key Concepts
Eval Agent Connection — A registered connection to an evaluation backend, defining the backend type (PYTHON_AGENT_SERVICE or ML_COMMONS), communication protocol (REST or AGUI), endpoint, and timeout. Stored in the eval_agent_connections index.
Eval Search Filter — A saved query configuration that matches spans by criteria (agent name, operation, tags) and links them to evaluators via evaluator assignments. Each assignment pairs an evaluator with a specific Eval Agent Connection, determining the backend and protocol at execution time.
Job Document — A document in the eval_job_metrics index representing a single evaluation job. Tracks status (PENDING → RUNNING → COMPLETED/FAILED), priority, retry count, and timing metadata.
Evaluator Template — A configuration specifying which OSS library, metric, model, and parameters to use for an LLM-dependent evaluation. Does not specify backend or protocol — those come from the connection.
Deterministic Evaluator — An evaluator that runs directly in the plugin without external calls (regex match, JSON validity, exact match, contains). Produces scores with minimal latency.
Architecture Overview
graph TB
subgraph "OpenSearch Cluster"
subgraph "Every Data Node"
PLUGIN[agent-eval-scheduler-plugin]
TS[Trigger Sweeper<br/><i>polls for new spans</i>]
JE[Job Executor<br/><i>runs evaluations</i>]
DET[Deterministic Engine<br/><i>regex, JSON, exact match</i>]
PLUGIN --> TS
PLUGIN --> JE
JE --> DET
end
JS[Job Scheduler SPI<br/><i>scheduling + LockService</i>]
PLUGIN -->|registers via SPI| JS
subgraph "Indices"
SPANS[otel-v1-apm-span-*]
SCORES[eval_scores]
JOBS[eval_job_metrics]
CONNS[eval_agent_connections]
FILTERS[eval_search_filters]
TEMPLATES[eval_evaluator_templates]
end
end
subgraph "Evaluation Backends"
PAS[Python Agent Service<br/><i>Strands-based, external</i>]
MLC[ML Commons Agent Framework<br/><i>in-cluster</i>]
end
LLM[LLM Providers<br/><i>Bedrock / OpenAI / Anthropic</i>]
TS -->|poll new spans| SPANS
TS -->|read trigger configs| FILTERS
TS -->|create PENDING jobs| JOBS
JE -->|pick up PENDING jobs| JOBS
JE -->|read span data| SPANS
JE -->|resolve connection| CONNS
JE -->|load evaluator config| TEMPLATES
JE -->|write scores| SCORES
JE -->|REST or AG-UI| PAS
JE -->|Execute Agent API or AG-UI| MLC
PAS --> LLM
MLC --> LLM
Loading
The plugin runs on every data node. Job Scheduler's LockService ensures each job is executed by exactly one node — no external coordination required.
How It Works
1. Set up connections. An operator registers one or more Eval Agent Connections via the plugin's REST API (/_plugins/_eval/connections). Each connection defines a backend type, protocol, endpoint, and timeout. For example: a Python Agent Service over REST at http://eval-agent:8080, or an ML Commons agent via AG-UI with agent ID agent-abc123.
2. Create evaluation triggers. The operator creates Eval Search Filters (/_plugins/_eval/search-filters) that define which spans to evaluate and how. Each filter specifies span matching criteria (e.g., gen_ai.agent.name = "my-agent"), an evaluation mode (online or offline), and a list of evaluator assignments. Each assignment pairs an evaluator template with a specific connection — different evaluators in the same filter can use different backends.
3. Trigger Sweeper detects spans. The Trigger Sweeper runs on a configurable interval (default 5s). It loads all active Eval Search Filters, queries otel-v1-apm-span-* for root spans indexed since each filter's lastSweepTime watermark, and creates PENDING Job Documents for each span × evaluator match. Online jobs get priority: HIGH; offline jobs get priority: NORMAL. Before creating a job, the sweeper checks for existing targetSpanId + evaluatorId combinations to prevent duplicates.
4. Job Executor processes jobs. The Job Executor runs on a configurable interval (default 2s). It queries for PENDING jobs ordered by priority (HIGH before NORMAL before LOW), acquires a distributed lock per job, and executes the evaluation. For LLM-dependent evaluators, it resolves the Eval Agent Connection from the job, selects the matching backend implementation, and calls it using the connection's protocol. For deterministic evaluators, it runs the check in-plugin (no external call). On success, it writes scores to eval_scores and marks the job COMPLETED. On failure, it retries with exponential backoff (2^retryCount × 1000ms) up to a configurable max.
5. Deterministic evaluators skip the network. Simple checks — exact match, regex, JSON validity, contains — execute directly in the Java plugin. No backend call, no LLM invocation. This keeps latency minimal for straightforward quality gates.
Evaluation Backends
The plugin uses an Evaluation_Backend abstraction with two implementations, making backends interchangeable from the Job Executor's perspective.
Python Agent Service — An external Python service hosting a Strands-based eval agent. It invokes OSS evaluation libraries (Strands Eval, DeepEval, Ragas) and manages LLM provider connections. The plugin communicates with it over REST (synchronous HTTP) or AG-UI (streaming events), depending on the connection's protocol setting.
ML Commons Agent Framework — OpenSearch's native ML framework, running in-cluster. The plugin sends evaluation requests via the Execute Agent API (_plugins/_ml/agents/{agent_id}/_execute) or opens an AG-UI stream. No external service deployment required.
Both backends support both protocols. The protocol is a property of the connection, not a global setting — an operator can have one connection using REST and another using AG-UI, even to the same backend type. The Job Executor calls evaluateRest() or evaluateAgui() based on the resolved connection, and both return the same normalized score format.
Data Models
The plugin manages three core data models, each stored in its own OpenSearch index.
Entity Relationships
erDiagram
EVAL_SEARCH_FILTER ||--o{ EVALUATOR_ASSIGNMENT : contains
EVALUATOR_ASSIGNMENT }o--|| EVAL_AGENT_CONNECTION : "routes via"
EVALUATOR_ASSIGNMENT }o--|| EVALUATOR_TEMPLATE : references
EVAL_SEARCH_FILTER ||--o{ JOB_DOCUMENT : "triggers creation of"
JOB_DOCUMENT }o--|| EVAL_AGENT_CONNECTION : "resolved at execution"
EVAL_SEARCH_FILTER {
keyword id
text name
keyword evaluationMode
object spanMatchCriteria
nested evaluatorAssignments
date lastSweepTime
}
EVAL_AGENT_CONNECTION {
keyword id
text name
keyword backendType
keyword protocol
keyword endpoint
integer timeoutMs
keyword status
}
JOB_DOCUMENT {
keyword jobId
keyword jobType
keyword status
integer priority
keyword evaluatorId
keyword connectionId
keyword targetSpanId
integer retryCount
date nextEligibleTime
}
EVALUATOR_TEMPLATE {
keyword id
keyword library
keyword metric
object modelConfig
}
Loading
Eval Agent Connection (eval_agent_connections)
Represents a registered connection to an evaluation backend.
Field
Description
id
Unique identifier
name
Human-readable connection name
backendType
PYTHON_AGENT_SERVICE or ML_COMMONS
protocol
REST or AGUI
endpoint
For Python Agent Service: HTTP URL (e.g., http://eval-agent:8080). For ML Commons: agent ID (e.g., agent-abc123)
timeoutMs
Request timeout in milliseconds
status
ACTIVE or INACTIVE — only active connections can be used for new jobs
Eval Search Filter (eval_search_filters)
Defines which spans to evaluate and how, linking span criteria to evaluators via connections.
Emphasizes that the plugin is the evaluation execution engine, not just a scheduler. Captures both orchestration and execution.
eval-orchestrator-plugin
Highlights the orchestration role (sweep → create jobs → delegate → write scores). Aligns with the multi-step pipeline nature.
eval-worker-plugin
Emphasizes the async worker pattern. Common in job-processing systems.
eval-pipeline-plugin
Captures the sweep → detect → evaluate → score pipeline. May conflict with OpenSearch's existing "pipeline" concept (ingest pipelines).
eval-runner-plugin
Simple, action-oriented. "Runner" aligns with Job Scheduler's ScheduledJobRunner SPI interface.
Recommendation:eval-engine-plugin — it best captures the dual responsibility of job orchestration and evaluation execution, and distinguishes the plugin from a pure scheduling wrapper.
Key Design Decisions
Job Scheduler SPI as infrastructure — The plugin does not implement its own scheduling, locking, or job persistence. It implements the three SPI interfaces (JobSchedulerExtension, ScheduledJobParameter, ScheduledJobRunner) and lets Job Scheduler handle the rest.
Evaluation_Backend abstraction — A common interface with two implementations (Python Agent Service, ML Commons), each supporting both REST and AG-UI protocols. New backends can be added without changing core job execution logic.
Priority-based job pickup — Jobs are queried by priority DESC, createdAt ASC, ensuring online evaluations (HIGH) always execute before offline batch work (NORMAL) and annotation lock releases (LOW).
Deduplication at creation time — The Trigger Sweeper checks for existing targetSpanId + evaluatorId combinations before creating jobs, preventing duplicate evaluations regardless of job status.
Deterministic evaluators run in-plugin — Simple evaluations (regex, JSON validity, exact match, contains) execute directly in Java without external service calls, minimizing latency.
Horizontal scaling via LockService — Every data node runs the plugin. Job Scheduler's distributed LockService ensures exactly-once execution per job across the cluster. Throughput scales linearly with cluster size.
Configuration
All scheduling behavior is configurable via opensearch.yml under the eval.scheduler.* namespace, including sweep intervals, executor batch size, lock TTL, retry limits, and per-type concurrency limits. Sensible defaults are provided (e.g., 5s sweep interval, 2s executor interval, 3 max retries). Invalid values are rejected at startup with descriptive errors.
FAQ - New Repository Proposal: agent-eval-scheduler-plugin
This section proposes creating a new repository under the opensearch-project GitHub organization to host the plugin described in this RFC. The content below addresses the questions from the opensearch-project proposal template.
What are you proposing?
A new repository under opensearch-project for the agent-eval-scheduler-plugin — the Java-based OpenSearch plugin described in this RFC. The plugin implements asynchronous evaluation job orchestration for the Agentic AI Eval Platform by consuming the OpenSearch Job Scheduler SPI. It is a core backend component of the platform described in the high-level design RFC (dashboards-observability#2592).
The observability community has expressed interest in LLM evaluation capabilities integrated into the OpenSearch ecosystem — specifically for scoring agent traces using LLM-as-a-Judge, RAG metrics, and deterministic checks.
Existing evaluation tools (DeepEval, Ragas, Strands Eval) run externally and require separate infrastructure. Users want evaluation orchestration that leverages OpenSearch's native scheduling, indexing, and distributed execution capabilities.
What problems are you trying to solve?
When agent traces are ingested into OpenSearch via OTel Collector, there is no automated way to evaluate their quality (correctness, faithfulness, relevance) without external orchestration. OpenSearch Job Scheduler provides scheduling infrastructure but is an SPI framework — it requires a consumer plugin to define job types and execution logic. No existing OpenSearch plugin provides evaluation-specific orchestration.
When new agent traces are ingested into OpenSearch, a platform operator wants to automatically detect and evaluate those traces using configured evaluators and backends, so they get quality scores written back to OpenSearch without manual intervention or external orchestration.
What is the developer experience going to be?
The plugin exposes REST APIs under the /_plugins/_eval/ namespace:
/_plugins/_eval/connections — CRUD for Eval Agent Connections (register evaluation backends with backend type, protocol, endpoint, timeout)
/_plugins/_eval/search-filters — CRUD for Eval Search Filters (configure span matching criteria and evaluator-to-connection assignments)
No changes to existing OpenSearch APIs. The plugin depends on the Job Scheduler plugin (existing SPI dependency). All configuration is via opensearch.yml under the eval.scheduler.* namespace. See the Architecture Overview and Data Models sections above for full details.
Security considerations
The plugin integrates with OpenSearch's security plugin for index-level access control. All plugin indices (eval_job_metrics, eval_agent_connections, eval_search_filters) are subject to standard OpenSearch security policies.
Eval Agent Connection endpoints (external URLs, agent IDs) are stored as configuration — operators control which backends are reachable.
No new authentication mechanisms are introduced; the plugin relies on OpenSearch's existing security model.
Breaking changes to the API
None. This is a new plugin with new REST endpoints. No existing OpenSearch APIs are modified.
What is the user experience going to be?
Operator installs the plugin alongside Job Scheduler.
Operator registers Eval Agent Connections via REST API (e.g., a Python Agent Service over REST, or an ML Commons agent via AG-UI).
Operator creates Eval Search Filters that define which spans to evaluate and which evaluator + connection to use.
The plugin automatically detects new spans, creates evaluation jobs, executes them via the configured backends, and writes scores to eval_scores.
Operators monitor job status and metrics via the eval_job_metrics index.
No breaking changes to existing user experience. The plugin is entirely additive.
Why should it be built? Any reason not to?
Why build it:
The Agentic AI Eval Platform needs an async evaluation engine that runs natively within OpenSearch. Without this plugin, evaluation orchestration would require external infrastructure (Airflow, Step Functions, custom cron jobs), adding operational complexity.
The plugin leverages Job Scheduler's existing distributed locking and scheduling infrastructure, avoiding reinventing these capabilities.
It enables horizontal scaling — the plugin runs on every data node, and Job Scheduler's LockService ensures exactly-once execution per job. Throughput scales linearly with cluster size.
The connection-based architecture allows operators to manage multiple evaluation backends independently, each with its own protocol and endpoint configuration.
Why a separate repository:
The plugin is a standalone OpenSearch server-side plugin (Java) with its own build lifecycle, release cadence, and SPI dependency on Job Scheduler.
It does not belong in OpenSearch Core — it is domain-specific evaluation orchestration, not core search/indexing functionality.
It does not belong in dashboards-observability — that repository hosts OpenSearch Dashboards UI components, not backend OpenSearch plugins.
Potential concern: The plugin introduces new OpenSearch indices and REST endpoints. However, these are isolated to the eval.* namespace and do not affect existing functionality.
Key components:EvalSchedulerExtension (plugin entry point), EvalTriggerSweeper, EvalJobExecutor, EvaluationBackend interface with PythonAgentServiceBackend and MLCommonsBackend implementations, DeterministicEvaluatorEngine, REST API handlers for connections and search filters.
Testing: Property-based tests using jqwik for correctness properties, unit tests for edge cases and error handling, integration tests with embedded OpenSearch cluster.
License: Apache License 2.0. No third-party dependencies that are incompatible with Apache-2.0.
Initial maintainers: [To be confirmed — list proposed maintainers here]
Any remaining open questions?
Final plugin name: The working name is agent-eval-scheduler-plugin, but alternatives like eval-engine-plugin or eval-orchestrator-plugin may better capture the plugin's dual responsibility. Community input is welcome — see the Plugin Name Suggestions section above.
AG-UI protocol specification: The plugin supports AG-UI as a streaming communication protocol alongside REST. The AG-UI integration details will be finalized as the protocol matures.
ML Commons Agent Framework integration: The exact request/response format for the Execute Agent API integration will be finalized during implementation.
Release coordination: The plugin is a backend component of the broader Agentic AI Eval Platform. The UI components live in dashboards-observability. Coordination on release cadence and compatibility will be needed.
Purpose & Motivation
The Agentic AI Eval Platform is built natively on OpenSearch, using OTel Collector for span ingestion and OpenSearch indices as the sole data store. The platform needs an asynchronous evaluation engine that detects newly ingested spans, runs evaluations (LLM-as-a-Judge, RAG metrics, deterministic checks), and writes scores back to OpenSearch — all without manual intervention.
OpenSearch Job Scheduler provides the scheduling infrastructure (interval triggers, distributed locking, job persistence), but it is an SPI framework — it requires a consumer plugin to define job types and execution logic. No existing OpenSearch plugin provides evaluation-specific orchestration. The agent-eval-scheduler plugin fills this gap: it implements the Job Scheduler SPI to schedule two recurring jobs that bridge span ingestion and evaluation execution.
The plugin's core architectural principle is the connection-based evaluation routing model. Each evaluation request is routed to a specific backend (Python Agent Service or ML Commons) using a specific protocol (REST or AG-UI) as determined by the Eval Agent Connection linked in the trigger configuration. There are no global backend or protocol settings — the connection is the single source of truth for how an evaluation is executed.
Key Concepts
Eval Agent Connection — A registered connection to an evaluation backend, defining the backend type (
PYTHON_AGENT_SERVICEorML_COMMONS), communication protocol (RESTorAGUI), endpoint, and timeout. Stored in theeval_agent_connectionsindex.Eval Search Filter — A saved query configuration that matches spans by criteria (agent name, operation, tags) and links them to evaluators via evaluator assignments. Each assignment pairs an evaluator with a specific Eval Agent Connection, determining the backend and protocol at execution time.
Job Document — A document in the
eval_job_metricsindex representing a single evaluation job. Tracks status (PENDING→RUNNING→COMPLETED/FAILED), priority, retry count, and timing metadata.Evaluator Template — A configuration specifying which OSS library, metric, model, and parameters to use for an LLM-dependent evaluation. Does not specify backend or protocol — those come from the connection.
Deterministic Evaluator — An evaluator that runs directly in the plugin without external calls (regex match, JSON validity, exact match, contains). Produces scores with minimal latency.
Architecture Overview
graph TB subgraph "OpenSearch Cluster" subgraph "Every Data Node" PLUGIN[agent-eval-scheduler-plugin] TS[Trigger Sweeper<br/><i>polls for new spans</i>] JE[Job Executor<br/><i>runs evaluations</i>] DET[Deterministic Engine<br/><i>regex, JSON, exact match</i>] PLUGIN --> TS PLUGIN --> JE JE --> DET end JS[Job Scheduler SPI<br/><i>scheduling + LockService</i>] PLUGIN -->|registers via SPI| JS subgraph "Indices" SPANS[otel-v1-apm-span-*] SCORES[eval_scores] JOBS[eval_job_metrics] CONNS[eval_agent_connections] FILTERS[eval_search_filters] TEMPLATES[eval_evaluator_templates] end end subgraph "Evaluation Backends" PAS[Python Agent Service<br/><i>Strands-based, external</i>] MLC[ML Commons Agent Framework<br/><i>in-cluster</i>] end LLM[LLM Providers<br/><i>Bedrock / OpenAI / Anthropic</i>] TS -->|poll new spans| SPANS TS -->|read trigger configs| FILTERS TS -->|create PENDING jobs| JOBS JE -->|pick up PENDING jobs| JOBS JE -->|read span data| SPANS JE -->|resolve connection| CONNS JE -->|load evaluator config| TEMPLATES JE -->|write scores| SCORES JE -->|REST or AG-UI| PAS JE -->|Execute Agent API or AG-UI| MLC PAS --> LLM MLC --> LLMThe plugin runs on every data node. Job Scheduler's LockService ensures each job is executed by exactly one node — no external coordination required.
How It Works
1. Set up connections. An operator registers one or more Eval Agent Connections via the plugin's REST API (
/_plugins/_eval/connections). Each connection defines a backend type, protocol, endpoint, and timeout. For example: a Python Agent Service over REST athttp://eval-agent:8080, or an ML Commons agent via AG-UI with agent IDagent-abc123.2. Create evaluation triggers. The operator creates Eval Search Filters (
/_plugins/_eval/search-filters) that define which spans to evaluate and how. Each filter specifies span matching criteria (e.g.,gen_ai.agent.name = "my-agent"), an evaluation mode (online or offline), and a list of evaluator assignments. Each assignment pairs an evaluator template with a specific connection — different evaluators in the same filter can use different backends.3. Trigger Sweeper detects spans. The Trigger Sweeper runs on a configurable interval (default 5s). It loads all active Eval Search Filters, queries
otel-v1-apm-span-*for root spans indexed since each filter'slastSweepTimewatermark, and creates PENDING Job Documents for each span × evaluator match. Online jobs getpriority: HIGH; offline jobs getpriority: NORMAL. Before creating a job, the sweeper checks for existingtargetSpanId + evaluatorIdcombinations to prevent duplicates.4. Job Executor processes jobs. The Job Executor runs on a configurable interval (default 2s). It queries for PENDING jobs ordered by priority (HIGH before NORMAL before LOW), acquires a distributed lock per job, and executes the evaluation. For LLM-dependent evaluators, it resolves the Eval Agent Connection from the job, selects the matching backend implementation, and calls it using the connection's protocol. For deterministic evaluators, it runs the check in-plugin (no external call). On success, it writes scores to
eval_scoresand marks the job COMPLETED. On failure, it retries with exponential backoff (2^retryCount × 1000ms) up to a configurable max.5. Deterministic evaluators skip the network. Simple checks — exact match, regex, JSON validity, contains — execute directly in the Java plugin. No backend call, no LLM invocation. This keeps latency minimal for straightforward quality gates.
Evaluation Backends
The plugin uses an
Evaluation_Backendabstraction with two implementations, making backends interchangeable from the Job Executor's perspective.Python Agent Service — An external Python service hosting a Strands-based eval agent. It invokes OSS evaluation libraries (Strands Eval, DeepEval, Ragas) and manages LLM provider connections. The plugin communicates with it over REST (synchronous HTTP) or AG-UI (streaming events), depending on the connection's protocol setting.
ML Commons Agent Framework — OpenSearch's native ML framework, running in-cluster. The plugin sends evaluation requests via the Execute Agent API (
_plugins/_ml/agents/{agent_id}/_execute) or opens an AG-UI stream. No external service deployment required.Both backends support both protocols. The protocol is a property of the connection, not a global setting — an operator can have one connection using REST and another using AG-UI, even to the same backend type. The Job Executor calls
evaluateRest()orevaluateAgui()based on the resolved connection, and both return the same normalized score format.Data Models
The plugin manages three core data models, each stored in its own OpenSearch index.
Entity Relationships
erDiagram EVAL_SEARCH_FILTER ||--o{ EVALUATOR_ASSIGNMENT : contains EVALUATOR_ASSIGNMENT }o--|| EVAL_AGENT_CONNECTION : "routes via" EVALUATOR_ASSIGNMENT }o--|| EVALUATOR_TEMPLATE : references EVAL_SEARCH_FILTER ||--o{ JOB_DOCUMENT : "triggers creation of" JOB_DOCUMENT }o--|| EVAL_AGENT_CONNECTION : "resolved at execution" EVAL_SEARCH_FILTER { keyword id text name keyword evaluationMode object spanMatchCriteria nested evaluatorAssignments date lastSweepTime } EVAL_AGENT_CONNECTION { keyword id text name keyword backendType keyword protocol keyword endpoint integer timeoutMs keyword status } JOB_DOCUMENT { keyword jobId keyword jobType keyword status integer priority keyword evaluatorId keyword connectionId keyword targetSpanId integer retryCount date nextEligibleTime } EVALUATOR_TEMPLATE { keyword id keyword library keyword metric object modelConfig }Eval Agent Connection (
eval_agent_connections)Represents a registered connection to an evaluation backend.
idnamebackendTypePYTHON_AGENT_SERVICEorML_COMMONSprotocolRESTorAGUIendpointhttp://eval-agent:8080). For ML Commons: agent ID (e.g.,agent-abc123)timeoutMsstatusACTIVEorINACTIVE— only active connections can be used for new jobsEval Search Filter (
eval_search_filters)Defines which spans to evaluate and how, linking span criteria to evaluators via connections.
idnameevaluationModeONLINE(real-time) orOFFLINE(batch/experiment)spanMatchCriteriaevaluatorAssignments{ evaluatorId, connectionId }pairs — each assignment routes an evaluator through a specific connectionlastSweepTimeJob Document (
eval_job_metrics)Represents a single evaluation job created by the Trigger Sweeper and processed by the Job Executor.
jobIdjobTypeonline_agent_trace_eval,offline_agent_trace_eval_item,offline_agent_trace_eval_run, orannotation_lock_releasestatusPENDING,RUNNING,COMPLETED, orFAILEDpriorityHIGH(3) for online,NORMAL(2) for offline,LOW(1) for annotation lock releaseevaluatorIdconnectionIdtargetSpanIdretryCountnextEligibleTimeJob Status Lifecycle
stateDiagram-v2 [*] --> PENDING: Trigger Sweeper creates job PENDING --> RUNNING: Job Executor acquires lock RUNNING --> COMPLETED: Evaluation succeeds RUNNING --> FAILED: Max retries exceeded RUNNING --> PENDING: Transient failure, retry eligible note right of PENDING: nextEligibleTime = now + 2^retryCount × 1000ms FAILED --> [*] COMPLETED --> [*]Plugin Name Suggestions
agent-eval-scheduler-plugin(current)eval-engine-plugineval-orchestrator-plugineval-worker-plugineval-pipeline-plugineval-runner-pluginScheduledJobRunnerSPI interface.Recommendation:
eval-engine-plugin— it best captures the dual responsibility of job orchestration and evaluation execution, and distinguishes the plugin from a pure scheduling wrapper.Key Design Decisions
Job Scheduler SPI as infrastructure — The plugin does not implement its own scheduling, locking, or job persistence. It implements the three SPI interfaces (
JobSchedulerExtension,ScheduledJobParameter,ScheduledJobRunner) and lets Job Scheduler handle the rest.Evaluation_Backend abstraction — A common interface with two implementations (Python Agent Service, ML Commons), each supporting both REST and AG-UI protocols. New backends can be added without changing core job execution logic.
Priority-based job pickup — Jobs are queried by
priority DESC, createdAt ASC, ensuring online evaluations (HIGH) always execute before offline batch work (NORMAL) and annotation lock releases (LOW).Deduplication at creation time — The Trigger Sweeper checks for existing
targetSpanId + evaluatorIdcombinations before creating jobs, preventing duplicate evaluations regardless of job status.Deterministic evaluators run in-plugin — Simple evaluations (regex, JSON validity, exact match, contains) execute directly in Java without external service calls, minimizing latency.
Horizontal scaling via LockService — Every data node runs the plugin. Job Scheduler's distributed LockService ensures exactly-once execution per job across the cluster. Throughput scales linearly with cluster size.
Configuration
All scheduling behavior is configurable via
opensearch.ymlunder theeval.scheduler.*namespace, including sweep intervals, executor batch size, lock TTL, retry limits, and per-type concurrency limits. Sensible defaults are provided (e.g., 5s sweep interval, 2s executor interval, 3 max retries). Invalid values are rejected at startup with descriptive errors.FAQ - New Repository Proposal:
agent-eval-scheduler-pluginThis section proposes creating a new repository under the
opensearch-projectGitHub organization to host the plugin described in this RFC. The content below addresses the questions from the opensearch-project proposal template.What are you proposing?
A new repository under
opensearch-projectfor theagent-eval-scheduler-plugin— the Java-based OpenSearch plugin described in this RFC. The plugin implements asynchronous evaluation job orchestration for the Agentic AI Eval Platform by consuming the OpenSearch Job Scheduler SPI. It is a core backend component of the platform described in the high-level design RFC (dashboards-observability#2592).What users have asked for this feature?
What problems are you trying to solve?
When agent traces are ingested into OpenSearch via OTel Collector, there is no automated way to evaluate their quality (correctness, faithfulness, relevance) without external orchestration. OpenSearch Job Scheduler provides scheduling infrastructure but is an SPI framework — it requires a consumer plugin to define job types and execution logic. No existing OpenSearch plugin provides evaluation-specific orchestration.
What is the developer experience going to be?
The plugin exposes REST APIs under the
/_plugins/_eval/namespace:/_plugins/_eval/connections— CRUD for Eval Agent Connections (register evaluation backends with backend type, protocol, endpoint, timeout)/_plugins/_eval/search-filters— CRUD for Eval Search Filters (configure span matching criteria and evaluator-to-connection assignments)No changes to existing OpenSearch APIs. The plugin depends on the Job Scheduler plugin (existing SPI dependency). All configuration is via
opensearch.ymlunder theeval.scheduler.*namespace. See the Architecture Overview and Data Models sections above for full details.Security considerations
eval_job_metrics,eval_agent_connections,eval_search_filters) are subject to standard OpenSearch security policies.Breaking changes to the API
None. This is a new plugin with new REST endpoints. No existing OpenSearch APIs are modified.
What is the user experience going to be?
eval_scores.eval_job_metricsindex.No breaking changes to existing user experience. The plugin is entirely additive.
Why should it be built? Any reason not to?
Why build it:
Why a separate repository:
dashboards-observability— that repository hosts OpenSearch Dashboards UI components, not backend OpenSearch plugins.job-scheduler,anomaly-detection,index-management).Potential concern: The plugin introduces new OpenSearch indices and REST endpoints. However, these are isolated to the
eval.*namespace and do not affect existing functionality.What will it take to execute?
EvalSchedulerExtension(plugin entry point),EvalTriggerSweeper,EvalJobExecutor,EvaluationBackendinterface withPythonAgentServiceBackendandMLCommonsBackendimplementations,DeterministicEvaluatorEngine, REST API handlers for connections and search filters.Any remaining open questions?
agent-eval-scheduler-plugin, but alternatives likeeval-engine-pluginoreval-orchestrator-pluginmay better capture the plugin's dual responsibility. Community input is welcome — see the Plugin Name Suggestions section above.dashboards-observability. Coordination on release cadence and compatibility will be needed.Related