Embabel Evaluation Framework #1570

azanux · 2026-04-04T21:03:41Z

azanux
Apr 4, 2026
Collaborator

Summary

Why agents need evaluation

Unlike traditional software, AI agents are non-deterministic -- the same input can produce different outputs across runs. They are also multi-step systems: an agent plans, selects actions, calls LLMs, invokes tools, sometimes replans. A correct final output can hide a broken intermediate step. A failing output can hide that 6 out of 7 steps worked perfectly.

Traditional testing (assertEquals(expected, actual)) does not work here. You cannot assert an exact LLM response. You need to express what "good enough" looks like: did the agent complete? Does the output contain the right information? Was it fast enough? Did it stay within budget? Did each tool call return something useful?

And this is not just a testing problem. In production, models get updated, latencies shift, costs change. An agent that met its quality bar last month can silently degrade. You need the same quality criteria running continuously in production, publishing metrics and alerting when things drift.

Getting started

Add the dependency:

<dependency>
    <groupId>com.embabel.agent</groupId>
    <artifactId>embabel-agent-observability</artifactId>
</dependency>

Write your first evaluation test:

@EmbabelAgentTest
class MyAgentEvalTest {

    @Autowired AgentRunner runner;

    @Test
    void storyAgentMeetsQualityBar() {
        var inv = runner.run(ReviewedStory.class,
                "Tell me a story about a brave knight named Galahad");

        Eval.on(WriteAndReviewAgent.class)
            .agent(
                AgentEvalRules.completed(),                    // did it finish?
                AgentEvalRules.outputNotEmpty(),                // did it produce something?
                AgentEvalRules.outputContains("Galahad"),       // is the content relevant?
                AgentEvalRules.outputContainsNone("password"),  // no sensitive data leaked?
                AgentEvalRules.latencyBudget(120_000),          // fast enough? (scored 0-1)
                AgentEvalRules.costBudget(1.0)                  // cheap enough? (scored 0-1)
            )
            .action("craftStory", ActionEvalRules.latencyBudget(60_000))  // this step under 60s?
            .action("reviewStory", ActionEvalRules.succeeded())            // review step OK?
            .llm(LlmEvalRules.responseNotEmpty())              // every LLM call returned something?
            .assertAll(inv);
    }
}

It's a standard JUnit test -- run it like any other test in your project:

mvn test -Dtest=MyAgentEvalTest

No special runner, no plugin -- it's a regular JUnit test. The evaluation report is logged automatically:

+---------------------------------------------------------------------------+
|                          Evaluation Report                                |
|  Agent: WriteAndReviewAgent (12.4s)                                       |
|  Test:  MyAgentEvalTest#storyAgentMeetsQualityBar                         |
+---------------------------------------------------------------------------+
|  AGENT  [14:32:07.124]                                                    |
|    [pass] Completed                       1.00  agent completed           |
|    [pass] ContainsAll[output]             1.00  all keywords found        |
|    [pass] LatencyBudget[duration]         0.97  12400ms / 120000ms budget |
|    [pass] Budget[totalCost]               1.00  $0.02 / $1.00 budget     |
|  ACTION: craftStory  [14:32:07.130]                                       |
|    [pass] LatencyBudget[duration]         0.88  8200ms / 60000ms budget   |
|  ACTION: reviewStory  [14:32:15.342]                                      |
|    [pass] Succeeded                       1.00  action succeeded          |
|  LLM (3 calls)                                                            |
|    [pass] NotEmpty[response]  [14:32:08.501]  1.00  not empty             |
|    [pass] NotEmpty[response]  [14:32:12.204]  1.00  not empty             |
|    [pass] NotEmpty[response]  [14:32:16.710]  1.00  not empty             |
+---------------------------------------------------------------------------+
|  Summary                                                                  |
|  Total: 9 | Passed: 9 | Failed: 0 | Pass rate: 100.0%                    |
|  Mean score: 0.97 | Min: 0.88 | Max: 1.00                                |
|  Tokens: 4,512 | Cost: $0.02 | Duration: 12.4s                           |
+---------------------------------------------------------------------------+

Tests run: 1, Failures: 0, Errors: 0
BUILD SUCCESS

Enable in production (application.yml):

embabel:
  observability:
    evaluation:
      enabled: true

What kinds of evaluation?

The framework supports evaluation at 7 levels of an agent's execution, because different levels catch different problems:

Level	What it catches	Example
Agent	Did the whole thing work? Output quality, total cost, total latency	"Output must contain 'Galahad' and cost < $1"
Action	Did each step succeed in time?	"craftStory must complete within 60s"
LLM Call	Is each LLM response useful? Well-formatted?	"Every LLM response must be valid JSON"
Tool Call	Did each tool return good results?	"searchWeb must return non-empty results"
Plan	Is the agent's plan reasonable?	"Plan must have 2-10 steps"
Tool Loop	Did the agent converge efficiently?	"Loop must finish within 50% of max iterations"
RAG	Is retrieved context faithful and relevant?	"Faithfulness score must be > 0.8"

One API for tests and production

Same Eval criteria, two usages -- only the last line changes:

In tests: .assertAll(inv) -- throws on failure
In production: .build() as a @Bean -- Micrometer metrics published on every agent run

Scores like latencyBudget are proportional (0.0 to 1.0), not just pass/fail. This lets you detect degradation before it crosses the failure threshold -- in dashboards, you see the score dropping from 0.95 to 0.82 over a week, and you investigate before it hits 0.8 and starts failing.

What's Included

24+ Built-in Rules

Organized by category, all composable:

Category	Rules	Scoring
Duration	`LatencyBudgetRule`, `MaxDurationRule`	Proportional / Binary
Text content	`NotEmpty`, `ExactMatch`, `ContainsAll`, `ContainsAny`, `ContainsNone`, `RegexMatch`, `StartsWith`, `EndsWith`	Binary
Length	`MinLength`, `MaxLength`, `WordCountRange`, `SentenceCount`	Binary
Numeric	`NumericRange`, `Budget`, `Ratio`	Binary / Proportional
Similarity	`Levenshtein`, `Jaccard`, `Cosine`	Proportional
Format	`JsonValidity`, `JsonSchema`	Binary
LLM Judge	`LlmJudgeRule`, `LlmBinaryJudgeRule` + 11 presets	Proportional / Binary
RAG	`Faithfulness`, `Hallucination`, `ContextRecall`, `ContextualRelevance`, `AnswerRelevancy`	Proportional

7 Convenience Factory Classes

Pre-wired evaluators so developers never need to compose rules + extractors manually:

AgentEvalRules -- 25+ methods (completion, output, latency, cost, actions, tools, sub-agents, LLM judge)
LlmEvalRules -- response content, format, latency, LLM judge
ActionEvalRules -- success, latency, tool usage
ToolCallEvalRules -- result content, format, latency, schema validation
PlanEvalRules -- steps, goal, step count
ToolLoopEvalRules -- convergence, iteration efficiency, latency
RagEvalRules -- faithfulness, relevancy, recall, result count

All proportional-scoring methods support an optional threshold parameter:

AgentEvalRules.latencyBudget(30_000)       // default threshold 0.8
AgentEvalRules.latencyBudget(30_000, 0.5)  // custom threshold

5 Composite Evaluators

Combine rules with logic beyond "all must pass":

AllPassEvaluator -- AND logic (all must pass, score = min)
AnyPassEvaluator -- OR logic (any can pass, score = max)
CompositeEvaluator -- Weighted average with configurable threshold
ConditionalEvaluator -- Evaluate only when a condition is met
ThresholdOverrideEvaluator -- Override pass/fail threshold of any evaluator

Examples:

// Pass if the agent is either fast OR cheap (we accept either trade-off)
var acceptable = new AnyPassEvaluator<>(
    AgentEvalRules.latencyBudget(5_000),
    AgentEvalRules.costBudget(0.10)
);

// Output quality counts 3x more than latency and cost
var quality = new CompositeEvaluator<>(
    List.of(
        AgentEvalRules.outputNotEmpty(),
        AgentEvalRules.latencyBudget(30_000),
        AgentEvalRules.costBudget(0.50)
    ),
    Map.of(
        "NotEmpty[Output]", 3.0,
        "LatencyBudget[Duration]", 1.0,
        "Budget[TotalCost]", 1.0
    ),
    0.8  // pass if weighted average >= 0.8
);

// Use like any other rule -- in tests or production
Eval.on(MyAgent.class).agent(acceptable, quality).assertAll(inv);

Production Monitoring

Automatic evaluation -- Register Eval beans, enable in YAML, evaluations run on every agent execution
Micrometer metrics -- Scores, pass/fail counters, distributions, composite scores published to any Micrometer backend (Prometheus, Datadog, Graphite, etc.)
Sampling -- Evaluate a fraction of requests in high-traffic production (sampling-rate: 0.1)
@EvalBlocking -- Quality gate that prevents non-conforming results from being returned

embabel:
  observability:
    evaluation:
      enabled: true
      sampling: true
      sampling-rate: 0.1

Evaluation Reports

Every .assertAll() call logs a detailed report:

+---------------------------------------------------------------------------+
|                          Evaluation Report                                |
|  Agent: WriteAndReviewAgent (12.4s)                                       |
|  Test:  WriteAndReviewAgentEvalTest#writeAndReviewWithRealLlm             |
+---------------------------------------------------------------------------+
|  AGENT  [14:32:07.124]                                                    |
|    [pass] Completed                       1.00  agent completed           |
|    [pass] NotEmpty[output]                1.00  not empty (1842 chars)    |
|    [pass] ContainsAll[output]             1.00  all keywords found        |
|    [pass] LatencyBudget[duration]         0.97  12400ms / 120000ms budget |
|    [pass] Budget[totalCost]               1.00  $0.02 / $1.00 budget     |
|    [pass] NoReplans                       1.00  0 replans                 |
|  ACTION: craftStory  [14:32:07.130]                                       |
|    [pass] LatencyBudget[duration]         0.88  8200ms / 60000ms budget   |
|  ACTION: reviewStory  [14:32:15.342]                                      |
|    [pass] LatencyBudget[duration]         0.93  4100ms / 60000ms budget   |
|  LLM (3 calls)                                                            |
|    [pass] MinLength[response]  [14:32:08.501]  1.00  length 1842 >= 0    |
|    [pass] MinLength[response]  [14:32:12.204]  1.00  length 924 >= 0     |
|    [pass] MinLength[response]  [14:32:16.710]  1.00  length 1156 >= 0    |
+---------------------------------------------------------------------------+
|  Summary                                                                  |
|  Total: 11 | Passed: 11 | Failed: 0 | Pass rate: 100.0%                  |
|  Mean score: 0.97 | Min: 0.88 | Max: 1.00                                |
|  Tokens: 4,512 | Cost: $0.02 | Duration: 12.4s                           |
+---------------------------------------------------------------------------+

When an evaluation fails:

+---------------------------------------------------------------------------+
|  AGENT  [14:32:07.124]                                                    |
|    [pass] Completed                       1.00  agent completed           |
|    [FAIL] ContainsAll[output]             0.00  missing: [Galahad]        |
|    [FAIL] LatencyBudget[duration]         0.42  24000ms / 10000ms budget  |
+---------------------------------------------------------------------------+
|  Summary                                                                  |
|  Total: 3 | Passed: 1 | Failed: 2 | Pass rate: 33.3%                     |
+---------------------------------------------------------------------------+

EvaluationAssertionError: 2 evaluation(s) failed:
  - ContainsAll[output]: score=0.00, details="missing keywords: [Galahad]"
  - LatencyBudget[duration]: score=0.42, details="24000ms / 10000ms budget"

4 output formats -- Console (as above), Markdown, HTML, JSON
Aggregated reports -- EvaluationReportWriter collects all test results and writes a single file at JVM shutdown
Statistics -- allPassed(), passRate(), meanScore(), minScore(), maxScore(), byLevel()

LLM-as-Judge

Use an LLM to evaluate qualities that deterministic rules cannot check (coherence, safety, goal achievement). Each preset comes with a built-in prompt -- you do not write the evaluation criteria yourself:

JudgeLlm judge = JudgeLlm.fromPromptRunner(context.ai());

Eval.on(MyAgent.class)
    .agent(
        // Built-in presets -- each sends a specific prompt to the judge LLM
        AgentEvalRules.goalAchievement(judge),  // "Did the agent achieve its stated goal?..."
        AgentEvalRules.safety(judge),            // "Is the content safe and non-biased?..."
        AgentEvalRules.coherence(judge)          // "Is the response logically coherent?..."
    )
    .assertAll(inv);

11 built-in presets with ready-to-use prompts: goal achievement, safety, coherence, completeness, tone & style, instruction following, groundedness, plan completeness, plan feasibility, and more.

For specific needs, write your own evaluation prompt:

// Scored (0.0-1.0) -- you write the criteria
AgentEvalRules.outputJudge(judge,
    "The output is a well-structured news article with a headline and at least 3 paragraphs", 0.8)

// Yes/no -- you write the question
AgentEvalRules.outputBinaryJudge(judge,
    "Does the output contain any personal information?")

Extensibility

Custom rules, extractors, and evaluators are first-class:

// Custom rule
public class MyRule implements EvalRule<String> { ... }

// Compose with any extractor
Eval.on(MyAgent.class)
    .agent(new MyRule().on(AgentExtractors.OUTPUT))
    .assertAll(inv);

neilchaudhuri · 2026-04-04T21:45:15Z

neilchaudhuri
Apr 4, 2026

Lots of great ideas here. Curious abut your thoughts on some things:

LLM Call: Isn't Embabel's type safety the first line of defense? The inability to produce a response of the given type feels like the most likely evaluation failure. And then that is only necessary but not sufficient. If I get past that hurdle, then I would wonder how an LLM call produces an instance of the proper type but with the wrong data.
Tool call: I think the primary point of evaluation is whether the agent called the right tool with the right data. As for whether the right tool did the wrong thing, this could be a traditional JUnit test like we've always done.
RAG: Wouldn't precision- and recall-based tests be the best here?
Plan: I am not sure how I would be able to provide any metrics around the plan to test against. Are there heuristics?
Tool Loop: I love the idea of this, but I would need heuristics here too. I am not knowledgable enough to go beyond "Did it converge or not?"

1 reply

azanux Apr 5, 2026
Collaborator Author

@neilchaudhuri Great questions !

let me take them one by one.

1. Type Safety

Type safety is the first line of defense in Embabel. If the LLM can't produce a valid instance of the target type, the call fails immediately.

The interesting failures happen when the LLM does produce a valid instance, but with wrong data:

A ReviewedStory with a valid structure but the story is about the wrong character (type-safe, semantically wrong)
A JSON response that parses correctly but contains hallucinated data
An output that's technically valid but contains sensitive information that leaked from the prompt
An LLM that responds with fewer than 200 words when a detailed analysis is needed

This is exactly what the LLM evaluation layer targets:

Content checks: responseContains, responseContainsNone
Format validation: responseIsValidJson, responseMatchesSchema
Similarity scoring: responseSimilarTo
And everything that deterministic rules can't catch

2. Tool Calls

The primary question is: "Did the agent call the right tool with the right arguments?"

The framework addresses this at two levels:

At the Action level: toolsCalled("search", "summarize"), toolsNotCalled("deleteUser"), toolsInOrder("search", "summarize") , verify tool selection and sequencing
At the ToolCall level: inputMatchesSchema(schema) - validates that arguments are well-formed

Key insight: evaluation isn't just for tests - it runs in production too.

Imagine you start with 10 tools and everything works. Over the next 3 months, you add 10 more. At some point, the LLM starts struggling to select the right tool - the search space grew, descriptions overlap, the model gets confused.

Unit tests won't catch this because they run against fixed scenarios. But with evaluation enabled in production, you're scoring every real agent run. Your Grafana dashboard shows tool selection accuracy dropping from 100% → 80% → 70% → 50% over three months, on live production data, with real user queries.

You see the degradation before users start complaining, and you know exactly which tools are getting confused. That's something no unit test can ever give you.

3. RAG

Precision and recall are excellent for evaluating retrieval quality. But they have known limitations:

They require an annotated ground truth (knowing which chunks are "the right ones"), which is expensive to build and maintain
They measure retrieval quality, but not the quality of the final answer generated by the LLM from that context

4. Plan

Plan evaluation is the hardest to pin down because a "good plan" is domain-dependent.

Deterministic heuristics give you a baseline:

stepCountRange(min, max) - a plan with 0 or 50 steps is suspicious
hasSteps() - did the planner produce anything at all?
stepsContain("research") / stepsContainNone("delete") - verify that expected steps are present and dangerous ones aren't

The real value comes from LLM-as-judge presets:

planCompleteness(judge) — "Are all necessary steps present for this goal?"
planFeasibility(judge) - "Can each step actually be executed with the available tools?"

These are scored 0.0 - 1.0, so like everything else in the framework, you can track them in production. If your agent starts producing incomplete plans after a model update or a tool change, you'll see planCompleteness scores dropping in your dashboard.

Combined with stepCountRange, you can detect both structural anomalies (too many or too few steps) and semantic ones (missing a critical step, planning an action the agent can't actually perform).

5. Tool Loop: Beyond Convergence

Replan detection: converged() checks that the loop finished without requesting a replan - meaning the agent didn't give up mid-loop
Latency budget: latencyBudget(5000) - convergence in 100 iterations that took 300 seconds might be too slow for your use case
Iteration bounds: iterationRange(2, 5) - if you know a task should take 2–5 tool calls, flag anything outside that range

So the heuristics are: did it converge? How efficiently? How fast? And did it stay within expected bounds? These combine to distinguish "converged well" from "converged, but barely."

neilchaudhuri · 2026-04-05T16:06:12Z

neilchaudhuri
Apr 5, 2026

On type safety, that clarifies things because it wasn't clear in the original framing the difference between evaluating a Python framework and this one. It's good that you are using type safety as the initial guard.

On tool calls, I think we are in violent agreement. The "special" things to test are "Did the agent call the right tool?" and "Did the agent call it the right way?" Traditional unit tests do not work for those, but they do work for testing the tool calls themselves. If the tool call is a complex mathematical calculation, then traditional unit tests (probably Property-Based Tests) work just fine. The exception to this would be if the tool itself uses an LLM to do its work. Like with an MCP Sampling situation.

On RAG, precision and recall have their challenges, but, and I suppose this applies to my understanding of the evals writ large, how do you come up with the numbers? I see a lot of numbers like 300 seconds, 2-5 tool calls, 200 lines for a real plan. Outside of performance because I know how "slow" feels and token spend because I can open my wallet and see how much money I have, I am unaware of heuristics to know what kinds of numbers make sense for my situation.

Of course this is an issue writ large when you test something non-deterministic, but I don't know if my tests are useful because I am making up numbers. This is the area I am struggling with tbh.

1 reply

azanux Apr 6, 2026
Collaborator Author

On tool calls with an LLM underneath, even in that case, you can still write evals for the inner LLM itself: does it respond in the right format? Does it stay within a reasonable latency? Does it stay within an acceptable token budget? This is actually useful for catching regressions when a maintainer updates the underlying model or prompt.(MPC)

On the numbers question, I think the key insight is that it's up to you as the developer to set the thresholds that feel reasonable for your use case. The real value isn't in the absolute number itself, it's in the delta. For example:

If you're migrating from Gemini 2.5 to Gemini 3.1 and you had set 30 seconds as your acceptable latency, and suddenly the same agent takes 5x longer, your eval catches that immediately.
If a skill agent was consistently calling 3 tools to complete a task, and after a model update it starts calling 10, that's a signal something broke, and again your eval catches it.
Same logic applies to cost: if your token spend doubles on the same workload after an update, that's a regression worth knowing about.

So the numbers don't need to be "objectively correct", they just need to be your baseline. The eval becomes a regression detector more than an absolute quality gate.

For example, if in production you had an average of 25 tool failures per day and you suddenly see 200 per day in the eval dashboard, you need to investigate the reason. Without evals, you would have had no way of knowing those failures were even happening.( call online Eval )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embabel Evaluation Framework #1570

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Embabel Evaluation Framework #1570

Uh oh!

azanux Apr 4, 2026 Collaborator

Summary

Why agents need evaluation

Getting started

What kinds of evaluation?

One API for tests and production

What's Included

24+ Built-in Rules

7 Convenience Factory Classes

5 Composite Evaluators

Production Monitoring

Evaluation Reports

LLM-as-Judge

Extensibility

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

neilchaudhuri Apr 4, 2026

Uh oh!

azanux Apr 5, 2026 Collaborator Author

1. Type Safety

2. Tool Calls

3. RAG

4. Plan

5. Tool Loop: Beyond Convergence

Uh oh!

neilchaudhuri Apr 5, 2026

Uh oh!

azanux Apr 6, 2026 Collaborator Author

azanux
Apr 4, 2026
Collaborator

Replies: 2 comments 2 replies

neilchaudhuri
Apr 4, 2026

azanux Apr 5, 2026
Collaborator Author

neilchaudhuri
Apr 5, 2026

azanux Apr 6, 2026
Collaborator Author