Replies: 2 comments 2 replies
-
|
Lots of great ideas here. Curious abut your thoughts on some things:
|
Beta Was this translation helpful? Give feedback.
-
|
On type safety, that clarifies things because it wasn't clear in the original framing the difference between evaluating a Python framework and this one. It's good that you are using type safety as the initial guard. On tool calls, I think we are in violent agreement. The "special" things to test are "Did the agent call the right tool?" and "Did the agent call it the right way?" Traditional unit tests do not work for those, but they do work for testing the tool calls themselves. If the tool call is a complex mathematical calculation, then traditional unit tests (probably Property-Based Tests) work just fine. The exception to this would be if the tool itself uses an LLM to do its work. Like with an MCP Sampling situation. On RAG, precision and recall have their challenges, but, and I suppose this applies to my understanding of the evals writ large, how do you come up with the numbers? I see a lot of numbers like 300 seconds, 2-5 tool calls, 200 lines for a real plan. Outside of performance because I know how "slow" feels and token spend because I can open my wallet and see how much money I have, I am unaware of heuristics to know what kinds of numbers make sense for my situation. Of course this is an issue writ large when you test something non-deterministic, but I don't know if my tests are useful because I am making up numbers. This is the area I am struggling with tbh. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Why agents need evaluation
Unlike traditional software, AI agents are non-deterministic -- the same input can produce different outputs across runs. They are also multi-step systems: an agent plans, selects actions, calls LLMs, invokes tools, sometimes replans. A correct final output can hide a broken intermediate step. A failing output can hide that 6 out of 7 steps worked perfectly.
Traditional testing (
assertEquals(expected, actual)) does not work here. You cannot assert an exact LLM response. You need to express what "good enough" looks like: did the agent complete? Does the output contain the right information? Was it fast enough? Did it stay within budget? Did each tool call return something useful?And this is not just a testing problem. In production, models get updated, latencies shift, costs change. An agent that met its quality bar last month can silently degrade. You need the same quality criteria running continuously in production, publishing metrics and alerting when things drift.
Getting started
Add the dependency:
Write your first evaluation test:
It's a standard JUnit test -- run it like any other test in your project:
mvn test -Dtest=MyAgentEvalTestNo special runner, no plugin -- it's a regular JUnit test. The evaluation report is logged automatically:
Enable in production (
application.yml):What kinds of evaluation?
The framework supports evaluation at 7 levels of an agent's execution, because different levels catch different problems:
One API for tests and production
Same
Evalcriteria, two usages -- only the last line changes:.assertAll(inv)-- throws on failure.build()as a@Bean-- Micrometer metrics published on every agent runScores like
latencyBudgetare proportional (0.0 to 1.0), not just pass/fail. This lets you detect degradation before it crosses the failure threshold -- in dashboards, you see the score dropping from 0.95 to 0.82 over a week, and you investigate before it hits 0.8 and starts failing.What's Included
24+ Built-in Rules
Organized by category, all composable:
LatencyBudgetRule,MaxDurationRuleNotEmpty,ExactMatch,ContainsAll,ContainsAny,ContainsNone,RegexMatch,StartsWith,EndsWithMinLength,MaxLength,WordCountRange,SentenceCountNumericRange,Budget,RatioLevenshtein,Jaccard,CosineJsonValidity,JsonSchemaLlmJudgeRule,LlmBinaryJudgeRule+ 11 presetsFaithfulness,Hallucination,ContextRecall,ContextualRelevance,AnswerRelevancy7 Convenience Factory Classes
Pre-wired evaluators so developers never need to compose rules + extractors manually:
AgentEvalRules-- 25+ methods (completion, output, latency, cost, actions, tools, sub-agents, LLM judge)LlmEvalRules-- response content, format, latency, LLM judgeActionEvalRules-- success, latency, tool usageToolCallEvalRules-- result content, format, latency, schema validationPlanEvalRules-- steps, goal, step countToolLoopEvalRules-- convergence, iteration efficiency, latencyRagEvalRules-- faithfulness, relevancy, recall, result countAll proportional-scoring methods support an optional
thresholdparameter:5 Composite Evaluators
Combine rules with logic beyond "all must pass":
AllPassEvaluator-- AND logic (all must pass, score = min)AnyPassEvaluator-- OR logic (any can pass, score = max)CompositeEvaluator-- Weighted average with configurable thresholdConditionalEvaluator-- Evaluate only when a condition is metThresholdOverrideEvaluator-- Override pass/fail threshold of any evaluatorExamples:
Production Monitoring
Evalbeans, enable in YAML, evaluations run on every agent executionsampling-rate: 0.1)@EvalBlocking-- Quality gate that prevents non-conforming results from being returnedEvaluation Reports
Every
.assertAll()call logs a detailed report:When an evaluation fails:
EvaluationReportWritercollects all test results and writes a single file at JVM shutdownallPassed(),passRate(),meanScore(),minScore(),maxScore(),byLevel()LLM-as-Judge
Use an LLM to evaluate qualities that deterministic rules cannot check (coherence, safety, goal achievement). Each preset comes with a built-in prompt -- you do not write the evaluation criteria yourself:
11 built-in presets with ready-to-use prompts: goal achievement, safety, coherence, completeness, tone & style, instruction following, groundedness, plan completeness, plan feasibility, and more.
For specific needs, write your own evaluation prompt:
Extensibility
Custom rules, extractors, and evaluators are first-class:
Beta Was this translation helpful? Give feedback.
All reactions