-
Notifications
You must be signed in to change notification settings - Fork 4.4k
12.3 Testing
Relevant source files
The following files were used as context for generating this wiki page:
This document describes the testing strategy and infrastructure for ZeroClaw. It covers the three-layer testing architecture (unit tests, integration tests, and end-to-end tests), the mock infrastructure used to isolate components, and the specific test coverage for critical subsystems including agent orchestration, tool execution, channel integration, and scheduler behavior.
For information about the CI/CD workflows that execute these tests, see CI/CD Workflows. For information about the agent turn cycle being tested, see Agent Turn Cycle.
ZeroClaw implements a three-layer testing strategy that progressively validates system behavior from individual functions to full agent orchestration cycles:
graph TB
subgraph "Testing Layers"
E2E["E2E Tests<br/>tests/agent_e2e.rs<br/>Full agent orchestration"]
Integration["Integration Tests<br/>Module #[cfg(test)]<br/>Cross-component behavior"]
Unit["Unit Tests<br/>In-module tests<br/>Function-level validation"]
end
subgraph "Test Infrastructure"
MockProvider["MockProvider<br/>ScriptedProvider<br/>FailingProvider"]
MockTools["Mock Tools<br/>EchoTool<br/>CountingTool<br/>FailingTool"]
MockMemory["Mock Memory<br/>none backend<br/>sqlite for persistence"]
TestHelpers["Test Helpers<br/>build_agent()<br/>text_response()<br/>tool_response()"]
end
subgraph "System Under Test"
Agent["Agent::turn()"]
Dispatcher["ToolDispatcher<br/>NativeToolDispatcher<br/>XmlToolDispatcher"]
Scheduler["Scheduler"]
Channels["Channel implementations"]
Tools["Tool implementations"]
end
E2E --> Agent
Integration --> Dispatcher
Integration --> Scheduler
Integration --> Channels
Unit --> Tools
E2E --> MockProvider
E2E --> MockTools
E2E --> MockMemory
E2E --> TestHelpers
Integration --> MockProvider
Integration --> MockTools
Integration --> TestHelpers
Agent --> Dispatcher
style E2E fill:#f9f9f9
style Integration fill:#f9f9f9
style Unit fill:#f9f9f9
Sources:
E2E tests validate full agent orchestration cycles through the public Agent API without touching external services. They use mock providers and tools to script multi-turn conversations and verify that the agent correctly loops between LLM calls and tool execution.
Test Coverage:
- Simple text responses (no tool calls)
- Single tool call → final response
- Multi-step tool chains (tool A → tool B → tool C)
- XML dispatcher integration
- Multi-turn conversation coherence
- Unknown tool recovery
- Parallel tool dispatch
Key Test Functions:
| Test Function | Purpose |
|---|---|
e2e_simple_text_response |
User message → LLM text response |
e2e_single_tool_call_cycle |
Tool call → execution → final response |
e2e_multi_step_tool_chain |
Chain of 2+ tool calls before response |
e2e_xml_dispatcher_tool_call |
XML-tagged tool calls work end-to-end |
e2e_multi_turn_conversation |
Sequential turns maintain coherence |
e2e_unknown_tool_recovery |
Graceful handling of missing tools |
e2e_parallel_tool_dispatch |
Multiple tools in single response |
Sources:
Unit tests live in #[cfg(test)] modules alongside the code they validate. They test individual functions and edge cases with minimal dependencies.
Agent Tests (src/agent/tests.rs):
The agent test suite covers 20+ scenarios organized into categories:
graph LR
subgraph "Agent Test Categories"
Basic["Basic Flow<br/>1. Text response<br/>2. Single tool<br/>3. Multi-step chain"]
Edge["Edge Cases<br/>4. Max iterations<br/>5. Unknown tool<br/>6. Tool failure<br/>7. Provider error"]
History["History Management<br/>8. Trimming<br/>12. Mixed text+tool<br/>18. Conversation fidelity"]
Memory["Memory Integration<br/>9. Auto-save<br/>15. Context enrichment"]
Dispatcher["Dispatcher Modes<br/>10. Native vs XML<br/>14. System prompts"]
Other["Other Scenarios<br/>11. Empty responses<br/>13. Multi-tool batch<br/>16. Serialization<br/>19. Builder validation"]
end
Basic --> AgentTurn["Agent::turn()"]
Edge --> AgentTurn
History --> AgentTurn
Memory --> AgentTurn
Dispatcher --> AgentTurn
Other --> AgentTurn
Sources:
Scheduler Tests (src/cron/scheduler.rs):
Validates cron job execution with security policy enforcement:
| Test Function | Validates |
|---|---|
run_job_command_success |
Shell command execution succeeds |
run_job_command_failure |
Non-zero exit codes handled |
run_job_command_times_out |
Jobs killed after timeout |
run_job_command_blocks_disallowed_command |
Security: command allowlist |
run_job_command_blocks_forbidden_path_argument |
Security: path restrictions |
run_job_command_blocks_readonly_mode |
Security: read-only enforcement |
run_job_command_blocks_rate_limited |
Security: rate limiting |
execute_job_with_retry_recovers_after_first_failure |
Retry logic with backoff |
execute_job_with_retry_exhausts_attempts |
Retry exhaustion |
Sources:
Channel Tests (src/channels/mattermost.rs):
Tests channel-specific parsing and security:
graph TB
subgraph "Mattermost Test Scenarios"
Parse["Message Parsing<br/>parse_mattermost_post()"]
Thread["Threading Logic<br/>thread_replies config"]
Mention["Mention Detection<br/>mention_only mode"]
Security["Security Checks<br/>Allowlist<br/>Self-ignore<br/>Old message filter"]
end
Parse --> MattermostChannel["MattermostChannel"]
Thread --> MattermostChannel
Mention --> MattermostChannel
Security --> MattermostChannel
Parse -.tests.-> ParseTests["mattermost_parse_post_basic<br/>mattermost_parse_post_thread<br/>mattermost_parse_post_ignore_self"]
Thread -.tests.-> ThreadTests["mattermost_parse_post_thread_replies_enabled<br/>mattermost_parse_post_thread_replies_disabled"]
Mention -.tests.-> MentionTests["mattermost_mention_only_accepts_mention<br/>mattermost_mention_only_rejects_no_mention<br/>find_bot_mention_spans"]
Security -.tests.-> SecurityTests["mattermost_allowlist_wildcard<br/>mattermost_parse_post_allowlist_deny"]
Sources:
ScriptedProvider (src/agent/tests.rs):
Returns pre-scripted responses in FIFO order, enabling deterministic test scenarios. Exhausted queue returns "done" text response.
graph LR
Test["Test Setup"] --> Queue["Response Queue<br/>Vec<ChatResponse>"]
Queue --> Provider["ScriptedProvider"]
Provider --> Agent["Agent::turn()"]
Agent -->|"chat()"| Provider
Provider --> Pop["Pop next response"]
Pop --> Return["Return to Agent"]
Agent -->|"Records request"| History["Request History<br/>Vec<Vec<ChatMessage>>"]
Key Methods:
-
new(responses: Vec<ChatResponse>)— Initialize with response queue -
request_count()— Assert how many LLM calls were made
MockProvider (tests/agent_e2e.rs):
Simpler variant for E2E tests with same FIFO pattern.
FailingProvider:
Always returns Err to test error propagation.
Sources:
EchoTool:
Echoes its message argument back as output. Used to validate tool argument passing.
CountingTool:
Tracks invocation count via shared Arc<Mutex<usize>>. Used to verify parallel dispatch and multi-step chains.
graph LR
Tool["CountingTool"] --> Counter["Arc<Mutex<usize>><br/>Shared counter"]
Test1["Test Thread 1"] --> Tool
Test2["Test Thread 2"] --> Tool
Tool --> Execute["execute()"]
Execute --> Increment["*count += 1"]
Execute --> Return["ToolResult"]
Test1 --> Assert["assert_eq!(*count, expected)"]
Test2 --> Assert
FailingTool:
Returns ToolResult { success: false, error: Some(...) } to test recovery.
PanickingTool:
Returns Err(anyhow!) to test catastrophic failure handling.
Sources:
Response Builders:
| Function | Purpose |
|---|---|
text_response(text: &str) |
Plain text ChatResponse
|
tool_response(calls: Vec<ToolCall>) |
Native tool call ChatResponse
|
xml_tool_response(name, args) |
XML-tagged tool call response |
Agent Builders:
| Function | Purpose |
|---|---|
build_agent(provider, tools) |
Standard agent with NativeToolDispatcher
|
build_agent_xml(provider, tools) |
Agent with XmlToolDispatcher
|
build_agent_with_memory(...) |
Agent with custom memory backend |
build_agent_with_config(...) |
Agent with custom AgentConfig
|
Memory Helpers:
fn make_memory() -> Arc<dyn Memory> {
// Returns Memory with backend="none" (no persistence)
}
fn make_sqlite_memory() -> (Arc<dyn Memory>, TempDir) {
// Returns Memory with backend="sqlite" for persistence tests
}Sources:
Tests that the agent doesn't run indefinitely when the LLM keeps calling tools:
sequenceDiagram
participant Test
participant Agent
participant Provider as ScriptedProvider
participant Tool
Test->>Provider: Queue 10 tool_response()
Test->>Agent: turn("infinite loop")
loop Until max_tool_iterations=3
Agent->>Provider: chat()
Provider-->>Agent: tool_response
Agent->>Tool: execute()
Tool-->>Agent: ToolResult
end
Agent-->>Test: Err("maximum tool iterations")
Test: turn_bails_out_at_max_iterations validates that the agent returns an error after max_tool_iterations is exceeded, preventing runaway loops.
Sources:
Validates that conversation history doesn't grow unbounded:
graph TB
Start["Agent starts<br/>max_history_messages=6"] --> Loop["Send 11 messages"]
Loop --> Check["Check history length"]
Check --> Assert1["history.len() <= 7<br/>(6 messages + 1 system)"]
Check --> Assert2["First message is system prompt"]
subgraph "Trimming Logic"
Trim["trim_history()"]
System["Preserve system prompt<br/>(always index 0)"]
Recent["Keep 6 most recent<br/>non-system messages"]
end
Loop --> Trim
Trim --> System
Trim --> Recent
Test: history_trims_after_max_messages sends more messages than max_history_messages and verifies the system prompt is preserved and history length is capped.
Sources:
Cron job execution tests validate that SecurityPolicy blocks unsafe operations:
graph TB
Job["CronJob"] --> Validate["Scheduler validation"]
Validate --> Check1["can_act()?<br/>ReadOnly blocks all writes"]
Validate --> Check2["is_rate_limited()?<br/>max_actions_per_hour"]
Validate --> Check3["is_command_allowed()?<br/>allowed_commands list"]
Validate --> Check4["forbidden_path_argument()?<br/>14 system dirs blocked"]
Check1 -->|Blocked| Deny["Return:<br/>(false, 'blocked by security policy')"]
Check2 -->|Blocked| Deny
Check3 -->|Blocked| Deny
Check4 -->|Blocked| Deny
Check1 -->|Allowed| Execute["Command::new('sh')"]
Check2 -->|Allowed| Execute
Check3 -->|Allowed| Execute
Check4 -->|Allowed| Execute
Execute --> Result["(bool, String)"]
Tests validate each security layer independently:
run_job_command_blocks_readonly_moderun_job_command_blocks_rate_limitedrun_job_command_blocks_disallowed_commandrun_job_command_blocks_forbidden_path_argument
Sources:
Validates that agent conversations persist to memory when auto_save = true:
sequenceDiagram
participant Test
participant Agent
participant Memory as SQLite Memory
Test->>Memory: Create with sqlite backend
Test->>Agent: build_agent_with_memory(auto_save=true)
Test->>Agent: turn("Remember this fact")
Agent->>Agent: Process turn
Agent->>Memory: store(user_message)
Agent->>Memory: store(assistant_response)
Test->>Memory: count()
Memory-->>Test: count >= 2 ✓
Note over Test: Separate test with auto_save=false
Test->>Agent: turn("test message")
Test->>Memory: count()
Memory-->>Test: count == 0 ✓
Tests:
-
auto_save_stores_messages_in_memory— Verifiescount() >= 2after one turn -
auto_save_disabled_does_not_store— Verifiescount() == 0when disabled
Sources:
Validates that both NativeToolDispatcher (structured tool calls) and XmlToolDispatcher (XML-tagged calls) produce equivalent behavior:
graph TB
subgraph "Native Dispatcher Path"
Native["NativeToolDispatcher"] --> SendSpecs["should_send_tool_specs() = true"]
SendSpecs --> LLM1["LLM receives tool specs"]
LLM1 --> Structured["Returns structured<br/>ChatResponse.tool_calls"]
Structured --> Parse1["parse_tool_calls()"]
end
subgraph "XML Dispatcher Path"
XML["XmlToolDispatcher"] --> NoSpecs["should_send_tool_specs() = false"]
NoSpecs --> LLM2["LLM receives system prompt<br/>with XML instructions"]
LLM2 --> Tagged["Returns XML-tagged text<br/><tool_call>...</tool_call>"]
Tagged --> Parse2["parse_tool_calls()"]
end
Parse1 --> Execute["ToolExecutionResult"]
Parse2 --> Execute
Tests:
-
xml_dispatcher_parses_and_loops— E2E test with XML format -
native_dispatcher_sends_tool_specs— Verifiesshould_send_tool_specs() -
xml_dispatcher_does_not_send_tool_specs— Verifies XML omits specs
Sources:
# Run all tests (unit + integration)
cargo test
# Run agent tests only
cargo test --lib agent::tests
# Run scheduler tests only
cargo test --lib cron::scheduler::tests
# Run with output (show println!/tracing)
cargo test -- --nocapture
# Run specific test by name
cargo test turn_bails_out_at_max_iterations# Run E2E tests
cargo test --test agent_e2e
# Run specific E2E test
cargo test --test agent_e2e e2e_multi_step_tool_chain# Generate coverage with tarpaulin (if installed)
cargo tarpaulin --out Html --output-dir coverage
# Or with llvm-cov
cargo llvm-cov --html --output-dir coverageSources:
graph TB
subgraph "Test Files"
E2E["tests/agent_e2e.rs<br/>E2E integration tests<br/>Public API boundary"]
AgentTests["src/agent/tests.rs<br/>20+ agent turn scenarios<br/>Mock providers + tools"]
SchedulerTests["src/cron/scheduler.rs<br/>#[cfg(test)] mod tests<br/>Security + retry logic"]
ChannelTests["src/channels/mattermost.rs<br/>#[cfg(test)] mod tests<br/>Parsing + mention detection"]
end
subgraph "Shared Test Infrastructure"
MockProvider["Mock Providers<br/>ScriptedProvider<br/>FailingProvider"]
MockTools["Mock Tools<br/>EchoTool<br/>CountingTool<br/>FailingTool"]
Helpers["Test Helpers<br/>build_agent()<br/>make_memory()<br/>text_response()"]
end
E2E --> MockProvider
E2E --> MockTools
E2E --> Helpers
AgentTests --> MockProvider
AgentTests --> MockTools
AgentTests --> Helpers
SchedulerTests --> TempConfig["TempDir + test_config()"]
ChannelTests --> JsonFixtures["JSON test fixtures"]
style E2E fill:#f9f9f9
style AgentTests fill:#f9f9f9
style SchedulerTests fill:#f9f9f9
style ChannelTests fill:#f9f9f9
Sources:
- tests/agent_e2e.rs:1-354
- src/agent/tests.rs:1-900
- src/cron/scheduler.rs:469-650
- src/channels/mattermost.rs:450-650
| Subsystem | Unit Tests | Integration Tests | E2E Tests |
|---|---|---|---|
| Agent Turn Cycle | ✅ 20+ scenarios in src/agent/tests.rs
|
— | ✅ 8 scenarios in tests/agent_e2e.rs
|
| Tool Execution | ✅ Mock tools (echo, fail, count) | — | ✅ Full dispatch cycle |
| Tool Dispatchers | ✅ Native vs XML comparison | — | ✅ XML parser integration |
| History Management | ✅ Trimming, system prompt preservation | — | ✅ Multi-turn coherence |
| Memory Integration | ✅ Auto-save, backend switching | — | — |
| Scheduler | ✅ Execution, retries, security checks | — | — |
| Channels | ✅ Mattermost parsing, threading, mentions | — | — |
| Security Policy | ✅ Command/path blocking, rate limits | — | — |
| Provider Error Handling | ✅ FailingProvider propagation | — | — |
Key Gaps (areas without automated tests):
- Gateway endpoints (
/webhook,/pair,/whatsapp) — No E2E HTTP tests visible - Memory recall/context enrichment — Unit tests exist but limited E2E coverage
- Multi-agent delegation — No visible tests for sub-agent orchestration
- Hardware tools (GPIO, serial, USB) — No visible tests in provided files
Sources:
- tests/agent_e2e.rs:196-354
- src/agent/tests.rs:1-900
- src/cron/scheduler.rs:469-650
- src/channels/mattermost.rs:450-650