12.3 Testing

Testing

Relevant source files

The following files were used as context for generating this wiki page:

Purpose and Scope

This document describes the testing strategy and infrastructure for ZeroClaw. It covers the three-layer testing architecture (unit tests, integration tests, and end-to-end tests), the mock infrastructure used to isolate components, and the specific test coverage for critical subsystems including agent orchestration, tool execution, channel integration, and scheduler behavior.

For information about the CI/CD workflows that execute these tests, see CI/CD Workflows. For information about the agent turn cycle being tested, see Agent Turn Cycle.

Testing Architecture Overview

ZeroClaw implements a three-layer testing strategy that progressively validates system behavior from individual functions to full agent orchestration cycles:

graph TB
    subgraph "Testing Layers"
        E2E["E2E Tests<br/>tests/agent_e2e.rs<br/>Full agent orchestration"]
        Integration["Integration Tests<br/>Module #[cfg(test)]<br/>Cross-component behavior"]
        Unit["Unit Tests<br/>In-module tests<br/>Function-level validation"]
    end
    
    subgraph "Test Infrastructure"
        MockProvider["MockProvider<br/>ScriptedProvider<br/>FailingProvider"]
        MockTools["Mock Tools<br/>EchoTool<br/>CountingTool<br/>FailingTool"]
        MockMemory["Mock Memory<br/>none backend<br/>sqlite for persistence"]
        TestHelpers["Test Helpers<br/>build_agent()<br/>text_response()<br/>tool_response()"]
    end
    
    subgraph "System Under Test"
        Agent["Agent::turn()"]
        Dispatcher["ToolDispatcher<br/>NativeToolDispatcher<br/>XmlToolDispatcher"]
        Scheduler["Scheduler"]
        Channels["Channel implementations"]
        Tools["Tool implementations"]
    end
    
    E2E --> Agent
    Integration --> Dispatcher
    Integration --> Scheduler
    Integration --> Channels
    Unit --> Tools
    
    E2E --> MockProvider
    E2E --> MockTools
    E2E --> MockMemory
    E2E --> TestHelpers
    
    Integration --> MockProvider
    Integration --> MockTools
    Integration --> TestHelpers
    
    Agent --> Dispatcher
    
    style E2E fill:#f9f9f9
    style Integration fill:#f9f9f9
    style Unit fill:#f9f9f9

Sources:

Test Layer Responsibilities

End-to-End Tests (tests/agent_e2e.rs)

E2E tests validate full agent orchestration cycles through the public Agent API without touching external services. They use mock providers and tools to script multi-turn conversations and verify that the agent correctly loops between LLM calls and tool execution.

Test Coverage:

Simple text responses (no tool calls)
Single tool call → final response
Multi-step tool chains (tool A → tool B → tool C)
XML dispatcher integration
Multi-turn conversation coherence
Unknown tool recovery
Parallel tool dispatch

Key Test Functions:

Test Function	Purpose
`e2e_simple_text_response`	User message → LLM text response
`e2e_single_tool_call_cycle`	Tool call → execution → final response
`e2e_multi_step_tool_chain`	Chain of 2+ tool calls before response
`e2e_xml_dispatcher_tool_call`	XML-tagged tool calls work end-to-end
`e2e_multi_turn_conversation`	Sequential turns maintain coherence
`e2e_unknown_tool_recovery`	Graceful handling of missing tools
`e2e_parallel_tool_dispatch`	Multiple tools in single response

Sources:

tests/agent_e2e.rs:196-354

Unit Tests (Module-Level)

Unit tests live in #[cfg(test)] modules alongside the code they validate. They test individual functions and edge cases with minimal dependencies.

Agent Tests (src/agent/tests.rs):

The agent test suite covers 20+ scenarios organized into categories:

graph LR
    subgraph "Agent Test Categories"
        Basic["Basic Flow<br/>1. Text response<br/>2. Single tool<br/>3. Multi-step chain"]
        Edge["Edge Cases<br/>4. Max iterations<br/>5. Unknown tool<br/>6. Tool failure<br/>7. Provider error"]
        History["History Management<br/>8. Trimming<br/>12. Mixed text+tool<br/>18. Conversation fidelity"]
        Memory["Memory Integration<br/>9. Auto-save<br/>15. Context enrichment"]
        Dispatcher["Dispatcher Modes<br/>10. Native vs XML<br/>14. System prompts"]
        Other["Other Scenarios<br/>11. Empty responses<br/>13. Multi-tool batch<br/>16. Serialization<br/>19. Builder validation"]
    end
    
    Basic --> AgentTurn["Agent::turn()"]
    Edge --> AgentTurn
    History --> AgentTurn
    Memory --> AgentTurn
    Dispatcher --> AgentTurn
    Other --> AgentTurn

Sources:

src/agent/tests.rs:1-42

Scheduler Tests (src/cron/scheduler.rs):

Validates cron job execution with security policy enforcement:

Test Function	Validates
`run_job_command_success`	Shell command execution succeeds
`run_job_command_failure`	Non-zero exit codes handled
`run_job_command_times_out`	Jobs killed after timeout
`run_job_command_blocks_disallowed_command`	Security: command allowlist
`run_job_command_blocks_forbidden_path_argument`	Security: path restrictions
`run_job_command_blocks_readonly_mode`	Security: read-only enforcement
`run_job_command_blocks_rate_limited`	Security: rate limiting
`execute_job_with_retry_recovers_after_first_failure`	Retry logic with backoff
`execute_job_with_retry_exhausts_attempts`	Retry exhaustion

Sources:

src/cron/scheduler.rs:469-650

Channel Tests (src/channels/mattermost.rs):

Tests channel-specific parsing and security:

graph TB
    subgraph "Mattermost Test Scenarios"
        Parse["Message Parsing<br/>parse_mattermost_post()"]
        Thread["Threading Logic<br/>thread_replies config"]
        Mention["Mention Detection<br/>mention_only mode"]
        Security["Security Checks<br/>Allowlist<br/>Self-ignore<br/>Old message filter"]
    end
    
    Parse --> MattermostChannel["MattermostChannel"]
    Thread --> MattermostChannel
    Mention --> MattermostChannel
    Security --> MattermostChannel
    
    Parse -.tests.-> ParseTests["mattermost_parse_post_basic<br/>mattermost_parse_post_thread<br/>mattermost_parse_post_ignore_self"]
    Thread -.tests.-> ThreadTests["mattermost_parse_post_thread_replies_enabled<br/>mattermost_parse_post_thread_replies_disabled"]
    Mention -.tests.-> MentionTests["mattermost_mention_only_accepts_mention<br/>mattermost_mention_only_rejects_no_mention<br/>find_bot_mention_spans"]
    Security -.tests.-> SecurityTests["mattermost_allowlist_wildcard<br/>mattermost_parse_post_allowlist_deny"]

Sources:

src/channels/mattermost.rs:450-650

Mock Infrastructure

Mock Providers

ScriptedProvider (src/agent/tests.rs):

Returns pre-scripted responses in FIFO order, enabling deterministic test scenarios. Exhausted queue returns "done" text response.

graph LR
    Test["Test Setup"] --> Queue["Response Queue<br/>Vec&lt;ChatResponse&gt;"]
    Queue --> Provider["ScriptedProvider"]
    Provider --> Agent["Agent::turn()"]
    Agent -->|"chat()"| Provider
    Provider --> Pop["Pop next response"]
    Pop --> Return["Return to Agent"]
    
    Agent -->|"Records request"| History["Request History<br/>Vec&lt;Vec&lt;ChatMessage&gt;&gt;"]

Key Methods:

new(responses: Vec<ChatResponse>) — Initialize with response queue
request_count() — Assert how many LLM calls were made

MockProvider (tests/agent_e2e.rs):

Simpler variant for E2E tests with same FIFO pattern.

FailingProvider:

Always returns Err to test error propagation.

Sources:

Mock Tools

EchoTool:

Echoes its message argument back as output. Used to validate tool argument passing.

CountingTool:

Tracks invocation count via shared Arc<Mutex<usize>>. Used to verify parallel dispatch and multi-step chains.

graph LR
    Tool["CountingTool"] --> Counter["Arc&lt;Mutex&lt;usize&gt;&gt;<br/>Shared counter"]
    Test1["Test Thread 1"] --> Tool
    Test2["Test Thread 2"] --> Tool
    Tool --> Execute["execute()"]
    Execute --> Increment["*count += 1"]
    Execute --> Return["ToolResult"]
    Test1 --> Assert["assert_eq!(*count, expected)"]
    Test2 --> Assert

FailingTool:

Returns ToolResult { success: false, error: Some(...) } to test recovery.

PanickingTool:

Returns Err(anyhow!) to test catastrophic failure handling.

Sources:

Test Helpers

Response Builders:

Function	Purpose
`text_response(text: &str)`	Plain text `ChatResponse`
`tool_response(calls: Vec<ToolCall>)`	Native tool call `ChatResponse`
`xml_tool_response(name, args)`	XML-tagged tool call response

Agent Builders:

Function	Purpose
`build_agent(provider, tools)`	Standard agent with `NativeToolDispatcher`
`build_agent_xml(provider, tools)`	Agent with `XmlToolDispatcher`
`build_agent_with_memory(...)`	Agent with custom memory backend
`build_agent_with_config(...)`	Agent with custom `AgentConfig`

Memory Helpers:

fn make_memory() -> Arc<dyn Memory> {
    // Returns Memory with backend="none" (no persistence)
}

fn make_sqlite_memory() -> (Arc<dyn Memory>, TempDir) {
    // Returns Memory with backend="sqlite" for persistence tests
}

Sources:

Critical Test Scenarios

Tool Call Loop Termination

Tests that the agent doesn't run indefinitely when the LLM keeps calling tools:

sequenceDiagram
    participant Test
    participant Agent
    participant Provider as ScriptedProvider
    participant Tool
    
    Test->>Provider: Queue 10 tool_response()
    Test->>Agent: turn("infinite loop")
    
    loop Until max_tool_iterations=3
        Agent->>Provider: chat()
        Provider-->>Agent: tool_response
        Agent->>Tool: execute()
        Tool-->>Agent: ToolResult
    end
    
    Agent-->>Test: Err("maximum tool iterations")

Test: turn_bails_out_at_max_iterations validates that the agent returns an error after max_tool_iterations is exceeded, preventing runaway loops.

Sources:

src/agent/tests.rs:445-474

History Trimming

Validates that conversation history doesn't grow unbounded:

graph TB
    Start["Agent starts<br/>max_history_messages=6"] --> Loop["Send 11 messages"]
    Loop --> Check["Check history length"]
    Check --> Assert1["history.len() <= 7<br/>(6 messages + 1 system)"]
    Check --> Assert2["First message is system prompt"]
    
    subgraph "Trimming Logic"
        Trim["trim_history()"]
        System["Preserve system prompt<br/>(always index 0)"]
        Recent["Keep 6 most recent<br/>non-system messages"]
    end
    
    Loop --> Trim
    Trim --> System
    Trim --> Recent

Test: history_trims_after_max_messages sends more messages than max_history_messages and verifies the system prompt is preserved and history length is capped.

Sources:

src/agent/tests.rs:588-620

Security Policy Enforcement (Scheduler)

Cron job execution tests validate that SecurityPolicy blocks unsafe operations:

graph TB
    Job["CronJob"] --> Validate["Scheduler validation"]
    
    Validate --> Check1["can_act()?<br/>ReadOnly blocks all writes"]
    Validate --> Check2["is_rate_limited()?<br/>max_actions_per_hour"]
    Validate --> Check3["is_command_allowed()?<br/>allowed_commands list"]
    Validate --> Check4["forbidden_path_argument()?<br/>14 system dirs blocked"]
    
    Check1 -->|Blocked| Deny["Return:<br/>(false, 'blocked by security policy')"]
    Check2 -->|Blocked| Deny
    Check3 -->|Blocked| Deny
    Check4 -->|Blocked| Deny
    
    Check1 -->|Allowed| Execute["Command::new('sh')"]
    Check2 -->|Allowed| Execute
    Check3 -->|Allowed| Execute
    Check4 -->|Allowed| Execute
    
    Execute --> Result["(bool, String)"]

Tests validate each security layer independently:

run_job_command_blocks_readonly_mode
run_job_command_blocks_rate_limited
run_job_command_blocks_disallowed_command
run_job_command_blocks_forbidden_path_argument

Sources:

Memory Auto-Save Round-Trip

Validates that agent conversations persist to memory when auto_save = true:

sequenceDiagram
    participant Test
    participant Agent
    participant Memory as SQLite Memory
    
    Test->>Memory: Create with sqlite backend
    Test->>Agent: build_agent_with_memory(auto_save=true)
    Test->>Agent: turn("Remember this fact")
    
    Agent->>Agent: Process turn
    Agent->>Memory: store(user_message)
    Agent->>Memory: store(assistant_response)
    
    Test->>Memory: count()
    Memory-->>Test: count >= 2 ✓
    
    Note over Test: Separate test with auto_save=false
    Test->>Agent: turn("test message")
    Test->>Memory: count()
    Memory-->>Test: count == 0 ✓

Tests:

auto_save_stores_messages_in_memory — Verifies count() >= 2 after one turn
auto_save_disabled_does_not_store — Verifies count() == 0 when disabled

Sources:

src/agent/tests.rs:626-666

Dispatcher Comparison (Native vs XML)

Validates that both NativeToolDispatcher (structured tool calls) and XmlToolDispatcher (XML-tagged calls) produce equivalent behavior:

graph TB
    subgraph "Native Dispatcher Path"
        Native["NativeToolDispatcher"] --> SendSpecs["should_send_tool_specs() = true"]
        SendSpecs --> LLM1["LLM receives tool specs"]
        LLM1 --> Structured["Returns structured<br/>ChatResponse.tool_calls"]
        Structured --> Parse1["parse_tool_calls()"]
    end
    
    subgraph "XML Dispatcher Path"
        XML["XmlToolDispatcher"] --> NoSpecs["should_send_tool_specs() = false"]
        NoSpecs --> LLM2["LLM receives system prompt<br/>with XML instructions"]
        LLM2 --> Tagged["Returns XML-tagged text<br/>&lt;tool_call&gt;...&lt;/tool_call&gt;"]
        Tagged --> Parse2["parse_tool_calls()"]
    end
    
    Parse1 --> Execute["ToolExecutionResult"]
    Parse2 --> Execute

Tests:

xml_dispatcher_parses_and_loops — E2E test with XML format
native_dispatcher_sends_tool_specs — Verifies should_send_tool_specs()
xml_dispatcher_does_not_send_tool_specs — Verifies XML omits specs

Sources:

Running Tests

Unit and Integration Tests

# Run all tests (unit + integration)
cargo test

# Run agent tests only
cargo test --lib agent::tests

# Run scheduler tests only
cargo test --lib cron::scheduler::tests

# Run with output (show println!/tracing)
cargo test -- --nocapture

# Run specific test by name
cargo test turn_bails_out_at_max_iterations

End-to-End Tests

# Run E2E tests
cargo test --test agent_e2e

# Run specific E2E test
cargo test --test agent_e2e e2e_multi_step_tool_chain

Coverage Report

# Generate coverage with tarpaulin (if installed)
cargo tarpaulin --out Html --output-dir coverage

# Or with llvm-cov
cargo llvm-cov --html --output-dir coverage

Sources:

Test Organization and File Structure

graph TB
    subgraph "Test Files"
        E2E["tests/agent_e2e.rs<br/>E2E integration tests<br/>Public API boundary"]
        
        AgentTests["src/agent/tests.rs<br/>20+ agent turn scenarios<br/>Mock providers + tools"]
        
        SchedulerTests["src/cron/scheduler.rs<br/>#[cfg(test)] mod tests<br/>Security + retry logic"]
        
        ChannelTests["src/channels/mattermost.rs<br/>#[cfg(test)] mod tests<br/>Parsing + mention detection"]
    end
    
    subgraph "Shared Test Infrastructure"
        MockProvider["Mock Providers<br/>ScriptedProvider<br/>FailingProvider"]
        MockTools["Mock Tools<br/>EchoTool<br/>CountingTool<br/>FailingTool"]
        Helpers["Test Helpers<br/>build_agent()<br/>make_memory()<br/>text_response()"]
    end
    
    E2E --> MockProvider
    E2E --> MockTools
    E2E --> Helpers
    
    AgentTests --> MockProvider
    AgentTests --> MockTools
    AgentTests --> Helpers
    
    SchedulerTests --> TempConfig["TempDir + test_config()"]
    ChannelTests --> JsonFixtures["JSON test fixtures"]
    
    style E2E fill:#f9f9f9
    style AgentTests fill:#f9f9f9
    style SchedulerTests fill:#f9f9f9
    style ChannelTests fill:#f9f9f9

Sources:

Coverage Areas Summary

Subsystem	Unit Tests	Integration Tests	E2E Tests
Agent Turn Cycle	✅ 20+ scenarios in `src/agent/tests.rs`	—	✅ 8 scenarios in `tests/agent_e2e.rs`
Tool Execution	✅ Mock tools (echo, fail, count)	—	✅ Full dispatch cycle
Tool Dispatchers	✅ Native vs XML comparison	—	✅ XML parser integration
History Management	✅ Trimming, system prompt preservation	—	✅ Multi-turn coherence
Memory Integration	✅ Auto-save, backend switching	—	—
Scheduler	✅ Execution, retries, security checks	—	—
Channels	✅ Mattermost parsing, threading, mentions	—	—
Security Policy	✅ Command/path blocking, rate limits	—	—
Provider Error Handling	✅ FailingProvider propagation	—	—

Key Gaps (areas without automated tests):

Gateway endpoints (/webhook, /pair, /whatsapp) — No E2E HTTP tests visible
Memory recall/context enrichment — Unit tests exist but limited E2E coverage
Multi-agent delegation — No visible tests for sub-agent orchestration
Hardware tools (GPIO, serial, USB) — No visible tests in provided files

Sources:

Home

12.3 Testing

Testing

Purpose and Scope

Testing Architecture Overview

Test Layer Responsibilities

End-to-End Tests (tests/agent_e2e.rs)

Unit Tests (Module-Level)

Mock Infrastructure

Mock Providers

Mock Tools

Test Helpers

Critical Test Scenarios

Tool Call Loop Termination

History Trimming

Security Policy Enforcement (Scheduler)

Memory Auto-Save Round-Trip

Dispatcher Comparison (Native vs XML)

Running Tests

Unit and Integration Tests

End-to-End Tests

Coverage Report

Test Organization and File Structure

Coverage Areas Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!