Skip to content

12.3 Testing

Nikolay Vyahhi edited this page Feb 19, 2026 · 2 revisions

Testing

Relevant source files

The following files were used as context for generating this wiki page:

Purpose and Scope

This document describes the testing strategy and infrastructure for ZeroClaw. It covers the three-layer testing architecture (unit tests, integration tests, and end-to-end tests), the mock infrastructure used to isolate components, and the specific test coverage for critical subsystems including agent orchestration, tool execution, channel integration, and scheduler behavior.

For information about the CI/CD workflows that execute these tests, see CI/CD Workflows. For information about the agent turn cycle being tested, see Agent Turn Cycle.


Testing Architecture Overview

ZeroClaw implements a three-layer testing strategy that progressively validates system behavior from individual functions to full agent orchestration cycles:

graph TB
    subgraph "Testing Layers"
        E2E["E2E Tests<br/>tests/agent_e2e.rs<br/>Full agent orchestration"]
        Integration["Integration Tests<br/>Module #[cfg(test)]<br/>Cross-component behavior"]
        Unit["Unit Tests<br/>In-module tests<br/>Function-level validation"]
    end
    
    subgraph "Test Infrastructure"
        MockProvider["MockProvider<br/>ScriptedProvider<br/>FailingProvider"]
        MockTools["Mock Tools<br/>EchoTool<br/>CountingTool<br/>FailingTool"]
        MockMemory["Mock Memory<br/>none backend<br/>sqlite for persistence"]
        TestHelpers["Test Helpers<br/>build_agent()<br/>text_response()<br/>tool_response()"]
    end
    
    subgraph "System Under Test"
        Agent["Agent::turn()"]
        Dispatcher["ToolDispatcher<br/>NativeToolDispatcher<br/>XmlToolDispatcher"]
        Scheduler["Scheduler"]
        Channels["Channel implementations"]
        Tools["Tool implementations"]
    end
    
    E2E --> Agent
    Integration --> Dispatcher
    Integration --> Scheduler
    Integration --> Channels
    Unit --> Tools
    
    E2E --> MockProvider
    E2E --> MockTools
    E2E --> MockMemory
    E2E --> TestHelpers
    
    Integration --> MockProvider
    Integration --> MockTools
    Integration --> TestHelpers
    
    Agent --> Dispatcher
    
    style E2E fill:#f9f9f9
    style Integration fill:#f9f9f9
    style Unit fill:#f9f9f9
Loading

Sources:


Test Layer Responsibilities

End-to-End Tests (tests/agent_e2e.rs)

E2E tests validate full agent orchestration cycles through the public Agent API without touching external services. They use mock providers and tools to script multi-turn conversations and verify that the agent correctly loops between LLM calls and tool execution.

Test Coverage:

  • Simple text responses (no tool calls)
  • Single tool call → final response
  • Multi-step tool chains (tool A → tool B → tool C)
  • XML dispatcher integration
  • Multi-turn conversation coherence
  • Unknown tool recovery
  • Parallel tool dispatch

Key Test Functions:

Test Function Purpose
e2e_simple_text_response User message → LLM text response
e2e_single_tool_call_cycle Tool call → execution → final response
e2e_multi_step_tool_chain Chain of 2+ tool calls before response
e2e_xml_dispatcher_tool_call XML-tagged tool calls work end-to-end
e2e_multi_turn_conversation Sequential turns maintain coherence
e2e_unknown_tool_recovery Graceful handling of missing tools
e2e_parallel_tool_dispatch Multiple tools in single response

Sources:

Unit Tests (Module-Level)

Unit tests live in #[cfg(test)] modules alongside the code they validate. They test individual functions and edge cases with minimal dependencies.

Agent Tests (src/agent/tests.rs):

The agent test suite covers 20+ scenarios organized into categories:

graph LR
    subgraph "Agent Test Categories"
        Basic["Basic Flow<br/>1. Text response<br/>2. Single tool<br/>3. Multi-step chain"]
        Edge["Edge Cases<br/>4. Max iterations<br/>5. Unknown tool<br/>6. Tool failure<br/>7. Provider error"]
        History["History Management<br/>8. Trimming<br/>12. Mixed text+tool<br/>18. Conversation fidelity"]
        Memory["Memory Integration<br/>9. Auto-save<br/>15. Context enrichment"]
        Dispatcher["Dispatcher Modes<br/>10. Native vs XML<br/>14. System prompts"]
        Other["Other Scenarios<br/>11. Empty responses<br/>13. Multi-tool batch<br/>16. Serialization<br/>19. Builder validation"]
    end
    
    Basic --> AgentTurn["Agent::turn()"]
    Edge --> AgentTurn
    History --> AgentTurn
    Memory --> AgentTurn
    Dispatcher --> AgentTurn
    Other --> AgentTurn
Loading

Sources:

Scheduler Tests (src/cron/scheduler.rs):

Validates cron job execution with security policy enforcement:

Test Function Validates
run_job_command_success Shell command execution succeeds
run_job_command_failure Non-zero exit codes handled
run_job_command_times_out Jobs killed after timeout
run_job_command_blocks_disallowed_command Security: command allowlist
run_job_command_blocks_forbidden_path_argument Security: path restrictions
run_job_command_blocks_readonly_mode Security: read-only enforcement
run_job_command_blocks_rate_limited Security: rate limiting
execute_job_with_retry_recovers_after_first_failure Retry logic with backoff
execute_job_with_retry_exhausts_attempts Retry exhaustion

Sources:

Channel Tests (src/channels/mattermost.rs):

Tests channel-specific parsing and security:

graph TB
    subgraph "Mattermost Test Scenarios"
        Parse["Message Parsing<br/>parse_mattermost_post()"]
        Thread["Threading Logic<br/>thread_replies config"]
        Mention["Mention Detection<br/>mention_only mode"]
        Security["Security Checks<br/>Allowlist<br/>Self-ignore<br/>Old message filter"]
    end
    
    Parse --> MattermostChannel["MattermostChannel"]
    Thread --> MattermostChannel
    Mention --> MattermostChannel
    Security --> MattermostChannel
    
    Parse -.tests.-> ParseTests["mattermost_parse_post_basic<br/>mattermost_parse_post_thread<br/>mattermost_parse_post_ignore_self"]
    Thread -.tests.-> ThreadTests["mattermost_parse_post_thread_replies_enabled<br/>mattermost_parse_post_thread_replies_disabled"]
    Mention -.tests.-> MentionTests["mattermost_mention_only_accepts_mention<br/>mattermost_mention_only_rejects_no_mention<br/>find_bot_mention_spans"]
    Security -.tests.-> SecurityTests["mattermost_allowlist_wildcard<br/>mattermost_parse_post_allowlist_deny"]
Loading

Sources:


Mock Infrastructure

Mock Providers

ScriptedProvider (src/agent/tests.rs):

Returns pre-scripted responses in FIFO order, enabling deterministic test scenarios. Exhausted queue returns "done" text response.

graph LR
    Test["Test Setup"] --> Queue["Response Queue<br/>Vec&lt;ChatResponse&gt;"]
    Queue --> Provider["ScriptedProvider"]
    Provider --> Agent["Agent::turn()"]
    Agent -->|"chat()"| Provider
    Provider --> Pop["Pop next response"]
    Pop --> Return["Return to Agent"]
    
    Agent -->|"Records request"| History["Request History<br/>Vec&lt;Vec&lt;ChatMessage&gt;&gt;"]
Loading

Key Methods:

  • new(responses: Vec<ChatResponse>) — Initialize with response queue
  • request_count() — Assert how many LLM calls were made

MockProvider (tests/agent_e2e.rs):

Simpler variant for E2E tests with same FIFO pattern.

FailingProvider:

Always returns Err to test error propagation.

Sources:

Mock Tools

EchoTool:

Echoes its message argument back as output. Used to validate tool argument passing.

CountingTool:

Tracks invocation count via shared Arc<Mutex<usize>>. Used to verify parallel dispatch and multi-step chains.

graph LR
    Tool["CountingTool"] --> Counter["Arc&lt;Mutex&lt;usize&gt;&gt;<br/>Shared counter"]
    Test1["Test Thread 1"] --> Tool
    Test2["Test Thread 2"] --> Tool
    Tool --> Execute["execute()"]
    Execute --> Increment["*count += 1"]
    Execute --> Return["ToolResult"]
    Test1 --> Assert["assert_eq!(*count, expected)"]
    Test2 --> Assert
Loading

FailingTool:

Returns ToolResult { success: false, error: Some(...) } to test recovery.

PanickingTool:

Returns Err(anyhow!) to test catastrophic failure handling.

Sources:

Test Helpers

Response Builders:

Function Purpose
text_response(text: &str) Plain text ChatResponse
tool_response(calls: Vec<ToolCall>) Native tool call ChatResponse
xml_tool_response(name, args) XML-tagged tool call response

Agent Builders:

Function Purpose
build_agent(provider, tools) Standard agent with NativeToolDispatcher
build_agent_xml(provider, tools) Agent with XmlToolDispatcher
build_agent_with_memory(...) Agent with custom memory backend
build_agent_with_config(...) Agent with custom AgentConfig

Memory Helpers:

fn make_memory() -> Arc<dyn Memory> {
    // Returns Memory with backend="none" (no persistence)
}

fn make_sqlite_memory() -> (Arc<dyn Memory>, TempDir) {
    // Returns Memory with backend="sqlite" for persistence tests
}

Sources:


Critical Test Scenarios

Tool Call Loop Termination

Tests that the agent doesn't run indefinitely when the LLM keeps calling tools:

sequenceDiagram
    participant Test
    participant Agent
    participant Provider as ScriptedProvider
    participant Tool
    
    Test->>Provider: Queue 10 tool_response()
    Test->>Agent: turn("infinite loop")
    
    loop Until max_tool_iterations=3
        Agent->>Provider: chat()
        Provider-->>Agent: tool_response
        Agent->>Tool: execute()
        Tool-->>Agent: ToolResult
    end
    
    Agent-->>Test: Err("maximum tool iterations")
Loading

Test: turn_bails_out_at_max_iterations validates that the agent returns an error after max_tool_iterations is exceeded, preventing runaway loops.

Sources:

History Trimming

Validates that conversation history doesn't grow unbounded:

graph TB
    Start["Agent starts<br/>max_history_messages=6"] --> Loop["Send 11 messages"]
    Loop --> Check["Check history length"]
    Check --> Assert1["history.len() <= 7<br/>(6 messages + 1 system)"]
    Check --> Assert2["First message is system prompt"]
    
    subgraph "Trimming Logic"
        Trim["trim_history()"]
        System["Preserve system prompt<br/>(always index 0)"]
        Recent["Keep 6 most recent<br/>non-system messages"]
    end
    
    Loop --> Trim
    Trim --> System
    Trim --> Recent
Loading

Test: history_trims_after_max_messages sends more messages than max_history_messages and verifies the system prompt is preserved and history length is capped.

Sources:

Security Policy Enforcement (Scheduler)

Cron job execution tests validate that SecurityPolicy blocks unsafe operations:

graph TB
    Job["CronJob"] --> Validate["Scheduler validation"]
    
    Validate --> Check1["can_act()?<br/>ReadOnly blocks all writes"]
    Validate --> Check2["is_rate_limited()?<br/>max_actions_per_hour"]
    Validate --> Check3["is_command_allowed()?<br/>allowed_commands list"]
    Validate --> Check4["forbidden_path_argument()?<br/>14 system dirs blocked"]
    
    Check1 -->|Blocked| Deny["Return:<br/>(false, 'blocked by security policy')"]
    Check2 -->|Blocked| Deny
    Check3 -->|Blocked| Deny
    Check4 -->|Blocked| Deny
    
    Check1 -->|Allowed| Execute["Command::new('sh')"]
    Check2 -->|Allowed| Execute
    Check3 -->|Allowed| Execute
    Check4 -->|Allowed| Execute
    
    Execute --> Result["(bool, String)"]
Loading

Tests validate each security layer independently:

  • run_job_command_blocks_readonly_mode
  • run_job_command_blocks_rate_limited
  • run_job_command_blocks_disallowed_command
  • run_job_command_blocks_forbidden_path_argument

Sources:

Memory Auto-Save Round-Trip

Validates that agent conversations persist to memory when auto_save = true:

sequenceDiagram
    participant Test
    participant Agent
    participant Memory as SQLite Memory
    
    Test->>Memory: Create with sqlite backend
    Test->>Agent: build_agent_with_memory(auto_save=true)
    Test->>Agent: turn("Remember this fact")
    
    Agent->>Agent: Process turn
    Agent->>Memory: store(user_message)
    Agent->>Memory: store(assistant_response)
    
    Test->>Memory: count()
    Memory-->>Test: count >= 2 ✓
    
    Note over Test: Separate test with auto_save=false
    Test->>Agent: turn("test message")
    Test->>Memory: count()
    Memory-->>Test: count == 0 ✓
Loading

Tests:

  • auto_save_stores_messages_in_memory — Verifies count() >= 2 after one turn
  • auto_save_disabled_does_not_store — Verifies count() == 0 when disabled

Sources:

Dispatcher Comparison (Native vs XML)

Validates that both NativeToolDispatcher (structured tool calls) and XmlToolDispatcher (XML-tagged calls) produce equivalent behavior:

graph TB
    subgraph "Native Dispatcher Path"
        Native["NativeToolDispatcher"] --> SendSpecs["should_send_tool_specs() = true"]
        SendSpecs --> LLM1["LLM receives tool specs"]
        LLM1 --> Structured["Returns structured<br/>ChatResponse.tool_calls"]
        Structured --> Parse1["parse_tool_calls()"]
    end
    
    subgraph "XML Dispatcher Path"
        XML["XmlToolDispatcher"] --> NoSpecs["should_send_tool_specs() = false"]
        NoSpecs --> LLM2["LLM receives system prompt<br/>with XML instructions"]
        LLM2 --> Tagged["Returns XML-tagged text<br/>&lt;tool_call&gt;...&lt;/tool_call&gt;"]
        Tagged --> Parse2["parse_tool_calls()"]
    end
    
    Parse1 --> Execute["ToolExecutionResult"]
    Parse2 --> Execute
Loading

Tests:

  • xml_dispatcher_parses_and_loops — E2E test with XML format
  • native_dispatcher_sends_tool_specs — Verifies should_send_tool_specs()
  • xml_dispatcher_does_not_send_tool_specs — Verifies XML omits specs

Sources:


Running Tests

Unit and Integration Tests

# Run all tests (unit + integration)
cargo test

# Run agent tests only
cargo test --lib agent::tests

# Run scheduler tests only
cargo test --lib cron::scheduler::tests

# Run with output (show println!/tracing)
cargo test -- --nocapture

# Run specific test by name
cargo test turn_bails_out_at_max_iterations

End-to-End Tests

# Run E2E tests
cargo test --test agent_e2e

# Run specific E2E test
cargo test --test agent_e2e e2e_multi_step_tool_chain

Coverage Report

# Generate coverage with tarpaulin (if installed)
cargo tarpaulin --out Html --output-dir coverage

# Or with llvm-cov
cargo llvm-cov --html --output-dir coverage

Sources:


Test Organization and File Structure

graph TB
    subgraph "Test Files"
        E2E["tests/agent_e2e.rs<br/>E2E integration tests<br/>Public API boundary"]
        
        AgentTests["src/agent/tests.rs<br/>20+ agent turn scenarios<br/>Mock providers + tools"]
        
        SchedulerTests["src/cron/scheduler.rs<br/>#[cfg(test)] mod tests<br/>Security + retry logic"]
        
        ChannelTests["src/channels/mattermost.rs<br/>#[cfg(test)] mod tests<br/>Parsing + mention detection"]
    end
    
    subgraph "Shared Test Infrastructure"
        MockProvider["Mock Providers<br/>ScriptedProvider<br/>FailingProvider"]
        MockTools["Mock Tools<br/>EchoTool<br/>CountingTool<br/>FailingTool"]
        Helpers["Test Helpers<br/>build_agent()<br/>make_memory()<br/>text_response()"]
    end
    
    E2E --> MockProvider
    E2E --> MockTools
    E2E --> Helpers
    
    AgentTests --> MockProvider
    AgentTests --> MockTools
    AgentTests --> Helpers
    
    SchedulerTests --> TempConfig["TempDir + test_config()"]
    ChannelTests --> JsonFixtures["JSON test fixtures"]
    
    style E2E fill:#f9f9f9
    style AgentTests fill:#f9f9f9
    style SchedulerTests fill:#f9f9f9
    style ChannelTests fill:#f9f9f9
Loading

Sources:


Coverage Areas Summary

Subsystem Unit Tests Integration Tests E2E Tests
Agent Turn Cycle ✅ 20+ scenarios in src/agent/tests.rs ✅ 8 scenarios in tests/agent_e2e.rs
Tool Execution ✅ Mock tools (echo, fail, count) ✅ Full dispatch cycle
Tool Dispatchers ✅ Native vs XML comparison ✅ XML parser integration
History Management ✅ Trimming, system prompt preservation ✅ Multi-turn coherence
Memory Integration ✅ Auto-save, backend switching
Scheduler ✅ Execution, retries, security checks
Channels ✅ Mattermost parsing, threading, mentions
Security Policy ✅ Command/path blocking, rate limits
Provider Error Handling ✅ FailingProvider propagation

Key Gaps (areas without automated tests):

  • Gateway endpoints (/webhook, /pair, /whatsapp) — No E2E HTTP tests visible
  • Memory recall/context enrichment — Unit tests exist but limited E2E coverage
  • Multi-agent delegation — No visible tests for sub-agent orchestration
  • Hardware tools (GPIO, serial, USB) — No visible tests in provided files

Sources:


Clone this wiki locally