Skip to content

13.3 Production Configuration

Nikolay Vyahhi edited this page Feb 19, 2026 · 2 revisions

Production Configuration

Relevant source files

The following files were used as context for generating this wiki page:

This page covers production deployment configuration best practices for ZeroClaw, including security hardening, resource limits, monitoring, and operational tuning. For deployment methods, see Docker Deployment and Native Binary Deployment. For the configuration file reference, see Configuration File Reference.


Configuration Hierarchy

ZeroClaw uses a three-tier configuration system with environment variables taking precedence over config.toml, which takes precedence over built-in defaults.

Configuration Priority

flowchart LR
    A["Environment Variables"] -->|highest priority| B["config.toml"]
    B --> C["Built-in Defaults"]
    C -->|lowest priority| D["Runtime Behavior"]
Loading

Sources: README.md:492-599, src/security/secrets.rs:1-227

Key Configuration Files

File Path Purpose Required
~/.zeroclaw/config.toml Primary configuration Yes
~/.zeroclaw/.secret_key Secret encryption key Auto-created
~/.zeroclaw/auth-profiles.json OAuth profiles (encrypted) Optional
workspace/MEMORY_SNAPSHOT.md Memory backup Auto-generated

Sources: README.md:492-599, src/security/secrets.rs:36-51


Security Hardening

Five-Layer Security Model

flowchart TD
    Request["Incoming Request"] --> L1["Layer 1: Network Isolation"]
    L1 --> L2["Layer 2: Authentication"]
    L2 --> L3["Layer 3: Authorization"]
    L3 --> L4["Layer 4: Execution Isolation"]
    L4 --> L5["Layer 5: Data Protection"]
    
    L1 --> L1A["127.0.0.1 bind<br/>Tunnel required"]
    L2 --> L2A["PairingGuard<br/>Bearer tokens"]
    L3 --> L3A["SecurityPolicy<br/>Autonomy levels<br/>Allowlists"]
    L4 --> L4A["RuntimeAdapter<br/>Docker sandbox"]
    L5 --> L5A["SecretStore<br/>ChaCha20-Poly1305"]
Loading

Sources: README.md:380-431, src/security/pairing.rs:1-231, src/security/secrets.rs:1-227

Secret Management

ZeroClaw encrypts secrets at rest using ChaCha20-Poly1305 AEAD with a local key file.

Configuration:

[secrets]
encrypt = true  # Enable encryption (default: true)

Key File Location: ~/.zeroclaw/.secret_key (mode 0600)

Encryption Format:

  • Current: enc2:<hex(nonce || ciphertext || tag)> (ChaCha20-Poly1305)
  • Legacy: enc:<hex(xor_ciphertext)> (auto-migrates on load)

Migration Detection:

// Check if secret needs upgrade from legacy XOR
SecretStore::needs_migration(value)  // Returns true for "enc:" prefix

Production Recommendation: Always enable encryption. For compliance scenarios requiring plaintext (audit logs, SIEM integration), set encrypt = false and use external secret management (Vault, AWS Secrets Manager).

Sources: src/security/secrets.rs:1-283, README.md:556-558

Gateway Authentication

Production gateway configuration enforces two-factor protection: one-time pairing code plus bearer token.

[gateway]
port = 3000
host = "127.0.0.1"              # Localhost-only binding
require_pairing = true           # Enforce pairing (default: true)
allow_public_bind = false        # Refuse 0.0.0.0 without tunnel

Pairing Flow:

sequenceDiagram
    participant G as Gateway
    participant C as Client
    participant PG as PairingGuard
    
    G->>PG: Generate 6-digit code on startup
    G->>G: Print code to console
    C->>G: POST /pair<br/>X-Pairing-Code: 123456
    G->>PG: try_pair(code)
    PG->>PG: Verify code (constant-time compare)
    PG->>PG: Generate bearer token (zc_<64-hex>)
    PG->>PG: Hash token (SHA-256) for storage
    PG-->>C: {"token": "zc_..."}
    C->>G: POST /webhook<br/>Authorization: Bearer zc_...
    G->>PG: is_authenticated(token)
    PG->>PG: Compare SHA-256(token) against stored hash
    PG-->>G: true/false
Loading

Sources: src/security/pairing.rs:36-151, README.md:525-530

Brute Force Protection:

  • Max attempts: 5 failed pairing attempts
  • Lockout duration: 300 seconds (5 minutes)
  • Token storage: SHA-256 hashes only (no plaintext)

Production Best Practice: Rotate bearer tokens periodically using zeroclaw auth commands.

Sources: src/security/pairing.rs:16-36

Channel Allowlists

Production deployments should use explicit allowlists for all channels to prevent unauthorized access.

Default Behavior: Empty allowlist = deny all inbound messages

[channels_config.telegram]
allowed_users = ["alice_username", "123456789"]  # Username or numeric ID

[channels_config.discord]
allowed_users = ["987654321098765432"]  # Discord user ID

[channels_config.slack]
allowed_users = ["U01234ABC"]  # Slack member ID

Wildcard (Testing Only):

allowed_users = ["*"]  # ⚠️ Allow all — use only for temporary testing

Sources: README.md:394-438

Filesystem Scoping

[autonomy]
workspace_only = true  # Restrict to workspace directory (default: true)
forbidden_paths = [
    "/etc", "/root", "/proc", "/sys",
    "~/.ssh", "~/.gnupg", "~/.aws"
]

Built-in Protections:

  • 14 system directories blocked by default
  • Null byte injection detection
  • Symlink escape prevention via path canonicalization

Sources: README.md:385-390


Resource Management

Docker Resource Constraints

Production docker-compose.yml example with resource limits:

services:
  zeroclaw:
    image: ghcr.io/zeroclaw-labs/zeroclaw:latest
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 512M

Sources: docker-compose.yml:42-50

Runtime Adapter Settings

[runtime]
kind = "docker"  # "native" or "docker"

[runtime.docker]
image = "alpine:3.20"
network = "none"              # Isolate from network
memory_limit_mb = 512         # Hard limit
cpu_limit = 1.0               # CPU shares (1.0 = 1 core)
read_only_rootfs = true       # Immutable root filesystem
mount_workspace = true        # Mount workspace at /workspace

Production Recommendation: Use docker runtime for untrusted tool execution. Use native for trusted environments where Docker overhead is unacceptable.

Sources: README.md:537-548

Memory Backend Selection

graph LR
    A["Workload Type"] --> B{"Data Volume"}
    B -->|< 10k memories| C["sqlite<br/>Full-stack search<br/>Hybrid vector + keyword"]
    B -->|> 100k memories| D["postgres<br/>Remote persistence<br/>Multi-agent shared state"]
    B -->|Ephemeral| E["none<br/>No-op backend<br/>Stateless operation"]
Loading

Sources: README.md:330-377

Configuration:

[memory]
backend = "sqlite"            # "sqlite", "postgres", "lucid", "markdown", "none"
auto_save = true              # Auto-persist conversations

# PostgreSQL example
[storage.provider.config]
provider = "postgres"
db_url = "postgres://user:pass@host:5432/zeroclaw"
schema = "public"
table = "memories"
connect_timeout_secs = 15

Sources: README.md:346-365


Persistence Strategy

Memory Snapshot System

ZeroClaw exports MemoryCategory::Core to MEMORY_SNAPSHOT.md for Git visibility and disaster recovery.

flowchart TD
    A["Agent Startup"] --> B{"brain.db exists?"}
    B -->|No| C{"MEMORY_SNAPSHOT.md exists?"}
    C -->|Yes| D["hydrate_from_snapshot()"]
    D --> E["Recreate brain.db from snapshot"]
    E --> F["Normal operation"]
    B -->|Yes| F
    C -->|No| F
    
    F --> G["Periodic export_snapshot()"]
    G --> H["Write core memories to<br/>MEMORY_SNAPSHOT.md"]
Loading

Sources: src/memory/snapshot.rs:1-471

Key Functions:

Function Purpose Trigger
export_snapshot() Export core memories to Markdown Manual / on-shutdown
hydrate_from_snapshot() Restore from Markdown to SQLite Auto on cold-boot if DB missing
should_hydrate() Check if hydration needed Startup check

File Locations:

  • Snapshot: <workspace>/MEMORY_SNAPSHOT.md
  • Database: <workspace>/memory/brain.db

Sources: src/memory/snapshot.rs:26-200

Backup Strategy

Production Checklist:

  1. Mount persistent volumes:

    volumes:
      - zeroclaw-data:/zeroclaw-data  # Must match WORKDIR in Dockerfile
  2. Periodic backup:

    # Backup entire workspace
    tar -czf zeroclaw-backup-$(date +%F).tar.gz ~/.zeroclaw/workspace
    
    # Or backup SQLite directly
    sqlite3 ~/.zeroclaw/workspace/memory/brain.db ".backup brain-$(date +%F).db"
  3. Git-track snapshot:

    cd ~/.zeroclaw/workspace
    git add MEMORY_SNAPSHOT.md
    git commit -m "Memory snapshot $(date +%F)"

Sources: docker-compose.yml:34-36, src/memory/snapshot.rs:17-90


Monitoring & Observability

Health Check Endpoints

graph LR
    A["Health Check"] --> B["/health endpoint"]
    A --> C["zeroclaw status"]
    A --> D["zeroclaw doctor"]
    
    B --> B1["Always public<br/>No authentication"]
    C --> C1["System status<br/>Config validation"]
    D --> D1["Deep diagnostics<br/>Channel health"]
Loading

Sources: README.md:225-233

HTTP Health Check:

curl -f http://localhost:3000/health
# Returns: {"status": "ok"}

Docker Compose Health Check:

healthcheck:
  test: ["CMD", "zeroclaw", "status"]
  interval: 60s
  timeout: 10s
  retries: 3
  start_period: 10s

Sources: docker-compose.yml:53-59

Diagnostics Commands

Command Purpose Output
zeroclaw status Overall system health Config paths, provider, memory backend
zeroclaw doctor Deep diagnostics Daemon freshness, scheduler status
zeroclaw channel doctor Channel health Per-channel reachability, auth status
zeroclaw auth status OAuth status Profile validity, token expiry

Sources: README.md:225-233

Logging Configuration

Tracing Levels via Environment:

export RUST_LOG=zeroclaw=info,zeroclaw::gateway=debug

Log Targets:

Module Key Events
zeroclaw::gateway Request handling, pairing, rate limiting
zeroclaw::channels Message ingestion, allowlist checks
zeroclaw::security Authorization decisions, policy violations
zeroclaw::memory Snapshot export/hydrate, query performance

Production Recommendation: Use structured logging (JSON) for SIEM integration:

// tracing-subscriber with JSON formatter
tracing_subscriber::fmt()
    .json()
    .with_env_filter(EnvFilter::from_default_env())
    .init();

Sources: Cargo.toml:40-42

Metrics (Prometheus)

ZeroClaw includes prometheus crate for metrics export.

Configuration:

# Future: metrics endpoint configuration
[observability]
metrics_enabled = true
metrics_port = 9090

Available Metrics (from code structure):

  • Request counts by endpoint
  • Rate limit violations
  • Provider API call latency
  • Memory operation latency

Sources: Cargo.toml:45


Performance Tuning

Build Profiles

graph TD
    A["Build Target"] --> B{"Build Profile"}
    B -->|Development| C["cargo build"]
    B -->|Production| D["cargo build --release"]
    B -->|High-memory machines| E["cargo build --profile release-fast"]
    
    C --> C1["Debug symbols<br/>No optimization<br/>Fast compile"]
    D --> D1["opt-level=z<br/>codegen-units=1<br/>3.4 MB binary"]
    E --> E1["opt-level=z<br/>codegen-units=8<br/>Faster compile"]
Loading

Sources: Cargo.toml:161-173

Production Build:

cargo build --release --locked
# Binary size: ~8.8 MB on macOS arm64 (measured Feb 2026)
# Memory footprint: ~4-5 MB for common CLI operations

Docker Multi-Stage Build:

# Stage 1: Builder (cached dependencies)
FROM rust:1.93-slim AS builder
COPY Cargo.toml Cargo.lock ./
RUN cargo build --release --locked

# Stage 2: Production runtime (distroless)
FROM gcr.io/distroless/cc-debian13:nonroot AS release
COPY --from=builder /app/zeroclaw /usr/local/bin/zeroclaw

Sources: Dockerfile:1-113, README.md:63-98

Runtime Optimization

Memory Backend Performance:

Backend Query Latency Write Latency Storage Use Case
sqlite ~2ms (FTS5) ~5ms Local file Single-agent, full search
postgres ~10ms (network) ~15ms Remote DB Multi-agent, shared state
markdown ~1ms (grep) ~0.5ms .md files Human-readable, Git-tracked
none 0ms 0ms None Stateless, ephemeral

Sources: README.md:330-377

Provider Resilience:

# ReliableProvider wraps all providers with retry logic
[provider]
max_retries = 3
backoff_multiplier = 2.0
timeout_secs = 60

Key resilience features:

  • Exponential backoff on transient errors
  • API key rotation (multiple keys in env)
  • Model fallback (default_modelfallback_model)

Sources: Per architecture diagrams, providers module structure

Browser Backend Selection

[browser]
backend = "auto"  # "agent_browser", "rust_native", "computer_use", "auto"

Performance Comparison:

Backend Startup Memory Availability
agent_browser ~200ms (Node.js) ~100 MB npm install
rust_native ~50ms ~30 MB cargo build --features browser-native
computer_use ~10ms (sidecar) ~50 MB External sidecar

Sources: src/tools/browser.rs:1-700


Network Configuration

Gateway Binding

[gateway]
host = "127.0.0.1"           # Localhost-only (production default)
port = 3000
allow_public_bind = false    # Refuse 0.0.0.0 without tunnel

Public Bind Protection:

// Gateway refuses 0.0.0.0 unless tunnel active or explicit override
if is_public_bind(&host) && !tunnel_active && !allow_public_bind {
    anyhow::bail!("Refusing public bind without tunnel");
}

Sources: src/security/pairing.rs:225-230, README.md:387-391

Tunnel Requirements

Production deployments must use tunnels for remote access:

[tunnel]
provider = "cloudflare"  # "cloudflare", "tailscale", "ngrok", "custom"

Tunnel Matrix:

Provider Transport Use Case
Cloudflare HTTPS Public webhook endpoints (WhatsApp, etc.)
Tailscale Wireguard Private mesh networks
ngrok HTTPS Development, temporary exposure
Custom Any Custom tunnel binary

HTTPS Enforcement:

  • WhatsApp webhook: Requires HTTPS (Meta Cloud API validation)
  • Pairing over public net: Bearer tokens should only traverse HTTPS

Sources: README.md:456-491

Port Mapping

# docker-compose.yml
ports:
  - "${HOST_PORT:-3000}:3000"  # Override with HOST_PORT=8080

Production Recommendation: Use non-standard ports (e.g., 8443) to reduce automated scanner noise.

Sources: docker-compose.yml:38-40


Scaling Considerations

Horizontal Scaling

ZeroClaw stateless design enables horizontal scaling with shared backend:

graph TD
    LB["Load Balancer"] --> G1["Gateway Instance 1"]
    LB --> G2["Gateway Instance 2"]
    LB --> G3["Gateway Instance 3"]
    
    G1 --> PG["PostgreSQL<br/>Shared Memory"]
    G2 --> PG
    G3 --> PG
    
    G1 --> RD["Redis<br/>Rate Limit State"]
    G2 --> RD
    G3 --> RD
Loading

Configuration:

[memory]
backend = "postgres"
[storage.provider.config]
provider = "postgres"
db_url = "postgres://shared-db:5432/zeroclaw"

# Rate limiting requires shared state (not yet implemented)
# Future: Redis adapter for distributed rate limiting

Sources: README.md:346-365

Resource Scaling

Single-agent optimal configuration:

  • CPU: 1-2 cores
  • Memory: 512 MB - 2 GB (depends on memory backend size)
  • Storage: 1 GB (SQLite + workspace)

Multi-agent coordinator configuration:

  • CPU: 4+ cores (parallel tool execution)
  • Memory: 4-8 GB (multiple sub-agent contexts)
  • Storage: 10 GB+ (large conversation histories)

Sources: docker-compose.yml:42-50, README.md:63-74


Production Deployment Checklist

Security

  • Enable secret encryption (secrets.encrypt = true)
  • Enable gateway pairing (gateway.require_pairing = true)
  • Configure channel allowlists (no ["*"] wildcards)
  • Enable workspace scoping (autonomy.workspace_only = true)
  • Use Docker runtime for untrusted tools (runtime.kind = "docker")
  • Configure tunnel provider (tunnel.provider)
  • Restrict forbidden paths (autonomy.forbidden_paths)

Resource Management

  • Set CPU limits (deploy.resources.limits.cpus)
  • Set memory limits (deploy.resources.limits.memory)
  • Configure runtime constraints (runtime.docker.memory_limit_mb)
  • Select appropriate memory backend (memory.backend)

Persistence

  • Mount persistent volumes (volumes: zeroclaw-data:/zeroclaw-data)
  • Schedule backups (SQLite + MEMORY_SNAPSHOT.md)
  • Git-track workspace for version control
  • Test restore procedure

Monitoring

  • Configure health checks (healthcheck.test)
  • Set up structured logging (RUST_LOG)
  • Enable metrics endpoint (when available)
  • Configure alerting on health check failures

Performance

  • Build with --release --locked
  • Use appropriate build profile (release vs release-fast)
  • Tune provider timeout (provider.timeout_secs)
  • Select optimal browser backend (browser.backend)

Network

  • Bind to localhost (gateway.host = "127.0.0.1")
  • Configure tunnel (tunnel.provider)
  • Use non-standard ports in production
  • Enforce HTTPS for public endpoints

Sources: All sections above

Clone this wiki locally