Skip to content

09.2 Hybrid Search

Nikolay Vyahhi edited this page Feb 19, 2026 · 3 revisions

Hybrid Search

Relevant source files

The following files were used as context for generating this wiki page:

Purpose and Scope

This page documents ZeroClaw's hybrid search implementation, which combines vector similarity search with keyword-based full-text search to provide robust memory recall. The system uses SQLite as the backing store with custom implementations for both search modes, requiring no external dependencies like Pinecone, Elasticsearch, or LangChain.

For information about available memory backends and how to configure them, see Memory Backends. For memory export and backup functionality, see Memory Snapshot.


Architecture Overview

ZeroClaw's hybrid search operates as a two-stage retrieval system that merges results from both vector and keyword search paths using configurable weights.

flowchart TB
    Query["User Query / Recall Request"]
    
    subgraph "Parallel Search Execution"
        VectorPath["Vector Search Path"]
        KeywordPath["Keyword Search Path"]
    end
    
    subgraph "Vector Search"
        Embed["Generate Query Embedding<br/>(EmbeddingProvider)"]
        VectorDB["SQLite BLOB Storage<br/>Cosine Similarity Scan"]
        VectorResults["Vector Results<br/>(similarity scores)"]
    end
    
    subgraph "Keyword Search"
        Tokenize["Tokenize Query<br/>(FTS5 tokenizer)"]
        FTS5["FTS5 Virtual Table<br/>BM25 Ranking"]
        KeywordResults["Keyword Results<br/>(BM25 scores)"]
    end
    
    subgraph "Merge Layer"
        Normalize["Normalize Scores<br/>(0.0 - 1.0 range)"]
        Weight["Apply Weights<br/>vector_weight * vec_score +<br/>keyword_weight * kw_score"]
        Dedupe["Deduplicate & Sort<br/>by Combined Score"]
    end
    
    FinalResults["Ranked Memory Chunks"]
    
    Query --> VectorPath
    Query --> KeywordPath
    
    VectorPath --> Embed
    Embed --> VectorDB
    VectorDB --> VectorResults
    
    KeywordPath --> Tokenize
    Tokenize --> FTS5
    FTS5 --> KeywordResults
    
    VectorResults --> Normalize
    KeywordResults --> Normalize
    Normalize --> Weight
    Weight --> Dedupe
    Dedupe --> FinalResults
Loading

Sources: README.md:330-345


Vector Search Component

Embedding Generation

The vector search path begins by converting the query text into a high-dimensional embedding vector using an EmbeddingProvider implementation. The provider is configurable via memory.embedding_provider:

Provider Value Behavior
"none" No-op provider; vector search disabled (keyword-only mode)
"openai" Uses OpenAI's embedding API (requires OPENAI_API_KEY)
"custom:https://..." Custom OpenAI-compatible embedding endpoint

Sources: README.md:346-377

Storage Format

Embeddings are stored as BLOBs directly in SQLite tables, typically in a memories or similar table structure. Each row contains:

  • id: Unique memory chunk identifier
  • content: Original text content
  • embedding: Binary-encoded vector (typically float32 array)
  • metadata: Timestamps, tags, source information

Cosine Similarity Search

Vector retrieval performs a full table scan computing cosine similarity between the query embedding and each stored embedding:

similarity(A, B) = (A · B) / (||A|| * ||B||)

Results are ranked by similarity score (1.0 = identical, -1.0 = opposite, 0.0 = orthogonal).

Sources: README.md:336-337


Keyword Search Component

FTS5 Virtual Tables

The keyword search path leverages SQLite's FTS5 (Full-Text Search 5) extension, which creates inverted indices for fast text retrieval. FTS5 tables mirror the structure of the main memory table but are optimized for tokenized search.

Sources: README.md:338-339

BM25 Ranking

FTS5 uses the BM25 (Best Match 25) algorithm to score document relevance. BM25 improves on TF-IDF by:

  1. Term Frequency Saturation: Logarithmic scaling prevents repeated terms from dominating scores
  2. Document Length Normalization: Shorter documents aren't penalized unfairly
  3. Inverse Document Frequency: Rare terms score higher than common terms

The FTS5 query syntax supports:

  • Boolean operators: AND, OR, NOT
  • Phrase matching: "exact phrase"
  • Proximity search: NEAR(term1 term2, distance)
  • Prefix matching: term*

Sources: README.md:338-339


Hybrid Merge Algorithm

The merge layer combines vector and keyword results using a weighted linear combination. This is implemented in a custom merge function.

flowchart LR
    VectorScores["Vector Scores<br/>(0.0 - 1.0)"]
    KeywordScores["Keyword Scores<br/>(0.0 - 1.0)"]
    
    VectorWeight["vector_weight<br/>(default: 0.7)"]
    KeywordWeight["keyword_weight<br/>(default: 0.3)"]
    
    Combine["Combined Score =<br/>vec_weight * vec_score +<br/>kw_weight * kw_score"]
    
    Sort["Sort by Combined Score<br/>(descending)"]
    
    VectorScores --> Combine
    KeywordScores --> Combine
    VectorWeight --> Combine
    KeywordWeight --> Combine
    
    Combine --> Sort
Loading

Merge Function (vector.rs)

The custom weighted merge function in vector.rs performs:

  1. Score Normalization: Both vector similarity and BM25 scores are normalized to [0.0, 1.0]
  2. Weight Application: Applies vector_weight and keyword_weight from config
  3. Deduplication: Merges duplicate results (same memory chunk ID) by taking max combined score
  4. Sorting: Orders final results by combined score descending

Configuration

Weights are configured in ~/.zeroclaw/config.toml:

[memory]
vector_weight = 0.7    # Emphasize semantic similarity
keyword_weight = 0.3   # De-emphasize exact keyword matches

Tuning Guidelines:

Use Case vector_weight keyword_weight Rationale
Semantic search 0.8 - 1.0 0.0 - 0.2 Prioritize meaning over exact terms
Technical queries 0.5 0.5 Balance semantic and exact matches
Exact recall 0.0 - 0.2 0.8 - 1.0 Prioritize keyword precision

Sources: README.md:340-341, README.md:346-377


Embedding Provider System

classDiagram
    class EmbeddingProvider {
        <<trait>>
        +embed(text: String) Result~Vec~f32~~
        +embed_batch(texts: Vec~String~) Result~Vec~Vec~f32~~~
    }
    
    class OpenAIEmbedding {
        -api_key: String
        -model: String
        +embed(text) Result~Vec~f32~~
        +embed_batch(texts) Result~Vec~Vec~f32~~~
    }
    
    class CustomEmbedding {
        -endpoint: String
        -api_key: Option~String~
        +embed(text) Result~Vec~f32~~
        +embed_batch(texts) Result~Vec~Vec~f32~~~
    }
    
    class NoopEmbedding {
        +embed(text) Result~Vec~f32~~
        +embed_batch(texts) Result~Vec~Vec~f32~~~
    }
    
    EmbeddingProvider <|.. OpenAIEmbedding
    EmbeddingProvider <|.. CustomEmbedding
    EmbeddingProvider <|.. NoopEmbedding
Loading

Provider Implementations

Provider Configuration Behavior
OpenAI embedding_provider = "openai" Uses text-embedding-ada-002 model via OpenAI API
Custom embedding_provider = "custom:https://..." OpenAI-compatible embedding endpoint
Noop embedding_provider = "none" Returns empty vector; disables vector search

API Key Resolution

Embedding providers resolve API keys using the same priority order as LLM providers:

  1. Explicit api_key in memory config
  2. Provider-specific environment variable (e.g., OPENAI_API_KEY)
  3. Generic API_KEY environment variable
  4. Encrypted secret store (~/.zeroclaw/.secret_key)

Sources: README.md:342-343


Embedding Cache

To minimize redundant API calls and improve recall latency, ZeroClaw maintains an LRU (Least Recently Used) cache for embeddings.

Cache Table Structure

The embedding_cache table in SQLite stores:

Column Type Description
text_hash TEXT PRIMARY KEY SHA-256 hash of input text
embedding BLOB Binary-encoded vector
created_at TIMESTAMP Cache entry creation time
accessed_at TIMESTAMP Last access time (for LRU)

Cache Operations

flowchart TD
    Query["Memory Recall Query"]
    Hash["Compute SHA-256<br/>of query text"]
    CacheLookup["SELECT FROM<br/>embedding_cache<br/>WHERE text_hash = ?"]
    
    CacheHit{Cache Hit?}
    
    UpdateAccess["UPDATE accessed_at<br/>SET NOW()"]
    ReturnCached["Return Cached<br/>Embedding"]
    
    CallProvider["Call EmbeddingProvider<br/>embed(text)"]
    InsertCache["INSERT INTO<br/>embedding_cache<br/>(or replace if full)"]
    ReturnNew["Return New<br/>Embedding"]
    
    Query --> Hash
    Hash --> CacheLookup
    CacheLookup --> CacheHit
    
    CacheHit -->|Yes| UpdateAccess
    UpdateAccess --> ReturnCached
    
    CacheHit -->|No| CallProvider
    CallProvider --> InsertCache
    InsertCache --> ReturnNew
Loading

LRU Eviction

When the cache reaches capacity (configurable, typically 10,000 entries):

  1. Identify the row with the oldest accessed_at timestamp
  2. Delete that row
  3. Insert the new embedding

This ensures frequently queried embeddings remain cached while stale entries are evicted.

Sources: README.md:344-345


Text Chunking

Before storing memories, long-form content is split into chunks to improve retrieval granularity and context relevance.

Line-Based Markdown Chunker

ZeroClaw uses a line-based chunking strategy that preserves markdown structure:

flowchart TD
    Document["Input Document<br/>(Markdown)"]
    
    ParseLines["Parse Lines<br/>(split by \\n)"]
    
    DetectHeadings["Detect Headings<br/>(# ## ### etc.)"]
    
    GroupContent["Group Content<br/>Under Headings"]
    
    SplitOversize["Split Oversize Chunks<br/>(max_chunk_size)"]
    
    PreserveContext["Preserve Heading Context<br/>in each chunk"]
    
    Chunks["Output Chunks<br/>(with metadata)"]
    
    Document --> ParseLines
    ParseLines --> DetectHeadings
    DetectHeadings --> GroupContent
    GroupContent --> SplitOversize
    SplitOversize --> PreserveContext
    PreserveContext --> Chunks
Loading

Chunking Strategy

  1. Heading Preservation: Each chunk retains its parent heading hierarchy for context

    • Example: ## Installation > ### Prerequisites preserved in chunk metadata
  2. Size Limits: Configurable max_chunk_size (typically 1000-2000 characters)

    • Oversize paragraphs are split at sentence boundaries
  3. Overlap: Optional sliding window overlap (default: 100 characters)

    • Ensures context isn't lost at chunk boundaries
  4. Code Block Handling: Code blocks are treated as atomic units

    • Never split mid-block, even if exceeding size limit

Why Line-Based?

  • Preserves Structure: Markdown headings provide natural semantic boundaries
  • Context-Aware: Parent headings give additional relevance signals
  • Fast: No complex NLP parsing required
  • Deterministic: Same input always produces same chunks

Sources: README.md:344-345


Reindexing

Over time, the memory system may require reindexing to:

  • Rebuild corrupted FTS5 indices
  • Re-embed content with a new embedding provider
  • Update to a newer embedding model
  • Compact database after many deletions

Safe Atomic Reindex

The reindex operation is designed to be safe and atomic:

sequenceDiagram
    participant User
    participant Memory as Memory Backend
    participant FTS5 as FTS5 Virtual Table
    participant Embeddings as Embedding Table
    participant Provider as EmbeddingProvider
    
    User->>Memory: Trigger Reindex
    Memory->>Memory: BEGIN TRANSACTION
    
    Memory->>FTS5: DROP VIRTUAL TABLE IF EXISTS memories_fts
    Memory->>FTS5: CREATE VIRTUAL TABLE memories_fts
    
    loop For each memory chunk
        Memory->>Embeddings: Check if embedding exists
        alt Missing Embedding
            Memory->>Provider: embed(content)
            Provider-->>Memory: embedding vector
            Memory->>Embeddings: INSERT embedding
        end
        
        Memory->>FTS5: INSERT INTO memories_fts
    end
    
    Memory->>Memory: COMMIT TRANSACTION
    Memory-->>User: Reindex Complete
Loading

Reindex Operations

  1. FTS5 Rebuild:

    • Drops and recreates FTS5 virtual table
    • Re-inserts all content for fresh indexing
    • Updates BM25 term statistics
  2. Re-embedding:

    • Scans all memory chunks
    • Identifies missing or null embeddings
    • Calls EmbeddingProvider.embed() for each
    • Updates embedding BLOBs atomically
  3. Atomicity:

    • Entire operation runs within a SQLite transaction
    • If any step fails, all changes rollback
    • No partial or corrupt state

Triggering Reindex

Reindex is typically triggered:

  • Manually via CLI: zeroclaw memory reindex
  • Automatically on provider change (if configured)
  • On database health check failure

Sources: README.md:346-347


Configuration Reference

Complete hybrid search configuration options:

[memory]
# Backend type (must be "sqlite" for hybrid search)
backend = "sqlite"

# Auto-save conversation context to memory
auto_save = true

# Embedding provider for vector search
# Options: "none", "openai", "custom:https://..."
embedding_provider = "none"

# Hybrid merge weights (must sum to <= 1.0)
vector_weight = 0.7      # Weight for vector similarity scores
keyword_weight = 0.3     # Weight for keyword BM25 scores

# Optional: SQLite open timeout when file is locked
sqlite_open_timeout_secs = 30

# Optional: Override storage provider
# [storage.provider.config]
# provider = "postgres"    # Use PostgreSQL instead of SQLite
# db_url = "postgres://..."

Performance Tuning

Parameter Impact Recommendation
vector_weight Higher = more semantic 0.7-0.9 for conversational queries
keyword_weight Higher = more exact 0.3-0.5 for technical/code queries
embedding_provider API latency Use caching; consider local model
sqlite_open_timeout_secs Lock contention Increase for high concurrency

Keyword-Only Mode

To disable vector search entirely (zero API calls, lower latency):

[memory]
backend = "sqlite"
embedding_provider = "none"
vector_weight = 0.0
keyword_weight = 1.0

This mode uses FTS5 BM25 only, suitable for:

  • Exact keyword recall scenarios
  • No API key / offline environments
  • Performance-critical deployments

Sources: README.md:346-377


Integration with Agent Core

The hybrid search system is invoked automatically during the agent turn cycle:

sequenceDiagram
    participant Agent as Agent Turn Loop
    participant Memory as Memory Backend
    participant Hybrid as Hybrid Search
    participant Provider as LLM Provider
    
    Agent->>Memory: Recall relevant context<br/>recall(query, limit)
    Memory->>Hybrid: Execute hybrid search
    
    par Vector Search
        Hybrid->>Hybrid: Generate query embedding
        Hybrid->>Hybrid: Cosine similarity scan
    and Keyword Search
        Hybrid->>Hybrid: Tokenize query
        Hybrid->>Hybrid: FTS5 BM25 search
    end
    
    Hybrid->>Hybrid: Merge & rank results
    Hybrid-->>Memory: Ranked chunks
    Memory-->>Agent: Recalled context
    
    Agent->>Agent: Append context to<br/>system prompt
    Agent->>Provider: chat(messages + context)
    Provider-->>Agent: Response
    
    alt auto_save enabled
        Agent->>Memory: Store conversation<br/>store(content)
    end
Loading

Recall Tool

The agent can explicitly trigger memory recall via the recall tool:

{
  "tool_calls": [{
    "id": "call_abc123",
    "function": {
      "name": "recall",
      "arguments": "{\"query\": \"authentication implementation\", \"limit\": 5}"
    }
  }]
}

The hybrid search executes and returns top-ranked results to the LLM context.

Sources: README.md:330-345


Summary

ZeroClaw's hybrid search provides production-grade memory retrieval with:

  • Zero external dependencies: All components (vector storage, keyword search, merge logic) implemented in SQLite
  • Configurable weights: Tune vector vs. keyword emphasis per deployment
  • LRU caching: Minimize embedding API calls for frequent queries
  • Atomic operations: Safe reindexing with full transaction support
  • Multiple embedding providers: OpenAI, custom endpoints, or none (keyword-only mode)

The system balances semantic understanding (vector search) with exact matching (keyword search), delivering robust recall across diverse query types without requiring heavyweight infrastructure like Pinecone or Elasticsearch.

Sources: README.md:330-347


Clone this wiki locally