Skip to content

Chunk size (800) exceeds embedding model token limit (256 tokens / ~512 chars) #390

@sidonsoft

Description

@sidonsoft

Chunk size (800) exceeds embedding model token limit (256 tokens / ~512 chars)

Problem

The default CHUNK_SIZE = 800 in miner.py exceeds the token limit of the default embedding model.

Details:

  • Chunk size: 800 characters (line 56 in miner.py)
  • Embedding model: all-MiniLM-L6-v2 (ChromaDB default via ONNX)
  • Model token limit: 256 tokens (~512 characters)
  • Result: Content beyond ~512 chars is silently truncated before embedding

Impact

  1. Lost context: Important information in positions 512-800 of each chunk is not included in the embedding
  2. Reduced recall: Semantic search may miss relevant content in truncated portions
  3. Silent degradation: No errors or warnings — users don't know this is happening

Suggested Fixes

Option 1: Reduce default chunk size

# miner.py line 56-57
CHUNK_SIZE = 400  # Was 800 - safer for 256 token limit
CHUNK_OVERLAP = 50  # Was 100

Option 2: Add CLI configuration

mempalace mine <path> --chunk-size 400 --chunk-overlap 50

Option 3: Auto-detect model limits
Query the embedding function for its max tokens and adjust chunking accordingly.

Environment

  • mempalace version: 3.0.0
  • ChromaDB version: 1.5.7
  • Embedding: Default (all-MiniLM-L6-v2 via ONNX)

Priority

Medium — search works but quality could be significantly improved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/miningFile and conversation miningbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions