Chunk size (800) exceeds embedding model token limit (256 tokens / ~512 chars)
Problem
The default CHUNK_SIZE = 800 in miner.py exceeds the token limit of the default embedding model.
Details:
- Chunk size: 800 characters (line 56 in
miner.py)
- Embedding model:
all-MiniLM-L6-v2 (ChromaDB default via ONNX)
- Model token limit: 256 tokens (~512 characters)
- Result: Content beyond ~512 chars is silently truncated before embedding
Impact
- Lost context: Important information in positions 512-800 of each chunk is not included in the embedding
- Reduced recall: Semantic search may miss relevant content in truncated portions
- Silent degradation: No errors or warnings — users don't know this is happening
Suggested Fixes
Option 1: Reduce default chunk size
# miner.py line 56-57
CHUNK_SIZE = 400 # Was 800 - safer for 256 token limit
CHUNK_OVERLAP = 50 # Was 100
Option 2: Add CLI configuration
mempalace mine <path> --chunk-size 400 --chunk-overlap 50
Option 3: Auto-detect model limits
Query the embedding function for its max tokens and adjust chunking accordingly.
Environment
- mempalace version: 3.0.0
- ChromaDB version: 1.5.7
- Embedding: Default (all-MiniLM-L6-v2 via ONNX)
Priority
Medium — search works but quality could be significantly improved.
Chunk size (800) exceeds embedding model token limit (256 tokens / ~512 chars)
Problem
The default
CHUNK_SIZE = 800inminer.pyexceeds the token limit of the default embedding model.Details:
miner.py)all-MiniLM-L6-v2(ChromaDB default via ONNX)Impact
Suggested Fixes
Option 1: Reduce default chunk size
Option 2: Add CLI configuration
Option 3: Auto-detect model limits
Query the embedding function for its max tokens and adjust chunking accordingly.
Environment
Priority
Medium — search works but quality could be significantly improved.