Enhance robustness of memory search with jieba-next CJK tokenizer and native Markdown parsing

## Problem Statement

There are two **severe** design flaws that   carries  a high risk of compromising   information retrieval effectiveness and degrading overall reliability, especially in CJK (specifically Chinese) contexts:

- The current lack of robust **tokenization** and **semantic parsing** forms a critical bottleneck for practical usage in multilingual (notably CJK) environments.
- LLMs may produce malformed or inconsistent Markdown output: their markdown *dialect* is shaped by black-box pre-training/RLHF, making formats unstable even with explicit prompts. And infrastructure issues (network fluctuations, cluster load balancing) often cause errors in complex structures like tables.
- Plain regex chunking and standard tokenizers (as in current store.py/chunker.py) lead to poor recall, fuzzy chunking, and ineffective BM25 search in Chinese.

## Proposal: Tokenizer and Markdown Parser Recommendations

- **Chinese tokenizer:** Prioritize [jieba-next](https://github.com/mxcoras/jieba-next) (most API-rich, good Windows support) for CJK tokenization. Alternatively, [jieba-rs](https://github.com/messense/jieba-rs) offers higher speed but narrower API coverage for basic use cases.
- **Markdown parser:** Favor [Mistune](https://github.com/lepture/mistune), as it is actively maintained and provides a structured AST in Python. Alternatively, [md4c](https://github.com/mity/md4c) is very fast, but the SAX parser design can be fragile in engineering for precise chunking.

## Potential Consequences

Failure to address these design flaws will:
- Severely limit semantic search quality, especially for Chinese/CJK agents or knowledge bases
- Hinder adoption as a memory solution in multilingual/global settings
- Brittle heading/paragraph detection increases risk of false negatives/positives during recall

**Request:** Please consider prioritizing a robust chunker and BM25 tokenizer/pluggable analyzer as a core architectural improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance robustness of memory search with jieba-next CJK tokenizer and native Markdown parsing #102

Problem Statement

Proposal: Tokenizer and Markdown Parser Recommendations

Potential Consequences

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhance robustness of memory search with jieba-next CJK tokenizer and native Markdown parsing #102

Description

Problem Statement

Proposal: Tokenizer and Markdown Parser Recommendations

Potential Consequences

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions