The MemPalace  README.md and official website (mempalaceofficial.com) still contain misrepresentations and false benchmark claims

**TL;DR:** MemPalace is **not** "The highest-scoring AI memory system ever benchmarked."

- The headline 96.6% is ChromaDB's default sentence-transformer embeddings with no MemPalace-specific code path involved (per #27, #39, and the maintainer reply on #39).
- The LongMemEval 100% was achieved by hand-tuning on failing questions; held-out score is 98.4% (per #125).
- The LoCoMo 100% R@10 is a top-k bypass artifact: top-k=50 exceeds per-conversation session count (19-32), so the retrieval stage returns every session and delegates to Sonnet - not a retrieval result (per #29 point 2, #125).
- The comparison tables place MemPalace retrieval-recall (R@5) next to Mastra's binary QA accuracy (94.87%) and a Mem0 figure (~85%) with no published source - a category error (per #29 point 4, and [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory), [mem0.ai/research](https://mem0.ai/research)).
- The only end-to-end QA numbers ever measured against MemPalace (49% on BEAM raw, 26-43% on MemPalace-specific modes, per #125) are absent from all public-facing pages.
- MemPalace's own unmanipulated retrieval baseline on LoCoMo without rerank is 60.3% R@10 (verbatim from `mempalaceofficial.com/reference/benchmarks.html`).

## Summary

Four prior issues (#27, #29, #39, #125) documented specific methodology and attribution problems in MemPalace's public-facing benchmark claims. These issues were already acknowledged by maintainers. One week later (2026-04-14), the headline claims on the repo `About` text, the `README.md`, `mempalaceofficial.com`, and `mempalaceofficial.com/reference/benchmarks.html` still reproduce the same framing, including a comparison table that places retrieval-recall numbers and binary-QA-accuracy numbers under a single column labeled "LongMemEval R@5."

This issue enumerates each surviving claim alongside the specific prior issue that documented the problem and the maintainer acknowledgement. Verbatim quotes throughout; every external number verifiable from the linked sources.

## The central methodology point

**R@5 (recall at 5) is a retrieval metric. QA accuracy is an end-to-end answer-correctness metric. They are not in any way comparable.** A system can have 100% R@5 and 40% QA accuracy. Presenting R@5 numbers next to QA accuracy numbers under a single column header is not a rounding error; it is a category error that inflates MemPalace's relative standing against systems that published different metrics. This exact flaw was flagged one week ago.

##  1 - "Highest-scoring AI memory system ever benchmarked"

**Current public-facing text (verified 2026-04-14):**

- Repo `About`: *"The highest-scoring AI memory system ever benchmarked. And it's free."*
- `README.md` headline banner: *"The highest-scoring AI memory system ever benchmarked. And it's free."* with *"96.6% LongMemEval R@5 raw mode, zero API calls."*
- `mempalaceofficial.com`: *"96.6% recall on LongMemEval in raw mode."*

**What #27 established and maintainers acknowledged in #39:**

From #27 (lhl): *"96.6% LongMemEval R@5 (headline, positioned as MemPalace's score) - Real score, but measured in 'raw mode' - uncompressed verbatim text stored in ChromaDB, standard nearest-neighbor retrieval. **The palace structure (wings/rooms/halls) is not involved.** This measures ChromaDB's default embedding model performance, not MemPalace."*

From #39 (gizmax, independent reproduction on M2 Ultra): *"the `--mode raw` runner builds a fresh `chromadb.EphemeralClient()` per question and never touches the palace"* code paths.

Maintainer's reply on #39: *"the palace structure is for navigation/organization, not retrieval boost. AAAK is an experimental compression layer, not the storage default."*

**Status 2026-04-14:** The About text and README headline continue to attribute the 96.6% to MemPalace as a memory system. Per the maintainer's own acknowledgment, the number measures ChromaDB's default sentence-transformer model with no MemPalace-specific retrieval code path involved. The correct attribution is "ChromaDB default embeddings," which is a third-party off-the-shelf component, not a MemPalace result.

##  2 - Comparison table mixes R@5 with QA accuracy

**Current public-facing text (`mempalaceofficial.com/reference/benchmarks.html`, verified 2026-04-14):**

> **Comparison vs Published Systems**
>
> | System | LongMemEval R@5 | API Required | Cost |
> | --- | --- | --- | --- |
> | MemPalace (hybrid) | 100% | Optional | Free |
> | Supermemory ASMR | ~99% | Yes | - |
> | MemPalace (raw) | 96.6% | None | Free |
> | Mastra | 94.87% | Yes | API costs |
> | Hindsight | 91.4% | Yes | API costs |
> | Mem0 | ~85% | Yes | $19-249/mo |

**What each cited source actually publishes:**

| Row | MemPalace table claim | Source's actual published metric | Benchmark | Cited URL |
| --- | --- | --- | --- | --- |
| Mastra 94.87% | LongMemEval R@5 | **"Overall Accuracy - Each question is evaluated as correct/incorrect (binary)"** with gpt-5-mini | LongMemEval-S (QA accuracy, not R@5) | [mastra.ai/research/observational-memory](https://mastra.ai/research/observational-memory) |
| Mem0 ~85% | LongMemEval R@5 | **66.9% overall accuracy on LoCoMo** - no LongMemEval number published | LoCoMo (different benchmark entirely, and different metric) | [mem0.ai/research](https://mem0.ai/research) |

Mastra's 94.87% is binary QA accuracy, not R@5. Mem0 does not publish a LongMemEval number at all; the ~85% in the table has no cited source and does not match Mem0's actual published figure of 66.9% on a different benchmark (LoCoMo, not LongMemEval).

**What #29 already established:**

From #29, point 4: *"LongMemEval runner measures retrieval, not QA. The benchmark never generates answers or invokes judges - it only checks if labeled session IDs appear in top-5 results (recall_any@5), fundamentally different from the published leaderboard's end-to-end QA task."*

From #29, point 5: *"ConvoMem metric mismatch. The 92.9% is retrieval-based; Mem0's published numbers are end-to-end QA accuracy. Comparing different metrics on the same dataset."*

**Status 2026-04-14:** #29 was closed as completed. The same flaw identified there (MemPalace retrieval-recall presented as comparable to competitor QA accuracy) is reproduced on the LongMemEval comparison table on `mempalaceofficial.com`, and reappears in the ConvoMem table on the same page (verbatim):

> | System | Score |
> | --- | --- |
> | MemPalace | 92.9% |
> | Gemini (long context) | 70-82% |
> | Block extraction | 57-71% |
> | Mem0 (RAG) | 30-45% |

The column header is "Score" - undefined. MemPalace's 92.9% is retrieval-based (per #29, point 5). Mem0's published ConvoMem numbers are end-to-end QA accuracy. Mixing the two under a single "Score" column is the same category error as the LongMemEval table, one level down the page.

## 3 - "100%" scores still headlined after acknowledgment

**Current public-facing text (`mempalaceofficial.com/reference/benchmarks.html`, verified 2026-04-14):**

> | Mode | R@5 | LLM Required | Cost/query |
> | --- | --- | --- | --- |
> | Hybrid v4 + rerank | 100% | Haiku | ~$0.001 |

And in the LoCoMo table on the same page (verbatim, `mempalaceofficial.com/reference/benchmarks.html`, "LoCoMo (1,986 multi-hop QA pairs)"):

> | Mode | R@10 | LLM |
> | --- | --- | --- |
> | Hybrid v5 + Sonnet rerank (top-50) | 100% | Sonnet |
> | bge-large + Haiku rerank (top-15) | 96.3% | Haiku |
> | Hybrid v5 (top-10, no rerank) | 88.9% | None |
> | Session, no rerank (top-10) | 60.3% | None |

The column header is **R@10**; the top row's mode is **(top-50)**. The two cannot both be true descriptions of the same evaluation. Either the mode is not top-50, or the metric is not R@10. The non-reranked baseline on the same page - `Session, no rerank (top-10) - 60.3%` - is the honest retrieval-recall signal for this system on this benchmark.

**What #125 established and bensig acknowledged:**

From #125 (rohithzr): *"The '100% LongMemEval score' involved 'teaching to the test' on three specific questions, with a held-out score of 98.4%. The LoCoMo benchmark reports '100% by construction' when top-k retrieval exceeds corpus size."*

From #29, point 2: *"The 100% bypasses retrieval. With top-k=50, every conversation (19-32 sessions each) is fully retrieved. The system reduces to 'dump every session into Claude Sonnet, ask Sonnet which one matches.'"*

bensig's reply on #125: *"100% shouldn't be headlined like that, we've updated the readme a bit more now."*

**Status 2026-04-14:** Both 100% numbers remain in the public-facing benchmark tables, one week after acknowledgment they shouldn't be headlined. Three specific problems with the LoCoMo 100% row are not disclosed anywhere on the page:

1. **The "R@10" header does not match the "(top-50)" mode.** The retrieval stage returns 50 candidates before Sonnet reranks them. Per #29, LoCoMo conversations contain 19-32 sessions each - top-50 exceeds the per-conversation session count in every case, so the retrieval stage returns every session. What the row actually reports is Sonnet's recall at 10 given the entire conversation as input: an LLM-judging task, not a retrieval task.

2. **LoCoMo contains adversarial / unanswerable questions by design.** The 1,986-question LoCoMo set includes questions whose answer is not present in the conversation (the adversarial category exists specifically to penalize systems that hallucinate support). Retrieval recall on those questions cannot be 100%. There is no supporting chunk in the corpus to retrieve. A headline "100% R@10" across the full set is either mathematically impossible or does not actually include the adversarial subset (Category 5). Neither the exclusion nor the question-count actually evaluated is disclosed.

3. **The LongMemEval 100% held-out figure (98.4% on 450 untouched questions, per #125) is not shown in the comparison table where 100% appears.**

## 4 - End-to-end QA results exist and are omitted

**What #125 established (BEAM benchmark, end-to-end answer quality, not retrieval):**

From #125 (rohithzr): *"Raw ChromaDB achieves 49% overall score, while MemPalace-specific modes score lower (43%, 43.6%, 26.2%, 27.9%). The analysis identifies critical gaps: summarization reaches only 35%, event ordering 32%, and contradiction resolution 40%."*

**Status 2026-04-14:** These are the only end-to-end answer-quality numbers published for MemPalace. They are not in the README, the About text, or the `mempalaceofficial.com` benchmarks page. The positioning of MemPalace as "the highest-scoring AI memory system ever benchmarked" relies exclusively on retrieval-recall numbers while omitting the only QA-accuracy benchmark performed against the system.

The competitor numbers in the comparison table (Mastra 94.87%, Mem0 66.9% on LoCoMo) are QA accuracy. The MemPalace QA accuracy on BEAM raw is 49%. Apples-to-apples comparison does not appear anywhere in the public-facing claims.

## The pattern

- 2026-04-07 morning: #27, #29, #39, #125 filed. Multiple independent reporters.
- 2026-04-07 afternoon: Milla acknowledges #27 and #39. Ben acknowledges #125.
- 2026-04-07 evening: README gets partial correction ("A Note from Milla & Ben - April 7, 2026") addressing AAAK and the palace-structure-boost claim. The headline banner and About text are not changed.
- 2026-04-07 to 2026-04-14: `mempalaceofficial.com` publishes the comparison tables documented above, extending rather than correcting the framing.
- 2026-04-09: @rohitg00, maintainer of the competing [agentmemory](https://github.com/rohitg00/agentmemory) project, filed [Discussion #747](https://github.com/MemPalace/mempalace/discussions/747) independently documenting most of the issues above. Five days later: zero maintainer reply, zero comments.
- 2026-04-14: About text, README headline, and `mempalaceofficial.com` comparison tables still contain the claims documented above as misleading.

One week is a long time for a single-line About text and a README banner to remain unfixed after the responsible maintainer acknowledged the problem.

## Verification

Every external number and quotation in this issue is verifiable from the cited source. The Mastra and Mem0 metric definitions are quoted verbatim from their research pages. The MemPalace README and repo About text are quoted from the live GitHub pages as of 2026-04-14. The `mempalaceofficial.com` homepage content is quoted from the 2026-04-14 archive at https://archive.is/ZXbI3 (the live site was returning errors at the time of filing). The `mempalaceofficial.com/reference/benchmarks.html` content is quoted from the same-day captured text; a public archive of that page has been requested.

<img width="614" height="353" alt="Image" src="https://github.com/user-attachments/assets/691c7f5e-68ef-4c0a-b864-60884d5ea2c2" />

<img width="701" height="482" alt="Image" src="https://github.com/user-attachments/assets/cbc8c6b6-4c91-45a9-ba56-e7fc94791984" />

<img width="928" height="546" alt="Image" src="https://github.com/user-attachments/assets/cc7cce68-6825-42aa-acda-a5f130102436" />

<img width="892" height="393" alt="Image" src="https://github.com/user-attachments/assets/048e3baf-d480-432b-898d-9f7951447a3c" />

<img width="720" height="828" alt="Image" src="https://github.com/user-attachments/assets/644cca55-5853-4076-bda3-c5263ba37fee" />

<img width="717" height="525" alt="Image" src="https://github.com/user-attachments/assets/96678800-129c-49fa-be19-5b44c1cddc98" />

<img width="730" height="836" alt="Image" src="https://github.com/user-attachments/assets/0a3ff938-4838-4bef-b111-19a2f2716b25" />

<img width="718" height="445" alt="Image" src="https://github.com/user-attachments/assets/1ea3f7f5-dd52-422d-b3c8-03143685d04a" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The MemPalace README.md and official website (mempalaceofficial.com) still contain misrepresentations and false benchmark claims #875

Summary

The central methodology point

1 - "Highest-scoring AI memory system ever benchmarked"

2 - Comparison table mixes R@5 with QA accuracy

3 - "100%" scores still headlined after acknowledgment

4 - End-to-end QA results exist and are omitted

The pattern

Verification

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System	LongMemEval R@5	API Required	Cost
MemPalace (hybrid)	100%	Optional	Free
Supermemory ASMR	~99%	Yes	-
MemPalace (raw)	96.6%	None	Free
Mastra	94.87%	Yes	API costs
Hindsight	91.4%	Yes	API costs
Mem0	~85%	Yes	$19-249/mo

Row	MemPalace table claim	Source's actual published metric	Benchmark	Cited URL
Mastra 94.87%	LongMemEval R@5	"Overall Accuracy - Each question is evaluated as correct/incorrect (binary)" with gpt-5-mini	LongMemEval-S (QA accuracy, not R@5)	mastra.ai/research/observational-memory
Mem0 ~85%	LongMemEval R@5	66.9% overall accuracy on LoCoMo - no LongMemEval number published	LoCoMo (different benchmark entirely, and different metric)	mem0.ai/research

System	Score
MemPalace	92.9%
Gemini (long context)	70-82%
Block extraction	57-71%
Mem0 (RAG)	30-45%

Mode	R@10	LLM
Hybrid v5 + Sonnet rerank (top-50)	100%	Sonnet
bge-large + Haiku rerank (top-15)	96.3%	Haiku
Hybrid v5 (top-10, no rerank)	88.9%	None
Session, no rerank (top-10)	60.3%	None

The MemPalace README.md and official website (mempalaceofficial.com) still contain misrepresentations and false benchmark claims #875

Description

Summary

The central methodology point

1 - "Highest-scoring AI memory system ever benchmarked"

2 - Comparison table mixes R@5 with QA accuracy

3 - "100%" scores still headlined after acknowledgment

4 - End-to-end QA results exist and are omitted

The pattern

Verification

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions