Skip to content

The MemPalace README.md and official website (mempalaceofficial.com) still contain misrepresentations and false benchmark claims #875

@dial481

Description

@dial481

TL;DR: MemPalace is not "The highest-scoring AI memory system ever benchmarked."

Summary

Four prior issues (#27, #29, #39, #125) documented specific methodology and attribution problems in MemPalace's public-facing benchmark claims. These issues were already acknowledged by maintainers. One week later (2026-04-14), the headline claims on the repo About text, the README.md, mempalaceofficial.com, and mempalaceofficial.com/reference/benchmarks.html still reproduce the same framing, including a comparison table that places retrieval-recall numbers and binary-QA-accuracy numbers under a single column labeled "LongMemEval R@5."

This issue enumerates each surviving claim alongside the specific prior issue that documented the problem and the maintainer acknowledgement. Verbatim quotes throughout; every external number verifiable from the linked sources.

The central methodology point

R@5 (recall at 5) is a retrieval metric. QA accuracy is an end-to-end answer-correctness metric. They are not in any way comparable. A system can have 100% R@5 and 40% QA accuracy. Presenting R@5 numbers next to QA accuracy numbers under a single column header is not a rounding error; it is a category error that inflates MemPalace's relative standing against systems that published different metrics. This exact flaw was flagged one week ago.

1 - "Highest-scoring AI memory system ever benchmarked"

Current public-facing text (verified 2026-04-14):

  • Repo About: "The highest-scoring AI memory system ever benchmarked. And it's free."
  • README.md headline banner: "The highest-scoring AI memory system ever benchmarked. And it's free." with "96.6% LongMemEval R@5 raw mode, zero API calls."
  • mempalaceofficial.com: "96.6% recall on LongMemEval in raw mode."

What #27 established and maintainers acknowledged in #39:

From #27 (lhl): "96.6% LongMemEval R@5 (headline, positioned as MemPalace's score) - Real score, but measured in 'raw mode' - uncompressed verbatim text stored in ChromaDB, standard nearest-neighbor retrieval. The palace structure (wings/rooms/halls) is not involved. This measures ChromaDB's default embedding model performance, not MemPalace."

From #39 (gizmax, independent reproduction on M2 Ultra): "the --mode raw runner builds a fresh chromadb.EphemeralClient() per question and never touches the palace" code paths.

Maintainer's reply on #39: "the palace structure is for navigation/organization, not retrieval boost. AAAK is an experimental compression layer, not the storage default."

Status 2026-04-14: The About text and README headline continue to attribute the 96.6% to MemPalace as a memory system. Per the maintainer's own acknowledgment, the number measures ChromaDB's default sentence-transformer model with no MemPalace-specific retrieval code path involved. The correct attribution is "ChromaDB default embeddings," which is a third-party off-the-shelf component, not a MemPalace result.

2 - Comparison table mixes R@5 with QA accuracy

Current public-facing text (mempalaceofficial.com/reference/benchmarks.html, verified 2026-04-14):

Comparison vs Published Systems

System LongMemEval R@5 API Required Cost
MemPalace (hybrid) 100% Optional Free
Supermemory ASMR ~99% Yes -
MemPalace (raw) 96.6% None Free
Mastra 94.87% Yes API costs
Hindsight 91.4% Yes API costs
Mem0 ~85% Yes $19-249/mo

What each cited source actually publishes:

Row MemPalace table claim Source's actual published metric Benchmark Cited URL
Mastra 94.87% LongMemEval R@5 "Overall Accuracy - Each question is evaluated as correct/incorrect (binary)" with gpt-5-mini LongMemEval-S (QA accuracy, not R@5) mastra.ai/research/observational-memory
Mem0 ~85% LongMemEval R@5 66.9% overall accuracy on LoCoMo - no LongMemEval number published LoCoMo (different benchmark entirely, and different metric) mem0.ai/research

Mastra's 94.87% is binary QA accuracy, not R@5. Mem0 does not publish a LongMemEval number at all; the ~85% in the table has no cited source and does not match Mem0's actual published figure of 66.9% on a different benchmark (LoCoMo, not LongMemEval).

What #29 already established:

From #29, point 4: "LongMemEval runner measures retrieval, not QA. The benchmark never generates answers or invokes judges - it only checks if labeled session IDs appear in top-5 results (recall_any@5), fundamentally different from the published leaderboard's end-to-end QA task."

From #29, point 5: "ConvoMem metric mismatch. The 92.9% is retrieval-based; Mem0's published numbers are end-to-end QA accuracy. Comparing different metrics on the same dataset."

Status 2026-04-14: #29 was closed as completed. The same flaw identified there (MemPalace retrieval-recall presented as comparable to competitor QA accuracy) is reproduced on the LongMemEval comparison table on mempalaceofficial.com, and reappears in the ConvoMem table on the same page (verbatim):

System Score
MemPalace 92.9%
Gemini (long context) 70-82%
Block extraction 57-71%
Mem0 (RAG) 30-45%

The column header is "Score" - undefined. MemPalace's 92.9% is retrieval-based (per #29, point 5). Mem0's published ConvoMem numbers are end-to-end QA accuracy. Mixing the two under a single "Score" column is the same category error as the LongMemEval table, one level down the page.

3 - "100%" scores still headlined after acknowledgment

Current public-facing text (mempalaceofficial.com/reference/benchmarks.html, verified 2026-04-14):

Mode R@5 LLM Required Cost/query
Hybrid v4 + rerank 100% Haiku ~$0.001

And in the LoCoMo table on the same page (verbatim, mempalaceofficial.com/reference/benchmarks.html, "LoCoMo (1,986 multi-hop QA pairs)"):

Mode R@10 LLM
Hybrid v5 + Sonnet rerank (top-50) 100% Sonnet
bge-large + Haiku rerank (top-15) 96.3% Haiku
Hybrid v5 (top-10, no rerank) 88.9% None
Session, no rerank (top-10) 60.3% None

The column header is R@10; the top row's mode is (top-50). The two cannot both be true descriptions of the same evaluation. Either the mode is not top-50, or the metric is not R@10. The non-reranked baseline on the same page - Session, no rerank (top-10) - 60.3% - is the honest retrieval-recall signal for this system on this benchmark.

What #125 established and bensig acknowledged:

From #125 (rohithzr): "The '100% LongMemEval score' involved 'teaching to the test' on three specific questions, with a held-out score of 98.4%. The LoCoMo benchmark reports '100% by construction' when top-k retrieval exceeds corpus size."

From #29, point 2: "The 100% bypasses retrieval. With top-k=50, every conversation (19-32 sessions each) is fully retrieved. The system reduces to 'dump every session into Claude Sonnet, ask Sonnet which one matches.'"

bensig's reply on #125: "100% shouldn't be headlined like that, we've updated the readme a bit more now."

Status 2026-04-14: Both 100% numbers remain in the public-facing benchmark tables, one week after acknowledgment they shouldn't be headlined. Three specific problems with the LoCoMo 100% row are not disclosed anywhere on the page:

  1. The "R@10" header does not match the "(top-50)" mode. The retrieval stage returns 50 candidates before Sonnet reranks them. Per Multiple issues with benchmark methodology and scoring #29, LoCoMo conversations contain 19-32 sessions each - top-50 exceeds the per-conversation session count in every case, so the retrieval stage returns every session. What the row actually reports is Sonnet's recall at 10 given the entire conversation as input: an LLM-judging task, not a retrieval task.

  2. LoCoMo contains adversarial / unanswerable questions by design. The 1,986-question LoCoMo set includes questions whose answer is not present in the conversation (the adversarial category exists specifically to penalize systems that hallucinate support). Retrieval recall on those questions cannot be 100%. There is no supporting chunk in the corpus to retrieve. A headline "100% R@10" across the full set is either mathematically impossible or does not actually include the adversarial subset (Category 5). Neither the exclusion nor the question-count actually evaluated is disclosed.

  3. The LongMemEval 100% held-out figure (98.4% on 450 untouched questions, per BEAM 100K benchmark results - first end-to-end answer quality evaluation #125) is not shown in the comparison table where 100% appears.

4 - End-to-end QA results exist and are omitted

What #125 established (BEAM benchmark, end-to-end answer quality, not retrieval):

From #125 (rohithzr): "Raw ChromaDB achieves 49% overall score, while MemPalace-specific modes score lower (43%, 43.6%, 26.2%, 27.9%). The analysis identifies critical gaps: summarization reaches only 35%, event ordering 32%, and contradiction resolution 40%."

Status 2026-04-14: These are the only end-to-end answer-quality numbers published for MemPalace. They are not in the README, the About text, or the mempalaceofficial.com benchmarks page. The positioning of MemPalace as "the highest-scoring AI memory system ever benchmarked" relies exclusively on retrieval-recall numbers while omitting the only QA-accuracy benchmark performed against the system.

The competitor numbers in the comparison table (Mastra 94.87%, Mem0 66.9% on LoCoMo) are QA accuracy. The MemPalace QA accuracy on BEAM raw is 49%. Apples-to-apples comparison does not appear anywhere in the public-facing claims.

The pattern

One week is a long time for a single-line About text and a README banner to remain unfixed after the responsible maintainer acknowledged the problem.

Verification

Every external number and quotation in this issue is verifiable from the cited source. The Mastra and Mem0 metric definitions are quoted verbatim from their research pages. The MemPalace README and repo About text are quoted from the live GitHub pages as of 2026-04-14. The mempalaceofficial.com homepage content is quoted from the 2026-04-14 archive at https://archive.is/ZXbI3 (the live site was returning errors at the time of filing). The mempalaceofficial.com/reference/benchmarks.html content is quoted from the same-day captured text; a public archive of that page has been requested.

Image Image Image Image Image Image Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions