You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MemPalace's own unmanipulated retrieval baseline on LoCoMo without rerank is 60.3% R@10 (verbatim from mempalaceofficial.com/reference/benchmarks.html).
Summary
Four prior issues (#27, #29, #39, #125) documented specific methodology and attribution problems in MemPalace's public-facing benchmark claims. These issues were already acknowledged by maintainers. One week later (2026-04-14), the headline claims on the repo About text, the README.md, mempalaceofficial.com, and mempalaceofficial.com/reference/benchmarks.html still reproduce the same framing, including a comparison table that places retrieval-recall numbers and binary-QA-accuracy numbers under a single column labeled "LongMemEval R@5."
This issue enumerates each surviving claim alongside the specific prior issue that documented the problem and the maintainer acknowledgement. Verbatim quotes throughout; every external number verifiable from the linked sources.
The central methodology point
R@5 (recall at 5) is a retrieval metric. QA accuracy is an end-to-end answer-correctness metric. They are not in any way comparable. A system can have 100% R@5 and 40% QA accuracy. Presenting R@5 numbers next to QA accuracy numbers under a single column header is not a rounding error; it is a category error that inflates MemPalace's relative standing against systems that published different metrics. This exact flaw was flagged one week ago.
1 - "Highest-scoring AI memory system ever benchmarked"
Current public-facing text (verified 2026-04-14):
Repo About: "The highest-scoring AI memory system ever benchmarked. And it's free."
README.md headline banner: "The highest-scoring AI memory system ever benchmarked. And it's free." with "96.6% LongMemEval R@5 raw mode, zero API calls."
mempalaceofficial.com: "96.6% recall on LongMemEval in raw mode."
What #27 established and maintainers acknowledged in #39:
From #27 (lhl): "96.6% LongMemEval R@5 (headline, positioned as MemPalace's score) - Real score, but measured in 'raw mode' - uncompressed verbatim text stored in ChromaDB, standard nearest-neighbor retrieval. The palace structure (wings/rooms/halls) is not involved. This measures ChromaDB's default embedding model performance, not MemPalace."
From #39 (gizmax, independent reproduction on M2 Ultra): "the --mode raw runner builds a fresh chromadb.EphemeralClient() per question and never touches the palace" code paths.
Maintainer's reply on #39: "the palace structure is for navigation/organization, not retrieval boost. AAAK is an experimental compression layer, not the storage default."
Status 2026-04-14: The About text and README headline continue to attribute the 96.6% to MemPalace as a memory system. Per the maintainer's own acknowledgment, the number measures ChromaDB's default sentence-transformer model with no MemPalace-specific retrieval code path involved. The correct attribution is "ChromaDB default embeddings," which is a third-party off-the-shelf component, not a MemPalace result.
2 - Comparison table mixes R@5 with QA accuracy
Current public-facing text (mempalaceofficial.com/reference/benchmarks.html, verified 2026-04-14):
Comparison vs Published Systems
System
LongMemEval R@5
API Required
Cost
MemPalace (hybrid)
100%
Optional
Free
Supermemory ASMR
~99%
Yes
-
MemPalace (raw)
96.6%
None
Free
Mastra
94.87%
Yes
API costs
Hindsight
91.4%
Yes
API costs
Mem0
~85%
Yes
$19-249/mo
What each cited source actually publishes:
Row
MemPalace table claim
Source's actual published metric
Benchmark
Cited URL
Mastra 94.87%
LongMemEval R@5
"Overall Accuracy - Each question is evaluated as correct/incorrect (binary)" with gpt-5-mini
Mastra's 94.87% is binary QA accuracy, not R@5. Mem0 does not publish a LongMemEval number at all; the ~85% in the table has no cited source and does not match Mem0's actual published figure of 66.9% on a different benchmark (LoCoMo, not LongMemEval).
From #29, point 4: "LongMemEval runner measures retrieval, not QA. The benchmark never generates answers or invokes judges - it only checks if labeled session IDs appear in top-5 results (recall_any@5), fundamentally different from the published leaderboard's end-to-end QA task."
From #29, point 5: "ConvoMem metric mismatch. The 92.9% is retrieval-based; Mem0's published numbers are end-to-end QA accuracy. Comparing different metrics on the same dataset."
Status 2026-04-14:#29 was closed as completed. The same flaw identified there (MemPalace retrieval-recall presented as comparable to competitor QA accuracy) is reproduced on the LongMemEval comparison table on mempalaceofficial.com, and reappears in the ConvoMem table on the same page (verbatim):
System
Score
MemPalace
92.9%
Gemini (long context)
70-82%
Block extraction
57-71%
Mem0 (RAG)
30-45%
The column header is "Score" - undefined. MemPalace's 92.9% is retrieval-based (per #29, point 5). Mem0's published ConvoMem numbers are end-to-end QA accuracy. Mixing the two under a single "Score" column is the same category error as the LongMemEval table, one level down the page.
3 - "100%" scores still headlined after acknowledgment
Current public-facing text (mempalaceofficial.com/reference/benchmarks.html, verified 2026-04-14):
Mode
R@5
LLM Required
Cost/query
Hybrid v4 + rerank
100%
Haiku
~$0.001
And in the LoCoMo table on the same page (verbatim, mempalaceofficial.com/reference/benchmarks.html, "LoCoMo (1,986 multi-hop QA pairs)"):
Mode
R@10
LLM
Hybrid v5 + Sonnet rerank (top-50)
100%
Sonnet
bge-large + Haiku rerank (top-15)
96.3%
Haiku
Hybrid v5 (top-10, no rerank)
88.9%
None
Session, no rerank (top-10)
60.3%
None
The column header is R@10; the top row's mode is (top-50). The two cannot both be true descriptions of the same evaluation. Either the mode is not top-50, or the metric is not R@10. The non-reranked baseline on the same page - Session, no rerank (top-10) - 60.3% - is the honest retrieval-recall signal for this system on this benchmark.
From #125 (rohithzr): "The '100% LongMemEval score' involved 'teaching to the test' on three specific questions, with a held-out score of 98.4%. The LoCoMo benchmark reports '100% by construction' when top-k retrieval exceeds corpus size."
From #29, point 2: "The 100% bypasses retrieval. With top-k=50, every conversation (19-32 sessions each) is fully retrieved. The system reduces to 'dump every session into Claude Sonnet, ask Sonnet which one matches.'"
bensig's reply on #125: "100% shouldn't be headlined like that, we've updated the readme a bit more now."
Status 2026-04-14: Both 100% numbers remain in the public-facing benchmark tables, one week after acknowledgment they shouldn't be headlined. Three specific problems with the LoCoMo 100% row are not disclosed anywhere on the page:
The "R@10" header does not match the "(top-50)" mode. The retrieval stage returns 50 candidates before Sonnet reranks them. Per Multiple issues with benchmark methodology and scoring #29, LoCoMo conversations contain 19-32 sessions each - top-50 exceeds the per-conversation session count in every case, so the retrieval stage returns every session. What the row actually reports is Sonnet's recall at 10 given the entire conversation as input: an LLM-judging task, not a retrieval task.
LoCoMo contains adversarial / unanswerable questions by design. The 1,986-question LoCoMo set includes questions whose answer is not present in the conversation (the adversarial category exists specifically to penalize systems that hallucinate support). Retrieval recall on those questions cannot be 100%. There is no supporting chunk in the corpus to retrieve. A headline "100% R@10" across the full set is either mathematically impossible or does not actually include the adversarial subset (Category 5). Neither the exclusion nor the question-count actually evaluated is disclosed.
What #125 established (BEAM benchmark, end-to-end answer quality, not retrieval):
From #125 (rohithzr): "Raw ChromaDB achieves 49% overall score, while MemPalace-specific modes score lower (43%, 43.6%, 26.2%, 27.9%). The analysis identifies critical gaps: summarization reaches only 35%, event ordering 32%, and contradiction resolution 40%."
Status 2026-04-14: These are the only end-to-end answer-quality numbers published for MemPalace. They are not in the README, the About text, or the mempalaceofficial.com benchmarks page. The positioning of MemPalace as "the highest-scoring AI memory system ever benchmarked" relies exclusively on retrieval-recall numbers while omitting the only QA-accuracy benchmark performed against the system.
The competitor numbers in the comparison table (Mastra 94.87%, Mem0 66.9% on LoCoMo) are QA accuracy. The MemPalace QA accuracy on BEAM raw is 49%. Apples-to-apples comparison does not appear anywhere in the public-facing claims.
2026-04-07 evening: README gets partial correction ("A Note from Milla & Ben - April 7, 2026") addressing AAAK and the palace-structure-boost claim. The headline banner and About text are not changed.
2026-04-07 to 2026-04-14: mempalaceofficial.com publishes the comparison tables documented above, extending rather than correcting the framing.
2026-04-09: @rohitg00, maintainer of the competing agentmemory project, filed Discussion #747 independently documenting most of the issues above. Five days later: zero maintainer reply, zero comments.
2026-04-14: About text, README headline, and mempalaceofficial.com comparison tables still contain the claims documented above as misleading.
One week is a long time for a single-line About text and a README banner to remain unfixed after the responsible maintainer acknowledged the problem.
Verification
Every external number and quotation in this issue is verifiable from the cited source. The Mastra and Mem0 metric definitions are quoted verbatim from their research pages. The MemPalace README and repo About text are quoted from the live GitHub pages as of 2026-04-14. The mempalaceofficial.com homepage content is quoted from the 2026-04-14 archive at https://archive.is/ZXbI3 (the live site was returning errors at the time of filing). The mempalaceofficial.com/reference/benchmarks.html content is quoted from the same-day captured text; a public archive of that page has been requested.
TL;DR: MemPalace is not "The highest-scoring AI memory system ever benchmarked."
mempalaceofficial.com/reference/benchmarks.html).Summary
Four prior issues (#27, #29, #39, #125) documented specific methodology and attribution problems in MemPalace's public-facing benchmark claims. These issues were already acknowledged by maintainers. One week later (2026-04-14), the headline claims on the repo
Abouttext, theREADME.md,mempalaceofficial.com, andmempalaceofficial.com/reference/benchmarks.htmlstill reproduce the same framing, including a comparison table that places retrieval-recall numbers and binary-QA-accuracy numbers under a single column labeled "LongMemEval R@5."This issue enumerates each surviving claim alongside the specific prior issue that documented the problem and the maintainer acknowledgement. Verbatim quotes throughout; every external number verifiable from the linked sources.
The central methodology point
R@5 (recall at 5) is a retrieval metric. QA accuracy is an end-to-end answer-correctness metric. They are not in any way comparable. A system can have 100% R@5 and 40% QA accuracy. Presenting R@5 numbers next to QA accuracy numbers under a single column header is not a rounding error; it is a category error that inflates MemPalace's relative standing against systems that published different metrics. This exact flaw was flagged one week ago.
1 - "Highest-scoring AI memory system ever benchmarked"
Current public-facing text (verified 2026-04-14):
About: "The highest-scoring AI memory system ever benchmarked. And it's free."README.mdheadline banner: "The highest-scoring AI memory system ever benchmarked. And it's free." with "96.6% LongMemEval R@5 raw mode, zero API calls."mempalaceofficial.com: "96.6% recall on LongMemEval in raw mode."What #27 established and maintainers acknowledged in #39:
From #27 (lhl): "96.6% LongMemEval R@5 (headline, positioned as MemPalace's score) - Real score, but measured in 'raw mode' - uncompressed verbatim text stored in ChromaDB, standard nearest-neighbor retrieval. The palace structure (wings/rooms/halls) is not involved. This measures ChromaDB's default embedding model performance, not MemPalace."
From #39 (gizmax, independent reproduction on M2 Ultra): "the
--mode rawrunner builds a freshchromadb.EphemeralClient()per question and never touches the palace" code paths.Maintainer's reply on #39: "the palace structure is for navigation/organization, not retrieval boost. AAAK is an experimental compression layer, not the storage default."
Status 2026-04-14: The About text and README headline continue to attribute the 96.6% to MemPalace as a memory system. Per the maintainer's own acknowledgment, the number measures ChromaDB's default sentence-transformer model with no MemPalace-specific retrieval code path involved. The correct attribution is "ChromaDB default embeddings," which is a third-party off-the-shelf component, not a MemPalace result.
2 - Comparison table mixes R@5 with QA accuracy
Current public-facing text (
mempalaceofficial.com/reference/benchmarks.html, verified 2026-04-14):What each cited source actually publishes:
Mastra's 94.87% is binary QA accuracy, not R@5. Mem0 does not publish a LongMemEval number at all; the ~85% in the table has no cited source and does not match Mem0's actual published figure of 66.9% on a different benchmark (LoCoMo, not LongMemEval).
What #29 already established:
From #29, point 4: "LongMemEval runner measures retrieval, not QA. The benchmark never generates answers or invokes judges - it only checks if labeled session IDs appear in top-5 results (recall_any@5), fundamentally different from the published leaderboard's end-to-end QA task."
From #29, point 5: "ConvoMem metric mismatch. The 92.9% is retrieval-based; Mem0's published numbers are end-to-end QA accuracy. Comparing different metrics on the same dataset."
Status 2026-04-14: #29 was closed as completed. The same flaw identified there (MemPalace retrieval-recall presented as comparable to competitor QA accuracy) is reproduced on the LongMemEval comparison table on
mempalaceofficial.com, and reappears in the ConvoMem table on the same page (verbatim):The column header is "Score" - undefined. MemPalace's 92.9% is retrieval-based (per #29, point 5). Mem0's published ConvoMem numbers are end-to-end QA accuracy. Mixing the two under a single "Score" column is the same category error as the LongMemEval table, one level down the page.
3 - "100%" scores still headlined after acknowledgment
Current public-facing text (
mempalaceofficial.com/reference/benchmarks.html, verified 2026-04-14):And in the LoCoMo table on the same page (verbatim,
mempalaceofficial.com/reference/benchmarks.html, "LoCoMo (1,986 multi-hop QA pairs)"):The column header is R@10; the top row's mode is (top-50). The two cannot both be true descriptions of the same evaluation. Either the mode is not top-50, or the metric is not R@10. The non-reranked baseline on the same page -
Session, no rerank (top-10) - 60.3%- is the honest retrieval-recall signal for this system on this benchmark.What #125 established and bensig acknowledged:
From #125 (rohithzr): "The '100% LongMemEval score' involved 'teaching to the test' on three specific questions, with a held-out score of 98.4%. The LoCoMo benchmark reports '100% by construction' when top-k retrieval exceeds corpus size."
From #29, point 2: "The 100% bypasses retrieval. With top-k=50, every conversation (19-32 sessions each) is fully retrieved. The system reduces to 'dump every session into Claude Sonnet, ask Sonnet which one matches.'"
bensig's reply on #125: "100% shouldn't be headlined like that, we've updated the readme a bit more now."
Status 2026-04-14: Both 100% numbers remain in the public-facing benchmark tables, one week after acknowledgment they shouldn't be headlined. Three specific problems with the LoCoMo 100% row are not disclosed anywhere on the page:
The "R@10" header does not match the "(top-50)" mode. The retrieval stage returns 50 candidates before Sonnet reranks them. Per Multiple issues with benchmark methodology and scoring #29, LoCoMo conversations contain 19-32 sessions each - top-50 exceeds the per-conversation session count in every case, so the retrieval stage returns every session. What the row actually reports is Sonnet's recall at 10 given the entire conversation as input: an LLM-judging task, not a retrieval task.
LoCoMo contains adversarial / unanswerable questions by design. The 1,986-question LoCoMo set includes questions whose answer is not present in the conversation (the adversarial category exists specifically to penalize systems that hallucinate support). Retrieval recall on those questions cannot be 100%. There is no supporting chunk in the corpus to retrieve. A headline "100% R@10" across the full set is either mathematically impossible or does not actually include the adversarial subset (Category 5). Neither the exclusion nor the question-count actually evaluated is disclosed.
The LongMemEval 100% held-out figure (98.4% on 450 untouched questions, per BEAM 100K benchmark results - first end-to-end answer quality evaluation #125) is not shown in the comparison table where 100% appears.
4 - End-to-end QA results exist and are omitted
What #125 established (BEAM benchmark, end-to-end answer quality, not retrieval):
From #125 (rohithzr): "Raw ChromaDB achieves 49% overall score, while MemPalace-specific modes score lower (43%, 43.6%, 26.2%, 27.9%). The analysis identifies critical gaps: summarization reaches only 35%, event ordering 32%, and contradiction resolution 40%."
Status 2026-04-14: These are the only end-to-end answer-quality numbers published for MemPalace. They are not in the README, the About text, or the
mempalaceofficial.combenchmarks page. The positioning of MemPalace as "the highest-scoring AI memory system ever benchmarked" relies exclusively on retrieval-recall numbers while omitting the only QA-accuracy benchmark performed against the system.The competitor numbers in the comparison table (Mastra 94.87%, Mem0 66.9% on LoCoMo) are QA accuracy. The MemPalace QA accuracy on BEAM raw is 49%. Apples-to-apples comparison does not appear anywhere in the public-facing claims.
The pattern
mempalaceofficial.compublishes the comparison tables documented above, extending rather than correcting the framing.mempalaceofficial.comcomparison tables still contain the claims documented above as misleading.One week is a long time for a single-line About text and a README banner to remain unfixed after the responsible maintainer acknowledged the problem.
Verification
Every external number and quotation in this issue is verifiable from the cited source. The Mastra and Mem0 metric definitions are quoted verbatim from their research pages. The MemPalace README and repo About text are quoted from the live GitHub pages as of 2026-04-14. The
mempalaceofficial.comhomepage content is quoted from the 2026-04-14 archive at https://archive.is/ZXbI3 (the live site was returning errors at the time of filing). Themempalaceofficial.com/reference/benchmarks.htmlcontent is quoted from the same-day captured text; a public archive of that page has been requested.