This project is originally forked from mem0-evaluation in commit
393a4fd5a6cfeb754857a2229726f567a9fadf36
This project contains the code of running benchmark results on Locomo dataset with different memory methods:
- langmem
- mem0
- zep
- basic rag
- naive LLM
- Memobase
-
We ran Memobase results and pasted the other methods' result from Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.
-
We mainly report the LLM Judge Sorce (higher is better).
| Method | Single-Hop(%) | Multi-Hop(%) | Open Domain(%) | Temporal(%) | Overall(%) |
|---|---|---|---|---|---|
| Mem0 | 67.13 | 51.15 | 72.93 | 55.51 | 66.88 |
| Mem0-Graph | 65.71 | 47.19 | 75.71 | 58.13 | 68.44 |
| LangMem | 62.23 | 47.92 | 71.12 | 23.43 | 58.10 |
| Zep | 61.70 | 41.35 | 76.60 | 49.31 | 65.99 |
| OpenAI | 63.79 | 42.92 | 62.29 | 21.71 | 52.90 |
| Memobase(v0.0.32) | 63.83 | 52.08 | 71.82 | 80.37 | 70.91 |
| Memobase(v0.0.37) | 70.92 | 46.88 | 77.17 | 85.05 | 75.78 |
What is LLM Judge Score?
Basically, Locomo benchmark offers some long conversations and prepare some questions. LLM Judge Score is to use LLM(e.g. OpenAI
gpt-4o) to judge if the answer generated from memory method is the same as the ground truth, score is 1 if it is, else 0.
We attached the artifacts of Memobase under fixture/memobase/:
-
v0.0.32
fixture/memobase/results_0503_3000.json: predicted answers from Memobase Memoryfixture/memobase/memobase_eval_0503_3000.json: LLM Judge results of predicted answers
-
v0.0.37
fixture/memobase/results_0710_3000.json: predicted answers from Memobase Memoryfixture/memobase/memobase_eval_0710_3000.json: LLM Judge results of predicted answers
To generate the latest scorings, run:
python generate_scores.py --input_path="fixture/memobase/memobase_eval_0710_3000.json"Output:
Mean Scores Per Category:
bleu_score f1_score llm_score count type
category
1 0.3516 0.4629 0.7092 282 single_hop
2 0.4758 0.6423 0.8505 321 temporal
3 0.1758 0.2293 0.4688 96 multi_hop
4 0.4089 0.5155 0.7717 841 open_domain
Overall Mean Scores:
bleu_score 0.3978
f1_score 0.5145
llm_score 0.7578
dtype: float64
❕ We update the results from Zep team (Zep*). See this issue for detail reports and artifacts.
Method Single-Hop(%) Multi-Hop(%) Open Domain(%) Temporal(%) Overall(%) Zep* 74.11 66.04 67.71 79.79 75.14
Download the locomo10.json file and place it under dataset/
Create a .env file with your API keys and configurations. You must have beflow envs:
# OpenAI API key for GPT models and embeddings
OPENAI_API_KEY="your-openai-api-key"Below is the detailed requirements
Deps
pip install memobaseEnv
You can find free API key in Memobase Cloud, or deploy one in your local
MEMOBASE_API_KEY=XXXXX
MEMOBASE_PROJECT_URL=http://localhost:8019 # OPTIONALCommand
# memorize the data
make run-memobase-add
# answer the benchmark
make run-memobase-search
# evaluate the results
py evals.py --input_file results.json --output_file evals.json
# print the final scores
py generate_scores.py --input_path="evals.json"Deps
pip install mem0Env
# Mem0 API keys (for Mem0 and Mem0+ techniques)
MEM0_API_KEY="your-mem0-api-key"
MEM0_PROJECT_ID="your-mem0-project-id"
MEM0_ORGANIZATION_ID="your-mem0-organization-id"Command
Just like the commands of Memobase, but replace
memobasewithmem0. See [all commands](#Memory Techniques)
Deps
pip install zep_cloudEnv
ZEP_API_KEY="api-key-from-zep"Command
Just like the commands of Memobase, but replace
memobasewithzep. See [all commands](#Memory Techniques)
Deps
pip install langgraph langmemEnv
EMBEDDING_MODEL="text-embedding-3-small" # or your preferred embedding modelCommand
Just like the commands of Memobase, but replace
memobasewithzep. See [all commands](#Memory Techniques)
The rest methods don't require extra deps/envs.
# Run Mem0 experiments
make run-memobase-add # Add memories using Memobase
make run-memobase-search # Search memories using Memobase
# Run Mem0 experiments
make run-mem0-add # Add memories using Mem0
make run-mem0-search # Search memories using Mem0
# Run Mem0+ experiments (with graph-based search)
make run-mem0-plus-add # Add memories using Mem0+
make run-mem0-plus-search # Search memories using Mem0+
# Run RAG experiments
make run-rag # Run RAG with chunk size 500
make run-full-context # Run RAG with full context
# Run LangMem experiments
make run-langmem # Run LangMem
# Run Zep experiments
make run-zep-add # Add memories using Zep
make run-zep-search # Search memories using Zep
# Run OpenAI experiments
make run-openai # Run OpenAI experimentsTo evaluate results, run:
python evals.py --input_file [path_to_results] --output_file [output_path]This script:
- Processes each question-answer pair
- Calculates BLEU and F1 scores automatically
- Uses an LLM judge to evaluate answer correctness
- Saves the combined results to the output file
Generate final scores with:
python generate_scores.pyThis script:
- Loads the evaluation metrics data
- Calculates mean scores for each category (BLEU, F1, LLM)
- Reports the number of questions per category
- Calculates overall mean scores across all categories
Example output:
Mean Scores Per Category:
bleu_score f1_score llm_score count
category
1 0.xxxx 0.xxxx 0.xxxx xx
2 0.xxxx 0.xxxx 0.xxxx xx
3 0.xxxx 0.xxxx 0.xxxx xx
Overall Mean Scores:
bleu_score 0.xxxx
f1_score 0.xxxx
llm_score 0.xxxx
.
├── src/ # Source code for different memory techniques
│ ├── memobase_client/ # Implementation of the Memobase
│ ├── memzero/ # Implementation of the Mem0 technique
│ ├── openai/ # Implementation of the OpenAI memory
│ ├── zep/ # Implementation of the Zep memory
│ ├── rag.py # Implementation of the RAG technique
│ └── langmem.py # Implementation of the Language-based memory
├── metrics/ # Code for evaluation metrics
├── results/ # Results of experiments
├── dataset/ # Dataset files
├── evals.py # Evaluation script
├── run_experiments.py # Script to run experiments
├── generate_scores.py # Script to generate scores from results
└── prompts.py # Prompts used for the models