Locomo Benchmark for Various Memory Backends

This project is originally forked from mem0-evaluation in commit 393a4fd5a6cfeb754857a2229726f567a9fadf36

This project contains the code of running benchmark results on Locomo dataset with different memory methods:

langmem
mem0
zep
basic rag
naive LLM
Memobase

Result

We ran Memobase results and pasted the other methods' result from Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.
We mainly report the LLM Judge Sorce (higher is better).

Method	Single-Hop(%)	Multi-Hop(%)	Open Domain(%)	Temporal(%)	Overall(%)
Mem0	67.13	51.15	72.93	55.51	66.88
Mem0-Graph	65.71	47.19	75.71	58.13	68.44
LangMem	62.23	47.92	71.12	23.43	58.10
Zep	61.70	41.35	76.60	49.31	65.99
OpenAI	63.79	42.92	62.29	21.71	52.90
Memobase(v0.0.32)	63.83	52.08	71.82	80.37	70.91
Memobase(v0.0.37)	70.92	46.88	77.17	85.05	75.78

What is LLM Judge Score?

Basically, Locomo benchmark offers some long conversations and prepare some questions. LLM Judge Score is to use LLM(e.g. OpenAI gpt-4o) to judge if the answer generated from memory method is the same as the ground truth, score is 1 if it is, else 0.

We attached the artifacts of Memobase under fixture/memobase/:

v0.0.32
- fixture/memobase/results_0503_3000.json: predicted answers from Memobase Memory
- fixture/memobase/memobase_eval_0503_3000.json: LLM Judge results of predicted answers
v0.0.37
- fixture/memobase/results_0710_3000.json: predicted answers from Memobase Memory
- fixture/memobase/memobase_eval_0710_3000.json: LLM Judge results of predicted answers

To generate the latest scorings, run:

python generate_scores.py --input_path="fixture/memobase/memobase_eval_0710_3000.json"

Output:

Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count         type
category
1             0.3516    0.4629     0.7092    282   single_hop
2             0.4758    0.6423     0.8505    321     temporal
3             0.1758    0.2293     0.4688     96    multi_hop
4             0.4089    0.5155     0.7717    841  open_domain

Overall Mean Scores:
bleu_score    0.3978
f1_score      0.5145
llm_score     0.7578
dtype: float64

❕ We update the results from Zep team (Zep*). See this issue for detail reports and artifacts.

Method Single-Hop(%) Multi-Hop(%) Open Domain(%) Temporal(%) Overall(%)

Zep* 74.11 66.04 67.71 79.79 75.14

Method	Single-Hop(%)	Multi-Hop(%)	Open Domain(%)	Temporal(%)	Overall(%)
Zep*	74.11	66.04	67.71	79.79	75.14

🔍 Dataset

Download the locomo10.json file and place it under dataset/

🚀 Getting Started

Prerequisites

Create a .env file with your API keys and configurations. You must have beflow envs:

# OpenAI API key for GPT models and embeddings
OPENAI_API_KEY="your-openai-api-key"

Below is the detailed requirements

Memobase

Deps

pip install memobase

Env

You can find free API key in Memobase Cloud, or deploy one in your local

MEMOBASE_API_KEY=XXXXX
MEMOBASE_PROJECT_URL=http://localhost:8019 # OPTIONAL

Command

# memorize the data
make run-memobase-add 
# answer the benchmark
make run-memobase-search 
# evaluate the results
py evals.py --input_file results.json --output_file evals.json 
# print the final scores
py generate_scores.py --input_path="evals.json"

Run Mem0

Deps

pip install mem0

Env

# Mem0 API keys (for Mem0 and Mem0+ techniques)
MEM0_API_KEY="your-mem0-api-key"
MEM0_PROJECT_ID="your-mem0-project-id"
MEM0_ORGANIZATION_ID="your-mem0-organization-id"

Command

Just like the commands of Memobase, but replace memobase with mem0. See [all commands](#Memory Techniques)

Run Zep

Deps

pip install zep_cloud

Env

ZEP_API_KEY="api-key-from-zep"

Command

Just like the commands of Memobase, but replace memobase with zep. See [all commands](#Memory Techniques)

Run langmem

Deps

pip install langgraph langmem

Env

EMBEDDING_MODEL="text-embedding-3-small"  # or your preferred embedding model

Command

Just like the commands of Memobase, but replace memobase with zep. See [all commands](#Memory Techniques)

Other methods

The rest methods don't require extra deps/envs.

Memory Techniques

# Run Mem0 experiments
make run-memobase-add         # Add memories using Memobase
make run-memobase-search      # Search memories using Memobase

# Run Mem0 experiments
make run-mem0-add         # Add memories using Mem0
make run-mem0-search      # Search memories using Mem0

# Run Mem0+ experiments (with graph-based search)
make run-mem0-plus-add    # Add memories using Mem0+
make run-mem0-plus-search # Search memories using Mem0+

# Run RAG experiments
make run-rag              # Run RAG with chunk size 500
make run-full-context     # Run RAG with full context

# Run LangMem experiments
make run-langmem          # Run LangMem

# Run Zep experiments
make run-zep-add          # Add memories using Zep
make run-zep-search       # Search memories using Zep

# Run OpenAI experiments
make run-openai           # Run OpenAI experiments

📊 Evaluation

To evaluate results, run:

python evals.py --input_file [path_to_results] --output_file [output_path]

This script:

Processes each question-answer pair
Calculates BLEU and F1 scores automatically
Uses an LLM judge to evaluate answer correctness
Saves the combined results to the output file

📈 Generating Scores

Generate final scores with:

python generate_scores.py

This script:

Loads the evaluation metrics data
Calculates mean scores for each category (BLEU, F1, LLM)
Reports the number of questions per category
Calculates overall mean scores across all categories

Example output:

Mean Scores Per Category:
         bleu_score  f1_score  llm_score  count
category                                       
1           0.xxxx    0.xxxx     0.xxxx     xx
2           0.xxxx    0.xxxx     0.xxxx     xx
3           0.xxxx    0.xxxx     0.xxxx     xx

Overall Mean Scores:
bleu_score    0.xxxx
f1_score      0.xxxx
llm_score     0.xxxx

📁 Project Structure

.
├── src/                  # Source code for different memory techniques
│   ├── memobase_client/  # Implementation of the Memobase
│   ├── memzero/          # Implementation of the Mem0 technique
│   ├── openai/           # Implementation of the OpenAI memory
│   ├── zep/              # Implementation of the Zep memory
│   ├── rag.py            # Implementation of the RAG technique
│   └── langmem.py        # Implementation of the Language-based memory
├── metrics/              # Code for evaluation metrics
├── results/              # Results of experiments
├── dataset/              # Dataset files
├── evals.py              # Evaluation script
├── run_experiments.py    # Script to run experiments
├── generate_scores.py    # Script to generate scores from results
└── prompts.py            # Prompts used for the models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locomo Benchmark for Various Memory Backends

Result

🔍 Dataset

🚀 Getting Started

Prerequisites

Memobase

Run Mem0

Run Zep

Run langmem

Other methods

Memory Techniques

📊 Evaluation

📈 Generating Scores

📁 Project Structure

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Locomo Benchmark for Various Memory Backends

Result

🔍 Dataset

🚀 Getting Started

Prerequisites

Memobase

Run Mem0

Run Zep

Run langmem

Other methods

Memory Techniques

📊 Evaluation

📈 Generating Scores

📁 Project Structure