Skip to content

Latest commit

 

History

History
289 lines (197 loc) · 7.96 KB

File metadata and controls

289 lines (197 loc) · 7.96 KB

Locomo Benchmark for Various Memory Backends

This project is originally forked from mem0-evaluation in commit 393a4fd5a6cfeb754857a2229726f567a9fadf36

This project contains the code of running benchmark results on Locomo dataset with different memory methods:

  • langmem
  • mem0
  • zep
  • basic rag
  • naive LLM
  • Memobase

Result

Method Single-Hop(%) Multi-Hop(%) Open Domain(%) Temporal(%) Overall(%)
Mem0 67.13 51.15 72.93 55.51 66.88
Mem0-Graph 65.71 47.19 75.71 58.13 68.44
LangMem 62.23 47.92 71.12 23.43 58.10
Zep 61.70 41.35 76.60 49.31 65.99
OpenAI 63.79 42.92 62.29 21.71 52.90
Memobase(v0.0.32) 63.83 52.08 71.82 80.37 70.91
Memobase(v0.0.37) 70.92 46.88 77.17 85.05 75.78

What is LLM Judge Score?

Basically, Locomo benchmark offers some long conversations and prepare some questions. LLM Judge Score is to use LLM(e.g. OpenAI gpt-4o) to judge if the answer generated from memory method is the same as the ground truth, score is 1 if it is, else 0.

We attached the artifacts of Memobase under fixture/memobase/:

  • v0.0.32

    • fixture/memobase/results_0503_3000.json: predicted answers from Memobase Memory
    • fixture/memobase/memobase_eval_0503_3000.json: LLM Judge results of predicted answers
  • v0.0.37

    • fixture/memobase/results_0710_3000.json: predicted answers from Memobase Memory
    • fixture/memobase/memobase_eval_0710_3000.json: LLM Judge results of predicted answers

To generate the latest scorings, run:

python generate_scores.py --input_path="fixture/memobase/memobase_eval_0710_3000.json"

Output:

Mean Scores Per Category:
          bleu_score  f1_score  llm_score  count         type
category
1             0.3516    0.4629     0.7092    282   single_hop
2             0.4758    0.6423     0.8505    321     temporal
3             0.1758    0.2293     0.4688     96    multi_hop
4             0.4089    0.5155     0.7717    841  open_domain

Overall Mean Scores:
bleu_score    0.3978
f1_score      0.5145
llm_score     0.7578
dtype: float64

❕ We update the results from Zep team (Zep*). See this issue for detail reports and artifacts.

Method Single-Hop(%) Multi-Hop(%) Open Domain(%) Temporal(%) Overall(%)
Zep* 74.11 66.04 67.71 79.79 75.14

🔍 Dataset

Download the locomo10.json file and place it under dataset/

🚀 Getting Started

Prerequisites

Create a .env file with your API keys and configurations. You must have beflow envs:

# OpenAI API key for GPT models and embeddings
OPENAI_API_KEY="your-openai-api-key"

Below is the detailed requirements

Memobase

Deps

pip install memobase

Env

You can find free API key in Memobase Cloud, or deploy one in your local

MEMOBASE_API_KEY=XXXXX
MEMOBASE_PROJECT_URL=http://localhost:8019 # OPTIONAL

Command

# memorize the data
make run-memobase-add 
# answer the benchmark
make run-memobase-search 
# evaluate the results
py evals.py --input_file results.json --output_file evals.json 
# print the final scores
py generate_scores.py --input_path="evals.json"

Run Mem0

Deps

pip install mem0

Env

# Mem0 API keys (for Mem0 and Mem0+ techniques)
MEM0_API_KEY="your-mem0-api-key"
MEM0_PROJECT_ID="your-mem0-project-id"
MEM0_ORGANIZATION_ID="your-mem0-organization-id"

Command

Just like the commands of Memobase, but replace memobase with mem0. See [all commands](#Memory Techniques)

Run Zep

Deps

pip install zep_cloud

Env

ZEP_API_KEY="api-key-from-zep"

Command

Just like the commands of Memobase, but replace memobase with zep. See [all commands](#Memory Techniques)

Run langmem

Deps

pip install langgraph langmem

Env

EMBEDDING_MODEL="text-embedding-3-small"  # or your preferred embedding model

Command

Just like the commands of Memobase, but replace memobase with zep. See [all commands](#Memory Techniques)

Other methods

The rest methods don't require extra deps/envs.

Memory Techniques

# Run Mem0 experiments
make run-memobase-add         # Add memories using Memobase
make run-memobase-search      # Search memories using Memobase

# Run Mem0 experiments
make run-mem0-add         # Add memories using Mem0
make run-mem0-search      # Search memories using Mem0

# Run Mem0+ experiments (with graph-based search)
make run-mem0-plus-add    # Add memories using Mem0+
make run-mem0-plus-search # Search memories using Mem0+

# Run RAG experiments
make run-rag              # Run RAG with chunk size 500
make run-full-context     # Run RAG with full context

# Run LangMem experiments
make run-langmem          # Run LangMem

# Run Zep experiments
make run-zep-add          # Add memories using Zep
make run-zep-search       # Search memories using Zep

# Run OpenAI experiments
make run-openai           # Run OpenAI experiments

📊 Evaluation

To evaluate results, run:

python evals.py --input_file [path_to_results] --output_file [output_path]

This script:

  1. Processes each question-answer pair
  2. Calculates BLEU and F1 scores automatically
  3. Uses an LLM judge to evaluate answer correctness
  4. Saves the combined results to the output file

📈 Generating Scores

Generate final scores with:

python generate_scores.py

This script:

  1. Loads the evaluation metrics data
  2. Calculates mean scores for each category (BLEU, F1, LLM)
  3. Reports the number of questions per category
  4. Calculates overall mean scores across all categories

Example output:

Mean Scores Per Category:
         bleu_score  f1_score  llm_score  count
category                                       
1           0.xxxx    0.xxxx     0.xxxx     xx
2           0.xxxx    0.xxxx     0.xxxx     xx
3           0.xxxx    0.xxxx     0.xxxx     xx

Overall Mean Scores:
bleu_score    0.xxxx
f1_score      0.xxxx
llm_score     0.xxxx

📁 Project Structure

.
├── src/                  # Source code for different memory techniques
│   ├── memobase_client/  # Implementation of the Memobase
│   ├── memzero/          # Implementation of the Mem0 technique
│   ├── openai/           # Implementation of the OpenAI memory
│   ├── zep/              # Implementation of the Zep memory
│   ├── rag.py            # Implementation of the RAG technique
│   └── langmem.py        # Implementation of the Language-based memory
├── metrics/              # Code for evaluation metrics
├── results/              # Results of experiments
├── dataset/              # Dataset files
├── evals.py              # Evaluation script
├── run_experiments.py    # Script to run experiments
├── generate_scores.py    # Script to generate scores from results
└── prompts.py            # Prompts used for the models