MemBase

A comprehensive evaluation framework for benchmarking various memory layers on long-term conversational memory tasks. This framework provides a unified pipeline for memory construction, memory retrieval, and question answering evaluation.

Key Features

Checkpoint, Recovery & Rerun: It automatically saves progress during memory construction. If interrupted, simply re-run the script and it will skip already-processed trajectories and resume from where it left off. Use the --rerun flag to force rebuild memories from scratch when needed.
Non-Invasive Token Cost Monitoring: Built-in token consumption tracking for LLM API calls. It uses monkey-patching to intercept calls without modifying any baseline's internal code.
Modular Architecture: Clean separation between memory layers, datasets, and evaluation logic. Adding a new memory layer only requires implementing the MemBaseLayer interface and registering it in the membase package. New datasets can be added by subclassing MemBaseDataset and registering them.
Multiple Baselines & Datasets: See Supported Memory Layers and Supported Datasets below.

Project Structure

MemBase/
├── memory_construction.py       # CLI: Stage 1 – Build memories from trajectories
├── memory_search.py             # CLI: Stage 2 – Retrieve memories for each query
├── memory_evaluation.py         # CLI: Stage 3 – Answer questions and evaluate
├── envs/                        # Requirements for each baseline
├── examples/                    # Usage examples and tutorials
└── membase/                     # Core package
    ├── __init__.py              # Package-level re-exports
    ├── runners/                 # Runner classes for programmatic pipeline execution
    ├── configs/                 # Configuration classes for each memory layer
    ├── datasets/                # Dataset loaders
    ├── layers/                  # Memory layer implementations
    ├── baselines/               # Vendored baseline source code
    ├── inference_utils/         # QA and evaluation operators
    ├── model_types/             # Data models (dataset, memory)
    └── utils/                   # Token monitoring, monkey-patching, file utilities

Installation

Prerequisites

Python >= 3.12 is required.
Conda (Anaconda or Miniconda) is recommended for environment management.

Setting Up the Environment

⚠️ Important: Different memory baselines may have conflicting dependencies. We strongly recommend creating a separate virtual environment for each baseline to avoid dependency conflicts.

Each memory baseline has its own requirements file in the envs/ directory. Below are two examples:

Example: Setting up environment for A-MEM

conda create -n amem_env python=3.12 -y
conda activate amem_env
pip install -r envs/amem_requirements.txt

Example: Setting up environment for EverMemOS

conda create -n evermemos_env python=3.12 -y
conda activate evermemos_env
pip install -r envs/evermemos_requirements.txt

Repeat the same pattern for other baselines using the corresponding requirements file in envs/.

Evaluation Pipeline Overview

The evaluation of all memory baselines follows a three-stage pipeline:

Stage 1: Memory Construction

User interaction trajectories are fed incrementally (message by message) into the memory layer, which builds and updates its internal memory state as each message arrives.

Stage 2: Memory Retrieval

Given the constructed memory, this stage retrieves the top-k most relevant memory units for each evaluation query.

Stage 3: Question Answering & Evaluation

Using the retrieved memories as context, a question-answering model generates answers for each question. A judge model then evaluates whether the generated answers match the ground truth, producing final accuracy metrics.

Supported Memory Layers

Supported Datasets

Examples

See the examples/ directory for step-by-step tutorials:

Example	Description
Evaluate A-MEM on LongMemEval	Run the full three-stage evaluation pipeline (construction, retrieval, QA) using A-MEM on LongMemEval
Evaluate Mem0 on LoCoMo	Evaluate Mem0 (with Kuzu graph store) on LoCoMo, with a custom question-answering prompt and adversarial question filtering
Evaluate MemOS on LoCoMo	Evaluate MemOS with vLLM-served embedding on LoCoMo, with adversarial question filtering
Download Models	Download pre-trained embedding and reranker models from Hugging Face

Programmatic API

Besides the CLI scripts, all three pipeline stages are available as importable Runner classes under membase.runners. This allows you to drive the evaluation from Python scripts or notebooks without shell commands.

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests. More baselines will be added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemBase

Key Features

Project Structure

Installation

Prerequisites

Setting Up the Environment

Evaluation Pipeline Overview

Stage 1: Memory Construction

Stage 2: Memory Retrieval

Stage 3: Question Answering & Evaluation

Supported Memory Layers

Supported Datasets

Examples

Programmatic API

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
envs		envs
examples		examples
membase		membase
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
memory_construction.py		memory_construction.py
memory_evaluation.py		memory_evaluation.py
memory_search.py		memory_search.py

Folders and files

Latest commit

History

Repository files navigation

MemBase

Key Features

Project Structure

Installation

Prerequisites

Setting Up the Environment

Evaluation Pipeline Overview

Stage 1: Memory Construction

Stage 2: Memory Retrieval

Stage 3: Question Answering & Evaluation

Supported Memory Layers

Supported Datasets

Examples

Programmatic API

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages