A comprehensive evaluation framework for benchmarking various memory layers on long-term conversational memory tasks. This framework provides a unified pipeline for memory construction, memory retrieval, and question answering evaluation.
- Checkpoint, Recovery & Rerun: It automatically saves progress during memory construction. If interrupted, simply re-run the script and it will skip already-processed trajectories and resume from where it left off. Use the
--rerunflag to force rebuild memories from scratch when needed. - Non-Invasive Token Cost Monitoring: Built-in token consumption tracking for LLM API calls. It uses monkey-patching to intercept calls without modifying any baseline's internal code.
- Modular Architecture: Clean separation between memory layers, datasets, and evaluation logic. Adding a new memory layer only requires implementing the
MemBaseLayerinterface and registering it in themembasepackage. New datasets can be added by subclassingMemBaseDatasetand registering them. - Multiple Baselines & Datasets: See Supported Memory Layers and Supported Datasets below.
MemBase/
├── memory_construction.py # CLI: Stage 1 – Build memories from trajectories
├── memory_search.py # CLI: Stage 2 – Retrieve memories for each query
├── memory_evaluation.py # CLI: Stage 3 – Answer questions and evaluate
├── envs/ # Requirements for each baseline
├── examples/ # Usage examples and tutorials
└── membase/ # Core package
├── __init__.py # Package-level re-exports
├── runners/ # Runner classes for programmatic pipeline execution
├── configs/ # Configuration classes for each memory layer
├── datasets/ # Dataset loaders
├── layers/ # Memory layer implementations
├── baselines/ # Vendored baseline source code
├── inference_utils/ # QA and evaluation operators
├── model_types/ # Data models (dataset, memory)
└── utils/ # Token monitoring, monkey-patching, file utilities
- Python >= 3.12 is required.
- Conda (Anaconda or Miniconda) is recommended for environment management.
⚠️ Important: Different memory baselines may have conflicting dependencies. We strongly recommend creating a separate virtual environment for each baseline to avoid dependency conflicts.
Each memory baseline has its own requirements file in the envs/ directory. Below are two examples:
Example: Setting up environment for A-MEM
conda create -n amem_env python=3.12 -y
conda activate amem_env
pip install -r envs/amem_requirements.txtExample: Setting up environment for EverMemOS
conda create -n evermemos_env python=3.12 -y
conda activate evermemos_env
pip install -r envs/evermemos_requirements.txtRepeat the same pattern for other baselines using the corresponding requirements file in envs/.
The evaluation of all memory baselines follows a three-stage pipeline:
User interaction trajectories are fed incrementally (message by message) into the memory layer, which builds and updates its internal memory state as each message arrives.
Given the constructed memory, this stage retrieves the top-k most relevant memory units for each evaluation query.
Using the retrieved memories as context, a question-answering model generates answers for each question. A judge model then evaluates whether the generated answers match the ground truth, producing final accuracy metrics.
See the examples/ directory for step-by-step tutorials:
| Example | Description |
|---|---|
| Evaluate A-MEM on LongMemEval | Run the full three-stage evaluation pipeline (construction, retrieval, QA) using A-MEM on LongMemEval |
| Evaluate Mem0 on LoCoMo | Evaluate Mem0 (with Kuzu graph store) on LoCoMo, with a custom question-answering prompt and adversarial question filtering |
| Evaluate MemOS on LoCoMo | Evaluate MemOS with vLLM-served embedding on LoCoMo, with adversarial question filtering |
| Download Models | Download pre-trained embedding and reranker models from Hugging Face |
Besides the CLI scripts, all three pipeline stages are available as importable Runner classes under membase.runners. This allows you to drive the evaluation from Python scripts or notebooks without shell commands.
Contributions are welcome! Please feel free to submit issues or pull requests. More baselines will be added.