Loomem benchmarks

Loomem scores 75.0% (375 / 500) on LongMemEval-S, a public long-term-memory question-answering benchmark — running fully self-hosted, with on-device embeddings and no external database. This page documents the exact setup behind the number; the full harness, prompts, dataset pin, and raw per-question results are being prepared for publication so the run can be independently reproduced.

What LongMemEval measures

LongMemEval is a public benchmark for long-term memory in chat assistants: across long, multi-session histories it tests information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. We evaluate on LongMemEval-S (cleaned) — the xiaowu0162/longmemeval-cleaned variant — so the questions match, one-to-one, the set used in published Mem0 and Letta write-ups.

The result

Benchmark	Metric	Loomem
LongMemEval-S (cleaned)	QA accuracy (judge-scored)	75.0% — 375 / 500

Configuration (for reproducibility)

Setting	Value
Loomem engine	commit `caaccd1`, single local instance
Embeddings	local `multilingual-e5-small` (384-dim, ONNX) — no OpenAI embeddings
Retrieval	hybrid BM25 + vector + graph, `top_k = 20`
Reranking	off
Reader & judge model	GPT-4.1 (same reader and judge as the reference runs)
Dataset	LongMemEval-S (cleaned), 500 questions

How to read this

Self-run, single configuration. This is our own run, not a third-party leaderboard. The config above is the full setup; the harness, exact prompts, dataset pin, and raw per-question results are being prepared for publication so the run can be independently reproduced.
The reader model matters. LongMemEval scores blend retrieval quality with the reader model and its prompt. We report the reader/judge (GPT-4.1) so the comparison is apples-to-apples; a terse reader prompt in particular can penalise preference-style questions where the answer must be applied, not just stated.
Embeddings were fully local. The 75.0% is with on-device e5-small embeddings, so it reflects the default offline setup rather than a cloud-embedding best case.
LongMemEval is saturating. Top systems now cluster near the ceiling, so we are also evaluating on newer suites (LongMemEval-V2, MemoryArena) and will publish those as they stabilise.

Loomem is open source (Apache-2.0); see how it compares to other memory layers on the comparison page.

Benchmark run 2026-06-26. LongMemEval: Wu et al., “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory”. Dataset: xiaowu0162/longmemeval-cleaned.