Every AI agent today remembers too much and understands too little. SAMD is the first benchmark that gives agents ground-truth labels for what to keep, what to prune, and what to let go, turning consolidation from a runtime guess into something a model can actually learn.
The standard approach (store everything, retrieve the most similar) breaks down across three independent failure modes that compound over time. Every existing benchmark misses them entirely because it only tests retrieval, never the store itself.
"Every sleep cycle an agent runs generates thousands of ground-truth consolidation decisions. Every system built so far throws them away."
Two state-of-the-art systems, PRISM (schema-guided compression) and RSPM (sleep-cycle consolidation), solve different halves of the problem. Neither produces a learned policy. SAMD closes the gap.
| System | Retrieves | Compresses | Prunes / Decays | Learned Policy |
|---|---|---|---|---|
| Standard RAG | ✓ | ✗ | ✗ | ✗ |
| PRISM (Jayalath et al., EMNLP 2025) | ✓ | ✓ | ✗ | ✗ |
| RSPM (COLM 2026) | ✓ | ✓ | ✓ | ✗ |
| SAMD Fine-tuned ← Target | ✓ | ✓ | ✓ | ✓ |
Every benchmark before SAMD asks whether the agent found the right memory. SAMD asks the harder question: whether that memory entry should have been there at all. Six ground-truth labels cover every case.
A hard negative is a memory entry that looks valid but is temporally wrong. Syntactically correct, semantically plausible, and outdated. No existing benchmark contains them. SAMD generates them systematically by swapping superseded values into active schema slots.
Schema-grounded hard negatives: syntactically valid · semantically plausible · temporally incorrect
SAMD makes specific, verifiable predictions that differentiate it from prior work. If either hypothesis fails, the research direction is falsified. That is the point.
Last week I was at the YC x Google DeepMind event. Demis Hassabis was in the room. The conversation kept circling back to the same thing: agents that cannot learn from their own history are fundamentally limited. The data infrastructure to fix that does not exist yet. I left that room more convinced than ever that SAMD is the right problem at the right time.
I have been building this alone: the consolidation system, the label taxonomy, the hard-negative construction method. The pieces are there. But what I need now are people who have seen this problem from the inside. Researchers who have shipped agents into production and watched them fail at memory. People who can tell me where my hypotheses are wrong before I find out the hard way.