Synthetic Agent Memory Dataset · 2026

Teaching Agents
What to Forget

Every AI agent today remembers too much and understands too little. SAMD is the first benchmark that gives agents ground-truth labels for what to keep, what to prune, and what to let go, turning consolidation from a runtime guess into something a model can actually learn.

93% retry-loop reduction

30% memory footprint reduction

19K labeled memory entries

6 consolidation label classes

Explore the Dataset The Vision

scroll

The Problem

Long-horizon agents that store everything remember nothing.

The standard approach (store everything, retrieve the most similar) breaks down across three independent failure modes that compound over time. Every existing benchmark misses them entirely because it only tests retrieval, never the store itself.

∞

Memory Bloat

The vector store grows without bound. By episode 50, the context window is noise. No entry is ever retired. The agent drowns in its own history.

↺

Repetition Loops

Failed approaches are retried indefinitely because failure traces are never flagged. The agent re-discovers the same dead ends on every new task.

⊘

Knowledge Staleness

Retrievers rank by semantic similarity, not temporal truth. A preference from six months ago surfaces as confidently as today's. Stale knowledge poisons fresh decisions.

"Every sleep cycle an agent runs generates thousands of ground-truth consolidation decisions. Every system built so far throws them away."

The core insight behind SAMD

That is how many of those labels exist in any current benchmark. Not because the signal is not there. It is produced every time a consolidation system runs. But because nobody thought to save it. SAMD saves it. Every KEEP, every PRUNE, every PROMOTE decision becomes a training example. Consolidation stops being an engineering workaround and becomes something a model can learn.

Capability Gap

The missing piece no one built.

Two state-of-the-art systems, PRISM (schema-guided compression) and RSPM (sleep-cycle consolidation), solve different halves of the problem. Neither produces a learned policy. SAMD closes the gap.

System	Retrieves	Compresses	Prunes / Decays	Learned Policy
Standard RAG	✓	✗	✗	✗
PRISM (Jayalath et al., EMNLP 2025)	✓	✓	✗	✗
RSPM (COLM 2026)	✓	✓	✓	✗
SAMD Fine-tuned ← Target	✓	✓	✓	✓

Label Taxonomy

Six classes. One question: should this have existed?

Every benchmark before SAMD asks whether the agent found the right memory. SAMD asks the harder question: whether that memory entry should have been there at all. Six ground-truth labels cover every case.

KEEP

RetainRelevant, accurate, current. This entry earns its place in the store.

PRUNE

DeleteIrrelevant, redundant, or actively harmful. Keeping this makes the agent worse.

PROMOTE-L2

Elevate to domain ruleA pattern that recurs across tasks in a domain. Lift it above episodic recall.

PROMOTE-L1

Elevate to universal ruleA pattern that generalizes agent-wide. The agent should apply this everywhere.

The case that breaks every current agent.

A hard negative is a memory entry that looks valid but is temporally wrong. Syntactically correct, semantically plausible, and outdated. No existing benchmark contains them. SAMD generates them systematically by swapping superseded values into active schema slots.

6 MONTHS AGO User prefers dark mode. Always apply it.

→

TODAY User switched to light mode last week.

HARD NEGATIVE, LOOKS VALID "User preferred dark mode" syntactically valid, semantically plausible, temporally wrong.

PRUNE

CORRECT ENTRY "User prefers light mode" current ground truth.

KEEP

Schema-grounded hard negatives: syntactically valid · semantically plausible · temporally incorrect

Testable Claims

Two falsifiable hypotheses. Both measurable.

SAMD makes specific, verifiable predictions that differentiate it from prior work. If either hypothesis fails, the research direction is falsified. That is the point.

H1 CONSOLIDATION BEATS RETRIEVAL

A learned consolidation policy outperforms RAG on long-horizon QA

A model fine-tuned on SAMD labels will outperform standard RAG on LoCoMo, without any test-time memory management. The difference: it learned what is worth keeping, not just what is most similar.

Baseline: RAG with cosine-similarity retrieval, no memory management

H2 STRUCTURED NEGATIVES WIN

Schema-grounded hard negatives produce steeper learning curves than random negatives

Hard negatives built from structured schema slot-swapping will train better consolidation policies than randomly sampled negatives, demonstrating that compression and consolidation are complementary supervision signals.

Baseline: same model architecture, random negatives instead

Expected Contributions

Three things that do not exist yet.

SAMD: the dataset itself

The first corpus where every agentic memory entry carries a ground-truth consolidation label derived from real agent success and failure. Evaluated across 19K SWE-bench issues and 150+ turn long-horizon dialogues from LoCoMo. Planned open release.

Consolidation as a supervised learning objective

An empirical characterization of memory consolidation as something a model can be trained to do, not just engineered around. If H1 holds, this reframes how the field thinks about long-horizon agent memory at the architecture level.

Schema-grounded hard negative construction

A novel method for generating temporally adversarial training cases by swapping superseded schema slot values into active memory positions. The first method that unifies structured compression and biological consolidation as complementary supervision signals.

This is bigger than one person.

Last week I was at the YC x Google DeepMind event. Demis Hassabis was in the room. The conversation kept circling back to the same thing: agents that cannot learn from their own history are fundamentally limited. The data infrastructure to fix that does not exist yet. I left that room more convinced than ever that SAMD is the right problem at the right time.

I have been building this alone: the consolidation system, the label taxonomy, the hard-negative construction method. The pieces are there. But what I need now are people who have seen this problem from the inside. Researchers who have shipped agents into production and watched them fail at memory. People who can tell me where my hypotheses are wrong before I find out the hard way.

Memory Consolidation Agentic AI Systems Ground-Truth Supervision Long-Horizon Agents

Teaching AgentsWhat to Forget