Pricing
Back to Articles

Demystifying Agent Memory

| 12 min read
Zhipeng YeMember of Technical Staff
Tejas Manoj GhoneMember of Technical Staff

This post covers how we think about the memory system, and how we're building it for our RAG 2.0 agents.

Foundation models are stateless by default. They don't pick up where you left off, they don't learn your domain over time, and they don't get to know individual users. That matters whenever the work is multi-turn, but it matters even more when getting the task right depends on understanding the environment the agent is operating in.

Most production agent work isn't a single prompt and a single answer. There's a chain of tool calls, intermediate results, things that didn't work, and feedback from the user along the way. Memory is a mechanism that lets the agent keep learning and improve moving forward.

For an enterprise production agent, memory is an important feature to continuously deliver high performance. It becomes a small system of its own to hold what's true about the domain, get better at tool use, and adapt to how people actually use the product.


Why memory matters for agents

Foundation models give you reasoning, and the context windows keep getting bigger. Memory is still what makes that reasoning compound across turns.

Take a RAG agent answering questions over an enterprise corpus. To answer a question well, it might need to remember which sources it already checked, which query rewrites came back empty, which document is the authoritative one when two say different things, and how the user corrected its last answer. Strip that out and the agent re-does work it already did, drops context it already had, and never gets better.

The problem gets worse the more tool-driven the agent gets. ReAct-style planning is essentially a multi-turn conversation with the foundation model itself. Every tool call leaves the state behind. Every retrieval produces evidence. Every thumbs-down is a piece of supervision. Agents can learn from these signals over time to improve performance.


Where memory lives: in the model, or next to it

There are two places memory can sit.

Inside the model, as parametric memory — knowledge baked into weights through fine-tuning, RL, or other training. The ceiling is high in principle: the model internalizes the patterns directly. In production it's painful. Training is expensive, iteration is slow, the result is hard to inspect, and it's pinned to a specific model version. When the next foundation model comes out, none of that work transfers.

Outside the model, as non-parametric memory or latent memory — data the model reads at inference time. This is just memory-as-data: prompts, retrieved snippets, structured state, KV-cache, vectors. It's easier to update, easier to validate, easier to explain, and it survives model upgrades.

For most production agents today, non-parametric memory is the more practical choice. You can see what's stored, decide what should change, and ship updates without retraining anything. That's where we focus.


A vector database is not a memory system

The shortcut a lot of teams take is to dump past interactions into a vector store and call that memory. That's a useful piece, but it's not enough on its own.

A vector store can hand you something that looks similar to the current query. It doesn't decide what's worth remembering in the first place, when something has gone stale, how to compact a noisy trace into a useful lesson, or whether the thing it just retrieved is actually safe to act on. None of that comes for free with the index.

A working memory system needs opinions about all of those. What gets written. What gets thrown away. What gets retrieved when. How proposed updates are validated before they start influencing behavior. How citations and source grounding survive compaction. The storage problem is the easy part.

How we layer memory

We split memory into multiple layers that solve different problems. They're not interchangeable, and we deliberately don't try to make one mechanism do all of them.

Working memory: the context window

Within a single session, the context window is the memory for ensuring conversation continuity. The user's request, the recent turns, the tool results that just came back, intermediate state, the source evidence — all of it lives there.

The naive version is to keep stuffing everything back into the prompt. It works on simple cases but degrades in production, where queries and the environment are far more complex. Tokens grow, latency grows with them, and once the context gets long enough you start hitting the lost-in-the-middle problem well before the window is technically full.

So we treat the window as a budget. Tool results get trimmed down to the fields that actually carry signals. Big payloads get compacted before they reach the model. Source text stays — that's the grounding — but the structural overhead around it doesn't. Raw JSON is fine for machines, but for a model trying to reason over the content, repeated field names and braces are tokens that aren't doing anything for you. When the content is what matters, we strip the wrapper.

It's not really about shorter prompts. It's about signal density per token, and the agent gets visibly sharper when you push that up.


Procedural memory: how to search this corpus

Every RAG agent we ship runs over a different corpus, and a generic model doesn't know any of the things that make a corpus tractable: the terminology people actually use, which collections to trust, which metadata fields are reliable, and how retrieval typically fails on this data.

Procedural memory is where we capture that. We use ACE-style optimization as we wrote in an earlier blog post to let the agent reflect on its own tool-call traces, pull out which strategies worked, write the lessons into a search playbook, and validate proposed updates against eval metrics before anything gets committed. Over time the agent learns things like which filters help on which query types, when to broaden a search versus narrow it, which sources to trust when there's a conflict, and how to recover when retrieval comes back empty.

This is operational, not factual. It's not telling the agent what's true; it's telling the agent how to look. One thing we're deliberate about: the foundation model itself stays in the loop for the reflection, distillation, and validation. That way every base-model upgrade lifts the memory mechanism along with the agent.

Semantic memory: stable facts about the domain

Semantic memory is the facts about the environment: what the product does, what the corpus is, what internal terms mean, which sources count as authoritative, what constraints apply. Things the agent should know before it ever runs a retrieval, and that don't need to be re-derived every query.

When users describe an agent in natural language, we extract the key facts and persist them into the agent's configuration directly. For a technical-support agent, that might mean what the product is, which doc collections it has access to, what specific internal terms mean, and which sources to prefer when there's overlap. The payoff is fewer redundant retrievals, better grounding, and an agent that behaves consistently from one session to the next.

Behavioral memory: what users have actually taught us

Once an agent is in production, the actual traffic carries insights the spec didn't. You see what people actually ask, where the phrasing is ambiguous, which answer formats they push back on, where the agent keeps tripping over the same gap.

We use prompt optimization methods like GEPA and ACE to fold those patterns back into the agent's prompt and behavior based on user feedback signals, instead of treating every lesson as a fact to be retrieved later. For self-reflection, we lean on comprehensive evaluation metrics like LMUnit alongside direct user feedback as improvement signals. The recurring stuff — common intent shapes, common misreadings, preferred response styles — becomes part of how the agent operates rather than something it has to look up.

It's the closest thing in this stack to the agent learning from its users, and it's where we get the most leverage on quality once enough real traffic has come through.


Why the layers matter together

Each layer is solving a different problem: working memory keeps the current task coherent, procedural memory makes the agent better at using its tools, semantic memory anchors what's true, behavioral memory adapts to how the product is actually used.

Not every piece of context belongs in a vector embedding. Not every lesson belongs in the prompt. Not every user correction should change long-term behavior. Memory can go wrong and degrade performance, so we keep the layers separate and use different mechanisms for each — that's how we avoid one bad update bleeding into everything else.


What goes wrong

Non-parametric memory is the right foundation, but it isn't free. The write operation is way riskier than read, and can lead to performance regressions if not done well. We split memory into layers of problems so we can apply the right evaluation and monitoring mechanisms — things like LMUnit and an expert-grounded test suite — to guardrail every change.

Retrieved memory goes stale. Summaries quietly drop the detail you actually needed. Compaction loses evidence. A bad memory written once can degrade behavior for a long time before anyone notices. Prompt-level memory competes with task context for the model's attention. Optimization loops will happily overfit to a narrow eval set if you let them.

We design around the failure modes, not around the happy path. Memory updates go through evaluation gates before they take effect. Procedural playbooks should be updated regularly. Semantic facts should carry provenance. Retrieved memories should get a relevance check. Compaction has to preserve citations and evidence, not just shrink the text. None of this is glamorous, but the bulk of the engineering effort in a real memory system goes here, not into the storage layer.

Where this is going

Our work so far has leaned on layered, non-parametric memory because that's where the production tradeoffs are honest right now: explainable, scalable, updateable, and portable across model versions. Each layer is added only when it’s necessary to keep the system as simple as possible. It already covers most of what a real Q&A agent needs — keeping context clean, getting better at search, holding domain knowledge, adapting to usage.

There's a real ceiling, though, and it's worth being honest about it. With non-parametric memory, the model isn't learning in the deepest sense. It's reading, retrieving, summarizing, and conditioning on external states. Parametric and latent approaches let experience live in a form the model uses more directly, and we expect the development of those methods to change the paradigm of memories. For now, non-parametric memory is what gets you reliable agents in production with the control and iteration speed real teams need. In the long run, a good memory system should be simple, scalable, and effective. As research goes, we will continue to explore and experiment with new mechanisms to support our agents via memories.