Why persistent memory is the bottleneck for AI agents

Myah Research · Mar 10, 2026

There is a quiet failure mode at the center of almost every AI product shipping today, and it has nothing to do with model size or reasoning capability. The model you talked to yesterday does not remember talking to you. The agent that helped you draft a strategy document last week begins this week as a stranger. The assistant you have used for six months knows you no better than it did on day one.

This is not a small problem. It is the bottleneck.

We at Myah Research spend a lot of time on this question, because the gap between an LLM that can answer well in a single conversation and an agent that can genuinely work alongside you over months is almost entirely a memory gap. The reasoning is already there. The knowledge is already there. What is missing is the thing that lets a system carry yesterday into today.

This piece is an attempt to lay out, in some detail, what that gap actually is. Why it has not been solved by bigger context windows. Why retrieval is not the same thing as memory. And what we think a real memory system has to do.

The stateless problem

Every modern large language model is, in its default form, completely stateless. Each prompt arrives as if the universe began with that prompt. The model has no internal record of having spoken to you before, no record of what you discussed, no record of what worked and what did not. Whatever happened in your previous conversation is, from the model's perspective, as if it never happened.

There is a useful analogy I keep coming back to. Imagine an employee who is brilliant, capable, articulate, and who arrives at work each morning with no memory of any previous day. You can hand them a folder of notes about who they are and what they have been doing, and they will read it, do excellent work, and go home. The next morning the same person walks in and you hand them the same folder. They have no opinion about you. No accumulated taste for how you like things done. No sense of which projects matter and which were experiments. The folder is the only thing standing between you and starting from zero every single day.

This is roughly what every chat product on the market is doing right now. The folder is the prompt. The folder is also your only protection against amnesia.

Why the obvious answer does not work

The first instinct, when you notice this problem, is the same instinct most product teams have had for the last two years: just put more in the context window. Newer models accept hundreds of thousands of tokens. Some accept millions. Why not just put everything the user has ever said in front of the model and let it sort things out?

There are a few reasons this does not actually work, and they are worth understanding because they reveal what a real solution has to look like.

Cost compounds in ways that are easy to ignore

A million-token context window sounds like a lot of room until you do the arithmetic on what it costs to send a million tokens through a frontier model on every single message. Even if cost per token continues to fall, this approach scales linearly with conversation history and then keeps scaling. A user who interacts with your product daily for a year is sending thousands of messages, and on each one you are paying to re-read all of the previous ones. The economics fall apart long before the technical limits do.

Long context degrades performance, not improves it

This is the part that surprises people. There is now a substantial body of research showing that as context windows fill up, model performance does not stay flat. It degrades. Models begin to lose track of what is in the middle of long contexts, begin to weight recent information disproportionately, and begin to make mistakes they would not make on the same information presented in a focused prompt. OpenAI's own guidance on their reasoning models discourages overloading them with examples for exactly this reason. More context does not produce better answers past a certain point. It produces worse ones.

A pile of history is not the same as understanding

Even if cost and degradation were not problems, you would still be left with the fundamental issue: an enormous unfiltered transcript of past conversations is not what a human assistant uses to remember you. A good human collaborator does not replay every conversation in their head before responding to you. They have an internal model of who you are. They have opinions, formed over time, about how you think and what you care about. They have noticed patterns. The transcript is the input that produced that model. The model itself is what gets used.

Putting raw history into the context window is like asking someone to re-read your entire email archive before every reply. It is technically possible. It is also nothing like how memory actually works.

Why retrieval is not memory

The second answer the field has reached for is retrieval augmented generation, usually called RAG. The idea is straightforward: store everything in a vector database, embed each piece of text into a high dimensional vector, and when a new query comes in, find the chunks most similar to that query and put them in the context window.

RAG is a useful tool. It is also, by itself, a poor substitute for memory.

The problem is that RAG retrieves on similarity, and similarity is not the same as relevance. If a user mentions in passing in March that their favorite color is blue, and then in November mentions they are planning their daughter's birthday party, a RAG system will not connect those two facts, because the words "favorite color" and "birthday party" are not particularly close in embedding space. A good human assistant would connect them immediately. The RAG system retrieves what looks similar to what was just said, not what is actually relevant.

There is a useful analogy from the team at Letta, who write about this clearly: imagine asking a student to write a book report by handing them a pile of shredded pages from the book sorted by which ones contain the most occurrences of the word "theme." That is what RAG does. It finds the shreds. It does not produce understanding. Understanding requires integration over time, not lookup at query time.

There is a deeper issue underneath this. Retrieval is reactive. It only fires when something asks for it. A real memory system has to do work even when nobody is talking to it. It has to notice that two things observed weeks apart are connected. It has to form hypotheses about what a person cares about, and it has to test those hypotheses against new information as it comes in. None of that happens in a vector database. A vector database is a library, and the library only opens when you walk in and ask for a book.

What memory actually needs to do

The most useful reframe I have come across in this field is the idea that memory in AI systems should be understood as reasoning, not storage. The team at Plastic Labs, who have written extensively on this, put it well: human memory is not a recording. It is a continuous process of prediction, error correction, and consolidation. We do not store the past as a fixed transcript. We rebuild it constantly, and the act of rebuilding is what produces understanding.

For an AI system to genuinely remember a user, it has to do something similar. It has to take the raw stream of interactions and convert it, over time, into a structured representation of who that person is and what matters to them. That representation has to update as new information arrives, and it has to update in ways that account for older context that the new information bears on.

There are three things this kind of system has to handle, and they correspond roughly to what cognitive scientists call episodic, semantic, and procedural memory.

Episodic memory: what happened

Episodic memory is the record of specific events. The conversation last Tuesday. The decision made in the planning session three weeks ago. The mistake that got corrected last month. An agent without episodic memory cannot reference its own history with you in any meaningful way. It cannot say "we decided this last week." It cannot remember that the email it sent yesterday is the one you are replying to today.

Storing conversation transcripts is the easy part of episodic memory. Knowing which transcripts to consult, and in what order, and how to integrate them into the current moment, is the hard part. This is where most current systems fall down. They have the data. They do not have the index that knows which data is relevant to what is happening right now.

Semantic memory: what you know

Semantic memory is the structured knowledge an agent accumulates about the world and about you. Your role at work. Your preferences. The names of the people in your life. The recurring projects you care about. The vocabulary you use for things in your business. None of this is stored as conversation. It is stored as structured facts, which is what makes it usable.

The interesting research question here is how those facts get formed. You cannot rely on the user to type them in. You cannot rely on regex or keyword extraction to catch them. The system has to read the conversation, decide what is worth remembering, decide how to phrase it as a fact, and decide where to file it. And then, critically, it has to revise those facts as new information comes in. If the user tells you in January they are working on Project A, and in May they mention Project A is finished, the fact in the structured representation needs to update. Otherwise the system spends the rest of the year asking about a project that no longer exists.

Procedural memory: how things get done

Procedural memory is the knowledge of how to do things. For an agent, this includes the user's preferred way of writing emails, the structure they like for status updates, the tone they use with different audiences, the order they like tasks executed in. This kind of knowledge cannot be stated directly. It has to be inferred from behavior. The agent watches the user accept some outputs and reject others, and over time it builds up a sense of what fits and what does not.

This is the hardest of the three, and the one current systems are worst at, because procedural knowledge is implicit. It does not show up in any single message. It shows up in the pattern across hundreds of messages. A real memory system has to be doing the work of pattern recognition continuously, in the background, without waiting to be asked.

The consolidation problem

There is one more piece of this that does not get discussed enough, and it might be the most important. Human memory does most of its real work when you are not actively using it. Sleep is when memories get sorted, integrated, pruned, and connected to other memories. The brain runs reasoning loops over the day's experiences and decides what to keep, what to throw away, and what to merge with what.

AI systems do not do this. By default, an LLM only thinks when you are talking to it. The moment you stop sending messages, the model goes quiet. Whatever raw conversation just happened sits in a database somewhere, untouched, waiting for the next time it might be retrieved.

This is a missed opportunity, and it is one of the most actively studied frontiers in agent research right now. The basic idea, sometimes called sleep-time compute, is that an agent should be doing reasoning work in the background between user interactions. Re-reading recent conversations. Updating its model of the user. Forming hypotheses. Cross-referencing things observed in different sessions. The work that a human assistant does in the shower or on the drive home, when the day's events are settling into something coherent.

Research from Letta and UC Berkeley on this approach has shown that doing work offline, before the next query arrives, can produce roughly equivalent answers using a small fraction of the compute at query time. This is not just a performance optimization. It is a different shape of system entirely. It means the agent is doing real work even when no user is present, in the same way an employee is doing real work even when no manager is watching.

What we are building toward

When we talk about memory at Myah, we mean something specific. We do not mean a transcript log. We do not mean a vector database with similarity search bolted on. We mean a system that does the work of building and maintaining an understanding of the people it works with, continuously, in the background, in a way that compounds over time.

The technical pieces of this are not exotic. They are well documented in the recent literature. You need a pipeline that consumes interactions and converts them into structured representations. You need a reasoning layer that can run during idle time and refine those representations as new information arrives. You need a retrieval layer that knows the difference between similar text and relevant text. You need the ability to forget, or at least to deprioritize, things that have stopped mattering. You need a way for the agent to reference its own evolving understanding when it is producing output, not just at the moment a question is asked.

None of this is impossible. Some parts of it are already shipping in production systems. What is mostly missing right now is the willingness to treat memory as a first class problem, on the same level as the model itself, instead of treating it as an afterthought you can solve with a longer prompt.

The agents that will matter over the next few years will be the ones that learn from the people they work with. Not in the sense of fine tuning on user data. In the much more ordinary sense of remembering what was said, noticing what mattered, and showing up the next morning a little smarter about you than they were the day before.

That is the bottleneck. Solving it is most of the work.

← Back to blog