The Infra Pod - Building the Future of AI with Long-term Memory (Chat with Charles from Letta)
Episode Date: August 11, 2025In this episode of The Infra Pod, Tim (Essence VC) and Ian (Keycard.sh) delve into the fascinating world of AI memory with Charles, the CEO and co-founder of Letta. They explore the intricacies of mem...ory in AI, its current state, how it’s implemented in various applications, and its potential to revolutionize productivity tools such as coding assistants. Charles shares insights on how Letta is leading the way by creating platforms for agents with long-term memory, the future implications of shared memory systems, and the concept of 'sleep time compute.' Tune in to gain a deeper understanding of why memory could become more valuable than the models themselves in the future of AI.00:24 The Concept of Memory in AI02:04 Expanding on AI Memory02:17 Implementing Effective AI Memory01:55 Current and Future State of AI Memory11:37 The Challenges and Opportunities of AI Memory11:37 Real-World Applications and Limitations35:14 Introducing Letta's Solutions36:35 Future Trends and Predictions
Transcript
Discussion (0)
Welcome to the InfraPod.
This is Tim from Essence, and Ian, let's go.
Hey, this is Ian, lover of agentic workflows,
believer in Level 5 autonomy.
I'm super excited today to be joined by Charles,
a CEO and co-founder of Letta AI.
Charles, how are you doing?
Tell us a little about yourself
and what got you to start Leta.
Yeah, I'm doing great.
Thanks for asking.
So prior to Letta, Letta is about a year old now.
I graduated about a year ago from the PhD program at Berkeley,
with a few of my colleagues, my co-founder,
also the head of research at Leta.
We were all doing kind of our PhDs in various things at Berkeley.
I was doing more of like RL.
Co-founder Sarah is more of an ML systems person.
I'm not like feature stores.
I like the predecessor to vector databases.
And then Kevin was working on like more pure NLP.
And then the last year of our PhD,
we all signed a cubicle together,
had shared an advisor.
The last year for our PhD you worked on this project,
MGMT together.
And MemGBT, I think, to place it in time, it was probably the first, like, academic paper that had, like, a GitHub, very large GitHub repo associated with it that was addressing the whole memory problem with LM.
So at the time, like, the reason we started working on MGPT was because GPT4 was just so damn good.
And it also had such a small context window, like 4K tokens on the API.
So, yeah, the idea with MMGPT was you just kind of make the AI or the language model self-aware that it has a memory problem.
and you give it the tools to address this memory problem,
kind of like read, write operations to external memory.
And it just kind of worked.
Like the first prototype we made,
I think it just worked out of the box.
It's kind of crazy.
I think function calling had just kind of happened around the same time as well.
So that's the reason we were thinking more from like a function calling perspective.
And then Leta, the company is really about taking these ideas to the limit.
And I think the ideas around memory management and kind of the greater system,
what do you need beyond the language model to create?
AI that is a lot more like humans than just a chatbot. Those are the problems that we're
trying to solve as a company. Well, and when you say memory, like, can you expand how do you define
memory in this context? And like what is stored in that memory? Like a lot of type of data would
you find for what types of use cases? Yeah, I think memory is super overloaded. So it's a little bit
tricky to talk about because I think as soon as you mentioned memory, people have different ideas
in their head. I think today, if you just off the cuff mentioned memory and AI, people probably
think about, like, conversational memory that's going to baked into chat GPT memory or
Gemini memory.
Like, every single chat bot now has, like, a memory bank.
And that's really about, you know, just capturing important information that should persist
indefinitely in the context window.
And that's also kind of what the MGMT paper was very focused on.
So if you say something like, you know, I really prefer short, concise responses, don't
give me any more slop.
The AI should probably, like, write that to memory.
And for every subsequent chat conversation, always kind of behave in that manner.
Similarly, if you're, like, doing a coding assistant, and you say, like, hey, we always write our code this way.
Or, like, we always finish a PR in this style.
If you're using an AI coding assistant, it should also write that to memory and, like, kind of do that in perpetuity until you change it.
That's what people often think about when they think about, like, AI memory.
I think another thing that happened with the MMGPD research paper is we were talking about this idea of the LMOS.
And I think the concept there of memory is actually much more general.
it's all just about tokens
and it's about the real primitives
of the machine you're building
and I think with anything
that's AI built on top of language models
it's a language model
it's tokens in tokens out
and I think memory in that context
is really just all the tokens
what are the tokens going into the compute unit
what are the tokens coming out
and memory management is just token management
so obviously that you know
semantic memory or like conversational memory
that's a subset of that kind of memory
but I think everything to do with
making LMs into agents is all about context management, like tool calling. It's all about how do you
format the tools. How do you extract the tool? Reasoning, it's all about how do you like properly
start the reasoning injection and like how do you end it faster? Like these papers, you know,
they have like variable reasoning effort by continually appending, you know, think harder, think
harder, like or appending, you know, hold on, wait a minute, let me think again. And then you run the
LM inference again. I think that's also memory management and that it's token management. So I think
the concept of memory is actually much more broad than just like this conversational memory.
It's really just about turning the language model into the intelligent machine.
I mean, that's an incredible, incredibly good description.
I'm curious, so today with Letta, and more broadly, like, where are we in that journey?
Like, how much of this exists?
How much more is there to build?
How, like, where are we in this sort of development?
And then, like, yeah, help us understand sort of like, did these machines exist and work great?
or are we, like, early in this process?
Yeah.
We're early, but we're definitely further along
than when we first started the MGBT project.
So I think, like, when MGPT came out,
function calling had just started.
It was in its infancy.
I think in the original MGPT paper,
if you look at the V1,
there's like a caveat at the end of the paper
that's like, by the way,
everything you're looking at in this paper
only works at GPT4.
Like, it does not work with any open model.
And that very quickly actually changed
and open models started supporting function calling better.
But I think some of the primitives of this LMOS have started to settle.
And I think one of those primitives is tool calling.
Everything is agentic.
Everything's orchestrated through tool calling.
And I think that is what you're seeing in these much more agentic applications like Claude.
I think everyone would agree cloud code is like, you know, really leaning heavily agentic.
You know, it's not a workflow.
It's an agent.
And Claude code just leans extremely heavily on tool calling.
The Claude plays Pokemon example, right?
It has memory, but the memory is orchestrated through tool calls like MGB.
kind of like making these files that constitute the agent's memory.
So I think one thing that has happened is we've kind of standardized on tool calling.
And I think one thing that hasn't happened beyond that is understanding how to have functional
AI memory that is not just a tool.
People have often asked me this question, like, why isn't Letta just an MCP server?
Like, why can I not just have memory as a tool in the same way have like my Google Drive
as a tool?
And I think a very quick counter example to like make this point.
is, like, if you want to do something like sleep time compute,
which you can talk about more later,
where you have an agent that's kind of observing everything that's happening
with the main agent, kind of a subconscious agent,
and it's maintaining a very long-term,
really, really high-quality memory bank,
if you want to plug that into Cursor as an MCP tool,
well, what are the kind of like interfaces of the MCP tool,
the memory-Mex-P tool?
It's probably something like create memory and read-memory.
But you are now relying on Cursor to always call Create Memory.
Right? And is cursor when it's like in the middle of your vibe coding session going to be properly calling create memory when it needs to be calling create memory? Probably not. You're basically relying on the other clients or the service to also be good at memory management. I think there's a lot, if you look, Google, how do I do AI memory? I think a lot of the ideas from the original MGBT paper around tools and memory, like rewrite, that kind of thing, that's everywhere. Like they have startups that are built entirely around this premise, like a memory bank for your AI that's maybe like an MCP server. But I think, you know, part of the reason
we still talk about memories because it's still a very big problem.
I think once memory is solved, we stop talking about it because it's just like, you know,
it's obvious and it's like exists in every single AI API, but it's still not solved.
And I think part of the reason it's not solved is because we haven't yet transcended
beyond the memory as a tool.
I think memory is just way too fundamental to AI to just be a tool in the same way Google Drive
as a tool.
That makes total sense.
You know, one of the things I've often thought about like MCP, I have some, you know,
when MCP came out, I was like, this is a beautiful like version one.
not even version 0.01 of a future
standardized protocol for an operating system
and kind of what you're saying is like
memory as a tool doesn't work
but you have to rely on the top level
LOM to actually call the tool.
It's like you actually need a protocol layer
that's between the fundamental model
data that's flowing in the model and flowing out
and a place to plug in. And I think that's true
not just of memory but it's finally true
of how you can build plugable agentic
experiences is that right now
we're fundamentally reliant on, you know, the model that the L, whatever's orcsuring the LM
to actually then call a tool in flow, and there's no guarantees of when that tool you call.
There's no, like, hook system, right, that says, hey, pre-submitting the context window,
run this thing.
Like, we can't, like, register hooks or do anything fun that you'd find in, like,
a natural, normal deterministic system.
Tool calls are non-deterministic.
You know, that's part of the problem.
I'm also very curious to sort of think about, like, in traditional computing, like,
management of memory has been one of the, like, most important components broad space, right?
Like, we invest a lot of time, like, in the last 20 years to ensure that the right bites are in the L1
cash versus L2 cache versus in RAM versus, you know, in swap versus like having to be right out
of like a magnetic disk.
And that's the city has helped us a lot here.
But like, I'm curious, like, what are the parallels here as well for, like, are there latency
and other like parallels in terms of like new broad memory management for agents or is that
Is that, like, a solved problem?
I'm kind of curious how you manage it.
Because I can think about, like, memory for an agent, like,
it could be an absolute massive shit ton of data.
If there's, like, a personal agent, like,
there's a lot to know about Ian and what he's done
in order to be able to answer some random, like, natural prompt
I may send to some sort of agentic system.
So I'd love to sort of also get your thoughts on that as well.
Yeah, I think maybe talk about the latency thing first.
So if you've taken any sort of like operating system class or whatever,
or I think if you, like, in the PhD program at Berkeley,
you have to, like, take a prelim
and you have to memorize the latency table, you know,
where it's like staggered, like, log scale.
But I think with, like, traditional operating systems
and, like, memory management,
the number one problem you hit is, like, latency on retrieval.
And that's, like, how you design all your algorithms,
just, like, minimize latency on retrieval.
And then with language models, at least currently now
and probably for the foreseeable future,
retrieval is an inconsequential amount of latency
compared to the actual LMification.
So actually, I don't think retrieval is like the hardest part or like that you'd have to think about your, you know, your memory placement algorithms in the context of retrieving from, you know, some like a vector database for like PG vector or whatever.
I think it's actually more on the creation side.
Like creating these memories and updating these memories is actually very, very lane intensive, right?
Because you can imagine just to create a memory, like today, you know, Charles's mom's name is Brenda, that is one LM call.
but probably that memory when it's created
and you have a ton of prior context
should be very cascading.
It should kind of like inform
every other memory that was created in the system.
It should probably itself cascade
into like thousands of LM invocations.
And I think what you will end up happening
as these systems mature
and we kind of get out of, you know,
like chatbots and move more towards long-running,
stateful agents is you'll have these context windows
where segments of the context window
that have memory are so enriched
by so many LM invocations
and you'll have like these pieces
of the context window that effectively costs
thousands of dollars worth of
Cloud API credits to have been created
and they were created over millions of trajectories.
The analogy breaks down a little bit
in the dynamics of retrieval versus generation cost
because, yeah, with the traditional OS,
like the generation cost is kind of like fixed
the clock speed of your CPU or something.
Whereas, yeah, I think with LLMs
is like a little bit different about dynamic.
There's just so much we can talk about.
I think we can talk about the memory thing,
but maybe I want to jump into the sleep time compute a little bit.
Maybe tell us about
what sleep time compute is, because I think folks may
actually know what that term is, but I think
the idea is actually pretty cool, and
it does work in
conjunction with memory, I think
a lot, too. Yeah, so I think one
very obvious difference between
LM-driven intelligence and human intelligence is that
human brains, like, they don't turn off. Like, your
brain doesn't, like, drop to like zero voltage
and, like, you just go into a carogenic state
every night. Like, when you go to sleep, your brain
is still, like, kicking, things are firing.
But with language models, you have to
actively, like, run the machine,
to get the intelligence to, like, kick on.
I think if you're trying to make human-like intelligence from LMs,
it becomes very unnatural to kind of go back to a chatbot
or chat thread that you have with some agent.
And that agent has done literally nothing since you were gone, right?
It hasn't reflected.
It hasn't attempted to do anything, like, proactive.
It's just been sitting there static.
Like, the weights were cached in some,
or maybe not even cache.
It was, like, just rent-to-disc, and so it's been loading.
It's been loaded.
So I think sleep-town compute at a very high level,
about can we just run language models all the time? And if we can run language models all the
time, how can we make that actually benefit their intelligence? And I think, like, you know,
you mentioned that the key element to this is memory. I think if you are running language models all
the time, if they don't have memory, there's no point because they won't remember anything to
happen when they're running while you're gone. So this entire thing hinges on having memory. And
I think the other context to look at sleep time compute from is the test time compute angle,
which is kind of why we named the sleep time compute,
there's been a lot of focus,
just a tremendous amount of energy around,
hey, it turns out that if you just run the language model for longer
when you ask a question,
it becomes more intelligent.
It's kind of like an emerging capability, right?
But the thing is,
it's actually very expensive to run language models
when you ask them to do something.
That's why it's so painful, you know,
to kick on deep research or 03.
I mean, that's like 03 was as fast as 401.
Like, why would anyone ever use 401?
But the fact is it's a tradeoff.
You're navigating some like Pareto,
frontier. And I think if you could basically do the same thing, like get the same kind of
benefits, like increase the intelligence by spending more compute, but in a way that is like more
asynchronous, it's the human doesn't have to be like waiting for this intelligence like turbo boost
to kick in. It just kicked in while they were gone. And I think that actually, if you do have some
sort of mechanism, you have some sort of idea about how can we scale compute when the user's not
interacting with the agent and make that agent more intelligent, that really is a pretty tremendous
unlock because it's actually much more scalable than test time compute because test time
compute is extremely expensive as to be done in real time the user notices or sleep time compute
you can also just saturate empty compute you have on your data center like I mean all these
companies would love to saturate their compute right they have batch APIs they want people to
use it they don't want like those electricity go to waste and so yeah test time compute is the idea
that I'm thinking harder right when a context comes in where sleep compute is basically pre-computing
that thinking hard in a process for him to memorize
it. That's kind of how I view it. And it doesn't feel like those are actually in
conflicts at each other. You can actually do both, right? You can actually do test and compute
if I don't know what the context I will be passing in the first place. But based on that,
maybe it can't even inform you what other state can compute can be done. I just wonder how
you view, because I feel like these are so new. I don't know if people even know what
examples people even do right now. Like what is some real word examples people actually are using
And what do you have you found, folks, or are they choosing one another or they're choosing both?
Or maybe, or so new, nobody is even doing these right now.
Yeah, I think it's pretty new.
I think like a lot of the scaffolding and engineering people have set up has not been designed
to run agents for very long periods of time asynchronously.
Everything is kind of like meant for this pattern.
It was previously for this pattern where you like spawn a post-Sysie request against like
the LMAPI, you get back the tokens, you hold for maybe like 20 seconds, five seconds.
And then now everything has been, you know, the infrastructure changed a little bit and we expect, like, you either open like a heinously long post-SSE or you like maybe like poll instead like on the responses API.
And that's for test time compute.
But to give a concrete example, there really actually is no batch agents API other than the one we built in the Letta Open Source project because very few people are actually running agents and batches.
And like that's what you would do for sleep time compute, right?
You'd be having like a ton of agents all running, you know, like a huge.
batch, but taking advantage of the underlying, like, L-LM badge invocations.
I think a very obvious example of, like, why you want sleep-tong compute in, like, very
common use cases, like, coding.
Like, I think if you kind of, like, initiate clod code on your lap, on your computer,
and you forget that you get sidetracked and you, like, have to go respond to Slack,
where you're living a world where, like, inference is free for a moment.
Claude code should have kind of, like, been inspecting your repo, kind of probing around
the recent, like, Gitlog, it should have done a mental map.
Kind of it made, like, a deep wiki, you know, like the Devon, deep wiki thing.
it should have done that by itself
because any sort of question you now ask
clog code when you start pair programming
it will have much better answers for
so we'll have done that's like pre-computation
and then similarly when you kind of like wrap up your session
with clog code
clog code probably should be looking at the entire transcript
of your interaction and reflecting
on what happened and thinking like
hey what were some patterns I noticed
about the thing that they kept on asking me to do
that I could probably use to like update my memory
and make my future behavior better
I noticed that we kind of looped a lot editing like the globals.s.cissS file.
And it turns out that like I should really, when I do that, I should also edit like tailwind.comfing or whatever.
And I'll write that to memory.
So there's just so much, so many things you can do related to memory, pre-computation, post-computation in just coding alone.
I think this is actually where like because there's so much energy and incentive around making these coding systems better, I think that's one of the first places you'll see sleep time compute really heavily applied.
you'll have products that basically have
mega max mode
it's no longer just max mode but it's mega max mode
you don't care about cost at all
we'll turn this on and cloud code will be cooking
the whole day like you'll never not be thinking
about your repo right and that's an example
sleep time compute really taking to the limit
it's an incredibly cool concept
right because like functionally if we were to step back
to a lower level like the fundamental problem we
have if we're like on a primitive
level how do you give like good instructions
and then how do you ensure that what's going
on top of the instructions that like the data
that's being fed into the LLM is the right data,
like to manage what's in the context window, right?
Like, it doesn't matter that we can have been fixed context windows.
It matters about, like, what's the right context in the context window,
which prompts the right, you know, statistical inference
that then results the right result at the end of the day.
And so this is actually, like, it's super, super fucking cool.
Like, just, I don't always swear it, but, like, on the podcast.
But, like, it is incredibly cool.
And one of the things just listening to you think about, like, ClaudeCode and my own,
like, my own small team has stopped very AI forward, you know,
adopt all the coding age.
agents, but the worst part is, like, we don't have shared memory.
And the best form of shared memory we have today is, like, these files on disk.
So I'd love to, like, talk to us about how you think about the future of agentic systems
that are multi-tenant, right?
Like, a lot of the time people talk about this in the context, they look, oh, it's Ian's agent.
But it's like, in the context of coding, it's like, I don't really want it to be Ian's agent.
And maybe Ian's agent shares memory with other people's right memory, right?
Like, there's value in sort of multi-tenancy, and there's value in.
in the shared context.
And so even that, it sounds like a really interesting problem space
that is actually quite different from anything you had before.
Because if you think of like traditional systems, pre-LOMs,
it's like, yeah, I might be preparing a cache,
or I might be doing some roll-ups and putting that someplace,
but it was incredibly structured.
It was incredibly bounded, static task bounding.
And this is not, right?
This is like a process of, like, we want the system to self-discover net new knowledge
and then store the net new knowledge in a way that it allows,
is the system to use that knowledge in future when the time comes, right? So yeah, I'm curious
how you've thought about, you know, the sharing, the querying, and where you kind of see the
future of agentic systems based on the idea that, hey, like, obviously, like memory is like a
foundational element. What's this sort of look like in five years? And how does that change
the way that like software works? Yeah, I think the shared memory thing is super important. I think
it's also incredibly exciting to think about because a lot of the conversations so far has been
about emulating humans, but there's no reason why human should be the limit.
I mean, I think in some cases, like, consumer, you might want to actually have perfectly
imperfect human-like memory where, like, the AI forgets in the same way human does.
I think for, like, productivity tools, like coding, your memory systems should be superhuman, right?
It should do things that, like, no team of engineers could ever about them doing, right?
I feel like, you know, as a small engineering team, it would be kind of sick if we all had,
like, neural links and we had, like, shared, you know, memory on, like, the projects we're working on,
right? But we definitely don't live in that future right now. But you can definitely do that with
agents, right? I mean, in Leta, for example, because we write all the memories to just a table
on Postgres, it means that it's extremely easy to have shared memory where basically agents are
actually dynamically reading from shared memory blocks. And if they write an update, it's immediately
broadcast because it's just written and, you know, it's transacted in the red like when you
actually do the elementification. That is really the future. I think like the future is you will have
like everyone running Claude Code and your company or like CloudCode.
cursor, whatever new thing comes out, Klein, everything will be sharing memory. It probably is not going to be sharing the entire context window, but segments of the context window will be being read from like some common source. And I think this is also why the true canonical data format for memory is text. And I think we are already starting to see that. We're like cursor rules, the cloud code, instruction files. It all looks the same. It's all text. And I think the same thing will happen with, you know, even in five years with this really advanced shared memory where you'll have
have all these different systems that are all reading memory from the same location.
And it will actually look like something that's very human readable because then the human can
also read it.
I think there's this really dramatic advantage to like storing everything in English tokens.
It doesn't have to be English, but just like readable tokens because it means that human
in the loop is not like anything crazy.
It's just a human looking at the same file and editing it.
And yeah, I think that's partially why it's very, some people also ask me like, you know,
why wouldn't you want to just do like parametric memory and bake everything into the way?
And like, well, once you bake into the way,
it's really hard to understand what the memory is.
And often, like, in coding, you want to know what the memory is.
Like, it's really painful for an AI to be behaving in a way you don't understand
because something is in the memory.
It's even worse if what's in the memory is like English, like, chat you BT, right?
You know, it's like behaving in some weird way and you can't even see, like,
why it's behaving in that way.
But you know it's because something in the system prompt is, like, really dirty or, like, corrupted.
But, yeah, I think it'll still be taxed in five years.
But every, all these agents,
if they're like productivity related, you're going to have like fleets of agents running
and they'll all be sharing their right.
And I'm curious how you've given thought of like, I was just talking, Tim and I always
said this rollback as we do the podcast, we are just talking about him.
And like the thing that immediately came to my mind was, you know, one is we've always said
like data has gravity.
I mean, that's why like once you deploy data to us is really hard to get off ADVS
because your data is there, right?
And you have to pay all these egress fees.
And it's very clear like in this conversation that like, okay, the next stage of this is
actual memory has gravity, right?
where whichever agent I've interacted with has the best memory,
which is ultimately like the best personalization of that agent's behavior
to whatever set of task it is that I've done with it and repeat doing.
Like that has gravity to it.
I'm curious, like have you thought about, in the context of coding,
like the company that has the best set of shared memory
and is the highest, most intelligence of memory for their code base,
it's probably going to have the highest velocity in these coding tools.
Right.
And so this is like a new form of advantage.
It's not just enough to give, like, deploy to all my developers, here's cloud code.
I have to, like, deploy to all my developers here's cloud code.
And here's, like, a shared memory system and shared thing that makes, like, cloud code more than just a, you know, a 50% speed up.
It's, like, multi-factor speed up because it has learned about the system, the problem domain we're working on.
And it's specialized to us.
And that's IP.
So I'm kind of curious, you know, have you thought about how this sort of changes, like SaaS or software, or how people buy software, or how companies buy software?
software and like how people think about it like are there concepts like are we going to live in a world
will people bring their own memory and that memory is like you know we spend it invest a lot of time
into like safe guarding that memory because ultimately like the memory is almost like a shadow
of the human in some way like I'm curious where your brain has gone on it obviously been thinking
about this a lot and have so many so many endless questions yeah I think memory definitely is
the upcoming mode the memory will just be much more important than the models and
I think what that means for like the competitive landscape of like tools and
companies, it's hard to predict.
But I think 100%, you know, the frontier labs that are very invested in consumer
facing products or even debt tool facing products, they are aware of this.
And they want to make sure that they capitalize on this advantage.
And, you know, like chat GPT, they want to make memory as good as possible.
Because if it's incredibly good, incredibly realistic, makes experience, like, it raises the ceiling
and it doesn't drop the floor.
the problem with, I think, memory today and chatty
drops the floor stochastically, which is very bad.
Once they fix that, which they will, and it's just
completely raised to the ceiling, now
when you're a friend tells you,
hey, have you been on, like, Gemini or Deepseek,
you're kind of like, well, you know,
chat GPT already has, knows everything about me.
And I know the day zero experience will be worse.
So I just don't want to switch.
And like, it's just too much work for me.
So from a consumer angle, I think memory will become the thing
that, like, makes these products incredibly sticky.
From a dev tooling perspective,
I think it's similar in some ways,
that I think companies, you know, the companies that are making the best
or have the most funding around, like, making coding tools,
they also probably want the memory to be a lock-in feature
where, you know, you're using cursor for longer and longer.
Like, you're accumulating more and more memory on cursor cloud,
like the paid service.
And memory is something that will just, like, make your experience on cursor
much better than it is on CloudCote and vice versa.
I think more from like a dev tooling infrastructure perspective
and maybe just like business data,
I think memory will become so important that actually for a lot of companies,
they just will not use these tools
that have closed memory systems.
And it just will be like table stakes
that you have to like have open memory
or like we have to be able to bring our own memory.
And in many ways, that's kind of what Leta is.
It's a database for your memory
and all your context for AI.
And one interesting thing that I've observed
from working with like some of these
larger enterprise companies is that I think
today like business data
and all your information about your users,
it's stored in the same way it's been stored for like a decade, right?
It's like very structured
and it's like maybe a lot of it's in like
Salesforce, a lot of it's like in XYC location, and that means that the most valuable, like,
asset you have is maybe like your CRM or something. In the future, when everything is kind of
like driven by agents, actually the parallel to that is the most important and most valuable
asset you have are these like memories of your customers. And they're basically like simulacra
of your customers. And when you have those memories, you can do crazy things where you can like
run simulations on like understanding what they want to buy. What are they going to do? Like if I put
them in my app, how will they like interact with my app based off of the memory of
compiled on them. But that memory, it's kind of like the exact opposite of a CRM. It's a flat file.
It's one, you know, markdown file. Maybe it's like 10. It's very human readable. So I think, yeah,
the most invaluable business data, you know, the future will be like memories of customers and
memories of your users of whoever they are as a company.
And I'm very curious because I think people have intuitive understanding what a memory
from a very abstraction, what it means. But when it comes down to even like the archie
texture, the actual way to retrieve memory, how you store memory. It sounds like it's actually
very simple. Plaintext, you know, really just a bunch of facts sometimes, right? And I'm very
curious how to use view memory where we are today towards what other advancements do we need
around memory? Because I feel like memory is maybe a little bit so new that we don't really
know how to even like evaluate memory. Like what is a good memory system? What is a bad memory system?
is it just about the how much storage of facts
is about like can I retrieve
how many rights facts
or like how do you think about
even like benchmarking memory
like this is actually a good retrieval
versus a bad retrieval
and what is what is sort of like
the way you or your customers
are even thinking about like
you know is it about the latency
or maybe give us some of the factors
that really is important in memory side
yeah I think
when technology is moving
as fast, it's very hard for
benchmarks to lead. That's why
everyone's doing vibe-based evaluations, right?
But I think benchmarks are important.
Also, in the world where you can optimize very easily
against benchmarks with things like RL,
benchmarks become even more important.
But for us, I think, the most
important quality of, like, memory
is kind of like task-dependent.
So if you have an agent and the agent is responsible
for making recommendations based off of its memory
of a user, and clearly the KPI is like,
or the recommendations better.
And there's like downstream metrics, like, did they click on the recommendations?
Or if you ask the user, you know, qualitatively, what do you think about this?
Thumbs up, thumbs down?
Do they rank it thumbs up more than like the traditional Rexus version where it's just,
you know, like a number that pops out of the black box?
And I think for chat-based applications, it's obviously going to be like very feedback-based too.
It's like, you know, is engagement higher because, you know, clearly if memory is better,
then engagement should be higher.
And I think that's probably one way that, like, opening eye, like, tries to track the impact
of their memory. But I'm sure even Open AI has a very hard time materializing very precisely
what are the metrics related to memory that inform us if the memory is good. And I'm not sure
if those metrics actually have higher signal than their highest taste testers internally just
looking at the memory and being like, is this good on my own account or is this bad? And I think
that actually is probably the highest signal. And I think memory, like one way to think about memory
and like why memory is important. One analogy I give sometimes is like there's a model predictive
or like a world model's version of memory,
which is that, like, you know,
I feel like froze the entire world simulation
we're running in right now.
And we had to, like, the task was now predict exactly the next word
that's can come out of Charles's mouth.
There's one way you could do this,
which is you have an incredibly,
you have an infinite context model,
and that infinite context model gets to, like,
observe every single step of my life,
in a second base dilation from when I was like a zygote in the room
all the way until this moment in time right now.
And that's like a tremendous amount of data.
It's like,
whatever, insane amount of data, but it's an infinite context model, so it can look at all
that data. And that model has to predict the next word that's coming out of my mouth. And then
there's another version of this, which is you give a one-page summary to a model of like,
kind of me, my personality, like what I'm doing today, kind of like really written to try to
help the model predict what's going to come out of my mouth. And you like compare those two
to each other. And really the ground truth is like the world simulation that had everything.
And what you want the memory to do is like you want the memory to be as close of a reconstruction
to predicting the next step as like the full context version.
And I think that's really what like evolution has done with our brains.
That's like how memory works with humans, right?
And I think another interesting twist here is that the machines we have created,
like this new version of AI, the way it reasons, the way it emulates humans,
is it's like emulating human reasoning written down into text.
And humans don't, we are not infinite context machines.
We're finite context machines.
So that means that the LM should then be much better at doing this sort of like
short context reasoning as opposed to long context reasoning.
So that's another, it's really it's another comment generally about like long context versus short context.
And why like short context can often be better.
You know, one of the things I was working on like a data privacy company eight years ago, six years ago.
Anyways, it's a whole other story.
But one of the things I always think about is like the minute the data is released, it's free.
You never get it back.
Right.
And one of the challenges for any company that's trying to like have a moat moving forward is,
well, the minute that data leaves, the confines,
of your network, you no longer have holdover anyone.
And a good example is how, like, the portable memory,
this memory stuff, like, creates those frameworks.
The other thing that struck me as you were describing,
I don't know, have you've seen the movie Inside Out,
but they have this, like, great scene
where they show how, like, memories go to long-term memory.
I don't know.
It's just a good...
I think I've seen it.
It's like a big, like, a marble dungeon or something, right?
Exactly.
And they have, like, all these, like,
you can see this graph where they connect these different memories together
and creates this formulation.
It's a good movie.
And actually, I think it's,
described kind of what you're actually doing quite well from like an animate like if you ignore all
the emotion stuff and just think about like how do we turn like experiences into like artifacts
I'm super kind of curious though like what's the performance of this like like how long does
it take to like create these memories and like is there are there measurable like have you come up
with like ways to like measure the efficiency and like the quality of a memory like how do you
know that a memory is like a memory is like a compression right of like some observation it's just
compressed thing.
Like, how do you think about that?
Are we just, you know, fundamental information theory, you know, like limitations,
kind of curious to understand sort of what the limitations are here and how performance
of the system like this works.
Yeah, I think self-improve, is the other question I have.
Yeah, does it self-improve?
Yeah, I think self-improvement is, like, very, very important.
I think, like, in many ways, the whole thesis of our company led it is that we want to
create, like, memory that works so that you can have self-improving agents.
I don't necessarily mean self-improving.
and that we'll have some sort of like singularity event where like it's an exponential that like
stop. But it's more like self-improving in that you would never wipe an agent. Like today like every
agents are just like constantly being wiped. Like you kind of restart, restart, even though the
trajectory is like seven hours, you know, the anthropics thing that a seven hour long Claude Codod
run or something. Well, it's not like that Claude Codod run continued another seven hours. They
wiped it and then I started again, right? And I think that's just not the way like humans work.
And I think the reason we do that is because we just don't have like very good functional memory.
And on the latency side, one thing that we believe as a company is that we really should also be leading on like AI to do the memory management.
And I think that's how you kind of latch on to all the improvements that will be happening with the models.
Like you basically want models to be deciding how to organize this memory.
And it turns out the models are actually very good at doing this.
They're often much better than any sort of like heuristics you could derive on yourself.
So one thing that means is that you actually do want to use like very powerful, chunky models.
to do memory management, like memory creation, memory rewriting,
it's definitely something that's high enough latency that you want to paralyze it.
You don't want to run the memory manager as a blocking process on the main agent.
And obviously that has, like, interesting consequences that you might be like one step
behind or things you have to like address related to that.
But I think that also will, like, improve over time.
I think latency, we live in a world where like intelligence will scale, latency will drive
down, cost will drive down.
So I think that's sort of design where you have memory management done by other agents
is very well positioned for the future.
And I think similarly, it also relates to how do you want to structure memory.
Well, why don't you just let it be text and let the AI decide how it wants the structure memory.
If the AI decides that the perfect way to structure the memory for this conversation is a mermaid diagram,
you know, if it just make a mermaid diagram and text,
if it wants to make a graph, it can make a graph in text,
really leaning on the models to do all this.
But yeah, when you lean on the models, it's very, very expensive.
So what is Leta?
We haven't asked you yet what the company is.
we've talked a lot about memory, but tell us what Letta is, what your products are,
and how Leta helps people solve memory.
Leta, I think, very concisely put, is a platform to build agents that have long-term memory on.
And you can build them extremely fast on Letta, because the way we have designed everything
is that memory is table stakes.
You have to go out of your way to create an agent that does not have very advanced long-term
memory in Leta.
So if you want to build like a workflow in Letta that looks like N8N, you can do it,
but you're going to have to turn off a lot of things.
By default, you know, the agents we see our developers creating,
there are agents that are intended to be run indefinitely
and kind of have stable-state self-improving memory.
And in terms of the actual product, like the product surface,
we have an agent's API.
So it's a developer toolkit.
So basically, you can either self-host or you can use a cloud,
but you use an agent's API instead of a chat API,
and the agent under the hood can have, like,
many agents in parallel managing its memory.
So you end up having an agent that is extremely high-quality memory,
but that memory is white-box.
So you can also, with the same API, like read and write to it manually.
So, yeah, the product is an agent's API, and there's an open-source self-hosted version of it.
There's a cloud version.
And any agent you create in this API has all the memory, like the latest and greatest memory based off of the research that we do as a company as well.
We have so much we can ask you, but we have to jump to our favorite section, what we call the spicy future.
Spicy futures.
So I probably heard somewhat
What are your spicy hot take already?
But give us your spicy hot take
about AI or info or whatever.
Yeah, I think my most relevant hot take
is that memory is more valuable than the models.
And I think right now that may not be true
because people are treating their agents
like throwaway workflow processes.
But I think as soon as like the memory knot is cracked
or like more people kind of like shift
to running stateful agents, the memory very clearly, in my opinion, just becomes more valuable
than the model. Because I think we can all generally agree if we believe in, you know,
we use like very powerful AI tooling that, you know, in a few years, you will have effectively
like employees that are at your company that are living on Slack and they might have like
one, two, three, or tenures. They've been there for a very long time. And just think about
today, you know, how often you swap models. And that like the best frontier model is like swapping
from like provider to provider. If an agent is really just the state, the memory, the context
window that was talking about, and it's the model, and then maybe some runtime that fuses
these two things together, in a world where we have these agents that are running, as long as
humans live or even surpassing human lifetimes, why would the model be more important than the context
or the state of the memory? I think that has very strong implications for what the landscape of
the most valuable AI companies looks like, right? Because I think that would kind of imply that the most
valuable AI companies, they are memory companies. And their number one asset is storing this
context, storing this memory. And the model is like, you know, this commodity that you can use
whatever model you want, maybe you use their model. They made the model the best for their full
stack experience. But the real asset at that company, you know, if you have to like break in there
and like have a hard drive, like Mission Impossible style and like walk out the door with like the most
valuable thing, it's not the model weights. You're walking out with like the agent memory.
how far along are we like what are example of companies that actually even do memory well
like memory is just starting to pop up on like chat chvety or anthropic like it feels like we're
still very i mean anthropic and artifacts very early in comparison to chat chvety but like where are we
at in terms of like which agents have the best formulation of this that a human can use and have
this experience like are we early are we so early that most haven't even witnessed like what
this actually will look like in practice
Yeah, I think we're still pretty early.
I mean, I think a lot of people use chat GPT and chat GPT is pretty aggressive in trying to
like push the memory experience on consumer, which I think makes a ton of sense as a consumer
facing company.
But I think the fact that, you know, the average developer that uses maybe clock code and
also uses chat GPT and uses, you know, whatever isn't saying things like, you know, I didn't
try, I didn't try Gemini because like I just didn't bother all my stuff isn't chat GPT.
The fact that no one has said that yet means we're very early, but that is coming soon.
I think it's just a matter of time.
I think it could be six months from now, it could be eight months from now.
We're definitely still very early.
And I think another aspect of being, like, a sign that we're very early is that the way people run agents is often very ephemeral.
There's very few of these agents in production anywhere today that run and they have like indefinite lifetimes.
Like Claude code, still the default experience is like you are typing Claude code,
you type clod, and it spawns cloud code, but that's a fresh process. I think there will be a time
in the very near future where you are rebooting a session that, like, is running in perpetuity.
I think the number one thing that these teams, like anthropic and also have to solve, is this
thing I talked about earlier of the floor dropping. I think people get extremely frustrated
when the floor drops because of memory. And I think for many teams, that means the easiest way to
solve this problem is just not forced memory on people. I think open AI, they're kind of
forward looking and that they are okay with dropping the floor because they're like we have to
like be the first to have like cracked the memory on consumer side so we're going to drop the floor
for some people but we're also going to try to raise the ceiling so i think that will kind of follow
for a lot of other companies and when you say drop the floor you mean that like the quality
of the experience to grades because that they're actually like creating the wrong memories and
those memories are adding noise instead of signal is that though exactly yeah yeah there's
this um a blog by like simon right where he i think he was talking about how he tried to create
an image in Chatchipt, and then the image had like a weird sign of like Hawaii or something
in. I might be picturing this, but it's like, why is this here? I didn't ask for like a sign of
Hawaii. And then the chatypd explained itself, like, well, I know that you like Hawaii from
our previous chats, you know, so I put it in there for you. That's an example of dropping the
floor. It's like making the product experience worse for some use case because you like were pretty
heavy handed without any memory. What do you think the formulation of the future user
experiences here? Like, do you think users are going to have to, like, select which memories
to include, or do you think we can be intelligent? Like, this still comes back to, like,
a core information retrieval problem of, like, how do I pull what context into the window,
right? Like, or are we kind of stuck with, like, what information retrieval solutions we
have to do something new? Like, where do you think, like, the breakthrough is on solving
some of these problems? Like, yeah, well, I think it's a very open research problem. So I think
one thing I didn't answer your question earlier about was like being precise metrics or like
being very objective about like understanding what makes good memory. And I think there is an
information theoretical perspective of a memory, which is that the optimal memory, especially if
it's short form, short context, is highly predictive of what will happen next, right? And I think you can
do things under this framework where you kind of, it's kind of like a camee or like, you know,
it's many outcome prediction where you're saying, hey, like the user today, when they're going
chatchmitu.com in this future world where chatypd is one thread, you have no chat history.
Very early, I will be able to determine if they're coming to code, if they're coming to
ask me to generate studio jibli images, or if they are asking for like fitness advice, right?
And very early, depending on that, I will load a different memory, right?
And that's one way to, you know, prevent dropping the floor.
I think part of the reason, like, we're dropping the floor is because I think people like
the superhuman aspect of like chat chad pd be only knowing about.
what you're doing right now because it doesn't have any clutter from anything else.
So yeah, I think there is an open research problem here of like how do you be extremely like
use this information theoretical perspective of predictability to actually like determine like
the optimal memory construction.
But I do think the future definitely, at least on the consumer side, is like the one threat thing
because, you know, researchers of these companies will push this agenda and make it better
and better and better.
And it's just I think a better user experience for the average user of a product with insane
distribution like chat chabootie chat window right to just like one thread everything a lot of
people actually i have learned recently one thread chat chitpt like they don't actually use chat windows
like more maybe like less technical people but yep totally well charles this has been incredible
we appreciate you coming on the pod i think everybody has learned a lot i know i have where can people
find out more about you and letta on the internet yeah the best place to go is um if your developer just
head straight to our docs page, docs.orgs.com.
Like I said, you can basically create these agents with this sort of memory.
I'm talking about, this really advanced memory that leverages sleep time compute
with agents running in the background.
And you can do that in, like, one API call.
Just like create agent.
And immediately it has this sort of memory baked in.
So, yeah, that's the best place to go.
And we also have a very active Discord.
If you're a developer on Discord, I want to, like, chat with the team or anyone else
holding on the platform.
Awesome. Thank you so much.
Of course.
Thank you.