The Infra Pod - Building the Future of AI with Long-term Memory (Chat with Charles from Letta)

Starting point is 00:00:00 Welcome to the InfraPod. This is Tim from Essence, and Ian, let's go. Hey, this is Ian, lover of agentic workflows, believer in Level 5 autonomy. I'm super excited today to be joined by Charles, a CEO and co-founder of Letta AI. Charles, how are you doing? Tell us a little about yourself

Starting point is 00:00:22 and what got you to start Leta. Yeah, I'm doing great. Thanks for asking. So prior to Letta, Letta is about a year old now. I graduated about a year ago from the PhD program at Berkeley, with a few of my colleagues, my co-founder, also the head of research at Leta. We were all doing kind of our PhDs in various things at Berkeley.

Starting point is 00:00:39 I was doing more of like RL. Co-founder Sarah is more of an ML systems person. I'm not like feature stores. I like the predecessor to vector databases. And then Kevin was working on like more pure NLP. And then the last year of our PhD, we all signed a cubicle together, had shared an advisor.

Starting point is 00:00:57 The last year for our PhD you worked on this project, MGMT together. And MemGBT, I think, to place it in time, it was probably the first, like, academic paper that had, like, a GitHub, very large GitHub repo associated with it that was addressing the whole memory problem with LM. So at the time, like, the reason we started working on MGPT was because GPT4 was just so damn good. And it also had such a small context window, like 4K tokens on the API. So, yeah, the idea with MMGPT was you just kind of make the AI or the language model self-aware that it has a memory problem. and you give it the tools to address this memory problem, kind of like read, write operations to external memory.

Starting point is 00:01:34 And it just kind of worked. Like the first prototype we made, I think it just worked out of the box. It's kind of crazy. I think function calling had just kind of happened around the same time as well. So that's the reason we were thinking more from like a function calling perspective. And then Leta, the company is really about taking these ideas to the limit. And I think the ideas around memory management and kind of the greater system,

Starting point is 00:01:55 what do you need beyond the language model to create? AI that is a lot more like humans than just a chatbot. Those are the problems that we're trying to solve as a company. Well, and when you say memory, like, can you expand how do you define memory in this context? And like what is stored in that memory? Like a lot of type of data would you find for what types of use cases? Yeah, I think memory is super overloaded. So it's a little bit tricky to talk about because I think as soon as you mentioned memory, people have different ideas in their head. I think today, if you just off the cuff mentioned memory and AI, people probably think about, like, conversational memory that's going to baked into chat GPT memory or

Starting point is 00:02:33 Gemini memory. Like, every single chat bot now has, like, a memory bank. And that's really about, you know, just capturing important information that should persist indefinitely in the context window. And that's also kind of what the MGMT paper was very focused on. So if you say something like, you know, I really prefer short, concise responses, don't give me any more slop. The AI should probably, like, write that to memory.

Starting point is 00:02:54 And for every subsequent chat conversation, always kind of behave in that manner. Similarly, if you're, like, doing a coding assistant, and you say, like, hey, we always write our code this way. Or, like, we always finish a PR in this style. If you're using an AI coding assistant, it should also write that to memory and, like, kind of do that in perpetuity until you change it. That's what people often think about when they think about, like, AI memory. I think another thing that happened with the MMGPD research paper is we were talking about this idea of the LMOS. And I think the concept there of memory is actually much more general. it's all just about tokens

Starting point is 00:03:28 and it's about the real primitives of the machine you're building and I think with anything that's AI built on top of language models it's a language model it's tokens in tokens out and I think memory in that context is really just all the tokens

Starting point is 00:03:41 what are the tokens going into the compute unit what are the tokens coming out and memory management is just token management so obviously that you know semantic memory or like conversational memory that's a subset of that kind of memory but I think everything to do with making LMs into agents is all about context management, like tool calling. It's all about how do you

Starting point is 00:04:00 format the tools. How do you extract the tool? Reasoning, it's all about how do you like properly start the reasoning injection and like how do you end it faster? Like these papers, you know, they have like variable reasoning effort by continually appending, you know, think harder, think harder, like or appending, you know, hold on, wait a minute, let me think again. And then you run the LM inference again. I think that's also memory management and that it's token management. So I think the concept of memory is actually much more broad than just like this conversational memory. It's really just about turning the language model into the intelligent machine. I mean, that's an incredible, incredibly good description.

Starting point is 00:04:35 I'm curious, so today with Letta, and more broadly, like, where are we in that journey? Like, how much of this exists? How much more is there to build? How, like, where are we in this sort of development? And then, like, yeah, help us understand sort of like, did these machines exist and work great? or are we, like, early in this process? Yeah. We're early, but we're definitely further along

Starting point is 00:04:57 than when we first started the MGBT project. So I think, like, when MGPT came out, function calling had just started. It was in its infancy. I think in the original MGPT paper, if you look at the V1, there's like a caveat at the end of the paper that's like, by the way,

Starting point is 00:05:11 everything you're looking at in this paper only works at GPT4. Like, it does not work with any open model. And that very quickly actually changed and open models started supporting function calling better. But I think some of the primitives of this LMOS have started to settle. And I think one of those primitives is tool calling. Everything is agentic.

Starting point is 00:05:28 Everything's orchestrated through tool calling. And I think that is what you're seeing in these much more agentic applications like Claude. I think everyone would agree cloud code is like, you know, really leaning heavily agentic. You know, it's not a workflow. It's an agent. And Claude code just leans extremely heavily on tool calling. The Claude plays Pokemon example, right? It has memory, but the memory is orchestrated through tool calls like MGB.

Starting point is 00:05:50 kind of like making these files that constitute the agent's memory. So I think one thing that has happened is we've kind of standardized on tool calling. And I think one thing that hasn't happened beyond that is understanding how to have functional AI memory that is not just a tool. People have often asked me this question, like, why isn't Letta just an MCP server? Like, why can I not just have memory as a tool in the same way have like my Google Drive as a tool? And I think a very quick counter example to like make this point.

Starting point is 00:06:20 is, like, if you want to do something like sleep time compute, which you can talk about more later, where you have an agent that's kind of observing everything that's happening with the main agent, kind of a subconscious agent, and it's maintaining a very long-term, really, really high-quality memory bank, if you want to plug that into Cursor as an MCP tool, well, what are the kind of like interfaces of the MCP tool,

Starting point is 00:06:42 the memory-Mex-P tool? It's probably something like create memory and read-memory. But you are now relying on Cursor to always call Create Memory. Right? And is cursor when it's like in the middle of your vibe coding session going to be properly calling create memory when it needs to be calling create memory? Probably not. You're basically relying on the other clients or the service to also be good at memory management. I think there's a lot, if you look, Google, how do I do AI memory? I think a lot of the ideas from the original MGBT paper around tools and memory, like rewrite, that kind of thing, that's everywhere. Like they have startups that are built entirely around this premise, like a memory bank for your AI that's maybe like an MCP server. But I think, you know, part of the reason we still talk about memories because it's still a very big problem. I think once memory is solved, we stop talking about it because it's just like, you know, it's obvious and it's like exists in every single AI API, but it's still not solved. And I think part of the reason it's not solved is because we haven't yet transcended

Starting point is 00:07:34 beyond the memory as a tool. I think memory is just way too fundamental to AI to just be a tool in the same way Google Drive as a tool. That makes total sense. You know, one of the things I've often thought about like MCP, I have some, you know, when MCP came out, I was like, this is a beautiful like version one. not even version 0.01 of a future standardized protocol for an operating system

Starting point is 00:07:56 and kind of what you're saying is like memory as a tool doesn't work but you have to rely on the top level LOM to actually call the tool. It's like you actually need a protocol layer that's between the fundamental model data that's flowing in the model and flowing out and a place to plug in. And I think that's true

Starting point is 00:08:12 not just of memory but it's finally true of how you can build plugable agentic experiences is that right now we're fundamentally reliant on, you know, the model that the L, whatever's orcsuring the LM to actually then call a tool in flow, and there's no guarantees of when that tool you call. There's no, like, hook system, right, that says, hey, pre-submitting the context window, run this thing. Like, we can't, like, register hooks or do anything fun that you'd find in, like,

Starting point is 00:08:39 a natural, normal deterministic system. Tool calls are non-deterministic. You know, that's part of the problem. I'm also very curious to sort of think about, like, in traditional computing, like, management of memory has been one of the, like, most important components broad space, right? Like, we invest a lot of time, like, in the last 20 years to ensure that the right bites are in the L1 cash versus L2 cache versus in RAM versus, you know, in swap versus like having to be right out of like a magnetic disk.

Starting point is 00:09:08 And that's the city has helped us a lot here. But like, I'm curious, like, what are the parallels here as well for, like, are there latency and other like parallels in terms of like new broad memory management for agents or is that Is that, like, a solved problem? I'm kind of curious how you manage it. Because I can think about, like, memory for an agent, like, it could be an absolute massive shit ton of data. If there's, like, a personal agent, like,

Starting point is 00:09:32 there's a lot to know about Ian and what he's done in order to be able to answer some random, like, natural prompt I may send to some sort of agentic system. So I'd love to sort of also get your thoughts on that as well. Yeah, I think maybe talk about the latency thing first. So if you've taken any sort of like operating system class or whatever, or I think if you, like, in the PhD program at Berkeley, you have to, like, take a prelim

Starting point is 00:09:52 and you have to memorize the latency table, you know, where it's like staggered, like, log scale. But I think with, like, traditional operating systems and, like, memory management, the number one problem you hit is, like, latency on retrieval. And that's, like, how you design all your algorithms, just, like, minimize latency on retrieval. And then with language models, at least currently now

Starting point is 00:10:10 and probably for the foreseeable future, retrieval is an inconsequential amount of latency compared to the actual LMification. So actually, I don't think retrieval is like the hardest part or like that you'd have to think about your, you know, your memory placement algorithms in the context of retrieving from, you know, some like a vector database for like PG vector or whatever. I think it's actually more on the creation side. Like creating these memories and updating these memories is actually very, very lane intensive, right? Because you can imagine just to create a memory, like today, you know, Charles's mom's name is Brenda, that is one LM call. but probably that memory when it's created

Starting point is 00:10:47 and you have a ton of prior context should be very cascading. It should kind of like inform every other memory that was created in the system. It should probably itself cascade into like thousands of LM invocations. And I think what you will end up happening as these systems mature

Starting point is 00:11:02 and we kind of get out of, you know, like chatbots and move more towards long-running, stateful agents is you'll have these context windows where segments of the context window that have memory are so enriched by so many LM invocations and you'll have like these pieces of the context window that effectively costs

Starting point is 00:11:16 thousands of dollars worth of Cloud API credits to have been created and they were created over millions of trajectories. The analogy breaks down a little bit in the dynamics of retrieval versus generation cost because, yeah, with the traditional OS, like the generation cost is kind of like fixed the clock speed of your CPU or something.

Starting point is 00:11:33 Whereas, yeah, I think with LLMs is like a little bit different about dynamic. There's just so much we can talk about. I think we can talk about the memory thing, but maybe I want to jump into the sleep time compute a little bit. Maybe tell us about what sleep time compute is, because I think folks may actually know what that term is, but I think

Starting point is 00:11:48 the idea is actually pretty cool, and it does work in conjunction with memory, I think a lot, too. Yeah, so I think one very obvious difference between LM-driven intelligence and human intelligence is that human brains, like, they don't turn off. Like, your brain doesn't, like, drop to like zero voltage

Starting point is 00:12:04 and, like, you just go into a carogenic state every night. Like, when you go to sleep, your brain is still, like, kicking, things are firing. But with language models, you have to actively, like, run the machine, to get the intelligence to, like, kick on. I think if you're trying to make human-like intelligence from LMs, it becomes very unnatural to kind of go back to a chatbot

Starting point is 00:12:24 or chat thread that you have with some agent. And that agent has done literally nothing since you were gone, right? It hasn't reflected. It hasn't attempted to do anything, like, proactive. It's just been sitting there static. Like, the weights were cached in some, or maybe not even cache. It was, like, just rent-to-disc, and so it's been loading.

Starting point is 00:12:41 It's been loaded. So I think sleep-town compute at a very high level, about can we just run language models all the time? And if we can run language models all the time, how can we make that actually benefit their intelligence? And I think, like, you know, you mentioned that the key element to this is memory. I think if you are running language models all the time, if they don't have memory, there's no point because they won't remember anything to happen when they're running while you're gone. So this entire thing hinges on having memory. And I think the other context to look at sleep time compute from is the test time compute angle,

Starting point is 00:13:13 which is kind of why we named the sleep time compute, there's been a lot of focus, just a tremendous amount of energy around, hey, it turns out that if you just run the language model for longer when you ask a question, it becomes more intelligent. It's kind of like an emerging capability, right? But the thing is,

Starting point is 00:13:27 it's actually very expensive to run language models when you ask them to do something. That's why it's so painful, you know, to kick on deep research or 03. I mean, that's like 03 was as fast as 401. Like, why would anyone ever use 401? But the fact is it's a tradeoff. You're navigating some like Pareto,

Starting point is 00:13:42 frontier. And I think if you could basically do the same thing, like get the same kind of benefits, like increase the intelligence by spending more compute, but in a way that is like more asynchronous, it's the human doesn't have to be like waiting for this intelligence like turbo boost to kick in. It just kicked in while they were gone. And I think that actually, if you do have some sort of mechanism, you have some sort of idea about how can we scale compute when the user's not interacting with the agent and make that agent more intelligent, that really is a pretty tremendous unlock because it's actually much more scalable than test time compute because test time compute is extremely expensive as to be done in real time the user notices or sleep time compute

Starting point is 00:14:20 you can also just saturate empty compute you have on your data center like I mean all these companies would love to saturate their compute right they have batch APIs they want people to use it they don't want like those electricity go to waste and so yeah test time compute is the idea that I'm thinking harder right when a context comes in where sleep compute is basically pre-computing that thinking hard in a process for him to memorize it. That's kind of how I view it. And it doesn't feel like those are actually in conflicts at each other. You can actually do both, right? You can actually do test and compute if I don't know what the context I will be passing in the first place. But based on that,

Starting point is 00:14:55 maybe it can't even inform you what other state can compute can be done. I just wonder how you view, because I feel like these are so new. I don't know if people even know what examples people even do right now. Like what is some real word examples people actually are using And what do you have you found, folks, or are they choosing one another or they're choosing both? Or maybe, or so new, nobody is even doing these right now. Yeah, I think it's pretty new. I think like a lot of the scaffolding and engineering people have set up has not been designed to run agents for very long periods of time asynchronously.

Starting point is 00:15:30 Everything is kind of like meant for this pattern. It was previously for this pattern where you like spawn a post-Sysie request against like the LMAPI, you get back the tokens, you hold for maybe like 20 seconds, five seconds. And then now everything has been, you know, the infrastructure changed a little bit and we expect, like, you either open like a heinously long post-SSE or you like maybe like poll instead like on the responses API. And that's for test time compute. But to give a concrete example, there really actually is no batch agents API other than the one we built in the Letta Open Source project because very few people are actually running agents and batches. And like that's what you would do for sleep time compute, right? You'd be having like a ton of agents all running, you know, like a huge.

Starting point is 00:16:09 batch, but taking advantage of the underlying, like, L-LM badge invocations. I think a very obvious example of, like, why you want sleep-tong compute in, like, very common use cases, like, coding. Like, I think if you kind of, like, initiate clod code on your lap, on your computer, and you forget that you get sidetracked and you, like, have to go respond to Slack, where you're living a world where, like, inference is free for a moment. Claude code should have kind of, like, been inspecting your repo, kind of probing around the recent, like, Gitlog, it should have done a mental map.

Starting point is 00:16:36 Kind of it made, like, a deep wiki, you know, like the Devon, deep wiki thing. it should have done that by itself because any sort of question you now ask clog code when you start pair programming it will have much better answers for so we'll have done that's like pre-computation and then similarly when you kind of like wrap up your session with clog code

Starting point is 00:16:52 clog code probably should be looking at the entire transcript of your interaction and reflecting on what happened and thinking like hey what were some patterns I noticed about the thing that they kept on asking me to do that I could probably use to like update my memory and make my future behavior better I noticed that we kind of looped a lot editing like the globals.s.cissS file.

Starting point is 00:17:13 And it turns out that like I should really, when I do that, I should also edit like tailwind.comfing or whatever. And I'll write that to memory. So there's just so much, so many things you can do related to memory, pre-computation, post-computation in just coding alone. I think this is actually where like because there's so much energy and incentive around making these coding systems better, I think that's one of the first places you'll see sleep time compute really heavily applied. you'll have products that basically have mega max mode it's no longer just max mode but it's mega max mode you don't care about cost at all

Starting point is 00:17:44 we'll turn this on and cloud code will be cooking the whole day like you'll never not be thinking about your repo right and that's an example sleep time compute really taking to the limit it's an incredibly cool concept right because like functionally if we were to step back to a lower level like the fundamental problem we have if we're like on a primitive

Starting point is 00:18:00 level how do you give like good instructions and then how do you ensure that what's going on top of the instructions that like the data that's being fed into the LLM is the right data, like to manage what's in the context window, right? Like, it doesn't matter that we can have been fixed context windows. It matters about, like, what's the right context in the context window, which prompts the right, you know, statistical inference

Starting point is 00:18:19 that then results the right result at the end of the day. And so this is actually, like, it's super, super fucking cool. Like, just, I don't always swear it, but, like, on the podcast. But, like, it is incredibly cool. And one of the things just listening to you think about, like, ClaudeCode and my own, like, my own small team has stopped very AI forward, you know, adopt all the coding age. agents, but the worst part is, like, we don't have shared memory.

Starting point is 00:18:39 And the best form of shared memory we have today is, like, these files on disk. So I'd love to, like, talk to us about how you think about the future of agentic systems that are multi-tenant, right? Like, a lot of the time people talk about this in the context, they look, oh, it's Ian's agent. But it's like, in the context of coding, it's like, I don't really want it to be Ian's agent. And maybe Ian's agent shares memory with other people's right memory, right? Like, there's value in sort of multi-tenancy, and there's value in. in the shared context.

Starting point is 00:19:07 And so even that, it sounds like a really interesting problem space that is actually quite different from anything you had before. Because if you think of like traditional systems, pre-LOMs, it's like, yeah, I might be preparing a cache, or I might be doing some roll-ups and putting that someplace, but it was incredibly structured. It was incredibly bounded, static task bounding. And this is not, right?

Starting point is 00:19:28 This is like a process of, like, we want the system to self-discover net new knowledge and then store the net new knowledge in a way that it allows, is the system to use that knowledge in future when the time comes, right? So yeah, I'm curious how you've thought about, you know, the sharing, the querying, and where you kind of see the future of agentic systems based on the idea that, hey, like, obviously, like memory is like a foundational element. What's this sort of look like in five years? And how does that change the way that like software works? Yeah, I think the shared memory thing is super important. I think it's also incredibly exciting to think about because a lot of the conversations so far has been

Starting point is 00:20:03 about emulating humans, but there's no reason why human should be the limit. I mean, I think in some cases, like, consumer, you might want to actually have perfectly imperfect human-like memory where, like, the AI forgets in the same way human does. I think for, like, productivity tools, like coding, your memory systems should be superhuman, right? It should do things that, like, no team of engineers could ever about them doing, right? I feel like, you know, as a small engineering team, it would be kind of sick if we all had, like, neural links and we had, like, shared, you know, memory on, like, the projects we're working on, right? But we definitely don't live in that future right now. But you can definitely do that with

Starting point is 00:20:35 agents, right? I mean, in Leta, for example, because we write all the memories to just a table on Postgres, it means that it's extremely easy to have shared memory where basically agents are actually dynamically reading from shared memory blocks. And if they write an update, it's immediately broadcast because it's just written and, you know, it's transacted in the red like when you actually do the elementification. That is really the future. I think like the future is you will have like everyone running Claude Code and your company or like CloudCode. cursor, whatever new thing comes out, Klein, everything will be sharing memory. It probably is not going to be sharing the entire context window, but segments of the context window will be being read from like some common source. And I think this is also why the true canonical data format for memory is text. And I think we are already starting to see that. We're like cursor rules, the cloud code, instruction files. It all looks the same. It's all text. And I think the same thing will happen with, you know, even in five years with this really advanced shared memory where you'll have have all these different systems that are all reading memory from the same location.

Starting point is 00:21:36 And it will actually look like something that's very human readable because then the human can also read it. I think there's this really dramatic advantage to like storing everything in English tokens. It doesn't have to be English, but just like readable tokens because it means that human in the loop is not like anything crazy. It's just a human looking at the same file and editing it. And yeah, I think that's partially why it's very, some people also ask me like, you know, why wouldn't you want to just do like parametric memory and bake everything into the way?

Starting point is 00:22:02 And like, well, once you bake into the way, it's really hard to understand what the memory is. And often, like, in coding, you want to know what the memory is. Like, it's really painful for an AI to be behaving in a way you don't understand because something is in the memory. It's even worse if what's in the memory is like English, like, chat you BT, right? You know, it's like behaving in some weird way and you can't even see, like, why it's behaving in that way.

Starting point is 00:22:23 But you know it's because something in the system prompt is, like, really dirty or, like, corrupted. But, yeah, I think it'll still be taxed in five years. But every, all these agents, if they're like productivity related, you're going to have like fleets of agents running and they'll all be sharing their right. And I'm curious how you've given thought of like, I was just talking, Tim and I always said this rollback as we do the podcast, we are just talking about him. And like the thing that immediately came to my mind was, you know, one is we've always said

Starting point is 00:22:49 like data has gravity. I mean, that's why like once you deploy data to us is really hard to get off ADVS because your data is there, right? And you have to pay all these egress fees. And it's very clear like in this conversation that like, okay, the next stage of this is actual memory has gravity, right? where whichever agent I've interacted with has the best memory, which is ultimately like the best personalization of that agent's behavior

Starting point is 00:23:08 to whatever set of task it is that I've done with it and repeat doing. Like that has gravity to it. I'm curious, like have you thought about, in the context of coding, like the company that has the best set of shared memory and is the highest, most intelligence of memory for their code base, it's probably going to have the highest velocity in these coding tools. Right. And so this is like a new form of advantage.

Starting point is 00:23:30 It's not just enough to give, like, deploy to all my developers, here's cloud code. I have to, like, deploy to all my developers here's cloud code. And here's, like, a shared memory system and shared thing that makes, like, cloud code more than just a, you know, a 50% speed up. It's, like, multi-factor speed up because it has learned about the system, the problem domain we're working on. And it's specialized to us. And that's IP. So I'm kind of curious, you know, have you thought about how this sort of changes, like SaaS or software, or how people buy software, or how companies buy software? software and like how people think about it like are there concepts like are we going to live in a world

Starting point is 00:24:04 will people bring their own memory and that memory is like you know we spend it invest a lot of time into like safe guarding that memory because ultimately like the memory is almost like a shadow of the human in some way like I'm curious where your brain has gone on it obviously been thinking about this a lot and have so many so many endless questions yeah I think memory definitely is the upcoming mode the memory will just be much more important than the models and I think what that means for like the competitive landscape of like tools and companies, it's hard to predict. But I think 100%, you know, the frontier labs that are very invested in consumer

Starting point is 00:24:40 facing products or even debt tool facing products, they are aware of this. And they want to make sure that they capitalize on this advantage. And, you know, like chat GPT, they want to make memory as good as possible. Because if it's incredibly good, incredibly realistic, makes experience, like, it raises the ceiling and it doesn't drop the floor. the problem with, I think, memory today and chatty drops the floor stochastically, which is very bad. Once they fix that, which they will, and it's just

Starting point is 00:25:05 completely raised to the ceiling, now when you're a friend tells you, hey, have you been on, like, Gemini or Deepseek, you're kind of like, well, you know, chat GPT already has, knows everything about me. And I know the day zero experience will be worse. So I just don't want to switch. And like, it's just too much work for me.

Starting point is 00:25:21 So from a consumer angle, I think memory will become the thing that, like, makes these products incredibly sticky. From a dev tooling perspective, I think it's similar in some ways, that I think companies, you know, the companies that are making the best or have the most funding around, like, making coding tools, they also probably want the memory to be a lock-in feature where, you know, you're using cursor for longer and longer.

Starting point is 00:25:41 Like, you're accumulating more and more memory on cursor cloud, like the paid service. And memory is something that will just, like, make your experience on cursor much better than it is on CloudCote and vice versa. I think more from like a dev tooling infrastructure perspective and maybe just like business data, I think memory will become so important that actually for a lot of companies, they just will not use these tools

Starting point is 00:26:00 that have closed memory systems. And it just will be like table stakes that you have to like have open memory or like we have to be able to bring our own memory. And in many ways, that's kind of what Leta is. It's a database for your memory and all your context for AI. And one interesting thing that I've observed

Starting point is 00:26:16 from working with like some of these larger enterprise companies is that I think today like business data and all your information about your users, it's stored in the same way it's been stored for like a decade, right? It's like very structured and it's like maybe a lot of it's in like Salesforce, a lot of it's like in XYC location, and that means that the most valuable, like,

Starting point is 00:26:33 asset you have is maybe like your CRM or something. In the future, when everything is kind of like driven by agents, actually the parallel to that is the most important and most valuable asset you have are these like memories of your customers. And they're basically like simulacra of your customers. And when you have those memories, you can do crazy things where you can like run simulations on like understanding what they want to buy. What are they going to do? Like if I put them in my app, how will they like interact with my app based off of the memory of compiled on them. But that memory, it's kind of like the exact opposite of a CRM. It's a flat file. It's one, you know, markdown file. Maybe it's like 10. It's very human readable. So I think, yeah,

Starting point is 00:27:08 the most invaluable business data, you know, the future will be like memories of customers and memories of your users of whoever they are as a company. And I'm very curious because I think people have intuitive understanding what a memory from a very abstraction, what it means. But when it comes down to even like the archie texture, the actual way to retrieve memory, how you store memory. It sounds like it's actually very simple. Plaintext, you know, really just a bunch of facts sometimes, right? And I'm very curious how to use view memory where we are today towards what other advancements do we need around memory? Because I feel like memory is maybe a little bit so new that we don't really

Starting point is 00:27:52 know how to even like evaluate memory. Like what is a good memory system? What is a bad memory system? is it just about the how much storage of facts is about like can I retrieve how many rights facts or like how do you think about even like benchmarking memory like this is actually a good retrieval versus a bad retrieval

Starting point is 00:28:11 and what is what is sort of like the way you or your customers are even thinking about like you know is it about the latency or maybe give us some of the factors that really is important in memory side yeah I think when technology is moving

Starting point is 00:28:28 as fast, it's very hard for benchmarks to lead. That's why everyone's doing vibe-based evaluations, right? But I think benchmarks are important. Also, in the world where you can optimize very easily against benchmarks with things like RL, benchmarks become even more important. But for us, I think, the most

Starting point is 00:28:44 important quality of, like, memory is kind of like task-dependent. So if you have an agent and the agent is responsible for making recommendations based off of its memory of a user, and clearly the KPI is like, or the recommendations better. And there's like downstream metrics, like, did they click on the recommendations? Or if you ask the user, you know, qualitatively, what do you think about this?

Starting point is 00:29:03 Thumbs up, thumbs down? Do they rank it thumbs up more than like the traditional Rexus version where it's just, you know, like a number that pops out of the black box? And I think for chat-based applications, it's obviously going to be like very feedback-based too. It's like, you know, is engagement higher because, you know, clearly if memory is better, then engagement should be higher. And I think that's probably one way that, like, opening eye, like, tries to track the impact of their memory. But I'm sure even Open AI has a very hard time materializing very precisely

Starting point is 00:29:29 what are the metrics related to memory that inform us if the memory is good. And I'm not sure if those metrics actually have higher signal than their highest taste testers internally just looking at the memory and being like, is this good on my own account or is this bad? And I think that actually is probably the highest signal. And I think memory, like one way to think about memory and like why memory is important. One analogy I give sometimes is like there's a model predictive or like a world model's version of memory, which is that, like, you know, I feel like froze the entire world simulation

Starting point is 00:30:00 we're running in right now. And we had to, like, the task was now predict exactly the next word that's can come out of Charles's mouth. There's one way you could do this, which is you have an incredibly, you have an infinite context model, and that infinite context model gets to, like, observe every single step of my life,

Starting point is 00:30:15 in a second base dilation from when I was like a zygote in the room all the way until this moment in time right now. And that's like a tremendous amount of data. It's like, whatever, insane amount of data, but it's an infinite context model, so it can look at all that data. And that model has to predict the next word that's coming out of my mouth. And then there's another version of this, which is you give a one-page summary to a model of like, kind of me, my personality, like what I'm doing today, kind of like really written to try to

Starting point is 00:30:40 help the model predict what's going to come out of my mouth. And you like compare those two to each other. And really the ground truth is like the world simulation that had everything. And what you want the memory to do is like you want the memory to be as close of a reconstruction to predicting the next step as like the full context version. And I think that's really what like evolution has done with our brains. That's like how memory works with humans, right? And I think another interesting twist here is that the machines we have created, like this new version of AI, the way it reasons, the way it emulates humans,

Starting point is 00:31:09 is it's like emulating human reasoning written down into text. And humans don't, we are not infinite context machines. We're finite context machines. So that means that the LM should then be much better at doing this sort of like short context reasoning as opposed to long context reasoning. So that's another, it's really it's another comment generally about like long context versus short context. And why like short context can often be better. You know, one of the things I was working on like a data privacy company eight years ago, six years ago.

Starting point is 00:31:38 Anyways, it's a whole other story. But one of the things I always think about is like the minute the data is released, it's free. You never get it back. Right. And one of the challenges for any company that's trying to like have a moat moving forward is, well, the minute that data leaves, the confines, of your network, you no longer have holdover anyone. And a good example is how, like, the portable memory,

Starting point is 00:31:58 this memory stuff, like, creates those frameworks. The other thing that struck me as you were describing, I don't know, have you've seen the movie Inside Out, but they have this, like, great scene where they show how, like, memories go to long-term memory. I don't know. It's just a good... I think I've seen it.

Starting point is 00:32:11 It's like a big, like, a marble dungeon or something, right? Exactly. And they have, like, all these, like, you can see this graph where they connect these different memories together and creates this formulation. It's a good movie. And actually, I think it's, described kind of what you're actually doing quite well from like an animate like if you ignore all

Starting point is 00:32:26 the emotion stuff and just think about like how do we turn like experiences into like artifacts I'm super kind of curious though like what's the performance of this like like how long does it take to like create these memories and like is there are there measurable like have you come up with like ways to like measure the efficiency and like the quality of a memory like how do you know that a memory is like a memory is like a compression right of like some observation it's just compressed thing. Like, how do you think about that? Are we just, you know, fundamental information theory, you know, like limitations,

Starting point is 00:32:58 kind of curious to understand sort of what the limitations are here and how performance of the system like this works. Yeah, I think self-improve, is the other question I have. Yeah, does it self-improve? Yeah, I think self-improvement is, like, very, very important. I think, like, in many ways, the whole thesis of our company led it is that we want to create, like, memory that works so that you can have self-improving agents. I don't necessarily mean self-improving.

Starting point is 00:33:22 and that we'll have some sort of like singularity event where like it's an exponential that like stop. But it's more like self-improving in that you would never wipe an agent. Like today like every agents are just like constantly being wiped. Like you kind of restart, restart, even though the trajectory is like seven hours, you know, the anthropics thing that a seven hour long Claude Codod run or something. Well, it's not like that Claude Codod run continued another seven hours. They wiped it and then I started again, right? And I think that's just not the way like humans work. And I think the reason we do that is because we just don't have like very good functional memory. And on the latency side, one thing that we believe as a company is that we really should also be leading on like AI to do the memory management.

Starting point is 00:33:59 And I think that's how you kind of latch on to all the improvements that will be happening with the models. Like you basically want models to be deciding how to organize this memory. And it turns out the models are actually very good at doing this. They're often much better than any sort of like heuristics you could derive on yourself. So one thing that means is that you actually do want to use like very powerful, chunky models. to do memory management, like memory creation, memory rewriting, it's definitely something that's high enough latency that you want to paralyze it. You don't want to run the memory manager as a blocking process on the main agent.

Starting point is 00:34:31 And obviously that has, like, interesting consequences that you might be like one step behind or things you have to like address related to that. But I think that also will, like, improve over time. I think latency, we live in a world where like intelligence will scale, latency will drive down, cost will drive down. So I think that's sort of design where you have memory management done by other agents is very well positioned for the future. And I think similarly, it also relates to how do you want to structure memory.

Starting point is 00:34:54 Well, why don't you just let it be text and let the AI decide how it wants the structure memory. If the AI decides that the perfect way to structure the memory for this conversation is a mermaid diagram, you know, if it just make a mermaid diagram and text, if it wants to make a graph, it can make a graph in text, really leaning on the models to do all this. But yeah, when you lean on the models, it's very, very expensive. So what is Leta? We haven't asked you yet what the company is.

Starting point is 00:35:18 we've talked a lot about memory, but tell us what Letta is, what your products are, and how Leta helps people solve memory. Leta, I think, very concisely put, is a platform to build agents that have long-term memory on. And you can build them extremely fast on Letta, because the way we have designed everything is that memory is table stakes. You have to go out of your way to create an agent that does not have very advanced long-term memory in Leta. So if you want to build like a workflow in Letta that looks like N8N, you can do it,

Starting point is 00:35:46 but you're going to have to turn off a lot of things. By default, you know, the agents we see our developers creating, there are agents that are intended to be run indefinitely and kind of have stable-state self-improving memory. And in terms of the actual product, like the product surface, we have an agent's API. So it's a developer toolkit. So basically, you can either self-host or you can use a cloud,

Starting point is 00:36:05 but you use an agent's API instead of a chat API, and the agent under the hood can have, like, many agents in parallel managing its memory. So you end up having an agent that is extremely high-quality memory, but that memory is white-box. So you can also, with the same API, like read and write to it manually. So, yeah, the product is an agent's API, and there's an open-source self-hosted version of it. There's a cloud version.

Starting point is 00:36:27 And any agent you create in this API has all the memory, like the latest and greatest memory based off of the research that we do as a company as well. We have so much we can ask you, but we have to jump to our favorite section, what we call the spicy future. Spicy futures. So I probably heard somewhat What are your spicy hot take already? But give us your spicy hot take about AI or info or whatever. Yeah, I think my most relevant hot take

Starting point is 00:36:57 is that memory is more valuable than the models. And I think right now that may not be true because people are treating their agents like throwaway workflow processes. But I think as soon as like the memory knot is cracked or like more people kind of like shift to running stateful agents, the memory very clearly, in my opinion, just becomes more valuable than the model. Because I think we can all generally agree if we believe in, you know,

Starting point is 00:37:21 we use like very powerful AI tooling that, you know, in a few years, you will have effectively like employees that are at your company that are living on Slack and they might have like one, two, three, or tenures. They've been there for a very long time. And just think about today, you know, how often you swap models. And that like the best frontier model is like swapping from like provider to provider. If an agent is really just the state, the memory, the context window that was talking about, and it's the model, and then maybe some runtime that fuses these two things together, in a world where we have these agents that are running, as long as humans live or even surpassing human lifetimes, why would the model be more important than the context

Starting point is 00:38:00 or the state of the memory? I think that has very strong implications for what the landscape of the most valuable AI companies looks like, right? Because I think that would kind of imply that the most valuable AI companies, they are memory companies. And their number one asset is storing this context, storing this memory. And the model is like, you know, this commodity that you can use whatever model you want, maybe you use their model. They made the model the best for their full stack experience. But the real asset at that company, you know, if you have to like break in there and like have a hard drive, like Mission Impossible style and like walk out the door with like the most valuable thing, it's not the model weights. You're walking out with like the agent memory.

Starting point is 00:38:34 how far along are we like what are example of companies that actually even do memory well like memory is just starting to pop up on like chat chvety or anthropic like it feels like we're still very i mean anthropic and artifacts very early in comparison to chat chvety but like where are we at in terms of like which agents have the best formulation of this that a human can use and have this experience like are we early are we so early that most haven't even witnessed like what this actually will look like in practice Yeah, I think we're still pretty early. I mean, I think a lot of people use chat GPT and chat GPT is pretty aggressive in trying to

Starting point is 00:39:12 like push the memory experience on consumer, which I think makes a ton of sense as a consumer facing company. But I think the fact that, you know, the average developer that uses maybe clock code and also uses chat GPT and uses, you know, whatever isn't saying things like, you know, I didn't try, I didn't try Gemini because like I just didn't bother all my stuff isn't chat GPT. The fact that no one has said that yet means we're very early, but that is coming soon. I think it's just a matter of time. I think it could be six months from now, it could be eight months from now.

Starting point is 00:39:40 We're definitely still very early. And I think another aspect of being, like, a sign that we're very early is that the way people run agents is often very ephemeral. There's very few of these agents in production anywhere today that run and they have like indefinite lifetimes. Like Claude code, still the default experience is like you are typing Claude code, you type clod, and it spawns cloud code, but that's a fresh process. I think there will be a time in the very near future where you are rebooting a session that, like, is running in perpetuity. I think the number one thing that these teams, like anthropic and also have to solve, is this thing I talked about earlier of the floor dropping. I think people get extremely frustrated

Starting point is 00:40:22 when the floor drops because of memory. And I think for many teams, that means the easiest way to solve this problem is just not forced memory on people. I think open AI, they're kind of forward looking and that they are okay with dropping the floor because they're like we have to like be the first to have like cracked the memory on consumer side so we're going to drop the floor for some people but we're also going to try to raise the ceiling so i think that will kind of follow for a lot of other companies and when you say drop the floor you mean that like the quality of the experience to grades because that they're actually like creating the wrong memories and those memories are adding noise instead of signal is that though exactly yeah yeah there's

Starting point is 00:40:57 this um a blog by like simon right where he i think he was talking about how he tried to create an image in Chatchipt, and then the image had like a weird sign of like Hawaii or something in. I might be picturing this, but it's like, why is this here? I didn't ask for like a sign of Hawaii. And then the chatypd explained itself, like, well, I know that you like Hawaii from our previous chats, you know, so I put it in there for you. That's an example of dropping the floor. It's like making the product experience worse for some use case because you like were pretty heavy handed without any memory. What do you think the formulation of the future user experiences here? Like, do you think users are going to have to, like, select which memories

Starting point is 00:41:31 to include, or do you think we can be intelligent? Like, this still comes back to, like, a core information retrieval problem of, like, how do I pull what context into the window, right? Like, or are we kind of stuck with, like, what information retrieval solutions we have to do something new? Like, where do you think, like, the breakthrough is on solving some of these problems? Like, yeah, well, I think it's a very open research problem. So I think one thing I didn't answer your question earlier about was like being precise metrics or like being very objective about like understanding what makes good memory. And I think there is an information theoretical perspective of a memory, which is that the optimal memory, especially if

Starting point is 00:42:09 it's short form, short context, is highly predictive of what will happen next, right? And I think you can do things under this framework where you kind of, it's kind of like a camee or like, you know, it's many outcome prediction where you're saying, hey, like the user today, when they're going chatchmitu.com in this future world where chatypd is one thread, you have no chat history. Very early, I will be able to determine if they're coming to code, if they're coming to ask me to generate studio jibli images, or if they are asking for like fitness advice, right? And very early, depending on that, I will load a different memory, right? And that's one way to, you know, prevent dropping the floor.

Starting point is 00:42:45 I think part of the reason, like, we're dropping the floor is because I think people like the superhuman aspect of like chat chad pd be only knowing about. what you're doing right now because it doesn't have any clutter from anything else. So yeah, I think there is an open research problem here of like how do you be extremely like use this information theoretical perspective of predictability to actually like determine like the optimal memory construction. But I do think the future definitely, at least on the consumer side, is like the one threat thing because, you know, researchers of these companies will push this agenda and make it better

Starting point is 00:43:16 and better and better. And it's just I think a better user experience for the average user of a product with insane distribution like chat chabootie chat window right to just like one thread everything a lot of people actually i have learned recently one thread chat chitpt like they don't actually use chat windows like more maybe like less technical people but yep totally well charles this has been incredible we appreciate you coming on the pod i think everybody has learned a lot i know i have where can people find out more about you and letta on the internet yeah the best place to go is um if your developer just head straight to our docs page, docs.orgs.com.

Starting point is 00:43:54 Like I said, you can basically create these agents with this sort of memory. I'm talking about, this really advanced memory that leverages sleep time compute with agents running in the background. And you can do that in, like, one API call. Just like create agent. And immediately it has this sort of memory baked in. So, yeah, that's the best place to go. And we also have a very active Discord.

Starting point is 00:44:11 If you're a developer on Discord, I want to, like, chat with the team or anyone else holding on the platform. Awesome. Thank you so much. Of course. Thank you.

The Infra Pod - Building the Future of AI with Long-term Memory (Chat with Charles from Letta)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.