Latent Space: The AI Engineer Podcast - Context Engineering for Agents - Lance Martin, LangChain

Episode Date: September 11, 2025

Lance: https://www.linkedin.com/in/lance-martin-64a33b5/How Context Fails: https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.htmlHow New Buzzwords Get Created: https://www.dbre...unig.com/2025/07/24/why-the-term-context-engineering-matters.htmlContent Engineering: https://rlancemartin.github.io/2025/06/23/context_engineering/ https://docs.google.com/presentation/d/16aaXLu40GugY-kOpqDU4e-S0hD1FmHcNyF0rRRnb1OU/edit?usp=sharingManus Post: https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-ManusCognition Post: https://cognition.ai/blog/dont-build-multi-agentsMulti-Agent Researcher: https://www.anthropic.com/engineering/multi-agent-research-systemHuman-in-the-loop + Memory: https://github.com/langchain-ai/agents-from-scratch- Bitter Lesson in AI Engineering -Hyung Won Chung on the Bitter Lesson in AI Research: Bitter Lesson w/ Claude Code: Learning the Bitter Lesson in AI Engineering: https://rlancemartin.github.io/2025/07/30/bitter_lesson/Open Deep Research: https://github.com/langchain-ai/open_deep_research https://academy.langchain.com/courses/deep-research-with-langgraphScaling and building things that “don’t yet work”: - Frameworks -Roast framework at Shopify / standardization of orchestration tools: MCP adoption within Anthropic / standardization of protocols: How to think about frameworks: https://blog.langchain.com/how-to-think-about-agent-frameworks/RAG benchmarking: https://rlancemartin.github.io/2025/04/03/vibe-code/Simon’s talk with memory-gone-wrong: https://simonwillison.net/2025/Jun/6/six-months-in-llms/Full Video EpisodeTimestamps00:00 Introduction and Background00:53 The Rise of Context Engineering01:57 Context Engineering vs Prompt Engineering05:56 The Five Categories of Context Engineering10:02 Multi-Agent Systems and Context Isolation14:48 Classical Retrieval vs Agentic Search17:12 LLMs.txt and MCP Servers24:51 Context Pruning and Memory Management37:25 Memory Systems and Human-in-the-Loop42:55 The Bitter Lesson Applied to AI Engineering51:21 Frameworks, Abstractions, and Building for the Future This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:03 Hey, everyone. Welcome to the Latinspace podcast. This is Alessio, Founder of Kernel Labs, and I'm joined by Swix, Fundera of Small AI. Hello, hello. We are so happy to be in the remote studio with Lance Martin from Langchain, Landgraf, and everything else he does. Welcome. It's great to be here. I'm a long-time listener to the pod, and is finally great to be on. You've been part of our orbit for a while. You spoke at one of the AIEs, and also, obviously, we're pretty close with Langchain. recently though you've also like been doing a lot of tutorials i remember you did like r1 deep researcher which is a pretty popular project and async ambient agents but the thing that really sort of prompted me to reach out and say like okay it's finally time for the lance martin pod is your recent work on
Starting point is 00:00:50 context engineering which is all the rage uh how'd you get into it well you know it's funny buzzwords emerge oftentimes when people have a shared experience and i think lots of people started building agents, kind of early this year, mid this year, quote unquote, the year of agents. And I think what happened is when you kind of put together an agent, it's just tool-clogging a loop, it's relatively simple to lay out. But it's actually quite tricky to get to work well. In a particular, managing context with agents is a hard problem. Carpath had put out that tweet, canonizing the term, context engineering. And he kind of mentioned this nice definition, which is context of engineering is the challenge of feeding an LM just the right context for the next step,
Starting point is 00:01:32 is highly applicable to agents. And I think that really resonated with a lot of people. I in particular had that experience over the past year working on agents, and I wrote about that a little bit in my piece talking about building open deep research over the past year. So I think it was kind of an interesting point that the term capture a common experience that many people were having, and it took hold because of that. How do you define the lines between prompt engineering and like context engineering? So is the prompt optimization, like context engineering? in your mind?
Starting point is 00:02:02 Like, I think people are confused. Like, are we replacing the term? Like, what is it? Well, I think that, you know, prompt engineering is kind of a subset of contact engineering. I think when we kind of move from chat models and chat interactions to agents, there's a big shift that occurred. So with chat models, working on chat GPT, the human message is really the primary input. And, of course, a lot of time and effort is spend and crafting the right message that's passed
Starting point is 00:02:28 to the model. With agents, the game is a bit trickier, though, because the age. agent's getting context not just from the human, but now context is flowing in from tool calls during the agent trajectory. And so I think this was really the key challenge that I observed and many people observed is like, oof, when you put together an agent, you're not only managing, of course, the system instructions, system prompt, and of course user instructions. You also have to manage all this context that's flowing at each step over the course of a large number of tool calls. And I think there's been a number of good pieces on this. Manus put out a great piece,
Starting point is 00:03:00 talking about contact engineering with madness. And they made the point that the typical manis task is like 50 tool calls. Anthropics multi-agent research is another nice example of this. They mentioned that the typical production agent, and this is probably referring to Cloud Code, could be other agents that they've produced, is like hundreds of tool calls.
Starting point is 00:03:20 When I had my first experience with this, and I think many people have this experience, you put together an agent, you're sold the story that's just tool calling in a loop. That's pretty simple. You put it together. I was building deep research. These research tool calls
Starting point is 00:03:32 are pretty token heavy. And suddenly you're finding that my deep researcher, for example, with a naive tool calling loop was using 500,000 tokens. It was like $1 to $2 per run. I think this is an experience that many people had.
Starting point is 00:03:45 And I think it's kind of that the challenge is realizing that, oof, building agents is actually a little bit tricky because if you just naively plumb in the context from each of those tool calls, naively, you just hit the context window
Starting point is 00:03:58 of the LM, That's kind of the obvious problem. But also, Jeff from Chrome out spoke about this on the recent pod. There's all these weird and idiosyncratic failure modes as context is longer. So Jeff has that nice report on context rot. And so you have both these problems happening. If you build a naive agent, context is flowing in from all these tool calls. It could be dozens to hundreds.
Starting point is 00:04:21 And there's degradation and performance with respect to context length. And also the trivial problem with hitting the context window itself. So this was kind of, I think, the most. motivation for this new idea of actually it's very important to engineer the context that you're feeding you to an agent. And that spawned into a bunch of different ideas that I put together in the blog posts that people are using to handle this, drawn from Anthropic, from my own experience, from Manus and others. So I'm just going to put some of the relevant materials on screen just because we like to, you know, part this is going to do. We'd like to have some visual aid. We did our
Starting point is 00:04:55 posts on 55 and we call it Thinking with Tools. So we're part of the tools is to get context. And I think using tools to obtain more context, like the agent can figure out what context it needs and if you just tell it to. And then the other one is, actually, I thought you did a blog post on this, but apparently it was just like, this is it. I will say it's funny. And actually, I was hoping you'd bring this up. I also have a blog post, but it's all moving so quickly that I did a meetup after the blog post and updated the story a little bit with this meetup. So actually, this is a better thing to show. But I do have a blog post too. But things changed between my blog post and the meetup, which were like two weeks
Starting point is 00:05:32 apart. So that's how quickly these things are moving. Exactly. That's the blog post. Should we do this sequentially then? I think it's actually okay to just hit the meetup. Because it's just easier to follow one thing. And it's like a super set of the blog post story. Okay. How do you define the five categories? So, I mean, I understand what offload kind of means, but like can you maybe, yeah, go deeper. Yeah, yeah. We should, let's walk through these, actually. When I talked about naive agents, and the first time I built an agent, agent makes a bunch of tool calls. Those tool calls are passed back to the LLM at each turn,
Starting point is 00:06:07 and you naively just plumb all that context back. And of course, what you see is the context window grows significantly because this tool feedback is accumulating in your message history. A perspective that Manna shared in particular I thought was really good. It's important and useful to offload context. Don't just naively send back the full context of each of your tool calls. you can actually offload it, and they talk about offloading it to disk.
Starting point is 00:06:31 So they talk about this idea of using the file system as externalized memory, rather than just writing back the full concept of your tool calls, which could be token-heavy, write those to disk, and you can write back a summary, it could be a URL, something so that the agent knows it's retrieved a thing.
Starting point is 00:06:47 It can fetch that on-demand, but you're not just naively pushing all that raw context back to the model. So that's this off-loading concept. Note that it could be a file system, It could also be, for example, agent state. So Langraph, for example, has this notion of state. So it could be kind of the agent runtime state object.
Starting point is 00:07:07 It could be the file system. But the point is you're not just plumbing all the context from your tool calls back into the agent's message history. You're saving it an externalized system. You're fetching it as needed. This saves token costs significantly. So that's the offloading concept. I guess the question on the offloading is like, What's the minimum summary metadata or whatever you need to keep in the context to let the model
Starting point is 00:07:33 understand what's in the offloaded context? Like if you're doing deep research, obviously you're offloading kind of like the full pages maybe, but like how do you generate like an effective summary or blurb about what's in the file? This is actually a very interesting and important point. So I'll give an example from what I did with Open Deep Research. So Open Deep Research is a deep research agent that I've been working on for about a year. And it's now according to Deep Research spends the best. best-performing deep research agent, at least on that particular benchmark.
Starting point is 00:08:01 So it's pretty good. Listen, it's not as good as open-eyed deep research, which uses end-end-r-r-l. It's all fully open-source, and it's pretty strong. So I just do carefully prompted summarization. I try to prompt the summarization model to give an exhaustive set of kind of bullet points of the key things that are in the post, just so the agent can know whether to retrieve the full context later. So I think it's kind of prompting, if you're doing summarization, carefully for recall, compressing it, but like making sure that all the key bullet points necessary for the LLM to know what's in that piece of full context is actually very important when you're doing this kind of summarization step.
Starting point is 00:08:42 Now, cognition had a really nice blog post talking about this as well. And they mentioned you can really spend a lot of time on summarization, so I don't want to trivialize it. but at least my experience has been it's worked quite effectively. Prompt a model carefully to capture exactly. So in this post, they talk a lot about even using a fine-tuned model for performing summarization. In this case, they're talking about agent-to-agent boundaries and summarizing, for example, message history, but the same challenges apply to summarizing, for example, the full contents
Starting point is 00:09:13 of token-heavy tool calls so the model knows what's in context. I basically spent a lot of time prompt engineering to make sure my summaries capture with high recall what's in the document, but compress the content significantly. I do think that the compression, that was also part of the meetup findings of yesterday where we were at the context engineering meetup that Kroma hosted,
Starting point is 00:09:34 that you do want frequent compression because you don't want to hit the context route limit. I'm not sure there's much else to say. Offloading is important, and you should probably do it. There was also a really interesting link. I guess somebody, I think Dex was linking it to the concept of multi-agents, why you do want multi-agents is because you can compress and load in different things based on the role of the agent.
Starting point is 00:09:57 And probably a single agent would not have all the context. That's exactly right. And actually, one of the other big themes my head and talk about quite a bit is context isolation with multi-agent. And I do think this does link back to the cognition take. So, which is interesting. So their argument against multi-agent is literally called don't-dole multi-agent. Correct. and what they're arguing is a few different things.
Starting point is 00:10:24 One of the main things is that it is difficult to communicate sufficient context to subagents. They talk a lot about spending time on that summarization or compression step. They even use a fine-tuned model to ensure that all the relevant information, so they actually show it a little bit down below as kind of a linear agent, but even at those agent-to-agent boundaries, they talk a lot about being careful about how you compress information and pass it between agents. Yeah, I think the biggest question for me, coding is kind of like the main use case that I have. And I think I still haven't figured out how much of value there is in showing how the implementation was made to then write, if you have a sub agent that writes tests or you have a type agent that does different things.
Starting point is 00:11:08 How much do you need to explain to it about how you got to the place the code basis in versus not? and then does it only need to return the test back in the context of the main agent? If it has to fix some code to match the test, should it say that to the main agent? I think that's kind of like, it's clear to me like the deep research case because it's kind of like atomic pieces of content that you're going through. But I think when you have state that depends between the subagents, I think that's the thing is still unclear to me. That's one of the most important points about this context isolation kind of bucket. So cognition argues, which actually I think is a very reasonable argument, they argue don't do subagents because each subagent implicitly makes decisions and those decisions can conflict. So you have subagent one doing a bunch of tasks, subagent two are doing a bunch of tasks.
Starting point is 00:12:02 Those kind of decisions may be conflicting and then when you could try to compile the full result, in your example of coding, there could be tricky conflicts. I found this to be the case as well. And I think a perspective I like on this is use multi-agent in cases where there's very clear and easy parallelization of tasks. Cognition and Walden Yan spoke on this quite a bit. He talks about this idea of kind of read versus write tasks. So, for example, if each sub-agent is writing some component of your final solution, that's much harder. They have to communicate like you're saying. An agent-to-agent communication is still quite early. But with deep research, it's really only reading. They're just doing context collection, and you can do a write from all that shared
Starting point is 00:12:47 context after all the subagents work. And I found this worked really well for deep research, and actually anthropic report on this too. So their deep researcher just uses parallelized subagents for research collation, and they do the writing in one shot at the end. So this works great. So it's a very nuanced point that what you apply context isolation to in terms of the problem, Yes, you can see this is their work, matters significantly. Coding may be much harder. In particular, if you're having each sub-agent
Starting point is 00:13:18 create one component of your system, there's many potentially implicitly conflicting decisions each of the sub-agents are making. When you try to compile a full system, there may be lots of conflicts. With research, you're just doing context-gathering in each those sub-agent steps, and you're writing in a single step.
Starting point is 00:13:36 So I think this was kind of a key tension between the cognition take, don't do multi-agents, and the anthropic take, hey, multi-agents work really well, it depends on the problem you're trying to do with multi-agents. So this was a very subtle
Starting point is 00:13:49 and interesting point. What you apply multi-agents to manage tremendously and how you use them. I like the take that apply multi-agents to problems that are easily paralyzable that are read-only, for example, context gathering for deep research, and do like the final, quote-unquote,
Starting point is 00:14:04 write, in this case, report writing, at the end. I think this is trickier for coding agents. I did find it interesting that Claude Code now allows for sub-agents. So they obviously have some belief
Starting point is 00:14:16 that this can be done well. At least it can be done. But I still think I actually kind of agree with Walden's take. It can be very tricky in the case of coding if sub-agents
Starting point is 00:14:25 are doing tasks that need to be highly coordinated. I think that's a well-explained contrast in comparison. Not much to add there. I think it's interesting that they have different
Starting point is 00:14:35 use cases and different architectures evolved. I don't know if that's a permanent thing, that might fall to the bitter lesson, as you would put it. Yes. We should probably talk about
Starting point is 00:14:45 some of the other parts of the system that you set up. Yeah. Because there's a lot of interesting techniques there. Let's talk about classic old retrieval. So RAG is obviously, it has been in the air for now many years, obviously well before LMs
Starting point is 00:15:00 and this client pole wave. One thing I found pretty interesting is, for example, different code agents take very different approaches to retrieval. Varroon from Winsurf shared an interesting perspective on how they approach retrieval
Starting point is 00:15:14 in the context of Winsurf. So they use classic co-chunking along carefully designed semantic boundaries, embedding those chunks. So classic kind of semantic similarity vector search and retrieval. But they also combine that with, for example,
Starting point is 00:15:30 Grip. They then also mentioned knowledge graphs. They then talk about combining those results, doing your ranking, So this is kind of your classic, complicated, multi-step rag pipeline. Now, what's interesting is Boris from Anthropic in CloudCode has taken a very different approach. He's spoken about this quite a bit. Clock code doesn't do any indexing. It's just doing, quote-unquote, agentic retrieval, just using simple tool calls,
Starting point is 00:15:56 for example, using grep, to kind of poke around your files, no indexing whatsoever, and obviously works extremely well. So there's very different approaches to kind of rag and retrieval that different code agents are taking. And this seems to be kind of an interesting and emerging theme. Like when do you actually need more hardcore indexing? When can you just get away with simple, just kind of agenic search using very basic file tools? Yeah, one of the more viral moments from one of our recent podcasts was Boris is part with us. And Klein also mentioning that they just don't do code indexing.
Starting point is 00:16:33 they just use agentic search. And that's a really good 80-20. And then if you really want to fine-tune, probably you want to do a little mix, but maybe you don't have to do it for your needs. Yeah, I actually just saw Klein posted, I think, yesterday talking about that they only use crap. They don't do indexing.
Starting point is 00:16:49 And so I think within the retrieval area of context engineering, there are some interesting tradeoffs you can make with respect to are you doing kind of classic vector store-based semantic search and retrieval with a relatively complicated pipeline like Veroon's talking about the windsurf, or just good old kind of agenic search with basic file tools. I will note I actually did a benchmark on this myself. I think there's a shared blog post somewhere.
Starting point is 00:17:13 I'll bring it up right now. Yep. I actually looked at this a bit myself. This was a while ago. But I compared three different ways to do retrieval on all Langraph documentation for a set of 20 coding questions related Langraph. So I basically wanted to allow different code agents to write landgraft for me by retrieving from our docs. I tested Claude Code and cursor.
Starting point is 00:17:39 I used three different approaches for grabbing documentation. So one was I took all of our docs around 3 million tokens. I indexed them in the vector store, just did classic old vector store search and retrieval. I also use an elms.coms. With just a simple file loader tool. So that's kind of more like the agenic search, just basically look at this LNDOT text file, which has all of the URLs of our documents
Starting point is 00:18:06 with some basic description, and let the LM, or the code agent in this case, just make tool calls to fetch specific docs of interest. And I also just tried context stuffing. So take all the docs, 3 million tokens, and just feed them all to the code agent. So there's just some results I found comparing Claude Code to Cursor.
Starting point is 00:18:26 And interesting, what I actually found, this is only my particular test case, but I actually found that LM. Text with good descriptions, which is just very simple, it's just basically a markdown file with all the URLs of your documentation, and like a description of what's in that dock, just that passed to the code agent with a simple tool just to grab files is extremely effective.
Starting point is 00:18:52 And what happens is the code agent can just say, okay, here's the question, I need to grab this doc and read it, I'll read it, I need grab this doc, read it, read it, This worked really well for me, and I actually use this all the time. So I actually personally don't do vector store indexing. I actually do LM. Text with a simple search tool with Claude code is kind of my go-to. Cloud code, in this case, this was done a few months ago. These things are always changing.
Starting point is 00:19:14 In this particular point in time, Claudecode actually outperform cursor for my test case. This actually Claude code pilled me, and this was, I did this back in April. So I've been kind of on Claude Code since. But that was really it. So this kind of goes to the point that Boris has been making about Cloud Code, about and client as well. You give an LM access to simple file tools. In this case, I actually use an LM. Text to help it out
Starting point is 00:19:36 so it can actually know it's in each file. It's extremely effective and much more simple and easier to maintain than building an index. So that's just my own experience as well. The skilled up form of LM.coms, which I really like and I use quite a bit is actually the deep wiki from Cognition. So I made a little Chrome extension for myself where I, like any repo, including yours, I can just hit eWiki. And this is an LLMs.txte, kind of, but also I read it.
Starting point is 00:20:05 This is a great example. And actually, I think this could be a very nice approach. Take a repo, compile it down to some kind of easily, kind of readable. Yeah, lm. Dot text. What I actually found was even using an LM to write the descriptions helped a lot. So I have actually a little package on my GitHub
Starting point is 00:20:22 where it can rip through documentation and just pass it to a cheap LLM to write a high-quality summary of each doc. this works extremely well. And so that element of text then has LLM generated. Yeah, this one. This is a little repo. It got almost no attention,
Starting point is 00:20:39 but I found it to be very useful. So basically it's trivial. You just point it to some documentation. It can kind of rip through it, grab all the pages, send each one to an LLM, and LLM writes a nice subscription, compiles it into an LM.
Starting point is 00:20:51 dot text file. I found when I did this, and I fed that to ClaudeCode, Claudecode is extremely good at saying, okay, based on the description, here's the page I should load. Here's the page I should load from the question asked. I use it when I'm trying to generate element of text for new documentation. But I've done this for Landgraf.
Starting point is 00:21:08 I've done it for a few other libraries that I use frequently. You just give that to Cloud Code. Then Cloud Code can rip through and grab docs really effectively. Super simple. The only catch is I found that the descriptions in your element of text matter a lot because the LM actually has to use the descriptions to know what to read. You know, anyway, that's just a nice little. utility that I use all the time.
Starting point is 00:21:30 When we had a client that said the Context 7 MCP by Upstash, which is like an MCP for like project documentation and stuff like that was one of the most used. Have you seen, have you tried it? Have you seen anything else like that that automates some of this stuff away? Well, you know, it's funny. We have an MCP server for Langrap documentation that basically gives, for example, Cloud code, the NLM. That text file and a simple file search tool.
Starting point is 00:21:55 Now, Cloud has built-in fetch tools, but at the time we've been. built it, it didn't. But it's a very simple MCP server that exposes elm.com to, for example, Cloud Code. It's called MCP doc. So it's a little, very simple utility. I use that all the time, extremely useful. She basically can just point it to all the LM. Text files you want to work with. Well, the MCP docs have an MCP server that you can search the docs with. So it's kind of throws all the way down. But I guess my question is like, should this be like one server per project, you know, or like at some point you're going to have kind of like a meta server. And I think part of it is, you know, once you move on from just doing tool calling in servers to doing things
Starting point is 00:22:39 like sampling and kind of like, you know, prompts and resources and stuff like that, you can do a lot of the extraction in the server itself as well. And again, it goes back to like your point on context engineering. It's like maybe you do all that work, not in the context, but in the server. and then you just put the final piece that you care about in the context. But it seems like very early. Yeah, this is actually a very interesting point. I've spoken with folks from Anthropic about this quite a bit. It is I found that storing prompts in MCP servers is actually pretty important,
Starting point is 00:23:10 but in particular to tell the LM or code agent how to use the server. And so I actually end up doing kind of separate servers for different projects with specific prompts. And also sometimes I'll have, you can also sort of resources. So some of the lot of specific resources for that particular project in the server itself. So I actually don't mind separating servers project-wise with project-specific kind of context and prompts necessary for that particular task. Yeah, a lot of people actually may have missed some features of the NCP spec, and you do have prompts in there. It's probably the first actual features that they have, which actually may be kind of underrated. Like people kind of view MCP as just in tool integration,
Starting point is 00:23:53 but there's actually a lot of stuff in here, including sampling, which is underrated too. That's exactly right. And actually, the prompting thing is pretty important because even to use our little simple MCP dock server for Langraph docs, you actually, I found it's better, of course, if you prompt it. But then I had to put in the read me initially, like, oh, okay, here's how you should prompt it.
Starting point is 00:24:14 But of course, that prompt can just live in the server itself. So you can kind of compartmentalize it. the prompt necessary for the LM to use the server effectively within the server itself. And this was a problem I saw initially. A lot of people were using our MCP doc server and the finding, oh, this doesn't work well. And it's like, oh, it's a skill issue. You need to prompt it better. But then that's our problem.
Starting point is 00:24:32 The prompt should actually live in the server and should be available to the code agent. Right. So it knows how to use a server. Right. So that's maybe retrieval. And that's a whole. Retrieval is a big theme. It obviously predates this new term of context engineering, but there's a lot going on in
Starting point is 00:24:46 the retrieval bucket. it certainly is an important subset of context engineering. I'm wondering if there's any other trends in retrieval before you leave the topic. You know, I think one other thing I was tracking was just Colbert and like the general concept of late interaction.
Starting point is 00:25:00 I don't know if you guys do a ton on that, but some sort of in-between element between full agentic and full pre-indexing and two-phase indexing maybe is what I would call it. Any comments on that? I haven't personally looked at Colbert very much.
Starting point is 00:25:17 I play with it only a little bit. So I don't have much perspective there, unfortunately. All right, happy to move on. We could talk about me reducing context briefly. Everyone's had any experience with this, because if you use cloud code, you hit that 95, you know, you've hit 95% of the context window and you're about to, and cloud code's about to perform compaction.
Starting point is 00:25:35 So that's like a very intuitive and obvious case in which you want to do some kind of context reduction when you're near the context window. I think an interesting take here, though, is there's a lot of other opportunities for using somersation. We talked about it a little bit previously with offloading,
Starting point is 00:25:51 but actually at tool call boundaries is a pretty reasonable place to do some kind of compaction or pruning. I use that in Open Deep Research. Hugging Face actually has a very interesting open deep research implementation. It actually uses, like, not a coding agent,
Starting point is 00:26:05 but the code agent, agent implementation. So instead of tool calls as JSON, tool calls are actually code blocks. They go to a coding environment that actually runs the code. and one argument they make there is that they perform some kind of summarization or compaction and only send back limited context to the LLM, leave the raw tool call itself, which is often token heavy as we're talking about deep research, in the environment. So it's another example. Anthropica and their multi-agent researcher also does summarization of findings.
Starting point is 00:26:42 So I think you see pruning show up all over the place. it's pretty intuitive. I think an interesting counter to pruning was made by Manus. They make the point and the warning that pruning comes with risk, particularly if it's irreversible. And cognition kind of hits this too. They talk about we have to be very careful with summarization. You can even fine-tuned models to do it effectively. That's actually why Manus kind of has the perspective that you should definitely use context offloading.
Starting point is 00:27:12 so perform tool calls, offload the raw observations to, for example, disk, so you have them, then sure, do some kind of pruning, summarization, like Alessio was asking before, to pass back to the LM, useful information, but you still have that raw context available to you, so you don't have kind of lossy compression or lossy summarization. So I think that's an important and useful caveat to note on the point of summarization or pruning. You have to be careful about information loss. This is something that people do disagree on, and I'll just flag this, on pruning mistakes, pruning wrong paths. Manus says keep it in, and so you can learn from the mistakes. So other people would say that, well, once you've made a mistake, it's going to keep going down
Starting point is 00:27:58 that path that there's a mistake, you got to unwind. Or you just got to like prune it and tell it, do not do the thing I know to be wrong. So then you just do the other thing. I don't know if you have an opinion, but I would call this out. There was someone that spoke yesterday. have disagreed with this. That's actually very interesting. Drew Brunick has a nice blog post that hits this point. He talks about this theme of context poisoning, and apparently Gemini reports on this in their technical report. He talked about, for example, a model can perform a hallucination, and that hallucinations is stuck in the history of the agent. And it can kind of poison the context, so to speak, and kind of steer the agent off track. And I think he cited a very specific example from
Starting point is 00:28:37 Gemini 2-5 playing Pokemon, they mentioned in the technical report. So that's why. We're one perspective on this issue of we should be very careful about mistakes and context that can poison the context. That's perspective one. Perspective two is like you're saying is if an agent makes a mistake, for example, calling a tool, you should leave that in so it knows how to correct. So I think there is an interesting tension there. I will note it does seem that Claude Code will leave failures in. I notice when I work with it, for example, it'll kind of have an arrow, the arrow will get printed, and it'll kind of use that to correct. So, and in my experience, and work with agents in particular, for tool call errors, I actually like to keep them in,
Starting point is 00:29:15 personally. That's just been my experience. I don't try to prune them. Also, for what it's worth, it can be kind of tricky to prune from the content, from the message history. You have to decide when to do it. So if you're introducing a bunch more code, you have to manage. So I'm not sure I love the idea of kind of selectively trying to prune your message history and you're building an agent. It can add more logic you need to manage within your kind of agent scaffolding or harness. It's a classic sort of precision recall, but like sort of reinvented for context in an agenetic workflow. Exactly. Exactly. Right. Well, we're on a topic of Drew. Drew is obviously another really good author. He's coined a bunch of like sort of context engineering lore.
Starting point is 00:29:56 Any other commentary on stuff that, you know, you particularly like or disagree with? So he and I did did a meet up on this. And I kind of like this quote from Stuart Brand. It was kind of comical. If you want to know where the future is being made, look for where language is being invented and lawyers are congregating. And it was talking about this, this idea of why buzzwords emerge. And he actually was the one who turned me onto this idea that a term like context engineering catches fire because it captures an experience that many people are having. They don't come out of nowhere. And if you scroll down a little bit, he kind of talks about this. He's a whole post about kind of, I think it's how to build a buzzword. But he talks a lot about this idea of kind of successful buzzwords are capturing a common experience that many of us feel. And I think that's kind of the genesis of context of engineering is also largely because, many of us build agents. There's lots of ways that can be quite tricky. And, oh,
Starting point is 00:30:46 contact engineering is kind of what I've been doing. And you hear a number of people saying, and then it kind of resonates. And you say, oh, okay, yes. That describes my experience. So I think that's just an interesting aside on kind of how language emerges anthropologically in different communities.
Starting point is 00:31:00 I will co-sign this because that's exactly what I use to coin or come up with AI engineer. A engineer. No, exactly. This is because people were trying to hire software engineers that were more optimistic with the AI, and engineers wanted to work at companies that would respect their work, you know, and maybe also come out from the baggage of classical ML engineering.
Starting point is 00:31:23 A lot of AI engineers don't even need to use PyTorch because you can just prompt and do typical software engineering. And I think that's probably the right way, at least in a world where most of the frontier models are coming from closed labs. I think an interesting counter on this is when you, for example, people try to create language that doesn't really resonate, that doesn't capture common experience, it tends to flop. So, which is to say that buzzwords kind of co-evalve with the ecosystem, they tend to kind of become big and resonate because they actually capture experience. Many people try to coin terms that don't actually resonate that go nowhere. Do you have experience with that? I'm the worst that naming things, but you do a good job, Sean. Yes.
Starting point is 00:32:04 You nailed it, the few ones who put on Layton's face. So that's right. Cool. Well, you know, I wanted to talk about context engineering. Okay, so, sorry, I don't know if I sidetracked you a little bit. No, that's perfect. The meta stuff on Thali. That hits a lot of the major themes.
Starting point is 00:32:20 I can maybe just talk very brief about one more. We could talk about bitter lesson and some other things. Yeah. If you go back to that table, I just wanted to give Manus a shout because I thought they had one other very interesting point. Oh, the table that you had. Yes, exactly. We've talked about offloading,
Starting point is 00:32:37 reducing context, retrieval, contact isolation. Those are, I think, the big ones you can see very commonly used. I do want to highlight Maness. I thought they had a very interesting take here about caching,
Starting point is 00:32:48 and it's a good argument. When people have the experience of building an agent, the fact that it runs in a loop and that all those prior tool calls are passed back through every time is quite like a shock the first time you'll an agent.
Starting point is 00:33:01 You have one token every tool call and you incur that token cost every pass through your agent. And so Manus talks about the idea of just caching your prior message history. It's a good idea. I haven't done it personally, but seems quite reasonable. So caching reduces both latency and costs significantly. But don't most of the API is auto-cash for you? I mean, if you're using like opening eye, you would just automatically have a cache hit.
Starting point is 00:33:25 I'm actually not sure that's the case. For example, when you're building age, you're passing your message history back through every time. As far as I know, it's stateless. there's different APIs for this across the different providers, but especially if you use just the Responses API, the new one, it should be that if you're never modifying the state, which is good for you if you believe that you shouldn't compress conversation history, bad for you if you do.
Starting point is 00:33:49 If you never modify the state, then you can just use the Assisansapy, everything that you pass in prior is going to be cached, which is kind of nice. Anthropic used to require weird header thing, and they've made it more automatic. Yeah, okay, so that's a good, out. So I had used Anthropics kind of caching header explicitly in the past, but it may be the case that caching is automatically done for you, which is, which is fantastic if that's the case. I think it's a good call-up for Manus. Yeah, Gemini also introduced implicit caching.
Starting point is 00:34:15 It's really hard to keep up. Like, you basically have to follow everyone on Twitter and just like read everything. So that I must have a bullet bot for it. Yeah, yeah, yeah, yeah. Well, you know, it's interesting, though. So APIs are not supporting caching more or more. That's fantastic. I'd use Anthropics' explicit caching header in the past. I do think an important and subtle point here is that caching doesn't solve the long context problem. So it of course solves the problem of like latency and cost, but if you still have 100,000 tokens in context, whether it's cached or not, the LM is utilizing that context.
Starting point is 00:34:49 This came up, I actually asked Anton this in their context for a meetup or in their context for rot webinar, and they kind of had mentioned that the characterization of context rot that they made, they think they would expect to apply whether or not using caching. Caching shouldn't actually help you with all the context rot and long context problems. It absolutely helps you with latency and cost. I do wonder what else can be cached. I feel like this is definitely a form of lock-in because you ideally want to be able to run prompts across multiple providers and all that. Yeah, caching is a hard problem. Like I think ultimately you control your destiny if you can run your own open models
Starting point is 00:35:28 because then you can also control the caching. Everything else is just a half approximation of that. That's right. That's exactly right. That is overall broad context engineering. I don't know if you have any other takes from like the meetup yesterday or questions. No, I think my main take from yesterday was like a quality of compacting. I think there was like one of the charts was using the automated compacting of like open code
Starting point is 00:35:53 and some of these tools is basically the same as not. doing it on like the quality of what you get from the previous instructions and I think Jeff at this chart is like curated compacting is like 2x better but I'm like how to you know it's like how do you do curated compacting I think that's something that maybe we can do a future blockposts on I think that's interesting to me like how do you compact especially coding agents things for like it can get very very long I think for things like deep research is like once I get the report it's fine you know But for coding, it's like, well, I would like to keep building. I found that like even when you're like writing tests or like you're doing changes,
Starting point is 00:36:31 having the previous history, it's like helpful to the model. It seems to perform better when it knows why it made certain decisions. And I think how to extract that in a way that is like more token efficient and still unclear. I don't have an answer, but maybe like a request for for work by people listening. Yeah. You know, that's a great point. It actually echoes some of Walden Dan's points from cognition. also that the summarization compaction step is just non-trivial.
Starting point is 00:36:59 You have to be very careful with it. Devin uses a fine-tuned model for doing summarization within the context of coding. So they obviously spent a lot of time and effort on that particular step. And Manus kind of calls out that they are very careful about information loss. Whenever they do pruning, compaction, summarization, they always use a file system to offload things so they can retrieve it. So it's a good call out that compaction is risky when you're building agents. and very tricky.
Starting point is 00:37:26 You know, I think there were a lot of, there's a lot of previously a lot of interest in memory. And I'm always, I was thinking about the interplay between memory and context engineering. I mean, are they kind of the same thing? It's just a rebrand. Are there parts of memory? And, you know, you guys recently relaunch Langman. That's also a form of context engineering.
Starting point is 00:37:44 But I don't know if there's, there's like a qualitatively philosophical difference. Yeah. So that's a good thing to hit, actually. I made me think about this on two dimensions. writing memories, reading memories, and then the degree of automation on both of those. So take the simplest case, which actually I quite like,
Starting point is 00:38:03 Claude Code, how do they do it? Well, for reading memories, they just suck in your CloudMDs every time. So every time you spend up Cloud Code, it pulls in all our CloudMDs. For writing memories, the user specifies, hey, I want to save this to memory,
Starting point is 00:38:18 and then Cloud Code writes it to CloudMD. So on this axis of, like, degree of automation, It's kind of like the zero-zero, it's very simple, and it's kind of very like Boris-pilled, like super simple, and I actually quite like it. Now, the other extreme is maybe Chattebtee-T. So behind the scenes, Chat-TB decides when to write memories and it decides when to suck them in. And actually, I thought Simon at A Engineer had a great talk on this, and it wasn't about memory, but he hit memory in the talk. And he mentioned, I don't even if you remember this, but it was a failure
Starting point is 00:38:50 mode in image generation because he wanted an image of a particular scene and it sucked in his location and put it in the image. Like it sucked in half like half moon bay or something and suck in an image. And it was a case of memory retrieval gone wrong. He didn't actually want that. So even in a product like Chat TBT that spent a lot of time on memory, it's non-trivial. And I think my take is the writing of memories is tricky, like when actually should the system write memories is non-trivial. Reading of memories. actually kind of converges with the concept of general thing of retrieval. Like memory retrieval at large scale is just retrieval, right?
Starting point is 00:39:28 I kind of view them as. It's retrieval in a certain context, which is your past conversations, which That's right. You know, it is different than retrieval from a knowledge base, different than retrieval from the public web. By the way, this is a second's write up on his website on here where he was just trying to generate it, and then suddenly it shows up. There you go.
Starting point is 00:39:46 Actually, it's a subtle point. I don't know exactly know what Open Eye does behind the hood. with respect to memory retrieval, my guess is they're indexing your past conversations and using semantic vector search and probably other things. So it may still be using some kind of knowledge base or vector store for retrieval. So in that sense, I kind of view it just simply as, you know, in the case of sophisticated memory retrieval, it is just like a complex rag system in the same way we talked about with like Varroon and building windsurf. It's kind of a multi-step rag pipeline. So I kind of view memories, at least the reading part, as just, you know, it's just retrieval.
Starting point is 00:40:20 And actually, I quite like clause approach. It's very simple. Just the trivial. Just suck it in every time. I would also highlight the semantic differences that you've established, you know, episodic, semantic, procedural, and background memory processing. We've done an episode with the letter folks on sleep time compute. I think these are just like, if you have ambient agents, very long-running agents,
Starting point is 00:40:40 you're going to run into this kind of context engineering, which is previously the domain of memory. And I would say that the classic context engineering discussion doesn't have. have this stuff. Not yet. So actually, there's an interesting point there. I did a course on building ambient agents, and I built this little email assistant that I used to run my email. I actually think this is made of a sidebar memory. Memory pairs really well with human the loop. So for example, in my little email assistant, it's just an agent that runs my email, I have the opportunity pause it before it sends off an email and correct it if I want, like change the tone of this email, or I can literally just modify the tool call to have a little UI for that. And,
Starting point is 00:41:20 And every time when you have these ambient agents, you edit, for example, or you give it feedback, you edit the tool calls itself, that feedback can be sucked into memory. And that's exactly what I do. So actually, I think memory pairs very nice with human loop. And like when you're using human loop to make corrections to a system, that should be captured in memory. And so that's a very nice way to use memory in kind of a narrow way that's just capturing user preferences over time. And actually uses an LLM to actually reflect on the changes I made, reflect on the prior instructions in memory and just update the instructions based upon my edits.
Starting point is 00:41:55 And that's a very simple and effective way to use memory when you're building ambient agents that I quite like. There is a course which you can find on the GitHub. And yeah, I mean, you know, you guys have done plenty of talks on instant agents. That's right. But I think it's a very good point that memory is often kind of confusing when to use it. I think a very clear place to use it is when you're building agents that have human loop because human loop is a great place to update your agent memory with your.
Starting point is 00:42:20 preferences. So it kind of gets smart over time and learn stream. It's exactly what I do with my little email assistant. So Harrison, I'm sure, I think he said this publicly, uses an email system for all his emails. He gets a lot as a CEO. I get much fewer, so I'm just a lowly guy, but I still use it. And that's a very nice way to use memory, is kind of pair it with human in the loop. Yeah, totally. I've tried to use the email system before, but like, you know, I'm still still very married to my superhuman. Yeah, fair enough. That's right. That's right. That's what the coverage that we planned on Contex Eng. You have a little bit on
Starting point is 00:42:54 a bitter lesson that we could wrap up with. Yeah, that's a fun theme to hit on a little bit. I'd love to hear your perspective. So there's a great talk from Hyeong Wang Chung, previously Open AI now at MSL on the bitter lesson and his approach to AI research. The take is compute 10xes every five years
Starting point is 00:43:16 for the same cost, of course. We all know that. The history of machine learning has Yeah, exactly this slide. Exactly. History of machine learning has shown that actually capturing this scaling is the most important thing. In particular, algorithms that are more general
Starting point is 00:43:30 with fewer inductive biases and more data on compute tend to beat algorithms more, for example, hand-tuned features, inductive biases built in, which is to say, just letting a machine learn how to think itself with more compute and data,
Starting point is 00:43:46 rather than trying to teach machine how we think tends to be better. So that's kind of the bitter lesson piece simply stated. So his argument is this subtle point that at any point time, when you're, for example, doing research, you typically need to add some amount of structure to get the performance you want at a given level of compute. But over time, that structure can bottleneck your further progress. And that's kind of what he's showing here, is that in the low compute regime, kind of on the left of that x-axis, adding more structure, for example, more modeling assumptions, more inductive biases, is better
Starting point is 00:44:25 than less. But as compute grows, less structure, and this is exactly the better lesson point, less structure, more general tends to win out. So his argument was we should add structure at a given point in time in order to get something to work with the level of compute that we have today, but remember to move it later. And a lot of his argument was like people often forget to remove that structure later. And I think my link here is that I think this applies to AI engineering too. And if you kind of scroll down, I have the same chart showing my little, exactly. This is, this is my little example of building deep research over the course of a year. So I started with a highly structured research workflow, didn't use tool calling. I embedded a bunch of
Starting point is 00:45:11 assumptions about how research should be conducted. In particular, don't use tool calling because everyone knows tool calling is not reliable. This was back in 2018.24. Decompose the problem into a set of sections and parallelize each one, those sections written in parallel into the final report. What I found is you're building an LM applications on top of models that are improving exponentially. So while the workflow was more reliable than building an agent back in 24, that flipped pretty quickly as LMs got better and better. It's exactly like was mentioned. in the Stanford talk, you have to be constantly reassessing your assumptions when you're building A applications given the capabilities of the models. I talk a lot about here the specific structure
Starting point is 00:45:51 I added, the fact that I used the workflow because we know tool calling doesn't work. This was back in 2034. The fact that I decomposed the problem because that's how I thought I should perform research. And this basically bottlenecked me. I couldn't use MCP as MCP got, for example, much more popular. I couldn't take advantage of the fact that tool calling was getting significantly better over time. So then I moved to an agent, started to remove structure, allow for tool calling, let the agent decide the research path, a subtle mistake that I made, which links back to that point about failing to remove structure, I actually wrote the report sections within each subagent. So this kind of links back to what we talked about with sub-agents in isolation.
Starting point is 00:46:34 Sub-agents just don't communicate effectively with one another. So if you write report sections in each sub-agent, the final report is actually printed this joint. This is exactly Alessio's challenge and problem about using multi-agent. So I actually hit that exact problem. So I ripped out the independent writing and did a one-shot writing at the end. And this is the current kind of version of Open Deep Research, which is quite good. And this is kind of the thing that's, at least on deep research, meant the best performing open deep research assistant, at least that's open source. So it was kind of my own arc, although we do have some data results with GPD5 that are quite strong. So, you know, the models are always getting better. And so indeed, our open source
Starting point is 00:47:10 assistant actually takes advantage and rides that wave. But I actually kind of experienced, I felt like I actually got bitter lesson to myself because I started with the system that was very reliable for the current state of models back in mid-20204, early 2024. But I was completely bottlenecked as models got better. I had to rip out the entire system and rebuild it twice, rechecking my assumptions, in order to kind of capture the gains of the model. So I think, I just want to flag, I think this is an interesting point. It's hard to build on top of rapidly expanding models, rapidly improving model capability. And actually, I really enjoyed from A.A. Engineer, Boris's talk on Claude Code, and they're very bitter lesson-pilled. He talks a lot
Starting point is 00:47:51 about the fact that they make Cloud Code very simple and very general because of this fact. They want to give users unfettered access to the model without much scaffolding around it. But I think it's an interesting consideration in A-A-engineering. that we're building on top of models that are improving exponentially. And one of the points he makes is a correlator of the bitter lesson is that more general things around the model tend to win.
Starting point is 00:48:16 And so when building applications, we should be thinking about this. We should be adding structure necessary to get things to work today by keeping a close eye on models improving rapidly and removing structure in order to unbottlemeck ourselves. I think that was my takeaway. So I really liked the talk from Hyeong-won Chung.
Starting point is 00:48:35 I think that's worth everyone listening to. And I think a lot of lessons apply to AI engineering. I think this is similar to incumbents adopting AI, putting AI in existing tools. Because you already have the workflow, right? So you already have all the structure. You just put AI. It becomes better. But then the AI native approaches catch up as the models get better.
Starting point is 00:48:55 And then there's no way for existing products to remove the structure because the structure is the product. Yes. And that's why then you have, you know, cursor and windstor. serve or better than VS code for like AI Native thing just because they didn't have to deal with removing things and why cognition is like, you know, again, it's like it doesn't even think about the idea as like the first thing. The ID is like a piece of the agent. And so I think you see this in a lot of markets, which is like, hey, again, if you have a workflow and you put AI, the workflow is better. Like the workflow is not the end goal. I think we're now at a place where like you should just
Starting point is 00:49:30 start without a lot of structure just because now the models are like so good. But I think the first two and a half years of the market, there was kind of like the stance of like, should I just put AI into the workflow that works? Should I rewrite the workflow? But the workflow is not that good because the models are not that good. But I think we're past that point now.
Starting point is 00:49:47 That's an amazing example. Actually, if you show your chart again, there's another interesting point in your chart. In the earlier model regime, the structure approach is actually better. And so an interesting take on this, Jared Kaplan, the founder of Anthropic, is a great talk at startup school
Starting point is 00:50:03 from a couple weeks ago. And he mentions this point about oftentimes building products that explicitly don't quite work yet can be a good approach because a model under them is pretty accidentally and it'll kind of unlock the product. We saw that with cursor. Part of the cursor lore is that it did not work particularly well, Cloud 3-5 hits, and then boom, it kind of unlocks the product. And so you kind of hit that near the curve when the model capability catches up to the product needs. But in that earlier regime, the structure approach appears better. So it's kind of this interesting, subtle point that for a while, the more structure approach appears better,
Starting point is 00:50:40 then the model finally hits the capability needed to unlock your product, and suddenly your product just takes off. There's kind of another corollary to this, that you can get tricked into thinking your structure approach is indeed better, because it'll be better for a while until the model catches up with less structured approaches. Your chart looks very similar to the windsurf chart. I got to bring it up because I was involved in the writing of this one. Isn't this similar? There's a, there's a stealing. you know and then like boom you go slow it's this is almost like bitter lesson but in like uh enterprise that's right for me okay the lines are important but to me the bullet points are the main thing if you understand the bullet points then you can not you can actually learn from the the mistakes
Starting point is 00:51:20 of others right there is one and spicy take on on this which is like you know how much is land graph aligned with the bitter lesson yes obviously you guys are obviously aware of it so it's not not going to be a surprise. But I do think that making abstractions easy to unwind is very important if you believe in a bitter lesson, which you do. No, no, this is super important, actually. And I actually talk about this at the end of the post. Yeah. There's an interesting subtlety when you talk about Asian frameworks. A lot of people are anti-framework. I completely understand and sympathetic to those points. But I think when people talk about frameworks, there's two different things. So there can be a low-level orchestration framework. There's a great talk, for example, at
Starting point is 00:52:02 from Shopify, they built this orchestration framework called Roast internally. And it's basically Langraph. It's some kind of way to build kind of internal orchestration, workflows with LMs. And Roast, Langraph provides you low-level building blocks, nodes, edges, state, which you can compose into agents, you can compose into workflows. I don't hate that. I like working low-level building blocks. They're pretty easy to tear down, rebuild. In fact, I used, for example, Langraph to build open-deep research. I had a workflow. I rip it out. I've rebuilt the agent. The building blocks are low-level, just nodes, edges, state. But the thing I'm sympathetic to is there's also, in addition to just kind of low-level orchestration
Starting point is 00:52:45 frameworks, there's also agent abstractions from framework import agent. That is actually where you can get into more trouble because you might not know what's behind that abstraction. I think when a lot of people kind of are anti-framework, I think what they're really saying is are largely anti-obstraction, which I'm actually very sympathetic to. And I don't particularly like agent abstractions for the exact reason. And I think Walden Yans made a good point. Like, we're very early in the archive agents. We're like in the HTML era. And agent abstractions are problematic because you don't know what's necessarily under the hood of the abstraction. You don't understand it. And if I was building, for example, you know, open deep research with an abstraction,
Starting point is 00:53:21 I wouldn't necessarily know how to rip it apart and rebuild it when models got better. So I'm actually wary of abstractions. I'm very sympathetic to that part of the critique of frameworks. But I don't hate low-level orchestration frameworks that just provide nodes, edges. You can just recombine them in any way you want. And then the question is, why use orchestration at all? And actually, I use Landgraf because you get some nice use. You get checkpointing, you get state management.
Starting point is 00:53:46 It's low-level stuff. And that's the way I happen to use Langraph. And that's why I like Landgraf. And that's actually why I found, like, a lot of customers like Landgraf. it's not necessarily for the agent abstraction, which I agree can be much tricker. Some people like agent abstractions. That's completely fine as long as you understand
Starting point is 00:54:02 what's under the hood. But I think that's a very interesting debate about frameworks. I think the critique is it should be made a little bit more on abstractions because often people don't know what's under the hood. For those who are looking for resources,
Starting point is 00:54:15 it was a bit hard to find the shopify talk. Yeah, it's unlisted now. I don't know why it's unlisted, but it's a nice talk. I found it through this Chinese ripoff. of the talk. Yeah, it's actually hard to find now. I think there should be a browse comp
Starting point is 00:54:29 where you find obscure YouTube videos because that's something I'm very good at. It's kind of my bread in mind. It's good. And what's funny is, this talk follows exactly the arc we often see when we're talking to companies about Langraph.
Starting point is 00:54:42 It is people want to build agents and workflows internally. Everyone rolls their own. It becomes hard to kind of manage and coordinate and review code in this context of large organizations. It can be very helpful to have standard library or framework that people are using with low-level components that are easily
Starting point is 00:54:58 composable. That's what they build with Roast. That's effective what Land-Graph is. And that's why a lot of people like Langraph. I actually thought the talk on MCP that, I believe it was John Welsh. Yes, I think that was like a super underrated talk. I tried yelling about it. No one listened to me.
Starting point is 00:55:16 But like, you know, if you listen this far into the podcast, do us a favor. Did actually listen to John Welsh's talk? It's actually very good. It's very good. He makes a case for a lot of the reason why people actually, for example, enterprises, larger companies like Langraph, which is the fact that when tool calling got good at it with an Anthropic and, you know, sometime mid-last year, he actually makes this point explicitly.
Starting point is 00:55:38 So he mentions, okay, so you're anthropic, tool calling gets good in mid-2024, everyone's building their own integrations, it becomes complete chaos. And that's actually where kind of MCP came from. Let's build a kind of a standard protocol for accessing tools. Everyone adopts it. much easier to kind of have off and have review and you minimize cognitive load. And this is actually the argument for standardized tooling, whether it be frameworks or otherwise, within larger orgs, is practicality.
Starting point is 00:56:06 And his whole talk is making that very pragmatic point, which is actually why people do tend to like frameworks, for example, in large organizations. And then ship it as a gateway. This is the other big thing that they do. That's right. Lance, you've been so generous of your time. Thank you. Any shameless plugs?
Starting point is 00:56:25 Calls to action, stuff like that. Yeah, if he'll made it this far, thanks for listening. We've a bunch of different courses I've taught, one on ambient agents, one on building open debris research. So I actually was very inspired by Carpath. He had a tweet a long time ago talking about building on ramps. So he talked about he had his micrograd repo. A few people looked at it, but not that many.
Starting point is 00:56:43 He made a YouTube video, and that created an on ramp and the repo skyrocketed in popularity. So I like this one-two punch of building a thing. like OpenD Research, then creating a class so people can actually understand how to build it themselves. And I kind of like that, build a thing, create an on ramp for it. So I have a class on building open deep research.
Starting point is 00:57:00 Feel free to it's for free. But it walks through a bunch of notebooks as to how I build it. And you can see the agent is quite good. We even have better results coming out soon with GPD5. So if you want, kind of an open source deep research agent, have a look at it. It's been pretty fun to build. And that's exactly what I talked about in that bitter list and blog post as well.
Starting point is 00:57:17 Awesome, Lance. Thank you for joining. Yeah, a lot of fun. Great to be on. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.