Latent Space: The AI Engineer Podcast - ⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

Starting point is 00:00:03 Okay, we're here at AIE Code. And we have two of our speakers. Bill and Brian, welcome. Hi. In space. Thank you for having us. Bill, Brian, I know you're the listener for a little bit. Oh, yeah.

Starting point is 00:00:16 What's your take on lane space? Like, how does it, what role does it perform in your function at opening? Yeah. I mean, first of all, love the name. Okay. I'm a massive latent space context management versa. How the story behind the name would be the chance. Yeah.

Starting point is 00:00:30 So it's start, we never had late space as a sense. of name at the start. It was called L-space. Interesting. And one of my readers donated the domain name, Layton. Dot space. He's like, you want it from like, yeah. Awesome. So Layton just like came accidentally. See, it was in the ether, but like I didn't have the domain. Yeah, so I just, I just like called it L-space. El Space is like a visceral. Nice. Yeah, no, it's, it's amazing. I love it because it's, you're like always on the

Starting point is 00:00:58 cutting edge and it goes into a lot of detail about all the things that like I should be keeping up with It's part of my job and there's so much to keep up with, right? So there's only so many sources of really good high quality information for what's like happening on a deep level. Well, you guys have your own podcast now. So I'm like, you know, like a pre petition. Yeah, well, I still listen to yours and I still think yours is really good. So you guys, I guess, are representing like a startups team, Codex, all the things. You just launched Codex.

Starting point is 00:01:28 Yeah. The Clare's Max. Yep. But a genit. Yeah. yesterday. Yep, we're going to name names. Yeah.

Starting point is 00:01:35 I do. People do make fun. I think Thibbo was like, yeah, you know, we're good at a lot of things but not need me. I was like, well, why call it, Max? Was there any like internal discussion? Yeah, I mean, it's complicated because it needs to be differentiated from the previous one. And the idea is like Max can run for a really long time. We can go 24 hours or more.

Starting point is 00:01:54 I've actually like sort of had it gone for more than that. And the name is, you know, is inside codex on the wind? is that how do you what you say a really long time 24 hours oh I on my

Starting point is 00:02:05 oh that's I think that was on the web inside of Kornats I'm not sure but I've actually done on my local computer for

Starting point is 00:02:12 for quite a bit longer than 24 hours over the course of a couple days but closing my laptop and it'll be opening it but but the name you know

Starting point is 00:02:19 you could come up with something like pro but pro is sort of like slower more thoughtful max is about sort of like speed and

Starting point is 00:02:27 maximization like maximalist So for this mono, it can run for a long time. But it can also actually, for the same types of problems, it can actually get through the right answer it faster. So I can, it's simply better and fast. Yeah. So I think the part of what you guys are speaking about

Starting point is 00:02:49 is the training that goes into something like, yeah, that's right. Bigly people just kind of wave their hands to say RL. But like, what specifically have you learned about what's a good party's all of son? So I got to, I mean, this sounds weird to say, but I was lucky enough to be really close to the training team while GP5 was training. And one of the big things that we focused on, Bill was there too, we focused on personality, right? So it's really important to build trust with developers for like how a model works.

Starting point is 00:03:20 And if a model doesn't act the way that you expected to do or if it doesn't work alongside of you as well, you're not going to really trust it. You're not going to get as much out of it. So for coding, we thought, okay, well, what is the best personality for a coder, for a pair programmer, for somebody you trust? And how do we, like, eval against that? How do we come up with behavioral characteristics? And we came up with things like communication. It needs to keep you impressed of what's going on while it's working. Planning. Like, come up with that strategy, do some searching around, like, figure out context gather, figure out what to do before you

Starting point is 00:03:55 just dive in, if it makes sense to. And then, you know, check your work, right? And so these are just best software engineering practices that turn out to be behavioral characteristics, and we can measure the model's performance on those behaviors and grade it that way. Yeah, I will say that another key aspect to how we train to the model is you work really, really closely with some of our cop coding partners. And a lot of those folks that lead on the bleeding edge, and so they have a lot of understanding of work or two. particularities than ye, and we really focused on sort of those areas and really drive deeply as into those. Yeah, that's right, especially tools, right? So like different

Starting point is 00:04:39 harnesses have different tools. Some people have context, like semantic search. Some people have different ways of doing code edits. And initially, you know, our models are trained the way they were trained to use tools. And that kind of bakes in a habit. And so we've been getting the models better at using different types of tools. Yeah, it's a lot to follow up on, but I'll go tools first and then I'll go back on the personality base. But the engineers wise, I think the communication by the

Starting point is 00:05:08 5-Podex just came out was, well, this is the model trade for our Potex, not necessarily your choice, right? Has that message change for other startups using the 5-Codex model? Right, no, so Codex is, just to be clear, codex is the frontier coding model

Starting point is 00:05:23 that we have that is optimized for its harness. The codex, team is very focused on creating a coding agent and they wanted to work perfectly inside of the shape of the harness and API that we have. So they're completely unbounded. It's open source. Yes, that's open source. And the model is available in the API. So that's what they focus on. Yeah. And then the conflict is, well, you just said other startups have other tools and obviously, I know that. It is possible. Like, one thing to mention here is, I think we can probably disemangled a little bit on sort of the codex.

Starting point is 00:05:57 apart from the sort of the mainline models. The codex models are sort of focused on the agents itself, right? Like the codex agent itself, the model has been trained with an agent specifically in mind. It actually turned out to be somewhat even sometimes easier to integrate because we come into it with an firm opinion on what the sort of best way of using it look like. And so some folks that we work with actually really appreciate that we come into it with that opinion. Well, for the other ones that has more of a, general or specific tools that they definitely need.

Starting point is 00:06:31 The mainline model is the one that's more general in the sense. And that's sort of what Brian was referring to when he's talked about Jupyty 5's tools. Yeah, so the 5-1 non-codex is more general across the board. It can respond to things that are, it's much broader than just code it. It has coding capabilities that are also mirrored in codex and they work together to keep that true it up. But since it's more general, it does have more steerability. to different types of tools.

Starting point is 00:06:59 And when you're implementing tools, the model can get bogged down if it hasn't seen a tool that it's used to, and it might take more time thinking about how to use it or make more mistakes. So our recommendation is if you're wanting to go bleeding edge coding focused, pay attention to the Codex line

Starting point is 00:07:18 and the Codex SDK and the Codex models because that's the one that's really aimed at that. You'll have to do some work to look at how we're implementing our tools inside of Codex to maximize its capability without logging it down. But, like, people are having success, like, bending it in ways that maybe we haven't thought of. If you come to mind, I always want to pry if... Sure. Yeah. Do you have any examples?

Starting point is 00:07:44 You say bending in ways you haven't thought of it. Yeah, so I think, um, so Codex is trained, uh, with terminal tools, uh, in mind. And so what we've thought would be the case is you all essentially only have to strip out, you have to strip out all of the tools except for the terminal tools. But we found some like partners of ours like the discovery that what you can do is that you can actually still have a lot of the tools just named in the same way as a terminal tools as well as having the same input and output. And all of a sudden the tool called performance jumped up by a lot.

Starting point is 00:08:19 Yeah. And Codditz loves rip grip. So if you make a ripgrip tool and tell it to use, it, it'll use it. So if you call it Grep, it actually does a little bit worse, but if you call it RG, it actually does really well. Right. Yeah, yes. This is

Starting point is 00:08:33 something that we ourselves only discover. This is one of the coolest things about, like, model training is literally, like, they develop habits. Just like a person does. Like, if you're, like, working on some podcasting tool, right, you're really good at editing. And then somebody makes you use a different one, it's going to slow

Starting point is 00:08:49 you down, you're going to get kind of bogged down to make a mistake. Sure, but I would I don't know if, like, yes, that's very humid, but I would, I don't know if I'll call it cool because they're supposed to generalize. Well, right, that's the end, the end goal, yes, of course. And so that's what we're doing with the five series of models, that they're way more general.

Starting point is 00:09:08 And Codex is focused on maximizing coding, and those are the sort of two horizons that we're working on. Yeah, awesome. I want to go back on personality. I know you hate that word sometimes. Eat it. It means different things to do. Yeah.

Starting point is 00:09:23 And when it comes to people who are like very, very keen on like model research, model personality is much more like, I think really what your topic would say is like it warms your friend you guys for your, I agree with understanding people's emotional state, whatever. And so it's this is really jarring when that is also applied to Toto agents where like, well, I got a talk to Vichie. Like, Silicon Valley HB ultra is also saying hands on, but it could be paid on do it. the freight. Awesome. I think the other thing is also, but what doesn't matter, because you said a lot of things about, like, commenting is that you're going to user engagement or that. Doesn't matter if it's so quantized anyway, right? Like, you're going for 24 hours. You're closing your laptops. You have like the extra high parameter now. Doesn't matter. Exactly. So here's, we're in this world right now where we're in between a situation where people don't quite have, like the models don't quite have the trust of senior engineers or engineers. is doing very important work. And so we found, our customers have found, that people really want to follow along with what it's doing

Starting point is 00:10:30 so they can interject or stop it, or at least understand what it's thinking so they don't waste all the kinds of time doing a rollout that they have to throw away. So with the five series, because it's more general, and it's just about as good as coding as Codex for a lot of things, we've taught it to be more communicative. And so it has preambles before tool calls.

Starting point is 00:10:50 It'll say things like, I'm about to go look for this. Yeah, and you can steer that really well. I actually really like it. I have, I've created like a personality. I tweeted about this. I created a personality for my coding agent because I really like my tools to be kind of like fun to work with if I'm in there with them. And so I have it, it's got this like, it gets really excited if we do something together.

Starting point is 00:11:10 And like, because I want to wake up in the morning and be like, oh, I'm going to go work on this project with my buddy, 5-1, right? But some people don't like that. And also, for like you said, long-running agentic tasks, that can get in the way. You're burning tokens that don't really matter if it's running in the cloud. So 5-1, you can turn that off. You can prompt it not to do that. But the Codex model can't actually do that. And it relies on the reasoning summarizer to give you that update.

Starting point is 00:11:35 I guess more broadly, why should people know or think about in terms of what will be as to with voting models in general? More broadly than just like you be your book experience release, just like, what? What trends are you see, what discussions are active? Our talk today is folks on talking a little bit about sort of the trend that we're sort of seen. Is the abstraction layer really moving, starting to move upwards from the model layer, where it's the age of layer. As I said, we trend our models starting to be a little bit more opinionated, especially

Starting point is 00:12:12 with regard to a Boeing model like Codex. And the models are really good at doing certain things, widened inside of a certain harness assert typing search. And so we're actually packaging that up more closely so we're actually shipping this entire agent altogether

Starting point is 00:12:32 than you can actually build on top of that agent. That's one of the patterns that we're seeing here is rather than focusing on optimizing with every single model release, you're actually just be able to plug in an agent like codex into your platform and be able to use an app box. Yeah, and you're seeing Zed use this

Starting point is 00:12:47 GitHub, VS code, lets you just like package to hold agent to work inside of it. That way, like if you're building a coding tool, like said, and you don't feel like having a whole team keep up with every single model release and every single API change and how to update the harness to do different cuts of sandboxing and all that kind of stuff, you can just build one layer above.

Starting point is 00:13:08 And that is actually super powerful because coding is just like one agentic behavior. It turns out it's a really nice one to start with because you can measure the performance sometimes easier with a lot of other ones. but it also gives the model the capability, right? So we started out with like chatbots. Like you're having a conversation. Let's give the chatbot a tool to use. Okay, so now you have an agent that can like run commands.

Starting point is 00:13:34 Well, let's give the chatbot agent a codex to use. So now if it doesn't have a tool, it can make a tool that it needs to solve a problem. Right? So that's like another layer of abstraction and it's not just coding. You can write software that has an agent that can split. spin up a codex instance, and write a custom plug-in for your software for that customer's API, right? And so now your software is self-customerizable because it has its own team of people inside that can do integrations at launch. Yeah, solving integration engineering is a CI.

Starting point is 00:14:07 Yeah, one thing I'm binding at this conference so far, even early, like the first Tener Oaks. I think people are starting to really explore sub-ages, ages that are more abstracting, ages that use agents. and we used to call it multi-agent. I don't know why it was on now. I don't know if there's any thoughts on your end about this, where you can tool-call. I guess a very basic example is what you just say, which is that the agents can create another instance of Bodex

Starting point is 00:14:36 that creates a tool and then drop me. Just use the tool. Is there a case for skating like some agents? TGISERI, A you go? Yeah, I think so. I mean, Codex Max was designed for that. So it has its own compaction and context management. Codex-Mex manages its own context window.

Starting point is 00:14:56 And so it can run basically forever without you having to worry about it while it's inside of the Codex harness. And that lets you do a lot of different things. You can essentially have it handoff its own context to other sub-agents, right? So letting it sort of like spawn different agents to do more of its work in parallel

Starting point is 00:15:16 and all kinds of things like that. So it's built for that. We're just sort of like starting to see the indications of like what that means. But that's I think the future and we're really excited about that. Yeah. It's really, I think like as I said, the trend that was sort of observing here, really moving up the attraction layer to the agent to the agent layer really allows you to do a lot of cool things like brand new spaceship, spending a few agents, creating new abstractions as things as the long running agent workflow.

Starting point is 00:15:49 continues, and right now, we're building all the primitives as well all bottles, specifically with animites. Yeah, and it's really about moving the threshold up further, right? Like I was saying before, like I now trust, like, Codex to do some of my hardest work. I haven't written a single line of code by

Starting point is 00:16:07 hand in months, because I know what I can trust it to do. You're the Forbes person that said that in the last way for us. Yeah, no, it's real. I mean, I've actually launched something. There's an open source project that I did. There was a Codex upgrade pack, for migrating from completions to responses that was totally written by

Starting point is 00:16:23 Codex. And I didn't write a single line of that code. And now it's out there. It's open source. I should most of the folks at Open AI. Well, initially when Codex first launched, it was around 50% of folks that Open AI started using you, but now up they go, with those folks that open app. That's very true. We use it

Starting point is 00:16:39 every day. The way that we do it is we're really good at eVals. Right? Like, in order to develop trust and build a product that can do more than you design it for, which is what we're talking about here. You're making an agent that can solve its own problems. You have to get really good at figuring out

Starting point is 00:16:55 how to build those guardrails and e-vils around what is it doing, what is it allowed to do, and check it in production. So we have all of this platform tooling now around agent traces and rollout traces and coming up with e-vows for that and building, you know, graders and all the things you need to sort of maximize

Starting point is 00:17:11 the pipeline. So you can let it go and then be like, okay, I don't really like the way it did that. Great it. Have it met a prompt itself so that next time it actually does a better best practices. One of the biggest is you use in terms of which is the organizational capabilities

Starting point is 00:17:27 that OPI see messaged is a prior to. We see more about that. Like, why is that suddenly a big priority now? Obviously, I think there was a lot of this OPA I always did internally about it's, but now it's like a team that is more over-facing and then you be able to this random era.

Starting point is 00:17:43 The path to your AGI really goes to VE those and well, I'm sorry. That was a little... It's so true. Repeated way too many times. But I think there are a lot of academic e-vows, right? There's like sweep-ends, there's other, like, you name it.

Starting point is 00:18:01 But I think there's a slightly lack of evals off the real world on sort of what people care about the most. And we want to make sure that whatever we're developing, model-wise, as well as product-wise, are aligned and are actually making the most amount of use. sole impact on this world. And applied evals is really in that direction, capturing all of those sorts of real-world use cases and things for us to hill climb together. I like to think of it as like we have, I mean, people say it's a PhD and an API, right? But if you, you know, you hire a PhD

Starting point is 00:18:37 student, they don't know how to do the job. You have to give them a job description. Okay, that's a prompt, right? So now you have your policy and then you have them do the job and they're going to kind of like flail around, right? So they need. mentorship, they need guardrails, they need evals, performance reviews on how to do their job, the best practices. And so what we're doing is we're trying to put our models out there and see what they're good at, what they're not good at talking to our customers. They're like, oh, we could really use your model for more things if it could do this one thing. Here's our e-vow for it. Or help us build those evals with you so that we can see where we're deficient and go back

Starting point is 00:19:12 and train the model to be able to do that job in the way that we wouldn't normally get to see it form. Yeah. How do that? do you through multi-turn evals? So I think that's the really hard thing that, I mean, sometimes you need multi-turn if it doesn't get around on the first go, but it could just get around the first goal, then it's more longer multi-turn, right? So then what? Do you want to take, I have some ideas. Oh, yeah, you go. I mean, I've built a few myself.

Starting point is 00:19:39 I don't, this is, this is sort of like my personal work. I think this is like an area that people are just now getting into, right? We have LM as a judge. you can use LLM as a judge to look at an entire trajectory and see, okay, over the course of all of this, like how well it did it perform, what did it do? And then you could maybe like walk it back a step to the part where you don't like,

Starting point is 00:19:59 and then you could have the model run the next step with the instructions, graded it on that, and then have it improve itself. Oh, I don't like the way that you... We do this all the time inside of harnesses. It's like, that was a good answer, but I don't really like how long it took you to get there. So can you give yourself better,

Starting point is 00:20:17 instructions we're doing that next time, it'll write something and we'll add it in there, and then suddenly it's better, right? So that's one way of doing it. Yeah, I think multi-turn evils, most of the companies or startups that we work with, like these days, the agent runs and then multi-turn way, right? And so, therefore, if you can build an agentic harness that works in a multiple turn way, you can eval it. And then there are like also academic benchmarks, already does this in some ways, like Cowbench, and now we have like Tau Square Bench that does this like particularly well,

Starting point is 00:20:51 and we'll definitely certainly take inspirations from that. I have this idea. I call it like a job interview eval. I haven't finished it. But really, like, if you're evaluating a coding agent, what do you want it to be able to do? You want it to be able to take an underspecified, imagine you were interviewing a developer. You give them a problem. Hey, like, go implement a string reverse

Starting point is 00:21:12 or whatever. And then it's like up to them to like ask for okay, well, I need more information. What are the constraints here? Like, what is, and then you judge them on that. And then they start implementing it. You give them some modifications. You grade them on that. You can imagine, like, building, you know, with an LLM, like a rollout that is

Starting point is 00:21:30 comfortable and the model responds and that you can kind of grade the whole thing. Yeah. One thing I would love, and this is like the feature request part of the podcast, is batch multi-turn evel API. You know, so batch API is single turn, but you can't really batch multi-turn requests. Is that already doable? Batch multi-turn requests. I don't believe it.

Starting point is 00:21:54 You can't do it yet, but yeah, I think that's like a really valid. Because you need e-vals to be cheap as possible. Yes. They're not that time-sensitive. And you want to run it overnight when the things are cheap. Yes. Well, feedback taken. Feed-dustaken, man.

Starting point is 00:22:09 But that's the thing. Every day we're trying to break the platform better. And right now, evals is certainly part of it. we make product feature updates as we talk to people like you. Yeah. They're like, hey, can you do this? I mean, it's super like, yeah, if I'm going to throw thousands of runs at this thing, you know, I should probably spend some time worrying about costs.

Starting point is 00:22:26 Speaking of which, what are you trying to, you though? I mean, Devin and Cascade. You know, and, uh, I, so I have a personal side project where I want to make Devin for non-coding. Oh. Because I really love Devin so much. Like, they slack. My kind of semi-hot take that I'm floating around. because just to see how it feels

Starting point is 00:22:46 is I think Slack is the ultimate user interface for work, right? I don't want to read email. I just read Slack all day. I interact with my email agent through Slack. So basically I'm building a dev-in for email. Yeah. Well, that's the thing is, like you can use

Starting point is 00:23:01 you can use Devon to do that, right? Like a coding agent, like Codex, a CLI, it used to be back in the old days. Like I started out in the 90s working at IBM as a system administrator and I had to write my own custom software and bass scripts and whatever to actually solve real real problems every day. And so I had this like, you know, toolkit of the scripts that I made, right, that were like organizing file directories or doing like other random things that weren't necessarily writing code. Yeah, yeah.

Starting point is 00:23:28 And so you can get for not-40 use cases to just like sort through your email using like Elm or something, right, in the terminal or like have it generate like snippets of video clips from YouTube that you can, watch later or things like that. You know, I never thought about that, but I do that all the time as part of lane space. Yeah, I should probably invest in that tooling. I had, I had Kodax go through my really messy directory of all of these experiments that I was running and

Starting point is 00:23:56 completely organized them and, like, put them in the shape, and it was so wonderful. I used it for something that's more boring organizing my desktop. Yeah, you know, we have a lot of files on the desktop, and Kodex is really good. People think... Kodi-M-N-G-0-1416.

Starting point is 00:24:12 JPG or that thing. Yeah, well, just find all the images and put them in one folder. I think that even that, that's something codex can do. I think that's one of the big themes that are also seeing, like, coding tools of breaking out of coding and just like everything. They're personal automation. Exactly. Because the way, if you can think about before graphic user interfaces and browsers, like,

Starting point is 00:24:32 how did we interact with a computer? What did so through a terminal? And we did so by writing commands and writing code and stringing them together inside of the terminal. So what you think about it is those coding agents are actually a computer use agent, but for the terminal. Yes, yeah. They're actually incredibly general.

Starting point is 00:24:51 I would say that coding agents today are still not vision native enough. Like, you have to try to get it to use vision. And oftentimes it fails still. We should use vision a lot for. Yeah, I would say, you know, I was going to end the episode with asking for your 2026 predictions. Like we sit down this time next year, what do you want to see? You know, what do you hope to see? I'll just kick it off of the easy one.

Starting point is 00:25:13 Yeah. More computer use. And I think like where you say things like, oh, we'll have a coding age and build its own integration to your application. A lot of applications don't have APIs, don't have NCPs. The only thing you have is a UI, right? Yeah. Because their legacy or because they don't want you to take the data.

Starting point is 00:25:32 But while the data is yours, you just have to, like, in a non-provision way, take it by the user. Yeah. Yeah. And I can continue just by sort of like saying that that's definitely going to be something, I think, is going to be something that will be capable of in 26. And but also the other thing that I am sort of really like looking forward to are codex being able to do more, right? We're already starting to talk about our codex or like our coding agents can sort of use computers in novel ways. We're going to be able to sort of see more general and general use. like that coming along as well, and more sensible ways for you to build with those sub-agents

Starting point is 00:26:15 as well. I really want to see the trust level go of even further, right? Like, at opening I get to work with some of the most amazing developers I've ever worked with in my life. They're incredible, like some crazy tech leads. I wish every company, no matter whether there's like a small dev shop in Alaska where I worked for a while or opening I, be able to have on their team like capabilities that you would only be able to get at like a top tier firm, right? So like, so all of my teammates at all of these places could turn to a coding model be like, hey, how do we do this like crazy awful refactor that we have to do to get us to support this new customer that we have?

Starting point is 00:26:51 Or like, wow, there's so much of a mess here. Or like, what's the best way to actually implement this new technology and have it be so trusted and so right and so smart that like, you know, we can actually perform better than we could normally get access to it. Yeah, see? I think that's going to be any any of friendly calls or something? Oh, yeah. We're Brian and Bill at OpenAI, and yeah, feel free to find us on our Twitter, socials, whatever,

Starting point is 00:27:18 and then let us know how you're building. Yeah, and we love working with startups, and anytime you have feedback about, you really wish the model could do this or the product can do this, and you could unlock some massive capability, just let us know. Yeah, amazing. We'll do. That's it. Thank you, guys. Nice. Thank you.

Latent Space: The AI Engineer Podcast - ⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.