Latent Space: The AI Engineer Podcast - [State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Starting point is 00:00:00 Light and space Night Wark up Light and space Welcome How is you introduce this up Yeah, I work on a bunch of the thinking models at opening eye

Starting point is 00:00:18 And like recently I've been sort of focused on doing search related stuff But yeah, just a post-training researcher At Okinae. Yeah, and you were on with us for GPT4.1 We're talking with Michelle who is on maternity leave, I didn't know that. And now we're 5.1. It's been a whole generation. Yeah, it's been wild. And like, you know, 4.1 was a non-thinking model. And then since then I,

Starting point is 00:00:43 you know, we sort of switched into doing. Is that you the last? Was it your last? No, we're so we still are releasing non-thinking models. But that one was the one that we did that was like API specific non-thinking. So, you know, focuses shifted a little. Yeah. How'd you get into post-training? So previously, before opening I was doing like pre-training data curation stuff. And I think what I was seeing from like the news and looking at papers is like, oh, it seems like a lot of that you, not pre-training is dead, but I was like, oh, there's going to be so much interesting stuff in post-training.

Starting point is 00:01:17 And at that point, I was like, I really want to like make some contributions there. And I mean, it's not even necessarily that like pre-training was dead, but it was definitely changing and like, you know, do I want to make compute efficiency wins of like 3% or do I want to like change the behavior by 40%? And honestly, it just seemed more exciting to go to post-training and many late nights later. That's definitely true. It's a different kind of data and engineering discipline too. It's very strange. Like the the kind of work that you need in especially RL, like scaling it. Yeah, definitely. I think like, for example, the number of moving parts in an RL run is just a lot higher.

Starting point is 00:01:58 Like, in some ways... You do order of magnitude or... I don't know if I could do order or magnitude, but if you think about, like, pre-training, you know, you're moving tokens to many machines, and then you're getting, like, basically a scaler from them, and then you're back propping. Yeah.

Starting point is 00:02:11 The issue with RL is, like, you're doing tasks, and each task could have, like, a different grading setup. And each one of those different grading setups, that's, like, more infrastructure. And so, you know, when I'm staying up late, trying to figure out what's going on with a run. It could be in way more things than there isn't a pre-training run generally. Yeah.

Starting point is 00:02:33 And does it matter if you own the code of the task or is it an outsourced third party person? Or, you know, my sense of it and the external sense of it, obviously I don't see it up close, is that you work a lot of external partners. And I'm sure also some internal stuff. But which is better? Honestly, I don't think I'll comment too much on like how many external partners There are some and there's some internal. Yeah, there's, we do like...

Starting point is 00:02:59 The technical trade-off of like, well, shit, like, I don't own this code. Oh, okay. So, well, when it comes to I don't own this code, actually, like, when, you know, when I'm babysitting a run or something, it doesn't really matter if it's like internal, external, whatever. Like, do I understand the system that's going underneath? And I think you end up having to, like, jump into a lot more code that you're like, I actually don't know what this does.

Starting point is 00:03:24 because I'll be watching the, you know, I work on my pieces of a run. And then there's also, you know, other people working on it. And like, do I understand what their code is doing? So that way, like, 1230 in the morning when I'm like, something looks wrong and it's, I'm like looking at this code, can I like get context fast enough to understand? Throw a code I said it. Oh, I use codex so much. It's really changed how it work. I feel like there's a degree to which like sometimes I feel trapped by codex, because if I'm,

Starting point is 00:03:54 I spend like, you know, 30, 40 minutes writing something that looks like a design doc or something, Codex can do more work than I can do in a few hours in like 15 minutes. But then, like, what do I do during those 15 minutes after? And like the, it's actually just like really changed how the flow of my day goes because I have to somehow now manage these like 40 minute sessions with like 15 minutes where like I could do something. But it's actually not nearly as effective as like, this new flow to the day. So I think I'm still getting used to that, honestly.

Starting point is 00:04:28 Yeah, yeah. I think it should be interesting for, like, also just code-based understanding when you're encountering unfamiliar code. Absolutely. So you, briefly, before we started, talked to a little bit about the shopping model,

Starting point is 00:04:41 which is like the latest hottest thing. And obviously, we're just recording it right after Black Friday, Saturday, Saturday, every Monday. First of all, any interesting findings from basically releasing shopping in Chachabit right into that period? Okay, well, I think the first.

Starting point is 00:04:53 thing is, I don't know why I would say in a meeting in, you know, August or so, like, oh, hey, Black Friday's coming up. Like, maybe we could, maybe we could do a release by them. In hindsight, like, wait, why would I say somewhere like that? They're like, yes. Now you own it. Yeah. Exactly. I guess the most interesting thing to me is the new interruptibility and like the sort of qualitative experience of using it. And the same thing happens with Codex, right? Like you, you write a prompt and you can like press escape and say like, oh, I, like, I mess something up. And we actually did the same thing in the shopping model.

Starting point is 00:05:26 So it shows you its chain of thought with, like, what products it's looking at. And you can write it new messages saying, like, oh, you know, I actually wanted doing this. Yeah, like, I wanted USBC on this or whatever it is. And, like, I think that's a really new interesting, like, interaction paradigm that we have in a couple of our different services. And I'm excited to see how people use it and if they enjoy it. Yeah.

Starting point is 00:05:48 Why did it have to be its own model and not just, like, a new tool? Stay tuned. I think, like, there's no reason that we couldn't do it in the same model eventually, but I think, you know, if we want to try out new things, sometimes it makes sense to make a new model. And I think it just made sense to this time say, like, can we do a deep research style model, but like for shopping where it's going to look really hard all across the internet for different things? You know, I think if you look at like deep research, the original one and GPT5 thinking on like high reasoning today, I think you'll see that like eventually the model is all sort of. converge in their capabilities. Yeah. Would you say that this is a discussion also a little spicy that I've kicked off

Starting point is 00:06:28 in the community? There's still maybe 30% of the community is still using deep research. A lot of them have moved over to just using Five Thinking as deep research. Is that the spiritual successor, are they direct replacements, are there things that we lose in the original deep research model if we do that?

Starting point is 00:06:45 I mean, I think if you look at our published e-vals, they look like basically on par if it's not better. So like, I mean, that's personally what I do. I use thinking on high versus using the deep research model. But like, you know, I think every, as we've learned over the past few months, there are sometimes people prefer the quirks of like one model over another. And so people like the deep research model, you know, more power to them. People like 40.

Starting point is 00:07:11 Anything special in the 40 post-trading that like, are people like really responding to personality? Is that like a differentiator that people really care about? And it's a part of your job to care about personality. Yeah, I mean, definitely people like care quite a bit about personality. I think like over the past few months, we've been working a lot on giving users more choice over what personality they want. Right, which is the toggles. Yeah, yeah. So now we have those toggles. What's your favorite toggle? Honestly, custom instruction for like, I want, I personally want my model to like be a tool. And so like, I don't, I don't necessarily like want the warmth or anything. I just want some answers because I'm, you know, mostly using it at work.

Starting point is 00:07:48 Yeah, so I call this the Anton versus Clippy Divide. So Anton is the, Silicon Valley HBO. Okay. Is it a machine? It only does work. It doesn't try to be helpful or friendly or anything. It tries to be helpful, but like doesn't try to be cheery. Or as Clippy tries to be cheery.

Starting point is 00:08:04 And I'm like, well, stop smiling at me. I'm like having problems. So it sounds like you also come down on the side of like using it. Anton. Yeah. I think a lot of developers want Anton. Yeah. They're just like, it just quietly does its work.

Starting point is 00:08:16 And when it's done, it shuts up. Yeah. Yeah. Well, I think like we're doing a lot of work to provide both like, People, Anton's and Clifties, and I hope they all like it. Yeah. So just generally, I was thinking about, like, well, what can we update people on post-training? You know, what do we know today in Neuros 2020-5 that we didn't know in New York 2024?

Starting point is 00:08:38 I would say, like, a lot of people at the time, there's still like this whole PPO versus DPO discussion that was there. That was the whole era. Yeah. And since then, we've moved on to RLVR. and I think a lot of like agents specific RL training. I guess like am I missing any large chunks of the post-training debates that are going on? Yeah, I mean, so not necessarily debates internal, but like my read personally from like looking at different papers that are coming out,

Starting point is 00:09:08 when you look at like an RLVR paper or like a RLHF paper, they read more like an optimization paper. And to me like the sort of interesting thing that's going on is we have this like spectrum of how high quality a signal is. So, like, really, at the end of the day, like, RLHF, RLVR, they're both policy gradient methods, but what's different is just, like, the input data. And it's always interesting to me that we call RLHF non-verifiable,

Starting point is 00:09:37 because we've trained this model to be good at, like, predicting human feedback. So in some sense, that's, like, verification. But obviously... It's human preference rather than truth. Yeah, yeah. But, like, if the... If, like, your value...

Starting point is 00:09:50 of truth is like does the user like this more? Like there's there's something strained that I think we haven't like looked at that axis of okay well how like sort of clean is this signal how much do I trust it? And like I totally agree that you know you don't necessarily trust the RLHF signal as much as like is this the solution to this polynomial. But I think there's a whole spectrum of like how high quality is a signal what's going to happen when I like do a lot of optimization against it. And that's very different than I think worrying about like the variance of different gradients, which I think is what you end up seeing in a lot of the papers that are currently coming out, rather than being like very data-centric. They're pretty

Starting point is 00:10:28 optimization-centric, even though I think the innovation really is where the data is coming from. Yeah. And before, I want to go broad before I go deep. Yeah. Any other discussions that maybe having in Europe's or sort of run about this time on post-training debates? Like what are, what are, you meet your peer at Anthropic and Deepvine? What do you talk about? Well, anthropic and deep mind, we're all saying I'm working on stuff and things. You know, we're not... And I think, like, it's more so talking a lot more broadly with my friends there. Or we're just talking about, man, the infra's so hard to keep up.

Starting point is 00:11:04 We're not necessarily talking too much about methods directly. Because on one level, it kind of doesn't matter. Yeah. And I think also, like, there's something that's very different about academic work where, like, what really matters is how narrativeizable it is. And I think that's one of the reasons you see a lot of optimization papers come out is a lot of the data work, there's a less clear narrative around it. I think the data and the scaling is actually more important than the specific.

Starting point is 00:11:32 Yeah, but it doesn't have like necessarily the same narrative that you get out of like some of the papers that you see here. And so like there becomes more of a like given a specific vertical, how do I like understand that? And I worked there was actually more papers on it here, but I think it can sometimes be harder to wrap up into a clean story. Yeah, that's also something that, like, where we're actually having a lot of conversations about with other folks as well.

Starting point is 00:11:58 Like, what's next, right? Like, what do you go from here now that we have, like, some kind of roadmap? I think what's interesting also for me is,

Starting point is 00:12:06 I guess the innovations that are exposed by the Chinese models are maybe copies or, like, discussions of what's going on in the labs. I think obviously GRPO, you mentioned a lot of these RL optimizations. They come on as, they present themselves as optimizations. JRPO came out in the deep seek math paper, which when it came out, I read it and I was like, okay, this is kind of cool. It's like a little bit cheaper.

Starting point is 00:12:31 But like it does seem to have more broad impacts, I think, on the industry as a whole than was initially appreciated. I just want to, I don't feel like we've processed that enough. Yeah, definitely. I mean, like, yeah, as you said, it came out in the deep seek math paper. and like it's an interesting optimization method, but it's like the more interesting thing that they have a new reward signal

Starting point is 00:12:51 that we can really, really trust. Like when, you know, you find the answer to a math problem, it's a lot less debatable than like, oh, well, is this thing that the human preferred actually what we want to do? Yeah, like, you want to be right at math. Yeah, yeah.

Starting point is 00:13:05 And so I think in some ways, that's underappreciated in, I would say, what's getting published. Yeah. Yeah. Let's talk about, I guess, Law Horizon. Yeah. What do people consider in terms of like very long horizon?

Starting point is 00:13:18 Like we're talking like 30 hours, you know, more than more than a day of autonomy. Does this is it just more of the same or are there anything like sort of qualitatively different? Okay. So first off, what I would first say is I tend to think more in terms of like actual number of tokens than than time. Because I think. Yeah. The human in the loop can take a while. Yeah.

Starting point is 00:13:37 Well, and also like it gives you a different measure to optimize against, right? Like as I was saying earlier with, um, when I used. use codex. It does something that would take me much longer. It would take me like four hours in 10 minutes. What we can actually push on there is token efficiency. So like, yeah, and that has a huge, huge research area. Yeah. And so you can see like from 5 to 5.1 our overall eVals, you know, we bumped some. But if you look at a 2D plot of how many tokens it takes for us to get that, it went way down. And so I think that's like a difference. when you had that? Like, that was such a great chart.

Starting point is 00:14:15 Dude, I live by those charts. Like, that, that was your chart? Okay. Not necessarily that, but like, that shape of chart. Like, I think that's something that we think about a lot, just because it contributes so much to your experience, like, how long does it take to to do this task? Yeah. And I think the other thing is, as you're pushing that token efficiency, it changes, you know, how many tool calls can I make? And, like, how many different things can the agent do in a reasonable number of tokens that we can actually serve. And so I personally think in terms of tokens, yeah. I think the interesting thing or the hard to understand thing from the outside

Starting point is 00:14:51 is having an explicit router in GPD-5, but then also basically having an implicit router in terms of the thinking, spending thing, that conflates a little bit, right? Like at some point, you do kind of need to merge them or else you're just going to get these weird bumps where sometimes the router at the top, decide something and it's wrong. And actually, if you just handed it to GPD5, it would have figured it out. Yeah, and I think, you know,

Starting point is 00:15:15 we'll figure out the correct abstractions over time. I think, like, there's a... Is the intention still to merge? Because that's what it was said in the paper. Yeah, I think, like, eventually, you know, we'll have AGI and, like, you're not going to have to worry too much about how hard to think directly.

Starting point is 00:15:29 It'll just, you know, we'll have one tool that you always go to, and it knows how long to think for and things like that. I think that the abstractions and the way that we drive these things today, it'll change. And like, you know, I think even the amount that we've changed from, you know, having a nom-thinking model to you can choose between two. And like, you know, now we can sort of route and how hard do you want to think? We're adding lots of knobs and, you know, eventually it'll probably simplify. Yeah. Another super interesting knob that everyone is doing is context compaction or memory compaction. What's going on there?

Starting point is 00:16:02 Nothing to share at the moment. Let me share. Clearly an important feature, clearly inspired by Codex usage as well, obviously. But I think, like, from the engineers' point of view, it feels like I used to do that as part of my harness, and now it's not the models doing it for me. And I don't know how to think about that, like, in terms of, I guess, I'm used to having more control, and now I have less. Yeah, is there a specific? Like, there's no specific question. I'm just getting, like, feedback on, like, well, is this a trend that, like, we need, where you, it's basically a permanent fact of life from here and out. Oh, I see. You know, I don't know.

Starting point is 00:16:40 I worked on long context. That was why I was on last was for 4.1, where we, you know, I think 10xed the effect of context window for 4.1. And so there always be some dance of like, well, if we want to push as much as what we can do, not only should we increase the length of the context window, but like we should also have strategies for keeping that context window available for as long as possible. I'm guessing that both things will sort of happen just because we want to put as much power into the models as possible. Yeah.

Starting point is 00:17:08 Yeah, I think we're still in a period where we should all be expecting changes in the interfaces that all of the models give to us. That way we can improve the models. Because if we walk the interface, I think what would be sad from my perspective is if we walk the interface, if we discover something new about models, we might sort of trap that improvement under an interface that needs to change. Got it. Talking about long context as well, there is some discussion about, I guess, context rot or like the utilization of the context. even if you gave us like a million token context, probably wouldn't use all of it.

Starting point is 00:17:41 What's the recommendation there? Where are things going? Are we going to have, I guess, perfect context by next year? Is that an impossible dream? I don't know. No, it's not an impossible dream. I think I'll give a shout out to some of the e-vals that we did 4.4.1 called Graph Walks.

Starting point is 00:17:57 I love Graph Walk. We covered this in the podcast. Yeah, yeah, we did. I think if you look over time, all of those e-vals are so fine. And I think one of the interesting things about that is you have to do complicated transformations across the entire context window. Like that's sort of the issue with those heat map plots of the those different. A needle little piece there.

Starting point is 00:18:19 Yeah, but the problem is if you only have to sample from one point in the context window, it's like sort of easy. Whereas with those graphwalks problems, you're having to do multiple transformations across the entire context window. And so I think keep watching those. I think they've been climbing. They'll continue to climb. I would say that that's definitely like a temporary issue that we are climbing on over time. Yeah. So, and then like, is 10 million tokens realistic?

Starting point is 00:18:45 Is 100 million? Like, where does, is there a natural end or there's no end and we just are going as far as the eye can see? Oh, gosh. I don't know. Like, what do you think? Yeah. I feel like, okay, there are use cases that require billions. And there are use cases that require many, many billions, maybe trillions.

Starting point is 00:19:02 Yeah. Out of curiosity, like, what would be billions of tokens? we just had a context engineering discussion about like a ad code base over support issues for a company and it was 100,000 documents totaling about 8 billion tokens. You can't stick that in a context window for now. That's fair. I guess the, so I would still say like I don't know, but I think I've been like really surprised. It reminds me of when I was doing like more information retrieval stuff and like BM25 and these like very simple like Ngram indexes were like just super hard to beat.

Starting point is 00:19:32 I think the agents with Grette are like, they feel really similar to me where it's like, it's just unreasonably effective. So then I will not use your 10 million token context window, even if you gave it. Maybe, but like, what if we're using that context window

Starting point is 00:19:47 in service of like some larger goal that just has a lot of sub-search calls? Which is why I'm saying, like, I just don't know. And I think that's what makes it so exciting. Yeah, yeah. I would say also like the other other modalities like video would eat up a lot and like then obviously the heart sciences have proteins and all that which a lot of information is encoded in in physics so so I mean yeah I I'm mixed feelings

Starting point is 00:20:16 about it just because I'm like well this will never scale not with like full attention and we we probably just need to invest in systems anyway which means we're good with what we have I mean like get your graph walks up But like, I don't know if we need like 10, 100 X, when actually maybe we need to figure out ways to 1,000, 1 million X. Yeah. Right? Like, these are just different slopes.

Starting point is 00:20:40 I mean, I'm glad that you're happy with the current context windows. I think my dream would be to push it and see what happens anyway. But the engineers, the engineer's incentive is always to say, well, the systems matter more than the models. And the researchers' incentive is to say, well, screw your systems. Well, we'll just put the models. Oh, no. It's so differently.

Starting point is 00:20:59 Yeah. I think that's one of the most like sort of beautiful things about post-training and opening eye is everyone. Co-design. Yeah, it's also co-designed. Like, you know, I spend a lot of time just doing our system stuff. And I also do lots of stuff like where I'm making graph walks. And I'm like doing a lot more like things on the learning side. And I think it's a great culture to have a place where people just move seamlessly between the two.

Starting point is 00:21:24 Yeah. What are you guys hiring for? Presumably you're hiring. What are you guys hiring for that is hard to hire? What is the skill set that it's like, we really need this, can't find it, please everyone go skill up on this. As my definitely personal opinion here, I think we're still having trouble, not at Open AI, but I think as a whole, producing lots of people that do want to do lots of both systems work and ML work. And I think if you're trying to push the frontier, you don't know which place is currently bottlenecking the frontier. And it changes all the time.

Starting point is 00:21:56 I mean, even within one project, it might change multiple times where the current bottleneck is. But I think the education system we have right now isn't really optimized for that. So, like, I personally, I studied math and then I was very, very lucky to have some, like, great mentors after school that, like, taught me to be a good software engineer. But it seems like if we're going to be in this place for a while, and I think we will be, we should probably be producing more students that are great at doing both, you know, distributed systems and, like, a lot of core engineering, as well as the statistics and other, like, things that are required to be a good machine learning research. If we were to throw codex at it, obviously, we can't do codex at everything. That's why you still, let's say, which will progress faster, which is more solvable by LLM?

Starting point is 00:22:43 That is a, that's a spicy question. You can't say they're both equally hard. I don't know. Maybe they are. I mean, they're differently hard. I think one is more hill climbable than the other, which is it? Because then we can go do it. Okay.

Starting point is 00:22:55 I think one thing that's slightly simpler about some of the ML research, like, you know, ML research is also distributed systems, to be clear. But like some of the things that I would say like get traditionally called ML research are things that you can treat a bit more of as a black box, whereas like, you know, the environment to train on, you know, building these different systems is actually just like complicated. engineering problem. And so theoretically, I would say that they're like

Starting point is 00:23:29 probably roughly equal. But I think that there's some there's some amount of effort I feel like to making the, the environments for it. Yeah. But they require yes. Yeah. They require GPUs in themselves as well. Yeah. Yeah. I guess they both would. But yeah,

Starting point is 00:23:46 that would be my guess. But I don't have my confidence in it. So a lot of people are building this like AI scientists, right? They automate research. You guys. have your benchmark on taper bench and though that's the one area that um like for example at cognition we've just decided to not do because it's so hard okay any other people on a post training team they're going to shot out have done like interesting work this year they should get more attention but

Starting point is 00:24:12 they're they're not getting credit well okay for sure everyone on the shopping team that i was just working with so like Andrew Hoyle um anukistrata john hallman all all great people yeah isa whole friend, obviously the manager for it. And she was the original deep deep research person? Yeah, yeah. There was like three of them. Yeah, yeah. And so definitely that part of the team. But I mean, everyone, everyone is so great. Like, I think it's hard

Starting point is 00:24:35 to take about a list. It's a really fun time on on post training right now. It's exciting every day. Yeah, it feels like we're all enjoying our Diet Coke together in the office late at night. Yeah. Oh, I did want to squeeze this in before we end. Nobody actually

Starting point is 00:24:51 serious is saying that pre-training is dead. It's just a meme. There's a lot of work going on in pre-training. And in fact, actually, a lot of my researcher friends are saying too much money is going to post-training. That's also spicy. I don't know. One of the charts I hold in memory from this year is the GROC 4 chart. I don't know if you seen it. But it's basically saying, well, we scaled pre-training to here and about this level of compute. And now we're spending the same level of compute on post-training as well.

Starting point is 00:25:18 That's very controversial, I guess, to me, because we're all used to post-training taker, taking orders of magnitude less data, compute, whatever. And obviously we're scaling that up now. Do we get to a point where they're equal? I don't know. But that's a topic for conversation. I think how much do we invest in this versus more like different pre-trading? Yeah. Yeah. So first off, neither one of those that I think it's really interesting to sort of be living through something that I, you know, all my other like historic or technological revolutions, things that I read about in history books. And like, this was live as this happened.

Starting point is 00:25:56 Yeah, this one, we don't know the end yet. Yeah, and so there's this almost like fog of war where I'm like, oh, did people think that like, we got like the steam, like the steam engine and they would have, you know, the factories. I don't know if you know this, but like the factories, they used to be like very linear because you had to drive like one motor across it in an entire room. And it made it so when electricity got developed, they just tried to do the same thing. And they're like, ah, this isn't all that useful. And it took, I think, like a couple of decades before.

Starting point is 00:26:22 they realize, wait, if we have electricity, we can move the little, like, stations in whatever is most ergonomic. And then, you know, manufacturing was transformed by electricity. And I think, like, it really gives me no confidence in being like, oh, this thing is dead. Yeah, our timelines are so short. Yeah. Yeah. Yeah. The way, like, good ideas get experimented and funded and propagated, actually, that's still a human timeline. It's not on AI timeline. Yeah. Yeah. And so I think, like things will maybe be like dormant, but it'll be spiky. Like there'll be all some, you know, some, yeah, yeah. And then we'll all feel different. It's like we're, what's, what's the meme? It's so over. We're so back. Yeah. It's going to be that many times. And I think having like a, some, some emotional

Starting point is 00:27:05 stabilizing to it is probably going to be good for, for everyone's sanity. Yeah. More sanity. Well, thank you so much for joining. Thanks for all the great post-training this year. Yeah, thank you. Yeah, continue giving feedback. I love to hear what you think. Yeah, awesome.

Latent Space: The AI Engineer Podcast - [State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.