Big Technology Podcast - OpenAI Chief Research Officer Mark Chen: GPT 4.5 is Live and Scaling Isn’t Dead

Starting point is 00:00:00 Open AI Chief Research Officer Mark Chen is here to talk about the release of GPT 4.5, the company's largest and best model yet, which is coming out today. We'll dive in right after this. Welcome to Big Technology Podcast, a show for cool-headed, nuanced conversation of the tech world and beyond. We're joined today by Mark Chen, the chief research officer at OpenAI, who's here to talk about the company's newest release, GPT 4.5. Yes, it's finally here, and it is debuting today. to see you. Welcome to the show. Thank you so much for having me on. Thanks for being here. This is in four and a half years of the show, our first Open AI interview, so hopefully the first of many. We appreciate you jumping into the water like this, and it's on big news with the release of GPT 4.5.

Starting point is 00:00:47 Yeah. So GPD 4.5, really, it signifies the latest milestone in our predictable scaling paradigm. So, you know, previous models that have fit this paradigm have been GPD3, 3.5, 4, and now this is the latest thing. It signifies an order of magnitude improvement over the last models, kind of commensurate with the jump from 3.5 to 4. I think the question that most of our listeners are going to be asking, and certainly we've asked on our show in the past couple months, is why isn't this GPT 5? I mean, what is it going to take to get to GPT5? Yeah. Well, I think GPT5, you know, whenever we make these naming decisions, right, we try to keep with a sense of what the trends are. So, again, when it comes to predictable scaling, right, going from 3 to 3.5, you can kind of predict out, you know, what an order of magnitude of improvements in, you know, the amount of compute that you train the model with in terms of efficiency improvements will buy you. And we find this model kind of aligns with what 4.5 would be.

Starting point is 00:01:53 So we want to name it what it is. Okay. But there's been so much talk about when GPT5 is going to come. Correct me if I'm wrong, but I think there's been a longer wait between GPT 4 and 4.5 than there has been between, let's say, GPT 3.5 and 4. And I don't know, is this like because we're seeing a lot of hype from opening eye folks on Twitter about what's coming next? Maybe this is probably, it probably is the most impatient industry in the world and the most impatient users in the world. But it seems to me like the expectations for GPT5 are built up pretty high.

Starting point is 00:02:31 And so I'm curious from like your perspective, do you think it's going to be hard to meet those expectations whenever that GPT5 model does come out? Well, I don't think so. And one of the fundamental reasons is because we now have two different axes on which we can scale, right? So, GPD 4.5, this is our latest scaling experiment along the axis of unsupervised learning, but there's also reasoning. And when you ask about kind of like why there seems to be, you know, a little bit bigger of a gap in release time between 4 and 4.5, we've been really largely focused on developing the reasoning paradigm as well. So I think, you know, our research program is really an exploratory research program, right?

Starting point is 00:03:12 we're looking into all avenues of how we can scale our models. And over the last, you know, one and a half for two years, we've really found a new, very exciting paradigm through reasoning, which we're also scaling. And so I think like GPD-5 really could be the combination of a lot of these things coming together. Okay. So you talk about how there's been a lot of work toward reasoning. We, of course, have seen that with a one.

Starting point is 00:03:37 There's a lot of buzz about deep seek. And now we're talking about, again, like one of the, more traditional scaled up large language models with GPT 4.5. So the big question here, I think that was on a lot of people's mind when it came to this upcoming release, we thought was it going to be 4.55. Anyway, it doesn't matter. The big question is, can AI models continue to scale when you add more compute, more data, more power to them? It seems like you have an answer to this. So I'm curious to hear your point of view on whether what you've learned about the scaling wall, given your development of this model,

Starting point is 00:04:17 and whether we're going to hit it, whether we're already seeing some diminishing returns from scaling. Yeah, I really kind of have a different framing around scaling. So when it comes to unsupervised learning, right, you want to put more ingredients like compute, algorithmic efficiencies, and more data.

Starting point is 00:04:36 And GPT 4.5 really is proof that we can continue the scaling paradigm. And this paradigm is not that. not the antithesis of reasoning as well, right? You need knowledge in order to build reasoning on top of, right? A model can't kind of go in blind and just learn reasoning from scratch. So we find these two paradigms to be fairly complementary, and we think, you know, they have feedback loops on each other.

Starting point is 00:05:00 So, yeah, GPD 4.5, again, it is smart in different ways from the ways that reasoning models are smart, right? When you look at the model today, it has a lot of it. lot more world knowledge. When we look at kind of comparisons against GPD 4 or O, you'll see that everyday use cases, people prefer it by a margin of 60% for actually productivity and knowledge work against GPD40. There's almost like a 70% preference rate. So people are really responding to this model. And it's this knowledge that we can leverage for our reasoning models in the future. So what are the examples, like you talk about everyday knowledge work, what are some of the

Starting point is 00:05:41 examples that you would use GPT 4.5 for that you would prefer it over a reasoning model. Yeah. So I wouldn't say, like, it's a different profile from a reasoning model, right? So with a larger model, what you're doing is it takes more time to kind of process and think through the query, but it's also giving you an immediate response back. So this is very similar to what a GPT4 would have done for you, right? Whereas, I think, with something like 01, you get a model where you give a query and it can think for several minutes. And I think these are fundamentally kind of different tradeoffs, right? You have a model that immediately comes back to you, doesn't do much thinking, but comes up with a better

Starting point is 00:06:28 answer versus a model that, you know, thinks for a while, and then comes up with an answer. And, you know, we find that in a lot of areas like creative writing, for instance, Again, this is stuff that we want to test over the next one or two months, but we find that there are areas like creative writing where this model outshines reasoning models. Okay, so writing, any other use cases? Yeah, so there's writing, I think some coding use cases as well. We also find that kind of like, you know,

Starting point is 00:07:02 there are some particular kind of scientific domains where this outshines in terms of the amount of knowledge that it can display. Okay. I'm going to come back to benchmarks in a moment. But I want to keep on this scaling question because I think there's been a lot of conversation about it in public and it's great to be speaking with you from Open AI to sort of get to the bottom of what's happening. So the first is the question that folks have is do you end up at this size? And you don't talk about the size of the models, which is, you know, which is fair. But they're big, right? This is the largest

Starting point is 00:07:35 model that OpenAI has ever released GPD 4.5. So I'm actually curious to hear. here at this size, does it adding, you know, similar amounts of compute, similar amounts of data get you the same returns that you did? Or are we already starting to see the returns of adding these resources tail off? No, no, we are seeing the same returns. And I do want to stress that GDP4.5 is that next point on this unsupervised learning paradigm. And, you know, we're very rigorous about how we do this. We make projections based on all the models we've trained before, on what performance to expect. And in this case, you know, we put together the scaling machinery,

Starting point is 00:08:14 and this is the point that lies at that next order of magnitude. So what's it been like getting here? I mean, again, we talked, okay, so there was a period of time that was longer than the last interval, and part of that was focused on reasoning, but there's also been some reports that open AIs had to start and stop a couple times to get this to work, and it really had to fight through some thorny issues

Starting point is 00:08:37 to get it to be this step change, as you're saying. So talk a little bit about the process and maybe you can confirm or deny some of the things that we've heard about having to start and stop again and retrain to get here. Actually, so I think it's interesting that this is a point that's attributed to this model because actually in developing all of our foundation models, right, they are all experiments, right? I think, you know, running all of the foundation models oftentimes does involve stopping at certain parts, just kind of analyzing what's going on, and then restarting the runs. And I don't think that this is a characteristic of GPD4.5.

Starting point is 00:09:16 It's something that we've done with, you know, GPD4, with O-Series models. And, you know, they are largely experiments, right? We want to go in, diagnose them in the middle, and if we want to make some interventions, we should make interventions. But I wouldn't characterize this as kind of something that. that we do for GPT 4.5 that we don't do for other models. We've already talked a little bit about reasoning versus these traditional GPT models. But it makes me think of DeepSeek.

Starting point is 00:09:44 And I think you already gave a pretty compelling answer as to like what you would use one of these models for versus a reasoning model. But there's another thing that Deepseek did that is worth discussing, which is that they made their models much more efficient. And it's kind of interesting like when I talk to you about like, all right, so you need data, you need compute, you need power. you're like, yeah, and you need model optimizations, which is something that people often overlook. And just going back to DeepSeek for a moment, the model optimization, the fact that they went from

Starting point is 00:10:13 basically querying the entire knowledge base to mixture of experts where they were able to sort of route the queries to certain parts of the model instead of lighting it all up is credited with helping them get more efficient. So I just want to turn it over to you without commenting on what they did or if you can if you want. But I'm actually more curious what Open AI is doing. on that front, and what sort of, whether you did similar optimizations with GPT 4.5, and are you able to run these large models more efficiently, and if so, how? Yeah, so I would say kind of the process of making a model efficient to serve, I often see as fairly decoupled from developing the core capability of the model, right?

Starting point is 00:10:54 And we see a lot of work being done on the inference stack, right? I think that's something that Deepseek did very well. And it's also something that we push on a lot, right? we care about serving these models at cheap cost to all users, and we push on that quite a bit. So I think this is irrespective of GPD4 or reasoning models. We're always applying that pressure to be able to inference more cheaply, and I think we've done a good job with that over time, right?

Starting point is 00:11:21 Like the cost have dropped many orders of magnitude since we first launched GPD4. And so are there like, I mean, maybe tell me if this is too elementary, but the move towards, for instance, mixture of experts. Is that more of a reasoning thing, or can you apply that in GPD 4.5? So that is an architectural element of language models. I think pretty much all large language models today use, utilize mixture of experts. And it's something that applies equally to efficiency wins

Starting point is 00:11:51 in foundation models like GPD for 4.5 as it does to reasoning models. So you were able to use that? out of here as well, basically? No, we've definitely explored a mixture of experts as well as a number of other architectural improvements in 2014.5. Okay, great. So we have a Discord with some members of the big technology listeners and reader group. And, you know, a theme that's come up recently, it's kind of interesting to be talking

Starting point is 00:12:19 with you right now about an extremely large model because a theme that they can't stop talking about the people in Discord is just that how small and niche models to them are going to, you know, potentially be the future. I'll just read you one comment that we had over the past few days. For me, the future is very much aligned with niche models existing in workflows and less so of these general purpose God models. So clearly open AI has a different thesis here. And I am curious to hear your perspective on what we get with the big models versus the niche models. And do you see them in competition or as compliments? Help us think about, think through that. Yeah. Yeah. So I think one important thing is we also serve

Starting point is 00:12:55 models that are smaller, right? Like we serve our flagship frontier models. but we also serve mini models, right, which are cost-efficient ways that you can access the capabilities or fairly close to frontier capabilities for much lower cost, right? And we think that's an important part of this comprehensive portfolio here. Fundamentally at Open Eye, though, we're in the business of advancing the frontier of intelligence, and that involves developing the best models that we can. And I think really kind of what we're motivated by is really pushing that out as much as possible.

Starting point is 00:13:27 We think there's always going to be use cases at the frontiers of intelligence. You know, we think that, you know, going from 99.9 percentile in mathematics to the best in the world in mathematics, right? Like, that difference means something to us. Like, I think what, you know, the best human scientists can discover is tangibly different, right, from what U.R.I can discover. So we're motivated by pushing the intelligence frontier as far as possible. And at the same time, we want to make these capabilities. cheaper and more cost-effective to serve for everyone. So we don't think the niche models will go away.

Starting point is 00:14:03 We want to build these foundation models and also figure out how to deliver these capabilities at cost over time. So that's always been our philosophy. There's always going to be some juice there in those last bits of intelligence. Yeah, so let's talk about that because we have a debate on the show often,

Starting point is 00:14:19 what matters more, the products or the model. I'm on team model. We have Ron John Roy, who comes on on Fridays. He's team products. He's basically like, just take what you have now in productize it. And I say, well, you could probably do more with a better model. But I have to be honest, I'm kind of at a loss for word sometimes about what that getting from that 99th percentile in math to the best in world a math will do.

Starting point is 00:14:42 So actually, I'm curious to hear your answer on this one. What does building the best model in the world do that you couldn't do otherwise? 100%. And I think really it signals a shift, right? Like, I think if you just think about, hey, you take the current models and you build the best surface for them, that's certainly something you should always be doing and exploring that exercise. I think three years ago, that looked like chat, right? We launched chat, GBT.

Starting point is 00:15:07 And today, when you take the best models and the best capabilities, I think it looks a little bit more like agents, right? And I think reasoning and agents, they're very, very much coupled, right? When you think about what makes a good agent, it's something that you can kind of sit back, let it do its own thing, and you're fairly confident it'll come back with something that you want, right? And I think reasoning is the engine that powers that, right? Like you have the model go and try something out, and if it can't succeed on the first try,

Starting point is 00:15:37 it should be able to be like, oh, well, why didn't I succeed? And what's a better approach for me to do? So, you know, I think very much kind of like the capabilities are always changing and the surface is always changing as a response. And we're always exploring what the best surface for the current capabilities looks like.

Starting point is 00:15:52 But just to hammer home on this. Yeah, but again, just to hammer home with this, like, what does that improvement in model get you? Like, what do you think that it will enable? Yeah, yeah. So, I mean, I think, I mean, agents of all forms, right, when you look at stuff like deep research, for instance, right, it gives you the ability to essentially kind of get a fully formed report on any single topic that you might be interested in, right? I've used it to even put together, like, hour-long talks, and it goes and really kind of like synthesizes all this. information out there and really organizes it, comes up with lessons, allows you to do deep discovery, it allows you to, you know, like dig into almost any topic that you're interested

Starting point is 00:16:34 in. So I feel like just the amount of information and synthesis that's available to you now is just really rapidly evolving. So basically it's not as simple as like just go make deep research better with the product, with the model you have now. Am I reading between the lines the right way, saying that what you're expressing here is that if you make the model better, then the product is going to get better inherently take deep research, for instance. 100%, 100%. Yeah. And that's something that is not enabled unless you have models of a certain level of capability,

Starting point is 00:17:07 both in reasoning and in the foundational unsupervised learning sense. Okay. You know, it's interesting. I guess like this one question I've had in the back of my mind is, and I'm just going to ask it to you again, just so I'm sure I'm clear on it, is my view, maybe erroneously was that we were just going to, or your industry was just going to move from these massive models to the massive models with reasoning. But you're actually saying that there's a dual track here.

Starting point is 00:17:34 Yeah, yeah. So I think we're always pushing the frontier, right? And we, I think even since, you know, five, six years ago, the prevailing way to do that was to up the scale, right? And so we've been upping the scale in unsupervised learning. We've been upping the scale in reasoning. But at the same time, right, you care about serving mini models. You care about serving models that are cost effective, that can deliver capabilities at a cheaper cost. And that will often be sufficient for a lot of use cases, right?

Starting point is 00:18:01 And the mission isn't just about pushing the biggest, most costly models. It's about having that and also a portfolio of models that people can use cheaply for their use cases. Okay, so let's quickly talk before we leave about the upgrades that you're seeing in 4.5 compared to 4. So I'm curious, like, if you can just run us through a very high level, the benchmarks it hits versus the benchmarks of the previous models. And then I'll just throw a double question in here. I've already read your blog post, and so I have an idea of what's coming. By the way, we're going to release this just as the news is released. So it seems like you're also saying, making a statement in some ways, saying, like, yes, we have the traditional benchmarks, but we also need to measure how this model works with EQ, as opposed to.

Starting point is 00:18:48 to just, you know, pure intelligence. So, yeah, just hit us with the benchmark improvements and then why you think that's important for us to look at both of these in conjunction. Right, yeah. So, I mean, along all traditional metrics, like things like, you know, GPQA, Amy, you know, the traditional kind of benchmarks that we track,

Starting point is 00:19:04 this does signify, you know, in order of magnitude, about at the same level of jump from 3.5 to 4. There isn't, there's a kind of interesting focus here also on, I would say, more vibes-based benchmarks, right? And I think that's actually important to highlight because every single time we've launched a model, there is a discovery process of what the kind of interesting use cases out there are going to be. We notice here, you know, it's actually a much more emotionally intelligent model. You know, you can kind of see examples in the blog post later today, but like how it responds to, you know, queries about, you know, a hard situation or, you know, advice in a particular difficult situation. It responds more emotionally intelligent.

Starting point is 00:19:48 I think there's also just kind of like, you can kind of see, like, this may be a kind of silly example, right? But if you ask any of the previous models to create ASCII art for you, right, actually, they mostly just fall down. This one can do it almost flawless. Pretty well. And so there's just so many kind of like footprints of improved capabilities. And I think things like creative writing will showcase this. One of the things that I think I picked up in the examples that you've given so far is that it doesn't seem like it feels the need to write, you know, a thesis for every response. Like one user was like, I'm having a hard time.

Starting point is 00:20:27 And it actually succinctly wrote like as if a human would as opposed to like maybe the traditional, you know, here's three paragraphs of self-care routine you can do for yourself. Yeah, yeah, yeah. And that speaks to the emotional intelligence, right? It's not like, oh, I see that you're feeling bad. Here are like five ways you could feel better, right? It just doesn't feel like a grounded kind of compassionate response. And here you just get something that's direct to the point and really invites the user to say more. So I think there's going to be a criticism.

Starting point is 00:20:52 I'm anticipating it. And let's talk about it right now that people will say, okay, Open AI was talking about these traditional benchmarks. Now it's talking about emotional intelligence. It's shifting the goalposts and wants us to pay attention to something else. What's your response there? Well, I really don't think that the accurate character. is that it doesn't hit the benchmarks that we expect it to. So when you look at kind of the development of 3 to 3.5 to 4 to 4.5, this does hit the

Starting point is 00:21:20 benchmarks that we expect. And I think the main thing is like, you know, it's all about use case discovery every time you put a new model out there. And in many senses, like, GPD4 is already very smart, right? And kind of when we were putting that, this parallel is kind of like when we were putting GPD4 out, right? It's like we saw it hit all the right benchmarks that we expected to, but what are users going to resonate with? That was the key question.

Starting point is 00:21:46 And I think that's the question that we're asking today with GPD 4.5 as well. And we're inviting people to be like, hey, you know, we did some early explorations. We see that it's more emotionally intelligent. You know, we see that it's a better creative writer. But what do you see here? Yep. All right, Mark. So I've been seeing you and we mentioned this before we started recording.

Starting point is 00:22:03 I've been seeing you in all the opening I videos about every release. So, first of all, great to speak to you live. But also, over the past year, we've seen a lot of exodus out of Open AI. Maybe the media plays it up too much. Probably we do. But I am kind of curious what it's like working within Open AI and how you see the talent bench inside the company. You recently became chief research officer just a few months ago. And now, look, we have a new foundational model.

Starting point is 00:22:27 So just give us a sense as to what the talent situation is instead. Honestly, it's still, I think, the most world-class AI organization. I would say that there's a separation between the talent bar at OpenEI and any other firm out there. And when it comes to kind of people leaving, you know, like the AI landscape, it changes a lot. You probably more so than any other field out there, right? The field three months ago looks different from the field three months before that. And I think it's kind of just natural in the development of AI that some people will have their own thesis about here's the way I want to develop AI and go try it their own way. I think that's healthy.

Starting point is 00:23:05 And it also gives an opportunity for people internally to shine. And we've never had a shortage of people internally who are willing to step up. And we've seen that a lot. And I really just love the bench that we have here. Very cool. All right, folks, GPT4.5 is out today for OpenAI Pro users next week. It's coming out for Plus, Team, Enterprise, and EDU. Mark, great to see you.

Starting point is 00:23:29 Thank you again for spending time. You're about to go and do the live stream. so I'm very grateful that you spent the time with me today. Thank you so much. I really appreciate your time, too. Thanks for having me. Well, let's do it again soon. And folks, so we shouted out the Ranjohn and I argument.

Starting point is 00:23:42 We'll go into that and more everything we can share about GPT 4.5 coming up tomorrow on the Friday show. Thanks for listening. Thanks again to Mark and Open AI for the interview. And we'll see you next time on Big Technology Podcast.

Big Technology Podcast - OpenAI Chief Research Officer Mark Chen: GPT 4.5 is Live and Scaling Isn’t Dead

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.