TBPN - OpenAI Day: GPT-5 Unveiled | Mark Chen, Greg Brockman, Sarah Friar, Max Schwarzer, Brad Lightcap & More

Starting point is 00:00:00 You're watching TVPN. Your background looks way different because you have a whiteboard behind you because we're breaking down the X's and O's of the GPT5 launch today. That's right. Launched from OpenAI. Really quickly, there is some other news. Firefly Aerospace stock opened at $70 in NASDAQ debut. This is the company that landed on the moon. Very cool.

Starting point is 00:00:22 Very cool. There are a few other stories going on, but we're going to skip most of them because we're going to be focusing on. ChatGPT today on GPT 5. We have a bunch of guests coming on. We have a stacked lineup. We'll pull that up, but we'll break down the X's and O's of the matchup. So of course, OpenAI, here's our lineup.

Starting point is 00:00:43 We have something like 15 guests today. A ton of folks from Open AI, a ton of people that build on top of Open A. And can comment on what's going on with ChatGPT. But of course, this battle is between Open A.I. And the timeline. It's the, it's, they got to get the vibes right. It's war. It's war.

Starting point is 00:01:03 It's, it's, the timelines in turmoil over whether or not this is a good model, what it means for the industry, what it means for AGI timelines. Everyone's got their take. Everyone's posting memes. There's been a ton of funny ones already. We'll take you through them, of course. But let's break down the offense today. We have Sam Altman, the founder, CEO.

Starting point is 00:01:21 He briefly got cut from the team in November of 2023, but he's back leading the team for the 2024, 2025 seasons. he seems healthy. He's doing great today. He went on at 10 a.m. to break down the launch of GPT5. He has a couple of key plays in his playbook, in his arsenal. He's got a solid ground game. Lots of quick posts hitting the timeline, probably in lowercase. Then he might air it out with a couple thousand word essay. We've seen him do this before. It's a bit of a Hail Mary. Maybe AG has a couple thousand days away. Maybe we're in the soft singularity. But he's very strong there with the long post when he needs to be. It's up his sleeve if he needs it. Then he can also pull out the vague

Starting point is 00:02:01 posting. He was doing this last night, posted a picture of the Death Star. No one knows what it means. Maybe it was taking a shot at the Dumers who are on the defense today. So he's also known for driving supercars. That lets him get to the office faster. He's saving time and money. You can save time and money by going to ramp.com. Easy use corporate cards, bill pay and accounting in a whole lot more all in one place. And so he is, he also gave apparently, this is a rumor, he gave every Open AI employee who's been with the company for more than two years, $1.5 million. A lot of people say, $1.5 million, that's not enough for a big house in San Francisco, but it is enough for a supercar. So that's probably why he picked that number. And that's why, that's what the open AI team

Starting point is 00:02:45 will be doing with that money. They'll be buying Aston Martin Valkyries, Paghani, Juarez, McClaren Sabers, for Ferrari, Daytona, Sautona, as. SB3s. They can get a Konigseg, Gemera. They could get a singer, DLS or a Bugatti Veyron. It would have to be used. They could also get the Bentley Bacalar. There's only, there's only 12 of those ever made. It's an open top two-seater roadster. It's coach belt. So that's going to run your 1.5 million, but that's perfect. You just got the 1.5 million bonus. So put it to work, spend it all in one place on a car. This is financial advice. Yes, exactly. Then you got Greg Brockman. He's joining at noon. He's, he's

Starting point is 00:03:22 extremely well rested. He's actually coming off a sabbatical right now. That's very exciting. He should be injury-free for the rest of the season. He cut his teeth at MIT, and then he got drafted by Stripe in 2010. Microsoft tried to do a trade deal during the 2023 chaotic trade deal, trade window that opened up post-Sam Altman Oster, but he stuck with the Open AI team, and now he's president of the company. Then you got Mark Chen. He's coming on at 1130 today. He's the chief research officer. The rumors that he turned out a maxed-out contract to head the MetaLamas, but he's sticking with the OpenAI team. He was an MIT undergrad. He also worked at Jane Street before joining OpenAI in 2018. Then we got Sarah Fryer coming on the show at 1230. She's the CFO of OpenAI.

Starting point is 00:04:08 It's her job to find bank accounts big enough to fill all the cash they're raising. It's a tough job. You got to find, okay, this bank account, will it hold 10 figures? Will it hold 11 figures? Will it 12 figures. There's been a lot of cash in this one. Exactly, exactly. She's also going to be defining the non-gap metrics that will be catnip for Ben Thompson in just a few years. We're excited to talk to her about how she's measuring the success and the health of their

Starting point is 00:04:33 business. Obviously, it's not just revenue. It's not just top line, bottom line. We're going to want to know about queries. We're going to want to know about DAUs, all those non-gap metrics. That's where people are going to be tracking when IPO day comes, hopefully soon. And then we also have Brad Lightcap. He's joining at 235.

Starting point is 00:04:49 He entered the league as an investment banker. Let's give it up for the investment bankers. They don't get enough credit around here, but we love the investment bankers. Then he got drafted by Y Combinator before joining OpenAI as CFO in 2018. Now he's the chief operating officer. And then we have Max Warser. He's in charge of post-training, fine-tuning these models, getting them into the fighting performance to put on a display of authority on GPT-5 launch day. Now, let's flip it over to the defense. They're going up against the timeline. They're going up against the vibe checks. We got the Dumers. The Dumers, they're led by L. Ezer Yutkowski. Admittedly, everyone knows this. No one debates this. The Dumers have had a terrible season. But you'd expect to see at least a few hellmeries about GPT5, creating bio-weapons thrown up on the timeline today. Probably won't be bangers, probably won't get a thousand likes, but you'll be seeing them here and there, mostly in the replies. We've also seen some Dumers talking about GPT-5.

Starting point is 00:05:48 being available to every government employee. And Eliezer had some harsh words about that. Don't give the keys to Sam Altman. Don't give the keys to the government to Open AI. He was upset about that. But in general, the DOOMers not putting much of a fight up today. Then you got Claude. Interesting.

Starting point is 00:06:06 Claude was caught playing for the wrong team earlier this week. Anthropic. They're on defense today. But we saw them take out Open AI's key pinch hitter, Claude, the Claude API was playing for the Open AI team, but they shut that down and Claude is no longer pinch hitting for Open AI. Then you got the Elon stands. The ground game's going to be there. It's going to be strong. The Elon stands are going to be tracking the benchmarks relentlessly. We know XAI loves to benchmarks and all the Elon stands are going to be calling out GPT5 for any,

Starting point is 00:06:44 any misaligned benchmarks if they fail. Humanity's last exam, it's over. It's over. They'll also toss up the occasional unhinged conspiracy theory. Moving on, Gemini. The betting lines have shifted big time. People thought Gemini was out of the game. They're so back.

Starting point is 00:07:01 Polymarket has Gemini at, what, 75% chance of being the best model towards the end of the month. This is, of course, based on the LM Arena, more vibes-based benchmark. But Gemini will probably be quiet today. They usually don't try and front run press releases. They usually try and sit back, let the model speak for themselves, let the API credits work their way through the latest YC Demo Day batch and get the product into the hands of people. And so expect to see a big, glossy conference in a couple weeks. Demoing Gemini 3 should be a good rebuttal from the Gemini's.

Starting point is 00:07:41 Then you get the Meadow Lama's been on a poaching spree. He's rebuilding the team during the offseason. Now he's a stacked roster and he's ready to go duke it out. But no one knows exactly what's going to be in the playbook. Is he going to go consumer? Is he going to go API? Is he going to turn

Starting point is 00:07:57 into a hyperscaler? We don't know, but we know they got a stacked team. They got Alex Wang. They got Nat Friedman. They got Daniel Gross. They got tons and tons of other researchers. They've been rating every other team. Completely reset the salary cap for the league. It's been an absolute clinic in terms of recruiting over there at Lama.

Starting point is 00:08:15 Then you got the final benchmark. Arc AGI. This benchmark stands. GPT5 couldn't get past this defense. And Arc AGI, you know, sitting there right in the end zone, just swatting him down. Swatting him down all day. You think, you think we're superintelligence around the corner. Arc AGI denied.

Starting point is 00:08:38 Denied. Tyler, give us the update on RKGI. Where does everything stand? How GPT-5 do? Does it matter should we care about RK-AGI? We love the team behind them, but isn't it an important benchmark should we be tracking it today?

Starting point is 00:08:54 RKGI, V-1 and V2, right? And B-3? B-3. I actually don't know if... No one's been... No one's even tested V-3 yet. No one's even really close there. But how are we doing on V-1? V-1.

Starting point is 00:09:06 GPD5 is at 65.17. Unfortunately, that's going to be 1% just short of GROC 4, 66.7. Okay, Arc AGI 2. The Elon stands are going to be going wild with that. Arc AGI 2, 9.9%. 9.9%. GROC 4, 16%.

Starting point is 00:09:23 So absolutely kind of brutal, you know, Arc AGI mugging. Rough showing. Some people have accused GROC4 of being slightly benched max. You know, this is, you know. They might have a team working on it. What are the pros and cons? We know the cons of benchmarking. of benchmaxing, you're overfitting on something that might not actually drive consumer value,

Starting point is 00:09:47 it might not actually solve real world problems, it might not increase DAUs or revenue or ARR or anything that really matters. It might not even get us closer to super intelligence. Give me the counter argument. Why is bench maxing good? The bull case for bench maxing. The bullcase for benchmarking, a bench maxing, break it down for me. Yeah. So I think the idea is basically this is almost like a non-eastern, non-AGI-pilled kind of take, right? So if you don't have a super general intelligence, your ability to benchbacks basically proves

Starting point is 00:10:20 your ability to solve some kind of specific task. So there's this thing about the gas station spiky. Yeah, it's called getting spiky. Getting spiky, adding more spikes to the spiky intelligence. Yeah, I think it was Rune who had this tweet about the gas station benchmark.

Starting point is 00:10:38 Yep. Right, I don't care if he said something like, I don't care about AI solving gas stations if it has the gas station benchmark, something like that. But the idea is like if you if the if making the gas station benchmark. Roon said my bar for AGI is an AI that can learn to run a gas station for a year without a team of scientists collecting the gas station data set in capital letters. And then my take is basically I don't care how they got to the like. I don't care how they made it run the gas station. I care how fast it gets here.

Starting point is 00:11:15 That it runs it. If we can run the gas station with AI, that's enough. If you have a team who's your bench maxing team, that just proves that if you have some tasks that's like really important that you want to get done, they can just figure it out. So it's like RL for business. This is like the same thing. RL for law. Yeah. All these like specific verticals.

Starting point is 00:11:32 Near Maradi is doing this at thinking machines, right? Like RL for businesses. Come into your organization, understand the most valuable business. processes out there that could potentially be RLed against, that could be turned into a benchmark and then, and then, you know, bench hacked, because I don't care if you're hacking, you know, if I have translate this type of document to this type of document for my business, if you can do it with 100% accuracy, I don't care that you bench hacked it. Yeah, exactly. Like benchmarks right now are not like economically valuable. Like if you're really that much better at MMLU,

Starting point is 00:12:05 yes. It's like, are you producing that much value? Yes. Probably not. But if you have, if you make some new benchmark that's you know your tax benchmark I think Anthropic just released that fairly recently oh sure sure sure that's like I don't care if you bench max on that as long as it does way better if it does the taxes you're good it's gonna it's gonna do the task yeah yeah yeah yeah that makes sense what about the what does it say that it feels like open AI seems capable of bench hacking it seems like they've opted not to is that because bench hacking has a risk of giving you negative aura.

Starting point is 00:12:41 Because if you're accused and found guilty of bench hacking, you could, it often reveals that you're not building this one beautiful, you know, super intelligence to rule them all. Yeah, I think it's also, like, maybe we're just looking at the wrong benchmarks. Like maybe there's much of, like, interesting benchmarks about, like, there's this one I really like, it's the Minecraft benchmark. Yeah, yeah, yeah. Where you have to, like, build, you, like, give it some castle,

Starting point is 00:13:09 and how good it looks, or there's the one you always see about the unicorn. Yeah. So you use this like math package that does like rass and stuff, but you ask it to draw a unicorn. Oh, yeah, I've seen that, yeah. Those are really good because it kind of shows the creativity, stuff like that. Walk us through TBPN bench.

Starting point is 00:13:27 Yeah. So we will be benchmarking the AIs against going forward. Have you heard about this? Reps of 225. That would be close, but it's difficult because the humanoid's kind of change that, and you can just use a normal actuator. This is truly for a large language model. You feed in our data set.

Starting point is 00:13:44 We have a public data set, a private data set, presumably at some point. But walk us through TBPN bench. Yeah, so I'm yet to try this on GP5. I don't think it's out yet for public use. I don't have it. But I can tell some of the questions, right? So the first one, I have this picture of a horse. You have to guess the breed.

Starting point is 00:14:02 Yep. So let me see. I think why I don't want to say it in case JVG5 is listening, but it is may or may not be a Caspian horse. Okay. And it's failing right now. O3 is failing. O3 is failing.

Starting point is 00:14:14 Oro is failing. I haven't tried every model. Yeah, we got to try GROC and Gemini. We're going to all out. Yeah. This seems extremely hackable. But at the very least, if we get one scientist to go off and collect the horse data set and then bench hack it, I think we will have done our job.

Starting point is 00:14:31 Yeah. That's the first question. Yes. The second one is a, it's, I have two pictures. Okay. Before and after of. this guy and it's which peptide did you take to achieve this body transformation. So it fails there.

Starting point is 00:14:46 It fails there. So you have a data set of what peptide does what to the human body? Where did you find that? Well, you know, Wikipedia has a lot of this stuff. Okay, okay. You would think they'd be able to cheat this around with O3, just reason who is this person go look up what they've said they've taken and then boom, you have the answer. Well, at first with O3, when I was prompting it, I would like save the photo.

Starting point is 00:15:05 But then wouldn't have the metadata or the, or the, sure, sure, sure. The file name would be like Caspian horse or something. Yeah, yeah, yeah. Okay. Yeah, and then the third one? The third one, I pass in an audio file of a car revving. Has to pick which one. It has to pick, it has to identify the car.

Starting point is 00:15:20 The car. From the engine note. From the engine. And it's not doing it currently. It's no. Okay. This is a good benchmark. We have these real last exam.

Starting point is 00:15:28 Yes. Yeah. Exactly. So I think those are pretty solid. I have some more, obviously, I don't want to make them public in case anyone's going to try to, you know, benchmark this. Of course. Of course.

Starting point is 00:15:38 It's funny because I was mentioning the other day this app that my dad had of tracking that you just set your phone up and it just automatically detects which birds are in your backyard. Yeah, yeah. I mean, this has to be extremely solvable. It's just something that it reveals the lack of like general intelligence when you have to go and collect the horse data set, which should just be out there or the engine note dataset, which should just be out there. But clearly we are in the age of go in RL on the individual problem, and we are looking at

Starting point is 00:16:14 like the power law of capabilities. Knowledge retrieval is clearly a, you know, $12 billion a year market that consumers will pay for that will probably grow significantly. And then health and therapy and shopping and all the other features that Fiji CMO laid out in her post, this is kind of like, you know, what will be RLed against because those are key pockets of value in the consumer economy. And the same thing will happen in the business economy. But in the B2B context, you'll probably see an individual startup building on top of an API. But even then, most of the model platforms offer kind of RL as a service, fine tunes as a service, something where

Starting point is 00:16:58 if you're starting to spend tens of millions of dollars, they will do some customization on top of the So that could be the regime for the next few years as we go into this like, you know, instead of like this centralizing AI force, there's only one company. There's actually like a Cambrian explosion of a ton of companies doing a bunch of different things. So anyway, let's go to Signals Post. Signal's not happy with the launch. He says, okay, I've seen enough. This launch felt like attending a funeral hosted by minimalists.

Starting point is 00:17:25 They're unveiling tech that should feel magical, real breakthroughs. But the whole vibe was grayscale grief. The set design looked like if mood disorder. got a Bauhaus grant. I don't know what a Bauhaus grant is exactly. Even the story telling our chart styles, the eulogy tributes, then closing on someone's health battles.

Starting point is 00:17:43 What exactly are we, or we as the audience mourning? It feels like where they're trying to get you to pre-install a therapist. Potentially great products, sure, but the emotional tone was so damn DOA. Incredibly strange all the rent. I think, A, like, it's weird because we're in this world,

Starting point is 00:17:58 and this is a question that I want to noodle on all day, is, will this be the last long? of a numbered GPD model because you don't hear about new new versions of Google going out you just it just got better and better and better same with Amazon when they were optimizing for hey it's faster we have more on our catalog is that the product matters more than the model yes now and probably will for potentially very long time and when we were watching the stream I was cheering because they gave the feature of you can now talk to the model and get it to trigger a deep reasoning workflow or get it to give you a quick answer in natural language.

Starting point is 00:18:42 And so it's it's abstracting even further, even more of the UI into the actual text interface. And so I think in terms of like surprise and delight and and I don't know, it's like you, everyone kind of rips the Apple thing. But Apple does a great job of being euphoric and happy with somewhat minor. product changes. And like maybe that's more of where they'll go is just, hey, there's these new features and here's how these things. And Apple will spend 10 minutes on stage talking about like shifting an icon around and stuff. And it's like, people like it. I thought it was interesting. They just sort of casually mention that they're they're deprecating the old models. I think it's great. Which makes sense. I think it's great. I don't want the model picker anymore. But are you upset? A lot of people are

Starting point is 00:19:25 going to be upset about that. I think they're getting rid of 4.5. Oh, really? And you're and you're a 4. I love 4.5. But I would imagine that the future is if I ask it to think really hard about the prose and the writing style, it would then do a pass with 4.5. But it would only trigger that when it needs to. It's not going to give you that. Because if I'm just asking for, hey, regurgitate a bunch of facts or write some code or put together a table of data, like it's not going to need to pull 4.5 off the shelf,

Starting point is 00:19:59 just like it's not always going to pull Python off the stuff. the shelf. It's not always going to pull web browsing off the shelf. And so I'm not sure that I necessarily want 4.5 there as a selection criteria. I would like all this to be tucked behind a UI and have something that's actually cleaner and less frustrating to use. I think it'll lead to higher attention. Yeah, for the average like Normie. No one knows what 4.5 is. That's true. It's true. Anyway, Chris Pikes is Open AI and Anthropic are duking it out. Meanwhile, consumer surplus is growing. We also have very good news. We also have the details from Mike Noop over at ARCGI. Full GPT5 is along the V1 Pareto frontier. That's cost versus performance. Open AI said they

Starting point is 00:20:45 focused on other goals like UX and reliability. Our testing supports this. Mini GPT5 is super impressive accuracy for cost. In fact, based on cost efficiency, many could have entered Ark Prize 2024 and likely won first place. We are still verifying GPT OSS or as Roon says GPT toss. Results soon. Nano-GPT appears overfit. Performance is commodity. And Francois Chalet is also chiming in with the top line.

Starting point is 00:21:16 John, direction team needs the deck. Oh, we don't have a deck today. No deck today. We're just doing it live. Yeah, we're just riffing through the time. the timeline tab and just pulling up some random posts so so you're free to pull those up but also we can just read through them Ashley Vance is saying but but model switching was my job model switching is out and we are into the future of just

Starting point is 00:21:41 talk to the model just talk to the model and ask it what you needed to do and it will switch for you it will pull the right tool for the job anyway the other question that I have for the open AI folks today is on the nature of secrets So in zero to one, T.L. has this concept that discovering a secret is key to building a startup. And it's a key insight. And I was joking with, you know, the superintelligence or GPT5. Could my first prompt be, teach me exactly how to build GPT5? And then I go to meta and I say, I know how to do it. I have the prompt. I have the result. And of course, the answer is no.

Starting point is 00:22:26 Of course, Open AI would never leak the most frontier capabilities into the model. But can you build a superintelligence, can you call it superintelligence if it doesn't, if it can't tell you how to build super intelligence? One read on what the secret might be is that the app was the most important thing all along. And if you create this narrative that super intelligence is, you know, weak. or months away and you get a bunch of people that go and try to compete on raw intelligence. Meanwhile, you build a consumer app business with billions of users. Yeah.

Starting point is 00:23:06 It's like it seems like a pretty good strategy. I guess one quick thought is how do you rate Sam's vague posting from yesterday with the Death Star in the context of this new, in the context of the release today? That's a great question. There's a bunch of reads on it. One is just that like the Death Star is to some degree like Stargate and you have to, oh wait, if this is the apocalypse, I figured at least tune in life. You have to like the impact of this of GPT-5 is not one crazy super intelligent model that does everything.

Starting point is 00:23:49 It's just a more user-friendly, higher retention, lower churn consumer model that weaves its way into all aspects of daily life and improves performance and efficiency all over the place. And so you have to build this massive cluster to serve all of that. I don't know. What's your read on it? I don't know. I think it just was, I think it was dramatic. It was provocative. People didn't like it. It is provocative because there are many other. like super megastructures that are in in sci-fi history that are positive or positive yeah and this is this is like but it gets the people going yeah I don't know is it is it is it a metaphor for someone

Starting point is 00:24:35 else that's going to attack is he I mean I mean the images from the viewpoint of someone looking at the death star is he saying he is seeing a death star being built on the horizon is that something else is that another company another organization is that the government is that the is that legal here's a from Bubble Boy says, I'm an expert on bubbles, so it brings me no joy to say that the AI bubble is popping this time next year. He is updating his timelines. When you promise infinite scaling and don't produce it, the calculus changes. I don't think it will be bad for most companies, but those who built their entire business model around making the best LLMs are unfortunately going to struggle as models become more of a commodity. Again, OpenAI is my read on it as a consumer app business.

Starting point is 00:25:21 They still have a big enterprise business, but by, you know, their recent valuations are predicated on this incredible consumer business that they've built. Bubble Boy says the end user doesn't care much if Claude is 5% better than GPT5. They care about cost, speed, and utility, especially at scale, things will be going. The obvious play now is shorting a video and dumping. Okay. Getting into financial advice territory here, Bubbleboy. But interesting, again, kind of goes back to what I was saying earlier in that if you were raising billions to make a lab, and I think the potentially, you know, we'll see what happens in the coding market, but there's some clear winners emerging. And then on the consumer side, you know, expecting a power law outcome.

Starting point is 00:26:13 And it's hard to see anyone unseating chat GPT there. Completely agree. I want to dig in more, but we have our first guest. Let's welcome him to the stream. What a day. Mark, how you doing? Hey, you're pretty good. Nice to see you guys again. What's happening? Great to see you. Congratulations on the launch. Take us through it. Are you, were you actually live or are you wearing the same thing and you recorded it yesterday? Actually live. Okay. I don't know why, but we do. Yeah. I'm gone. I mean, we're big fans of live. I mean, we're big fans of live. I mean, It just allows you to be the most reactive to the most new information. Give us the core thesis that you are trying to get across. I think that there are a few narratives out there.

Starting point is 00:27:03 We've been enjoying the one that's, you know, this is a dominant consumer product. They just made it a better consumer product and people are going to use the product and get more value at it. I saw a bunch of things in the presentation where I was like, that's going to make my daily the usage of chat GPT better. At the same time, we're in this, we're in this world of all the models, the numbers matter and the scale matters and this and that and this and that. And it's a, it's a fine line and it's a dance and we're in a transition phase away from benchmarks and away from talking about the size of the bubbles. But what was your core thesis? Like, what did you want

Starting point is 00:27:38 to get across to the listener? Yeah. I mean, fundamentally, I think from a research perspective, we've been working on reasoning models for several years now. And I think in until now, you've had this really clunky interface. You have to pick, you know, GPD40 or you have to pick O3. And for the longest time, we've known that O3 gives you better answers across the board. It's just too slow, right? I mean, you often don't want to just sit there and wait for the model, reason it out. So we've done a lot of work to push the speed, the performance of our reasoning models,

Starting point is 00:28:09 such that these can come together and work in a very seamless way. And so I think, you know, above everything, we're trying to move the world into this agentic reasoning world. We believe that's the future. And on top of that, you know, you pointed something out, which I really resonate with. Post-training is a huge part of this release. We really wanted to highlight Max Schwarzer and his team who did a phenomenal job. And they've made the model just really that much more useful for consumers, for businesses. It's a monster at coding. So, yeah. On the speed of reasoning, you're obviously the chief research officer. Are you more optimistic about getting speedups there from, I don't know, algorithmic design, software optimizations, or new hardware just let Moore's law carry on or find new ASICs?

Starting point is 00:29:03 Or we saw Cerebrus posting yesterday about the incredible speed that they're getting 3,000 tokens a second on GPTOSS. And I'm wondering what levers, obviously, we pull all of them. But what path of the tech tree are we, should we be like most focused around, most tracking, and most excited about? Yeah. I mean, as a person who represents research, I control the things that I can't control. I think what that focuses on algorithms, right? Simple algorithms that are scalable, that we can pump a lot of compute into. We also do care about the hardware improvements that are stacking up.

Starting point is 00:29:43 With the open source repeat release, you see thousands of people. really kind of serving these models, creating really great inference stacks. And those are really great lessons for us to pull from. What's the ceiling of the speed in which we can serve these models? What can you tell us about the actual user experience of speed? I was, just like last week, I finally got to a place

Starting point is 00:30:11 where for a lot of tasks, I'm firing off a 4-0 query. and an O3 Pro query. Yeah, I have two tabs. The O3 tab. Exactly. And I'm wondering what user experience patterns you think can help people balance between those. Is this like just something that we're like different patterns that we're going to learn over time or different or or or are there going to be certain problems of user experience that are purely solved just by better product design, better speed? And we don't even need to learn these.

Starting point is 00:30:47 Because I remember, like, you know, when you prompted an image generator, you used to have to say like, don't know six fingers, five fingers, please, or like, don't make mistakes. And now, you know, the models kind of have that baked in. But how are you thinking about the user experience of getting the user the results in the right amount of time?

Starting point is 00:31:09 Yeah, I mean, this is one facet of why we believe so much in reasoning. It's just because all of the scaffolding you used to have to give the model, all these small hints, they go away, right? Like the model can examine its own outputs. It can review them. It can be like, hey, look, like,

Starting point is 00:31:22 I'm just counting the fingers here, why are there seven? And you can kind of fix that, right? It does a lot of iterative generation. It does a lot of fixing things on the fly. And so we think one of the benefits of bringing reasoning to the world is really to kind of remove the need for scaffolding.

Starting point is 00:31:37 And with GPD-5, right? We know how clunky that experiences with switching between 40 and 03. Actually, I mean, there's so many stories. I was just talking to someone yesterday, right? They're like, hey, well, you know, I've used 4-0 my whole life. It's the frontier model. And I'm like, hey, well, have you tried O3?

Starting point is 00:31:55 And they're like, why would I try O3? You know, three is less than four. And so, you know, we need to get out of that world. Your GP-D-5, I think it's a one-stop-shop reasoning and non-reasoning. And we've really tried to make it kind of just pre-to-optimal. Yeah, yeah. It's absolutely crazy to just take a bunch of letters and smash them together and expect people to pick up on that as a name or a brand.

Starting point is 00:32:17 Chat GBT, GPT, TBPN. We're both kind of in the same insane gambit. But fortunately, it's worked out, and I think people have gotten over the hump. I can see it pulls off the tongue. TvP. Sort of, except our friend David Senra keeps flipping the letters. A lot of people do that. But at a certain point, yeah, you do breakthrough, and chat GPT has,

Starting point is 00:32:35 but keeping the model numbers simpler makes a ton of sense. Talk to me about the pace of play. for research to actual product. Yeah, and on that note, the line between your personal philosophy on the line between research orgs and engineering product orgs. Yeah, I mean, so our research operates on a variety of different timescales, right? We have teams that they scope out a bunch of ideas, and then they start to kind of narrow in on the promising ideas

Starting point is 00:33:10 as they get closer to a run. and then you kind of see a winnowing of ideas as you get closer to launching a flagship model, right? And there's always this kind of like more exploratory to more kind of concrete and execution-focused pipeline. And we're pulling on ideas across the board here, right? There's a lot of work in architecture optimization. Seb was on stream. You pointed out improvements in synthetic data. So there's really a lot of work that goes into creating one of these models.

Starting point is 00:33:40 And, you know, it's hard to say like, oh, this model was about this breakthrough. Just because right now we have this machine that's producing breakthroughs on all of these axes. And even across several paradigms, right? So it's all that coming together that produces the experience that you guys feel. Yeah. Can you talk to me about the legacy or future of 4.5? I remember I was talking to you and I was like, I haven't been using it a lot. And you looked at me like, I was crazy.

Starting point is 00:34:09 you were like, oh, it's so good. And I was talking to Tyler, and he was like, our intern here, and he was saying like, yeah, the people who really like understand how good it is, use it. But I was wondering, is there a world where that is a tool in the tool chest for GPT5 in the same way that Python is or a web browser is? And if it detects that I want something

Starting point is 00:34:33 with more emotional prose or more thoughtful writing, it can do a whole bunch of research. research collect a bunch of raw text and then kind of do a 4.5 pass that I believe is more expensive maybe and maybe it doesn't make sense for every single query, but could be a feature in the loop or a tool that is pulled into the overall product experience. Yeah, absolutely. Speaking of 4.5, it's also a very smart model, right? And one of our bars in creating GPD5 was to make sure that on a lot of the axes we cared about that it was able to outshine 4.5. And I think even in some of the soft ones like

Starting point is 00:35:14 creative writing, I think that was the case. And that's what makes us so confident with the name. I think we're able to really rely on all of the architecture advancements, all the kind of post-training advancements, all the synthetic data advancements to create a model that's better than 4.5, but much faster and much cheaper. Yeah. It feels kind of like we're, I remember, wasn't the second iPhone called the iPhone 3G? And the number literally corresponded to a specific technology. And now when you get the iPhone 14, it doesn't mean it's 14 megahertz or a gigahertz or inches big. It doesn't like the number is abstract and it speaks to a bucket of features.

Starting point is 00:35:57 And it feels like there's, I mean, this was the first day of kind of re-educating folks on what the nomenclature means. going forward. Have you talked about an annual release schedule or like or or because there's the iPhone cadence and then there's the Google cadence which was like Google search just got better every year for two decades. I it feels like at a certain point you want to just be shipping as fast as possible. How do you think about the culture of shipping updates that you know you find something that feels like hey that could make the customer more delighted or the user more delighted and we don't need to do a big training run for it so let's get that out

Starting point is 00:36:37 today and let's tell people about it like how are you thinking about fast iteration versus splashy announcements right so other product research side I think it makes a lot of sets to think about you know what's the cadence of release and you know what are the feature sets that we want to build and I actually think there's enough great research happening there that we don't have to worry about oh you know is there going to be a drought or a long stretch without enough features to launch but one thing that's important important for us is to be able to provide the people doing the exploratory work, some buffer

Starting point is 00:37:08 from that, right? It's hard to do really great exploratory research in an environment where you feel pressured to do release after release after release. And so we let that be a little bit of a lazier pipeline, not meaning that the work itself is lazy, but we give it space really to mature and to flourish. And once it's ready, we can ship things across that fence. So that's kind of philosophically how we organize. We have a product research org, still very much entrenched in the research, and they care about the release cadence. And they're able to draw from all of the research

Starting point is 00:37:41 that's happening, you know, algorithmically and in scaling and in RL. Yeah. Talk to me about tool use and how that's growing. I was kind of noodling on this idea that, you know, the, I was thinking about the IMO and how it least from the reporting, it sounded like Open AIs model didn't use tools for that. And that's an incredible achievement. But it's kind of like artificial. Like I don't I don't care if the

Starting point is 00:38:13 model doesn't use tools. I use everything possible. And even if even if an LLM can can memorize every fact, I'm fine with an LLM looking stuff up in a traditional database, spinning up a spreadsheet. Like use whatever tool you want just give me the correct answer. But do we have Is it important to give surface to the user the variety of tools that are in the GPT5 tool chest? I noticed something magical happened when I was using GPT, I was using O3 Pro. I sent an image in and I asked to estimate the height of a desk and it wrote like a thousand lines of Python image interpreter and was like, you know, interpreting pixels. And I was like, I didn't even think to trigger Python. It did.

Starting point is 00:39:00 Yeah, yeah, yeah, no, it was right. It was crazy. But the really funny thing was that it was just a standard size desk. It was just like, it could have just Googled like how tall is an average desk or something or just memorized. It probably was just already in the weights that it knows that a desk is like 36 inches tall. But it did a ton of work and it still got it right. It fact checked it a bunch of different ways. But I've noticed that now I can pull different things.

Starting point is 00:39:26 Make a table. Don't make a table. Write some Python for this. Don't write some Python. And it kind of gives me the feel of like a super user to some extent. But I'm wondering how you're thinking about what is further down. Like you've given chat GPT a computer, as Ben Thompson said. You've given kind of the core tools, the Python, repel, the web browser.

Starting point is 00:39:50 How are you thinking about kind of the long tail of tools that you want to bring to bear? And how does that interface? I know that there's API integrations and all sorts of different. surface area there, but give me some context on that. Yeah, I mean, our reasoning models are pretty cute, right? I mean, I think they, when you look at their behavior, right, they know the height of the desk, but they'll still go to verify it five different ways. It's all consistent, give you that median answer.

Starting point is 00:40:15 And I think that's really what makes these models so powerful. And when you think about tool use generically, right, like we want the models to use that reasoning ability to just be able to like zero shot a new tool, right? It should be able to kind of minimally get instructions about how the tool works and just be able to know how to use it. And humans do this all the time. You get a new tool, you start experimenting with it, and then you don't need too much scaffolding and you just go and use it and understand it. So we want our reasoning models to use their reasoning to be able to use a broad selection of tools. And of course, there are a couple that you really do care about.

Starting point is 00:40:49 In coding, it's very important for you to be able to execute code. It's really important in personalization for you to be able to get context from your calendars and from basically from the digital world. So I think there's a range of tools we aren't familiarity with, but beyond that, we want the model be smart enough to just generalize and use tool zero shot. Yeah. Talk to me more about personalization. I feel like there's a world where I feel like I'm maybe underutilizing chat GPT as an app because I don't have it wired up to a. non-relational database where it can just stuff data from, you know, it already has memory and it's doing kind of roll-ups and there's some sort of saving of context. But I was, when we were talking to Kevin Wheel, I was, I was kind of like, well, like, I don't really have like a GitHub repo that's active that I want to,

Starting point is 00:41:42 like, dump code in regularly for like my one-off tasks. But for that image generation, like, you know, understanding the height of the desk, it's like, well, if I'm doing that a lot, maybe I want to have a tool built that lives in the world that my chat interface can can kind of interact with on an ongoing basis and contribute to and modify and kind of wind up instantiating a piece of software that's like even more long lived and then every successive query is even faster. So yeah, how do you think about different ways to increase personalization? Yeah, I mean, I think memory is huge. So we have We have teams surrounding memory and also personality. And when you look at memory, right, I think it's just we have so much context built up about ourselves that the model doesn't have.

Starting point is 00:42:31 And our memory team's been really hard at work. You know, there's a surface level of just gathering facts about you. But there's also stuff about just kind of thinking very deeply about who you are, what your motivations are. And even you could think about, you know, you're trying to do some code-based tasks, right? you're a developer, shouldn't the model just be trying code out, you know, and just kind of leveraging all that memory, kind of its thoughts about what you want to do to just help you kind of be doing work all the time. So, yeah, we do think memory is a huge part of making the model more personalized to you. And it should just make use of all that passive signal about you that it observes or all

Starting point is 00:43:11 of that interaction and just help you accomplish your goals. Got it. What do you think it'll take for AI to start making? novel discoveries. That's been a critique over the last year is everybody's so excited. Everybody's using these products every day and in their work and life. And yet it still feels like we're missing that. Dwork Keshe's talked about, you know, potentially that being around continual learning, but I'm curious what you think. So one thing to underscore is I think the models are already phenomenally creative in certain ways. So when I looked at our performance on contests, right?

Starting point is 00:43:51 You know, I've done these contests before. Sometimes you have this mental classification of these problems require more creativity or these ones require less. And one of the big surprises for me was that the model can get some of the ones which I intuitively think require more creativity. And it often does come up with these solutions that I consider quite ad hoc and really don't pattern match to anything I've seen before. When you look at advancing science or mathematics or fields like this, one thing that construct

Starting point is 00:44:25 in which humans work sometimes is there are kind of three builders. In mathematics, for instance, there are mathematicians who's role are to kind of build out this theory and almost to kind of create Olympiad style sub-problems, which often other mathematicians who are very good at that kind of style of work can do. And I do think kind of the model will increasingly contribute on that side first, right? If there's some mechanical, like, hey, you know, I really don't know how to simplify this expression. I really don't know how to, like, get this result. It can really do that quickly for you.

Starting point is 00:45:04 We're trying to increase the envelope, such as the models, getting towards that theory-building side and, you know, being able to create creative hypotheses. And all of these components are very useful for what I consider the ultimate goal, which is being able to automate some of our own work and our own research. How are you thinking about like the layers of mixing? Like I remember GPT4, I don't know if this was ever confirmed, but mixture of experts model, this is kind of like widely understood in the industry. Now are we in the era of like a mixture of models that have mixture of experts? like how many mixtures are going on? How does GPT-5 actually work?

Starting point is 00:45:50 Is there a taxonomy or architecture diagram that you can kind of like walk through to explain what GPT-5 is? Because it feels so much different than GPT-3. Yeah, I mean, one of our, probably the pinnacle of our research road map, but our path to AGI. When you look at the levels of AGI, the top level is what we describe as organizational AI. And what this means is, you know,

Starting point is 00:46:19 collections of agents working together, often like we might in a company, towards a shared goal, right? And you would imagine that these agents probably sub-specialized in ways, maybe similar to what humans do, maybe in their own more efficient ways. And I think, you know,

Starting point is 00:46:36 effectively work together to accomplish some goal. So we very much care about exploring this vision, seeing that's much more effective than you know one single big brain working out a problem and I think there are reasons to think why it could be so and and yeah I think that that is one of the things that we're after yeah on that note of specialization how are businesses working with GPT5 or how do you expect them to work with GPT5 in terms of coming to open AI and asking for special capabilities or fine-tuning or you know any sort of RL on this particular

Starting point is 00:47:15 problem in my world. I have this specific data set. It's not public, but I want a hyper, I want you to bench max on it. I want you to get 100% on, you know, the gas station bench or whatever. You know, if I'm, if I have a certain business and I'm willing to invest in sort of some some overfit RL because it will create immense economic value for my business or it will solve some fundamental problem, how can, how are businesses going to be using GPT5 over the next few years. Oh, that's a great question. So I think that this is a chance to kind of highlight one of the results that we've accomplished over the last couple weeks, which is our ACCODA results. So this is a relatively unknown programming contest, but it involves really the pinnacle

Starting point is 00:48:03 of the best coding contestants in the world. And what they do is, you know, they're put in a room and they have to solve an optimization problem. This is something that's actually very real-world relevant. So you can imagine an optimization problem as something like what Uber might have. You have, let's say, riders and you have drivers, and you want to create a system where you match them as quickly as possible, you know, with the least amount of cost, for instance. And so we've really created a system that can solve optimization problems at the level of the best in the world. Right. And these truly are the kind of the best heuristic solvers in the world. And so we have an organization led by Alexander Madri. It's called strategic deployment. And what they do is for a select handful of customers who really have that, you know, beefy problem that that they need to

Starting point is 00:49:00 solve to just go and provide that value, right? And I think there's a lot we can do there. I think There's a lot of very, very valuable optimization problems in the real world. And we're really excited to partner with people. Because I think this creates a template for directly having AI provide economic value and really catapulting certain industries forward. On the research side, what unique advantages

Starting point is 00:49:31 do you think you and your team have given your position in the market? with the incredible user adoption and the incredible usage from those users. It's not just DAUs, but it's actually the number of queries. Semi-analysis estimated at like 71% of all queries going through chat GPT. What advantages does that confer from a research perspective? Yeah, I mean, a lot, right? And I think it allows us to kind of deeply understand use cases.

Starting point is 00:50:03 It allows us to understand the frontier of where humans are, you know, kind of finding value, where they're not finding value, which areas that we need to improve the models on. It gives us a lot of signal. It's how users are deriving value, when they derive value. And what is that signal? I see the thumbs up, thumbs down button. I'm sorry, I don't push it very often. I'm not doing my job, apparently. But I know that you can figure out whether or not I'm satisfied. Stop booing me, Jordy. That's the research, too. Mark, I promise you for the next 100 Chad GPT responses, I will be honest with my thumbs up, thumbs down.

Starting point is 00:50:45 I love it. Even if you do extra training. We have tons of people, luckily, who do. Oh, that's great. Okay. So you do get a lot of thumbs up, thumbs down. And I'm sure I have done it occasionally. But I also imagine that there's a ton of other signal in there.

Starting point is 00:51:00 You know, with the TikTok algorithm or any social algorithm, it's very easy. Time on site. With ChatGPT, obviously it's exciting when we hear, okay, 30 minutes a day or some rumored number of minutes. It feels correlated with usage. It feels correlated with value that's being delivered. You can obviously look at churn metrics and all that stuff. But what other pockets of signal are you finding? Are you finding people just, I remember the story about Google where they were trying to figure out how to handle misspellings and create the definitive database?

Starting point is 00:51:33 Do you know this story where they were trying to develop the definitive database? of how to spell things. And they were like taking a bunch of shots at it. And they figured out that the best, most rich source of data was just if you type in financial into Google and you misspell it, oftentimes then you will just correct it yourself. And the second query you send will be spelled correctly so that you can just look at two similar queries. What's the second one?

Starting point is 00:51:58 That's the correct spelling. So yeah, what other pockets of signal are you finding that are translating into the research environment? What are you excited to go deeper on? Yeah, so I'd love to first talk about the DAU signal because I think that's something that a lot of companies track, but we find actually a lot of danger in tracking it too closely. And one of the recent blog posts we pushed out was went on sycophancy, right? If you just, you know, hey, we're going to boost responses where users say thumbs up. Yeah.

Starting point is 00:52:30 You know, it creates a condition for a model. I just want to say, Mark, I love everything you're doing on this first. Yeah, this entire interview has just been fantastic. You are the best. You were just the best. We'd love to have you back on the show tomorrow. You're just. But clearly problems with that.

Starting point is 00:52:50 Yeah, yeah, clear problems, right? The model just starts kind of sucking up to you. Totally. And it's saying like, hey, you know, you're right. And even in complicated situations where I think objectively, you know, collectively, we'd be like, hey, this person's in the wrong. The model starts saying, hey, you know, you're right. You know, the other person's gaslighting you.

Starting point is 00:53:05 You know, this other person's kind of, and people deal with, people deal with this in the real world, they'll go to a friend, they'll tell them about a situation, and the friend will give them advice, but maybe it's not the entire, it's not the fullness of the situation, right? Maybe they left out some key facts. And the friend is like, oh, yeah, that other person is wrong. Definitely is in the wrong. And they, like, skipped over some important details. Yeah, no, no, exactly. And we don't want our models to fall into this trap where it's just trying to get you to, like, you like, you like what it says. And so, you know, we wrote back a lot of changes that produce that kind of behavior. And really the way I think about daily active users today is we need to be opinionated about the features that we build into the future. I think we have a lot of ideas here, but we have to let that drive. You know, build for the future, build for the things that people you think they'll want and maybe don't want necessarily, know they want necessarily today. And then use DAU as kind of this byproduct, right, a way to track that you're on the right, right?

Starting point is 00:54:04 right track here. So, yeah, I mean, we want to be careful here. We don't want to fall into these traps of like, you know, three, four years from now that this turns into kind of engagement bade or something like that. Totally. Yeah. Was it, how much time has the research team been focused on efficiency specifically? It felt like summer was a good window before kids come back to school and start, you know, maxing out queries. A good time to increase efficiency. And I know, the cost of GPT5 have Every time there's a new model, I'm like, this is the best it could ever be,

Starting point is 00:54:40 it's good enough, bake it on an ASIC, I just want it for free and I want it like in milliseconds. But that's just me being, you know, grumpy, I guess. We've done a lot of work. We've been building out our teams. We've focused a lot on scaling. I think Greg's going to come on a little bit later. He's been spiriting a lot of that work.

Starting point is 00:54:58 So, yeah, no, honestly, it's become a bigger and bigger focus for us, especially in the last couple of months. On the, I mean, this is somewhat related to the sycifancy thing, but I'm interested to know, like, what do you think is driving, like, the GPT tone? You know how, like, the M-Dash is a thing? And then the, it's not a newspaper. It's a way of life.

Starting point is 00:55:24 And it's like there's these, like, little, like, flourishes, like, that come through in kind of our tell that it was written. And in a lot of ways, I love it because when I get a deep research report, I like that it's using the same Wikipedia-style tone. Like, I want consistency there. I don't want it to be like, oh, this today, it looks like it's a vice news article. And today, tomorrow, it looks like it's written by someone at BuzzFeed. I like that it's consistent in many ways. But why is that happening?

Starting point is 00:55:50 Do you think that bigger models like 4.5 kind of were able to solve that? Or do those kind of like local minima, like, I don't know, like wells happen, even in bigger models? Is there anything from a research perspective that can stop GPT having its own voice? Or is it fine that it has its own voice? Yeah. That's a really great question. And I think, you know, as you scale up models, as the models become more intelligent, they kind of have a just deeper and day understanding of the tone, right?

Starting point is 00:56:20 And so you expect that to improve just naturally as you make the models more powerful, bigger, better reasoners. But one thing that I think gets lost a lot is each individual company has a lot of of impact in terms of how they shape the default tone. And we publish a document called the spec. It kind of lays out how we expect the model to sound in certain cases, lays out a lot of examples for that. And I think we use the spec in many ways, right?

Starting point is 00:56:48 We have people come in and see, hey, was this thing generated in accordance with what we would hope to generate from our spec? And this is a living document, right? It evolves over time. And so I think, you know, each company should, kind of has a very opinion to take on what they think the model should sound like. And it's not an accident that the model sound a certain way. I don't think just naturally every company is going to train the same kind of voice into their model.

Starting point is 00:57:15 Totally. Well, thank you so much for hopping on. Congratulations on the big launch. We'd love to have you back soon to talk more. We could go in a million different directions, but we'll let you get back to it. We know it's a big day. So have a great rest of your day, Mark. Thank you.

Starting point is 00:57:28 It's a great conversation. Talk to you soon. Mark. And we will tell you about restream. One live stream, 30 plus destinations, multi-stream and reach your audience wherever they are. This stream is made possible by Restream. OpenAI just did a live stream. With Restream. If you're trying to do a stream, if you're trying to do a stream, you've got to get on Restream. So it's everywhere. And we will bring in our next guest, Greg Brockman, the president of Open AI. And we'll bring him in. Greg, how you doing? Doing great. Thank you.

Starting point is 00:57:58 Welcome. Congratulations. Congratulations. How are you? you feeling? How's the company feeling? It's been such a wild journey. Just take me through a little bit of the like the vibes and the company and how you got here today. Well, I'm excited. The whole company is excited. And honestly, I'm just so proud of the team. Like it's just been amazing to watch people come together, not just for this launch. And, you know, the funny thing is behind the scenes that people are always putting on the last minute adjustments and polish and scaling up the capacity. And there's always something that goes wrong before lunch.

Starting point is 00:58:31 And so there's a lot of people who, you know, worked late into the night or really crunched to bring this release to the world. And, you know, it's a little bit like the duck that's, you know, you know, paddling. Yeah. Yeah. But that also describes the whole opening eye history, right? Is that I think that we have put in many years worth of investment to the techniques used to produce this model. And really, it's across just every function with an open AI that has come together to make this a reality. Yeah, I mean, you've been there for every GPT release.

Starting point is 00:59:04 How do you think about summing up each iteration in kind of like one line? Because GPT1, GPT2, DPP3, these feel like similar architectures, at least histories kind of compress them into similar architectures, but how do you think about the progression of just the big numbered releases? Yeah, it's interesting because in some ways it's a punctuate equilibrium, but on the inside it looks very smooth. Right, even before the GPT series formally began, the first result that really sort of set this path

Starting point is 00:59:40 to be something that we were heading down and there was clear that we were going to pursue it was the unsupervised sentiment neuron, which was an LSTM in like 2017, so a different architecture from today's Transformers. And it was the first time that you could train a model to predict the next element, so we predicted the next character on,

Starting point is 00:59:59 on Amazon reviews, and we were able to get semantics out, right? Because you expect, okay, yeah, it's going to learn where the commas go, what maybe what nouns and verbs are, but the idea they would learn a state-of-the-art sentiment analysis classifier. That was mind-blowing. And so I remember seeing that result in 2017, it's like, we have to scale this up. We have to see where it goes.

Starting point is 01:00:15 And so GPD-1 was, like, I think, a good, like, sign of life of, you train on sort of all the public data you can get, and you use a transformer, and that you were able to get state-of-the-art on various downstream. benchmarks, right? So you have a model, it clearly learned some representation, something useful about the data that it was shown, and it's applicable. You can use it for various tasks. But we didn't really think very hard about the generation side. GPD2 was the first time that we were like, all right, let's actually, like the samples we're getting from it, the things it actually

Starting point is 01:00:46 generates, they're kind of cool. And I remember reading the, in the GPT2 blog post, we have this unicorn story where it generates some fictional story about a herd of unicorns. And it was just so cool. It was like, wow, it wrote a story that's actually kind of interesting. It doesn't totally make sense, but like there's something here. There's some real spark of intelligence within this model. GPT3 was the first time that we had a model that was actually something people would, it was just barely above threshold for something people would want to use. And I remember working on the GPT3 API. This was our first real product. And it was actually the hardest product, the hardest project in total I've ever worked on because it just felt like maybe no one wants

Starting point is 01:01:29 to use this model. We don't really know what it's useful for. And it certainly was the case that GPD 3 was a great demo machine. You can make really awesome just like tweets and, you know, cool little, little apps and it would give you quick answers. But it didn't feel very reliable. And then GPD 4 was something that actually felt like it had true real world utility. It was above some threshold. It was something that was helpful for health. It was something that was helpful for, you know, starting to be good at coding. And GPD 5, I think just sets a whole new standard for the reliability for utility. Things like coding, I think, are just like clearly, we're already on this trajectory of transforming software engineering this year. I think are really on trajectory now to be revolutionized.

Starting point is 01:02:07 So just really exciting to see that whole arc. When did the API opportunity really click for you? Because I do remember companies in that era that quickly unlocked the power of the API and grew tremendously. When did that opportunity click because you said initially that you kind of had some, I don't know, concerns, kind of doubts, how useful was it going to be? And then when did the consumer opportunity click? Well, we in 2019, end of 2019, had GPD3. We knew we needed to build a product to be able to actually continue the mission, to be able to raise capital. But what did we want to build? Right. We're really here because we believe in AGI that's going to have this powerful, positive,

Starting point is 01:02:51 transformative effect on society. We want to be part of it. And so we thought, we thought, well maybe we could build something in health and then you realize okay well we're going to sell the hospitals and we're going to maybe hire let other people do that exactly right it's just like you have to go into one domain and that means giving up on the G the general right it's like it feels like you're going to become a one particular thing would we kind of want to be supporting all industries at once and so the idea was let's build an API and let people figure it out but this is totally not the way you're supposed to build a startup right you're supposed to have a problem no one cares about the technology

Starting point is 01:03:25 behind it, add value to that problem, focus on just that one thing. And so that's why that project was so hard. And in January of 2020, February of 2020, that I with the team were going around trying to just find anyone that would be willing to try this API. And we were driving to different offices in San Francisco being like, hey, we have this cool model. And it was hard enough to get people to take the meeting, much less to sign up their company for it. It was actually very fortunate. we found a couple of good partners. And it was fortunate that that happened then because March 2020, suddenly that was COVID. We weren't driving around to people's offices to try to beg them to use this, you know, this budding new technology.

Starting point is 01:04:05 So it was really six months worth of grind, right, of really trying to turn, like when we, when we started with GPD3, I remember it was, you know, that the inference code was not very well optimized. It was like, I don't know, 150 or maybe 250 milliseconds per token or something. And we just optimized, optimized got it down to like 50 milliseconds per token, which by the way, today. models run much faster than that, which is kind of amazing for me, just like seeing how much fast we're able to run them with much greater intelligence. And I remember setting two goals for the team. One was I actually find one customer who's willing to pay, so literally get a dollar in for this thing. And the second is get a use case that we use at Open AI every day. That first one happened within the first couple months. So actually that moment, I was like, all right, like this thing is

Starting point is 01:04:50 probably going to work. But in order to get there, we had to do a bunch of, you know, just scaling the API and really, you know, doing the product work. But that second one took much longer, right? And that wasn't really until ChatGBT. And so if you fast forward a couple of years, because this was, you know, mid-2020 when we first got that, the API into the world, Chad-GBTBT, we didn't release until November of 2022. So you're talking like a decent, a decent period of two years there, a little bit longer. And I remember we were building, you know, people have talked about, we were going to call it maybe chat with GPD 3.5. We had a sort of precursor product called WebGPT that was built on on 3.5 that we were literally paying contractors to use. Right. So this was all throughout

Starting point is 01:05:34 2022. We basically had the chat GPT precursor that we had to pay people. They would not pay us. We had to pay them to use this thing. And the moment for me that really close, was actually when we finished training GPD 4. So that was August 8th of 2022, which actually is like three years ago now. It's actually pretty wild to realize that, almost to the day. And we did the initial post train of GPD4, and honestly, I had a bunch of bugs in there.

Starting point is 01:06:04 It was like broken for a bunch of different reasons. But the model was like extremely creative. It was actually really interesting. It took about a year and a half to get to the point that the creative writing of our models matched that initial one that was buggy for various reasons. And I remember, you know, we had an instruction following data set that was post-trained on. So it was really, we had collected examples of, here's the human asking for a thing,

Starting point is 01:06:27 here's what the model should do. So it's really not trained to do multi-turn. So I asked you a question, it gave a response. But then I was like, well, what if we just ask another question? And it actually was able to leverage that full context. It actually was able to have a coherent chat. And the moment that we saw that, they were like, okay, this thing is capable not just of being post-trained to do this like very specific thing, but it can generalize, right? It can kind of do the intelligent thing, even though it wasn't directly trained for it.

Starting point is 01:06:59 It was just so clear this was going to be the killer application. And so then we were planning on launching GPD4 in, you know, early 2023. And we had this chat infrastructure we'd been working on. and it's so clear, okay, like, we're going to have to release the infrastructure and the model, and it's going to be this amazing killer product. And so just almost as infrastructure ahead of getting the real thing out, you know, I was excited for us to do chat GPT, and that's why we did, you know, and see that come to life in November. So I think that for me, I was really focused on GPT4 as the model.

Starting point is 01:07:35 This is going to be the chat moment that's really going to work. And kind of had missed the fact, because every time you see these new models, you just sort of, you know, see only flaws in the previous ones. And so it missed the fact that GPT 3.5 was something that no one had really tried before in the broad sense of society and that it was something that was already useful and that people would respond to. Was GPT3 kind of like the main pivot point for shifting the company towards LLMs? Because in the prehistory of open AI, there were a lot of other maybe expensive training runs. I don't know how much, I don't know how much financial risk was taken with like the Open AI5 project or the robotics projects, but it feels like at a certain point,

Starting point is 01:08:15 the chat became like the main financial risk vector. So I guess the question is like when it feels like GPT3 was the moment when you shifted. I'm also interested in hearing about Ben Thompson called OpenAI the accidental consumer company. And I'm wondering when that narrative set in for you. Like what. When did it become clear that this was going to be a really, really powerful consumer application? Yeah, going from paying people to use your product to people saying, hey, we want to give you money for this. Yeah. Yeah, a very important transition, it turns out.

Starting point is 01:08:55 Yeah. So it's a great question. I would say that if you rewind to the beginning of Open AI, you know, there's many people who thought that, in retrospect, say that we set out to prove that scale is how to you you make progress in this field. But it's almost the other way around. Scale was the thing that worked, right? We tried a bunch of things that didn't pan out. And it really, the first time we saw this concretely, was in our Dota project. I remember my collaborators, Jacob and Shimon, trained the very first little agent on like 16 cores or something and left it running on their desktop over the weekend. We came back and it was this like very, you know, sort of constrained

Starting point is 01:09:33 mini environment, but that the model was doing something smart. It was actually able to to solve this this kiting environment, and that was pretty cool. And then they and the team just kept scaling up, right? That we had all these free cores that were just sitting idle on AWS at the time, and they just kept throwing more computed it. And every time they would do that, the model would just get better. And so when you look at something like that, you're like, well, you just have to see where this goes. You have to push it until it hits the wall, right?

Starting point is 01:09:59 And our goal with Dota was actually to develop new reinforcement learning algorithms, because the common wisdom at the time was, well, the existing reinforcement learning, PPO, it doesn't scale. Everyone knows that. But the question from Yaquep-Schimon was, well, why do we believe that? Has anyone actually tested it? And no one had really tested it. And so I think that that ethos of saying, you have to push

Starting point is 01:10:21 the existing techniques to the wall until they break. And then once they break, you actually have a baseline to overcome. And you win either way, right? Either it just exceeds all the humans in terms of the specific capability that you're trying to to exercise, which was the case for Dota, or it hits a wall, and now you have a real problem to solve. And so I think that ethos really got embedded in our DNA.

Starting point is 01:10:46 And, you know, at the same time, I think that we were really thinking about how do we get to AGI, right? And really, like, I spent a lot of time thinking about that question of where's this company going and how do we actually achieve it. And you start to do some math in terms of, you know, the kind of compute that it would take to get to AGI. and you just start to realize you're going to have to build really big computers. And those are extremely expensive. And so I think that from the early foundational results in thinking, we kind of realize the path that we're going to have to walk.

Starting point is 01:11:15 So it seems like there's been a few walls that we've scaled up through and then maybe hit them. There's been talk of like a pre-training wall. Now we're putting tons of resources and compute towards reinforcement learning. Is there a third, is there a third scaling curve that we're going to be talking about? in the next few years, are we continuing to scale up those two primary vectors? Is that too high level of an abstraction in terms of how we should be thinking about just progress along the vector of scale? Give me the up-to-date thinking on just the fruits of scale. Yeah, I'd say fundamentally, deep learning, I think that people talk about the bitter lesson.

Starting point is 01:11:59 It's almost this exploration into how do you convert, compute into intelligence, right? Through a, you know, we have some particular techniques to do that that we're kind of constantly flushing out. And the thing that's really amazing is if you rewind to, I don't know, even the 1940s for the McClella Pitts neuron, which is kind of the precursor to neural nets, if you look at that paper, they have all these diagrams that actually look very similar to, like, the kinds of diagrams would draw now of multi-layer neural nets and things like that. Like the basic idea of what we're trying to do has not really changed in almost like 80, 80 plus years. which is just a wild fact, right?

Starting point is 01:12:36 It means there's something deeply fundamental about the thing that we are pursuing. And that idea itself, I think, kind of came from trying to model the information processing of the brain. And it's imperfect and not an exact analogy to biology and all of these reasons that it should fail or that people have said this thing is doomed. But the results are undeniable at this point.

Starting point is 01:12:55 I mean, some people try, but it's really hard to, it's really hard to kind of close your eyes and sleep on this in my mind. And it's very interesting if you look at, you can find quotes from the mid-1960s of people trying to poo-poo the whole direction saying that these neural net people have no new ideas. They just want to build bigger computers. And you can basically say something very similar today. What we're trying to do. One moment.

Starting point is 01:13:23 A little water break. Yeah. Exactly. For all of us. Cheers. Exactly. Cheers. You know, we're all human.

Starting point is 01:13:30 Proof of you, Henry, right there. Exactly. So what we're all trying to do is find novel ways of taking compute and really harnessing it. And sometimes you hit a wall, but these walls tend to be ones that you can drill through, right? What we found is every time you scale up, everything, all of your engineering, all of your sort of scale and variance, all these things, they get stressed to the next level. It's almost that the tolerance has become tighter and tighter. It's like launching a 10x bigger rocket means you need to be like 100x, just more precise. on everything, but it doesn't mean that the fundamentals of the science are different. So pre-training,

Starting point is 01:14:05 there's definitely been a lot of discussion of data wall. It doesn't mean it's fundamental, right? It just means that we need to be better and more precise of what we're doing. There's RL, which has been something that has kind of come from spending a small amount of compute to much larger amounts of compute now. And then there is a third way that we're really harnessing compute, which is compute at test time. And we publish some scaling laws around this, and all three of these things multiply. Like, that's the amazing thing. And of course, the compute and the harnessing of it is the fundamental goal, but that you get these multiplicative effects out of all of it through the quality of your engineering implementation, right, through the quality of the data sets, through a bunch of the refining work that you do. And there's lots of different techniques and ideas.

Starting point is 01:14:47 And that's what makes this field so rich and why progress is just going to continue a pace. What about on the infrastructure side? You guys have been busy scaling up. What can you share on that front? Well, so I run a team called Scaling at Open AI, and we really focus on building the infrastructure for scaling. And this is in partnership with really everyone across the company. It's almost a misnomer that our team is called scaling because fundamentally this, this whole team and effort is about scale. But what we really try to do is to both on the physical infrastructure side, deliver as much computer as humanly possible.

Starting point is 01:15:23 And that is in partnership with companies like Oracle, SoftBank and other. that we've been able to deliver just like increasing amounts of compute to open AI, but we're constantly thinking about how do we just deliver more flops and do it more efficiently earlier, cheaper, more power efficient, all of those kinds of questions. There's the software infrastructure side as well, and really thinking about how do you coordinate massive numbers of GPUs in order to work across one synchronous training run,

Starting point is 01:15:52 how do you coordinate that for reinforcement learning, how do you deploy that into production, and bring these models to life at massive scale. And I think that every single layer of the stack, there is innovation required. And that's something that's very easy to miss. Like one way I think about research is that there is, and this is kind of the view from from Yakup, who's now our chief scientist, that there's a research stack. And you can kind of think of the top of it is people running experiments and coming up

Starting point is 01:16:19 with new ideas for how to, you know, sort of utilize data or something like that. There's a middle of the research stack of people thinking about the how do you, you sort of take these different ways people are running experiments and I be able to train in novel ways and kind of put together the pieces differently. And then there's a bottom of the research stack, which is like writing kuda kernels to get the absolute max out of the GPUs. And every single layer here, you get a multiplicative factor through innovation. So it all comes together as one big hole.

Starting point is 01:16:49 On scaling, I'm interested to hear about just, if we think about like the impact impact of AGI or the impact of AI just being some sort of maybe, you know, quantitative GDP metric or qualitative just impact and good. Is there an important factor of scale with just not even the flops that are going into the models, into the pre-training, into the RL, into the test time inference, but actually just the flops that are going into the usage of AI within humanity broadly? And I feel like that might maybe be the next like scaling curve that we're seeing as more people use models. They see improvements all over the fact.

Starting point is 01:17:35 Like is that something that we should be tracking to see kind of the instead of these like S curves, we want to see like the continual exponential? I think that's a great perspective, right? Because at the end of the day, I mean, if you look at kind of the shift from something like Dota, which we pursued in order to you know, we wanted to do new algorithm of development, but really it almost validated how we scale up existing algorithms. But there was no illusion of delivering direct economic benefit from it, right? To the current models where we are still, we're starting to end the era of like pushing on these academic benchmarks, right? You look at things like the IMO at this point, models are

Starting point is 01:18:16 able to get gold metal on it. Like these, the hardest academic benchmarks that are available are sort of no longer a, you sort of the guiding light of progress for these models to where we actually want to be is for AI to be helping everyone, right, to be something that uplifts humanity. And that's the final metric, right? It's how much does it actually benefit everyone? How much value does it bring to the world? Yeah, not just health bench. It's actually how many people did you solve their healthcare problem, right?

Starting point is 01:18:44 Exactly. Yes, yes, yes. And that's the actual goal. Yeah. And that's what's exciting, right? It's like we're moving from the lab to reality. And I remember in the early days, as we were thinking about how do we measure our progress towards AGI, we always sort of dreamed that one day we would be able to measure it this way. And you can think of revenue maybe as a proxy metric for value delivered to the world.

Starting point is 01:19:05 It's not perfect, but it's at least something, right? You can think of the distribution of like how much compute goes into it, how many people are using it. But fundamentally, like what we're after is how much do we really uplift humanity through this technology? Yeah, I mean, I might be misreading it, but I'm pretty sure like that was the cursory. Dwelly and Ray Kurzweil philosophy was that like total number of flops getting immense not necessarily all in one data center for one model. It was that it was that compute broadly would be so wide. Yes, yes.

Starting point is 01:19:35 And I remember like on that chart, right? You can see, you know, total compute of all human brains. Yeah. Which really suggest a particular vision of how this technology will be rolled out. Yeah, distributed. The phones count as an impact. The Wi-Fi router counts for the impact of the internet, just like the phone does. It's not just the big pipe that's going, the backbone of the internet that actually matters.

Starting point is 01:20:01 Deep research hit product, almost everybody I know, at least in the industry is using it. He's reading 30 pages of deep research a day, basically. He loves it. He's making books with it. But why have agents broadly come around a little bit slow? lower than people may have expected. Is it just that using computers is actually much harder? Computer use is just a really hard challenge.

Starting point is 01:20:29 Or, you know, I think going into this year, everybody said this was the year of agents. Are you talking about flight booking or something like that? Flight booking, but, you know, people were saying, 2025 is the year of agents. And I would say that it's the year of deep research and not a lot of these other sort of like broader use cases. Sure.

Starting point is 01:20:47 Well, 2025 isn't quite over yet. So that would be my response. I'm very much on the, I think that progress in this field, the way that it tends to work is that if something kind of works with the current generation of models, it will be extremely reliable with the next generation of models. And I think that where we've been is that deep research is the, if you've rebound a year, that was the like we kind of had something working. And then like this year, it's been just incredible. And I think that agents, you know, specifically like computer use agents are something we've kind of had working. And again, you know, the year is not over. I think there's a lot of rapid progress to be made.

Starting point is 01:21:26 But I think that maybe part of it too is that the agents that we're about to see, I think, are a little different from maybe what we would have pictured five years ago. Like I remember having a debate with some friends on do you want a agent that does the flight booking? Because the problem is it's actually a very high bar to beat the flight booking UI because there's so many preferences that are entailed in that. right and you really have to know kind of what mood you're in like are you okay with like taking the extra layover and all these kinds of questions and um that actually there's so much other stuff that happens in your life that is that is toil or drudgery or that's something that that you're not an expert in you're supposed to be think about health right that like every patient really is the doctor if you're coordinating across multiple specialists there's no doctor that helps you with that right

Starting point is 01:22:08 that that's really on you uh and that there you actually can have AIs that are just text only that actually are able to add massive value and then it frees up your time if you want to go, you know, book the flights yourself. And so I think that really finding the right problems that have high leverage, right, that really add value to people. And also thinking about the other side of how to make sure these agents are responsible with the trust that you put in them, right? That the more that you give an agent access to your email, the more you really have to trust that it's going to, you know, sort of do right with whatever your task is and send the right email multiple right people and I'd be able to segment your information and all of these kinds

Starting point is 01:22:49 of questions. And so I think that there's both a practical, how do you get to adoption, but also just like where are the most important leverage points in a person's life? You also missed coding agents because it's been the year of deep research, but I feel like it's also been the year of coding agents. How is that developing at OpenAI? I've noticed that I'll hit O3 Pro and it'll wind up writing a bunch of code for me and I didn't even ask it to, then you have specific products for coding. How do you see the evolution of

Starting point is 01:23:21 software development evolve? How are you seeing OpenAI customers use coding tools? And how good is chat GPT or GPT5 on coding? Well, software engineering is definitely being revolutionized in front of our eyes. It's been happening. And GPT5 is the best coding model in the world right now. It's the default now in cursor, which I think is a really huge statement of the quality of the model, and that it's just so good across like every function of writing code, understanding code base, being able to use tons of tools, being able to do agentic work, that, yeah, it's like I'm not a front-end developer at all, but actually now I am, right? And I think that you are too, right? If you just talk to the model, you can produce incredible things. And so I think that there's this real empowerment. If you think about what computers

Starting point is 01:24:11 we're supposed to be, right? Computers are supposed to be a tool that makes you more productive, able to do the thing you want. But then somehow when we started out with computers, you have to contort the human to the machine, writing assembly language and like all these like very abnormal things for a human to do.

Starting point is 01:24:27 And that as we've moved to tools ultimately, you know, in the current generation now GPD5, suddenly the computer comes closer to you, right? That you just express your intent and you don't think about okay, like exactly which language and what version of different libraries, that the model is something you can delegate to. And so we are very committed to programming and to making our models continue to be the best they possibly can be.

Starting point is 01:24:51 Must a superintelligence be able to explain how to build superintelligence? So it's a great question. So I mean, I think that where we're going is a world, and we're already seeing it, where these models help us produce the next generation of models, right? They also help us really supervise tasks that are too hard for humans to supervise on our own, right? If the model writes a 10,000 line program for you, reviewing that is probably going to be quite burdensome. But if you can have a model that you trust, that maybe isn't as capable as the one that wrote all that code, or maybe there's a team of agents that work together to write all that code, but you have a team of reviewer agents. Like, this is the kind of thing that you can actually bootstrap trust.

Starting point is 01:25:31 And I think that this is like one of the most important things. And also, interestingly, 2017 is when we had the first language result. We also had some results or some vision on how you can actually bootstrap supervision beyond the scale of tasks that humans are able to supervise directly. And so I think that we're heading to a world where, you know, we now have these chain of thought models. We've been advocating very strongly to preserve the integrity of the chain of thought, right? So that it means don't directly optimize it to look good, you know, though there will be

Starting point is 01:26:01 lots of temptation to do it for various reasons, really make sure that there's no pressure on the model to obfuscate its thoughts within that chain of thought. Because then you can really see what it's up to. And I think there's further techniques to even make it more faithful and more rigid to what the internal monologue of the agent is. And so I think that there's actually a lot of promise in terms of interpretability, in terms of supervision, in terms of being able to scale to just like much more sophisticated tasks. Yeah, I guess, I guess my question is like there, how much information in the world can be derived from first

Starting point is 01:26:36 principles reasoning versus true secrets that can that need to be discovered by interacting with the world directly because I would it feels like it would be very difficult to I'm just wondering about like how intellectual property interfaces with super intelligence or how like if you play this out a lot how like there's all these like hard one Dorcas just talked a little bit about this with continual learning there's all these little subtleties that maybe they're not secrets, maybe they're not true trade secrets. You don't think to lock them down, but they're just things that haven't been codified online or anywhere.

Starting point is 01:27:15 They haven't been given to anything that is surfaceable by the model. And I'm wondering how is it just we need to build up new knowledge in every fact from first principles and kind of go through the history of humanity's pursuits of knowledge? or do we just need to onboard more and more information? Or maybe it's both. I don't know. It's just something I've been noodling on. Yes.

Starting point is 01:27:38 It's a great question. I would say all of the above. Select All-Star. So I think that the answer is very similar to what it is for humans, right? How does a human generate new knowledge? How do we accomplish new things? First, you want to be grounded in the wisdom of the past, right? You really want to understand what have people tried, what worked, what didn't work?

Starting point is 01:27:56 You want to go and read the biographies of, you know, various people and understand those. But you also want to try things out. You want to make some mistakes in a contained environment in a way that you actually can see the effect of your hypotheses. And then you want to be able to learn from those. And I think that being able to really start to scale up these systems and be able to integrate them with a world is a very big process and milestone that we're currently embarking on, right? To move from a world of totally hermetically sealed reinforcement learning environments to think about how do you actually put real world interaction in there. And you think about things like robotics, like you're going to need to have that at some point, right? You're going to need to have some sort of interaction with the real world and to have models that are able to produce new materials, right?

Starting point is 01:28:41 To be able to actually solve various diseases for them to be able to really help people, right? That, you know, we already have models that are great at use cases like therapy, but to really get to the next level of something can just really help every person accomplish more and accomplish whatever their goal is. it would be very helpful for that model to actually have some real world experience with doing that very thing. And so I think that figuring out how to bring all this together is ultimately what our mission is about. And we do this not in isolation, but really as part of a much broader community. It seems like it's advantageous to have the most dominant consumer app in that environment. So congratulations. Jordi, do you have a last question?

Starting point is 01:29:16 Last question. What do you hope to see out of Washington, D.C. in the next year, year or two, not thinking super long term in terms of basically promoting innovation within the, United States. Obviously, the admin cares a lot about AI and has been making moves, but what else would you like to see or where would you like them to double down? Yeah, I've been very, very impressed with how much the administration has engaged with the technology and really tried to figure out how can we help and ensure that American AI continues to lead and really sets the standard for the world. And I think that that is the lens that I would really encourage thinking through, right?

Starting point is 01:29:53 is like, this technology is changing very fast. And that fast plus government is not usually a ideal combination, but this is the reality that we have. It's the opportunity we have. And I think that the question in my mind is less about any specific regulation or strategy, but it's really being calibrated. It's really having a very tight utal loop, right? Being able to react to, okay, we have a new model.

Starting point is 01:30:15 These are the capabilities we see on the horizon. How do we make sure that we get the most uplift and benefit from it? And thinking strategically about not just how do we do this, this for Americans, right? But how do we actually do this for the world and promote democratic values? And so to me, the most important thing is that motivation, right? Is the question that is asked and the ultimate sort of motivation behind what gets implemented? Yeah, that makes a ton of sense. Thank you so much for joining us. Jordi, are you going to hit the gong? For GPT-5. Congratulations on the masses. A historic day. And thank you so much for stopping by. We'll talk to you.

Starting point is 01:30:51 Thanks for joining. Have a great day. Bye. Cheers. Really quickly, let me tell you about figma.com. Think bigger, build faster. Figma helps design and development teams build great products together. And we are joined by Sarah Fryer, the CFO of OpenAI next. And we are going to bring her in in just a minute.

Starting point is 01:31:08 The gong is still swinging. The gong is still swinging. And I'm going to tell you about vanta.com. Automate compliance, manage risk, improve trust continuously. Vantas Trust management platform takes the manual work out of your security and compliance process and replaces it with continuous automation, whether you're pursuing your first framework or managing a complex program.

Starting point is 01:31:28 We need one more second. Tyler, any other questions that we should be asking for the OpenAI folks? Anything top of mind? What's on the timeline? Is the timeline still in turmoil, or has it settled? So I think the general vibe is like

Starting point is 01:31:40 this model was not BenchMax, but if you actually get to use it, it's pretty solid. Cool. One thing, it failed, QAPN Bench. Oh, it did. It did not get the horse breed, correct?

Starting point is 01:31:49 Did you get the horse breed? So you have it. You have access to... Yes, I have access. But I've seen other things on the timeline. We can talk about it later, but it seems like a really good model. That's amazing. Great to hear. Well, welcome to the stream, Sarah. Good to meet you. How are you doing? Congratulations. A historic day. Thanks so much for taking the time to talk to us. How are you doing? I'm doing great. I mean, how could you not be doing great on the day when GPT5 launches?

Starting point is 01:32:11 It's been a long time in the making and we're so happy it's out. Yeah, fantastic. Walk me through your role and what GPT-5, what? this launch means specifically for you. And yeah, let's just start there. Finance has to be, you guys have to be the unsung heroes at Open AI. There's a lot of big numbers. There's a lot of massive bills coming in for crazy training runs and you have to underwrite these against future revenues and I'm sure you've developed many models to

Starting point is 01:32:42 figure that. But yeah, walk me through what your role at Open AI and what today means for you. Yeah, absolutely. So I'm opening ICFO, but the finance can be be the Unsung Heroes, but they are an amazing team. So I'm going to shout out to them. They're heroes to us. It's a complex world that we're all living in. And there are a lot of bees on the end of a lot of the numbers that we look at. Look, what is our role? Number one is just making sure we have a healthy, high growth business. It's been incredible watching just, first of all, the number of weekly active, 700 million people using ChatGPT every week. And I'm assuming after today, we should see a very nice little bump in that number.

Starting point is 01:33:19 This is going to be a gong heavy segment, Jordy. I think we have a lot of soundboard for the big number, so congratulations. I love it. And I love that. I've never met a number I didn't like. I think the other part of the business, you know, and then we have to do this. We have to this balance of the consumer business, enterprise business, and then API business, which I think of as somewhat enterprise. You know, and balancing that out.

Starting point is 01:33:45 So enterprise adoption has also been exploding. I probably do, I mean, interestingly, as a CFO, I probably meet four to five customers a week. It's a part of my job I actually love. We have about five million paying business users right now from banks to biotech. I was talking to the CFO. And so that number is individual companies. That is individual seats at companies. Seats and companies.

Starting point is 01:34:08 Got it. So what I would say about that number is it's crazy to have done that in just two and a half years. Because enterprises, right, you got to put your big boy, big girl pants on. to go sell to an enterprise, right? They want to make sure that you have the table stakes of security, SSO for signing on, you have HIPAA compliance if you're selling to health care and so on. They want to know that other people have done it, so they're often looking for that case study, but they also want to be, you know, the innovator right at the front. And so that to grow that scale of business and just two and a half years blows my mind. And it's not just big, big businesses,

Starting point is 01:34:43 which I could talk, you know, at length on, but it's also small mom and pop. you know, literally the people who really keep the lights on in most countries are also gravitating to chat GPT, which is wonderful. And then on the developer side, four million developers have built in our platform. And the question there is like, that could be a developer inside a big company like grab. It also could be the next, you know, startup founder. That's Y Combinator getting going with the next multi-billion dollar unicorn business. And so we see the whole gamut there, and that's important to us as well, because it's very mission aligned, right? How are we going to get AGI to all of humanity if we don't do it through this ecosystem?

Starting point is 01:35:25 So a big part of my, you ask my role, big part of my role is just keeping that business really healthy, making sure we always have the headlights on so people know the decisions they're making from a business standpoint, huge part of what the team does. The other big part of my role is compute. If I didn't talk about that in my first breath, you all should correct me. I mean, it's making sure we think compute is a massive competitive differentiator. I give so much kudos to Sam and the team, but particularly Sam, because no matter how big a number we look at, Sam always wants to go bigger. And he's been right.

Starting point is 01:36:02 He's never met a number. He doesn't want to add a zero to. That too. Maybe mold and be logarithmic. Maybe two zeros. And he's, but he has been very right. And if, you know, you just had a long conversation with Greg Brockman. I think he does such a good job of kind of really explaining what a completely different world,

Starting point is 01:36:20 an AGI world is, or an AIFide world is. And so I think when people get all cut around the axle of like, you know, what is a gigawatt of computers? And oh, my God, you guys want to have 10 gigawatts. And that's more than the compute of like Ireland since I grew up there. And now you kind of look back on that. And you're like, those numbers already look small for a world where everyone will have access to intelligence. And so we're really starting to see what that can mean when you look at the demos today around things like health care and education and so on. Can you talk to me about non-gap metrics and what you think is going to be useful to track?

Starting point is 01:36:59 We were talking to Mark Chan about this and he was saying, you know, DA user great, time on site is great. But that's not as impactful of a metric for open AI as it is necessarily for a social network or an entertainment app. And there can actually be some problems that come up with that. So it feels like there might be some tension in the organization eventually or just publicly about, you know, what metrics are worth optimizing for. And then there's also the financial community that wants non-gap metrics to track the health and progress of the business. And then, of course, over, you know, decades we see companies eventually roll back some of those non-gap metrics. And as the business gets more complex. So how do you think about the development and sharing of non-gap metrics?

Starting point is 01:37:44 gap metrics and what do you think is actually interesting and provide signal to the business and the investor community? I'm kind of smiling to myself because when anyone normally says talk to me about non-gap metrics, I can see like most of people's eyes roll back in their head. I live for non-gap metrics. I would love to do that. Please. I think in a CFOC, first of all, it's really important to think about input metrics

Starting point is 01:38:06 and output metrics. And things like revenue, which is a gap metric as well as a non-gat metric, they're very laggy. Like if you're supposed to be, spending your whole time focusing on the revenue number in an operator seat, like you are completely missing what's going on with the business. So I push my team a lot to get out of kind of ultimately what the P&L looks like, and I'll come back to it though, and go way upstream and say, what are the true input metrics that

Starting point is 01:38:31 tell us about the health of our business? And so I think it does start with that funnel of monthly active to weekly active to daily active because we do. I mean, our mission is literally AGI for the benefit of humanity. So we know how many billions of people live on the planet. The fact that we're starting to be able to talk in billions and percentage of the world's population, it blows my mind. Today, 85% of our users are outside the United States. And I love that stat. And in fact, if you go look at where are the big populations of users, it just tracks global population, right? It's countries like India, Indonesia, Brazil, Vietnam, like the Philippines, like go to anywhere that has big population,

Starting point is 01:39:15 the U.S. too, of course, but that will be your tracker. So that's kind of number one when I think of an input metric. From there, on the consumer side, you're right, things like time and app I've actually always had somewhat of a love-hate affair with. But I think in this case, because we're giving people intelligence, teaching them how to use that, I actually. think is where time and app does become important. And one of the things we've really seen with chat GPT are people are spending more time with it. Now, you know, we balance that with things like mental health and so on, making sure that we're not creating bad things like we might have seen in prior eras of computing. But I think we're just getting started on that front. Beyond that,

Starting point is 01:39:58 like when we go into areas like the API, I don't look only at usage, right? It can look at tokens per minute as a usage metric. But I look at things like latency. actually try to look at the elasticity of demand. We know that developers want performance. They want intelligence, but they also want to make sure the API is always up, and they want price. And they're often willing to trade across those three things, right? It's kind of a linear program depending on what your use case is.

Starting point is 01:40:26 And so I think it's important that we are offering things to developers that allow them to optimize across the three metrics, for example. So that's kind of your input metrics. And again, I could wax lyrical, but I want to do that. won't. But then go to what you really ask. So investors on the other side, right, they want to see a P&L. They're like, I want to be able to compare you to other companies. I want to be able to create maybe a DCF. Like I want to think about fundamental valuation for a company if I'm going to invest in it. And so, you know, today what I really try to push investors on is we are not

Starting point is 01:40:58 a company that should be optimizing for free cash flow today because there's just too much opportunity. Like that point about compute, we have to make a decision on compute. today with an eye to what we're going to need in two to three years because data centers don't just spring up overnight. Like they're not mushrooms. They literally take time and effort. The thing we have failed at, frankly, I would say it's three years ago, we didn't have enough foresight to say how big could chat cheap peteak because it didn't exist.

Starting point is 01:41:28 It's just a shame on us if we keep doing that over and over. So there can be a bit of a mismatch between our belief on revenue because we don't yet know the product versus the input, which is the cost today on compute. And so getting investors comfortable with the fact that there's probably losses for a period of time. I say probably because chat GPT, just generally the revenue models continue to surprise to the upside. But at least for now, we should be in big investment mode. And then you kind of said it well. Like as companies mature, you move to more gap metrics, right? If you look at the large, the MAG7, many cases they're looking at like real gap net income.

Starting point is 01:42:08 So the whole way down to the bottom of the P&L, we're just not there yet. And we should take advantage of that advantage because we can invest as a private company. How do you think about timing fund raises from my understanding or rumors? The last, you know, the most recent financing was very oversubscribed.

Starting point is 01:42:26 And at the same time, you're still committing to CAPEX in the future that is a multiple of current, you know, the current run rate. And so, you know, you and the CFO seat, I'm sure you're trying to find this balance of like, what does the business need today while, you know, not diluting the company, you know, too much, knowing the sort of growth rate of the business.

Starting point is 01:42:49 I mean, that's exactly right. That's the art, not the science of it, is that, you know, we did just come off the back of closing out the sleeve of investment that we could take down in this current round, led by SoftBank. And it was massively oversubscribed, which comes back to, I think, the market really waking up to the fact that AI is a generational opportunity. And the scale that it requires is like something people have not even seen before, right? It's, you know, people talk about the Internet or like the railways.

Starting point is 01:43:19 They're good analogies or transistors, I think Sam always goes back to. They're good analogies, but I do think this is bigger than everything that's come before. So there's a, you know, taking down $40 billion, which we just did in this round, that certainly felt like that gave me a lot of confidence. Appreciate that. A lot of confidence to then go out and do large compute deals. We announced the large deal with Oracle, for example, and to be able to keep working with all of our supply chain,

Starting point is 01:43:48 Microsoft, Corweave, Oracle, Nvidia, and so on. But at the same time, you know, in a world where our valuation has gone up, you know, at pace with our revenue, you do get an opportunity to keep coming back to market and not take that same delusion because you're getting that higher valuation for the work and the output that you've created. So it is a bit more of an art than a true science. I think for now we will continue to need to fundraise in order to fund that compute.

Starting point is 01:44:16 But I think we want to start getting more sophisticated. Like just pure equity fundraising for everything is an expensive way to fundraise. And I think we're probably getting to the stage at a company where we can be a little bit more kind of broad and how we think about funding overall. And even just working, frankly, with our supply chain because our success with bringing this era of AI into being is their success too. And I think these companies are realizing that. What about partners? Last question.

Starting point is 01:44:46 Partner selection on the compute front. There's not a lot of companies in the world or firms that can really be a meeting. You should update your LinkedIn title. We saw someone yesterday, works for Discord, is in charge of their class. cloud buying and his LinkedIn title was, I have full responsibility over buying cloud, our entire cloud budget. And it was clearly like a huge flag, but I'm sure, you know, you're in direct text message, you know, with every single person that's relevant in the industry. But yeah, but I'm curious around like, you know, a lot of people have been excited about

Starting point is 01:45:23 developing data centers over the last couple years in hopes to win, you know, the business of companies like Open AI, but I think when you guys are evaluating partners, I imagine that scale is such a massive factor. And so a single small data center is not really going to move the needle. You guys need to be thinking in terms of mega projects. Yeah, I mean, I think that's exactly right. I mean, it started with our partnership with Microsoft. And it's kind of, it makes me smile now to go back and look at that original kind of large fabric for pre-training because I think it was only in the maybe 20 megawatts sort of size. And, you know, now we're talking gigawatts even just this year.

Starting point is 01:46:07 And you're right that when we think about like what is perfect compute for us or strategically the right compute for us, we are definitely thinking about large scale. We're thinking about flexibility, right? We're learning a lot about, you know, pre-training, post-training, test time compute, even. like where the different kind of scaling is happening. We're kind of recognizing there's more of a blurred line, often between what people think of as inference. So investors always are like your inference compute and your training compute.

Starting point is 01:46:35 It's like, you know, literally it's like vanilla ice cream and chocolate ice cream, when in reality there's like a bit in the middle that is something of both. We also need to think about things like where, you know, latency, where do we want to put our footprints around the world, that very global weekly active user base, right, they use chat GPT, you don't want to slow the model down, right? The beauty of the intelligence is like the real-time nature of it. And then when we get into big compute, like where there's lots of tokens being used, like deep research, image gen, video, as that comes online, like all the work

Starting point is 01:47:09 you saw today, actually just even on voice, like that really quickly means that you've got to make sure your compute is near your users. And so it is a big plan that's coming together. But you're right. Like small is just not that useful to us. What about pushing partners to take risks? From my understanding, you guys are pre-committing to certain, you know, basically spend levels, but at the same time, I imagine you want people to say, here's what we know we're going to need, but we want you to build, you know, this much capacity so that we have the sort of sort of incremental capacity built in. Yeah, we want to, I mean, being extensible is really important. And we do want to partners, like I think Oracle OCI has done a really nice job of that, of kind of starting, we started

Starting point is 01:47:55 with like one large, felt really large at the time, data center footprint in Abilene and Texas, and now that has really multiplied up into multiple sites that can all be connected. And that's a good example of a partner who has the capability to start in one way, but to be able to show you a path to maybe five-xing just in that, in that single footprint. That said, we are finding that as we go around the world, there is an ability to go work, with governments, for example. We just made an announcement in Norway, made an announcement in the UK.

Starting point is 01:48:26 This is the first time my professional career I've seen countries come to the table. I want to do commercial deals like wall-to-wall chat GPT. I think the government of Estonia put chat GPT into all of their high schools, high school or I can remember it was up in the university level. But that's kind of wowing. And hand in hand with that, they are viewing AI infrastructure

Starting point is 01:48:47 as incredibly strategic for their population. And, you know, it's a whole other level of selling versus, you know, I've seen enterprise, large enterprises before, but never anything at this scale. Last question. Whose idea was it to give every federal agency chat GPT for a dollar a year? Yeah. I imagine that had to get past. You could have gotten more than a dollar.

Starting point is 01:49:12 The CFO must be really upset here. $10? Not at all. That's 10 times as much money. This is one where I think it's really important. And opening is, you know, in some ways, a U.S. asset and national asset. And we want to make sure we're accelerating our government, like all of the resources, as we think about, you know, Western democracy and so on, that we are absolutely putting

Starting point is 01:49:33 our technology into those hands. It's that guy, Kevin Wheel. He's been moonlighting for the U.S. government. It's like, which team are you playing for, Kevin? Are you on Open AI? Are you on the U.S. government team? Kevin just did his basic training. I don't know if I'm allowed to tell you that, but I was hearing all about it yesterday.

Starting point is 01:49:48 I saw some photos. They look great. Yeah, it's a good thing. It's going to be an even better shape. Let's go a lot with governments. Yeah, that's great. Amazing. Last question for me, the open source model launched two days ago.

Starting point is 01:50:01 And there's this world where, like, you have this dominant, the accidental consumer company, you have this dominant consumer app that's generating so much revenue. Then you have B2B and enterprise and API, and that looks more like a cloud provider. But then is there a world where the red hat Linux of open source LLN? is an Open AI division and that there's actually serious revenue and profit that comes from helping companies implement an open source large language model like Red Hat built a pretty fantastic business for a long time on top of open source Linux implementations. Yeah, I mean, I think it's the right question to be asking.

Starting point is 01:50:40 I mean, I think step one was getting our second open source model out and getting, seeing what that traction is and then seeing what the community needs. I think it's important to leave space for a community to develop, right? That is the beauty of open source is that ecosystem that develops. And that was true with Linux. It's true in areas like crypto, too. But I do think you'll find over time that as enterprises want to deploy it, like, I mean, now I've dinosaurs myself.

Starting point is 01:51:06 But when I was a, you know, when I was a research analyst at Goldman Sachs back in the day, I covered software and I covered Red Hat, actually. Yeah, really. And all that growth. I like wrote a research report called Fear the Penguin at one point. because of the Linux being deployed. But then you started to understand that for an enterprise, you couldn't depend on like patching and upgrading to happen via community model.

Starting point is 01:51:29 Like you needed some of the rigor that goes with an enterprise business where you kind of know if you need maintenance, if you need a bug patch and so on. And so that did allow Red Hat to grow an incredible business. So I don't know if it's us or we'd be supportive of others, but I think we are so excited to see open source out there and getting incredible. feedback. And I think we want to do that ahead of GPT-5 to keep coming back to like we're here to grow

Starting point is 01:51:54 this ecosystem. Well, we'll give you market cap credit for it anyway, even if it's early stage. Well, thank you so much for coming on. This is fantastic. We'll talk to you soon. Thank you, sir. Great to be both. Take care. Have a good one. Cheers. Bye. Up next, we have D.D. Crito from Kudo, I believe I'm pronouncing that correctly. Let me tell you about graphite, code review for the age of AI. Graphite helps teams on GitHub ship higher quality software faster. You can get started for free at graphite. Dev. And let's bring in our next. guest how are you doing welcome to the stream welcome very clean background it's probably virtual but whatever you got going on looks fantastic you look great how are you doing are you excited about

Starting point is 01:52:30 gpt 5 i'm so excited it's it's awesome it's actually like everybody's talking about the coding capabilities but no one is really talking about the code review capabilities and i'm going to talk about that today yeah yeah break it down um how are you using it right now yeah so we're just enabled it in our platform. It's the default model for both our ID plugin, our CLI, our Git plugin. And yeah, we're using it to generate very high quality code reviews, catch bugs before the eight production, help enterprises, verify that their code is aligned with their best practices. So yeah, it's super exciting. I can share my screen and show a few things if like that makes sense. You can. Everything you share will be live. It'll be a little yeah, yeah, yeah. But I want to know also,

Starting point is 01:53:18 So while you're getting that set up, I want to know about what changes materially do you think happened in GPT5 specifically for code and code review. Do you think there's more data going into the model, more data going into the pre-training, post-training, anything else? Anything that you're noticing that you're like, oh, there's a specific upgrade here. They must have done something to get there. Yeah, yeah. I think it's a great point.

Starting point is 01:53:43 So I think it's all of the above. So it's scaling of both the pre-training. but probably a lot of the reinforcement learning. And basically using that at scale to verify that code gets generated in high quality. And then also basically catching bugs. And when you do it with reinforcement learning, you have the actual ground truth. So once you scale that, you can get the model to basically be a lot better at that. How steep is the power law right now in just programming languages?

Starting point is 01:54:18 basically all Python JavaScript and then like a really hard fall off or is it actually important for coding models if they want to be adopted widely to be like truly multi-language and get all the way down into the long tail of like the rust and the and you know C sharp and all the different languages that are out yeah yeah for sure it's important to I mean the majority of the market is in the JavaScript type script Python like the majority of the early adopters I would say but then when you get to enterprise use cases You get a lot of dot net, you get a lot of Java.

Starting point is 01:54:50 And the models are getting pretty good at those languages as well, for sure. Are you excited about, I mean, how do you think about the difference between like the improvements to GPT5 from the consumer's perspective versus at the API level? I always found it a little confusing that chat GPT was available as an API and you could interface with the chat. I believe you could interface with the chat GPT model via the API. And there's a little bit of like a blind blurring there. But are there features that you think are croft and you want to kind of rip out for an API use case? Or do you just say, hey, give us the kitchen sink and we'll work from there. And it's actually helpful to have a coding model that can still have a web browser.

Starting point is 01:55:39 Yeah, yeah. I think basically it's a lot about we consume the model through the API. and it's really the same model that drives the consumer product. But for us, since our use cases are a lot about eugenic use cases, the more the model gets better at using tools and gets better at kind of listening to very, very specific instructions. Following instructions is critical for the enterprise use cases. Because for us, unlike the border market,

Starting point is 01:56:09 we believe that for enterprises, you need to have very specific, agents that are defined with specific set of instructions and prompts and tools and permissions. And the more the models get trained with that type of environment, the better they end up serving the enterprise market, which is really where we're focused on. My question is, I wonder, like you said, like very specific instructions are important. When are we going to get an agent that I can just turn loose in a code base and say, like,

Starting point is 01:56:41 just go improve it? like just go hunt around, do, like rewrite that. Like when you get a good open source contributor on a team that just becomes nerd sniped by the project that you're building on, they will just go around and find little ways to improve this documentation needs to be a little better. Let's rewrite this test case over here. Let's add a little bit more functionality to this class or function. How far are we from that?

Starting point is 01:57:05 Yeah, I think the models are getting better and better at that part of basically kind of running loose, a code base. Yeah. But they do need the Godrails in place. And this is kind of where we're focused on. Like a lot of the talk in the market is around the cogeneration side. You know, let the agent lose and give it a task and it's just going to go around and run four hours and do it.

Starting point is 01:57:29 What we're seeing is that the real challenge is now shifting towards how do I is verified that the code is aligned with the best practices? How do I make sure that it's well tested, well reviewed, doesn't break anything. And, you know, so that's, I think, the next frontier. And really, developers going forward are not going to write a lot of the codes by hand. They're going to spend most of their time reviewing code. And that's the next frontier. And that's what we're really, like, are here to tackle.

Starting point is 01:57:59 Very cool. Anything else, Rudy? Yeah. Well, thank you so much for joining, giving us some extra context on the GPT-5 launch. We will talk to you soon. Have a great rest of your day. And thank you for joining. Cheers.

Starting point is 01:58:12 Thanks. Cheers. Talk to you soon. And let me tell you about profound. Get your brand mentioned on ChachyPT. That seems more relevant than ever. Reach millions of consumers who are using A. to discover new products and brands.

Starting point is 01:58:26 I forgot to ask about this. We'll have to come back to this. But I want to know if. Profound powers MongoDB, indeed. Mercury, DocuScience, Sapier, Ramp, Roe, Golland, workable, Majorie, Aidesleep, U.S. Bank, Chime, Clay. Okay, okay. We get it.

Starting point is 01:58:42 They got some logos. There is this question of like, okay, even if you're like, okay, GPT5 is more incremental than a revolution, more of an evolution than a revolution, it's like, okay, well then let's talk about how it affects every other business and every other aspect of the economy. What should you be focusing on? And is, like, do the, do any of the updates from GPT 4 to GPD 5 change how you're positioning your brand for AI search? That's certainly an interesting question to dig into.

Starting point is 01:59:14 Anyway, we have Zach Lloyd from Warp coming into the studio. Welcome to the stream for the second time. Welcome back. Good to see you. He's back. How you doing? I'm doing pretty well. Yeah, so, I mean, a lot of what stuck out to me, I'm mostly a consumer of consumer AI apps.

Starting point is 01:59:34 I'm very excited about not needing to mess around with a model picker anymore. But take us through the biggest improvements from the social. software development side. Yeah, I mean, it's a major step up from the prior open AI models. It's, I mean, it's doing agenic workflows and work for much longer period. It's just a smarter general model. Like we evaled it against all of our benchmarks and it's up there at state of the art, which is, you know, from our perspective, it's, it's awesome to have multiple

Starting point is 02:00:05 competitive models that our users can benefit from. So definitely a huge improvement from GPT 4-1. Yeah. So it seems like not the, you know, clog code killer, but certainly in the same conversation, in the same football stadium for using a sports metaphor. How much,

Starting point is 02:00:28 you know, one thing that stood out is the cost reduction. I was about to ask. How much do you think that developers will care about that versus just, you know, what it can do from an output? standpoint. I think developers do care about value. So sort of like quality to cost ratio. I think it's the more you get into like the individual developer and the small team,

Starting point is 02:00:52 the more that that matters. Whereas if you're at the enterprise level, I feel like it's, it's a little bit less price sensitive. So yeah, I mean, you can see it as different apps change their pricing what the reaction of the developers is. You've probably seen this with cursor and seen this with Cloud Code. And so developers really, really are looking for something that's cost effective. So the fact that the cost is a little bit lower

Starting point is 02:01:19 is actually is a big deal. Do you think we're in the Lyft Uber 2015 arc where the prices are subsidized and the prices will go up? Do you think that there's a price war on the horizon now that the frontier models seem to be similar capabilities? Do you think that someone will try and raise a bunch of money, cut prices a bunch and steal a bunch of users? How do you think that plays out? It's an awesome question.

Starting point is 02:01:48 I mean, my hope is that we get to a world where there is price competition at the model layer. So Warp is very much at the app layer, right? And so our value prop is like we can give our users who are mostly developers the best model access. And so to the extent that it's not one sort of model provider running away with that and having pricing power, it's better for us, just candidly. And so, you know, my hope would be something like the model world ends up a little bit like GCloud, AWS, Azure. That's our best end state where all of these models are, you know, sort of similarly powerful and a little bit more commoditized. I don't think it's been like that, but it's going, it's getting a little bit more like that. So the more that there's more than one show in town, I think that's generally good for Warp.

Starting point is 02:02:41 And actually is good for developers because it will put competition. The competition will put pressure to bring the prices down. But I don't know. Like I also think that people will definitely pay for quality. And so if there is a, you know, meaningful delta and quality on the frontier models, then I think that like whoever has the quality delta will have a lead temporarily. but I'm not sure that that lead will be sustainable. We'll see.

Starting point is 02:03:08 How do you think the developer community should plan around model deprecation over the next, you know, one to two years? Like how much, you know, from, I don't know that I've gotten a reaction yet from, I don't know if there's general frustration yet from people. You know, we've heard on the consumer side, Tyler on our team here loves four or five. And so he was a little bit disappointed to hear that. But what are you seeing on the developer side? Yeah, I think it's a little bit different for people who are like building apps on LLMs versus people who are using LMs as like an accelerator to doing coding. And like, you know, at Warp actually we do both.

Starting point is 02:03:55 Like we were an application level stack. And like it's actually very easy for us to go to the latest model. and so it doesn't really bother me. I don't know what type of app you would be building where it's really important that it's like GPT-35 or GPT4 or something like that. I think like generally we want the most intelligent tokens at the best cost. So I don't see that being like too big of an issue, honestly. What about open source?

Starting point is 02:04:25 Does that feel like something that will be in the playbook? Is the markup on closed source models high enough that there will be a significant price delta or is the parator frontier kind of indifferent to close source open source? So if there was a comparable open source option, that would be awesome. I think that the economics of it, again, it doesn't seem like a perfect analogy to me between like open source software and open source models. So open source software, it's like you have a big community of people who, you know, for the love of coding are building a really awesome product. For open source models, it's like you just need the crazy amount of capital to train something

Starting point is 02:05:09 that's on the frontier. And so I don't know how that happens. And so what we've seen is like the open source models are competitive at the quality level that they're at, but the quality level that they're at is not the same as the frontier models. And I don't really see why that would change. And so, I don't know, in Warp, it's like we, we were serving some open source models, but they're just not, they're not as good. And so there's, I think, a more limited use case for them right now. And I don't really see economically why that would change. In fact, I would be, I would be surprised if anyone was spending billions of dollars to train a model and just kind of put out the open weights. Like, I don't get the business strategy there, but maybe that will happen.

Starting point is 02:05:56 that would be awesome. Is there a world where you're like this idea of like smarter, smarter models either orchestrating, dumber, cheaper models or like using or distilling models into more narrow, narrow formulations that can be run more efficiently. We've talked to a few companies that do this for businesses. Like you just want a model that just filters for profanity and you can run it on, you know, a gaming graphics card.

Starting point is 02:06:25 And so it's basically super, super cheap or super fast. I'm wondering about like in the coding world, coding agent world, any of that, like where are the opportunities to kind of fan out and use an ensemble of models instead of just this hit everything with the smartest best? It feels like because of the funding environment, everyone can kind of justify like a high cloud bill. But, and most people don't admit that it's hurting the bottom line. but it feels like at some point it kind of has to eventually. I mean, I think that's a very real thing.

Starting point is 02:07:02 Like sense of, even in Warp, we don't use like the biggest, most powerful model for every task. And so there are certain things like, you know, for Warp, maybe for like deciding whether or not we should summarize a conversation is like a good example. So you hit the context window, you're like, okay, is this is this a good spot? to summarize, is this a good spot to encourage a user to start a new conversation? We use a much more inexpensive and also low latency model. The other thing, the trend is that these very, very powerful models tend to have much higher latency. And so we do a mixture of models, and that's totally a real thing.

Starting point is 02:07:46 But I think for like the predominant use case as a developer is going to be, I want to tell an agent to do something. I want it to be harder and harder. I want it to run for longer and longer. And to do that, it's like you kind of want in general the most intelligent model. And so until this, until the models have a sort of S curve like type shape, I think that I think it's going to be more of a quality game than a cost game for most of these things. Doesn't it feel like they have an S curve shape right now? It certainly does from a consumer perspective. That's interesting. From from a coding perspective, I feel like we're still accelerating. Like the difference, again, between the last version of GPT and this version of GPT is

Starting point is 02:08:35 probably bigger than the difference between like 4-1 and 4 and 4 and 3.5. Like, it's a big deal. And same thing with the anthropic models. And I'm sure that we'll see something from Google where it's an acceleration. And I think that there is like a maybe an underappreciation of how much left there is to solve here. Because when you, even when you're doing like a real coding task as a pro, like, despite all the demos you see on Twitter where it's like someone asks, you know, an agent to build an app, that's like a lower level of difficulty than doing what a pro developer does

Starting point is 02:09:08 with one of these models. And the models still don't produce great code a lot of the time. Like there's a lot of kind of handholding that has to go into it. And I think, I think that we are still seeing an acceleration in terms of the model is actually becoming not just like okay, competent engineers but like really really good engineers yeah do you care about benchmarks we cared a ton about benchmarks like we um but your own internal benchmarks or or we do both so you know plug for warp we're number one on terminal bench which is the public you know terminal benchmark and we're top five on sui bench which is the coding benchmark and then the only way uh in my opinion that an app at our layer in the stack can really improve is by measuring the progress and so we have our own

Starting point is 02:09:53 set of e-vals that we run across all these models as well, which are coming from like real use cases. And that, again, is an advantage of being like a product that's in the wild that has a lot of users is that we can sort of see where the models are failing, where they're working. And so we're very big on that, actually. Yeah. Awesome. Well, thank you so much for stopping by. We will talk to you soon. Sure, you'll have a busy afternoon. Shout up, by the way, at Open AI team. Very, very helpful in working with us to get GPT5 to be awesome and warped. And one more shameless plug, we have a discount code for people who want to try GPT5 and warp. It's $5.

Starting point is 02:10:28 It's $5.00 GPT5. Okay. Thank you for having me, guys. Yeah, we'll talk to you soon. Thanks. Cheers. Tyler, any updates from the timeline while you're thinking about what the latest vibe check is in the war between Open A.I.

Starting point is 02:10:43 I got one from front of the show. Is a purpose-built tool for planning and building products. Meet the system for modern software development. Streamline issues, projects, and product roadmap. Go to linear.app to get started. Tool of choice for open AI. You have something? From Reggie James, front of the show.

Starting point is 02:10:59 Half of my timeline says this is the closest we've been to AGI. The other half of my timeline says we officially just hit AI stagnation. I love tech. Well, we will be going deeper deciding whether or not this is stagnation or hyper-intelligence takeoff. And we will be joined by our next guest. Riley from Charlie Labs. Hey, guys. Thanks for having me.

Starting point is 02:11:22 Good to see you. Riley, how are you doing? What's happening? I'm doing fantastic. We've been heads down with GPT5. How long have you had it? How long did you get the preview? I feel like it, you know, it gets rolled out to early adopters a little bit earlier, but it's been weeks, months? How long have got it? We're a couple weeks, like two or three. What was the first thing you did with it? How's Charlie liking it? Charlie loves it. And also, I love what Charlie does with it. Yeah. What does Charlie do with it? What was the first thing you did with chat GPT5? Ran our e-vails.

Starting point is 02:11:56 Oh, yeah? How'd they come back? Really good. Much better than O3, which was much better than any other model we'd from before that. Interesting. And yeah, so let's zoom out. What do you do? What do these e-vals measure? Walk me through it. So Charlie is a TypeScript-focused coding agent that operates much more like a human does. So less like IDE application terminal and more joins your GitHub and Slack and linear workspaces. And it interacts with the team the same way other humans do.

Starting point is 02:12:33 And then our evils are a mix of code review because part of Charlie's job is to review PRs from humans as well as his own and then code authoring, so opening PRs and pushing commits. So when you develop your own evals, I imagine you try and keep those out of training data you want those to be held private is that correct yes and it's getting even harder with web access now because they're too good at finding

Starting point is 02:13:00 things they're finding everything that's funny and then and then talk to me about like the shape of those of the actual problems in the eval are you are you doing are there some easy questions some hard questions some some extremely hard questions like how are you formulating those what's the shape of an individual task is it out of like 100? How do you think about developing a good e-vail? A mix of hard to very hard. The easy ones are just a waste of money and time at this point, especially with five. Like there's a bunch that it's just not going to get wrong.

Starting point is 02:13:33 Yeah. Yeah. And then we're mostly doing the PR ones look kind of like sweet bench in the sense that we're taking an issue to start with. But instead of giving the issue like in a Docker container already, we trigger a comment on the issue that says, hey, Charlie, go make a PR for this. And then Charlie does its thing and then the PR comes up and then we score that PR against a whole bunch of things like correctness to a no one's solution that's correct as well as code quality, testability and some softer things like descriptions. Who are the biggest, who are the biggest customers or users for like a typescript focused coding agent? It's a wide range of mostly modern apps, like pretty much any web app these days. It's going to be like a NextJS type app. And then all the way into like back end like Charlie himself is written in TypeScript.

Starting point is 02:14:27 Sure. Makes sense. And there's very little front end. Anything else? What else you got? I just want to say I love the name Charlie. It's one of my favorite agent names that we've had on the show. Yes.

Starting point is 02:14:38 It's right up there with Pig. And what was the other one? Well, I don't think that was an agent. But yeah, it's a good one. Yeah. Congrats on logging it down. Yeah, what about cost and that side of the business? Is there any movement there or anything that you require movement or you need movement

Starting point is 02:15:01 to really unlock new capabilities in the business or new markets? Not really for us because we're operating kind of as at a human level. we do value-based pricing, so we charge per PR per commit. And because that's comparing to such expensive actions that humans are doing, the challenge for us is more actually living up to the promise than doing it cheap. Yeah, yeah. Are you having any... But then doesn't the cost reduction announced today, isn't that great for business?

Starting point is 02:15:34 Yeah, I mean, it's good overall, but like that's, our problem is not that the models are expensive. It's that they're, I mean, they're getting really smart, but all. always take more. Never enough. For instance, since the beginning of August, we've been testing, 98% of the code that got merged into our code base was written by Charlie. Wow. Not 30, not 50, 98%.

Starting point is 02:15:57 And that's coming through PRs. That's not like auto complete in an I-E type thing. That's crazy. Yeah, what does that mean for like the future of like, who are you hiring? I imagine that you're still, you know, an engineering heavy organization that's, it's just puppeteering and orchestrating agents. But where do you see like the future of software development as a career path go? Yeah.

Starting point is 02:16:22 Are her new CS grads cooked? I think if they get really good at using the AI, no. If they try and take an approach of getting really good at writing code by hand, for sure. Yeah. What we're mostly looking for hiring is people who are able to see things at a much higher level and plan further out because with tools like Charlie, you can write so much more code so quickly that it's like it's more important to see where you're going and take the right path than it is to be able to write it quickly. Very cool. Well, thank you so much for stopping by. Good luck with the rest of

Starting point is 02:16:58 your day. And congrats on an upgrade to everything that you do. Tell Charlie to have fun out there. Have some fun. Thanks a lot, guys. We'll talk to you. All right. Chat has got some. Numeralhq.com. Sales tax and autopilot. Spend less. than five minutes per month on sales tax compliance sales tax super intelligence a number of the fellas in the chat got access to five break it down right says it's pretty good the writing ability feels a little nerfed says the way it right feels a little programmatic rather than sounding human reverts to using points even for things like blog posts and also uses overly complicated language for simple stuff

Starting point is 02:17:38 Techno Chief says it's crazy fast. Oh, that's good. Ratliff says, yeah, I was just going to say that very, very, very fast. Z. Jean Ahmed says junior devs are barbecued. Tyler, anything from your side before we talk to Guillermo from Versel? I think maybe a good way to like vibe check at least on the timeline is that it's almost like a 4.5. kind of thing where comes out people are like this model totally sucks look at the benchmarks it's like not it's not some massive improvement it's like you know not a step change at all

Starting point is 02:18:16 but then you you start playing with and it's actually like okay there's actually a good model yep like a lot of the stuff i'm seeing people post like oh that's actually like really like interesting output stuff like that um but we need it seems good can we do the green text eval green text bench yeah yeah we got to be tvpn intern yes yes yes yes uh we'll let you cook on that and then we'll move on to our next guest gues Romo Rao from Versel coming in to TBPN for the second time. Great to see you Guillermo. How you doing?

Starting point is 02:18:44 I like the action hall. Thank you. Welcome to the stream. How you doing today? Do you think GPT5 could beat me, you, a couple of the boys here on Dust 2 in Counterstrike? Easily? Easily. Yeah, it depends on the frame rate, right?

Starting point is 02:19:01 Yeah. But a long enough timeline, we're cooked. We're cooked. But we might frag it short term. that we might be faster. Amazing, yeah. Yeah, we got to, I mean, I'm sure we'll get to GPT5, but what's your reaction to the world model stuff from Google?

Starting point is 02:19:16 Do you think, do you have an idea of where that's going as a product? It feels like a GPT2 level technology, very much a research focused technology. I'm sure opening eyes working on something too and a lot of the labs will work on it. But what's your theory behind the generative video game world model stuff that's going on? I mean, number one, super fascinating, right? I think when we think about the future, I always think about Jensen's. The future of applications will be that pixels are generated, not rendered. So as much as we're really excited today that GPD5 and V0 are really good at writing code that then renders interfaces,

Starting point is 02:20:01 I think it's also cool to dream of a world where we're just going directly from GPU to pixel grid, right? And but if you remember like a couple years ago and maybe a decade ago, there was a lot of excitement of video games that were going to be live streamed from the cloud. Yeah, that's right. Where your input, your keyboard, you could have a very thin client, your input, your keyboard, your mouse movement was going to be dispatched to the cloud. We're going to have Google Stadio. Google Stadio was big there and then on live was in the

Starting point is 02:20:28 Microsoft's going to the game and is still, Microsoft is actually still pulling it, still pushing it very heavily. Awesome tech, but not Massadena. option. Yeah. But if you look at a lot of these technologies are being really successful in letting people get more creative and test things out. A lot of the use cases that we see for V-Zero and V-Vive coding are almost like a communication

Starting point is 02:20:50 tool. Like I want to prototype something. I want to see what's possible. I want to explore the latent space. And I think those world models are going to be incredible just to inspire what the future of games could look like, right? Just getting ideas for actually then shipping them in real 3D engine models. I think short term.

Starting point is 02:21:07 I think long term, all bets are off. Someone was just saying in the chat, you know, junior devs are roasted or a barbecue. I think that's not quite true. Okay. Same for like 3D engine developers. Give us the bull case for junior devs staying off the barbecue. So the bull case where I think people in general is that you move from, I mean the progression in the industry has been assistant to agent to team of agents, agent orchestras.

Starting point is 02:21:36 It's still really useful to have a human be the one that's sort of like managing the team. So you're moving from like junior dev to junior inch manager, especially as these tools become more agentic. In the new version of VZero that's coming up really soon, you're starting to notice that VZero sort of splits the task between a little team. You have the designer of the team. You have the PM of the team that sort of working on the spec. You have the architect. You have the engineer. I know if you saw Cloud Code announced, I think it's like a slash security review. You think of that as having a security team or a team of agents or security researcher at your disposal.

Starting point is 02:22:18 So junior dev as like a vertical skill, maybe a little barbecued. But junior inch manager, so I think it's just going to be the junior dev is so much more powered in this world. If you allow yourself to be and you keep up with what these tools can do and I think you stay, you know, at the cutting edge. Yeah, I mean the obvious bulk case is if you're like if someone's a college student today, they can learn to code truly AI natively. They don't have to say, oh, we're an AI native organization now. We have to upscale and kind of retrain people how to think. They can just naturally start to think with these capabilities in mind. There was that Sam Altman post about how we'll look back on, you know, 93% of humanity with subsistence farming. And if you ask those

Starting point is 02:23:02 people, what they think about our email jobs, they'd be like, you guys are crazy. And it's almost like in the near future, midterm future, maybe even long term future, it's like the number of individual contributors will be extremely low and almost everyone will be a manager and you'll become a manager much faster. You'll just be managing agents and then you'll be managing people who manage agents. But the job of almost everyone will become managerial. Maybe that's what happens. I don't know. I'm not 100% on, but that's what that made me think. someone asked me yesterday, you know, what do you think the future of the market of monitors looks like? Like, does it stay flat? Do people get more monitors? Because they're, you go to like

Starting point is 02:23:41 Doge coin trader analyst? Yeah, yeah, yeah, yeah. In the future, everyone has the hedge fund six monitor set up. In the future, everybody's just going to be on their phone. Maybe on their phone. I mean, I've noticed that what, you know, when I was an individual contributor, I had three monitors. I was programming on all the screens. And now, I mean, I use my laptop during the show. And then most of my work has done on my phone, phone calls, and then firing off messages. Yeah, maybe we actually shift away from monitors and go further into voice interfaces. Oh, I call the lead of my agents.

Starting point is 02:24:16 And then that agent relays it to some. I'm very optimistic on voice, by the way, because I've now seen it. I did what we're cooking on a better mobile experience for a V0. Sure. And I was going back and forth with my head of mobile. And he was talking to V0 and I was writing down in a pretty fast typeer. But he beat me with voice using the local model on the phone. So there's still the question of like etch latency versus cloud latency, kind of like what we talked about with 3D.

Starting point is 02:24:45 But I do think voice is going to play an increasingly exciting role in programming, which is kind of while. I would have never imagined. I've always been about typing benchmarks in WPMs. voice is coming. Yeah, yeah, yeah. How do you think about competition broadly in developer tooling, code gen? I mean, right now it seems like there's just so much demand. It feels like massive TAM expansion moment.

Starting point is 02:25:09 Every company's ripping. Tam expansion moment, but at the same time, winners will emerge. Obviously, you're playing to win. And, yeah, I'm curious. Yeah, on some level, we're playing both sides of the bet. We announced today that's really exciting is V0 with GPD5 support. So you can go to v0.dev slash GPD5 and we'll use GPD5 in combination with our model pipeline that makes it really good at vibe coding, especially for non-technical folks.

Starting point is 02:25:44 But we also, on the VERSEL AI cloud set of things, we open sourced. Basically, you can create your own Vive coding platform, power by any model. I was joking about this with Tyler. Code me a vibe coding platform, please. That's right. Make no mistake. Yeah, buy a code me a billion dollar company. Yeah.

Starting point is 02:26:03 No mistakes. But basically we are giving people that. It's a start a kit. Sure. And by the way, the fundamental question that CEO asked me the other day was, is vibe coding a product or a feature? Or is it both? You know, it's TBD.

Starting point is 02:26:18 The case for a feature is, okay, so there's going to be lots of systems of record. Think Salesforce, snowflake, Databricks, and increasingly they're going to incorporate co-gen capabilities into their platforms.

Starting point is 02:26:32 They can use a lot of these capabilities that we just open-sourced and you'll go to the existing place where you have the data. Kind of like

Starting point is 02:26:39 what we've talked about for decades of like are you bringing computer the data? Are you bringing vibes to the data? Right? Are you bringing

Starting point is 02:26:45 co-gen to your own platform? Yeah, you used to bring like a dashboard builder and it would have a couple widgets. And now I can just potentially

Starting point is 02:26:54 If I'm plugged into some sort of data source, some system of record, I could say vibe code this app on top of it. And there's been some retools played in this space, Zapier a little bit. But yeah, I mean, this feels like, you know, we're getting, we're not fully in the just the pixels are generated, but we're, you know, generative UI, generative application on top. And that, and that being bespoke and ad hoc. I also think it's important to understand the line between consumer vibe coding and just generating ephemeral software and websites and things. like that versus enterprises, which will have a lot of different use cases. When I look at the, when I look at the vibe coding market and I see businesses that are that, that are almost entirely consumers just creating things for fun. I think that has to be a tough business because it's a

Starting point is 02:27:41 hyper competitive market and consumers are flaky. They'll create something, you know, for fun, but they'll churn in month two because, you know, it's not, they're not running a real business. Whereas a business knows, hey, we'll pay for this on a, on a long-term basis because we have a use for it all the time from this product manager to an engineer over here to somebody in marketing, et cetera. Yeah, the other side of the equation is how do you make this vibe coding tools work really well for enterprises? Frankly, the most surprising immersion thing that I've learned is just how much demand there

Starting point is 02:28:17 is in enterprises for vibe coding. And this is because a lot of the traditional thing has been the people that understand the business are sitting over here. The people understand the code are sitting over here, and their communication is fraught with peril. They don't speak the same language. They kind of like resent one another. I love to tell this story. I was meeting with a CEO of a very successful company who's telling me that engineers, like asking a feature to his own engineers, felt like petitioning the government.

Starting point is 02:28:46 Even though he's the CEO, he's struggling to like make the case. And please like get me in your next sprint, get him to this feature. So vibe coding actually solves that problem. All of the PMs, designers, marketers, business users that previously only had access to what, like Jira and, you know, to do a little bit. Product management tools and writing PRDs and those kinds of things. They weren't able to ship PRs. They weren't able to, you know, ship software, and now they can. And so the opportunity is how do you actually make this secure?

Starting point is 02:29:22 How do you make it high quality? How do you create guardrails? And those are tricky problems. And I'm really happy that some of them are easy to overcome, at least for us. And some of them are active areas of research. But I think the enterprises really have a strong case for this. Yeah, can you walk me through like tool use? I mean, we were talking to the open AI folks about GPT5 being like really like a summation of like standing on the shoulders of giants.

Starting point is 02:29:45 You get a Python repel. You get a web browser. You get, you know, the ability to kind of run cron jobs now. There's voice and, you know, all sorts of different tools kind of wrapped up. up into one, multiple models. You can trigger reasoning chains if it wants, it can do all these different stuff. And that's actually the benefit of like,

Starting point is 02:30:00 this isn't just a bigger model. It's like a next version of a thing. It's more like switching from the iPhone 12 to 13 than going from the iPhone to the iPhone 3G. It's not just a new technology that's in there. But in the world of vibe coding, what are the tools that you want to think about adding? I know that basically every vibe coding platform

Starting point is 02:30:23 you know, recommends a database. But I was, we were talking to Harley at Shopify yesterday and there's a world where if I go to a vibe coding platform and I say, I'm building an e-commerce website, it should probably just be like, hey,

Starting point is 02:30:35 I'm going to do Shopify under the hood and I'll vibe code the landing page on top. But how are you thinking about the landscape of like tools that you could pull in full open, because there's open source repos that are like full projects that you could pull in and then just start customizing on top of. It's kind of this big continuum. Yeah, there's a couple layers. On the foundation model layer, what do you want is a model that is exceptional at tool calling.

Starting point is 02:30:58 Whether it has built in tools or whether you register them yourself, this is a sort of silent word that has been going on. Like if you talk to devs, what are you optimizing for? Tool calling quality. Why? Because to demystify the word agent, what an agent is, it's a loop of tool calling that builds up context over time. That's all an agent is. So let's, to give you an example, concretely, of B0, V0 is becoming more and more agentic over time. One of the things that he can do is it can take a screenshot of the thing that's building and reflect on it.

Starting point is 02:31:32 So today, I live, vibecoded to an audience of Web3 and crypto engineers. And I told VZero, hey, make this dark mode. And initially, VZero dust me dirty. He's like, he changes some things with the dark mode. And then it kind of astonished me because I was like, oh, I have to be. and I explained to this audience, it then takes a screenshot, looks at it,

Starting point is 02:31:55 and keeps fixing it. And I was like, this is literally a developer that's alive on autopilot. And the reason that it's in autopilot is because he has access to these tools, like looking at the web browser. Another one is research.

Starting point is 02:32:07 I've coded an example of, build me a substack clone for cryptocurrency news. And the agent didn't know what the cryptocurrency news were. So I started doing research on the internet of, okay, Ethereum passed certain price and whatever.

Starting point is 02:32:22 And then you're talking about the tools over the internet. So to demystify another topic, MCP is really exciting because it's a new protocol for registering tools that your agent doesn't locally have. So those tools that I just talked about, we gave them to VZero. Here's a deep research tool. Here's the screenshoting tool. And those will likely become the new services

Starting point is 02:32:45 when you think about like AWS of today, if AWS was an AI cloud. which is kind of what we're trying to build at Resale. You think a lot of those tools are going to become as a service. Like, bring me the research as a service, bring me browsing and screenshoting as a service and so on. But then you have MCP, which allows you to, okay, I need to sell something online. All right. So now there's an MCP for Shopify.

Starting point is 02:33:09 Now there's an MCP for Stripe. There's even crypto MCP. So it's really exciting. Like now it's like the ultimate choice for a builder. And you don't have to go and learn all these things. You don't have to, this is almost like a discontinuity of the valley trend of like, if we build amazing documentation, they will come. This is more so if the agent picks you, they will come, right?

Starting point is 02:33:31 And so there's a lot of figuring out right now, like, how do I make my infrastructure? How do I make my product to be loved by these agents? And the MCP promises to be one of these first things that you are in control of. That makes sense. Last question. Someone on your team named Josh is in the challenge. chat he wants to know what what does he need to do to get a Twitter badge oh well yeah 100k downloads of the AICLI I think we've been talking okay okay okay good work through the

Starting point is 02:34:02 god that's been thrown down thank you it's on record it's got your work cut out for you it's burned into the immutable record of this live stream and the future training runs best of luck we're gonna hold you accountable to that Guillermo great seeing you awesome great to see you we'll talk to you soon congratulations talk soon let me tell you about fin.a.ai, the number one AI agent for customer service, number one in performance benchmarks, number one in competitive bakeoffs, number one in IG2, number one in having an Irish founder. That's right. And we will invite our next guest to the stream from factory.a. Welcome to the stream. How are you doing? Good to see you. Hey, how's it going? Glad to beat here.

Starting point is 02:34:43 Great. Thanks so much. Kick us off with an introduction on you and the company. Yeah, my name is Eno, co-founder CTO at Factory. We are building a platform for enterprise software developers to perform what we call agent-driven software development. So basically, more than just code, bringing agents into every stage of the software development lifecycle. So think coding, code review, maintenance, incident response, documentation. We think agents should be a part of all of this. And we think that they should be driving a lot of that menial component while you think at the high level about how to plan and structure the work. There's so many different like enterprises and narrow category.

Starting point is 02:35:26 It's a, you know, not consumer, I guess, but it's such a wide, it's such a wide category. Is there a beachhead? Is there a certain type of project within, within different industries or specific industry that's getting an especially large amount of value at a factory these days? Yeah, totally. I think that one thing that we see a lot, and typically when we say enterprise, we're thinking greater than 1,000 engineers, right? Like 2,000, 3,000. And one reason why we focus on that larger scale, you tend to have these large organizations where people are, the bottleneck is not code, right? The bottleneck is how do we plan a migration of 185 code bases to this new framework?

Starting point is 02:36:10 and there are 3,000 developers that are going to touch this over the next six months. And an SI just told us the quote is $80 million to do it. And we have to figure out how to not. So re-platforming broadly is one of the major, major tasks for many, many enterprises, right? 100%. Modernization and migration is huge. Yeah, yeah, that makes a lot of sense. How do you estimate that market size?

Starting point is 02:36:39 and is that where you guys are leading with on the GTM side in terms of trying to find these legacy companies that are maybe not even using cursor yet? I mean, we talked to the CEO of GitHub yesterday, and what? 50% didn't he say? It was like at least half of their user base is not using any AI tools.

Starting point is 02:37:00 Yeah, yeah, totally. I think that the thing that we hear often, we pretty much only deploy into companies that have already tried an AI-native IDE or have an auto-complete tool deployed. And I think the thing that we hear often is you sort of hear like these numbers throw around like 5x, 10x. And then in practice, when you adopt an AI IDE, you see 10%, 15%. And so a lot of people are sort of saying like what is the delta there?

Starting point is 02:37:28 Like what causes that transition? And our sort of argument here is that there is a workflow change that's actually required to really adopt agents in the life cycle, right? And so if you're just sort of like accelerating an individual developer that you can go a little bit faster. But if you are able to parallelize and automate at scale, that is going to be that larger introduction of change. And so if you imagine the market here, there are companies where, you know, five or 10 percent of global payment transactions run on some cobal system that was written 40 years ago, every developer is gone and it's a taking time bomb. Like at some point it needs to go

Starting point is 02:38:09 to Java, but there's nobody who even knows how to do that. And so those are the types of projects where the market is so enormous because, you know, half the business runs on this legacy system hundreds of billions of dollars. Put it all in Lisp, skip Java, go straight to Lisp. Yeah, Python, right? Yeah. That would be the logical one. I'm sorry, we're running behind, so we're going to have to cut this short, but I want to know more about how the enterprise coding agent market will develop. We could see one world where we wind up with, you know, GCP, Azure, AWS, like, you know, pretty comparable, competitive. They've all had really great margins. It's been this oligopoly. There's another world where you could see more specialization.

Starting point is 02:38:59 One of these companies goes deep into high security environment. or oil and gas or financial environments or specializing based on specific programming languages, as the market develops, like, how do you think it'll play out? Yeah, great question. I think that what's very clear is that the bulk of very large enterprise has a lot of similar problems, refactors, migrations, modernization. So a platform like factory is able to deploy into that and solve problems quickly. I think that there's likely to be like that sort of 80-20 where there are going to be,

Starting point is 02:39:33 to be these very specialized providers that only focus on one sort of problem and that will represent maybe like 20% of what's out there and so it won't be like necessarily black or white but we do think that the bulk of enterprises have a lot of similar needs especially when you just get across a certain threshold of number of engineers scale of code base. Sure sure yeah I mean we even see that with the the clouds where you know obviously there's the hyperscalers but then there are neoclots and we talked to armada where those they'll send you a shipping container with a bunch of racks inside and put it in stranded energy. So there will obviously be a long tail here.

Starting point is 02:40:10 That's a great take. Thank you so much for stopping by. Have a great rest of your day. And enjoy the GPT5 upgrade. We'll talk to you soon. Have fun out there. Really quickly. Let me tell you about Adio.

Starting point is 02:40:20 Customer relationship magic. Adio is the AI native CRM that builds, scales, and grows your company to the next level. And we will be joined by our next guest from Augment. Welcome to the stream. How are you doing, Guy? Great. Thanks so much for having me. And that's his name, by the way, if you're listening. His name is Guy. I'm not just calling him, Guy.

Starting point is 02:40:39 Anyway, please introduce yourself and what do you do? What does your company do? Yeah, so I'm Guy Garari from Augment Code. I'm a co-founder and the chief scientist. And we build AI coding assistants for large teams with large code bases. And so you can use Augment Code to do question answering, to do development, to do refactoring, to do migrations, all the tasks that you do, except that our product understands your large code base really well.

Starting point is 02:41:05 And so that means less prompting for you and faster and better results out of the agent. Today, GPT5 launches. It's kind of a rising tide. Feels like it lifts all boats. Every company gets access to it. We've interviewed a number of companies that are building on top of GPT5. Except it drowned GPT 4 or 5. Yes.

Starting point is 02:41:23 But in general, how do you think you can use GPT5? Are there any pockets of value that you think you can uniquely take advantage of? Yeah, great question. So we've been trialling the model for the past few weeks. And what we found is that the GPT5 is a very thoughtful model. It likes to make a lot of tool calls. It likes to ask clarifying questions of the user before starting to make code changes. And so the place where I reach out for GPT5 is typically, if I need to make large changes or if I'm trying to

Starting point is 02:41:59 to answer a very difficult question about the code base. I will let GPT5 take a crack at it. It will turn for a while, making lots of tool calls, just making sure it got it right, and probably find all the different places in the code where it actually needs to make a change. And so I will typically let it run in the background and come back to it, and I will often get

Starting point is 02:42:19 a high quality result out of it. Are there any features or integrations that you're hoping GPT5 will roll out in the future. We talked to a couple of people who are like, like we want models that have access to as many tools as possible. And you can see with the MCP boom, more people are trying to make their services, their products accessible to these models. Is there anything that you see as potential low hanging fruit to just add to the capabilities?

Starting point is 02:42:53 So I think for us, we work hard on developing our own integrations and our own tools, building them into the product rather than relying on GPD5 or other model vendors to do so. We have worked closely with OpenAI to improve the prompting around our tools so that the agent kind of works flawlessly. I think the thing that would be very nice, I think one of the previous guests mentioned a screenshot tool. I think that's a very, yeah, that's a very nice way to close the loop on front end software development, just like we saw how on back-end software development running the tests,

Starting point is 02:43:28 radically really helps the agent iterate until it gets to working code. So I think having more support for screenshotting and things like that that close the front-end gap would be very nice to see. I wasn't aware that screenshots weren't flowing through. I feel like when I've triggered operator, I'm getting a view, a web view into the website, but I wasn't aware that that wasn't like being passed through easily in the API and you still kind of needed to build that yourself. Where else, we were just talking about this,

Starting point is 02:44:02 like where are the biggest pockets of value right now for AI coding tools? Generally, obviously everyone knows like the vibe coder, who's just the designer who's learning how to use software for the first time. Then there's the experienced developer going from a 10x to 100x with better code completion. Then there's the enterprise that's maybe doing

Starting point is 02:44:23 re-platforming. Where else are the interesting pockets of value? that are maybe on the horizon to be unlocked with new models. Yeah, so on top of everything you mentioned, certainly the inner loop of software development, that's where we've spent most of our time at Augment Co-Developing Product for. Yes, you can have a senior developer,

Starting point is 02:44:44 starting using agents, starting to use multiple agents in parallel and unlock TANX or more productivity gains. What we're starting to see now with our tools is the beginning of automating software development lifecycle. tasks. So with Augment Code, we have a CLI tool now where you can take the full power of our context engine and the agent, the thing that really understands your code base, and you can start automating tasks in the background. And so we're seeing more and more developers saying, oh, this is great. Like, I can break out of the IDE now. I'm using the agent that's already familiar to me,

Starting point is 02:45:18 but I'm starting to automate code reviews. I'm starting to automate incident response. I'm starting to automate looking at production logs and automatically assigning tickets based on air logs that I'm seeing, all kinds of new automation use cases that we're seeing just because agents have gotten so good and kind of really understands your codebase. Are there high stakes pockets of software engineering work that most of the AI tooling has kind of stayed away from? I'm imagining like the high stakes database migration. Where is the kind of sticky part of the industry?

Starting point is 02:45:57 reading a blog post by someone who's doing like very advanced cybersecurity pen testing and they were saying like just the creativity of the models wasn't quite there yet to really come up with the to really act and embody like a white hat hacker who is going for a bug bounty but uh where where are the pockets of still like intractability where i guess if you are you know in the in the individual contributor you love just just you know coding from scratch that's where you want to stay for at least the next couple months. Yeah, I think still the attention of all the models we've seen and all the agents we've seen around making proper design and architecture decisions, that's still high stakes and still the ability is not there. Because if you do complete vibe coding and you just let the agent go and do whatever it wants,

Starting point is 02:46:52 in the beginning, it looks amazing, the code works and it's all really good. But once you get to low, tens of thousands of lines, the bad decisions that were often made around the design and architecture start to show up and development slows down. So that's where we still see a limitation of today's agents and where you still have to supervise the agent fairly closely in order to make sure that you don't get stuck later on. Perhaps this will change in a year, but today I would say all these decisions that you make

Starting point is 02:47:23 around how the code is structured still requires close superfluous. vision and still high stakes because it can really slow your project down if you let it go autonomously for long enough. That makes sense. Well, thank you so much for stopping by. We will talk to you soon. Have a good rest of your day. Thanks so much. Cheers. Let's check in with Tyler on the timeline. Tyler's manning the timeline. How are the vibes? Are there any new posts that have hit the timeline? Are we still in turmoil or has the narrative settled? I think vibes are are picking up a little bit. You're trying to see people post like, Oh, this is something I made. Now you can see on LM Arena, it's number one.

Starting point is 02:47:58 No way. Wait, wait. So what's going on with the Polly Market then? So Polly Market is still... Still Google heavy? Yeah, I think, I guess they're just pricing in Gemini 3. Ooh, okay. I'm not exactly sure, honestly.

Starting point is 02:48:11 I was actually very surprised to see that it was number one. Yeah, yeah, yeah. But yeah, maybe later we can show some of the posts. Yeah, yeah, that'd be great. Cool stuff. Well, in the meantime, before our next guest, let's tell you about eight sleep, get a pod five, five-year warranty, 30-night risk-free trial, free returns, and free shipping.

Starting point is 02:48:27 And we will have our next guest join us from Code Rabbit. How are you doing? Good to meet you. Good to meet you as well. Thanks for having me here. What's your reaction to GPT-5? How long have you been playing with it? What are the biggest improvements that you've noticed?

Starting point is 02:48:44 Yeah, I would say mind-blowing, right? We have been playing. Our team has been playing for like a few weeks now. tested a few snapshots. It's amazing. It's a generational leap, we would say. Like, we have been using open AI models. I mean, how much you know about code average?

Starting point is 02:49:01 It's been a couple of years. We have been on Open AI Anthropic. And our product is a very reasoning-heavy product. Like, one of the very few use cases where you have a PhD-style problem and say we have to do code reviews. And that's what CodeRabbit does. Like users open-up pull requests, our agent, and uses reasoning models to find issues like race conditions or security issues and so on.

Starting point is 02:49:26 So we've been testing GPD5 on some of the hardest full requests. We have in our golden data set. So we've maintained a data set where we track progress of different models and progress of AI in general. So we have many problems that no model is able to solve so far. Like I mean, GPT5. But so far it has a highest score. We would say it's like almost 2x better than the next 03 or sonate or opus at this time.

Starting point is 02:49:50 What's the customer valued there? You think that all the customers just notice that the product gets better? Are you going to upsell folks? How do you play this given that this model is now in public availability? Every company, every competitor can access it as well. Yeah, there's no up there. That's the thing with AI. For the same price or even better prices, you're getting much more AI, much better

Starting point is 02:50:15 better, but yeah, that's the whole idea how fast this space is evolving. So, yeah, from the pricing point of view, we don't see like this to be like a separate plan or something in our product. I mean, for the same price per month, customers will now just get better quality of results with Code Rabbit. What's next for the business? What kind of customers are you going after? Who do you think has been on the fence and this release is going to be the thing that gets

Starting point is 02:50:45 them to actually jump into the world of AI? Yeah, we can track the top line metric. Like, one of the things we track very closely in the company is like how many sign-ups to the paid customers we get. That number has been constantly improving since GPD4, GPD4 turbo. GPD-4, you actually dip. So there was a time when GPD-4 was almost like a Windows Vista off and Liseas. It's funny, like how we kind of trusted the E-Viles and we thought it's the same model, but

Starting point is 02:51:13 you know, it was impedeer in many ways. But then we saw a huge improvement after 01 came out, O1 preview was a game changer for us. Even at that time, our conversion doubled actually. Right. I mean, so we went to like more like, close to 30% success in getting the paid users. And now with JPD-5, we're hoping we can see another big jump

Starting point is 02:51:36 in the number of people who start becoming paid customers and how many people churned. So those are the real numbers. Like one is like vibes, like how people like respond to the model and we get angry tweets or not, I mean, that's the other part. But the other thing is like the actual revenues, whether it moves the needle for us,

Starting point is 02:51:52 and that may be seen, like one of the things we have seen, even though you test these models in a lab, it's not like a huge data set, but once you actually are in the wild, you see hallucinations, some of those issues at scale pop up. So those are something we'll still be observing over the next few days to see whether it's like smart only like, 80% of the cases,

Starting point is 02:52:11 but then if the false positive rate, the hallucinations are too high, then also it's not a great model, but that remains to be seen. Yep, that makes a lot of sense. Well, thank you so much for stopping by. Congratulations on a new new tool in the tool chest. New toy. We will talk to you soon.

Starting point is 02:52:27 Have a great rest of your day. Cheers. Goodbye. And let me tell you about public.com investing for those who take it seriously. They got multi-asset investing. Industry-leading yields. They're trusted by millions. Millions.

Starting point is 02:52:39 The chat is going wild about public trading, the SPX 6,900. I think that comes from someone talking about like the non-Mag 7 stocks or something. There's been people benchmarking the mag 7 versus the. The big news while we were live or earlier today, Trump signed an executive order that is opening up 401Ks to digital assets and private equity. What's crypto doing? Is it ripping? The coin is up a couple points last time.

Starting point is 02:53:11 This point, you know, where's it going to go? Oh, it's already so high. I mean, it's just like there's been so many catalysts. It could go up, it could go down. Yep. We'll have to wait and see. Tyler, anything else notable from the timeline? What have people built?

Starting point is 02:53:26 I see this GPT5, just one shot at a Minecraft clone. Yeah, I think that's one of the cooler things I've seen. Okay, so this is, so it wrote, it wrote code to generate this game. It's not generating the pixels. You can do so many different things. like you could generate a video you generate a world model generate code that generates a game engine

Starting point is 02:53:47 you could generate code that runs on Unreal Engine I don't even know what they're using One thing now in on actual chat tbt in there's like a native like it's like a music player it's almost like garage band You can say like if you prompt to like build I saw a same old one tweet about this You prompt like to do some kind of like beat or something

Starting point is 02:54:04 It'll like make an interactive like garage band Almost interface in there that's cool I was playing with that earlier Yeah I do wonder how many of these features that we're seeing, like, where does Open AI want to keep things in the B2B world and let other companies build versus just build it as a consumer app? Like, will ChatGPT eventually just let me push your website? Like, will it become a vibe coding platform?

Starting point is 02:54:29 At least like a basic one. Like it's not the most advanced coding environment, but it can definitely write some code and execute it for you and do some stuff. Yeah, well, it's funny because like it used to be, you would have a, So there was like GBT 3.5 or something and people on top of that built a vibe coding thing. So you could use that to build your own vibe coding thing. But now you can just go straight from chat GBT to build your vibe coding platform. Yeah.

Starting point is 02:54:56 But soon maybe it'll just be the vibe coding platform. Yeah. The surface area of this stuff is very interesting. Clearly they're going after healthcare and therapy. It's interesting that they've kind of stayed away from legal. Maybe that's just the dynamic. of the sales process and the dynamic of that particular market. But increasingly you can just ask more and more questions

Starting point is 02:55:18 of chat GPT. So the consumer to business bleed over, there's certainly a world where just giving everyone in your organization chat GPT is a substitute for a bunch of different SaaS products. So it'd be interesting to see where that developed. What do you think about? NIR says there are concerns that the number used

Starting point is 02:55:37 to represent our AI's intelligence does not in fact represent It's intelligence. Worry not to address these allegations. We've added three new numbers. Near. Yeah, near is building something that's like not particularly benchmarkable, right? Isn't it a companion?

Starting point is 02:55:53 It's beyond benchmarks. Beyond benchmarks. Well, in completely other news, Anderl opens a Taiwan office and begin selling AI powered attack drones to Taiwan. Paul Murlucky has said he wants to turn Taiwan into a prickly porcupine. We're in the age of spiky intelligence.

Starting point is 02:56:11 spiky intelligence will be onboarded onto the AI powered attack drones and deployed in Taiwan to keep it safe. What else is going on in the timeline while we wait for our next guest from OpenAI to join? Spor says raise your hand if you were not automated today. I'll raise my hand. I was not automated today. Not yet. We survived. We made it through.

Starting point is 02:56:32 Sebastian Bubeck says here at Open AI we've cracked pre-training, then reasoning, and now we're experimenting with new set of techniques, then maximally leverage their interaction. GPT-5 is just the first step in this direction. We're excited to, incredibly excited to see where scaling this up will lead us. And it's the unicorn test, I believe. And the latest unicorn is really, really good. That is a creative interpretation.

Starting point is 02:56:58 And I think it has to draw all this with like SVGs. Anyway, we can talk to our next guest about it. Last post, GiroTicket says, I went to the permanent underclass party and everyone knew you. Anyway, back to the serious interviews. Welcome to the stream, Max. Good to see you.

Starting point is 02:57:14 How are you doing? What's happening? Nice to meet you guys. Yeah, doing well. It's a relief to have this launch out in the world. I think it's, you know, we've been working on this for the last few months now, and it's exciting to let the whole world see what we've had. Just a few months?

Starting point is 02:57:30 It's been, I don't know. It's been a little while. What's the actual launch day like? Because you're actually getting this out into the world. the GPUs are on fire or about to be on fire warming up. But is that out of your purview? There is a different team for that, fortunately. So, right.

Starting point is 02:57:47 So I run a lot of the research for GPG5. I don't necessarily handle the deployment, but I do get dragged in when the GPUs are on fire. I think we're moderately burning right now. Okay. Like a two alarm fire. Yeah, yeah, yeah. Is it materially different?

Starting point is 02:58:04 I mean, this is a launch day, but we'll probably discuss. like the Studio Ghibli capability once it gets out into the long tail of like, you know, hundreds of millions of people try it. Someone comes out with some genius thing, then everyone's doing that and then the GPUs. Because I feel like the Studio Ghibli thing happened like a few days after the launch of images in Chat GPD. It did. It was, it was pretty fast, but within about a week. I think in this case, we're going to see that here. Okay. I think coding, you know, if I had to take my bets for what the Studio Ghibli thing is going to be, it's coding.

Starting point is 02:58:36 that's the place where I think GPD 5 is like most tangibly a hugely ahead of GPD 4 and ahead of 03. Do you think there's a chance that that the coding will mean a studio Ghibli style meme or kind of like, and what I mean by that is that is that like image generation is incredibly valuable in the context of like Hollywood will be using AI to chroma key and rotoscope in a professional environment. But yeah. What was special about Studio Ghibli was that anyone was making these custom images and I could imagine a world where You know even going from like the levels Io example of like I vibe code at a flight simulator if we wind up in a studio Ghibli moment for coding. I would imagine it's like everyone built their own game today I think that's pretty much it. Yeah, so I don't know if you guys watched it what what do you that was that was one of the things we had on the live stream like you can just go into chat to BT

Starting point is 02:59:32 If you try it right now it might or might not work because the 55 rollout is still ongoing. But if you have five, you can just tell it, like, basically make me a game. Yeah. And it will make it, and you can actually play it in chat. That's amazing. So, yeah. Is there the ability to discover that?

Starting point is 02:59:47 The thing is, like with Studio Ghibli, right? Like, for Ghibli, you don't have to know how to draw to make it work. For this one, you don't have to know how to code. Yes. So can you share that chat and someone else can play the same game? How does the kind of sharing mechanism? Yeah, you can do the share link. we're I think going to try to make sharing for these a lot better over the next few days.

Starting point is 03:00:07 That was P2 after the P1 and P0 of making the GPUs not completely melt. Yeah, yeah. But yeah, we will try to make it much more terrible. Yeah, yeah. I mean, the studio Ghibli thing is so interesting because it was, it's not just that the model capability was there, but it's also like the prompt was two words. And it was so reliable that you always got a good result. And you could personalize it.

Starting point is 03:00:32 So even if it wasn't, I've seen people build Doom. I've seen people, you know, you can just buy Doom. It's a real game. You can build it. But if you build it and I'm like, oh, that's cool. You did it in a vibe code environment or in chat GPT. Like that's awesome, but I don't necessarily want to go do that for myself. But as soon as it becomes personal, which is what the studio,

Starting point is 03:00:50 like I had to see what I looked like as Studio Ghibli. I had to see what my favorite photo looked like. My favorite me looked like in Studio Ghibli. And once that happens with games, people will eventually, you know, there'll be this memetic explosion and you'll see the GPUs will truly be on fire yeah i mean i think even today you could probably with gbd5 do doom but all of the characters are like all the enemies or head shots of your friends like here we're going now we're that will just work real close yeah we're real close it's going to be something that's personal something that you know you can express

Starting point is 03:01:20 your own creativity through because i think people they still latch on to that they don't just want uh you know a copy of what already exists they want something new and in the studio ghibli moment was just new enough Anyway, we should talk about actual research. We should talk about post training. What's the thing you're most proud of? Like what can you give us on without immediately getting poached? What can you give us on the actual innovation that went into GPT5 from a post-training perspective? What are the kind of keywords and paths in the tech tree that we should be digging into over the next few years to understand how this works?

Starting point is 03:01:54 You know, I would say the thing that is most impressive to me about GVT-5, is how much getting all of the details right matters. Like when I look at GPD5, you know, we had an early version of this thing a while ago that was kind of okay, but clearly did not meet our bar for revolutionary. And we were trying to figure out, you know, why is that not as good as it should be?

Starting point is 03:02:17 And the team basically just went off and did a deep dive over a couple of months of just completely rebuilding the post-training stack for this model. And it turns out that when you do that, you get what would have taken, you know, another order of magnitude worth of pre-training improvements to to produce. How much are you thinking in pro-st training, in research, about, let's forget the benchmarks and just focus on user satisfaction, like NPS score basically, or like user minutes or any of

Starting point is 03:02:48 these other, the real benchmarks? Yeah, the intangibles. Profit, revenue. But also just, yeah, the feeling and the joy and the actual value that's delivered because Studio Ghibli was a delightful moment. It wasn't a benchmark. Yeah, I think, so that was something that we took very seriously for GPD-5. It was like, look at what people are actually doing with Chachabit

Starting point is 03:03:09 and look at where the model is failing them. Either in the sense that the model is like, sort of like you said, it's not enjoyable to use. Yeah. And so we did, I think, make a lot of progress on that. Like GPD-5 is much more engaging than our previous really smart models. Like, 03, I don't know if you guys talked to O3 and the past. It's a bit bland.

Starting point is 03:03:29 Sure. And GB-D-5, I think, has a lot more character, is a lot more more interesting. But then also, I think for, we really care about just actually being accurate. If a user is trying to do something economically valuable with our

Starting point is 03:03:45 model, we want to make sure it lands correctly. And so what we did there is just like look at the actual distributions of what people are doing with our models in the real world. Figure out where the models are going wrong. Build interventions to target it. And that was where we got, I think, the most impressive improvements in GPD5. Like, O3 would just get things wrong and not tell you it wasn't sure it was incorrect.

Starting point is 03:04:07 And GPD5 is much, much better about, like, actually being honest when it thinks it might not know. Yeah. How explicit are all the different pieces of the post-training pipeline? Like, you have, you have, you know, safety post-training. You have stop hallucinating. Give me the real facts. you have make sure the text, the flavor, the tone is pleasant. There's so many different things to optimize for.

Starting point is 03:04:33 How much of that is like try and just blend it all up into one thing versus like explicit passes, chunk it out, like split it up? How much can you decompose the problem? So, you know, my background is in reinforcement learning. And I think when you look at something like this, the magic is in the reward function, right? It's in what you're actually telling the model to be good at. And so fixing things like hallucinations, to a huge extent, is essentially a function of just fixing the reward function.

Starting point is 03:05:04 Actually making it so that the model is reliably penalized for saying something that's false. And if you do that, all of a sudden, the model stops saying things that are false. Ditto for safety, right? You know, on the live stream, Sachi talked a bit about the way we've changed safety for this model. And to a huge extent, it's just a function of we're actually putting out a paper today. on the new safety stack for this model. And the core insight in that paper is just figure out what you actually want to optimize for, which in our case is helpful,

Starting point is 03:05:31 helpful, not saying something that's actually dangerous or harmful. You know, write that down, figure out what that means as a reward function and optimize it for it. It's really not magic at all. It's just, again, it's what I said earlier. You've got to get the details right. You know, if at any part of that process you screw it up,

Starting point is 03:05:48 the model will be unusable. What's your current thinking on spiky intelligence? and is there some flywheel that you can get started where you're identifying low points that aren't spiky enough and then you're like almost automatically setting up the infrastructure, the eval to then RL against, to create a spike? I think GPT5 was a preview of what's possible

Starting point is 03:06:19 in that respect in the future. Yeah, a step in that direction. Do you think that there's a world where you get to a place where you're kind of, it's weird because we're not hammering down the nails of the spikes. We're adding spikes, but. Haring up the spikes, yeah. A metaphor that we're stretching a little bit too far. But is there a world where you can be doing post-training or just adding capabilities

Starting point is 03:06:42 in a more iterative cadence so that as soon as you identify something, the response can be, yeah, we don't need to wait until GPT6 to fix this. can just add this capability because, hey, we just found a pocket of users who are trying to do a thing and they're not super happy with the results. And let's add this capability. Yeah, I think so. I mean, I think we are going to launch other models between now and GPD6. I think it's relatively common knowledge, but we do update the model in chat GPT reasonably often. Yeah, people talk about it all the time. Yeah, exactly. And, you know, I think we are now in a world where we can conceivably update that model

Starting point is 03:07:18 and have it get materially better on capabilities too. not just on the personality is a little bit better than it was before. Yeah. Going back to your note on the new paper that I guess you guys are releasing today, when you talk about optimizing for helpfulness, is part of that avoiding the model reinforcing, there's times when you want to reinforce and give kind of confidence to the user that they're going down the right sort of like thought process and things like that.

Starting point is 03:07:49 But then there's like a point where it can get too extreme in terms of maybe convincing a user of something that may be totally untrue. Is that what the paper gets at? So it's not specifically about this, although I will say we do explicitly train the model to not lead users down bad paths. That's something that I think we've started taking much more seriously over the last few months. As we've realized, Sam talked about this a little bit, I think back in May. But chat chabit is just way more important for people's lives now than it was a year ago or especially two years ago. And we do have to actually be very cognizant of what effects our models have on users. So yeah, we do very actively trained models to not lead users down the right path.

Starting point is 03:08:34 Don't fact check me on the releasing today. I know we're releasing it. I believe it is soon. I think it's day, but I've also been in a whole dealing with launch all day. Yeah, we're not big on fact checks here. We're big on the truth zone, which is just the vibes. The vibes are we'll be publishing some information about the new safety setup. At some point.

Starting point is 03:08:53 That's great. Yeah, I think a large part of the conversation around safety should be how reliant and how useful the product has become to users and then the new level of care that you have to provide versus a while ago when it was just like people making a cute image or generating some text. that they were going to use in an email or an internal document and realizing this vector of usage, which is this companion confidant that is becoming so prevalent. Talk to me about post-training for big partners, enterprises,

Starting point is 03:09:36 government organizations. What is transferring from the research that you're doing to something that can be offered as an enterprise level, product? Yeah. So we do, OpenAI does partner with external companies to do essentially custom post training. That is a thing that we do. And from that perspective, the stuff we do just directly transfers.

Starting point is 03:10:00 I'll also say that we've put a lot of work into trying to make our models as general as possible, but to as large an extent as possible, if you want to get really good results from our model, you can do it right on the API just by actually telling the model what you want it to do. Yeah. Right. I think is pretty comfortably our most durable model ever. We've heard a lot of really positive feedback about this,

Starting point is 03:10:21 especially from folks like Cursor. Yeah. So if I came to you and I was like, I'm an enterprise and I need to generate a lot of studio Ghibli's, you'd be like, what are you doing? Just prompt it. But what are the examples of companies and organizations? Is it just private information, private data sets that aren't available on the open web?

Starting point is 03:10:44 or is it specifically like there is enough data out there, but there's just not the economic incentive for your team to go and RL on, you know, gas station bench or whatever we're talking about here hypothetically. I think the answer is both. Yeah. It's definitely both. Because yeah, we're not going to target, you know, as you said, gas station bench. Because it's not what people are doing with Chagasy.

Starting point is 03:11:06 Not on our own right now, probably, because it's not mostly what people are doing with Chattagogy. Exactly. You have some application that's super valuable to you. Yeah, yeah. We can be convinced that it's important. Yeah, yeah, yeah. It's just not what our users are already trying to do. What's the state of reward hacking and fighting that in RL environments?

Starting point is 03:11:26 You know, I think we've actually made a lot of progress. There was some discussion of this around O3, that O3 was like a little bit deceptive in ways that felt reward hacky. And GBD5 is dramatically less deceptive than O3 was. What's an example of how that would manifest? Like, do you have like a canonical case study? Yeah, I mean, the canonical thing is like you ask 03 to write you some code, and instead of actually writing some code, it changes some unit tests. It changes the test case, right? Which is kind of hilarious. It's like one of the funniest things that AI has ever done. I understand that is very bad, and it's not what we want. But it is just like, it's kind of cheeky in my mind. It's kind of cheek. It's also like, you know, I think if you spend enough time around real software engineers, they do actually do stuff like this pretty often. I have 100% done that.

Starting point is 03:12:09 I was going to say I also have done that. For formal reasons, I won't say that I did it at Open Eye, but back when I definitely did that. Yeah, of course, of course. This is natural. What do you think GPT6 looks like? You mentioned that you're going to be shipping updates to five, but what are you most excited about? Where are you most excited about going from here? And just really quickly, give us the date that GPT6 launches? Oh, man.

Starting point is 03:12:37 Hopefully six launches is a complete surprise to everyone. I think that would be ideal. Like a Beyonce album. Oh yeah. Hopefully five just makes it and says, hey, it's ready now. It's ready now. If you want to hit. Yeah, I think that would be a great thing for six, actually.

Starting point is 03:12:49 I would love for six to do all of the launch comms and to do the live stream. That would be great. Live streaming is, that's the real AGI test. For sure. I feel like we're not that far off, actually. I don't know. We're getting there. I mean, video synthesis maybe, but, you know, talking through a script for 30 minutes,

Starting point is 03:13:08 come on, models got to be able to do that. For sure. Well, yeah, that'll be the next SORA launch or something. We'd love to have you back on. But thank you so much for taking the time today. We'll talk to you soon. Great to talk you guys. Congratulations.

Starting point is 03:13:19 Cheers. Bye. Congrats on the launch. Let me tell you about adquick.com. Out of home advertising made easy and measurable. Say goodbye to the headaches of out of home advertising. Only ad quick combines technology, out of home expertise and data to enable efficient, seamless ad buying across the globe. And we have Scott Wu from Cognition coming in the studio for the fourth, fifth time.

Starting point is 03:13:37 I can't keep track anymore. Thank you for taking the time. Thank you for coming back. It's great to see you guys. How's it going? It is fantastic. Got to be honest. Great week to be an application letter company.

Starting point is 03:13:48 I got to tell you guys. I was about to say, this is the best thing for you ever. Open source. Another win for Scott Lou. Wow, wow, wow, wow. So yeah, how big is this? Are we in the Uber Lyft territory where you're going to be, you know, you're going to be, you know, in price competition between Anthropic and OpenA.

Starting point is 03:14:07 Going back and forth, like what, what is the real benefit to your business? benefit to your business right now from today? Yeah, yeah, for sure. So, first of all, obviously, massive capability gains across the board. I think really, really impressive work that Open AI has put together. You know, people have talked about what's going on in the AI coding model race. And I think by a lot of accounts, you know, Anthropic has generally been ahead for a lot of the last year, honestly.

Starting point is 03:14:30 And I think at this point, Open AI is very clearly, you know, has very clearly caught up. And it's pretty neck and neck, I'd say, between the two right now. So very exciting to see all this unfold and to see what's next. But I think from our perspective, yeah, I mean, code is just such a core capabilities pill to use case, I'll call it. And so, you know, being able to work with smarter and smarter models and do a lot of the work that we do, it just means that both Devin and WinServe can be a lot more capable, a lot more intelligent, can predict what you want to write or what you want to do with a lot of higher accuracy. Yeah, it's almost like surprising that given,

Starting point is 03:15:08 the cultural rigor at cognition that you're not doing fundamental frontier research. So can you walk me through like what is the focus of being an application layer company? Is it is it UI go to market? I'm sure it's all of these. But in terms of the the hardcore software engineering, like what is important to get right? At some point, there's fine-tuning and post-training, but is that moving back into the purview of the foundation labs? Or is there still work that you want to do on top of the models or on top of the APIs? Yeah, yeah, it's a great question.

Starting point is 03:15:50 I mean, I think the core of being, you know, an applied lab is really just focusing on a very particular use case, on delivering real, just very direct results. And I think, you know, like, I think the foundation labs are obviously, you know, incredible. at training-based models and all this pre-training and all of the work that they do there. I think from our perspective, we want to work on a lot of very particular capabilities that apply to software engineering in particular, and then obviously run the whole stack from there to building a product, figuring out the interface and the U.S. And then obviously bringing that to market and selling that.

Starting point is 03:16:27 On the capability side, there's a lot of particular stuff where, you know, one way to put it is, I think the base IQ is very much already there in the models. you can see the raw problem-solving ability. And I mean, we've gotten some pretty insane results, you know, getting a gold medal at the IMO or all of these other things, right? You called that, by the way. Yeah, you called that. I think the first, I mean, I was, I mean, we were one point away to be fair a year ago, right?

Starting point is 03:16:51 So it was on the way, I'd say. But, but, but, but, but, so, you know, you can really see the general intelligence improving it with every single model generation. On the other hand, for Devin, obviously, you know, it's a very clear, like, step up in the general intelligence, but also you want to be able to have, you know, if you ask Devon to go debug your Kubernetes or to go and, you know, look into your error logs and figure out what went wrong or things like that, there's often a lot of very specific capabilities. And that's where we find that, you know, the post-training of the URL is, is most effective there and a lot of

Starting point is 03:17:26 the kind of various work around the models that turns out to be useful. What about speed? A lot of people that have gotten access to GPT-5 or at least, in our chat are reporting that it just feels really, really quick. How is that over time going to impact the, I think a lot of people, you know, if they're using Devon today, task Devon with something and then maybe they go work on something else for a little bit or they're running multiple agents concurrently. But at some point, the agent could get so fast that you're just sort of like watching it and work in real time and you actually want to be engaged. But are we there yet? Is it still a ways out? What do you think?

Starting point is 03:18:05 Yeah, it's a great question. I think in general, I think as a sync will continue on as a paradigm, even as the models get faster and faster. One of the reasons that it should, by the way, is because there are a lot of real world thresholds that start to matter. Like, at some point, you're actually spending less time on token generation in the Devon life cycle, and you're spending more time every time Devon runs the command to go install packages or Devon running the unit tests or like Devin pulling up the front end by itself or things like that that obviously take real world time. right? I think we are honestly getting closer and closer to that threshold. But yeah, so long story short, I think like in the asynchronous mode, yeah, these things will get faster. You know, we'll see those gains or we'll be able to spend a lot more time, for example, thinking about a single problem relative to the amount of like real world clock time that gets spent. I think for the synchronous use cases is where we'll see things really, really, you know, explode with speed, which is, you know, windsurf and cascade, for example, where we're, where we're, we're, we see the speed gains really, really matter.

Starting point is 03:19:08 Speaking of windsurf, give us the update on the chat wants to know about the windsurf T and the 80-hour demand. How have the buyout offers gone? What's the internal response been? Where'd that idea even come from? Yeah, yeah. Look, people are stoked, honestly. And I think from our perspective, it's obviously really important to kind of just like

Starting point is 03:19:31 unite and get to the point where we can just be one culture and, one kind of shared set of values. And this is how things are at cognition is. You know, it's a pretty busy time. Like we are at the inflection point of code and we work like that too. And so I think a lot of it for folks is just kind of like, you know, we want to make sure folks who really want to do this with us, you know, make that conscious decision to opt in. And for anyone who doesn't, obviously we totally understand that there are a lot of talented

Starting point is 03:20:03 folks that maybe that's just not the right thing for them right now. or not at this time. And so wanted to make sure that they were, well, we'll take them care of too. And to be clear with the buyout offer, that's on top of the actual acquisition deal that already went through. They already got their vesting. So, yeah, I was thinking of the roller coaster. It's like, you have the opening I deal, then the Google deal, then the cognition deal.

Starting point is 03:20:25 And then they're like, wait, these guys work really, really hard. I don't know if I'm cut out for this. And they come back up again where they're like, wait, I can just go, you know, take a sabbatical and figure out my next thing. It's a great outcome. Yeah. Yeah. No, it's obviously, you know, overall is a killer team that's been through a lot.

Starting point is 03:20:42 And so I wanted to make sure that they're well taken care of. That's fantastic. Anything else you can tell us about the integration of Devin and Winserve? How are the teams getting along? How do you see the products playing together in the long term? Obviously, cross-sell seems really obvious. They had the go-to-market team as well. But how else are you thinking about the interaction maybe over the longer term there?

Starting point is 03:21:05 Yeah. Yeah, yeah, for sure. Yeah, a lot of obvious integration on the team, as you mentioned, with Crossout and so on. I think the thing that's really exciting on products, which I think actually comes along with these capabilities increases, is, you know, as the capabilities keep getting better, you start to take on harder and harder tasks with AI and with full agentic workflows, right? And I think there's an interesting thing that happens where for a lot of the harder tasks, you really actually do want to go back and forth between asynchronous and an asynchronous mode, you know? And that's for a few reasons. You know, one of the reasons, obviously, is because there's a lot of review and a lot of, like, looking at the pieces and thinking about the, you know, all the minutia and the details of what you're implementing. I think another big reason for it is, you know, when you get started on a larger project, you know, let's say you're sitting down as an engineer and you're saying, all right, I'm going to go build this whole project today.

Starting point is 03:21:54 You yourself don't actually know all the tradeoffs that want to make, all the decisions that you want to make and so on, right? And so having a format where, you know, for the decisions that need you to be there and you're involved setting the kind of the strategy or figuring out high level what should happen, you're able to do that in a nice synchronous environment, which is naturally the wind surf IDE, right? And then for the parts of the task that you can actually hand off and have an agent work on, you're giving that to Devin. And figuring out how you go back and forth between those is super interesting. So wave 12 on the way soon. We'll have a lot more to share. Last question. Yeah.

Starting point is 03:22:27 Hit the soundboard, Jordy, for that. For wave 12. Wave 12. Fantastic. Last question, we'll let you go. What is your probability that AI will get a perfect score on the IMO next year? Oh, interesting. So, by the way, we just had the I-O-I, which is the programming version, like the programming

Starting point is 03:22:47 Olympiad, and I think there's a good chance that we'll have a golden medal with the Ioi for this year announced as well. I think perfect score for next year. We as in humanity, or we? We as in cognition. As in humanity, yes, yes, yes. An AI perfect score. Yeah. Sorry, an AI gold medal.

Starting point is 03:23:03 Right. Perfect score in the IMO next year. I think it's got to be north of 50. Honestly, I would put it around like 75% or so. Okay. Well, thank you so much. We'll be following it closely. And good luck to you and congrats on all the progress.

Starting point is 03:23:19 Very fantastic. We'll talk to you soon. Awesome, guys. Thanks for having me. Bye. Let me tell you about Bezell. Getbezzle.com. Your Bezel concierge is available.

Starting point is 03:23:27 now to source you any watch on the planet, seriously any watch. And we are joined by our next guest, Claire Vaux, from chat PRD. Welcome to the stream. Claire, how are you doing? What's going on? It's a fun day today, isn't it? What was your reaction to the stream? What was your reaction to GPT5? You know, GPT5, the first thing I said and I got a little early access is I said, it's a developer for developers by developer. This thing is built to be a software engineer. You've seen a long string of your guests come on and really speak about the coding abilities of it. And what I think is interesting about this particular model, especially because we're seeing them deprecate the old models in the chat GPT experience.

Starting point is 03:24:07 And we're seeing a lot of positive feedback. But I do think there are drawbacks to a model that's so clearly tuned to a developer use case. And as somebody who's building an application that isn't focused on agentic coding, I have noticed some personality quirks that are going to be really interesting to see how they shake out as we roll out this model to our users. Walk me through those. What are the... Yeah.

Starting point is 03:24:32 What's the timeline? How much, like, how much time do you have to kind of move users over to five before? Yeah. Yeah. So, I mean, I think we have tons of time from the API side to move, move users. And in fact, you know, our strategy at chat purity is not to just upgrade to the latest model. I know Zach at Warp said, like, why wouldn't you want the latest intelligence? And the reality is because we're doing a lot of business strategy and business writing, I actually want to validate

Starting point is 03:25:00 with our users that they're getting the quality of strategic thinking, output, writing that they really want. So we actually A-B-test every single model rollout and really evaluate for user quality, token generation, all those things. And, you know, looking early on, it yaps. Man, this thing just wants to go through tokens. Right now I'm seeing four to 10x the number of tokens generated between the you know four generation models and five. And when you're in a business context, you do not always want longer words, you know? And so it'll be really interesting there. It is certainly focused on execution. So I, you know, I've heard a lot from the open AI team. It's steerable. Yes. And it's natural inclination is to drive you towards like how, what very tactical

Starting point is 03:25:49 very specific. And so if you're trying to zoom back out at a strategic level or focus on a business initiative, it's actually a little harder to tune in that direction. So, you know, I think there's a lot of positive things for me as somebody who uses a genetic coding platforms, who writes a lot of code. It's my daily driver now. I love it. But for other use cases, I think it's going to take some time to figure out if it really is optimal in use cases where intelligence actually isn't the differentiating capability. Yeah, it's very interesting to think the best product manager is not the one that writes the most, the longest doc. No, and you don't send your engineer into your executive meeting.

Starting point is 03:26:31 Like I and I really am looking forward to the time where we're not getting these number-based models where actually I can get like GPT developer or GPD strategist where they're pre-tuned and trained and trained for the role they're going to play as opposed to general purpose, but clearly oriented towards a set of tasks. And I just think if you look at this model, it was oriented towards an engineer, software engineering, at least in my experience. So have you been tempted to launch any type of agent like agented coding products? You are, you guys are obviously responsive at Chet PRD responsible for creating documentation. And if you look at the other guests that have joined today, many of them are competing with each other in different ways and trying to own

Starting point is 03:27:23 different parts of the stack. You guys have seemingly stayed really, really laser focus and no one else is doing anything like you're doing, at least on the show today. But talk about like picking your lane and kind of like optimizing. Yeah, we're integrated with, a lot of those platforms. So a lot of the kind of like prototyping platforms, V0.Deb, lovable, all those, we integrate, we just released our MCP. So I use chat PRD pretty consistently inside cursor through our MCP. So I think of, we think of ourselves as the product pair to the AI engineer. Now, what's really interesting about my experience with GPT5 is the one place that actually does really well as technical specs. And that's a place where chat PRD has sort of bridged into engineering

Starting point is 03:28:11 execution, often our product managers are generating a PRD or some sort of business document. They're actually going the next layer and developing a technical spec. The GPT5 technical specs fed into these agentee coding frameworks or prototyping frameworks output much higher quality assets on that end. So I do almost think there's going to be this kind of like right model for right use case, especially in our kind of business. And so we think of ourselves as integrating. The one thing I have thought about, with GPT5. It's the first one where it feels really simple to just go ahead and roll your own agent encoding framework or prototyping framework inside of our application. So never say never. It's something that we get asked for a lot. But we're good friends with almost all your guests

Starting point is 03:28:56 on your show today. And so we like the role we play in terms of being the product manager pair to all these AI engineers. Yeah, that makes sense. What are you looking for next? What am I looking for next? I mean, in terms of model capabilities, what I think is really interesting about Open AI and why I'm really committed to the Open AI ecosystem, even though I test and use a variety of models, is I think developer support is a real differentiating her. So we spend a lot of time talking about model capabilities. And for application developers, certainly ones that are doing more complex applications like Agentic coding, model capabilities really matter. Like core IQ of the model matters. But the other thing that matters, you know, somebody who has built

Starting point is 03:29:42 developer tooling products. It's developer experience matters. The primitives in these APIs matters. And so what I'm really pushing the Open AI team to think about, which is in addition to the core intelligence of the model, what are the developer tools you need around these models to really make them a platform on which a variety of applications can build. And I do think that Open AI has disproportionately invested in developer experience, but I'm always looking for like give me better out-of-the-box tooling, give me more control over these models, give me more hosted services, all those things that as an application developer are just going to make it easier to deploy these models of production beyond the core kind of intelligence of the models themselves.

Starting point is 03:30:26 What was your read on 4.5? Is there a world where, you know, I'm thinking about the product manager versus the engineer. You have your 03-go crunch some really hard reasoning, and then you have 4-5 turn it into, you know, stronger, or like more, you know, a human language. Yeah. So I did a lot of experimentation around 40, 45, and 4-1. 4-5 was my favorite prose writer by far. It was loved from a business writing perspective.

Starting point is 03:30:56 I thought the pros was the most natural. It was really slow, like untenably slow. And so the compromise we made in our testing is we ultimately ended up with 4-1 as the, fan favorite for business writing when we were balancing off both quality of pros and intelligence as well as performance, which for application developers is a real consideration. So I landed on 4-1. 4-1 is the model that's being tested right now against GPD5 in chat pyrd. And one of the things that I have to go do now is figure out how to get chat G-PT or GPT5 to stop writing. It writes a lot and it only

Starting point is 03:31:38 wants to write in bullet points. So I've got to go back and to our prompts and figure out how to direct it to be a little bit more business oriented. Bullet point maximalist. It's the new M-Dash. I'm telling you, you will not be able to stop seeing it. It just, all it wants to do is write a bullet point. And call a tool. Like it, I was using an incursor and it just kept maxing out my tool calls.

Starting point is 03:32:02 I'm like, you do not need to read 50 files to do this. So I do think, you know, application developers are really going to have to think about how they slot this into their current workflows. There's definitely tuning that needs to happen. But I'm telling you, you're going to see a lot of bullet points when this thing rolls out. Yeah. In 60 seconds, where is product management going? A lot of people talk about the, you know, examples of product managers that are starting to ship code themselves, ship whole features, products. But I'm sure those are edge cases to date. But where do you feel like it's going based on your user base? Yeah, I mean, it's going to go one direction of the other.

Starting point is 03:32:41 Product managers are either going to develop the hard skills to do the design, the go-to-market, and the engineering job to some extent. Because some of these other jobs are definitely going away for product managers or my favorite use case, engineers and designers are going to get tools like chat PRD or these prototyping tools or cursor. And they're going to be able to actually do the product management job. And so what I think is we're going to see a new type of role emerge, which is a much more generalist. role where people maybe have a specialist capability and they're augmenting that product thinking or they're augmenting that technical thinking with with AI. But I don't think there's going to be product managers as they were, you know, five or ten years ago for much longer. Makes sense. Well, thank you so much for stopping by. Yeah, great time.

Starting point is 03:33:25 You're telling you. Thanks for having me. We'll talk soon. Bye. Cheers. Up next, we have Brad Lightcap, the chief operating officer of Open AI. Welcome to stream, Brad. Also, Jordie, your post saying I'm updating my timelines. You know how four years to escape the permanent underclass. It's over 4,000 likes. There we go. A thousand likes for every year. Love it.

Starting point is 03:33:46 Anyway, Brad, how you doing? Brad, what's going on? Guys, how are you good? Congratulations on the launch. What are the biggest takeaways for today? From your side, I'd love to know about what it actually means to be the C.O. of Open AI does so many different things. Consumer internet company, API business, enterprise, there's all sorts of stuff, building data centers.

Starting point is 03:34:09 What is your actual role? My role is kind of whatever the company needs me to do. I play everything from like, you know, PM when I need to to like, you know, salesperson when I need to. That's kind of the fun part of the job for me. On this launch in particular, it was really fun. I spent a lot of time last few weeks with customers with partners getting a feel for GPD5 relative to what they were previously using.

Starting point is 03:34:35 In some cases, those are open AI models. In some cases, they were other models. But I've been opening eye a long time, but it's opening I seven years. So I've seen GBT3, I've seen GPD4, and then to be able to see GPD5 and just I think the joy of people being able to use it in production and seeing how much better it is, that's the best part. Greg told us earlier about the era having to pay people to use the early versions of the product.

Starting point is 03:34:59 You guys have come a long way since then. Yeah, we had like three customers with GPD3 or something like that. And so it was easy to manage, easy to talk to all of the. them. They actually were tired of us calling them being like, is it good? Is it getting better? And so now it's, you know, we're fortunate that we've got more than that. But it's cool. I mean, the diversity of use cases, I think the number of things that people are able to use it for, we've got everything from the team at Amgen, you know, big pharma, life sciences, using it for clinical workflows there. We've got teams at Uber's, you know, building it for customer

Starting point is 03:35:32 support, teams at Notion and cursor building it into products that people use every day. So, I think that's the power of it. Is it just more and more covers the service area of things people do with these tools? I'm not sure how much you touch organizational design at OpenAI, but I'd be interested to hear your thoughts on how those companies that you mentioned should be thinking about AI changing their org structure. Is it sort of like a horizontal, cross-functional service layer like a finance team that touches a lot of different elements of the business?

Starting point is 03:36:06 or should most companies be thinking about standing up a dedicated like AI implementation team? How do we get a chat box on every product that we already shipped? How do you think about those tradeoffs if you were talking to a friend at a Fortune 500 company that was thinking about their AI strategy? Yeah, you know, it's an interesting question. I think it was maybe said earlier on the show. The thing we see is just people can do more. And so there's like this much wider latitude that you get if you're an individual person

Starting point is 03:36:36 at an individual company where, especially as you get bigger, you know, maybe more bureaucratic organizations that have a lot of different functions, a lot of different levels, you have to rely on a lot of other people in the org to get stuff done. You've got to rely on your data science team to do data analysis. You've got to rely on your design team to do mockups. You've got to rely on your marketing team to do copy. And I think what we see with AI is it just accelerates people to get to a great V1 of everything. So if you're a high agency individual and you want to get stuff done, you're no longer gated on people that, you know, you otherwise would be. And I think that should enable organizations to move a lot faster. And I think it should

Starting point is 03:37:11 enable the people at organizations that really drive them to do a lot more. And we see that consistently. Chat ChatsyBTUBT Enterprise, I think that is consistently what we hear. And we seek those people out when we deploy chatyPD Enterprise. We find those like, you know, two or three people at the organizations who are just the like AI superstars and champions and then try and actually use them as these kind of touch points for the rest of the York to learn from. How are you personally using AI these days?

Starting point is 03:37:38 You know, my biggest challenge, I think, day to day is context switching. If you look at my calendar from, like, top to bottom, it's like, I joke like, like, with my wife, I, like, have to, like, show up to work, like, wearing, like, a lab coat, and then I, like, take the lab coat off and, like, put some, like, sunglasses on and a film school jacket,

Starting point is 03:37:52 and, you know, then I'm talking to, like, a media company, and then I, like, take that off. So I go through the costume changes. And I think what I actually mostly used, use it for is just to help with bridging me from kind of thing to thing to kind of put me in the mindset of being able to work with customers, help customers. GPD5 is incredibly good at this kind of structured reasoning of how do we actually take what is this very diverse set of things that models like GPD5 can do and then apply them in domains that I don't think about every day.

Starting point is 03:38:20 And so it gives me this launching off point to be able to talk with leaders and with customers much more fluently about how we can help their organizations. within let's say a set of companies like the Fortune 500 what does AI adoption look like across the spectrum because I'm sure that there's companies that you talk to that are truly you know adopting AI in the way that John was mentioning like trying to become AI native changing their entire organizational approach and then there's companies that just want to buy software to say that they can that they're becoming AI native so what what is that spectrum look like in practice? Yeah, it is a wide spectrum. So at the top level,

Starting point is 03:39:05 we're seeing just like amazing appetite for wanting to adopt tools for people. And I think that's like the easiest place to start. Typically that's where we steer organizations if they're starting at zero is just give your people the best tools. You may have seen we've, you know, we've grown chat GPT work, which is our enterprise and team product from three million seats to five million seats now, from from June till now. So toward growth there and we don't see any abatement in demand there. If anything, it's accelerated from last year. And so I think people and organizations are starting to realize that, like, at a minimum, you need to make sure people have the best tools.

Starting point is 03:39:43 What's cool about GPT5 now is it also enables people to use the best tools at every point. And so if you're in an organization, you're not fumbling with the model picker. You're not trying to figure out when to use a reasoning model. You're not trying to figure out kind of the art of prompting to get the perfect thing. All of that stuff is abstracted and it's kind of taken care of for you and you can have confidence that your people are actually using the best models at any given point. Beneath that, it gets a little more complicated. So more and more organizations, I think, are starting to grasp how the tools can actually help in the business process. So whether that's in customer support, whether it's in research, whether it's in software engineering and data science,

Starting point is 03:40:20 you're seeing these tools more and more adopted in the enterprise. I think there's still a quality gap though. I think we've, we now are just breaking into what I would call the kind of era of models that have capabilities that are good enough to make a dent in the types of problems businesses care about. Businesses care a lot about things like reliability, right? They think they care about accuracy. They care about the resilience of the model to recover from tool use errors and to be able to string together these very long kind of multi-tool, multi-step workflows. So GPD5 is a step on all those things. And I expect that that will enable us to be able to do more and more things in the business process.

Starting point is 03:40:56 Do you think those customers that you just mentioned will stick with this idea of like GPT4 level workloads will stay on GPT4 and maybe there'll be cost savings, but those workloads will stick around for a very long time and then you'll develop almost new capabilities, new workflows, new workloads that will be additive, but the enterprises will stick or will they want to, is everything so fresh that they'll want to just like rewrite everything with the latest and greatest? More often than not, I think it's the latter. I think you want to rewrite everything. One of the cool things we did here was we were able to keep the pricing on GPD5 at the level of 03 pricing. So, you know, if you're cost sensitive, you don't really have an excuse to

Starting point is 03:41:42 not upgrade. GPD5 is faster than 03 and 4-1. So we've improved on latency for sensitive use cases that are speed sensitive, latency sensitive, and obviously the intelligence bar has gone up. And so, you know, unless you've got really a very kind of narrow and specific workflow where you've got a model like 4-1 that kind of is okay, there's really not a reason I think that people wouldn't upgrade. Yeah, do we need like a three-dimensional Pareto Frontier right now that matches not just cost and capability, but also cost capability and latency or something? Is that something that you're seeing a lot of demand from in the enterprise? Yeah, 100%. We actually measure it that way. So we look at those three vectors and it's always kind of an optimization function along those three,

Starting point is 03:42:23 those three axes. We think we found that here. It was actually in terms of where my work was over the last few weeks, it was a lot of, I mean, this is a qualitative, you know, kind of, you know, really like manual process of collecting feedback because everyone's got a little bit of a different preference and we can only pick kind of one or two points on that curve. And so just trying to kind of dial customer feedback, namely developer feedback in for us on where that balance of things are is a big part of our,

Starting point is 03:42:49 our process for picking, picking all those points. And so we hope that, we hope that people like it and it unlocks, you know, the kind of maximal use. That's great. How are you thinking about open source? Who, you know, who's been most excited to get access to it? And, yeah, where do you see it going? Yeah.

Starting point is 03:43:09 I mean, it's important to us. You know, I'm glad we've, we've gotten this out. It's been a huge team effort. I think there was kind of a thing that, like, you know, Open AI doesn't like open source anymore. It's like, no, we're just like really busy with a gazillion other things. So I think hopefully going forward, we've got more of a leaned-in vantage point on open source. But it unlocks a huge number of use cases. I mean, if you think about kind of like, you know, government use cases, you think about on-prem, you know, use cases where you're handling sensitive data and very

Starting point is 03:43:38 sensitive environments. You think about where you want to run models on the edge. All these things right now are kind of inaccessible to us as a service provider to customers because we just just don't quite have models that kind of fit at those points. So this for us, we think, is huge TAM expansion. And we're excited to be able to work with enterprises on implementing that model, which is I think, competitive, hopefully, with our O3 class of models. What is the landscape like for companies that are

Starting point is 03:44:04 helping to implement open AI products at various enterprises? You have the big consulting groups that will give you an AI strategy. Maybe they'll try to take it a step further. But I imagine there's a cottage industry of firms that have sprung up to try to help organizations unlock the value beyond, hey, let's just get everybody a seat with chat Chupit work. Yeah, I think there will be this new industry that emerges that is kind of separate and apart from kind of the legacy set of SIs and consultants that is really AI fluent.

Starting point is 03:44:42 They're very eye-native. I think it's very hard to borrow, I think, paradigms from the last 20 years. here is a software building and, you know, implementation that are going to kind of map to what we're dealing with here. You're dealing with fundamentally probabilistic systems that are moving and increasing and improving at a rate, you know, of now kind of collapsing to every few months. And I think the nature of use cases changes quickly, where enterprises are focused on kind of deploying them changes quickly. And so I think it's just hard for kind of the legacy industries to to keep up, frankly.

Starting point is 03:45:17 We've had a lot of success working with some of this kind of new breed of SIs, so the distills of the world and others that really have been born, I think, in forged in the fire, so to speak, of this kind of new, this new platform. And so we hope there's more of them. We'd be excited to work with anyone that wants to work with us on it. There's more business than we can handle, and so we're always happy to spread the love. Talk about the $1-chat-G-GPT product for for the government.

Starting point is 03:45:47 Were you involved in that at all? I was involved in that. We wanted to do something that was meaningful for US government. It's been a real big focus of ours lately. I think our view is the government has got to start to modernize. We've got to make sure that the tools that we use in the private sector are also in the hands of folks serving us in the public sector. And we wanted to make that really simple.

Starting point is 03:46:10 So we made chatypt, you know, basically equivalent to chatyPT enterprise free. It's a dollar per year per agency. Hopefully we can afford that. And we wanted to make that available to anyone that wanted to use it and standardized their GSA. So we're super appreciative of the partnership with them and more I think that we can do on that front. How is that different than just like if I'm a government employee,

Starting point is 03:46:32 I can just go to Google.com and I have access to that and Google provides benefits. Scott Kapoor was saying that he can't use He can't use it. Yeah, so why? Yeah, just talk to me about how, how it's different to offer ChachypD as an actual service with a contract that you're, that you're, you know, vending in. You're actually, they are a client versus just if you put up a

Starting point is 03:46:55 website, every government employee can access the web to some degree or would it be blocked? Like, what, why does it need to be like a deal at all as opposed to just like everyone just uses it? Yeah. So part of it is just making sure that government employees can access it. So in some places, obviously, you know, you can put blockers in place that wouldn't prevent access. We hear a lot of stories, by the way, of people like going out on their lunch break to their car in the parking lot and like, you know, pulling up chat GPT on their phone and like throwing a bunch of stuff in there just to like, because they know it'll get them through the day faster. And we've done work, by the way, with governments, with the state of Pennsylvania, other places where we've seen dramatic increases, you know, things like two to three hours a day saved per employee, given the nature of the work that they do and how helpful chat chat can be. And so this lets us have an interface into them as a customer. it lets our team engage with them in a direct way.

Starting point is 03:47:45 We can see how they're using the product and can help them use it better. And so that's important for us, is like we got to build on that foundation with them. And then presumably it also allows the government to define like security and privacy in their world as opposed to if you're just like some website out there,

Starting point is 03:48:00 their choice is only block or don't block as opposed to actually, you know, communicate with you. This is okay to train on. This is not, et cetera, et cetera, like keep everything private, et cetera, et cetera. Yeah, I mean, we don't, we don't train on enterprise data at all. Yeah.

Starting point is 03:48:14 You're safe there. But the, yeah, I mean, for us, like just being able to treat them as a customer, right, to treat them as a user. And you go, you know, you mentioned earlier, like, we were talking about kind of like there being these points of success at every organization that, you know, you've got people who are like way more sophisticated in using these tools than others. We want to be able to see those people and amplify them. And the government's no different.

Starting point is 03:48:36 There are people that we've worked with in government who are incredibly sophisticated and how they use AI tools and our goal is to get everyone there. How do you think about the group of users that are active students? They've been on summer break. You guys have been busy over summer. Do you're thinking about, and you recently launched, I forget the exact name for the product. I think it was like Chat Tipete Learning.

Starting point is 03:48:58 How are you thinking about that cohort and unlocking new capabilities for them this coming year? Yeah. So we launched something called Study Mode, which was in our core Chat ChaptaintyPT product. And it was a little bit of an experiment. We wanted to see if you change the way the model behaves, when it can kind of, when it knows you want to be in a learning mode, if that can actually enhance outcomes for students,

Starting point is 03:49:23 where we have all these kind of studies that have been done very like anecdotally about ChatGPT's ability to drive student outcomes and learning outcomes. So here we kind of took a little bit more of an intentional approach of, if you actually take the model and actually use it in a more Socratic style, where it can actually kind of quiz you, it can withhold certain information

Starting point is 03:49:41 that it wants you to be able to empirically deduce. It wants you to reason about problems, and it kind of reasons with you as a partner. So far, so good. It's really cool. Learning is kind of the killer use case of chat chepti. And so I think to be able to actually launch something that is, in some sense, extends that kind of killer use case

Starting point is 03:50:00 has been really cool. And the student feedback so far, even on summer break, has been positive. Well, we'll let you get back to your day. What's next on your agenda? Are you putting on the lab coat or the suit and tie and and go into Washington. Good question.

Starting point is 03:50:15 You know, today I'm mostly with the team and talking to customers and maybe tomorrow I'll get back to the lab code. But we appreciate you taking the time to talk to us. Yeah. Well, thank you so much for taking the time to talk to us. We will talk to you to see you guys. Have a great rest of your day. And the timeline has been in turmoil because President Trump says he will be imposing a 100%

Starting point is 03:50:37 tariff on all semiconductors coming into the United States. States, it started with widespread tariffs on chips and then turned into export controls. This is from the Kobe Yesi letter. Is this a red flag moment? I don't know why you have the red flag. It felt like it. Ben was getting the flag. And Viter potentially affected. But Taiwan says TSMC exempt from Trump's 100% chip tariff. Very unclear. The story is obviously still developing. And Dylan Middick says, you're telling me.

Starting point is 03:51:11 that this level of monitoring the situation is free and it's a picture of you in front of the whiteboard monitoring the chat GPT versus the timeline today we're monitoring Illinois has banned AI therapy making it the first state to regulate the use of AI in mental health services interesting headlines coming out just interesting because the product can just be used the therapy like the user can choose to do that it's not necessarily it's kind of hard to ban outright. Like maybe you can ban it in a clinical setting. Yep.

Starting point is 03:51:45 I wonder how they define this. There's probably a loophole if I know anything about how these bans are implemented. But yeah, maybe it's like if you're in the clinical setting, you can't be, you can't use it, but then people will just use it independently. They're like, yeah, therapists is just on their phone. They're going to be going to be having, they're going to know, they're just going to have it listening to the conversation.

Starting point is 03:52:04 Yeah. They're going to be like, no, what should I do right now? What should I say? What should I say? How does that make you feel? That's what it's going to tell. you. Celsius nearly doubles revenue year over year. This is the energy drink. Revenue of $739 million versus $632 million. North America grew 87%. International grew 27%. But here's the real

Starting point is 03:52:28 kicker. Alani Nu acquisition is the primary driver of growth. Alani New added $300 million in revenue and retail sales are up. So wow, what performance. But yeah, I mean, that was the expectation when they bought Alani News that they would, I guess it's like the first moment they rolled them in, probably. But huge growth for Celsius as they become multi-product, multi-consumer company. What else is going on in the timeline? We have one last guest.

Starting point is 03:52:58 I think you might have to hop on with Taipei. So feel free to jump when you need to. Tyler, anything going on on the timeline we should be monitoring. We are, of course, monitoring the city. situation. I've been so so when Max's on he was talking about like how you can like make a little game right so I've been working on like a balloons tower defense game. Okay, how's it going? So it's going pretty well. I'm I'm making another change but then maybe I can screen record and and do share. Yeah, yeah, that'd be great you could share with the with the folks too. Yeah, I like this post from Ray Sullivan. These GPT five numbers are insane and it's a chart of GPT version versus number and then once it gets to four. It goes 4.1, 4.2, 4.3, 4.5. So the fifth one is a massive, massive bar. We need an analysis of the charts from today. It seems like there was multiple that were kind of odd or hallucinated or off.

Starting point is 03:53:56 It's interesting that multiple of them snuck up. Just in sheets, the popular convenience store chain with 750 locations is now offering 50% off purchases paid with Bitcoin and crypto daily from 3 to 7.7. p.m. What a wild move by sheets. Well, well, Ben Highlack is in the waiting, the restream waiting room. Let's bring him in. Bring him Ben Highlick. How you do, guys? Good to see you. Good to see you. We're doing well. I'm just going to say hello. I got to take off and talk with Taipei. I'm going to let John take it from here. Absolutely. I'll close up the show. You guys have a fantastic conversation. Give me the update. How's the day been for you? What were your

Starting point is 03:54:33 expectations? Did this meet to exceed? Did it underwhelm you? How are you doing? Well, so I've actually had access for a couple weeks. So we actually did a video. I'm not sure if you've seen it, but opening eye, I brought a couple of folks from the Twitter sphere to their office a couple weeks ago to try it. Yeah, yeah. Yeah, yep, yep, yep. I think that it pretty much exactly meets my expectation as far as like how it's been received.

Starting point is 03:55:03 And I've tweeted about this as well, but I think that it's really, really good. at like one-shotting things. You know, I think it's better than I think other models we've seen. But I think it's actually sort of a distraction in a lot of ways. I think that the things that's a lot better at are, A, a lot harder to describe, and B, I don't think the harnesses for it really exist yet. I think a lot of harnesses. So the way I've been describing it is that I think I've seen, you know,

Starting point is 03:55:35 web search existed in chat CBT for a really long time, right? long time, right? Like, it was able to, like, call a tool, search the web. Yeah. Obviously, like, deep research was very different than that, right? Like, what we saw was it was like actually, like, calling, you know, searching the web. It was like reasoning about those results, changing its kind of course, like course correcting the middle. So like intermediate reasoning is like what is the is the term for it. And they really trained it how to search the web well. I think GPT5 does that for like a whole plethora of tools. The interest

Starting point is 03:56:07 thing is that a lot of products like I think a lot of the agentic products exist today where I kind of built wrong like they weren't built that they didn't build the tools the right way and we've seen this before like if you look at like you know the first you know kind of infrastructure for agents was Langchain like way back when yeah I remember two or three yeah it was it was you know it was it was early but it was wrong right and so like anybody that you know they've iterated since right they have like Langraph it was a better implementation but the first similar to implementation of Langchain was like, again, early but wrong. And so if you built your product

Starting point is 03:56:41 on Langtrain, like you had to, you know, significantly change it. I think we'll see a similar thing happened for GPT5. You know, it's not just like, you know, change of the string and get, you know, from, you know, four o to five or something and push. And now you, you know, yeah. Yeah, you know that meme about like, oh, like Sam Holman's done on stage and like just like, you know, killed 75 startups. Google just killed 100 startups. Apple just killed Partifle with their new thing or whatever. Did any of that happen today? It feels like, it feels like this is like the Lang chain

Starting point is 03:57:13 needing to change their strategy. That happened a while ago. I haven't identified anything. It feels like, you know, Scott Wu hopped on and said like, you know, great day to be an application layer company. The foundation models got better. It's more tools in my tool chest.

Starting point is 03:57:29 I'm extremely happy and, and I'm more confident than ever. And I believe him. I believe that he doesn't see today as fundamentally needing to change his business model. I think that's true, actually. I think that people have been, you know, there's a lot of people building agents right now. I think a lot of them have not been feasible for some of the reasons that GPT5 starts to address.

Starting point is 03:57:53 So I think it is, I think that what it means is that the entire architecture behind agents will get a lot simpler. Like, it feels like a good day for people building applications. Yeah, it's not immediate. that there's like some, you know, like company or something that got killed today. Yeah, yeah, yeah. I mean, in general, it feels like, you know, Dorcasch updated as timelines. There's just been a general idea that, like, we've maxed out pre-training. We've kind of maxed out post-training.

Starting point is 03:58:23 We're now in the let's reap the reward of this. And we've seen it in like the incredible financial performance, the incredible usage numbers. You know, millions and millions, hundreds of millions of people are using Chachap, 30 minutes a day. I love the product. And yet it feels like the what have you done for me lately, meme? It's totally like, okay, yeah,

Starting point is 03:58:46 we went from the iPhone 4 to the iPhone 5 today. Yes. Still really an important technology, great company, but like I want another iPhone 1. Yes, yeah, yeah, yeah. I totally get what you're saying. I think that, like, I wrote a piece about this with Swix, but it really actually changed the way

Starting point is 03:59:06 I see that path to AGI, like, I think before using it a lot, I kind of was like, okay, we need like bigger, bigger models. They're going to like get smarter or something. I think like I had this realization. So I was watching it like solve. I had this like really weird like dependency conflict with yarn. Like we have like a mono repo. It's like the problem also with this discourse is like, um, the sort of problems that gets good at solving are just like not sexy things to talk about. They're not things that you'll understand. I'm like, We have this issue with the way we structured things. But like a couple weeks ago, I was watching it like, I had this problem. No other model would solve it. And I watched it sort of like poke around. Like it started running this like YR and Y command in a bunch of different directories. In between it's like reasoning and like correctly reasoning about like what and why and what it was learning. And you know, taking little actions in between seeing what happened.

Starting point is 04:00:01 I think what I realized is that like, you know, if you imagine like, you know, if you imagine like, humans without tools, like if we never had any tools, we're never even able to write things down. Like, would you be able to tell that we're intelligent? Would we have like, you know, learn to speak, et cetera? Like, I just like don't, you know, even if we could not have ever invented fire, right? It's like, it's like, where would we be right now? There's, that feels like there's a similar, like, I actually think a lot of the next year is just going to be, how do you get these models to do things better? It's like, you know, I think it's next year. In your yarn example, um, you said like, you were you were having it i assume gpt five like work on the problem was that wrapped in a

Starting point is 04:00:43 coding tool did you just go to chat dot com and give it your github repo like talk to me like what was the actual user experience from your side yeah so this was in cursor okay um i think the code xl i the new version of the code xl i which they just released today is also really really really good okay um I think that you will really only see a significant difference in places where it can sort of explore its environment is the way I would put it. Like when I was watching it like go bounce around my repo and like like I felt almost like I was watching something navigate like a little like video game like Pokemon or something. Like that's kind of what it felt like. It's kind of like I'm going to go over here. I'm going to see this.

Starting point is 04:01:24 Okay, wait a minute. That conflicts with what I just saw over here. Like where should I go next? You know what I mean? Like it felt very novel is like what I would say. Yeah. Yeah. What, what, so yeah, I mean, how are you using it?

Starting point is 04:01:38 What, what, where do you see it going? Do you see it like, just like a little bump of a tailwind today? Or what's your read on like, like how you'll be using GPT5 going forward? I mean, yeah, there's two huge things. So like one thing that like really got missed today is that they also released GP5 Nano. which is like an incredibly good model actually. So like we're not talking about it, but it's half the cost for input tokens, then Flashlight,

Starting point is 04:02:08 or sorry, yeah, I think it's actually half the cost input tokens and Flashlight. And it's a really good model. Like it's like 4-0 level for a lot of like writing and stuff like that. And so yeah, we'll be using that probably in the short term. I think it'll be interesting to see how other providers react. Like I'm sure Google will cut their prices as a result. But it is the cheapest, like, hosted model, I think. I don't think anyone's serving it any other model for those prices for that matter.

Starting point is 04:02:36 Yeah, that makes sense. What else are you looking for for the rest of the year? Probably no GPT6 on the horizon, but what are you looking out for? I mean, it seems like Google is expected to respond with Gemini 3 soon. But what else are you tracking in the world of AI these days? It's a great question. I think that, yeah, that's going to be wildly interesting. I think what Google does will tell us a lot.

Starting point is 04:02:59 I think that they, you've probably seen it, but they released this like world model yesterday. We're not talking about it anymore. I mean, like, if those videos, I haven't tried it myself, if those videos are real, like, that's one of the most mind-blowing things I've seen in the last, like, you know, decade or something. So, like, if that's real, like, that's extremely interesting. And I think has all the stuff that's going on with world models right now,

Starting point is 04:03:21 has, like, huge implications for, like, everything, like, from robotics, just like so many different fields. So super, super interested in that. And the other thing is that I actually just think that, like, again, I'm actually really bullish on cheap T5. I think that the way it was received today is like just about how I expected it. Like, and the reason is like, when I say harness again, I'm like, I think that like Canvas in ChatT is pretty bad is like my,

Starting point is 04:03:45 would be my take like, you know, it's a tough product to make, but like, yeah, like it does really poorly with like long files, crashes sometimes, like that sort. Like, I think that we don't have. the product layer around GP5 doesn't exist yet. So I think we're going to see some really, really interesting products that are built around it. Yeah, it's always hard when you go from like a binary,

Starting point is 04:04:05 qualitative, in your face improvement, GPT, like chat GPT was like, we passed the touring test. And now the next test is like super intelligence and self-replicates, smarter than every single person knows everything. It's like the bar is like, we really moved the goalposts, you know? 100%.

Starting point is 04:04:23 I think that there was like a lot of, you know, discourse around the model as well, like leading up to it, which I think didn't help, you know, but like the way that I would think about it is like, I think that, you know, depending, there's some percentage of the way through automating software engineering that we've made it. Like, let's say it's like 70% or something, 75%. The tough part is like that last like 25% is, um, A, the hardest, it's like the least sort of decipherable to like explain to people. It's the least like universal. Like, like, If I'm just like, oh, make a, you know, one of the examples I did,

Starting point is 04:04:57 I made a personal website. It's like all MacOS 9 themed in like 20 minutes with GP5. That's fun. And so it's really fun, right? You get it. Like my mom gets it. Like I can show it. I can share it.

Starting point is 04:05:07 You know, my mom, I can't explain any of the like the very specific ways that 2505 like helps in our specific code base, our specific problem, whatever. So I think that like it'll be less. These launches will probably get less and less sort of, interesting from a so like from a what it does for software engineering as that gap gets closed like i you know what's the last five percent of software engineering like i you know like i it's probably not going to be that interesting to me um do you think they'll be on an annual release cadence now like apple updated all of their ios all their operating system nomenclature to be like we are now on 26

Starting point is 04:05:47 because it's the year it's like a car model like like jaguar i don't think you can plan it i don't think you can plan ahead. That's the interesting thing. I think that, you know, there's people that say that Jeep G4.5 was supposed to be GPD 5. Yep. Yep. And like, I think that it sort of came out and they're like, eh, like, you know, I actually love 4.5. I think it's a really fun model, but, um, well, it's clear that like improvements come in many places. Just like with the, with the iPhone, like the latest iPhone, you buy that because it doesn't, it's not just like the one with the new screen. It has a slightly better camera, slightly lighter, longer battery. Like, it's like an ensemble of improvements that then they add up.

Starting point is 04:06:24 And I think that that feels like what we're getting here today and what we will get in the future is like this little, like we did a little extra URL over here. This tool is now sharper. It has new capabilities. We added multi-modal. Like, you know, the video generation got better and this feature got better, et cetera, et cetera. I think that like what a model is is still coming to change a lot and like how we like. So just give an example like 4-0 was sort of this big thing, you know, where they talked about it being like natively. multi-modal, you know, taking in, even like video at some point, video in, video out,

Starting point is 04:06:56 like audio in, audio out. And like, you know, you haven't heard that from GT5 yet. Like, you can't talk to it on advanced voice mode. Like that's interesting. It doesn't generate image. Like, you don't mean? There's no, at least yet, native image generation. We don't know much about how it works under the hood, but like, it's still calling

Starting point is 04:07:13 4-0 to generate images, right? So it's like, do you start to see an unbundling of these model capabilities, like, seems quite possible. Like the best model for writing natural language might not, or like writing creative, you know, creatively might not be the same model that writes, you know, really good Rust code. Like it might be different models. So I don't know. We'll see. Yeah, create image here is now tuck next to deep research agent mode, et cetera. But I would hope that you can call that from the actual chat interface. You can call it from the GPT5 chat. It's just using, it's using Jeep Image one, I think is actually the name of the model. So it's a dead. It's a dead.

Starting point is 04:07:50 image generation model, which I think it may be 40. I don't totally know. Yeah, I just I don't particularly care. I'm not looking for one model to rule them all. I'm fine if with models calling different tools. It seems fine. Yes. Anyway, fun day. Thanks for hopping on. Of course. Of course. Anytime. Have a good one. Bye. And that's our show today. Folks, leave us five stars on Apple Podcasts and Spotify. And thank you for tuning in to the GPT5 gigastream. We're on hour four and a half. We've enjoyed hanging out with you. Tyler, anything else from the timeline?

Starting point is 04:08:25 Close it out for me. Timeline's still in turmoil. If we want, we can show the little game I made? Okay, yeah, let's show Tyler's game. Can we do that? Is that positive? You got it? Tyler's tower defense.

Starting point is 04:08:35 This was one shot. Okay. Wait, what do you mean? One shot, one prompt. You said you were working on it? I was, but then it's like, wasn't as good. Oh, so you went back to a single prompt? Yeah.

Starting point is 04:08:47 I made a change, but then I realized like, okay this is not as good so I just went back to the first one okay so yeah my my my question is I mean this this seems well actually like it's it's like the game engine I don't know what it's using under the hood what it's do you know did it write like webGL code or do it right I think it's just it's just like yes okay and it's just like itchml canvas that's pretty crazy yeah um you'd think it would use some like 2D engine off the shelf or something but um my my my question is like what that won't go viral because that is less impressive than just the Tower Defense app that I can get in the App Store.

Starting point is 04:09:24 For sure. But it's like maybe if I take my, you know how there's like ControlNet images went viral where people take their corporate logo and then they'd throw that through control net and it would be like the TBPN logo overlaid over like a forest and like the trees would look like the logo. Or like the QR code. Yeah. So maybe like it's Tower Defense but it's my logo or something like that and like the the enemies are like moving through. something like that. I don't know. There's just got to be a way to personalize it and make it so every single game is a unique snowflake that you want to go and experience that one. You want to look at it. You want to spend some time in it. I don't know. Yeah. But it's hard because it's like,

Starting point is 04:10:01 it's still, you know, predicting the next token. It's not like the four image generation was like kind of a, it wasn't novel, I guess, because there was image generation. Yeah. It was like such a massive improvement. This is, like, there's not any clear massive step change here. It's a little bit better in a lot of ways. Yeah. Oh, well, well, we'll have to play with it more. Let us know what you think about GPT5, and we will see you tomorrow. Have a great day. Thank you so much.

Starting point is 04:10:26 Bye.

TBPN - OpenAI Day: GPT-5 Unveiled | Mark Chen, Greg Brockman, Sarah Friar, Max Schwarzer, Brad Lightcap & More

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.