TBPN Live - OpenAI Day: GPT-5 Unveiled | Mark Chen, Greg Brockman, Sarah Friar, Max Schwarzer, Brad Lightcap & More

Episode Date: August 7, 2025

(01:18) - AI Model Whiteboard Breakdown (26:23) - Mark Chen, Chief Research Officer at OpenAI, discusses the recent launch of GPT-5, emphasizing its enhanced reasoning capabilities and seam...less integration of various AI models to improve user experience. He highlights the model's ability to perform complex tasks more efficiently, reducing the need for users to choose between different model versions. Chen also touches on the importance of personalization and memory in AI, aiming to make interactions more intuitive and tailored to individual users. (57:52) - Greg Brockman, co-founder and president of OpenAI, discusses the evolution of the GPT series, highlighting the progression from GPT-1's foundational capabilities to GPT-5's transformative impact on software engineering. He reflects on the challenges and breakthroughs in developing these models, emphasizing the importance of scaling and infrastructure in achieving advanced AI functionalities. Brockman also touches on the broader implications of AI, including its role in enhancing human productivity and the necessity for responsible development to maximize societal benefits. (01:31:55) - Sarah Friar, OpenAI's Chief Financial Officer since June 2024, previously served as CEO of Nextdoor and CFO at Square. She discusses the rapid growth of ChatGPT, now with 700 million weekly active users, the expansion of enterprise adoption to 5 million paying business users, and the importance of substantial investments in compute infrastructure to support future AI developments. (01:52:13) - Dedy Kredo, Co-Founder and Chief Product Officer of Qodo (formerly CodiumAI), discusses the integration of GPT-5 into their platform to enhance code review processes. He highlights the model's improved capabilities in generating high-quality code reviews, identifying bugs before production, and ensuring enterprise code aligns with best practices. Kredo emphasizes the importance of AI agents in automating code review tasks while maintaining human oversight to verify code quality and adherence to standards. (01:59:11) - Zach Lloyd, founder and CEO of Warp, discusses the significant advancements in AI models, emphasizing their enhanced capabilities and cost-effectiveness, which are particularly beneficial for individual developers and small teams. He highlights the importance of competition among model providers to drive down prices and improve quality, expressing hope for a future where multiple competitive models coexist, similar to cloud service providers. Additionally, Lloyd addresses the challenges of model deprecation, noting that for application-level stacks like Warp, transitioning to the latest models is straightforward and advantageous. (02:11:15) - Riley Tomasek is a serial entrepreneur and the Founder & CEO of Charlie Labs, home of an AI-driven "autonomous TypeScript engineer" designed to accelerate code reviews and merge processes. Previously, he co-founded Flight (acquired by Figma) and launched Dexa, an AI platform that transforms podcast discovery. Riley holds a B.Sc. in Mathematics & Computer Science from the University of British Columbia and has a track record of building developer-friendly tools and interfaces. (02:18:31) - Guillermo Rauch, founder and CEO of Vercel, discusses the transformative impact of AI on software development, emphasizing the shift towards "vibe coding," where natural language prompts generate code and user interfaces, making software creation more accessible. He highlights the role of AI agents in automating tasks, enabling developers to focus on higher-level management and creative processes. Rauch also explores the future of developer tools, noting the importance of integrating AI capabilities to enhance productivity and streamline workflows. (02:34:28) - Eno Reyes, co-founder and CTO of Factory, discusses how their platform integrates AI agents into every stage of the software development lifecycle, including coding, code review, maintenance, incident response, and documentation. He highlights the platform's focus on large enterprises with over 1,000 engineers, addressing challenges like migrating numerous codebases to new frameworks and modernizing legacy systems. Reyes emphasizes that while AI tools can accelerate individual developers, significant productivity gains require workflow changes that incorporate agents throughout the development process. (02:40:20) - Guy Gur-Ari, co-founder and Chief Scientist at Augment Code, discusses the company's AI coding assistant designed for large teams with extensive codebases, emphasizing its capabilities in question answering, development, refactoring, and migrations. He highlights the thoughtful nature of GPT-5, noting its propensity for tool calls and clarifying questions before code modifications, making it particularly effective for complex tasks. Gur-Ari also mentions Augment's focus on developing proprietary integrations and tools, aiming to enhance the agent's performance without relying solely on external model vendors. (02:48:20) - Harjot Gill, CEO of CodeRabbit, discusses the significant improvements observed with GPT-5 in their AI-driven code review platform, noting a near doubling in performance compared to previous models. He emphasizes that these enhancements will be available to customers at no additional cost, reflecting the rapid evolution of AI capabilities. Gill also highlights the company's focus on monitoring real-world performance metrics, such as user conversion rates and potential issues like hallucinations, to ensure the model's effectiveness and reliability. (02:52:34) - Timeline (02:57:01) - Max Schwarzer, a leading researcher at OpenAI, discusses the recent launch of GPT-5, highlighting its significant advancements in coding capabilities and its potential to revolutionize user interactions by enabling the creation of personalized applications without prior coding knowledge. He emphasizes the importance of refining the post-training process to enhance the model's accuracy and reliability, particularly in reducing hallucinations and improving user engagement. Schwarzer also touches on the future trajectory of AI development, expressing optimism about the integration of reinforcement learning to extend AI's applicability beyond textual domains into real-world interactions. (03:13:26) - Scott Wu, co-founder and CEO of Cognition, discusses the significant advancements in AI coding models, noting that OpenAI has caught up to Anthropic, leading to a competitive landscape. He emphasizes the importance of integrating AI agents like Devin into software engineering workflows to enhance capabilities and efficiency. Wu also highlights the evolving role of engineers, suggesting a shift from "bricklayers" to "architects" as AI tools handle more complex tasks. (03:23:21) - Claire Vo, founder of ChatPRD and former Chief Product Officer at LaunchDarkly, discusses the developer-centric design of GPT-5, noting its enhanced coding capabilities but expressing concerns about its verbosity and tendency to produce lengthy outputs. She emphasizes the importance of validating new models with users, especially in business contexts where concise communication is crucial. Vo also highlights the need for AI models tailored to specific roles, such as strategists, to better serve diverse professional needs. (03:33:34) - Brad Lightcap, OpenAI's Chief Operating Officer, discusses his multifaceted role, which includes responsibilities ranging from project management to sales, and emphasizes the significant improvements observed with the launch of GPT-5. He highlights the diverse applications of OpenAI's models across various industries, such as pharmaceuticals, customer support, and everyday productivity tools, underscoring the transformative impact of AI on organizational efficiency...

Transcript
Discussion (0)
Starting point is 00:00:00 You're watching TVPN. Your background looks way different because you have a whiteboard behind you because we're breaking down the X's and O's of the GPT5 launch today. That's right. That's right. Launched from Open AI. Really quickly, there is some other news. Firefly Aerospace stock opened at $70 in NASDAQ debut. This is the company that landed on the moon. Very cool. Very cool.
Starting point is 00:00:23 There are a few other stories going on, but we're going to skip most of them because we're going to be focusing on. focusing on chat GPT today on GPT 5. We have a bunch of guests coming on. We have a stacked lineup, we'll pull that up, but we'll break down the X's and O's of the matchup. So of course, OpenAI, here's our lineup. We have something like 15 guests today. A ton of folks from Open AI, a ton of people that build on top
Starting point is 00:00:50 of Open AI and can comment on what's going on with ChatGPT. But of course, this battle is between Open AI and the timeline. It's the, it's, they got to get the vibes right. It's war. It's war. It's, it's the timelines in turmoil over whether or not this is a good model, what it means for the industry, what it means for AGI timelines. Everyone's got their take. Everyone's posting memes.
Starting point is 00:01:13 There's been a ton of funny ones already. We'll take you through them, of course. But let's break down the offense today. We have Sam Altman, the founder, CEO. He briefly got cut from the team in November of 2023, but he's back leading the team for the 2024, 2025 seasons. He seems healthy. doing great today. He went on at 10 a.m. to break down the launch of GPT5. He has a couple of key plays in his playbook, in his arsenal. He's got a solid ground game. Lots of quick
Starting point is 00:01:42 posts hitting the timeline, probably in lowercase. Then he might air it out with a couple thousand word essay. We've seen him do this before. It's a bit of a Hail Mary. Maybe AG has a couple thousand days away. Maybe we're in the soft singularity. But he's very strong there with the long post when he needs to be. It's up his sleeve. if he needs it. Then he can also pull out the vague posting. He was doing this last night, posted a picture of the Death Star. No one knows what it means. Maybe it was taking a shot at the Dumers who were on the defense today. So he's also known for driving supercars. That lets him get to the office faster. He's saving time and money. You can save time and money by going to
Starting point is 00:02:17 ramp.com. He's to use corporate cards, bill pay and accounting in a whole lot more all in one place. And so he is he also gave apparently, this is a rumor, he gave every, Open AI employee who's been with the company for more than two years, $1.5 million. A lot of people say, $1.5 million, that's not enough for a big house in San Francisco, but it is enough for a supercar. So that's probably why he picked that number. And that's why that's what the Open AI team will be doing with that money. They'll be buying Aston Martin Valkyries, Pagani Huiras, McLaren, Sabres,
Starting point is 00:02:51 Ferrari, Daytona, SB3s. They can get a Konigseg, Gamera. They could get a singer, DLS, or Bugatti Veyron. It would have to be used. They could also get the Bentley Bacalar. There's only 12 of those ever made. It's an open top two-seater roadster. It's Coach Belt.
Starting point is 00:03:07 So that's going to run your $1.5 million, but that's perfect. You just got the $1.5 million bonus. So put it to work, spend it all in one place on a car. This is financial advice. Yes, exactly. Then you got Greg Brockman. He's joining at noon. He's extremely well-rested.
Starting point is 00:03:23 He's actually coming off a sabbatical right now. That's very exciting. he should be injury-free for the rest of the season. He cut his teeth at MIT, and then he got drafted by Stripe in 2010. Microsoft tried to do a trade deal during the 2023 chaotic trade deal, trade window that opened up post-Sam Altman Oster, but he stuck with the Open AI team, and now he's president of the company. Then you got Mark Chen.
Starting point is 00:03:47 He's coming on at 1130 today. He's the chief research officer. The rumor is that he turned out a maxed-out contract ahead the Mata Lama's. But he's sticking with the Open AI team. He was an MIT undergrad. He also worked at Jane Street before joining Open AI in 2018. Then we got Sarah Fryer coming on the show at 1230. She's the CFO of Open AI.
Starting point is 00:04:08 It's her job to find bank accounts big enough to fill all the cash they're raising. It's a tough job. You got to find, okay, this bank account, will it hold 10 figures? Will it hold 11 figures? Will it hold 12 figures? Exactly, exactly. She's also going to be defining the non- gap metrics that will be catnip for Ben Thompson in just a few years. We're excited to talk to
Starting point is 00:04:30 her about how she's measuring the success and the health of their business. Obviously, it's not just revenues, not just top line, bottom line. We're going to want to know about queries. We're going to be want to know about DAUs, all those non-gap metrics. That's where people are going to be tracking when IPO day comes, hopefully soon. And then we also have Brad Lightcap. He's joining at 235. He entered the league as an investment banker. Let's give it up for the investment bankers. They don't get enough credit around here, but we love the investment bankers. Then he got drafted by Y Combinator before joining OpenAI as CFO in 2018. Now he's the chief operating officer.
Starting point is 00:05:05 And then we have Max Warser. He's in charge of post-training, fine-tuning these models, getting them into the fighting performance to put on a display of authority on GPT-5 launch day. Now, let's flip it over to the defense. They're going up against the timeline. They're going up against the vibe checks. We got the Doomers. The Doomers, they're led by Eliezer Utikowski.
Starting point is 00:05:27 Admittedly, everyone knows this. No one debates this. The Doomers have had a terrible season. But you'd expect to see at least a few hellmeries about GPT5, creating bio-weapons thrown up on the timeline today. Probably won't be bangers. Probably won't get a thousand likes, but you'll be seeing them here and there, mostly in the replies.
Starting point is 00:05:43 We've also seen some Doomers talking about GPT-5 being available to every government in point. And Eleazar had some harsh words about that. Don't give the keys to Sam Altman. Don't give the keys to the government to Open AI. He was upset about that. But in general, the DOOMers not putting much of a fight up today. Then you got Claude.
Starting point is 00:06:05 Interesting. Claude was caught playing for the wrong team earlier this week. Anthropic. They're on defense today. But we saw them take out Open AIs key pinch hitter. The Claude Code API was playing for the Open AI team. But they shut that down. and Claude is no longer pinch-hitting for Open AI.
Starting point is 00:06:26 Then you got the Elon stands. The ground game's going to be there. It's going to be strong. The Elon stands are going to be tracking the benchmarks relentlessly. We know XAI loves to benchmarks, and all the Elon stands are going to be calling out GPT-5 for any misaligned benchmarks. If they fail, Humanity's last exam, it's over. It's over. They'll also toss up the occasional unhinged conspiracy theory.
Starting point is 00:06:52 moving on Gemini the betting lines have shifted big time people thought Gemini was out of the game they're so back polymarket has Gemini at what 75% chance of being the best model towards the end of the month this is of course based on the LM arena more vibes based benchmark but jemini will probably be quiet today they usually don't try and front run press releases they usually try and sit back Let the model speak for themselves. Let the API credits work their way through the latest YC Demo Day batch and get the product into the hands of people.
Starting point is 00:07:30 And so expect to see a big, glossy conference in a couple weeks. Demoing Gemini 3 should be a good rebuttal from the Gemini's. Then you get the Meta Lama's. Zuck's been on a poaching spree. He's rebuilding the team during the offseason. Now he has a stacked roster and he's ready to go duke it out.
Starting point is 00:07:51 But no one knows exactly what's going to be in the playbook. Is he going to go consumer? Is he going to go API? Is he going to turn into a hyperscaler? We don't know, but we know they got a stack team. They got Alex Wang. They got Nat Friedman. They got Daniel Gross.
Starting point is 00:08:03 They got tons and tons of other researchers. They've been raiding every other team. Completely reset the salary cap for the league. And it's been an absolute clinic in terms of recruiting over there at Lama. Then you got the final benchmark. Arc AGI. This benchmark stands. GPT5 couldn't get past this defense and RKGI you know sitting there right in the end
Starting point is 00:08:30 zone just swatting him down swatting them down all day you think you think you think you think we're super intelligence around the corner RKGI D died denied Tyler give us the update on RKGI where does everything stand how GPT5 do does it matter should we care about RKGI we love the team behind them but is of important benchmarks should we be tracking it today? Yeah, okay, so there's RCAGI, V1 and V2, right? On both of them? V3.
Starting point is 00:08:58 B3, I actually don't know if... No one's been, no one's even tested V3 yet. No one's even really close there. But how we do it on V1? B1, GPD5 is at 65.17. Unfortunately, that's going to be 1% just short of GROC 4, 66.7. Okay, Arc AGI 2. The Elon stands are going to be going wild with that.
Starting point is 00:09:17 Arc AGI 2, 9.9.9. percent nine point nine percent rock four 16 percent so absolutely kind of brutal you know arc a gai magging rough showing some people have have accused grok four of being slightly benched max yes you know this is you know you might have a team working on it what's the what are the pros and cons we know the cons of benchmarking uh of bench maxing you're overfitting on something that might not actually drive consumer value. It might not actually solve real world problems. It might not increase DAUs or revenue or ARR or anything that really matters. It might not even get us closer to super intelligence. Give me the counter argument. Why is benchmarking good? The bullcase for
Starting point is 00:10:02 bench maxing. The bullcase for benchmarking, break it down for me. Yeah. So I think the idea is basically this is almost like a non-AGI pilled kind of take. So if you don't have a super general intelligence. Yep. Your ability to benchbacks basically proves your ability to solve some like kind of specific
Starting point is 00:10:25 tasks. So there's this thing about the gas station spiky. Yeah. It's called getting spiky. Getting spiky. Adding more spikes
Starting point is 00:10:32 to the spiky intelligence. Yeah. I think it was Roon who had this tweet about the gas station benchmark. Yep. Right. I don't care if he said
Starting point is 00:10:40 something like, I don't care about AI solving gas station if it has the gas station benchmark, something like that. But the idea is like if you, if the, if making the gas station benchmark. Roon said my bar for AGI is an AI that can learn to run a gas station for a year without a team of scientists collecting the gas station data set in, in capital letters. Yeah. And then my take is basically, I don't care how they got to the, like,
Starting point is 00:11:12 I don't care how they made it run the gas station. I care how fast. It runs it. If we can run the gas station with AI, that's the hand. If you have a team who's your bench maxing team, that just proves that if you have some tasks that's like really important that you want to get done, they can just figure it out. So it's like RL for business. This is like the same thing, RL for law.
Starting point is 00:11:30 All these like specific verticals. Mirradi is doing this at thinking machines, right? Like RL for businesses. Come into your organization, understand the most valuable business processes out there that could potentially be RLed against. could be turned into a benchmark and then and then you know bench hacked because I don't care if you're hacking you know if I have translate this type of document to this type of document for my business if you can do it with 100% accuracy I don't care that you bench hacked it yeah exactly
Starting point is 00:12:00 like benchmarks right now are not like economically valuable like if you if you're really that much better at MMLU yes it's like is are you producing that much value yes probably not but if you have if you make some new benchmark that's you know your tax benchmark I think Anthropic just released that fairly recently. Sure, sure, sure. That's like, I don't care if you bench max on that. As long as it does way better. If it does the taxes, it's going to do the task.
Starting point is 00:12:21 Yeah, yeah, yeah, yeah, yeah, that makes sense. What about the, what does it say that it feels like Open AI seems capable of bench hacking? It seems like they've opted not to. Is that because bench hacking has a risk of giving you negative aura? Because if you're accused and found guilty of bench hacking, you could, it often reveals that you're not building this one beautiful, you know, super intelligence to rule them all. Yeah, I think it's also, like, maybe we're just looking at the wrong benchmarks. Like, maybe there's a bunch of, like, interesting benchmarks about, like, there's this one I really like, it's the Minecraft benchmark.
Starting point is 00:13:06 Where you have to, like, build, you like, give it some castle and how good it looks. Or there's the one you always see about the unicorn. Yeah. So you use this like math package that does like grass and stuff, but you ask it to draw a unicorn. Oh, yeah. Those are really good because it kind of shows the creativity, stuff like that. Walk us through TBPN bench and what we will be benchmarking the AIs against going forward. Have you heard about this?
Starting point is 00:13:31 Reps of 225. That would be close, but it's difficult because the humanoids kind of change that and you can just use a normal actuator. This is truly for a large language model. model you feed in our data set we have a public data set a private data set presumably at some point but walk us through tbPN bench yes so i'm yet to try this on jp5 i don't think it's out yet okay for public use of this i don't have it um but i can i can tell some of the questions right so the first one um i have this picture of a horse you have to guess the breed yep so um let me see i think why i don't want to say it in case jubt 5's listening
Starting point is 00:14:07 but it is may or may not be a caspian horse okay um and it's failing right now it's O3 is failing. O3 is failing. O3 is failing. I haven't tried every model. Yeah, we got to try GROC and Gemini. We're going to all out. Yeah.
Starting point is 00:14:20 This seems extremely hackable, but at the very least, if we get one scientist to go off and collect the horse data set and then bench hack it, I think we will have done our job. Yeah. So that's the first question. Yes. The second one is a, it's, I have two pictures. Okay. Before and after of this guy, and it's which peptide did you take? to achieve this body transformation.
Starting point is 00:14:44 Yep, yep, yep. So it fails there. It fails there. So you have a data set of what peptide does what to the human body? Where did you find that? Well, you know, Wikipedia has a lot of this stuff. Okay, okay. You'd think they'd be able to cheat this around with O3,
Starting point is 00:14:57 just reason who is this person go look up what they've said they've taken and then, boom, you have the answer. Well, at first with O3, when I was prompting it, I would, like, save the photo, but then would have the metadata or the file name would be like Caspian horse or something. Yeah, yeah, yeah, yeah. Yeah, and then the third one? The third one, I pass in an audio file of a car revving, has to pick which one. It has to pick, it has to identify the car.
Starting point is 00:15:20 The car, yeah. From the engine note. From the engine. And it's not doing it currently. It's, no, it's wrong. This is a good benchmark. Humanity's real last exam. Yes, yeah, exactly.
Starting point is 00:15:31 So I think those are pretty solid. I have some more, obviously, I don't want to make them public in case anyone's going to try to, you know, benchmark this. Of course, of course. We'll see, hopefully. It's funny because. I was mentioning the other day this app that my dad had of tracking that you just set your phone up and it just automatically detects which birds are in your backyard. Yeah, yeah.
Starting point is 00:15:52 Yeah, I mean, this has to be extremely solvable. It's just something that it reveals the lack of like general intelligence when you have to go and collect the horse data set, which should just be out there, or the engine note data set, which should just be out there. But clearly we are in the age of go in RL on the individual problem and we are looking at like the power law of capabilities. Knowledge retrieval is clearly a, you know, $12 billion a year market that consumers will pay for. That will probably grow significantly. And then health and therapy and shopping and all the other features that Fiji-Simo laid out in her post. this is kind of like, you know, what will be RLed against because those are key pockets of value in the consumer economy. And the same thing will happen in the business economy.
Starting point is 00:16:46 But in the B2B context, you'll probably see an individual startup building on top of an API. But even then, most of the model platforms offer kind of RL as a service, fine tunes as a service, something where if you're starting to spend tens of millions of dollars, they will do some customization on top of the model. So that could be the regime for the next few years as we go into this, like, you know, instead of like this centralizing AI force, there's only one company. There's actually like a Cambrian explosion of a ton of companies doing a bunch of different things. So anyway, let's go to Signal's post. Signal's not happy with the launch. He says, okay, I've seen enough. This launch felt like attending a funeral hosted by minimalists.
Starting point is 00:17:25 They're unveiling tech that should feel magical, real breakthroughs. But the whole vibe was grayscale grief. The set design looked like if mood disorder got. about about house grant um i don't know what a bow house grant is exactly uh even the story telling our chart styles the eulogy tributes then closing on someone's health battles what exactly are we are we as the audience morning it feels like where they're trying to get you to pre-install a therapist potentially great products sure but the emotional tone was so damn d oa incredibly strange all the rind i i think a like like it's weird because we're in this world and this is a question that
Starting point is 00:17:59 i want to noodle on all day is is will this be the last launch of of a numbered GPT model because you don't hear about new, new versions of Google going out. You just, it just got better and better and better. Same thing with Amazon when they were optimizing for, hey, it's faster, we have more on our catalog. Our takeaway was, is that the product matters more than the model now and probably will for potentially a very long time.
Starting point is 00:18:26 And when we were watching the stream, I was cheering because they gave the feature of you can now, talk to the model and get it to trigger a deep reasoning workflow or get it to give you a quick answer in natural language. And so it's abstracting even for even more of the UI into the actual text interface. And so I think in terms of like surprise and delight and and I don't know, it's like you, everyone kind of rips the Apple thing. But Apple does a great job of being euphoric and happy with somewhat minor product changes. And like maybe that's, more of where they'll go is just hey there's these new features and here's how these things and
Starting point is 00:19:08 Apple will spend 10 minutes on stage talking about like shifting an icon around and stuff and it's like I thought it was interesting they just sort of casually mention that they're they're deprecating the old models I think it's great which makes sense I think it's great I don't want the model picker anymore but are you upset a lot of people are going to be well I think they're getting rid of 4.5 oh really and you're and you're a 4.5 fan I love 4.5 but but I I would imagine that the future is, if I ask it to think really hard about the prose and the writing style, it would then do a pass with 4.5. But it would only trigger that when it needs to. It's not going to give you that.
Starting point is 00:19:49 Because if I'm just asking for, hey, regurgitate a bunch of facts or write some code or put together a table of data, like it's not going to need to pull 4.5 off the shelf, just like it's not always going to pull Python off the shelf. It's not always going to pull web browsing off the shelf. And so I'm not sure that I necessarily want 4.5 there as a selection criteria. I would like all this to be tucked behind a UI and have something that's actually cleaner and less frustrating to use. I think it'll lead to higher attention. Yeah, for the average, like, normie. No one knows what 4.5 is.
Starting point is 00:20:23 That's true, it's true. Anyway, Chris Pikes' OpenAI and Anthropic are duking it out. Meanwhile, consumer surplus is growing. We also have very good news. We also have the details from Mike Noop over at ARC-AGI. Full GPT-5 is along the V-1 Pareto frontier. That's cost versus performance. Open A-I said they focused on other goals like UX and reliability.
Starting point is 00:20:48 Our testing supports this. Mini GPT-5 is super impressive accuracy for cost. In fact, based on cost efficiency, Mini could have entered Ark Prize 2024 and likely won first place. We are still verifying GPT OSS, or as Rune says, GPT toss. Results soon, nano-GPT appears overfit. Performance is commodity, and Francois Chalet is also chiming in with the top line. John, production team needs the deck.
Starting point is 00:21:19 Oh, we don't have a deck today. No deck today. We're just doing it live. Yeah, we're just riffing through the timeline tab and just pulling up some random posts so so you're free to pull those up but also we can just read through them Ashley Vance is saying but but model switching was my job model switching is out and we are into the future of just talk to the model just talk to the model and ask it what you need to do and it will switch for you it will pull the right
Starting point is 00:21:46 tool for the job anyway the other question that I have for the open AI folks today is on the nature of secrets so in zero to one TIL has this concept that that discovering a secret is key to building a startup and it's a key insight. And I was joking with, you know, the super intelligence or GPT5. Could my first prompt be, teach me exactly how to build GPT5? And then I go to meta and I say, I know how to do it. I have the prompt. I have the result.
Starting point is 00:22:24 And of course, the answer is no. Of course, Open AI would never leak the most frontier capabilities into the model, but can you build a super intelligence, can you call it super intelligence if it doesn't, if it can't tell you how to build super intelligence? One read on what the secret might be is that the app was the most important thing all along. And if you create this narrative that superintelligence is, you know, weeks or months away, and you get a bunch of people that go and try to compete on raw intelligence. Meanwhile, you build a consumer app business with billions of users. Yeah. It's like it seems like a pretty good strategy.
Starting point is 00:23:09 I guess one quick thought is how do you rate Sam's vague posting from yesterday with the Death Star in the context of this new in the context of the release today. That's a great question. There's a bunch of reads on it. One is just that like the Death Star is to some degree like Stargate and you have to, oh wait, he, if this is the apocalypse, I figured at least tune in life, you have to like the impact of this of GPT5 is not one crazy super intelligent model. that does everything.
Starting point is 00:23:49 It's just a more user-friendly, higher, retention, lower churn consumer model that weaves its way into all aspects of daily life and improves performance and efficiency all over the place. And so you have to build this massive cluster to serve all of that. I don't know. What's your read on it? I don't know.
Starting point is 00:24:12 I think it just was, I think it was dramatic. It was provocative. It is provocative. People didn't like it. It is provocative because there are many other, like, super megastructures that are in sci-fi history that are positive. Or positive, yeah. Yeah.
Starting point is 00:24:28 And this is, this is like. But it gets the people going. Yeah. I don't know. Is it a metaphor for someone else that's going to attack? Is he, I mean, the image is from the viewpoint of someone looking at the death star. Is he saying he is seeing a death star being built on the horizon? Is that something else?
Starting point is 00:24:45 Is that another company, another organization? Is that the government? Is that the, is that legal? Here's a read from Bubble Boy says, I am an expert on bubbles, so it brings me no joy to say that the AI bubble is popping this time next year. He is updating his timelines.
Starting point is 00:25:02 When you promise infinite scaling and don't produce it, the calculus changes. I don't think it will be bad for most companies, but those who built their entire business model around making the best LLMs are unfortunately gonna struggle as models become more of a commodity. Again, open AI is a, And my read on it is a consumer app business, right?
Starting point is 00:25:21 They still have a big enterprise business, but by, you know, their recent valuations are predicated on this incredible consumer business that they've built. Bubble Boy says the end user doesn't care much if Claude is 5% better than GPT5. They care about cost, speed, and utility, especially at scale, things will be going. The obvious play now is shorting a video and dumping, okay. Getting into financial advice. getting into financial advice territory here bubble boy but interesting again kind of goes back to what
Starting point is 00:25:53 i was saying earlier in that if you were raising billions to make a lab and i think the potentially enthrop you know we'll see what happens in the coding market but there's some clear winners emerging and then on the consumer side you know expecting a power law outcome and and it's hard to see anyone on seating chat GPT there completely agree I want to dig in more but we have our first guest let's welcome him to the stream what a day mark how you doing pretty good nice to see you guys again what's happening uh congratulations on the launch take us through it are you were you actually live or are you wearing the same thing and you recorded yesterday actually like okay I don't know why but we do
Starting point is 00:26:43 I mean, we're big fans of live. I mean, it just allows you to be the most reactive to the most new information. Give us the core thesis that you are trying to get across. I think that there are a few narratives out there. We've been enjoying the one that's, you know, this is a dominant consumer product. They just made it a better consumer product and people are going to use the product and get more value at it. I saw a bunch of things in the presentation where I was like, that's going to make my daily usage of chat GPT better. At the same time, we're in this, we're in this world of all
Starting point is 00:27:24 the models, the numbers matter and the scale matters and this and this and that. And it's a fine line and it's a dance. And we're in a transition phase away from benchmarks and away from talking about the size of the bubbles. But what was your core thesis? Like what did you want to get across to the listener? Yeah. I mean, fundamentally, I think from a research perspective, We've been working on reasoning models for several years now. And I think until now, you've had this really clunky interface. You have to pick, you know, GPD-40 or you have to pick O3. And for the longest time, we've known that O3 gives you better answers across the board.
Starting point is 00:27:58 It's just too slow, right? I mean, you often don't want to just sit there and wait for the model, reason it out. So you've done a lot of work to push the speed, the performance of our reasoning models, such that these can come together and work in a very seamless way. And so I think, you know, above everything, we're trying to move the world into this agentic reasoning world. We believe that's the future. And on top of that, you know, you pointed something out, which I really resonate with. Post-training is a huge part of this release.
Starting point is 00:28:28 We really wanted to highlight Max Schwarzer and his team who did a phenomenal job. And they've made the model just really that much more useful for consumers, for businesses. It's a monster at coding. So, yeah. on the on the speed of reasoning you're obviously the chief research officer it are you more optimistic about getting speed ups there from i don't know algorithmic design software optimizations or new hardware just let moore's law carry on or find new a6 or we saw cerebrus posting yesterday about um the incredible speed that they're getting 3,000 tokens a second on GPTOSS. And I'm wondering what levers, obviously, we pull all of them, but what path of the tech tree are we, should we be like most focused around, most tracking and most excited about? Yeah. I mean, as a person who represents research, I control the
Starting point is 00:29:30 things that I can't control. I think a lot of that focuses on algorithms, right? Simple algorithms that are scalable, that we can pump a lot of compute into. We also do. We also do care about the hardware improvements that are stacking up. With the open source repeat release, you see thousands of people, really kind of serving these models, creating really great inference stacks, and those are really great lessons for us to pull from. You know, what's the ceiling of the speed in which we can serve these models? What can you tell us about the actual user experience of speed?
Starting point is 00:30:07 I was, I'm, I've, just like last week, I finally got to a place where like, for a lot of tasks, I'm, I'm firing off a 4-0 query and an O3 Pro query. Yeah, I just have, I have two tabs. Yeah, I've two. The O3 tab. Exactly. And I'm wondering what user experience patterns you think can help people balance between those. Is this like just something that we're like different patterns that we're going to learn over time or different, Or are there going to be certain problems of user experience that are purely solved just by better product design, better speed, and we don't even need to learn these.
Starting point is 00:30:47 Because I remember, like, you know, when you prompted an image generator, you used to have to say, like, don't know six fingers, five fingers, please, or like, don't make mistakes. And now, you know, the models kind of have that baked in. But how are you thinking about the user experience of getting the user, the results in the right? amount of time. Yeah, I mean, this is one facet of why we believe so much in reasoning. It's just because all of the scaffolding you used to have to give the model, all of these small hints, they go away, right? Like, the model can examine its own outputs.
Starting point is 00:31:20 It can review them. It can be like, hey, look, like, I'm just counting the fingers here. Why are there seven? And it can kind of fix that, right? It does a lot of iterative generation. It does a lot of fixing things on the fly. And so we think one of the benefits of bringing reasoning to the world is really to kind of remove the need for scaffolding. And with GPD-5, right, we know how clunky that that
Starting point is 00:31:41 experiences with switching between 40 and O3. Actually, I mean, there's so many stories. I was just talking to someone yesterday, right? They're like, hey, well, you know, I've used 4-0 my whole life, right? It's the frontier model. And I'm like, hey, well, have you tried O3? And they're like, why would I try O3? You know, three is less than four. And so, you know, we need to get out of that world. Your GPD-5, I think it's a one-stop-shop reasoning and reasoning and we've really tried to make it kind of just pre-optimal yeah yeah it's absolutely crazy to just take a bunch of letters and smash them together and expect people to pick up on that as a name or a brand chat GPT TBPN we're both
Starting point is 00:32:19 kind of in the same insane gambit but fortunately it's worked out and I think people have gotten over the hump but see it pulls off the tongue yeah sort of except our friend David Senra keeps flipping the letters a lot of people do that but at a certain point yeah you do breakthrough in chat GPT has, but keeping the model numbers simpler makes a ton of sense. Talk to me about the pace of play for research to actual product. Yeah, and on that note, the line between your personal philosophy on the line between research orgs and engineering product orgs. Yeah, I mean, so our research operates on a variety of different timescales, right? We have teams that they scope out a bunch of
Starting point is 00:33:04 ideas and then they start to kind of narrow in on the promising ideas as they get closer to a run. And then you kind of see a winnowing of ideas as you get closer to launching a flagship model, right? And there's always this kind of like explore, more exploratory to more kind of concrete and execution focused pipeline. And we're pulling on ideas across the board here, right? There's a lot of work in architecture optimization. Seb was on stream. He pointed out improvements in synthetic data. So there's really a lot of work that goes into creating one of these models. And it's hard to say like, oh, this model was about this breakthrough. Just because right now we have this machine that's producing breakthroughs on all of these axes and even across
Starting point is 00:33:51 several paradigms, right? So it's all that coming together that produces the experience that you guys feel. Yeah. Can you talk to me about the legacy or future of 4.5? I remember I was talking to you and I was like I haven't been using it a lot and you looked at me like I was crazy You were like oh, it's so good and I was talking to Tyler and he was like our intern here and he was saying like yeah the people who really like Understand how good it is use it But but I was I was wondering is there a world where that is a tool in the tool chest for GPT 5 in the same way that Python is or web browser is and if if it detects that I want something with more more emotional prose or more thoughtful writing, it can do a whole bunch of research, collect a bunch of raw text, and then kind of do a 4.5 pass that I believe is more expensive maybe and maybe doesn't
Starting point is 00:34:46 make sense for every single query, but could be a feature in the loop or a tool that is pulled into the overall product experience. Yeah, absolutely. Speaking of 4.5, it's also a very smart model, right and one of our bars in creating gpd5 was to make sure that on a lot of the axes we cared about that it was able to outshine 4.5 and i think even in some of the soft ones like creative writing i think that was the case and that's what makes us so confident with the name i think we're able to really rely on all of the architecture advancements all the kind of post-training advancements all the synthetic data advancements to create a model that's better than 4.5 but much faster and much cheaper.
Starting point is 00:35:32 Yeah. It feels kind of like we're, I remember, wasn't the second iPhone called the iPhone 3G? And the number literally corresponded to a specific technology. And now when you get the iPhone 14, it doesn't mean it's 14 megahertz or a gigahertz or inches big. It doesn't, like the number is abstract and it speaks to a bucket of features. And it feels like there's, I mean,
Starting point is 00:35:59 this was the first day of kind of re-educating folks on what the nomenclature means going forward. Have you talked about an annual release schedule or like or because there's the iPhone cadence and then there's the Google cadence which was like Google search just got better every year for two decades. It feels like at a certain point you want to just be shipping as fast as possible. How do you think about the culture of shipping updates that you know you find something that feels like, hey, that could make the customer more delighted or the user more delighted. And we don't need to do a big training run for it. So let's get that out today.
Starting point is 00:36:38 And let's tell people about it. Like, how are you thinking about fast iteration versus splashy announcements? Right. So on the product research side, I think it makes a lot of sets to think about, you know, what's the cadence of release and, you know, what are the feature sets that we want to build? And I actually think there's enough great research happening there that we don't have to worry about, oh, you know, is there going to be a drought or a long stretch without enough features to launch?
Starting point is 00:37:02 But one thing that's important for us is to be able to provide the people doing the exploratory work some buffer from that, right? It's hard to do really great exploratory research in an environment where you feel pressured to do release after release after release. And so we let that be a little bit of a lazier pipeline, not meaning that the work itself is lazy,
Starting point is 00:37:21 but we give it space really to mature and to flourish. And once it's ready, we can ship things across across that fence. So that's kind of philosophically how we organize. We have a product research org still very much entrenched in the research and they care about the release cadence and they're able to draw from all of the research that's happening, you know, algorithmically and in scaling and in RL. Yeah. Talk to me about tool use and how that's growing. I was I was kind of noodling on this idea that, you know, the, I was thinking about the IMO. and how, at least from the reporting, it sounded like Open AI's model didn't use tools for that,
Starting point is 00:38:07 and that's an incredible achievement, but it's kind of artificial. Like, I don't care if the model doesn't use tools. I use everything possible, and even if an LLM can memorize every fact, I'm fine with an LLM looking stuff up in a traditional database, spinning up a spreadsheet. Like, use whatever tool you want, just give me the correct answer. But do we have, is it important to give surface to the user the variety of tools that are in the GPT5 tool chest? I noticed something magical happened when I was using GPT, I was using O3 Pro, I sent an image in and I asked to estimate the height of a desk and it wrote like a thousand lines of Python image interpreter and was like interpreting pixels. And I was like, I didn't even think to trigger Python.
Starting point is 00:39:00 It did. Yeah, yeah, yeah, no, it was right. It was crazy. But the really funny thing was that it was just a standard-sized desk. It was just like, it could have just Googled, like, how tall is an average desk or something? Or just memorized. It probably was just already in the weights that it knows that a desk is like 36 inches tall. But it did a ton of work and it still got it right.
Starting point is 00:39:20 It fact-checked it a bunch of different ways. But I've noticed that now I can pull different things. Make a table. Don't make a table. Write some Python for this. Don't write some Python. And it kind of gives me the feel of like a super user to some extent. But I'm wondering how you're thinking about what is further down. Like you've given chat GPT a computer, as Ben Thompson said. You've you've given kind of the core tools, the Python, repel, the web browser. How are you thinking about kind of the long tail of tools that you want to bring to bear? And how does that interface? I know that there's. API integrations and all sorts of different surface area there, but give me some context on that. Yeah. I mean, our reasoning models are pretty cute, right? I mean, I think they, when you look at their behavior, right, they know the height of the desk, but they'll still go to verify it five different ways. It's all consistent, give you that median answer.
Starting point is 00:40:15 And I think that's really what makes these models so powerful. And when you think about tool use generically, right, like we want the models to use that reasoning ability to just be able to like zero shot a new tool, right? It should be able to kind of minimally get instructions about how the tool works and just be able to know how to use it, right? And humans do this all the time. You get a new tool, you start experimenting with it, and then you don't need too much scaffolding
Starting point is 00:40:39 and you just go and go and use it and understand it. So we want our reasoning models to use their reasoning to be able to use a broad selection of tools. And of course, there are a couple that you really do care about. In coding, it's very important for you to be able to execute code. It's really important in personalization for you to be able to get context from your calendars and from basically from the digital world. So I think there's a range of tools we are familiarity with, but beyond that, we want the model be smart enough to just generalize and use tool zero shot.
Starting point is 00:41:09 Yeah. Talk to me more about personalization. I feel like there's a world where I feel like I'm maybe under utilizing chat GPT as an app because I don't have it wired up to a a non-relational database where it can just stuff data from, you know, it already has memory and it's doing kind of roll-ups and there's some sort of saving of context, but I was, when we were talking to Kevin Wheel, I was kind of like, well, like I don't really have like a GitHub repo that's active that I wanna like dump code in regularly
Starting point is 00:41:44 for like my one-off tasks, but for that image generation, like, you know, understanding the height of the desk, it's like, well, if I'm doing that a lot, Maybe I want to have a tool built that lives in the world that my chat interface can kind of interact with on an ongoing basis and contribute to and modify and kind of wind up instantiating a piece of software that's like even more long lived. And then every successive query is even faster. So, yeah, how do you think about different ways to increase personalization? Yeah, I mean, I think memory is huge. So we have teams surrounding memory and also personality.
Starting point is 00:42:23 And when you look at memory, right, I think it's just we have so much context built up about ourselves that the model doesn't have. And our memory team's been really hard at work. You know, there's a surface level of just gathering facts about you. But there's also stuff about just kind of thinking very deeply about who you are, what your motivations are. And even you can think about, you know, you're trying to do some code-based tasks, right? a developer, shouldn't the model just be trying code out, you know, and just kind of leveraging all that memory, kind of its thoughts about what you want to do to just help you kind of be doing work all the time. So, yeah, we do think memory is a huge part of making the model more personalized
Starting point is 00:43:05 to you. And it should just make use of all that passive signal about you that it observes or all of that interaction and just help you accomplish your goals. What do you think it'll take for AI to start making novel discoveries. That's been a critique over the last year is everybody's so excited. Everybody's using these products every day and in their work and life, and yet it still feels like we're missing that. Dwork Keshe's talked about, you know, potentially that being around continual learning, but I'm curious what you think. So one thing to underscore is I think the models are already phenomenally creative in certain ways. So when I looked at, our performance on
Starting point is 00:43:50 contests, right? You know, I've done these contests before. Sometimes you have this mental classification of these problems require more creativity or these ones require less. And one of the big surprises for me was that the model can get some of the ones which I intuitively think require
Starting point is 00:44:06 more creativity. And, you know, it often does come up with these solutions that I consider quite ad hoc and really don't pattern match to anything I've seen before. When you look at, you know, advancing science or mathematics or fields like this, one thing that construct in which humans work sometimes is there are kind of three builders.
Starting point is 00:44:30 In mathematics, for instance, there are mathematicians whose role are to kind of build out this theory and almost to kind of create, you know, Olympiad style sub-problems, which often other mathematicians who are very good at that kind of. of style of work can do. And I do think kind of the model will increasingly contribute on that side first, right? If there's some mechanical like, hey, you know, I really don't know how to simplify this expression. I really don't know how to like get this result. It can really do that quickly for you. We're trying to increase the envelope such as the models getting towards that theory building side and, you know, being able to create creative hypotheses. And all of these components are
Starting point is 00:45:16 very useful for what I consider the ultimate goal, which is being able to automate some of our own work and our own research. How are you thinking about like the layers of mixing? Like I remember GPT4, I don't know if this was ever confirmed, but mixture of experts model, this is kind of like widely understood in the industry. Now are we in the era of like a mixture of models that have mixture of experts? Like how many mixtures are going on? How does GPT-5 actually work? Is there a taxonomy or architecture diagram that you can kind of like walk through to explain what GPT-5 is because it feels so much different than GPT-3? Yeah, I mean, one of our, probably the pinnacle of our research road map, but our path to AGI. When you look at the levels of AGI, the top level is what we described as
Starting point is 00:46:15 organizational AI. And what this means is, you know, collections of agents working together often like we might in a company towards a shared goal, right? And you would imagine that these agents probably sub-specialized in ways, maybe similar to what humans do, maybe in their own more efficient ways. And I think, you know, effectively work together to accomplish them goal. So we very much care about exploring this vision, seeing that's much more effective. than, you know, one single big brain working on a problem. And I think there are reasons to think why it could be so. And yeah, I think that that is one of the things that we're after.
Starting point is 00:46:56 Yeah, on that note of specialization, how are businesses working with GPT-5? Or how do you expect them to work with GPT-5 in terms of coming to OpenAI and asking for special capabilities or fine-tuning or, you know, any sort of RL on this particular problem in my world. I have this specific data set. It's not public, but I want a hyper, I want you to bench max on it. I want you to get 100% on, you know, the gas station bench or whatever. You know, if I'm, if I have a certain business and I'm willing to invest in sort of some
Starting point is 00:47:34 some overfit RL because it will create immense economic value for my business or it'll solve some fundamental problem, how can, how are businesses going to be using GPT5 over the next few Oh, that's a great question. So I think that this is a chance to kind of highlight one of the results that we've accomplished over the last couple weeks, which is our at-coder results. So this is a relatively unknown programming contest, but it involves really the pinnacle of the best coding contestants in the world. And what they do is, you know, they're put in a room and they have to solve an optimization problem. This is something that's actually very real-world. relevant. So you can imagine an optimization problem as something like what Uber might have. You have, let's say, riders and you have drivers, and you want to kind of create a system where you match them as quickly as possible, you know, with the least amount of cost, for instance. And so we've really created a system that can solve optimization problems at the level of the best in the world. And these truly are the kind of the best here,
Starting point is 00:48:45 solvers in the world. And so we have an organization led by Alexander Madri. It's called Strategic Deployment. And what they do is for a select handful of customers who really have that beefy problem that they need to solve
Starting point is 00:49:01 to just go and provide that value. And I think there's a lot we can do there. I think there's a lot of very, very valuable optimization problems in the real world. And we're really excited to partner with people. Because I think this creates a template for directly having AI provide economic value and really catapulting certain industries forward. On the research side, what unique advantages do you think you and
Starting point is 00:49:32 your team have given your position in the market with the incredible user adoption and the incredible usage from those users. It's not just DAUs, but it's actually the number of queries, semi-analysis estimated at like 71% of all queries going through chat GPT. What advantages does that confer from a research perspective? Yeah, I mean, a lot, right? And I think it allows us to kind of deeply understand use cases. It allows us to understand the frontier of where humans are, you know, kind of finding value, where they're not finding value, which areas that we need to improve the models on. It gives us a lot of signal to how users are deriving value,
Starting point is 00:50:17 when they derive value. And what is that signal? Like I see the thumbs up, thumbs down button. I'm sorry, I don't push it very often. I'm not doing my job, apparently. But I know that you can figure out whether or not I'm satisfied. Stop booing me, Jordy. That's the research team.
Starting point is 00:50:37 Okay, Mark, I promise you for the next 100, Chad GPT responses, I will, I will be honest with my thumbs up, thumbs down. I love it. Just to help you do extra training. We have tons of people, luckily, who do. Oh, that's great. Okay. So you do get a lot of thumbs up, thumbs down. And I'm sure I have done it occasionally. But I also imagine that there's a ton of other signal in there. You know, with the TikTok algorithm or, you know, any social algorithm, it's very easy. Time on site. But with chat GPT, obviously it's exciting when we hear, okay, 30 minutes a day or some rumored number of minutes, it feels correlated with usage, feels correlated with value that's being delivered.
Starting point is 00:51:17 You can obviously look at churn metrics and all that stuff, but what other pockets of signal are you finding? Are you finding people just, I mean, I remember the story about Google where they were trying to figure out how to handle, like, misspellings and create the definitive database. Do you know this story where they were trying to develop the definitive database of how to spell things and they were like taking a bunch of shots at it and they figured out that the best most rich source of data was just if you type in financial into Google and you misspell it oftentimes then you will just correct it yourself and the second query you send will be spelled correctly so they can just look at two similar queries what's the second one that's the correct
Starting point is 00:51:58 that's the correct spelling so yeah what other pockets of signal are you finding that are translating into the research environment what are you excited to go deeper on Yeah, so I'd love to first talk about the DAU signal because I think, you know, that's something that a lot of companies track, but we find actually a lot of danger in tracking it too closely. And one of the recent blog posts we pushed out was one on sycophancy, right? If you just, you know, hey, we're going to boost responses where users say thumbs up. Yeah. You know, it creates a condition for a model.
Starting point is 00:52:33 I just want to say, Mark, I love everything you're doing on this front. Yeah, this entire interview has just been fantastic. You are the best. We'd love to have you back on the show tomorrow. But clearly problems with that. Yeah, yeah, clear problems, right? The model just starts kind of sucking up to you. Totally.
Starting point is 00:52:54 And it's saying like, hey, you know, you're right. And even in complicated situations where I think objectively, you know, collectively, we'd be like, hey, this person's in the wrong. The model starts saying, hey, you know, you're right. You know, the other person's gaslighting you. you know this other person's kind of and and people deal with people deal with this in the real world they'll go to a friend they'll tell them about a situation and the friend will give them advice but maybe it's not the entire it's not the fullness of the situation right maybe they left
Starting point is 00:53:18 out some key facts and the friend is like oh yeah that other person is wrong definitely is in the wrong and they like skipped over some important details and yeah no no exactly exactly and we don't want our models to fall into this trap where it's just trying to get you to like you like what it says And so, you know, we wrote back a lot of changes that produce that kind of behavior. And really the way I think about daily active users today is we need to be opinionated about the features that we build into the future. I think we have a lot of ideas here, but we have to let that drive. You know, build for the future, build for the things that people you think they'll want and maybe don't want necessarily, know they want necessarily today.
Starting point is 00:53:59 And then use DAU as kind of this byproduct, right, a way to track that you're on the right, right? right track here. So, yeah, I mean, we want to be careful here. We don't want to fall into these traps of like, you know, three, four years from now that this turns into kind of engagement bade or something like that. Totally. Yeah. Was it, how much time has the research team been focused on efficiency specifically? It felt like summer was a good window before kids come back to school and start, you know, maxing out query is a good time to increase efficiency. And I know, the cost of GPT-5 have significantly lower. I'm like, this is the best it could ever be.
Starting point is 00:54:40 It's good enough, bake it on an ASIC. I just want it for free and I want it like in milliseconds. But that's just me being grumpy, I guess. We've done a lot of work. We've been building out our different teams. We focused a lot on scaling. I think Grant's going to come on a little bit later. He's been spiriting a lot of that work.
Starting point is 00:54:59 So, yeah, no, honestly, it's become a bigger and bigger focus for us, us especially in the last couple of months on on the I mean this is somewhat related to the sycifancy thing but I'm interested to know like what do you think is driving like the GPT tone you know how like the M dash is a thing and then the the it's not a newspaper it's a way of life and it's like there's these like little like like flourishes like they come through and in kind of our tell that it was written and in a lot of ways I love it because when I get a deep research report I like that it's using the same
Starting point is 00:55:37 Wikipedia style tone like I want consistency there I don't want it to be like oh this today it looks like it's a vice news article and today tomorrow it looks like it's written by someone at BuzzFeed I like that it's consistent in many ways but but why is that happening do you think that bigger models like 4.5 kind of were able to solve that or do those kind of like local minima like I don't know like wells happen even in bigger models is there anything from a a research perspective that can stop GPT having its own voice? Or is it fine that it has its own voice? Yeah, that's a really great question.
Starting point is 00:56:12 And I think, you know, as you scale up models, as the models become more intelligent, they kind of have a just deeper and day understanding of the tone, right? And so you expect that to improve just naturally as you make the models more powerful, bigger, better reasoners. But one thing that I think gets lost a lot is each individual company has a lot of impact
Starting point is 00:56:33 in terms of how they shape the default tone. And we publish a document called the spec. It kind of lays out how we expect the model to sound in certain cases, lays out a lot of examples for that. And I think we use the spec in many ways, right? We have people come in and see, hey, was this thing generated in accordance with what we would hope to generate from our spec? And this is a living document, right?
Starting point is 00:56:57 It evolves over time. And so I think, you know, each company kind of has a very opinion, need to take on what they think the model should sound like. And it's not an accident that the model sound a certain way. I don't think just naturally every company is going to train the same kind of voice into their model. Totally. Well, thank you so much for hopping on. Congratulations on the big launch. We'd love to have you back soon to talk more. We could go in a million different directions, but we'll let you get back to it. We know it's a big day. So have a great rest of your day, Mark. Thank you so much. It's a great conversation. Talk to you soon. And we will tell you about
Starting point is 00:57:33 Restream, one live stream, 30 plus destinations, multi-stream and reach your audience, wherever they are. This stream is made possible by Restream. Open AI just did a live stream. With Restream. If you're trying to do a stream, if you're trying to do a stream, you've got to get on Restream, so it's everywhere. And we will bring in our next guest, Greg Brockman, the president of Open AI. And we'll bring him in. Greg, how are you doing? Doing great. Thank you for. Welcome. Congratulations.
Starting point is 00:58:00 How are you feeling? How's the company feeling? It's been such a wild journey. Just take me through a little bit of the, like the vibes and the company and how you got here today. Well, I'm excited. The whole company is excited. And honestly, I'm just so proud of the team. Like it's just been amazing to watch people come together, not just for this launch.
Starting point is 00:58:21 And you know, the funny thing is behind the scenes that people are always putting on the last minute adjustments and polish and scaling up the capacity. And there's always something that goes wrong before launch day. So there's a lot of people who, you know, worked late into the night are really crunched to bring this release to the world. And, you know, it's a little bit like the duck that's, you know, you know, paddling. But that also describes the whole opening eye history, right? Is that I think that we have put in many years' worth of investment to the techniques used to produce this model. And really, it's across just every function with an open AI that has come together to make this a reality.
Starting point is 00:58:59 Yeah. I mean, you've been there for every GPT release. How do you think about summing up each iteration in kind of like one line? Because GPT1, GPT2, GPT3, these feel like similar architectures, at least histories, at least histories kind of compress them into similar architectures. But how do you think about the progression of just the big numbered releases? Yeah, it's interesting because in some ways it's a punctuate. equilibrium but on the inside it looks very smooth right even before the GPD series formally began the first result that really sort of set this path to be something that we were heading down and there was clear that we were going to pursue it was the unsupervised sentiment neuron which was an LSTM in like 2017 so a different architecture from today's transformers and it was the first time that you could train a model to predict the next element so we predicted the next character on on
Starting point is 00:59:59 Amazon reviews and we were able to get semantics out, right? Because you expect, okay, yeah, it's going to learn where the commas go, maybe what nouns and verbs are, but the idea they would learn a state-of-the-art sentiment analysis classifier. That was mind-blowing. And so I remember seeing that result in 2017 is like, we have to scale this up. We have to see where it goes. And so GPD-1 was like, I think a good sign of life of you train on sort of all the public data you can get and you use transformer and that you were able to get state-of-the-art on various downstream benchmarks, right? So you have a model. It clearly learned some representation, something useful about the data that it was shown, and it's applicable. You can use it for various tasks. We didn't
Starting point is 01:00:39 really think very hard about the generation side. GPD2 was the first time that we were like, all right, let's actually, like the samples we're getting from it, the things it actually generates, they're kind of cool. And I remember reading the, in the GPT2 blog post, we have this unicorn story where it generates some fictional story about a herd of unicorns. And it was just so cool. It was like, wow, it wrote a story that's actually kind of interesting. It doesn't totally make sense, but like there's something here. There's some real spark of intelligence within this model. GPD3 was the first time that we had a model that was actually something people would, it was just barely above threshold for something people would want to
Starting point is 01:01:15 use. And I remember working on the GPT3 API. This was our first real product. And it was actually the hardest product, the hardest project in total I've ever worked on because it just felt like maybe no one wants to use this model. We don't really know what it's useful for. And it certainly was the case that GPD3 was a great demo machine. You can make really awesome just like tweets and, you know, cool little little apps and it would give you quick answers. But it didn't feel very reliable.
Starting point is 01:01:43 And then GPD4 was something that actually felt like it had true real world utility. It was above some threshold. It was something that was helpful for health. It was something that was helpful for, you know, starting to be good at coding. And GPD-5, I think, just sets a whole new standard for the reliability, for the utility. Things like coding, I think, are just like clearly, we're already on this trajectory of transforming software engineering this year. I think are really on trajectory now to be revolutionized. So just really exciting to see that whole arc.
Starting point is 01:02:10 When did the API opportunity really click for you? Because I do remember companies in that era that quickly unlock the power of the API and grew tremendous. When did that opportunity click, because you said initially that you kind of had some, I don't know, concerns, kind of doubts, how useful was it going to be? And then when did the consumer opportunity click? Well, we in 2019, end of 2019, had GPD3. We knew we needed to build a product to be able to actually continue the mission, to be able to raise capital. But what did we want to build? We're really here because we believe in AGI that's going to have this powerful, positive transformative effect on society.
Starting point is 01:02:53 and we want to be part of it. And so we thought, well, maybe we could build something in health. And then you realize, okay, well, we're going to sell the hospitals and we're going to maybe hire. Well, let other people do that. Exactly, right? It's just like you have to go into one domain. And that means giving up on the G, the general, right? It's like it feels like you're going to become a one particular thing.
Starting point is 01:03:12 Would we kind of want to be supporting all industries at once. And so the idea was, let's build an API and let people figure it out. But this is totally not the way you're supposed to build a startup. Right? You're supposed to have a problem. No one cares about the technology behind it. Add value to that problem. Focus on just that one thing. And so that's why that project was so hard. And in, you know, January of 2020, February of 2020, that, you know, I with the team were going around trying to just find anyone that would be willing to try this API. And we were driving to different offices in San Francisco being like, hey, we have this cool model. And it was hard enough to get people to take the meeting, much less to sign up their company for it. It was actually very fortunate. We found it, we found a couple of good partners. And it was fortunate that that happened then because March 2020, suddenly, that was COVID. We weren't driving around to people's offices to try to beg them to use this, you know, this, this budding new technology.
Starting point is 01:04:05 So it was really six months worth of grind, right, of really trying to turn, like when we, when we start with GPD3, I remember it was, you know, that the inference code was not very well optimized. It was like, I don't know, 150 or maybe 250 milliseconds per token or something. And we just optimized, optimized, got it down to like 50 mill. seconds per token, which by the way, today's models run much faster than that, which is kind of amazing for me just like seeing how much faster we're able to run them with much greater intelligence. And I remember setting two goals for the team. One was I actually find one one customer who's willing to pay, so literally get a dollar in for this thing. And the second is get a use case that we use at Open AI every day. That first one happened within the first couple
Starting point is 01:04:46 months. So actually, that moment, I was like, all right, like, this thing is probably going to work. But in order to get there, we had to do a bunch of, you know, just scaling the API and really, you know, doing the product work. But that second one took much longer, right? And that wasn't really until chat chabit. And so if you fast forward a couple of years, because this was, you know, mid-2020 when we first got that, the API into the world, chat chabit, we didn't release until November of 2022. So you're talking like a decent, a decent period of two years there, a little bit longer. And I remember we were building, you know, people have talked about, we were going to call it maybe chat with GPD 3.5. We had a sort of precursor product called WebGPT that was
Starting point is 01:05:28 built on on 3.5 that we were literally paying contractors to use. Right. So this was all throughout 2022. We basically had the chat GBT precursor that we had to pay people. They would not pay us. We had to pay them to use this thing. And the moment for me that really clicked was actually actually when we finished training GPD 4. So that was August 8th of 2020, which actually is like three years ago now. It's actually pretty wild to realize that almost to the day. And we did the initial post train of GPD 4. And honestly, I had a bunch of bugs in there.
Starting point is 01:06:04 It was like broken for a bunch of different reasons. But the model was like extremely creative. It was actually really interesting. It took about a year and a half to get to the point that the creative writing of our models match that initial one that was buggy for various reasons. And I remember, you know, we had an instruction following data set that was post-trained on. So it was really, we had collected examples of, here's the human asking for a thing, here's what the model should do.
Starting point is 01:06:28 So it's really not trained to do multi-turn. So I asked you a question, it gave a response. But then I was like, well, what if we just ask another question? And it actually was able to leverage that full context. It actually was able to have a coherent chat. And the moment that we saw that, they were like, okay, this thing is capable not just of being post-trained to do this like very specific thing, but it can generalize, right? It can kind of do the intelligent thing, even though it wasn't directly trained for it. It was just so clear this was going to be the killer, killer application.
Starting point is 01:07:02 And so then we were planning on launching GPD4 in, you know, early 2023, and we had this chat infrastructure we'd been working on. and it's so clear, okay, like, we're going to have to release the infrastructure and the model, and it's going to be this amazing killer product. And so just almost as infrastructure ahead of getting the real thing out, you know, I was excited for us to do chat GBT, and that's why we did, you know, and see that come to life in November. So I think that for me, I was really focused on GPD4 as the model. This is going to be the chat moment that's really going to work.
Starting point is 01:07:37 And kind of had missed the fact, because every time you see these new models, you just sort of, you know, see only flaws in the previous. ones. And so it missed the fact that GPT 3.5 was something that no one had really tried before in the broad sense of society and that it was something that was already useful and that people would respond to. Was GPT3 kind of like the main pivot point for shifting the company towards LLMs? Because in the prehistory of open AI, there were a lot of other maybe expensive training runs. I don't know how much, I don't know how much financial risk was taken with like the the OpenAI 5 project or the robotics projects,
Starting point is 01:08:13 but it feels like at a certain point, the chat became like the main financial risk vector. So I guess the question is like when, it feels like GPT3 was the moment when you shifted. I'm also interested in hearing about Ben Thompson called Open AI the accidental consumer company, and I'm wondering when that narrative set in for you. Like, what, when, when did it become clear that this was going to be a really, really powerful consumer application?
Starting point is 01:08:47 Yeah, going from paying people to use your product to people saying, hey, we want to give you money for this. Yeah. Yeah, a very important transition, it turns out. Yeah. So, yeah, it's a great question. I would say that if you rewind to the beginning of OpenAI, you know, there's many people who thought that, you know, in retrospect, say that we set out to prove that scam. is how you make progress in this field. But it's almost the other way around.
Starting point is 01:09:12 Scale was the thing that worked, right? We tried a bunch of things that didn't pan out. And it really, the first time we saw this concretely, was in our Dota project. I remember my collaborators, Jakup and Shimon, trained the very first little agent on like 16 cores or something and left it running on their desktop over the weekend. And we came back and it was this very, you know,
Starting point is 01:09:33 sort of constrained mini environment. But that the model was doing something smart. It was actually able to solve this, kiting environment, and that was pretty cool. And then they just, they and the team just kept scaling up, right? That we had all these free cores that were just sitting idle on, on, on, on AWS at the time, and they just kept throwing more computed it. And every time they would do that, the model would just get better. And so when you look at something like that, you're like, well, you just have to see where this goes. You have to push it until it hits the wall, right? And our
Starting point is 01:09:59 goal with Dota was actually to develop new reinforcement learning algorithms, because the common wisdom at the time was, well, the existing reinforcement learning, PPO, it doesn't scale. Everyone knows that. But the question from Yaquette Shimon was, well, why do we believe that? Has anyone actually tested it? And no one had really tested it. And so I think that that ethos of saying you have to push the existing techniques to the wall until they break.
Starting point is 01:10:24 And then once they break, you actually have a baseline to overcome. And you win either way, right? Either it just exceeds all the humans in terms of the specific capability. that you're trying to exercise, which was the case for Dota, or it hits a wall, and now you have a real problem to solve. And so I think that ethos really got embedded in our DNA. And, you know, at the same time, I think that we were really thinking about how do we get to AGI, right?
Starting point is 01:10:50 And really, like, I spent a lot of time thinking about that question of where's this company going and how do we actually achieve it. And you start to do some math in terms of, you know, the kind of compute that it would take to get to AGI. And you just start to realize you're going to have to build really big computers, and those are extremely expensive. And so I think that from the early foundational results in thinking, we kind of realize the path that we're going to have to walk. So it seems like there's been a few walls that we've scaled up through and then maybe hit them. There's been talk of like a pre-training wall.
Starting point is 01:11:23 Now we're putting tons of resources and compute towards reinforcement learning. Is there a third scaling curve that we're going to be talking about in the next. few years, are we continuing to scale up those two primary vectors? Is that too high level of an abstraction in terms of how we should be thinking about just progress along the vector of scale? Like, give me the up-to-date thinking on just the fruits of scale. Yeah, I'd say fundamentally deep learning. I think that people talk about the bitter lesson. It's almost this exploration into how do you convert compute into intelligence, right? Through a, you know, we have some particular techniques to do that that we're kind of constantly fleshing out.
Starting point is 01:12:10 And the thing that's really amazing is if you rewind to, I don't know, even the 1940s for the McClella Pitts neuron, which is kind of the precursor to neural nets, if you look at that paper, they have all these diagrams that actually very similar to like the kinds of diagrams we draw now of multi-layer neural nets and things like that. Like the basic idea of what we're trying to do has not really changed in almost like aging. 80 plus years, which is just a wild fact, right? It means there's something deeply fundamental about the thing that we are pursuing. And that idea itself, I think, kind of came from trying to model the information processing of the brain. And it's imperfect and not an exact analogy to biology and all these reasons that it should fail or that people have said this thing is doomed.
Starting point is 01:12:53 But the results are undeniable at this point. I mean, some people try, but it's hard to, it's really hard to kind of close your eyes and sleep on this in my mind. And it's very interesting if you look at, you can find quotes from the mid-1960s of people trying to poo-poo the whole direction saying that these neural net people have no new ideas. They just want to build bigger computers. And you can basically say something very similar today. What we're trying to do. One moment. A little water break.
Starting point is 01:13:25 Yeah. Exactly. For all of us. Cheers. Exactly. Cheers. I'll tell you know, we're all human. Proof of human handy right there.
Starting point is 01:13:32 Exactly. So what we're all trying to do is find novel ways of taking compute and really harnessing it. And sometimes you hit a wall, but these walls tend to be ones that you can drill through, right? What we found is every time you scale up, everything, all of your engineering, all of your sort of scale and variance, all these things, they get stressed to the next level. It's almost that the tolerance has become tighter and tighter. It's like launching a 10x bigger rocket means you need to be like 100x,x, just more precise. on everything, but it doesn't mean that the fundamentals of the science are different. So pre-training, there's definitely been a lot of discussion of data wall. Doesn't mean it's fundamental, right? It just
Starting point is 01:14:09 means that we need to be better and more precise of what we're doing. There's RL, which has been something that has kind of come from spending a small amount of compute to much larger amounts of compute now. And then there is a third way that we're really harnessing compute, which is compute at test time. And we publish some scaling laws around this. And all three of these things multiply. Like, that's the amazing thing. And of course, the compute and the harnessing of it is the fundamental goal, but that you get these multiplicative effects out of all of it
Starting point is 01:14:38 through the quality of your engineering implementation, through the quality of the data sets, through a bunch of the refining work that you do. And there's lots of different techniques and ideas. And that's what makes this field so rich and why progress is just going to continue a pace. What about on the infrastructure side? You guys have been busy scaling up.
Starting point is 01:14:56 What can you share on that front? Well, so I run a team called Scaling at OpenAI, and we really focus on building the infrastructure for scaling, and that this is in partnership with really everyone across the company. It's almost a misnomer that our team is called scaling because fundamentally this whole team and effort is about scale. But what we really try to do is to both on the physical infrastructure side, deliver as much computer as humanly possible. And that is in partnership with companies like Oracle, SoftBank, and others. that we've been able to deliver just like increasing amounts of compute to open AI, but we're constantly thinking about how do we just deliver more flops and do it more efficiently earlier, cheaper, more power efficient,
Starting point is 01:15:40 all of those kinds of questions. There's the software infrastructure side as well, and really thinking about how do you coordinate massive numbers of GPUs in order to work across one synchronous training run, how do you coordinate that for reinforcement learning, how do you deploy that into production and bring these models to life at massive scale. And I think that every single layer of a stack,
Starting point is 01:16:03 there is innovation required. And that's something that's very easy to miss. Like one way I think about research is that there is, and this is kind of the view from Yakup, who's now our chief scientist, that there's a research stack. And you can kind of think of the top of it is people running experiments
Starting point is 01:16:18 and coming up with new ideas for how to, you know, sort of utilize data or something like that. There's a middle of the research stack of people thinking about the, how do you sort of take these different ways people are running experiments and be able to train in novel ways and kind of put together the pieces differently. And then there's a bottom of the research stack,
Starting point is 01:16:36 which is like writing kuda kernels to get the absolute max out of the GPUs. And at every single layer here, you get a multiplicative factor through innovation. So it all comes together as one big hole. On scaling, I'm interested to hear about just, if we think about like the impact of AGI the impact of AI, just being some sort of maybe quantitative GDP metric or qualitative
Starting point is 01:17:04 just impact and good, is there an important factor of scale with just not even the flops that are going into the models, into the pre-training, into the RL, into the test time inference, but actually just the flops that are going into the usage of AI within humanity broadly. And I feel like that might maybe be the next, like, scaling curve that we're seeing as more people use models, they see improvements all over the fact. Like, is that something that we should be tracking to see kind of the, the instead of these, like, S curves, we want to see, like, the continual exponential? I think that's a great perspective, right? Because at the end of the day, I mean, if you look at kind of the shift from something like Dota, which we pursued in order to, you know, we wanted to do new algorithm of development, but really it almost validated
Starting point is 01:18:00 how we scale of existing algorithms. But there was no illusion of delivering direct economic benefit from it, right? To the current models where we are still, we're starting to end the era of like pushing on these academic benchmarks, right? You look at things like the IMO at this point, models are able to get gold metal on it. Like these, the hardest academic benchmarks that are available are sort of no longer a, you sort of the guiding, the guiding light of progress for these models to where we actually want to be is for AI to be helping everyone, right, to be something that uplifts humanity. And that's the final metric, right? Is how much does it actually benefit everyone? How much value does it bring to the world?
Starting point is 01:18:38 Yeah, not just health bench. It's actually how many people did you solve their health care problem, right? Exactly. Yes, yes, yes. And that's the actual goal. And that's what's exciting, right? It's like we're moving from the lab to reality. And I remember in the early days as we were thinking about how do we measure our progress towards AGI? We always sort of dreamed that one day we would be able to measure it this way. And you can think of revenue maybe as a proxy metric for value delivered to the world. It's not perfect, but it's at least something, right? You can think of the distribution of like how much compute goes into it, how many people are using it.
Starting point is 01:19:13 But fundamentally, like what we're after is how much do we really uplift humanity through this technology? Yeah, I mean, I might be misreading it, but I'm pretty sure like that was the Kurzweil and Ray Kurzweil philosophy was that, like, total number of flops getting immense, not necessarily all in one data center for one model. It was that compute broadly would be so wide. Yes, yes. And I remember on that chart, right, you can see, you know, total compute of all human brains, which really suggest a particular vision of how this technology will be rolled out. Yeah, distributed.
Starting point is 01:19:48 The phones count as an impact. The Wi-Fi router counts for the impact of the Internet, just like the phone. does, it's not just the big pipe that's going, the backbone of the internet that actually matters. Deep research, hit product, almost everybody I know, at least in the industry is using it. He's reading 30 pages of deep research a day, basically. He loves it. He's making books with it. But why have agents broadly come around a little bit slower than people may have expected? is it just that using computers is actually a much harder computer use is just a really hard challenge or you know i think going into this year everybody said this was the year of agents
Starting point is 01:20:34 are you talking about flight booking or something like that but you know people people were saying 2025 is a year of agents and i would say that it's the year of deep research and and not a lot of these other sort of like broader use cases sure well 2025 isn't quite over yet so that would be my response and I'm very much on the I think that progress in this field the way that it tends to work is that if something kind of works with the current generation of models it will be extremely reliable with the next generation of models and I think that where we've been is that deep research is the if you've rebound a year that was the like we kind of had something working and then like this year it's been just incredible and I think that agents you know specifically like computer use agents
Starting point is 01:21:19 are something we've kind of had working, and again, you know, the year is not over. I think there's a lot of rapid progress to be made. But I think that maybe part of it, too, is that the agents that we're about to see, I think are a little different from maybe what we would have pictured five years ago. Like, I remember having a debate with some friends on, do you want a agent that does the flight booking? Because the problem is it's actually a very high bar to beat the flight booking UI because there's so many preferences that are entailed in that. You really have to know kind of what mood you're in.
Starting point is 01:21:48 Like, are you okay with, like, taking the extra. layover and all these kinds of questions and that actually there's so much other stuff that happens in your life that is that is toil or drudgery or that's something that you're not an expert in you're supposed to be think about health right that like every patient really is the doctor if you're coordinating across multiple specialists there's no doctor that helps you with that right that's really on you and that there you actually can have AIs that are just text only that actually are able to add massive value and then it frees up your time if you want to go you know, book the flights yourself. And so I think that really finding the right problems
Starting point is 01:22:24 that have high leverage, right, that really add value to people. And also thinking about the other side of how to make sure these agents are responsible with the trust that you put in them, right? That the more that you give an agent access to your email, the more you really have to trust that it's going to, you know, sort of do right with whatever your task is and send the right email to the right people and be able to, you know, segment your information and all these kinds of questions. And so I think that there's both a practical, how do you get to adoption, but also just like where are the most important leverage points in a person's life? You also missed coding agents because it's been the year of deep research, but I feel like it's
Starting point is 01:23:03 also been the year of coding agents. How is that developing at OpenAI? I've noticed that I'll hit O3 Pro and it'll wind up writing a bunch of code for me and I didn't even ask it to. Then you have specific products for coding. How do you see the evolution of software development evolve? How are you seeing OpenAI customers use coding tools? And how good is chat GPT or GPT5 on coding? Well, software engineering is definitely being revolutionized in front of our eyes. It's been happening. And GPT5 is the best coding model in the world right now. It's the default now in cursor, which I think is a really huge statement of the quality of the model, and that it's just so good across like every function of writing code, understanding code base, being able to use tons of tools,
Starting point is 01:23:55 being able to do agentic work, that, yeah, it's like I'm not a front-end developer at all, but actually now I am, right? And I think that you are too, right? If you just talk to the model, you can produce incredible things. And so I think that there's this real empowerment. If you think about what computers were supposed to be, right? Computers are supposed to be a tool that makes you more productive, able to do the thing you want. But then somehow when we started out with computers, you have to contort the human to the machine, right? Writing assembly language and like all these like very abnormal things for a human to do.
Starting point is 01:24:26 And that as we've moved to tools ultimately, you know, in the current generation now, GPD5, suddenly the computer comes closer to you, right? That you just express your intent and you don't think about, okay, like exactly which language and what, you know, version of different libraries, that the model is something you can delegate. to. And so we are very committed to programming and to making our models continue to be the best they possibly can be. Must a superintelligence be able to explain how to build superintelligence?
Starting point is 01:24:58 So it's a great question. So I mean, I think that where we're going is a world and we're already seeing it where these models help us produce the next generation of models, right? They also help us really supervise tasks that are too hard for humans to supervise on our own, right? If the model of rights a 10,000 line program for you, reviewing that is probably going to be quite burdensome. But if you can have a model that you trust, that maybe isn't as capable as the one that wrote all that code, or maybe there's a team of agents that work together to write all that code, but you have a team of reviewer agents. Like, this is the kind of thing that you can actually bootstrap trust. And I think that this is, this is like one of the most important things.
Starting point is 01:25:34 And also, interestingly, 2017 is when we had the first language results. We also had some results or some vision on how you can actually bootstrap supervision beyond the scale of tasks that humans are able to supervise directly. And so I think that we're heading to a world where, you know, we now have these chain of thought models. We've been advocating very strongly to preserve the integrity of the chain of thought, right? So that it means don't directly optimize it to look good, you know, though there will be lots of temptation to do it for various reasons.
Starting point is 01:26:04 Really make sure that there's no pressure on the model to obfuscate its thoughts within that chain of thought. because then you can really see what it's up to. And I think there's further techniques to even make it more faithful and more rigid to what the internal monologue of the agent is. And so I think that there's actually a lot of promise in terms of interpretability, in terms of supervision,
Starting point is 01:26:23 in terms of being able to scale to just like much more sophisticated tasks. Yeah, I guess my question is like how much information in the world can be derived from first principles reasoning versus true secrets that can, that need to be discovered by interacting with the world directly because I would it feels like it would be very difficult to I'm just wondering about like how
Starting point is 01:26:52 intellectual property interfaces with super intelligence or how like if you play this out a lot how like there's all these like hard one Dorcas just talked a little bit about this with continual learning there's all these little subtleties that maybe they're not secrets maybe they're not true trade secrets you don't think to lock them down, but they're just things that haven't been codified online or anywhere. They haven't been given to anything that is surfacable by the model. And I'm wondering how is it just we need to build up new knowledge in every fact from first principles and kind of go through the history of humanity's pursuits of knowledge or do we just
Starting point is 01:27:34 need to onboard more and more information or maybe it's both. I don't know. It's just something I've been noodling on. Yes. It's a great question. I would say all of the above, you know, select All-Star. So I think that the answer is very similar to what it is for humans, right? How does a human generate new knowledge?
Starting point is 01:27:49 How do we accomplish new things? First, you want to be grounded in the wisdom of the past, right? You really want to understand what have people tried, what worked, what didn't work. You want to go and read the biographies of, you know, various people and understand those. But you also want to try things out, right? You want to make, you know, some mistakes in a contained environment in a way that you actually can see the effect. of your hypotheses and then you want to be able to learn from those and I think that being able to really start to scale up these systems and be able to integrate them with the world is a very
Starting point is 01:28:21 big process and milestone that we're currently embarking on right to move from a world of totally hermetically sealed reinforcement learning environments to think about how do you actually put real world interaction in there and you think about things like robotics like you're going to need to have that at some point right you're going to need to have some sort of interaction with the real world and to have models that are able to produce new materials right to be able to actually solve various diseases for them to be able to really help people right that you know we already have models that are great at use cases like therapy but to really get to the next level of something can just really help every person accomplish more and accomplish whatever their goal is it would
Starting point is 01:28:58 be very helpful for that model to actually have some real world experience with doing that very thing and so i think that figuring out how to bring all this together is ultimately what our mission is about And we do this not in isolation, but really as part of a much broader community. It seems like it's advantageous to have the most dominant consumer app in that environment. So congratulations. Jordy, do you have a last question? Last question. What do you hope to see out of Washington, D.C. in the next year, year or two,
Starting point is 01:29:22 not thinking super long term in terms of basically promoting innovation within the United States. Obviously, the admin cares a lot about AI and has been making moves. But what else would you like to see or where would you like them to double down? Yeah, I've been very, very impressed with how much the administration has engaged with the technology and really tried to figure out how can we help and ensure that American AI continues to lead and really sets the standard for the world. And I think that that is the lens that I would really encourage thinking through, right? It's like, this technology is changing very fast and that fast plus government is not usually a ideal combination, but this is the reality that we have. It's the opportunity we have. And I think that the question in my mind is less about any specific regulation or strategy,
Starting point is 01:30:09 but it's really being calibrated. It's really having a very tight utal loop, right? Being able to react to, okay, we have a new model. These are the capabilities we see on the horizon. How do we make sure that we get the most uplift and benefit from it? And thinking strategically about not just how do we do this for Americans, right? But how do we actually do this for the world and promote democratic values? And so to me, the most important thing is that motivation, right?
Starting point is 01:30:31 is the question that is asked and the ultimate sort of motivation behind what gets implemented. Yeah, that makes a ton of sense. Thank you so much for joining us. Georgie, are you going to hit the gong? For GPG5. Congratulations on the massive. A historic day. And thank you so much for stopping by.
Starting point is 01:30:51 We'll talk to you since then. Have a great day. Thank you for having me. Cheers. Really quickly, let me tell you about figma.com. Think bigger. Build faster. Figma helps design and development teams build great products together.
Starting point is 01:31:01 And we are joined by Sarah Fryer, the CFO of OpenAI, next. And we are going to bring her in in just a minute. The gong is still swinging. The gong is still swinging. And I'm going to tell you about vanta.com. Automate compliance, manage risk, improve trust continuously. Vantas trust management platform takes the manual work out of your security and compliance process and replaces it with continuous automation, whether you're pursuing your first framework or managing a complex program.
Starting point is 01:31:27 We need one more second. Tyler, any other questions that we should be asking for the Open AI folks? Anything top of mind? What's on the timeline? Is the timeline still in turmoil or has it settled? So I think the general vibe is like this model was not bench max, but if you actually get to use it, it's pretty solid. Cool.
Starting point is 01:31:45 One thing, it failed to UPN bench. Oh, it did. It did not get the horse breed. Did you get the horse breed? Wait, so you have it. You have access to. Yes, I've seen other things on the timeline. We can talk about it later, but it seems like a really good model.
Starting point is 01:31:57 That's amazing. Great to hear. Well, welcome to the stream, Sarah. Good to meet you. How are you doing? Congratulations, a historic day. Thanks so much for taking the time to talk to us. How are you doing? I'm doing great. I mean, how could you not be doing great on the day when GPT5 launches? It's been a long time in the making, and we're so happy it's out. Yeah, fantastic. Walk me through your role and what GPT5, what this launch means specifically for you. And, yeah, let's just start there. Finance has to be, you guys have to be the unsung heroes at Open AI. There's a lot of big numbers. There's a lot of massive bills coming in for crazy training runs,
Starting point is 01:32:37 and you have to underwrite these against future revenues, and I'm sure you've developed many models to figure that. But, yeah, walk me through your role at Open AI and what today means for you. Yeah, absolutely. So I'm opening ICFO, but finance can be the unsung heroes, but they're an amazing team. So I'm going to shout out to them. They're heroes to us.
Starting point is 01:32:55 It's a complex world that we're all living in, and there are a lot of bees on the end of a lot of the numbers that we look at. Look, what is our role? Number one is just making sure we have a healthy, high-growth business. It's been incredible watching just, first of all, the number of weekly active. 700 million people using chat GPT every week, and I'm assuming after today, we should see a very nice little bump in that number. This is going to be a gong-heavy segment, Jordy. We have a lot of soundboard for the big number, so congratulations. I love it.
Starting point is 01:33:26 And I love, I've never met a number I didn't like. I think the other part of the business, and then we have to do this balance of the consumer business, the enterprise business, and then API business, which I think of as somewhat enterprise, you know, and balancing that out. So enterprise adoption has also been exploding. I probably do, I mean, interestingly, as a CFO, I probably meet four to five customers a week. It's a part of my job, I actually love.
Starting point is 01:33:55 We have about 5 million paying business users right now, from banks to biotech. I was talking to the CFO. And so that number is individual companies. That is individual seats at companies. Seats and companies, got it. So what I would say about that number is it's crazy to have done that in just two and a half years. Because enterprises, right, you got to put your big boy, big girl pants on to go sell to an enterprise, right? They want to make sure that you have the table stakes of security.
Starting point is 01:34:23 SSO for signing on, you have HIPAA compliance if you're selling to healthcare and so on. They want to know that other people have done it, so they're often looking for that case study, but they also want to be, you know, the innovator right at the front. And so that to grow that scale of business and just two and a half years blows my mind. And it's not just big, big businesses, which I could talk, you know, I blend on, but it's also small mom and pop, you know, literally the people who really keep the lights on in most countries are also gravitating to chat GPT, which is wonderful. And then on the developer side, four million developers have built in our platform.
Starting point is 01:35:00 And the question there is like, that could be a developer inside a big company like grab. It also could be the next startup founder that's Y Combinator getting going with the next multi-billion dollar unicorn business. And so we see the whole gamut there and that's important to us as well, because it's very mission aligned, right? How are we going to get AGI to all of humanity if we don't do it? through this ecosystem. So a big part of my, you ask my role,
Starting point is 01:35:27 big part of my role is just keeping that business really healthy, making sure we always have the headlights on so people know the decisions they're making from a business standpoint, huge part of what the team does. The other big part of my role is compute. If I didn't talk about that in my first breath,
Starting point is 01:35:43 you all should correct me. I mean, it's making sure we think compute is a massive competitive differentiator. I give so much kudos us to Sam and the team, but particularly Sam, because no matter how big a number we look at, Sam always wants to go bigger. And he's been right. He's never met a number he doesn't want to add a zero to. That too, maybe Moulton, be logarithmic. Maybe two zeros. And he's, but he has
Starting point is 01:36:11 been very right. And if, you know, you just had a long conversation with Greg Brockman, I think he does such a good job of kind of really explaining what a completely different world, an AGI world is, or an AI-a-fied world is. And so I think when people get all cut around the axle of like, you know, what is a gigawatt of compute? And oh, my God, you guys want to have 10 gigawatts and that's more than the compute of like Ireland since I grew up there.
Starting point is 01:36:38 And now you kind of look back on that, and you're like, those numbers already look small for a world where everyone will have access to intelligence and we're really starting to see what that can mean when you look at the demos today around things like healthcare and education and so on. Can you talk to me about non-gap metrics and what you think is going to be useful to track? We were talking to Mark Chan about this and he was saying, you know, DAUs are great, time on site is great.
Starting point is 01:37:04 But that's not as impactful of a metric for open AI as it is necessarily for a social network or an entertainment app. And there can actually be some problems that come up with that. So it feels like there might be some tension in the organization eventually or just publicly, about, you know, what metrics are worth optimizing for. And then there's also the financial community that wants non-gap metrics to track the health and progress of the business. And then of course, over, you know, decades,
Starting point is 01:37:34 we see companies eventually roll back some of those non-gap metrics and as the business gets more complex. So how do you think about the development and sharing of non-gap metrics? And what do you think is actually interesting and provide signal to the business and the investor community? I'm kind of smiling to myself.
Starting point is 01:37:52 because when anyone normally says, talk to me about non-gap metrics, I can see like most of people's eyes roll back in their head. I live for non-gap metrics. I would love to do that. Please. I think in a CFOC, first of all, it's really important to think about input metrics
Starting point is 01:38:06 and output metrics. And things like revenue, which is a gap metric, as well as a non-gap metric, they're very laggy. Like if you're spending your whole time focusing on the revenue number in an operator seat, like you are completely missing what's going on with the business.
Starting point is 01:38:21 So I push my team a lot to get out of kind of ultimately what the P&L looks like, and I'll come back to it, though, and go way upstream and say, what are the true input metrics that tell us about the health of our business? And so I think it does start with that funnel of monthly active to weekly actives to daily actives because we do. I mean, our mission is literally AGI for the benefit of humanity. So we know how many billions of people live on the planet, the fact that we're starting to be able to talk in billions and percentage of the world.
Starting point is 01:38:51 population, it blows my mind. Today, 85% of our users are outside the United States. And I love that stat. And in fact, if you go look at where the big populations of users, it just tracks global population, right? It's countries like India, Indonesia, Brazil, Vietnam, like the Philippines. Like, go to anywhere that has big population, the U.S. too, of course, but that will be your tracker. So that's kind of number one when I think of an input metric. From there on the consumer side, you're right, things like time and app I've actually always had somewhat of a love, hate affair with. But I think in this case, because we're giving people intelligence, teaching them how to use that, I actually think is where time and app does become important.
Starting point is 01:39:40 And one of the things we've really seen with chat GPT are people are spending more time with it. Now, you know, we balance that with things like mental health and so on, making sure that we're not creating bad things like we might have seen in prior eras of computing, but I think we're just getting started on that front. Beyond that, like when we go into areas like the API, I don't look only at usage, right? It can look at tokens per minute as a usage metric, but I look at things like latency. I actually try to look at the elasticity of demand, right? We know that developers want performance. They want intelligence, but they also want to make sure the APA. API is always up and they want price.
Starting point is 01:40:17 And they're often willing to trade across those three things, right? It's kind of a linear program depending on what your use case is. And so I think it's important that we are offering things to developers that allow them to optimize across the three metrics, for example. So that's kind of your input metrics. And again, I could wax lyrical, but I won't. But then go to what you really asked. So investors on the other side, right, they want to see a P&L.
Starting point is 01:40:42 They're like, I want to be able to compare you to other companies. companies. I want to be able to create maybe a DCF. Like I want to think about fundamental valuation for a company if I'm going to invest in it. And so, you know, today what I really try to push investors on is we are not a company that should be optimizing for free cash flow today because there's just too much opportunity. Like that point about compute, we have to make a decision on compute today with an eye to what we're going to need in two to three years because data centers don't just spring up overnight. Like they're not mushrooms. They literally take time and effort.
Starting point is 01:41:17 The thing we have failed at, frankly, I would say is three years ago, we didn't have enough foresight to say how big could chat cheap peteak, because it didn't exist. It's just a shame on us if we keep doing that over and over. So there can be a bit of a mismatch between our belief on revenue, because we don't yet know the product, versus the input, which is the cost today on compute. And so getting investors comfortable with the fact that there's probably losses. for a period of time. I say probably because chat GPT, just generally, the revenue models continue to surprise to the upside, but at least for now, we should be in big investment mode. And then you
Starting point is 01:41:57 kind of set it well, like as companies mature, you move to more gap metrics, right? If you look at, you know, the large, the Mag 7, many cases they're looking at like real gap net income. So the whole way down to the bottom of the P&L, we're just not there yet. And we should take advantage of that advantage, because we can invest as a private company how do you think about timing fundraisers from my understanding or or or rumors the last you know the most recent financing was very oversubscribed and at the same time you're still committing to CapEx in the future that is a multiple of current you know the current run rate and so you and the CFO seat I'm sure there's you're you're trying to
Starting point is 01:42:38 find this balance of like what is the business need today well you know not diluting the company you know, too much, knowing the sort of growth rate of the business. I mean, that's exactly right. That's the art, not the science of it, is that, you know, we did just come off the back of closing out the sleeve of investment that we could take down in this current round, led by SoftBank. And it was massively oversubscribed, which comes back to, I think, the market really waking up to the fact that AI is a generational opportunity. And the scale that it requires is like something people have not.
Starting point is 01:43:14 even seen before, right? It's, you know, people talk about the internet or, like, the railways. They're good analogies or transistors, I think Sam always goes back to. They're good analogies, but I do think this is bigger than everything that's come before. So there's a, you know, taking down $40 billion, which we just did in this round, that certainly felt like that gave me a lot of confidence. Appreciate that. A lot of confidence to then go out and do large compute deals, right? We announced the large deal with Oracle, for example, and to be able to keep working with all of our supply chain, Microsoft, Corweave, Oracle, Nvidia, and so on. But at the same time, you know, in a world where our valuation has gone up, you know, at pace with our
Starting point is 01:43:58 revenue, you do get an opportunity to keep coming back to market and not take that same dilution because you're getting that higher valuation for the work and the output that you've created. So it is a bit more of an art than a true science. I think for now we will continue to need to fundraise in order to fund that compute. But I think we want to start getting more sophisticated. Like just pure equity fundraising for everything is an expensive way to fundraise. And I think we're probably getting to the stage at a company where we can be a little bit more kind of broad and how we think about funding overall.
Starting point is 01:44:30 And even just working frankly with our supply chain because our success with bringing this of AI into being is their success too, and I think these companies are realizing that. What about partners? Last question. Partner selection on the compute front. There's not a lot of companies in the world or firms that can really be a meeting. You should update your LinkedIn title. We saw someone yesterday, works for Discord, is in charge of their cloud buying. And his LinkedIn title was, I have full responsibility over buying cloud, our entire cloud budget. And it was clearly like a huge flag. But I'm sure, you know, you're in direct text message, you know, with every single
Starting point is 01:45:17 person that's relevant in the industry. But, yeah, but I'm curious around, like, you know, a lot of people have been excited about developing data centers over the last couple years in hopes to win, you know, the business of companies like Open AI. But I think in when you guys are evaluating partners, I, I imagine that scale is such a massive factor. And so a single small data center is not really going to move the needle. You guys need to be thinking in terms of mega projects.
Starting point is 01:45:45 Yeah, I mean, I think that's exactly right. I mean, it started with our partnership with Microsoft. And it's kind of, it makes me smile now to go back and look at that original kind of large fabric for pre-training because I think it was only in the maybe 20 megawatt sort of size. And, you know, now we're talking gigawatts even just this year. And you're right that when we think about, like, what is perfect compute for us or strategically the right compute for us, we are definitely thinking about large scale. We're thinking about flexibility, right?
Starting point is 01:46:17 We're learning a lot about, you know, pre-training, post-training, test time compute, even, like where the different kind of scaling is happening. We're kind of recognizing there's more of a blurred line, often between what people think of as inference. So investors always are like your inference compute and your training compute. It's like, you know, literally it's like vanilla ice cream and chocolate ice cream. When in reality, there's like a bit in the middle that is something of both. We also need to think about things like where, you know, latency, where do we want to put our footprints around the world, that very global weekly active user base, right? As they use chat GPT, you don't want to slow the model down, right?
Starting point is 01:46:56 The beauty of the intelligence is like the real-time nature of it. And then when we get into big compute, like where there's lots of tokens being used, like deep research, image gen, video, as that comes online, like all the work you saw today, actually just even on voice, like that really quickly means that you've got to make sure your compute is near your users. And so it is a big plan that's coming together, but you're right. Like small is just not that useful to us. What about pushing partners to take risks? For my understanding, you guys are pre-committing to certain, you know, basically spend levels. But at the same time, I imagine you want people to say, here's what we know we're going to need. But we want you to build, you know, this much capacity so that we have the sort of incremental capacity built in.
Starting point is 01:47:44 Yeah, we want to, I mean, being extensible is really important. And we do want to see partners. Like I think Oracle, OCI has done a really nice job of that, of kind of starting. We started with like one large, it felt really large at the time. data center footprint in Abilene in Texas, and now that has really multiplied up into multiple sites that can all be connected. And that's a good example of a partner who has the capability to start in one way, but to be able to show you a path to maybe 5xing just in that single footprint. That said, we are finding that as we go around the world, there is an ability to
Starting point is 01:48:20 go work with governments, for example. We just made an announcement in Norway, made an announcement in the UK. This is the first time my professional career, I've seen countries come to the table. I want to do commercial deals like wall-to-wall chat GPT. I think the government of Estonia put chat GPT into all of their high schools, high school or eye government was up in the university level. But that's kind of wowing. And hand in hand with that, they are viewing AI infrastructure as incredibly strategic for their population. And, you know, it's a whole other level of selling. versus, you know, I've seen enterprise, large enterprises before, but never anything at this scale. Last question. Whose idea was it to give at every federal agency chat GPT for a dollar a year?
Starting point is 01:49:07 Yeah. I imagine that had to get past. You could have gotten more than a dollar. The CFO must be really upset here. Ten dollars. That's ten times as much money. This is one where I think it's really important. Opening is, you know, in some ways, a U.S. asset and national asset. And we want to make sure we're accelerating our government, like all of the resources as we think about, you know, Western democracy and so on, that we are absolutely putting our technology into those hands. It's that guy, Kevin Wheel. He's been moonlighting for the U.S. government. It's like, which team are you playing for, Kevin? Are you on Open AI? Are you on the U.S. government? Kevin just did his basic training. I don't know if I'm a lot to tell you that, but I was hearing all about it yesterday. I saw some photos. They look great. Yeah, it's a good thing.
Starting point is 01:49:51 It's going to be an even better shape. lot of governments globally. Yeah, that's great. Amazing. Last question for me, I, you know, the open source model launched two days ago. And there's this world where like you have this dominant, the accidental consumer company, you have this dominant consumer app that's generating so much revenue. Then you have B2B and enterprise and API and that looks more like a cloud provider.
Starting point is 01:50:14 But then is there a world where the red hat Linux of open source LLMs is an open AI division? And that there's actually serious revenue and profit that comes from helping companies implement an open source large language model like Red Hat built a pretty fantastic business for a long time on top of open source Linux implementations. Yeah. I mean, I think it's the right question to be asking. I mean, I think step one was getting our second open source model out and getting, seeing what that traction is and then seeing what the community needs. I think it's important to leave space for a. community to develop, right? That is the beauty of open source is that ecosystem that develops, and that was true with Linux. It's true in areas like crypto too, but I do think you'll find over time that as enterprises want to deploy it. Like, I mean, now I've dinosaurs myself, but when I was a, you know, when I was a research analyst at Goldman Sachs back in the day, I covered software and I covered Red Hat actually. Yeah, really. All that growth, I like wrote a research report
Starting point is 01:51:18 called Fear the Penguin at one point. Because the Linux being deployed. But then you started to understand that for an enterprise, you couldn't depend on like patching and upgrading to happen via community model. Like you needed some of the rigor that goes with an enterprise business where you kind of know if you need maintenance, if you need a bug patch and so on. And so that did allow Red Hat to grow an incredible business. So I don't know if it's us or we'd be supportive of others, but I think we are so excited to see open source out there and getting incredible feedback. And I think we want to do that ahead of GPT-5 to keep coming back to, like, we're here to grow this ecosystem. Well, we'll give you market cap credit for it anyway, even if it's early stage. Well, thank you so much for coming on.
Starting point is 01:52:00 This is fantastic. We'll talk to you soon. Thank you, sir. Great to be you, good, take care. Have a good one. Cheers. Bye. Up next, we have D.D.Credo from Kudo, I believe I'm pronouncing that correctly.
Starting point is 01:52:09 Let me tell you about graphite, code review for the age of AI. Graphite helps teams on GitHub, ship higher quality software faster. You can get started for free at graphite.dev. and let's bring in our next guest. How are you doing? Welcome to the stream. Welcome. Ooh, very clean background.
Starting point is 01:52:25 I know it's probably virtual, but whatever you got going on, it looks fantastic. You look great. How are you doing? Are you excited about GPT5? I'm so excited. It's awesome.
Starting point is 01:52:35 It's actually like everybody's talking about the coding capabilities. Please. But no one is really talking about the code review capabilities, and I'm going to talk about that today. Yeah, yeah, break it down. How are you using it right now? Yeah, so we're just enabling. within our platform. It's the default model for both our ID plugin, our CLI, our Git plugin.
Starting point is 01:52:57 And yeah, we're using it to generate very high quality code reviews, catch bugs before the eight production, help enterprises, verify that their code is aligned with their best practices. So yeah, it's super exciting. I can share my screen and show a few things if like that makes sense. You can. Everything you share will be live. It'll be a little yeah, yeah, yeah. Please. But I want to know also while you're getting that set up, I want to know about what changes materially do you think happened in GPT5 specifically for code and code review. Do you think there's more data going into the model, more data going into the pre-training, post-training, anything else? Anything that you're noticing that you're like, oh, there's a specific upgrade here. They must have done something to get
Starting point is 01:53:40 there. Yeah, yeah, I think it's a great point. So I think it's all of the above. So it's scaling of both the like the pre-training, but probably a lot of the reinforcement learning. And basically using that at scale to verify that code gets generated in high quality. And then also basically catching bugs. And when you do it with reinforcement learning, you have the actual ground truth. So once you scale that, you can get the model to basically be a lot better at that. How steep is the power law right now in just programming languages? Is it basically all Python, JavaScript, and then like a really hard fall off?
Starting point is 01:54:22 Or is it actually important for coding models if they want to be adopted widely to be like truly multi-language and get all the way down into the long tail of like the rust and the, you know, C-sharp and all the different languages that are out there? Yeah. Yeah, for sure. It's important to, I mean, the majority of the market is in the JavaScript, TypeScript, Python. like the majority of the early adopters, I would say. But then when you get to enterprise use cases, you get a lot of dot net, you get a lot of Java. And the models are getting pretty good at those languages as well, for sure. Are you excited about, I mean, how do you think about the difference between like
Starting point is 01:55:01 the improvements to GPT5 from the consumer's perspective versus at the API level? I always found it a little confusing that chat GPT was available as an API. and you could interface with the chat, I believe you could interface with the chat GPT model via the API. And there's a little bit of like a blind blurring there, but are there features that you think are croft and you want to kind of rip out for an API use case?
Starting point is 01:55:30 Or do you just say, hey, give us the kitchen sink and we'll work from there. And it's actually helpful to have a coding model that can still have a web browser. Yeah, yeah. I think basically it's a lot about, we consume the model through the API. and it's really the same model that drives the consumer product.
Starting point is 01:55:48 But for us, since our use cases are a lot about eugenic use cases, the more the model gets better at using tools and gets better at kind of listening to very, very specific instructions. Following instructions is critical for the enterprise use cases. Because for us, unlike the border market, we believe that for enterprises, you need to have very specific agents that are defined with specific set of instructions and prompts and tools and permissions. And the more the models get trained with that type of environment, the better they end up serving
Starting point is 01:56:24 the enterprise market, which is really where we're focused on. My question is, I wonder, like you said, like very specific instructions are important. When are we going to get an agent that I can just turn loose in a code base? and say like, just go improve it. Like just go hunt around, do, like rewrite that. Like when you get a good open source contributor on a team that just becomes nerd sniped by the project that you're building on,
Starting point is 01:56:53 they will just go around and find little ways to improve this documentation needs to be a little better. Let's rewrite this test case over here. Let's add a little bit more functionality to this class or function. How far are we from that? Yeah, I think the models are getting better and better at that part of basically kind of running loose in a code base.
Starting point is 01:57:13 Yeah. But they do need the Godrails in place. And this is kind of where we're focused on. Like a lot of the talk in the market is around the co-generation side, you know, let the agent loose and give it a task and it's just going to go around and run for hours and do it.
Starting point is 01:57:29 What we're seeing is that the real challenge is now shifting towards how do it is verified that the code is aligned with the best practices? How do I make sure that it's well tested, well reviewed doesn't break anything you know so that that's i think the next frontier and really developers going forward are not going to write a lot of the codes by by hand they're going to spend they're going to spend most of it their time reviewing code and that's the next frontier that's what we're talking really like are here to tackle very cool anything else already yeah well
Starting point is 01:58:02 thank you so much for joining giving us some extra context on the gpt5 launch we will talk to you soon. Have a great rest of your day. And thank you for joining. Cheers. Thanks. Cheers. Talk to you soon. Uh, and let me tell you about profound. Get your brand mentioned on ChachyPT. That seems more relevant than ever. Reach millions of consumers who are using a million to discover new products and brands. I forgot to ask about this. We'll have to come back to this. But I want to know if, uh, Found powers MongoDB, indeed, Mercury, DocuSign, Zapier, Ramp, Roe, Golland, Workable, Majority, Aid Sleep, U.S. Bank, Chime, Clay. Okay, okay, we get it.
Starting point is 01:58:43 They got some logos. There is this question of like, okay, even if you're like, okay, GPT5 is more incremental than revenue, more of an evolution than a revolution, it's like, okay, well then let's talk about how it affects every other business and every other aspect of the economy. What should you be focusing on? and is like do the do any of the updates from gpt four to gpt five change how you're positioning your brand for ai search that's certainly an interesting question to dig into anyway we have zach lloyd from warp coming into the studio welcome to the stream for the second time welcome back
Starting point is 01:59:20 back good to see you he's back how you doing i'm doing pretty well uh you know yeah so uh i mean A lot of what stuck out to me, I'm mostly a consumer of consumer AI apps. I'm very excited about not needing to mess around with a model picker anymore, but take us through the biggest improvements from the software development side. Yeah, I mean, it's a major step up from the prior open AI models. It's, I mean, it's doing agenic workflows and work for much longer period. It's just a smarter general model. Like we evaled it against all of our benchmarks and it's up there at state of the art, which is, you know, from our perspective, it's, it's awesome to have multiple competitive models that our users can benefit from.
Starting point is 02:00:09 So definitely a huge improvement from GPT 4-1. Yeah. So it seems like not, not the, you know, clog code killer, but certainly in, in the same conversation, in the same in the same football stadium. If we're using a sports metaphor. How much, you know, one thing that stood out is the cost reduction. How much do you think that developers will care about that versus just, you know, what it can do from an output standpoint? I think developers do care about value. So sort of like quality to cost ratio.
Starting point is 02:00:47 I think it's the more you get into like the individual developer and the small team, the more that that matters. whereas if you're at the enterprise level, I feel like it's a little bit less price sensitive. So, yeah, I mean, you can see it as different apps change their pricing, what the reaction of the developers is. You've probably seen this with cursor and seen this with Cloud Code. And so developers really, really are looking for something that's cost effective. So the fact that the cost is a little bit lower is actually is a big deal.
Starting point is 02:01:21 Do you think we're in the Lyft Uber 2015 arc where the prices are subsidized and the prices will go up? Do you think that there's a price war on the horizon now that the frontier models seem to be similar capabilities? Do you think that someone will try and raise a bunch of money, cut prices a bunch and steal a bunch of users? How do you think that plays out? It's an awesome question. I mean, my hope is that we get to a world where there is price competition at the model layer. So Warp is very much at the app layer, right? And so our value prop is like we can give our users who are mostly developers the best model access.
Starting point is 02:02:05 And so to the extent that it's not one sort of model provider running away with that and having pricing power, it's better for us just candidly. And so, you know, my hope would be something like the model world ends up a little bit like GCloud, AWS, Azure. That's our best end state where all of these models are, you know, sort of similarly powerful and a little bit more commoditized. I don't think it's been like that, but it's going, it's getting a little bit more like that. And so the more that there's more than one show in town, I think that's generally good for Warp. And actually is good for developers because it will put competition, the competition. competition will put pressure to bring the prices down. But I don't know, like, I also think that people will definitely pay for quality.
Starting point is 02:02:52 And so if there is a, you know, meaningful Delta and quality on the frontier models, then I think that, like, whoever has the quality delta will have a lead temporarily, but it's, I'm not sure that that lead will be sustainable. We'll see. How do you think the developer community should plan around, uh, model deprecation over the next, you know, one to two years. Like how much, you know, from, I don't know that I've gotten a reaction yet from, I don't know if there's general frustration yet from people, you know, we've heard on the consumer side, Tyler on our team here,
Starting point is 02:03:33 loves four or five. Yeah. And so he was a little bit disappointed to hear that. But what kind of, what are you seeing on the developer side? Yeah. I think it's a little bit different for people who are like building apps on LLMs versus people who are using LLMs as like a like an accelerator to do encoding. And like, you know, at Warp actually we do both. Like we were, we're an application level stack and like it's actually very easy for us to go to the latest model. And so it doesn't it doesn't really bother me. I don't know I don't know what type of app you would be building where it's like it's really important that it's like GPT35 or DPT4. something like that. I think like generally we want the most intelligent tokens at the best
Starting point is 02:04:19 cost. So I don't I don't see that being like too big of an issue, honestly. What about open source? Does that feel like something that will be in the playbook? Is the markup on closed source models high enough that there will be a significant price delta or is the parator frontier kind of indifferent to close source open source? So if there was a comparable open source option, that would be awesome. I think that the economics of it, again, it doesn't seem like a perfect analogy to me between open source software and open source models. So open source software, it's like you have a big community of people who, you know,
Starting point is 02:04:59 for the love of coding are building a really awesome product. For open source models, it's like you just need the crazy amount of capital to train something that's on the frontier. And so I don't know how that happens. And so what we've seen is like the open source models are competitive at the quality level that they're at, but the quality level that they're at is not the same as the frontier models. And I don't really see why that would change. And so, I don't know, in Warp, it's like we were serving some open source models, but they're just not, they're not as good. And so there's, I think, a more limited use case for them right now. And I don't really see economically why that would change. In fact,
Starting point is 02:05:45 I would be surprised if anyone was spending billions of dollars to train a model and just kind of put out the open weights. Like, I don't get the business strategy there, but maybe that will happen. That would be awesome. Is there a world where you're like this idea of like smarter, smarter models, either orchestrating, dumber, cheaper models or like using or distilling models into more narrow, narrow formulations that can be run more efficiently. We've talked to a few companies that do this for businesses. You just want a model that just filters for profanity and you can run it on, you know, a gaming graphics card. And so it's basically super, super cheaper super fast I'm wondering about like in the coding world coding agent world any of that like
Starting point is 02:06:35 where where are the opportunities to kind of fan out and use an ensemble of models instead of just this hit everything with the smartest best it feels like because of the funding environment everyone can kind of justify like a high cloud bill but and most people don't admit that it's hurting the bottom line but it feels like at some point it kind of has to eventually. I mean I think I think that's a very real thing like sense of even in warp we don't use like the the biggest most powerful model for every task and so there's certain things like you know for warp maybe for like deciding whether or not we should summarize a conversation is like a good example so you hit the context window you're like
Starting point is 02:07:23 okay is this is this a good spot summarize is this a good spot to encourage a user to start a new conversation. We use a much more inexpensive and also low latency model, right? The other thing, the trend is that these very, very powerful models tend to have much higher latency. And so we do a mixture of models, and that's totally a real thing. But I think for like the like the predominant use case as a developer is going to be, I want to tell an agent to do something.
Starting point is 02:07:55 I want it to be harder and harder. I want it to run for longer and longer. And to do that, it's like you kind of want in general the most intelligent model. And so until this, until the models have a sort of S curve like type shape, I think that I think it's going to be more of a quality game than a cost game for most of these things. Doesn't it feel like they have an S curve shape right now? It certainly does from a consumer perspective. That's interesting.
Starting point is 02:08:23 From a coding perspective, I feel like we're still. accelerating um like the difference again between the last version of gpt and this version of of gpt is is probably bigger than the difference between like four one and four and four and three point five like interesting it's a big deal and same thing with the anthropic models and i'm sure that we'll see something from google where it's an acceleration um and i think that there is like a maybe an under appreciation of how much left there is to solve here because when when you even when you're doing like a real coding task as a pro, like despite all the demos you see on Twitter where it's like someone asks, you know, an agent to build an app, that's like a lower level of difficulty than doing what a pro developer does with one of these models. And the models still don't produce great code a lot of the time. Like there's a lot of kind of handholding that has to go into it. And I think, I think that we are still seeing an acceleration in terms of the model is actually becoming not just like okay competent engineers, but like really, really good engineers. Yeah. Do you care about benchmarks? We cared a ton about benchmarks. Like we, um, but your own internal
Starting point is 02:09:31 benchmarks or, or we do both. So, you know, plug for warp. We're number one on terminal bench, which is the public, uh, you know, terminal benchmark and we're top five on sui bench, which is the coding benchmark. And then the only way, uh, in my opinion, that an app at our layer in the stack can really improve is by measuring the progress. And so we have our own internal set of evals that we run across all these models as well, which are coming from, like, real use cases. And that, again, is an advantage of being like a product that's in the wild that has a lot of users, is that we can sort of see where the models are failing, where they're working. And so we're very big on that, actually, yeah.
Starting point is 02:10:10 Awesome. Well, thank you so much for stopping by. We will talk to you soon. Sure, you'll have a busy afternoon. Shout up, by the way, at OpenAI team, very, very helpful in working with us to get GPP5 to be awesome and warp. And one more shameless plug, it's, we have a discount code for people who want to try GPT5 and warp. It's $5.
Starting point is 02:10:30 Thank you for having me, guys. Yeah, we'll talk to you soon. Thank you all. Cheers. Tyler, any updates from the timeline while you're thinking about what the latest vibe check is in the war between Open A.I. I got one from front of the show. Linear is a purpose-built tool for planning and building products. Meet the system for modern software development.
Starting point is 02:10:50 And streamline issues, projects, and product roadmaps. Go to linear. Dot app to get started. Tool of choice for Open AI. You have something? From Reggie James, front of the show. Half of my timeline says this is the closest we've been to AGI. The other half of my timeline says we officially just hit AI stagnation.
Starting point is 02:11:07 I love tech. Well, we will be going deeper deciding whether or not this is stagnation or hyper-intelligence takeoff. And we will be joined by our next guest, Riley, from Charlie Last. I'm sorry. Hey guys, thanks to having me. Good to see you, Riley. How are you doing? What's happening?
Starting point is 02:11:25 I'm doing fantastic. We've been heads down with GPT5. How long have you had it? How long did you get the preview? I feel like it, you know, it gets rolled out to early adopters a little bit earlier, but it's been weeks, months? How long have you had it? We're a couple, a couple weeks, like two or three. What was the first one you did with it?
Starting point is 02:11:44 How's Charlie liking it? Charlie loves it. And also I love what Charlie does with it. Yeah. What does Charlie do with it? What was the first thing you did with chat GPT5? Ran our eVils. Oh, yeah? How'd they come back? Really good. Much better than O3, which was much better than any other model we've run before that. Interesting. And yeah, so let's zoom out. What do you do? What do these e-vals measure? Walk me through it.
Starting point is 02:12:13 So Charlie is a TypeScript-focused coding agent that operates much more like a human. does. So less like IDE application terminal and more joins your GitHub and Slack and linear workspaces. And it interacts with the team the same way other humans do. And then our evils are a mix of code review because part of Charlie's job is to review PRs from humans as well as his own and then code authoring, so opening PRs and pushing commits. So when you develop your own, your own evals, I imagine you try and and keep those out of any training data, you want those to be held private, is that correct?
Starting point is 02:12:55 Yes, and it's getting even harder with web access now because they're too good at finding things. They're finding everything. That's funny. And then talk to me about like the shape of those, of the actual problems in the Eval. Are you doing, are there some easy questions, some hard questions, some extremely hard questions,
Starting point is 02:13:17 like how are you formulating those? What's the shape of an individual? task? Is it scored out of like 100? How do you think about developing a good e-vail? A mix of hard to very hard. The easy ones are just a waste of money and time at this point, especially with five. There's a bunch that it's just not going to get wrong. And then we're mostly doing the PR ones look kind of like sweet bench in the sense that we're taking an issue to start with. But instead of giving the issue like in a Docker container already, we trigger a comment on the issue that says, hey, Charlie, go make a PR for this.
Starting point is 02:13:50 And then Charlie does its thing, and then the PR comes up, and then we score that PR against a whole bunch of things like correctness to a no one's solution that's correct as well as code quality, testability, and some softer things like descriptions. Who are the biggest customers or users for like a typescript focused coding agent? it's a wide range of mostly modern apps like pretty much any web app these days it's going to be like a next j s type app and then all the way into like back in like charlie himself is written in typescript sure makes sense there's very little front end anything else true what else you got i just want to say i love the name charlie it's one of my favorite agent names that we've had on the show yes it's right up there with pig and what was the other one well i don't think that was an agent but yeah it's a it's a good one yeah congrats on logging it down yeah what what about what about
Starting point is 02:14:49 cost and and that side of the business is there is there any movement there or anything that you where you require movement or you need movement to really unlock new capabilities in the business or new markets not really for us because we're operating kind of as at a human level we do value-based pricing so we charge per PR per commit and because that's comparing to such expensive actions that humans are doing the challenge for us is more actually living up to the promise than doing it cheaper yeah yeah are you having but then but then doesn't the cost reduction announced today isn't that great for business yeah I mean it's good overall but like that's our problem is not that the models are expensive it's that they're I mean they're
Starting point is 02:15:40 getting really smart, but I'll always take more. Never enough. For instance, since the beginning of August, we've been testing 98% of the code that got merged into our code base was written by Charlie. Wow. Not 30, not 50, 98%. And that's coming through PRs. That's not like auto-complete in a 90-type thing.
Starting point is 02:16:02 That's crazy. Yeah, what does that mean for like the future of, like, who are you hiring? I imagine that you're still, you know, an engineering heavy. organization that's just puppeteering and orchestrating agents, but where do you see like the future of software development as a career path go? Yeah, are her new CS grads cooked? I think if they get really good at using the AI, no, if they try and take an approach of getting really good at writing code by hand, for sure. Yeah. What we're mostly looking for hiring is people who are able to see things at a much higher level and plan further out because with tools like Charlie,
Starting point is 02:16:45 you can write so much more code so quickly that it's like it's more important to see where you're going and take the right path than it is to be able to write it quickly. Very cool. Well, thank you so much for stopping by. Good luck with the rest of your day. And congrats on an upgrade to everything that you do. Tell Charlie to have fun out there. Have some fun. Thanks a lot, guys. We'll talk to you Let me tell you about numeralhq.com. Sales tax and autopilot, spend less than five minutes per month on sales tax compliance. Sales tax superintelligence. A number of the fellas in the chat got access to five.
Starting point is 02:17:22 Break it down. Regov says it's pretty good. The writing ability feels a little nerfed. Says the way it writes feels a little programmatic rather than sounding human, reverts to using points even for things like blog posts. And also uses overly complicated language. language for simple stuff. Techno chief says it's crazy fast. Oh, that's good.
Starting point is 02:17:43 Ratliff says, yeah, I was just going to say that very, very, very fast. Z. Jean Ahmed says junior devs are barbecued. Tyler, anything from your side? Before we talk to Guillermo from Versaelle? I think maybe a good way to vibe check at least on the timeline is that it's almost like a 4.5 kind of thing where comes out people are like this model totally sucks look at the benchmarks it's like not it's not some massive improvement it's like you know not a step change at all but then you you start playing with and it's actually like okay there's actually a good model yeah like a lot of the stuff I'm seeing people post like oh that's actually like really like interesting output stuff like that but we need it seems good can we do the green text eval green text bench yeah yeah we got to be TBPN intern yes yes yes yes we'll let you cook on that and then We will move on to our next guest, Guillermo Rao from Vercl, coming in to TBPN. For the second time, great to see you, Guillermo.
Starting point is 02:18:43 How are you doing? I like the action hall. Thank you. Welcome to the stream. How you doing today? Do you think GPT5 could beat me, you, a couple of the boys here on Dust 2 in Counterstrike? Easily. Easily.
Starting point is 02:18:59 Yeah, it depends on the frame rate, right? Yeah. But a long enough timeline, we're cooked. We're cooked. But we might frag it short term and we might be faster. Amazing, yeah. Yeah, we got to, I mean, I'm sure we'll get to GPT5, but what's your reaction to the world model stuff from Google? Do you have an idea of where that's going as a product?
Starting point is 02:19:19 It feels like a GPT2 level technology, very much a research-focused technology. I'm sure opening eyes working on something too, and a lot of the labs will work on it. But what's your theory behind the generative video? video game world model stuff that's going on. I mean, number one, super fascinating, right? I think when we think about the future, I always think about Jensen's, the future of applications will be that pixels are generated, not rendered. So as much as we're really excited today that GPT5 and V0 are really good at writing code
Starting point is 02:19:58 that then renders interfaces, I think it's also cool to dream of a world where we're just going directly from GPU to pixel grid right and but if you remember like a couple years ago and maybe a decade ago there was a lot of excitement of video games that were going to be live streamed from the cloud yeah that's right where your input your keyboard you could have a very thin client your input your keyboard your mouse movement was going to be dispatched to the cloud we're going to have google stadia right big there and then live was Microsoft's gone into the game and is still Microsoft is actually still pulling it uh still pushing it very heavily Awesome tech, but not mass adoption.
Starting point is 02:20:37 But if you look at, you know, a lot of these technologies are being really successful in letting people get more creative and test things out. A lot of the use cases that we see for V0 and V5 coding are almost like a communication tool. Like I want to prototype something. I want to see what's possible. I want to explore the latent space. I think those world models are going to be incredible just to inspire what the future of games could look like, right? just getting ideas for actually then shipping them in a real 3D engine model. I think short term. I think long term, all bets are off.
Starting point is 02:21:09 Someone was saying in the chat, you know, junior devs are roasted or a barbecued. I think that's not quite true. Same for like 3D engine developers. Give us the bull case for junior devs staying off the barbecue. So the bull case for I think people in general is that you move from, I mean, the progression the industry has been assistant to agent to team of agents agent orchestrator it's still really useful to have a human be the one that's sort of like managing the team yeah so you're moving from like junior dev to junior inch manager uh especially as these tools become more agentic
Starting point is 02:21:50 uh in in the new version of v zero that's coming up really soon you're starting to notice that v zero sort of splits the task between a little team you have the designer of the team you have the designer of the team. You have the PM of the team that's sort of working on the spec. You have the architect. You have the engineer. I know if you saw a code announced, I think it's like a slash security review. Think of that as having a security team or team of agents or security researcher at your disposal. So junior dev as like a vertical skill, maybe a little barbecued, but junior inch manager. So I think it's just going to be the junior dev is so much more powered in this world if you allow yourself to be and you keep up with what these tools can do and and I think
Starting point is 02:22:32 you stay you know at the cutting edge yeah I mean the obvious bulk cases if you're like if someone's a college student today they can learn to code truly AI natively they don't have to say oh we're an AI native organization now we have to upskill and kind of retrain people how to think they can just naturally start to think I remember there was that Sam Altman post about how we'll look back on you know, 93% of humanity with subsistence farming. And if you ask those people, what they think about our email jobs, they'd be like, you guys are crazy. And it's almost like in the near future, midterm future, maybe even long-term future,
Starting point is 02:23:10 it's like the number of individual contributors will be extremely low. And almost everyone will be a manager. And you'll become a manager much faster. You'll just be managing agents. And then you'll be managing people who manage agents. But the job of almost everyone will become managerial. Maybe that's what happens. I don't know, I'm not 100% on, but that's what that made me think.
Starting point is 02:23:30 Someone asked me yesterday, you know, what do you think the future of the market of monitors looks like? Like, does it stay flat? Do people get more monitors? That's hilarious. You go to like a Doge coin trader analyst. In the future, everyone has the hedge fund six monitor set up. In the future, everybody's just going to be on their phone. Maybe on their phone.
Starting point is 02:23:53 I mean, I've noticed that what, you know, when I was an individual contributor, I had three monitors. I was programming on all the screens. And now, I mean, I use my laptop during the show. And then most of my work has done on my phone, phone calls, and then firing off messages. Yeah, maybe we actually shift away from monitors and go further into voice interfaces. Oh, I call the lead of my agents.
Starting point is 02:24:16 And then that agent relays it to some code agents. I'm very optimistic on voice, by the way, because I've now seen it. I did what we're cooking on a better mobile experience for a B0. sure and i was going back and forth with my head of mobile and he was talking to v0 and i was writing down in a pretty fast typeer but he beat me with voice using the local model in the phone so there's still the question of like etch latency versus cloud latency kind of like what we
Starting point is 02:24:44 talked about with 3d but i do think voice is going to play an increasingly exciting role in programming which is kind of while i would have never imagined i've always been about like typing benchmarks in WPM's voice is coming yeah yeah how do you think about competition broadly in developer tooling code gen i mean it right now it seems like there's just so much it feels like massive tam expansion moment every company's ripping tam expansion moment but at the same time winners will emerge obviously you're you're playing to win and yeah i'm curious yeah on some level We're playing both sides of the bat. What we announced today that's really exciting is V0 with GPD5 support.
Starting point is 02:25:31 So you can go to v0.dev slash GPD5, and we'll use GPD5 in combination with our model pipeline that makes it really good at vibe coding, especially for non-technical folks. But we also, on the VERSEL AI cloud side of things, we open-sourced. Basically, you can create your own vibe coding platform, powered by any model.
Starting point is 02:25:53 I was joking about this with Tyler. Vibe code me a vibe coding platform, please. That's right. Make no mistake. Yeah, buy-coad me, a billion-dollar company. Yeah. No mistakes. But basically we are giving people that.
Starting point is 02:26:06 It's a start a kit. Sure. And by the way, the fundamental question that CEO asked me the other day was, is vibe coding a product or a feature? Or is it both? You know, it's TBD. The case for a feature is, okay, so there's going to be lots of systems of record. Think Salesforce, Snowflake, Databricks. And increasingly, they're going to
Starting point is 02:26:29 incorporate co-gen capabilities into their platforms. They can use a lot of these capabilities that we just open-sourced and you'll go to their existing place where you have the data. Kind of like what we've talked about for decades of like, are you bringing computer the data? Are you bringing vibes to the data, right? Are you bringing co-gen to your own platform? You used to bring like a like a dashboard builder and it would have a couple widgets and now I could just potentially if I'm plugged into some sort of data source some system of record I could say vibe code this app on top of it there's been some tool retools played in this space zapier a little bit but yeah I mean this feels like you know we're getting
Starting point is 02:27:08 we're not fully in the just the pixels are generated but we're you know we're getting close Ui generative application on top and that and that being bespoke ad hoc I also think it's important to understand line between consumer vibe coding and just generating ephemeral software and websites and things like that versus enterprises which will have a lot of different use cases when I look at the when I look at the vibe coding market and I see businesses that are that that are almost entirely consumers just creating things for fun I think that has to be a tough business because it's a hyper competitive market and consumers are flaky they'll create something you know for fun but
Starting point is 02:27:47 they'll churn in month two because, you know, it's not, they're not running a real business, whereas a business knows, hey, we'll pay for this on a long-term basis because we have a use for it all the time from this product manager to an engineer over here to somebody in marketing, et cetera. Yeah, the other side of the equation is how do you make this vibe coding tools work really well for enterprises? Frankly, the most surprising immersion thing that I've learned is just how much demand there is in enterprises for vibe coding. And this is because a lot of the traditional thing has been the people that understand the business are sitting over here.
Starting point is 02:28:25 The people understand the code are sitting over here. And their communication is fraught with peril. Like, you don't speak the same language. They kind of like resent one another. I love to tell this story. I was meeting with a CEO of a very successful company who's telling me that engineers, like asking a feature to his own engineers. felt like petitioning the government.
Starting point is 02:28:46 Even though he's the CEO, he's struggling to make the case. And please, like, get me in your next sprint, get me to this feature. So vibe coding actually solves that problem. All of the PMs, designers, marketers, business users that previously only had access to what, like Jira and, you know, to do a little business and product management tools and writing PRDs and those kinds of things. things, they weren't able to ship PRs, they weren't able to, you know, ship software, and now they can. And so the opportunity is how do you actually make this secure? How do you make it high quality? How do you create a guardrails? And those are, those are tricky
Starting point is 02:29:27 problems. And I'm really happy that some of them are easy to overcome, and at least for us. And some of them are active areas of research, but I think the enterprises really have a strong case for this. Yeah, can you walk me through like tool use? We were talking to the Open AI folks about GPT5 being like really like a summation of like standing on the shoulders of giants. You get a Python repel. You get a web browser. You get, you know, the ability to kind of run cron jobs now. There's voice and, you know, all sorts of different tools kind of wrapped up into one, multiple models. You can trigger reasoning chains if it wants. It can do all these different stuff. And that's actually the benefit of like this isn't just a bigger model. It's like
Starting point is 02:30:01 a lot of, it's a, it's a next version of a thing. It's more like switching from the iPhone 12 to 13 then going from the iPhone to the iPhone 3G. It's not just a new technology that's in there. But in the world of vibe coding, what are the tools that you want to think about adding? I know that basically every vibe coding platform, you know, recommends a database. But we were talking to Harley at Shopify yesterday
Starting point is 02:30:29 and there's a world where if I go to a vibe coding platform and I say I'm building an e-commerce website, it should probably just be like, hey, I'm going to do Shopify under the hood and I'll vibe code the landing page on top. But how are you thinking about the landscape of like tools that you could pull in full open, because there's open source repos that are like full projects that you could pull in and then just start customizing on top of.
Starting point is 02:30:49 It's kind of this big continuum. Yeah, there's a couple layers. The foundation model layer, what do you want is a model that is exceptional at tool calling. Whether it has built in tools or whether you register them yourself, this is a like sort of silent word that has been going on. Like if you talk to devs, what are you? optimizing for tool-calling quality. Why? Because to demystify the word agent, what an agent is, it's a loop of tool-calling that builds up context over time. That's all an agent is. So,
Starting point is 02:31:21 let's, to give you an example concretely, a B-0. V-0 is becoming more and more agentic over time. One of the things that you can do is it can take a screenshot of the thing that's building and reflect on it. So today, I live vibe-coded to an audience of web 3 in crypto engineers, and I told VZero, hey, make this dark mode. And initially, VZO does me dirty. He's like, he changes some things in the dark mode. And then it kind of astonished me because I was like, oh, I have to explain to this audience. It then takes a screenshot, looks at it, and keeps fixing it.
Starting point is 02:31:56 And I was like, this is literally a developer that's alive on autopilot. And the reason it's in autopilot is because he has access to these tools, like looking at the web browser. Another one is research. I've coded an example of, build me a substack clone for cryptocurrency news. And the agent didn't know what the cryptocurrency news were. So I started doing research on the internet of, okay, Ethereum passed certain price and whatever.
Starting point is 02:32:22 And then you're talking about the tools over the internet. So to demystify another topic, MCP is really exciting because it's a new protocol for registering tools that your agent doesn't locally have. So those tools that I just talked about, we gave them to VZero. Here's a deep research tool. Here's the screenshoting tool. And those will likely become the new services when you think about like AWS of today.
Starting point is 02:32:48 If AWS was an AI cloud, which is kind of what we're trying to build at Resale, like you think a lot of those tools are going to become as a service. Like bring me deep research as a service, bring me browsing and screenshoting as a service and so on. But then you have MCP, which allows you to, okay, I need to sell something online. All right. So now there's an MCP for Shopify. Now there's an MCP for Stripe. There's even crypto MCP. So it's really exciting. Now it's like the ultimate choice for a builder. And you don't have to go and learn all these things. You don't have to. This is almost like a discontinuity of the valley trend of like if we build amazing documentation, they will come. This is more so if the agent picks you, they will come. Right. And so there's a lot of figuring out right now, like how do I make my infrastructure?
Starting point is 02:33:35 How do I make my product to be loved by these agents? And the MCB promises to be one of these first things that you are in control of. That makes sense. Last question. Someone on your team named Josh is in the chat. He wants to know what does he need to do to get a Twitter badge? Oh. Well, yeah, 100K downloads of the AICLI, I think we've been talking.
Starting point is 02:33:59 Okay. Okay. Got the good work. The gauntlet's been thrown down. Josh, thank you. It's on record. It's burned into the immutable record of this live stream and the future training runs. Best of luck, Josh.
Starting point is 02:34:11 I'll be accountable now. We're going to hold you accountable to that, Guillermo. Great seeing you. Awesome. Great to see you. We'll talk to you soon. Congratulations. Let me tell you about fin.a.i, the number one AI agent for customer service.
Starting point is 02:34:22 Number one in performance benchmarks, number one in competitive bakeoffs, number one in I are in G2, number one in having an Irish founder. That's right. And we will invite our next guest. to the stream from Factory.AI. Welcome to the stream. How are you doing? Good to see you. Hey, how's it going?
Starting point is 02:34:43 Glad to be here. Thanks so much. Kick us off with an introduction on you and the company. Yeah, my name is Eno, co-founder, CTO at Factory. We are building a platform for enterprise software developers to perform what we call agent-driven software development. So basically more than just code, bringing agents into every stage of the software development life cycle. So think coding, code review, maintenance, incident response, documentation. We think
Starting point is 02:35:11 agents should be a part of all of this. And we think that they should be driving a lot of that menial component while you think at the high level about how to plan and structure the work. There's so many different like enterprises, a narrow category. It's a, you know, not consumer, I guess, but it's such a wide, it's such a wide category. Is there a beachhead? Is there a certain type of project within different industries or specific industry that's getting an especially large amount of value at a factory these days. Yeah, totally. I think that one thing that we see a lot, and typically when we say enterprise, we're thinking greater than 1,000 engineers, right? Like 2000, 3,000. And one reason why we focus on that larger scale, you tend to have these large
Starting point is 02:35:58 organizations where people are, the bottleneck is not code, right? The bottom. The bottom. The model neck is how do we plan a migration of 185 codebases to this new framework. And there are 3,000 developers that are going to touch this over the next six months. And an SI just told us the quote is $80 million to do it. And we have to figure out how to it. So re-platforming broadly is one of the major, major tasks for many, many enterprises, right? 100% modernization and migration is huge. Yeah, yeah, that makes a lot of sense.
Starting point is 02:36:35 How do you estimate that market size, and is that where you guys are leading with on the GTM side in terms of trying to find these legacy companies that are maybe not even using cursor yet? I mean, we talked to the CEO of GitHub yesterday, and what? 50% didn't he say? It was like at least half of their user base
Starting point is 02:36:58 is not using any AI tools. Yeah, totally. I think that the thing that we hear often, we pretty much only deploy into companies that have already tried an AI native IDE or have an auto-complete tool deployed. And I think that the thing that we hear often is you sort of hear like these numbers thrown around like 5x, 10x. And then in practice, when you adopt an AI IDE, you see 10%, 15%. And so a lot of people are sort of saying, like, what is the delta there?
Starting point is 02:37:28 like what causes that transition. And our sort of argument here is that there is a workflow change that's actually required to really adopt agents in the life cycle, right? And so if you're just sort of like accelerating an individual developer, that you can go a little bit faster. But if you are able to parallelize and automate at scale, that is going to be that larger introduction of change. And so if you imagine the market here,
Starting point is 02:37:55 there are companies where, you know, five or ten 10% of global payment transactions run on some cobal system that was written 40 years ago. Every developer is gone, and it's a ticking time bomb. At some point, it needs to go to Java, but there's nobody who even knows how to do that. And so those are the types of projects where the market is so enormous because half the business runs on this legacy system, hundreds of billions of dollars. Put it all in LISP, skip Java, go straight to LISP. Yeah, exactly.
Starting point is 02:38:26 Yeah, I thought would be the logical one. I'm sorry, we're running behind, so we're going to have to cut this short. But I want to know more about how the enterprise coding agent market will develop. We could see one world where we wind up with, you know, GCP, Azure, AWS, like, you know, pretty comparable, competitive. They've all had really great margins. It's been this oligopoly. there's another world where you could see more specialization, one of these companies goes deep into high security environments or oil and gas or financial environments or specializing based
Starting point is 02:39:08 on specific programming languages. As the market develops, like, how do you think it'll play up? Yeah, great question. I think that what's very clear is that the bulk of very large enterprise has a lot of similar problems, refactors, migrations, modernization. So, a platform like factory is able to deploy into that and solve problems quickly. I think that there's likely to be like that sort of 80-20 where there are going to be these very specialized providers that only focus on one sort of problem,
Starting point is 02:39:38 and that will represent maybe like 20% of what's out there. And so it won't be like necessarily black or white, but we do think that the bulk of enterprises have a lot of similar needs, especially when you just get across a certain threshold of number of engineers, scale of code base. Sure, sure. Yeah, I mean, we even see that with the clouds where, you know, obviously there's the hyperscalers, but then there are neoclots. And we talked to Armada where they'll send you a shipping container with a bunch of racks inside and put it in stranded energy.
Starting point is 02:40:07 So there will obviously be a long tail here. That's a great take. Thank you so much for stopping by. Have a great rest of your day. And enjoy the GPT5 upgrade. We'll talk to you soon. Have fun out there. Really quickly. Let me tell you about Adio. Customer relationship magic. Adio is the AI Native CRM that builds scale, scale, and grows your company to the next level. And we will be joined by our next guest from Augment. Welcome to the stream.
Starting point is 02:40:31 How are you doing, Guy? Great. Thanks so much for having me. And that's his name, by the way, if you're listening. His name is Guy. I'm not just calling him, Guy. Anyway, please introduce yourself and what do you do? What does your company do?
Starting point is 02:40:43 Yeah, so I'm Guy Garari from Augment Code. I'm a co-founder and the chief scientist. And we build AI coding assistants for large teams with large code bases. And so you can use augment code to do question answering, to do development, to do refactoring, to do migrations, all the tasks that you do, except that our product understands your large code base really well. And so that means less prompting for you and faster and better results out of the agent. Today, GPT5 launches. It's kind of a rising tide. Feels like it lifts all boats.
Starting point is 02:41:15 Every company gets access to it. We've interviewed a number of companies that are building on top of GPT5. Except it drowned GPT 4 or 5. Yes. But in general, how do you think you can use GPT-5? Are there any pockets of value that you think you can uniquely take advantage of? Yeah, great question. So we've been trialing the model for the past few weeks. And what we found is that the GPT-5 is a very thoughtful model. It likes to make a lot of tool calls. It likes to ask clarifying questions of the user before. starting to make code changes. And so the place where I reach out for GPT5 is typically, if I need to make large changes
Starting point is 02:41:58 or if I'm trying to answer a very difficult question about the code base, I will let GPT5 take a crack at it. It will turn for a while, making lots of tool calls, just making sure it got it right, and probably find all the different places in the code where it actually needs to make a change. And so I will typically let it run in the background and come back to it.
Starting point is 02:42:18 And I will often get a high quality result out of it. Are there any features or integrations that you're hoping GPT-5 will roll out in the future? We talked to a couple of people who are like, we want models that have access to as many tools as possible, and you can see with the MCP boom, more people are trying to make their services, their products accessible to these models. Is there anything that you see as potential low-hanging fruit to just add to the capability? So I think for us, we work hard on developing our own integrations and our own tools, building them into the product rather than relying on GPD5 or other model vendors to do so. We have worked closely with OpenAI to improve the prompting around our tools so that the agent kind of works flawlessly.
Starting point is 02:43:12 I think the thing that would be very nice, I think one of the previous guests mentioned a screenshot tool. I think that's a very, yeah, that's a very nice way to close. the loop on front-end software development, just like we saw how on back-end software development running the tests automatically really helps the agent iterate until it gets to working code. So I think having more support for screenshoting and things like that that close the front-end gap would be very nice to see. I wasn't aware that screenshots weren't flowing through.
Starting point is 02:43:44 I feel like when I've triggered operator, I'm getting a view, a web view, into. the website but I wasn't I wasn't aware that that wasn't like being passed through easily in the API and you still kind of needed to build that yourself. Where else we were just talking about this like where are the biggest pockets of value right now for AI coding tools generally obviously everyone knows like the vibe coder who's just the designer who's learning how to use software for the first time then there's the experienced developer going from a 10x to 100x with better code completion. there's the enterprise that's, you know, maybe doing replatforming.
Starting point is 02:44:25 Where else are the interesting pockets of value that are maybe on the horizon to be unlocked with new models? Yeah. So on top of everything you mentioned, certainly the inner loop of software development, that's where we've spent most of our time at Augment Code developing product for. Yes, you can have a senior developer starting using agents, starting to use multiple agents in parallel and unlock TANX or more productivity gains. What we're starting to see now with our tools is the beginning of automating software
Starting point is 02:44:56 development lifecycle tasks. So with augment code, we have a CLI tool now where you can take the full power of our context engine and the agent, the thing that really understands your code base, and you can start automating tasks in the background. And so we're seeing more and more developers saying, oh, this is great. Like I can break out of the IDE now. I'm using the agent that's already familiar to me, but I'm starting to automate code reviews. I'm starting to automate incident response. I'm starting to automate looking at
Starting point is 02:45:24 production logs and automatically assigning tickets based on error logs that I'm seeing. All kinds of new automation use cases that we're seeing just because agents have gotten so good and kind of really understands your codebase. Are there high stakes pockets of software engineering work that most of the AI tooling has kind of stayed away from? I'm imagining like the the high-stakes database migration. Where is the kind of sticky part of the industry? I was reading a blog post by someone who was doing like very advanced cybersecurity pen testing, and they were saying like just the creativity of the models wasn't quite there yet to really
Starting point is 02:46:08 come up with the, to really act and embody like a white hat hacker who was going for a bug bounty. but where where are the pockets of still like intractability where I guess if you are you know in the in the individual contributor you love just just you know coding from scratch that's where you want to stay for at least the next couple months yeah I think still the attention of all the models we've seen and all the agents we've seen around making proper design and architecture decisions that's still high stakes and and still the ability is not there. Because if you do complete vibe coding and you just let the agent go and do whatever it wants, in the beginning it looks amazing, the code works and it's all really good. But once you get to low tens of thousands of lines,
Starting point is 02:46:59 the bad decisions that were often made around the design and architecture start to show up and development slows down. So that's where we still see a limitation of today's agents and where you still have to supervise the agent closely in order to make sure that you don't get stuck later on. Perhaps this will change in a year, but today I would say all these decisions that you make around how the code is structured still requires close supervision and still high stakes
Starting point is 02:47:28 because it can really slow your project down if you let it go autonomously for long enough. Yeah. That makes sense. Well, thank you so much for stopping by. We will talk to you soon. Have a good rest of your day. Thanks so much. Cheers.
Starting point is 02:47:39 Let's check in with Tyler on the timeline. Tyler's manning the timeline. How are the vibes? Are there any new posts that have hit the timeline? Are we still in turmoil, or has the narrative settled? I think vibes are picking up a little bit. You're starting to see people post like, oh, this is something I made. Now you can see on Elm Arena, it's number one.
Starting point is 02:47:58 No way. Wait, wait, so what's going on with the polymarket then? So, Polymarket is still, let's see. Still Google heavy? Yeah, I think, I guess they're just pricing in Gemini 3. Ooh, okay. I'm not exactly sure, honestly. I was actually very surprised to see that.
Starting point is 02:48:13 one yeah yeah yeah maybe later we can show some of the posts yeah yeah yeah yeah that'd be great cool stuff um well in the meantime before our next guest let's tell you about eight sleep get a pod five five year warranty 30 night risk free trial free returns and free shipping and we will uh have our next guest join us from code rabbit how are you doing good to meet you good to meet you uh good to meet here uh what's your reaction to gpt 5 how long have you been playing with it what are the biggest improvements that you've noticed yeah i would say mind-blowing right i mean we have been playing our team has been playing for like a few weeks now um tested a few snapshots and it's amazing it's a generational
Starting point is 02:48:54 leap we would say like we have been using open-a-i models like i mean how much you know about code have it's been like a couple of years we have been on open-a-anthropic um and our product is a very reasoning heavy product like one of the very few use cases where you have a phd style where we have to do code reviews. That's what CodeRabit does, like users open a pull request, our agent uses reasoning models to find issues, like race conditions or security issues and so on. So we've been testing GPD-5 on some of the hardest pull requests.
Starting point is 02:49:29 We have in our golden data set. So we've maintained a data set where we track progress of different models and progress of AI in general. So we have many problems that no model is able to solve so far. like, I mean, GPD5. But so far it has a highest score. We would say it's like almost 2x better than the next 03 or sonate or opus at this time. What's the customer value there?
Starting point is 02:49:52 You think that like all the customers just notice that the product gets better? Are you going to upsell folks? Are you going to like, how do you play this given that this model is now in public availability? Every company, every competitor can access it as well. Yeah, there is no up there. that's the thing with AI, for the same price or even better prices, you're getting much more AI, much better AI. That's the whole idea how fast this space is evolving. So yeah, from the pricing point of view, we don't see like this to be like a separate plan or something in our
Starting point is 02:50:24 product. I mean, for the same price per month, customers will now just get better quality of results with Code Rabbit. What's next for the business? What kind of customers are you going after who do you think has been on the fence and this release is going to be the thing that gets them to actually jump into the world of AI yeah we can track the top line metric like one of the things we track very closely in the company is like how many signups to the paid customers we get that number has been constantly improving since gpd4 gpd4 turbo gpddd4 he actually dipped so there was a time and gpd4 was almost like a Windows Vista often is easy as like it's funny,
Starting point is 02:51:10 like how we kind of trusted the Eows and we thought it's the same model, but you know, it was inferior in many ways. Then we saw a huge improvement after O1 came out, O1 preview was a game changer for us. Even at that time, our conversion doubled actually. Right, I mean, so we went to like more like close to 30% success in getting the paid users.
Starting point is 02:51:31 And now with GPD-5, we're hoping we can see another big jump in the number of people. who start becoming paid customers and how many people churns. So those are the real numbers. Like one is like wives, like how people like respond to the model and we get angry tweets or not, I mean, that's the other part. But the other thing is like the actual revenues, whether it moves the needle for us. And that can be seen, like one of the things we have seen, even though you test these models in a lab,
Starting point is 02:51:57 it's not like a huge data set, but once you actually are in the wild, you see hallucination, some of those issues at scale pop up. So those are something we'll still be observing over the next few days. is to see whether it's like spot only like 80% of the cases, but then if the false positive rate, the hallucinations are too high, then also it's not a great model, but that remains to be seen. Yep, that makes a lot of sense. Well, thank you so much for stopping by.
Starting point is 02:52:20 Congratulations on a new, new tool in the tool chest. New toy. We will talk to you soon. Have a great rest of your day. Cheers. Goodbye. And let me tell you about public.com investing for those who take it seriously. They got multi-asset investing, industry leading yields.
Starting point is 02:52:37 trusted by millions. Millions. The chat is going wild about public trading. The SPX 6,600. I think that comes from someone talking about the non-Mag 7 stocks or something. There's been people benchmarking the Mag 7 versus the... The big news while we were live or earlier today, Trump signed an executive order that is opening up 401Ks to digital assets and private equity. What's crypto doing? Is it ripping? Bitcoin is up a couple points last time this point you know where's it going to go it's already so high I mean it's just like there's been so many catalysts it it could go up it could go down yep we'll have to wait and see um Tyler anything else notable from the timeline what have people built I see this GPT 5 just one shot at a Minecraft clone yeah I think that's
Starting point is 02:53:32 one of the cooler things I've seen okay so this is so it wrote it wrote it wrote code to generate this game that it's not generating the pixels. You can do so many different things. Like you could generate a video, generate a world model, generate code that generates a game engine, you could generate code that runs on Unreal Engine. I don't even know what they're using. One thing now, on actual Chatabutti,
Starting point is 02:53:54 there's like a native, like, it's like a music player, it's almost like garage band. You can say like, if you prompt to like build, I saw a Sam Alwin tweet about this, you prompt to do some kind of like beat or something, it'll like make an interactive, like, Rajband, almost interface in there. That's cool.
Starting point is 02:54:08 I was playing with that earlier. Yeah, I do wonder how many of these features that we're seeing, like where does Open AI want to keep things in the B2B world and let other companies build versus just build it as a consumer app? Like, will chat UPT eventually just let me push your website? Like will it become a vibe coding platform, at least like a basic one? Like it's not the most advanced coding environment,
Starting point is 02:54:35 but it can definitely write some code. and execute it for you and do some stuff. Yeah, well, it's funny because like, it used to be, you would have a, like, so there was like, GBT 3.5 or something, and people on top of that built a vibe coding thing. So you could use that to build your own vibe coding thing. But now you can just go straight from
Starting point is 02:54:53 ChatGBT to build your vibe coding platform. Yeah. But soon, maybe it'll just be the vibe coding platform. Yeah, the surface area of this stuff is very interesting. Clearly they're going after health care and therapy. It's interesting that they've kind of stayed away from legal, maybe that's just the dynamic of the sales process and the dynamic of that particular market. But, I mean, increasingly you can just ask more and more questions
Starting point is 02:55:18 of chat GPT. So the consumer to business like bleed over, there's certainly a world where just giving everyone in your organization chat GPT is a substitute for a bunch of different SaaS products. So it'll be interesting to see where that developed. What are you thinking about? Neer says, there are concerns that the number used to represent our AI's intelligence does not, in fact, represent its intelligence. Worry not to address these allegations. We've added three new numbers. Near.
Starting point is 02:55:48 Yeah, Neer is building something that's like not particularly benchmarkable, right? Isn't it a companion? It's beyond benchmarks. Beyond benchmarks. Well, in completely other news, Anderl opens a Taiwan office and begin selling AI-powered attack drones to Taiwan. Paul Merlucky has said he wants to turn Taiwan into a prickly porcupine. We're in the age of spiky intelligence.
Starting point is 02:56:10 That spiky intelligence will be on boarded onto the AI powered attack drones and deployed in Taiwan to keep it safe. What else is going on in the timeline while we wait for our next guest from Open AI to join? Spore says, raise your hand if you are not automated today. I'll raise my hand. I was not automated today. Not yet. We survived. Sebastian Bubeck says, here at Open AI, we've cracked pre-training, then reasoning, and now we're experimenting with new set of techniques that maximally leverage their interaction.
Starting point is 02:56:43 GPT5 is just the first step in this direction. We're excited to incredibly excited to see where scaling this up will lead us. And it's the unicorn test, I believe. And the latest unicorn is really, really good. That is a creative interpretation. And I think it has a draw all this with like SVGs. Anyway, we can talk to our next guest about it. post, GeroTicket says, I went to the permanent underclass party and everyone knew you.
Starting point is 02:57:09 Anyway, back to the serious interviews. Welcome to the stream, Max. Good to see you. How are you doing? What's happening? Nice to meet you guys. Yeah, doing well. It's a relief to have this launch out in the world.
Starting point is 02:57:21 I think it's, you know, we've been working on this for the last few months now, and it's exciting to let the whole world see what we've had. Yeah. Just a few months? It's been, I don't know. It's been a little while. What's the actual launch day like? Because you're actually getting this out into the world.
Starting point is 02:57:37 The GPUs are on fire or about to be on fire, warming up. But is that out of your purview? There is a different team for that, fortunately. So right. So I run a lot of the research for GPG5. I don't necessarily handle the deployment, but I do get dragged in when the GPUs are on fire. I think we're moderately burning right now.
Starting point is 02:57:59 Like a tool arm fire. Yeah, yeah, yeah. Is it materially different? I mean, this is a launch day, but we'll probably discover, like, the Studio Ghibli capability once it gets out into the long tail of, like, you know, hundreds of millions of people try it. Someone comes out with some genius thing. Then everyone's doing that.
Starting point is 02:58:17 And then the GPUs. Because I feel like the studio Ghibli thing happened, like, a few days after the launch of images in chat GPT. It did. It was pretty fast, but within about a week. I think in this case, we're going to see that here. I think coding, you know, if I had to take my bets for what the Studio Ghibli thing is going to be, it's coding. That's the place where I think GPD5 is like most tangibly a hugely ahead of GPD4 and ahead of O3.
Starting point is 02:58:43 Do you think there's a chance that the coding will mean a studio Ghibli style meme or kind of like, and what I mean by that is that is that like image generation is incredibly valuable in the context of like Hollywood will be using AI to Chrome a key. and rotoscope in a professional environment. But what was special about Studio Ghibli was that anyone was making these custom images. And I could imagine a world where, you know, even going from like the Levels I-O example of like I vibe coded a flight simulator, if we wind up in a studio Ghibli moment for coding,
Starting point is 02:59:20 I would imagine it's like everyone built their own game today. I think that's pretty much it. Yeah. So I don't know if you guys watched the live. That was one of the things we had on the live stream. Like you can just go into Chachabit. If you try it right now, it might or might not work because the D5 rollout is still ongoing. But if you have five, you can just tell it like basically make me a game.
Starting point is 02:59:40 Yeah. And it will make it and you can actually play it in Chachapit. That's amazing. So I, yeah. Is there the ability to discover that? You don't, the thing is like with Studio Ghibli, right? Like for Ghibli, you don't have to know how to draw to make it work. For this one, you don't have to know how to code.
Starting point is 02:59:54 Yes. But can you share, can you share that chat and someone else can play the same game? How does the kind of sharing mechanism? Yeah, you can do the share link. We're, I think, going to try to make sharing for these a lot better over the next few days. That was P2 after the P1 and P0 of making the GPUs not completely melt. But yeah, we will try to make it much more terrible. Yeah, yeah.
Starting point is 03:00:17 I mean, the studio Ghibli thing is so interesting because it was, it's not just that the model capability was there, but it's also like the prompt was two words. and it was so reliable that you always got a good result and you could personalize it. So even if it wasn't, I've seen people build Doom. I've seen people, you know, you can just buy Doom. It's a real game. You can build it.
Starting point is 03:00:41 But if you build it and I'm like, oh, that's cool. You did it in a vibe code environment or in chatyPD. Like that's awesome. But I don't necessarily want to go do that for myself. But as soon as it becomes personal, which is what the studio, I had to see what I looked like as Studio Ghibli. I had to see what my favorite photo looked like. My favorite meme looked like in Studio Ghibli.
Starting point is 03:00:57 And once that happens with games, people will eventually, you know, there'll be this like memetic explosion and you'll see the GPUs will truly be on fire. Yeah. I mean, I think even today you could probably with GPD5 do Doom, but all of the characters are like, all the enemies are headshots of your friends. Like here we're going. That will just work. Yeah, we're real close.
Starting point is 03:01:16 Yeah, we're real close. It's going to be something that's personal, something that, you know, you can express your own creativity through because I think people, they still latch on to that. They don't just want a copy of what already exists. They want something new. And in the studio Ghibli moment was just new enough. Anyway, we should talk about actual research. We should talk about post-training.
Starting point is 03:01:33 What's the thing you're most proud of? Like, what can you give us on, without, you know, immediately getting poached? What can you give us on the actual innovation that went into GPT5 from a post-training perspective? What are the kind of keywords and paths in the tech tree that we should be digging into over the next few years to understand how this works? how this works. You know, I would say the thing that is most impressive to me about GBT5 is how much
Starting point is 03:02:01 getting all of the details right matters. Like when I look at GPD5, you know, we had an early version of this thing a while ago that was kind of okay, but clearly did not meet our bar for revolutionary. And we were trying to figure out, you know, why is that not as good as it should be? And the team basically just went off and did a deep dive over a couple of months of just completely rebuilding the post-training stack for this model. And it turns out that when you do that, you get what would have taken another order of magnitude worth of pre-training improvements to produce. How much are you thinking in post-training, in research, about let's forget the benchmarks
Starting point is 03:02:41 and just focus on user satisfaction, like NPS score basically, or like user minutes or any of these other, the real benchmarks. Yeah, the intangibles of how people feel when they're using it. Yeah, the feeling and the joy and the actual value that's delivered because Studio Ghibli was a delightful moment. It wasn't a benchmark. Yeah, I think, so that was something that we took very seriously for GPD-5. It's like, look at what people are actually doing with Chatsubit and look at where the model is failing them. Either in the sense that the model is like, sort of like you said, it's not
Starting point is 03:03:15 enjoyable to use. And so we did, I think, make a lot of progress on that. Like, GPD5 is much more engaging than our previous really smart models. Like, O3, I don't know if you guys talked to O3 in the past. It's a bit bland. Sure. And GBD5, I think, has a lot more character, is a lot more interesting. But then also, like, I think for, we really care about just actually being accurate.
Starting point is 03:03:40 But if a user is trying to do something economically valuable with our model, we want to make sure it lands correctly. And so what we did there is just look at the actual distributions of what people are doing with our models in the real world, figure out where the models are going wrong, build interventions to target it. And that was where we got, I think, the most impressive improvements in GPD5. Like, O3 would just get things wrong and not tell you it wasn't sure it was incorrect. And GPD5 is much, much better about actually being.
Starting point is 03:04:10 being honest when it thinks it might not know yeah how explicit is are all the different pieces of the post-training pipeline like you have you have you know safety post-training you have uh stop hallucinating give me the real facts you have make sure the text the the the the flavor the tone is pleasant um there's so many different things to optimize for how much of that is like try and just blend it all up into one thing versus like explicit passes chunk it out like split it up how much can you decompose the problem so you know my background is in reinforcement learning and I think yeah when you look at something like this the magic is in the reward function right it's in what you're actually telling the
Starting point is 03:04:55 model to be good at and so fixing things like hallucinations to a huge extent is essentially a function of just fixing the reward function actually making it so that the model is reliably penalized for saying something that's false and if you do that all of a sudden the model stops saying things that are false. Ditto for safety, right? You know, on the live stream, Sachi talked a bit about the way we've changed safety for this model. And to a huge extent, it's just a function of,
Starting point is 03:05:22 we're actually putting out a paper today on the new safety stack for this model. And the core insight in that paper is just figure out what you actually want to optimize for, which in our case is helpfulness, conditional on not saying something that's actually dangerous or harmful. You know, write that down, figure out what that means
Starting point is 03:05:38 as a reward function, then optimize it for it. it's really not magic at all it's just again it's it's what I said earlier you've got to get the details right you know if at any part of that process you screw it up the model will be unusable what's your current thinking on spiky intelligence and is there some is there some flywheel that you can get started where you're identifying low points that aren't spiky enough and then you're like almost automatically setting up the infrastructure the eval to then RL against to create a spike. I think GPT-5 was a preview of what's possible
Starting point is 03:06:19 in that respect in the future. Yeah. A step in that direction. Do you think that there's a world where you get to a place where you're kind of, it's weird because we're not hammering down the nails of the spikes. We're adding spikes, but... Hovering up the spikes, yeah.
Starting point is 03:06:33 It's this weird metaphor that we're stretching a little bit too far, but is there a world where you can be doing post-training or just adding capabilities in a more iterative cadence so that as soon as you identify something, the response can be, yeah, we don't need to wait until GPT6 to fix this. We can just add this capability because, hey, we just found a pocket of users who are trying to do a thing and they're not super happy with the results. And let's add this capability. Yeah, I think so.
Starting point is 03:07:04 I mean, I think we are going to launch other models between now and GPT6. I think it's relatively common knowledge, but we do update the model in chat to BT reasonably often. Yeah, people talk about it all the time. Yeah, exactly. And I think we are now in a world where we can conceivably update that model and have it get materially better on capabilities too. Yeah. Not just on, you know, the personality is a little bit better than it was before. Yeah.
Starting point is 03:07:26 Going back to your note on the new paper that I guess you guys are releasing today, when you talk about optimizing for helpfulness, is there, is part of that avoiding the model? you know reinforcing there's times when you want to reinforce and give kind of like confidence to the user that they're going down the right sort of like thought process and things like that but then there's like a point where it can get too extreme in terms of maybe convincing a user of something that may be totally untrue is that is that what the paper gets at or so it's not specifically about this although I will say we we do explicitly train the model to not lead users down bad paths. That's something that I think we've started taking much more seriously
Starting point is 03:08:12 over the last few months. As we've realized, Sam talked about this a little bit, I think, back in May, but chatchabit is just way more important for people's lives now than it was a year ago or especially two years ago. And we do have to actually be very cognizant of what effects our models have on users. So yeah, we do very actively trained models to not lead users down the right path. Don't fact check me on the releasing today. I know we're releasing it. I believe it is soon. I think it's day, but I've also been in a whole dealing with launch all day. Yeah, we're not big on fact checks here. We're big on the truth zone, which is just the vibes. The vibes are we'll be publishing some information about the new safety setup.
Starting point is 03:08:51 At some point. That's great. Yeah, I think a large part of the conversation around safety should be how reliant and how useful the product has become to users. And then the new, level of care that you have to provide versus a while ago when it was just like people saying making a cute image or generating some texts that they were going to use in an email or an internal document and realizing this this vector of usage which is this like companion confidant that is is becoming so prevalent talk to me about post-training for big big partners enterprises government organizations What is transferring from the research that you're doing to something that can be offered
Starting point is 03:09:44 as an enterprise level product? Yeah, so we do, Open AI does partner with external companies to do essentially custom post-training. That is a thing that we do. And from that perspective, the stuff we do just directly transfers. I'll also say that we've put a lot of work into trying to make our models as general as possible, to as large an extent as possible, if you want to get really good results from our model, you can do it right on the API
Starting point is 03:10:10 just by actually telling the model what you want it to do. Yeah. Right. Like, GPD5, I think, is pretty comfortably our most durable model ever. We've heard a lot of really positive feedback about this, especially from like folks like cursor.
Starting point is 03:10:23 Yeah. So if I came to you and I was like, I'm an enterprise and I need to generate a lot of studio Ghibli's, you'd be like, what are you doing? Just prompt it. Probably. What are the examples of, of, of companies and organizations, is it just private information,
Starting point is 03:10:39 private data sets that aren't available on the open web? Or is it specifically like there is enough data out there, but there's just not the economic incentive for your team to go and RL on gas station bench or whatever we're talking about here, hypothetically? I think the answer is both. Yeah, it's definitely both. Because yeah, we're not going to target,
Starting point is 03:11:03 as you said, gas station bench. Because it's not on our own right now, probably, because it's not mostly what people are doing with tragedy to T. Exactly. If you have some application that's super valuable to you, we can be convinced that it's important. Yeah, yeah, yeah. It's just not what our users are already trying to do.
Starting point is 03:11:19 What's the state of reward hacking and fighting that in RL environments? You know, I think we've actually made a lot of progress. There was some discussion of this around 03, that O3 was like a little bit deceptive in ways that felt reward hacky. and GPD5 is dramatically less deceptive than O3 was. What's an example of how that would manifest? Do you have like a canonical case study? Yeah, I mean, the canonical thing is like you ask O3 to write you some code
Starting point is 03:11:48 and instead of actually writing some code, it changes some unit tests. Changes the test case, right? Which is kind of hilarious. It's like one of the funniest things that AI has ever done. I understand it is very bad and it's not what we want, but it is just like it's kind of cheeky in my mind. It's kind of cheeky. It's also like, you know, I feel like if you spend enough,
Starting point is 03:12:04 time around real software engineers they do actually do stuff like this pretty often i have 100% done that i was going to say i also have done that uh for for formal reasons i won't say that i did at open eye but i definitely did that yeah of course of course this is natural what do you think gpt six looks like you guys you mentioned that you're going to be shipping you know updates to five but what what are you most excited about uh where where are you most excited about uh going and and just really quickly give us the date that gpt six launches oh man hopefully we
Starting point is 03:12:38 hopefully six launches is a complete surprise to everyone I think that would be ideal like a Beyonce album oh yeah hopefully five just makes it and says hey it's ready now if you want to hit yeah I think that would
Starting point is 03:12:48 that would be a great thing for six actually I would love for six to do all of the launchcoms and to do the live stream that would be very cool live streaming is that's the real AGI test for sure for sure I feel like we're not that far off actually
Starting point is 03:13:01 I don't know I mean video synthesis maybe but you know talking through a script for 30 minutes come on models got to be able to do that for sure well yeah that'll be the the next soror launch or something we'd love to have you back on but thank you so much for taking the time today we'll talk to you soon great to talk to guys congratulations cheers by congrats on the launch let me tell you about adquick.com out-of-home advertising made easy and measurable say goodbye to the headaches of out-of-home advertising only ad quick combines technology out-of-home expertise and data to enable efficient, seamless ad buying across the globe. And we have Scott Wu from Cognition coming in the studio for the fourth, fifth time. I can't keep track anymore. Thank you for taking the time.
Starting point is 03:13:40 Thank you for coming back. It's great to see you guys. It is fantastic. Got to be honest, great week to be an application letter company. I got to tell you guys. I was about to say. You get a model. You get a model. Open source. Another win for Scott Lou. Wow, wow, wow. 4.1. Yes. So, yeah, how big is this? Are we in the Uber Lyft territory where, you know, you're going to be, you know, in price competition between Anthropic and Open AI going back and forth? Like, what is the real benefit to your business right now from today?
Starting point is 03:14:12 Yeah, yeah, for sure. So, first of all, obviously, massive capability gains across the board, I think really, really impressive work that Open AI has put together. You know, people have talked about what's going on in the AI coding model race. And I think by a lot of accounts, you know, Anthropic has generally been ahead for a lot of the last year, honestly. And I think at this point, Open AI is very clearly, you know, has very clearly caught up. And it's pretty neck and neck, I'd say, between the two right now. And so I'm very exciting to see all this unfold and to see what's next. But I think from our perspective, yeah, I mean, code is just such a core capabilities pill to use case, I'll call it.
Starting point is 03:14:50 And so, you know, being able to work with smarter and smarter models and do a lot of the work that we do, it just means that both Devin and WinServe can be a lot more capable, a lot more intelligent, can predict what you want to write or what you want to do with a lot of higher accuracy. Yeah, it's almost, it's almost like surprising that given the like cultural rigor at cognition that you're not doing fundamental frontier research. So can you walk me through, like, what is the focus of being an application layer company? Is it, is it UI, go to market? I'm sure it's all of these. But in terms of the hardcore software engineering, like, what is important to get right? At some point, there's fine-tuning and post-training, but is that moving back into the purview of the foundation labs?
Starting point is 03:15:43 Or is there still work that you want to do on top of the models or on top of the 8? Yeah. Yeah, it's a great question. I mean, I think the core of being, you know, an applied lab is really just focusing on a very particular use case on delivering real, just very direct results. And I think, you know, like I think the foundation labs are obviously, you know, incredible at training-based models and all this pre-trading and all of the work that they do there. I think from our perspective, we, we want to work on a lot of very particular capabilities that apply to software engineering. in particular, and then obviously, you know, run the whole stack from there to building a product, figuring out the interface and the U.S., and then obviously bringing that to market and selling that. On the capability side, there's a lot of particular stuff where, you know, one way to put it is, I think the base IQ is very much already there in the models, and you can see the raw problem-solving ability, and I mean, we've gotten some pretty insane results, you know, getting a gold medal at the IMO or all of these other things, right?
Starting point is 03:16:44 You called that, by the way. Yeah, you called that. I think the first, you know, I mean, we were one point away to be fair a year ago, right? So it was on the way, I'd say. But, but, but, but, yeah, so, you know, you can really see the general intelligence improving it with every single model generation. On the other hand, for Devin, obviously, you know, it's a very clear, like, step up in the general intelligence, but also you want to be able to, you know, if you ask Devin to go debug your
Starting point is 03:17:10 Kubernetes or to go and, you know, look into your error logs and figure what, figure out what went wrong or things like that. There's often a lot of very specific capabilities, and that's where we find that the post-training of the URL is most effective there and a lot of the kind of various work around the models that turns out to be useful. What about speed? A lot of people that have gotten access to GPT-5 or at least in our chat are reporting that it just feels really, really quick. How is that over time going to impact the, I think a lot of people, you know, if they're using Devon today, tasked Devon with something and then maybe they go work on something else for a little bit or they're running multiple agents concurrently. But at some point, the agent could
Starting point is 03:17:56 get so fast that you're just sort of like watching it and work in real time and you actually want to be engaged. But are we there yet? Is it still a ways out? What do you think? Yeah, it's a great question. I think in general, I think async will continue on as a paradigm, even as the models get faster and faster. One of the reasons that it should, by the way, is because there are a lot of real world thresholds that start to matter. Like, at some point, you're actually spending less time on token generation in the Devon lifecycle, and you're spending more time on every time Devon runs the command to go install packages, or Devin running the unit tests, or like Devin pulling up the front end by itself, or things like that, that obviously take real world time, right? I think
Starting point is 03:18:37 We are honestly getting closer and closer to that threshold. But yeah, so long story short, I think like in the asynchronous mode, yeah, these things will get faster, you know, we'll see those gains or we'll be able to spend a lot more time, for example, thinking about a single problem relative to the amount of like real-world clock time that gets spent. I think for the synchronous use cases is where we'll see things really, really, you know, explode with speed, which is, you know, windsurf and cascade, for example, where we see the speed gains really, really matter.
Starting point is 03:19:08 Speaking of windsurf, give us the update on the, the chat wants to know about the windsurf tea and the 80-hour demand, how have the buyout offers gone? What's the internal response been? Where'd that idea even come from? Yeah, yeah. Look, people are stoked, honestly. And I think from our perspective, it's obviously really important to kind of just like unite and get to the point where we can just be one culture and one culture and one
Starting point is 03:19:36 kind of shared a set of values and and this is how things are at cognition is you know it's it's it's it's it's it's a pretty busy time like we we are at the inflection point of code and and we work like that too um and so i think a lot of it for folks is is just kind of like um you know we want to make sure folks who who who really want to do this with us you know make that conscious decision to opt in and for for anyone who doesn't obviously we totally understand that there are a lot of talented folks that maybe that's just not the right thing for them right now or you know, not at this time. And so wanted to make sure that they were, well,
Starting point is 03:20:10 we'll take them care of too. And to be clear with the buyout offer, that's on top of the actual acquisition deal that already went through. They already got their vesting. So, yeah, I was thinking of the roller coaster. It's like, you have the opening I deal, then the Google deal, then the cognition deal.
Starting point is 03:20:25 And then they're like, wait, these guys work really, really hard. I don't know if I'm cut out for this. And they come back up again or they're like, wait, I can just go, you know, take a sabbatical and figure out my next thing. So it's a great outcome. Yeah.
Starting point is 03:20:37 Yeah. No, and it's obviously, you know, overall, it's a killer team that's been through a lot. And so I wanted to make sure that they're well taken care of. That's fantastic. Anything else you can tell us about the integration of Devon and Winserve, how are the teams getting along? How do you see the products playing together in the long term? Obviously, cross-sell seems really obvious. They had the go-to-market team as well.
Starting point is 03:21:00 But how else are you thinking about the interaction maybe over the longer term there? Yeah, yeah, yeah, for sure. Yeah, a lot of obvious integration on the team, as you mentioned, with Crossout and so on. I think the thing that's really exciting on products, which I think actually comes along with these capabilities increases, is, you know, as the capabilities keep getting better, you start to take on harder and harder tasks with AI and with full agentic workflows, right? And I think there's an interesting thing that happens where for a lot of the harder tasks, you really actually do want to go back and forth between asynchronous and an asynchronous mode, you know, and that's for a few reasons. you know, one of the reasons, obviously, is because there's a lot of review and a lot of, like, looking at the pieces and thinking about the, you know, all the minutia and the details of what you're implementing. I think another big reason for it is, you know, when you get started on a larger project, you know, let's say you're sitting down as an engineer and you're saying, all right, I'm going to go build this whole project today. You yourself don't actually know all the tradeoffs that want to make, all the decisions that you want to make and so on, right? And so having a format where, you know, for the decisions that need you to be there and you're involved setting the kind of the strategy or, or, or, you figuring out high level what should happen, you're able to do that in a nice synchronous environment, which is naturally the wind surf IDE, right? And then for the parts of the task that you can actually
Starting point is 03:22:14 hand off and have an agent work on, you're giving that to Devin. And figuring out how you go back and forth between those is super interesting. So wave 12 on the way soon. Hopefully we'll have a lot more more to share. Last question. Yeah, hit the soundboard, Jordy, for that. Wave 12. Wave 12. Fantastic. Last question, we'll let you go. What is your probability that AI will get a perfect score on the IMO next year? Oh, interesting. So by the way, we just had the I.O.I, which is the programming version, like the programming Olympiad, but I think there's a good chance that we'll have a golden medal with the Ioi for this year announced as well. I think perfect score for next year. Wait, wait, we as in humanity or we as in cognition? As in humanity, yes, yes, yes, yes. An AI perfect score. Yeah.
Starting point is 03:23:01 Sorry, an AI gold medal. Right. Perfect score in the IMO next year. I think it's got to be north of 50, honestly. I would put it around like 75% or so. Okay. Well, thank you so much.
Starting point is 03:23:13 We'll be following it closely. And good luck to you, and congrats on all the progress. Very fantastic. We'll talk to you soon. Awesome, guys. Thanks for having me. Bye. Let me tell you about Bezel.
Starting point is 03:23:23 Getbezzle.com. Your Bezel concierge is available now to source you any watch on the planet. seriously any watch. And we are joined by our next guest, Claire, Vaux, from Chat, PRD. Welcome to the stream. Claire, how are you doing? What's going on? It's a fun day today, isn't it? It's a fun day. What was your reaction to the stream? What was your reaction to GPT-5? You know, GPT-5, the first thing I said, and I got a little early access is I said, it's a developer for developers, by developer. This thing is built to be a software engineer. You've seen a long string of your guests come on and really speak about
Starting point is 03:23:58 the coding abilities of it. What I think is interesting about this particular model, especially because we're seeing them deprecate the old models in the chat GPT experience. And we're seeing a lot of positive feedback. But I do think there are drawbacks to a model that's so clearly tuned to a developer use case. And as somebody who's building an application that isn't focused on agenetic coding, I have noticed some personality quirks that are going to be really interesting to see how they shake out as we roll out this model to our users. Walk me through those. What are the, what's the timeline?
Starting point is 03:24:34 How much, like, how much time do you have to kind of move users over to five before? Yeah. Yeah. So, I mean, I think we have tons of time from the API side to move, move users. And in fact, you know, our strategy at chat purity is not to just upgrade to the latest model. I know Zach at Warp said, like, why wouldn't you want the latest intelligence? And the reality is because we're doing a lot of business strategy and business writing, I actually want to validate with our users that they're getting the quality of strategic
Starting point is 03:25:03 thinking, output, writing that they really want. So we actually A-B-test every single model rollout and really evaluate for user quality, token generation, all those things. And, you know, looking early on, it yaps. Man, this thing just wants to go through tokens. Right now I'm seeing 4 to 10x, the number of tokens generated. between the you know four generation models and five and when you're in a business context you do not always want longer words you know and so it'll be really interesting there it is certainly
Starting point is 03:25:36 focused on execution so i you know i've heard a lot from the open a i team it's steerable yes and its natural inclination is to drive you towards like how what very tactical very specific and so if you're trying to zoom back out at a um strategic level or focus on a business initiative, it's actually a little harder to tune in that direction. So, you know, I think there's a lot of positive things for me as somebody who uses agentic coding platforms, who writes a lot of code. It's my daily driver now. I love it.
Starting point is 03:26:08 But for other use cases, I think it's going to take some time to figure out if it really is optimal in use cases where intelligence actually isn't the differentiating capability. Yeah, it's very interesting to think the best product manager. is not the one that writes the most, the longest doc. No, and you don't send your engineer into your executive meeting. And I really am looking forward to the time where we're not getting these number-based models where actually I can get like GPT developer or GPT strategist, where they're pre-tuned and trained for the role they're going to play, as opposed to general purpose, but clearly oriented towards
Starting point is 03:26:54 a set of tasks. And I just think if you look at this model, it was oriented towards an engineer, software engineering, at least in my experience. So have you been tempted to launch any type of agent like agented coding products? You are, you guys are obviously responsible at Chet PRD responsible for creating documentation. And if you look at the other guests that have joined today, many of them are competing with each other in different ways and trying to own different parts of the stack. You guys have seemingly stayed really, really laser focus and no one else is doing anything like you're doing, at least on the show today. But talk about like picking your lane and kind of like optimizing. Yeah, we're integrated with a lot of those platforms.
Starting point is 03:27:43 So a lot of the kind of like prototyping platforms, v0.deb, lovable, all those, we integrate. We just released our MCP. So I use chat purity pretty consistently inside cursor. through our MCP. So I think of, we think of ourselves as the product payer to the AI engineer. Now, what's really interesting about my experience with GPT5 is the one place that actually does really well as technical specs. And that's a place where chat PRD has sort of bridged into engineering execution. Often our product managers are generating a PRD or some sort of business document. They're actually going the next layer and developing a technical spec. The GPT5 technical specs fed into these agentic coding frameworks or prototyping frameworks, output much higher quality assets on that
Starting point is 03:28:29 end. So I do almost think there's going to be this kind of like right model for right use case, especially in our kind of business. And so we think of ourselves as integrating. The one thing I have thought about with GPT5, it's the first one where it feels really simple to just go ahead and roll your own agenting coding framework or prototyping framework inside of our application. So never say never. It's something that we get asked for a lot. But we're good friends with almost all your guests on your show today. And so we like, we like the role we play in terms of being the product manager pair to all these AI engineers. Yeah, that makes sense. What are you looking for next? What am I looking for next? I mean, in terms of model capability is what I think is really
Starting point is 03:29:15 interesting about open AI and why I'm really committed to the open AI ecosystem, even though I test and use a variety of models is I think developer support is a real differentiating her. So we spend a lot of time talking about model capabilities. And for application developers, certainly ones that are doing more complex applications like agenetic coding, model capabilities really matter. Like core IQ of the model matters. But the other thing that matters, you know, somebody who has built developer tooling products, it's developer experience matters, the primitives in these APIs matters.
Starting point is 03:29:48 And so what I'm really pushing the Open AI team to think about, which is in addition to the core intelligence of the model, what are the developer tools you need around these models to really make them a platform on which a variety of applications can build? And I do think that Open AI has disproportionately invested in developer experience, but I'm always looking for like give me better out-of-the-box tooling, give me more control over these models, give me more hosted services, all those things that. as an application developer, just going to make it easier to deploy these models of production beyond the core kind of intelligence of the models themselves. What was your read on 4.5? Is there a world where, you know, I'm thinking about the product manager versus the engineer. You have your 03 go crunch some really hard reasoning. And then you have 4 or 5 turn it into, you know, stronger pros or like more, you know, a human language. Yeah. So I did a lot of experimentation around 4.0.
Starting point is 03:30:48 4-5 and 4-1. 4-5 was my favorite prose writer by far. It was loved from a business writing perspective. I thought the pros was the most natural. It was really slow. Like, untenably slow. And so the compromise we made in our testing
Starting point is 03:31:09 is we ultimately ended up with 4-1 as the fan favorite for business writing when we were balancing off both quality of prose and intelligence as well as performance, which for application developers is a real consideration. So I landed on 4-1. 4-1 is the model that's being tested right now against GPD-5 in chat pyr-D. And one of the things that I have to go do now is figure out how to get chat G-PT or GPT-5
Starting point is 03:31:35 to stop writing. It writes a lot and it only wants to write in bullet points. So I've got to go back into our prompts and figure out how to direct it to be a little bit more business-oriented. bullet point maximalist it's the new m dash i'm telling you you will not be able to stop seeing it it just all it wants to do is write a bullet point and call a tool like it i was using an incursor and it just kept maxing out my tool calls i'm like you do not need to read 50 files to do this so i do think you know application developers are really going to have to think about how they slot this
Starting point is 03:32:10 into their current workflows there's definitely tuning that needs to happen but i'm telling you you're going to see a lot of bullet points when this thing rolls out yeah in 60 seconds where is product management going a lot of people talk about the you know examples of product managers that are starting to ship code themselves ship whole features products but I'm sure those are edge cases to date but but where do you feel like it's going based on on your user base yeah I mean it's going one direction of the other product managers are either going to develop the hard skills to do the design, the go-to-market and the engineering job to some extent because some of these other jobs are definitely going away for product managers or my favorite use case, engineers and
Starting point is 03:32:55 designers are going to get tools like chat parity or these prototyping tools or cursor and they're going to be able to actually do the product management job. And so what I think is we're going to see a new type of role emerge, which is a much more generalist role where people maybe have a specialist capability and they're augmenting that product thinking or they're augmenting that technical thinking with with AI. But I don't think there's going to be product managers as they were, you know, five or ten years ago for much longer. Makes sense. Well, thank you so much for stopping by. Yeah, great time. Thank you. Thanks for having you. We'll talk soon. Bye. Cheers. Up next, we have Brad Lightcap, the chief operating officer of Open AI. Welcome to stream,
Starting point is 03:33:33 Brad. Also, Jordie, your post saying, I'm updating my timelines. You now have four years to Escape the Permanent Underclass has over 4,000 likes. There we go. A thousand likes for every year. Love it. Anyway, Brad, how you doing? Brad, what's going on? Guys, how are you good?
Starting point is 03:33:49 Congratulations on the launch. What are the biggest takeaways for today? From your side, I'd love to know about what it actually means to be the COO of Open AI. Open AI does so many different things, consumer internet company, API business, enterprise. There's all sorts of stuff, building data centers. What is your actual? role? My role is kind of whatever the company needs me to do. I play everything from like, you know, PM when I need to to like, you know, salesperson when I need to. That's kind of the
Starting point is 03:34:22 fun part of the job for me. On this launch in particular, it was really fun. I spent a lot of time last few weeks with customers, with partners, getting a feel for GPD5 relative to what they were previously using. In some cases, those are open AI models. In some cases, they were other models. But, you know, I've been opening I a long time, been an opening I seven years. So I've seen GPD3, I've seen GPD4, and then to be able to see GPD5 and, you know, just I think the joy of people being able to use it in production and seeing how much better it is, that's the best part. Greg told us earlier about the era having to pay people to use the early versions of the product. You guys have come a long way since then. Yeah, we had like three customers
Starting point is 03:35:03 with GPD3 or something like that. And so it was easy to manage, easy to talk to all of them. They actually were tired of us calling them, being like, is it good? Is it getting better? And so now it's, you know, we're fortunate that we've got more than that. But it's cool. I mean, the diversity of use cases, I think the number of things that people are able to use it for, we've got everything from the team at Amgen, you know, big pharma, life sciences, using it for clinical workflows there.
Starting point is 03:35:29 We've got teams at Uber's, you know, building it for customer support, teams at Notion and cursor building it into products that people use every day. So I think that's the power of it, is it just more and more coverage of service area of things people do, you know, with these tools. I'm not sure how much you touch organizational design at OpenAI, but I'd be interested to hear your thoughts on how those companies that you mentioned should be thinking about AI changing their org structure. Is it sort of like a horizontal, cross-functional service layer like, you know, a finance team that touches a lot of different elements of the business? Or should most companies be thinking about standing up a dedicated like AI implementation team? How do we get a chat box on every product that we already shipped? Like how do you think about those tradeoffs if you were talking to a, you know, a friend
Starting point is 03:36:20 at a Fortune 500 company that was thinking about their AI strategy? Yeah. You know, it's an interesting question. I think it was maybe said earlier on the show. The thing we see is just people can do more. And so there's like this much wider latitude that you get if you're an individual person at an individual company where, especially as you get bigger, you know, maybe more bureaucratic organizations that have a lot of different functions, a lot of different levels, you have to
Starting point is 03:36:44 rely on a lot of other people in the org to get stuff done. You've got to rely on your data science team to do data analysis. You've got to rely on your design team to do mockups. You've got to rely on your marketing team to do copy. And I think what we see with AI is it just accelerates people to get to a great V1 of everything. So if you're a high agency individual and you want to get stuff done, you're no longer gated on people that, you know, you otherwise would be. And I think that should enable organizations to move a lot faster. And I think it should enable the people at organizations that really drive them to do a lot more. And we see that consistently. Chat Chabit Enterprise, I think that is consistently what we hear. And we seek those people out when we deploy
Starting point is 03:37:22 chat GPT Enterprise. We find those like, you know, two or three people at the organizations who are just the like AI superstars and champions and then try and actually use them as these kind of touch points for the rest of the yore to learn from how are you how are you personally using AI these days uh you know my biggest challenge i think day to day is context switching if you look at my calendar from like top to bottom it's like i i joke like you know with my wife i like have to like show up to work like wearing like a lab coat and then i like take the lab coat off and like put some like sunglasses on in a film school jacket and you know then i'm talking to like a media company and then i like take that off so i go through the costume changes um and i think what
Starting point is 03:37:58 what i actually mostly use it for is just to help with bridging me from kind of thing to thing to kind of put me in the mindset of being able to work with customers, help customers. GPD5 is incredibly good at this kind of structured reasoning of how do we actually take what is this very diverse set of things that models like GPD5 can do and then apply them in domains that I don't think about every day. And so it gives me this launching off point to be able to talk with leaders and with customers much more fluently about how we can help their organizations. Within, let's say, a set of companies like the Fortune 500, what does AI adoption look like across the spectrum? Because I'm sure that there's companies that you talk to that are
Starting point is 03:38:40 truly adopting AI in the way that John was mentioning, like trying to become AI native, changing their entire organizational approach. And then there's companies that just want to buy software to say that they can, that they're becoming AI native. So what? What does that spectrum look like in practice? Yeah, it is a wide spectrum. So at the top level, we're seeing just like amazing appetite for wanting to adopt tools for people. And I think that's like the easiest place to start. Typically, that's where we steer organizations if they're starting at zero is just give your people the best tools.
Starting point is 03:39:19 You may have seen, you know, we've grown chat GPT work, which is our enterprise and team product from 3 million seats to 5 million seats now. from June till now. So toward growth there, and we don't see any abatement in demand there, if anything, it's accelerated from last year. And so I think people and organizations are starting to realize that, like, at a minimum, you need to make sure people have the best tools.
Starting point is 03:39:43 What's cool about GPT5 now is it also enables people to use the best tools at every point. And so if you're in an organization, you're not fumbling with the model picker, you're not trying to figure out when to use a reasoning model, you're not trying to figure out kind of the art of prompting to get the perfect thing. all of that stuff is abstracted and it's kind of taken care of for you.
Starting point is 03:40:00 And you can have confidence that your people are actually using the best models at any given point. Beneath that, it gets a little more complicated. So more and more organizations, I think, are starting to grasp how the tools can actually help in the business process. So whether that's in customer support, whether it's in research, whether it's in software engineering and data science, you're seeing these tools more and more adopted in the enterprise. I think there's still a quality gap, though. I think we now are just breaking into what I would call the kind of era of models that have capabilities that are good enough to make a dent in the types of problems businesses care about.
Starting point is 03:40:35 Businesses care a lot about things like reliability, right? They think they care about accuracy. They care about the resiliency of the model to recover from tool use errors and to be able to string together these very long kind of multi-tool, multi-step workflows. So GPD5 is a step on all those things. And I expect that that will enable us to be able to do more and more things. things in the business process. Do you think those customers that you just mentioned
Starting point is 03:40:58 will stick with this idea of like GPT4 level workloads will stay on GPT4 and maybe there'll be cost savings, but those workloads will stick around for a very long time. And then you'll develop almost new capabilities, new workflows, new workloads that will be additive, but the enterprises will stick? Or will they want to, is everything so fresh that they'll want to just like rewrite everything
Starting point is 03:41:26 with the latest and greatest. More often than not, I think it's the latter. I think you want to rewrite everything. One of the cool things we did here was we were able to keep the pricing on GPD5 at the level of O3 pricing. So if you're cost sensitive, you don't really have an excuse to not upgrade.
Starting point is 03:41:44 GPD5 is faster than O3 and 4.1. So we've improved on latency for sensitive use cases that are speed sensitive, latency sensitive. And obviously the intelligence bar has gone up. And so, unless you've got really a very kind of narrow and specific workflow where you've got a model like 4-1 that kind of is okay, there's really not a reason I think that people wouldn't upgrade. Yeah, do we need like a three-dimensional Pareto Frontier right now that matches not just cost and capability but also cost capability and latency or something? Is that something that you're seeing a lot of demand from in the enterprise? Yeah, 100%.
Starting point is 03:42:17 We actually measure it that way. So we look at those three vectors and it's always kind of an optimization function along those three, those three. those three axes we think we found that here it was actually in terms of where my work was over the last few weeks it was a lot of i mean it's a qualitative you know and kind of you know really like manual process of collecting feedback because everyone's got a little bit of a different preference and we can only pick kind of one or two points on that curve and so just trying to kind of dial customer feedback namely developer feedback in for us on where that that balance of things are is a big part of our process for picking all those all those points and so we're
Starting point is 03:42:54 We hope that people like it, and it unlocks, you know, the kind of maximal use. That's great. How are you thinking about open source? Who, you know, who's been most excited to get access to it? And, yeah, where do you see it going? Yeah, I mean, it's important to us. You know, I'm glad we've gotten this out. It's been a huge team effort.
Starting point is 03:43:15 I think there was kind of a thing that, like, you know, OpenAI doesn't like open source anymore. It's like, no, we're just like really busy with a gazillion other things. So I think hopefully going forward, we've got more of a. a leaned-in vantage point on open source, but it unlocks a huge number of use cases. I mean, if you think about kind of like, you know, government use cases, you think about on-prem, you know, use cases where you're handling sensitive data and very sensitive environments. You think about where you want to run models on the edge. All these things right now are kind of inaccessible to us as a service provider to customers
Starting point is 03:43:47 because we just don't quite have models that kind of fit at those points. So this for us, we think, is huge TAM expansion, and we're excited to be able to work with enterprises on implementing that model, which is, I think, you know, competitive, hopefully with our O3 class of models. What is the landscape like for companies that are helping to implement OpenAI products at various enterprises? You have the, you know, big consulting groups that will give you an AI strategy. Maybe they'll try to take it a step further, but I imagine there's a cottage industry of, you know,
Starting point is 03:44:20 of firms that have sprung up to try to help organizations unlock the value beyond, hey, let's just get everybody a seat with ChatchipT work? Yeah, I think there will be this new industry that emerges that is kind of separate and apart from kind of the legacy set of SIs and consultants that is really AI fluent. They're very AI native. I think it's very hard to borrow, I think, paradigms from the last 20 years of software building. you know, implementation that are going to kind of map to what we're dealing with here. You're dealing with fundamentally probabilistic systems that are moving and increasing and
Starting point is 03:45:01 improving at a rate, you know, of now kind of collapsing to every few months. And I think the nature of use cases changes quickly, where enterprises are focused on kind of deploying them changes quickly. And so I think it's just hard for kind of the legacy industries to keep up, frankly. We've had a lot of success working with some of this kind of of new breed of SI, so the distills of the world and others that really have been born, I think, in forged in the fire, so to speak, of this kind of new, this new platform. And so we hope there's more of them. We'd be excited to work with anyone that wants to work with us on it. There's more business than we can handle. And so we're always happy to spread the
Starting point is 03:45:39 love. Talk about the $1 chat GPT product for the government. Were you involved in that at all? I was involved in that. We wanted to do something that was meaningful for U.S. government. It's been a real big focus of ours lately. I think our view is the government has got to start to modernize. We've got to make sure that the tools that we use in the private sector are also in the hands of folks serving us in the public sector. And we wanted to make that really simple.
Starting point is 03:46:10 So we made chat ChachapT, you know, basically equivalent to ChachapT enterprise free. It's a dollar per year per agency. Hopefully we can afford that. And we wanted to make that available to anyone that wanted to use it and standardized their GSA. So we're super appreciative of the partnership with them and more I think that we can do on that front. How is that different than just like,
Starting point is 03:46:31 if I'm a government employee, I can just go to Google.com and I have access to that and Google provides benefits. Scott Kapoor was saying that he can't use he can't use it. So, yeah, why? Yeah, just talk to me about how it's different to offer chat GPT as an actual service with a contract that you're that you're you know vending in you're actually like they are a client versus just if you put up a
Starting point is 03:46:55 website every government employee can access the web to some degree or would it be blocked like what why does it need to be like a deal at all as opposed to just like everyone just uses it yeah so part of it is just making sure that government employees can access it so sure in places obviously you know you you can put blockers in place that wouldn't prevent access we hear a lot of stories, by the way, of people like going out in their lunch break to their car in the parking lot and, like, you know, pulling up chat GPT on their phone and, like, throwing a bunch of stuff in there just to like, because they know it'll get them through the day faster. And we've done
Starting point is 03:47:27 work, by the way, with governments, with the state of Pennsylvania and other places where we've seen dramatic increases, you know, things like two to three hours a day saved per employee, given the nature of the work that they do and how helpful chat chat chat can be. And so this lets us have an interface into them as a customer. It lets our team engage with them in a direct way. how they're using the product and can help them use it better and so that's that's important for us is like we got to build on that foundation with them and then presumably it also allows the the government to define like security and privacy in their world as opposed to if you're just like some website out there they
Starting point is 03:48:00 their choices only block or don't block as opposed to actually you know communicate with you this is okay to train on this is not etc etc like keep everything private etc etc yeah I mean we don't we don't train on on enterprise data yeah yeah you're safe there but But yeah, I mean, for us, like, just being able to treat them as a customer, right, to treat them as a user. And you know, you mentioned earlier, like we were talking about kind of like there being these points of success at every organization that, you know, you've got people who are like way more sophisticated in using these tools than others. We want to be able to see those people and amplify them. And the government's no different.
Starting point is 03:48:36 There are people that we've worked with in government who are incredibly sophisticated in how they use AI tools and our goal is to get everyone there. how do you think about the group of users that are active students they've been on summer break you guys have been busy over summer do you're you thinking about uh and you recently launched uh i forget the exact name for the product i think it was like chat tpt learning how are you thinking about that cohort and unlocking new capabilities for them uh this coming year yeah so we launched something called study mode um which uh was in our core chat chp t product and um it was a little bit of an experiment we wanted to see if you change the way the model behaves when it can kind of, when it knows you want to be in a learning mode, if that can actually enhance outcomes for students, where we have all these kind of studies that have been done very like anecdotally about ChachyPT's ability to drive student outcomes and learning outcomes. So here we kind of took a little bit more of an intentional approach of if you actually model, take the model and actually use it in a more Socratic style where it can actually
Starting point is 03:49:39 to kind of quiz you. It can withhold certain information that it wants you to be able to empirically deduce. It wants you to reason about problems and it kind of reasons with you as a partner. So far so good. It's really cool. Learning is kind of the killer use case of chat GPT. And so I think to be able to actually launch something that is in some sense extends that kind of killer use case has been really cool. And the student feedback so far, even on summer break, has been positive. Well, we'll let you get back to your day. What's next on your agenda? Are you putting on the lab coat or the suit and tie and going to Washington? Good question. You know, today I'm mostly with the team and talking to customers and maybe tomorrow I'll get back to the lab coat,
Starting point is 03:50:22 but the meantime. We appreciate you taking the time to talk to us. Yeah, well, thank you so much for taking the time to talk to us. Great to see you. We will talk to you guys. Have a great rest of your day. And the timeline has been in turmoil because President Trump says he will be imposing a 100% tariff on all semiconductors coming into the United States. It started with widespread tariffs on chips and then turned into export controls. This is from the Kobe Yesi letter. Is this a red flag moment?
Starting point is 03:50:49 I don't know why you have the red flag. It felt like it. Ben was getting the flag ready. And Vier potentially affected. But Taiwan says TSM exempt from Trump's 100% chip tariff. Very unclear. The story is obviously still developing. And Dylan Middick says, you're telling me that this level of monitoring the situation is free and it's a picture of you in front of the whiteboard, monitoring the chat GPT versus the timeline today.
Starting point is 03:51:20 We're monitoring. Illinois has banned AI therapy, making it the first state to regulate the use of AI and mental health services. Interesting headline that's coming out. It's interesting because the product can just be used with therapy, like the user can choose to do that. It's not necessarily kind of hard to ban outright. Maybe you can ban it in a clinical setting. Yep. I wonder how they define this. There's probably a loophole if I know anything about how these bans are implemented. But yeah, maybe it's like if you're in the clinical setting, you can't be, you can't use it, but then people will just use it independently. I'm like,
Starting point is 03:51:58 therapist is just on their phone. They're going to be going to the car. They're going to be having, no, they're just going to have it listening to the conversation. Yeah. They're going to be like, what should I do right now? What should I say? What should I say? How does that make you feel? That's what it's going to tell you. Celsius nearly doubles revenue year over year. This is the energy drink. Revenue of $739 million versus $632 million.
Starting point is 03:52:23 North America grew 87%. International grew 27%. But here's the real kicker. Alani Nuu acquisition is the primary driver of growth. Alani New added $300 million in revenue and retail sales are up. So wow, what performance. But yeah, I mean, that was the expectation when they bought Alani News that they would, I guess it's like the first moment they rolled them in, probably. But huge growth for Celsius as they become multi-product, multi-consumer company.
Starting point is 03:52:55 What else is going on in the timeline? We have one last guest. I think you might have to hop on with Taipei. So feel free to jump when you need to. Tyler, anything going on on the timeline we should be monitoring. of course monitoring the situation um i've been so so when max was on he was talking about like how you can like make a little game right so i've been working on uh like a balloons tower defense game okay how's it going so it's going pretty well um i'm i'm making another change but then
Starting point is 03:53:23 maybe i can screen record and yeah yeah that'd be great you could share with the with the folks too yeah um i like this post from ray sullivan these gpt five numbers are insane and it's a chart of GPT version versus number, and then once it gets to four, it goes 4.1, 4.2, 4.3, 4.5. So the fifth one is a massive, massive bar. We need an analysis of the charts from today. It seems like there was multiple that were kind of odd or hallucinated or off. It's interesting that multiple of them snuck up. Just in sheets, the popular convenience store chain with 750 locations is now offering 50% off
Starting point is 03:54:05 purchases paid with Bitcoin and crypto daily from 3 to 7 p.m. What a wild move by sheets. Well, Ben Highlack is in the waiting, the restream waiting room. Let's bring him in. Let's bring him in. There he is. Good to see you. Good to see you. We're doing well. I'm just going to say hello. I got to take off and talk with Taipei. I'm going to let John take it from here. Absolutely. I'll close up the show. You guys have a fantastic conversation. Give me the update. How's the day been for you? What were your expectations. Did this meet, exceed? Did it underwhelm you? How are you doing? Well, so I've actually had access for a couple of weeks. So we actually did a video. I'm not sure if you've seen it,
Starting point is 03:54:44 but opening eye, I brought a couple of folks from the Twitter sphere to their office a couple weeks ago to try it. Yeah, yeah. Yep, yep, yep. I think that it pretty much exactly meets my expectation as far as like how how it's been received. And I've tweeted about this as well, but I think that it's really, really good at like one-shotting things. You know, I think it's better than I think other models we've seen. But I think it's actually sort of a distraction in a lot of ways. I think that the things that's a lot better at are, A, a lot harder to describe. And B, I don't think the harnesses for it really exist yet.
Starting point is 03:55:27 I think a lot of harnesses. So the way I've been describing it is that I think I've seen. seen, you know, web search existed in chat, GPT for a really long time, right? Like it was able to like call a tool search the web. Yeah. Obviously like deep research was very different than that, right? Like what we saw was it was like actually like calling,
Starting point is 03:55:48 you know, searching the web. It was like reasoning about those results, changing its kind of course, like course correcting in the middle. So like intermediate reasoning is like what is the term for it. And they really trained it how to search the web well. I think GPT5 does that. for like a whole plethora of tools. The interesting thing is that a lot of products, like,
Starting point is 03:56:10 I think a lot of the agentic products that exist today where it kind of built wrong, like they weren't built, they didn't build their tools the right way. And we've seen this before. Like, if you look at like, you know, the first, you know, kind of infrastructure for agents was Langchain, like way back when, like two or three. Yeah, it was, it was, you know, it was early, but it was wrong, right?
Starting point is 03:56:31 And so like anybody that, you know, But they've iterated since, right? They have like Langraph, it was a better implementation. But the first implementation of Langchain was like, again, early but wrong. And so if you built your product on Langtain, like you had to, you know, significantly change it. I think we'll see a similar thing happen for GP5. You know, it's not just like, you know, change of the string and get, you know, from, you know, four out of five or something and push and now you, you know, yeah.
Starting point is 03:56:55 Yeah, you know that meme about like, oh, like Sam Holman stood on stage and like just like, you know, killed 75 startups. Google just killed 100 startups. Apple just killed Partifle with their new thing or whatever. Did any of that happen today? It feels like it feels like this is like the Langchain needing to change their strategy. That happened a while ago. I haven't identified anything.
Starting point is 03:57:19 It feels like, you know, Scott Wu hopped on and said like, you know, great day to be an application layer company. The foundation models got better. It's more tools in my tool chest. I'm extremely happy. and I'm more confident than ever. And I believe him. I believe that he doesn't see today as fundamentally needing to change his business model.
Starting point is 03:57:40 I think that's true, actually. I think that people have been, you know, there's a lot of people building agents right now. I think a lot of them have not been feasible for some of the reasons that GPT5 starts to address. So I think that what it means is that the entire architecture behind agents will get a lot simpler. like, it feels like a good day for people building applications. Yeah, it's not immediate that there's like some, you know, like company or something I got killed today. Yeah, yeah, yeah. I mean, in general, it feels like, you know, Dorcasch updated his timelines.
Starting point is 03:58:16 There's just been a general idea that like we've maxed out pre-training. We've kind of maxed out post-training. We're now in the let's reap the reward of this. And we've seen it in like the incredible financial. financial performance, the incredible usage numbers, you know, millions and millions, hundreds of millions of people are using Chachapee 30 minutes a day. I love the product. And yet it feels like the what have you done for me lately, meme? It's totally like, okay, yeah, we went from the iPhone 4 to the iPhone 5 today. Yes.
Starting point is 03:58:50 Still really an important technology, great company, but like I want another iPhone 1. Yes, yeah, yeah. I totally get what you're saying. I think that, like, I wrote a piece about this with Swix, but it really actually changed the way I see that path to AGI. I think before using it a lot, I kind of was like, okay, we need like bigger, bigger models. They're going to like get smarter or something. I think like I had this realization, so I was watching it like solve. I had this like really weird like dependency conflict with yarn, like we have like a mono repo. It's like the point of the problem also with this discourse is like,
Starting point is 03:59:29 the sort of problems that gets good at solving are just like not sexy things to talk about. They're not things that you'll understand. I'm like, we have this issue with our, like, the way we structured things. And like, but like, a couple weeks ago, I was watching it like, I had this problem, no other model would solve it. And I watched it sort of like poke around, like, it started running this like Y command in a bunch of different directories in between as like reasoning and like correctly reasoning about like what and why and what it was learning. And it, you know, taking little actions in between. seeing what happened, I think what I realized is that like, you know, if you imagine like
Starting point is 04:00:06 humans without tools, like if we never had any tools, we're never even able to write things down, like, would you be able to tell that we're intelligent? Would we have like, you know, learn to speak, et cetera? Like, I just like don't, you know, even if we could not have ever invented fire, right? It's like, it's like, where would we be right now? There's, that feels like there's a similar, like I actually think a lot of the next year is just going to be, how do you get these models to do things better is like you know i think it's next year uh in your yarn uh example um you said like you were you were having it i assume gpt5 like work on the problem was that wrapped in a coding tool did you just go to chat dot com and give it your github repo like
Starting point is 04:00:49 talk to me like what was the actual user experience from your side yeah so this was in cursor I think the Codex CLI, the new version of the Codex CLI, which I just released today, is also really, really, really good. Okay. I think that you will really only see a significant difference in places where it can sort of explore its environment is the way I would put it. Like, when I was watching it like go bounce around my repo and like, like, I felt almost like I was watching something navigate like a little like video game like Pokemon or something. Like that's kind of what it felt like. It's kind of like, I'm going to go over here. I'm going to see this. Okay, wait a minute. That conflicts with what I just saw over here. Like, where should I go next? Do you know what I mean? Like it felt very novel is like what I would say. Yeah. Yeah, yeah. So yeah, I mean, how are you using it? Where do you see it going? Do you see it like just like a little bump of a tailwind today or what's your read on like how you'll be using GPT5 going forward? I mean, yeah, there's two huge things. So like one thing that like,
Starting point is 04:01:54 really got missed today is that they also released GP5 Nano, which is like an incredibly good model, actually. So like, we're not talking about it, but it's half the cost for input tokens, then Flashlight, or sorry, yeah, I think it's actually half the cost input tokens and Flashlight, and it's a really good model. Like, it's like 4-0 level for a lot of like writing and stuff like that. And so yeah, we'll be using that probably in the short term.
Starting point is 04:02:22 I think it'll be interesting to see how, other providers react. Like, I'm sure Google will cut their prices as a result. But it is the cheapest, like, hosted model, I think, that I don't think anyone's serving at any other model for those prices for that matter. Yeah, that makes sense. What else are you looking for for the rest of the year? Probably no GPT6 on the horizon, but what are you looking out for?
Starting point is 04:02:44 I mean, it seems like Google is expected to respond with Gemini 3 soon. But what else are you tracking in the world of AI these days? It's a great question. I think that, yeah, that's going to be wildly interesting. I think what Google does will tell us a lot. I think that they, you've probably seen it, but you know, they released this like world model yesterday. We're kind of not talking about it anymore.
Starting point is 04:03:06 I mean, like, if those videos, I haven't tried it myself, if those videos are real, like that's, that's one of the most mind-blowing things I've seen in the last, like, you know, decade or something. So, like, if that's real, like, that's extremely interesting. And I think has all the stuff that's going on with world models right now, has like huge implications for like everything like from robotics, just like so many different fields. So super, super interested in that.
Starting point is 04:03:29 And the other thing is that I actually just think that like, again, I'm actually really bullish on cheap T5. I think that the way it was received today is like just about how I expected it. Like, and the reason is like when I say harness again, I'm like, I think that like canvas and chat t is pretty bad is like my, would be my take. Like, you know, it's a tough product to make, but like, yeah, like it does really poorly with like long five. crashes sometimes, like that sort. Like, I think that we don't have the product layer around GP5 doesn't exist yet. So I think we're going to see some really, really interesting products that are built around it. Yeah, it's always hard when you go from like a binary, qualitative, in your face improvement, GPT.
Starting point is 04:04:09 Like chat GPT was like, we passed the touring test. And now the next test is like super intelligence and self-replicates, it's smarter than every single person knows everything. It's like the bar is like we really moved the goalpost. 100%. I think that there was like a lot of, you know, discourse around the model as well, like leading up to it, which I think didn't help, you know. But like the way that I would think about it is like, I think that, you know, depending, there's some percentage of the way through automating software engineering that we've made it. Like, let's say it's like 70% or something,
Starting point is 04:04:40 75%. The tough part is like that last like 25% is, um, A, the hardest, it's like the least um sort of decipherable to like explain to people it's the least like um universal like if i'm just like oh make uh you know one of the examples i did i made a personal website it's like all macOS9 themed in like 20 minutes with jp5 um and so it's really fun right you get it like my mom gets it like i can show it i can share it you get it you know my mom i can't explain any of the like the very specific ways that 2505 like helps in our specific code base our specific problem whatever um So I think that, like, it'll be less, and these launches will probably get less and less sort of interesting from a, like, from a, like, from a, what it does for software engineering as that gap gets closed. Like, I, you know, what's the last five percent of software engineering? Like, I, you know, like, I, it's probably not going to be that interesting to me.
Starting point is 04:05:37 Do you think they'll be on an annual release cadence now? Like, Apple updated all of their iOS, all their operating system nomenclature to be like, we are now on 26. because it's the year it's like a car model like jaguar i don't think you can plan it i don't think you can plan ahead like that's the interesting thing is like i think that you know there's people that say that gpte 4.5 was supposed to be gpte 5 yep um and like i think that it sort of came out and they're like eh like you know i actually love 4.5 i think it's a really fun model but um well it's clear that like improvements come in many places just like with the with the with the iPhone like the latest iPhone and you buy that because it doesn't it's not just like the one with the new screen it has a slightly better camera slightly lighter longer battery like it's like an ensemble of improvements that then
Starting point is 04:06:23 they add up and i think that that feels like what we're getting here today and what we will get in the future is like this little bit like we did a little extra rl over here this tool is now sharper has new capabilities we added multi-modal like you know the video generation got better and this feature got better etc etc i think that like what a model is is still going to change a lot lot and like how we value like so just give an example like 4-0 was sort of this big thing you know where they talked about it being like natively multimodal you know taking in even like video at some point video in video out like audio in audio out and like you know you haven't heard that from gt5 yet like you can't talk to it on advanced voice mode like it doesn't it doesn't
Starting point is 04:07:04 generate image it like you know what I mean it's there's no at least yet native image generation we don't know much about how it works under the hood but like it's still calling 4-0 to generate images, right? So it's like, do you start to see an unbundling of these model capabilities, like seems quite possible? Like, the best model for writing natural language might not, or like writing creative, you know, creatively might not be the same model that writes, you know, really good rust code. Like, it might be different models. So I don't know. We'll see. Yeah, create image here is now tucked next to deep research agent mode, et cetera. But I would hope that you can call that from the actual chat interface. You can call it. You can call it. You can call it.
Starting point is 04:07:44 from the GPT5 chat. It's just using, it's using GPT image one, I think is actually the name of the model. So it's a dedicated image generation model, which I think it may be 40. I don't totally know. Yeah, I just, I don't particularly care. I'm not looking for one model to rule them all. I'm fine if with models calling different tools. It seems fine. Yes. Anyway, fun day. Thanks for hopping on. Of course. Of course. Anytime. Have a good one. Bye. And that's our show today. folks leave us five stars on apple podcast and Spotify and thank you for tuning in to the gpt five gigastream we're on hour four and a half uh we've enjoyed hanging out with you Tyler anything else from the timeline close it out for me timeline still in turmoil if we want we can show the little game
Starting point is 04:08:29 I made okay yeah let's show Tyler's game can we do that is that false you got it Tyler's tower defense okay this was this was one shot okay I didn't wait what you mean one shot one prompt you said you were working on it I was, but then it's like, wasn't as good. Oh, so you went back to a single prompt? Yeah, I made it change, but then I realized like, okay, this is not as good. So I just went back to the first one. Okay.
Starting point is 04:08:52 So, yeah, my question is, I mean, this seems, well, actually, like, it's, like the game engine. I don't know what it's using under the hood. Do you know, did it write, like, WebGL code, or did it write, like, Godot? I think it's just, like, J.S. Okay. And it's just like it's email canvas. That's pretty crazy. Yeah.
Starting point is 04:09:10 You'd think it would use some like 2D engine off the shelf or something. But my question is like what that won't go viral because that is less impressive than just the Tower Defense app that I can get in the app store. For sure. But it's like maybe if I take my, you know how there's like control net images went viral where people would take their corporate logo and then they'd throw that through control net and it would be like the TBPN logo overlaid over like a forest and like the trees would look like the logo. Yeah, or like the QR code. Yeah, so maybe like it's tower defense, but it's my logo or something like that. And like the, the enemies are like moving through something like that. I don't know.
Starting point is 04:09:49 There's just got to be a way to personalize it and make it so every single game is a unique snowflake that you want to go and experience that one. You want to look at it. You want to spend some time in it. I don't know. Yeah. It's hard because it's like it's still, you know, predicting the next token. It's not like image, the four image generation was like kind of a, it wasn't novel, I guess, because there was image generation. It was like such a massive improvement.
Starting point is 04:10:11 This is like there's not any clear massive step change here. It's a little bit better in a lot of ways. Yeah. Oh well, well, we'll have to play with it more. Let us know what you think about GPT5 and we will see you tomorrow. Have a great day. Thank you so much.
Starting point is 04:10:26 Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.