a16z Podcast - GPT-5 Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim

Starting point is 00:00:00 I mean, I think it's pretty unique at Open AI to be able to work on something that's so generally useful. I mean, it's like everything they tell you not to do at a startup is just like your user as anyone. You just kind of take it for granted that you literally have this like wizard in your pocket. We're trying to make the most capable thing. And we're also trying to make it useful to as many people as possible and accessible to as many people as possible. I think we hear this with DBG5 internally when people were testing and they're like, oh, I thought I asked like a really hard question. I feel like a little bit insulted that he got for like two seconds.

Starting point is 00:00:30 Or like when it doesn't even want to think at all. Today's episode was recorded the day GPT-5 launched. A major milestone not just for Open AI, but for the entire AI ecosystem. Joining me in the studio, fresh off the launch live stream, where three people were instrumental in making this model a reality. Christina Kim, researcher at OpenAI, who leads the core models team on post-training. Issa Fulford, researcher at OpenAI, who leads deep research and the chat GPT agent team on post-training, an A16Z general partner, Sarah Wang, who's helped lead our investment in OpenAI since 2021.

Starting point is 00:01:02 We talk about what's new in GPT5, from major leaps in coding and creator writing to meaningful improvements in reasoning, behavior, and trust. We also get into training, RL environments, and why data quality is more important than ever. We also cover agents, what that word actually means, the paradigm shift for async workflows, and the golden age for the idea guys. Let's get into it. As a reminder, the content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.

Starting point is 00:01:45 For more details, including a link to our investments, please see A16Z.com forward slash disclosures. So slow news day. Not much going on for you guys. Thank you for, thank you for coming out. No, obviously, you know, Tina, you were just on the live stream. We're recording a day of. Congratulations. Thank you. For those who are unfamiliar, why don't you introduce what you guys do at Open Air? Yeah, I'm Christina. I lead the core models team on post training. I'm Issa. I lead the deep research, like chat GPT agent team on post training. And Tina, you've been here for, or you've both been here for a while now.

Starting point is 00:02:24 Do you know what you want to be a little bit of your history at the company? Yeah, I've been on opening eye for about four years now. I originally worked on WebGPT, which was the original first LLM using tool use. But it was just one question. So the model learned how to use the browser tool, but you only ask one question, you got to answer back. And then we kind of just had this realization like, oh, you normally when you have questions, you have more questions after that.

Starting point is 00:02:46 And so they started building this chatbot. And then that's what eventually became chat GPT. And would have been the reactions so far, you know, but it's only been a few hours, but in your live stream, like, what are any reflections, any react, what can you, what can you tell us the day of? I'm honestly really excited. I think that obviously we have some great eval numbers and numbers are always really exciting.

Starting point is 00:03:07 But I think the thing I'm, like, really excited about this model is just, it's way more useful, like, and across, like, all the things that people actually use chat for. And it's not just like, and it's, I think the eval numbers look good, but then also, like, the way when people use it, I think they'll notice it quite a big of a difference when the utility of it. I mean, this is my personal use cases. I use it for coding and writing all the time, and it's just a huge stuff change. Yeah. Sarah, you've been involved in the, in helping lead our investment since 2021. What, when you either share more or tee up, how you, how you've been thinking about sort of this as it relates to coding or more broadly? Yeah, well, actually, just on

Starting point is 00:03:42 the topic of coding, it was a huge deal to have Michael Terrell come on there and not only showcase the capabilities, but also say this is the best coding model in the market. And so just curious to the extent that you can share, what did you do differently to get these results? Yeah, I think huge shout out to the team, especially Michelle Pokeris. Like me, I think to get these things right in like e-val numbers is one thing, like I said, but to get the actual usability and like how great it is at coding, I think it takes a lot of detail and care. I think the team put a lot of effort into datasets and thinking about,

Starting point is 00:04:18 reward models for this. But I think it's just literally just caring so much about getting coding working well. And maybe actually just to double click on front end web development, I mean, we've seen as sort of investors in the ecosystem, that's obviously taken off in the last six to eight months. If you could pinpoint the improvement to that piece specifically, is it more around aesthetics or is there sort of another capability leap forward in terms of what we can do with front-end web development? I think there's going to be a lot more we can do with front-end. I think the way we've gone this big leap,

Starting point is 00:04:55 I mean, if you compare it to O3's front-end coding cable, this is just totally next level. It feels very different. And I think it kind of just goes back to what I was saying. The team just really cared about like nailing front-end. And that means like getting the best data, like thinking about the aesthetics of the model and all of these things. I think it's just all those details that are really coming together

Starting point is 00:05:14 and making the model like great at front-end. Really exciting we see. We loved the demos in the live stream, too. I wanted to ask about model behaviors, because I know you worked on that, too. But how did you guys think about that for GBT5? And there are a lot of things that, you know, we've talked about in prior models of like synchifancy and characteristics like that. How did you guys think about for this?

Starting point is 00:05:36 What did you guys change or tweak? Yeah, the design of this model has been very, very intentional for model behavior, especially with the sycifancy issues that we had like a few months ago with 4.0. And we've just spent a lot of time thinking about, like, yeah, what is the ideal behavior? And I think for post-training, what's really, or one of the reasons I really like post-training is it feels more like an art than maybe even like other areas of research. Because you kind of have to make all these trade-offs, right? Like, you have to think about like, for my rewards, like all these different rewards I could be optimizing during the run. Like how does that trade off against it, right?

Starting point is 00:06:06 Like I want the assistant to be like super like helpful and engaging. But maybe that's like a bit too engaging and getting too engaging gets to the overly a few. like assistance that we have um so i think it's really like a balancing act of trying to figure out like what are like the characteristics and like what do we want this model to actually feel like and i think we were really excited with gpt 5 because it's kind of a time to like reset and rethink about um especially since it's so easy to make something i think very engaging in the sense that in an unhealthy way how can we make this like a very healthy helpful assistant say more about how you receive such uh this kind of reduction in hallucinations but also also deception what's the relation between

Starting point is 00:06:44 knows. I guess for me, I find hallucinations, deception's, like, pretty related. So the model, and we kind of saw this a lot with the reasoning models. Like, the reasoning model would understand that it didn't have some ability, but then it still really wanted to respond. I think if we really baked it into the models that they want to be helpful. And so they're like, whatever I can say to be helpful in that moment. And that's kind of what we consider for like deception versus hallucination. Sometimes the model like literally, it seems that they will just say something quickly.

Starting point is 00:07:10 And we kind of see a lot of this reduction with the thinking, with when the models are a to take step by step, they actually can, like, pause before blurting out an answer is kind of what it feels like with a lot of the previous models for hallucinations. Over the next few weeks as you're evaluating usage, what are the biggest questions that you're having or that you're sort of anticipating being potentially answered? I'm just really curious to see how all of these things reflect in usage, right? Like, I think coding is way, way better. Like, what does this actually unlock for people?

Starting point is 00:07:38 And I think we're really excited to be offering these models at the price points that we have because I think this actually like unlocks like a lot more use cases that really weren't there before. Maybe like previous competitor models are good at coding, but the price point is not as exciting. And so I think with this number of capabilities that we have in this model and the price point, I'm kind of excited to see like all the new startups and like developers like doing things on top of it. Yeah, we're excited too. But by the way, just on the topic of usage, you obviously have a lot of products with a ton of usage already. And since we have one of the deep research gurus here too,

Starting point is 00:08:12 So how did deep research, chat GPT operator, sort of your existing products, inform how you went about approaching GPT5? One thing that's interesting is with reinforcement learning, training a model to be good at a specific capability is very data efficient. You don't need that many examples to teach it something new. And so the way that we think about it on my team is we're trying to push capabilities and things that are useful to people. So, like, deep research was the first model to do, like, very comprehensive browsing.

Starting point is 00:08:46 But then when 03 came out, it was also good at comprehensive browsing. And that's because we're able to take the data sets that we've created for the, you know, frontier agent models and then contribute it back to the frontier reasoning models. So we always want to make sure that the capabilities that we're pushing with agents makes it into their flagship models as well. Yeah, that's great. Very self-reinforcing. You mentioned all the startups that you're excited to see come as, like flush out what you think that could look like or even just high level some opportunities you're more excited about because of this.

Starting point is 00:09:19 I mean, people always say vibe coding. I think basically like non-technical people like have such a powerful tool at their hands. I think really you just need some good idea and like you're not going to be limited by the fact that like you don't know how to code something. Like you saw two of our demos, which were front end coding or in the beginning. And that's just literally took minutes. I think that would have honestly taken me like a week to actually build like. fully interactive. And so I think we're just going to have a lot more, I would expect like maybe a lot more like indie type of like businesses built around this because of the fact

Starting point is 00:09:46 that like you just need to have the idea, write a simple prompt and then you get the full-fledged out. It's the world of the ideas guy. Yeah. It's our time. Yeah. Finally. How about in the in the broader sort of AGI discourse, like what is this what does this mean or accelerate or not or like how do we think about sort of the broader AI discourse in terms of what does GBT5 mean here or change the conversation in any sort of way? I think with GBT5

Starting point is 00:10:13 because it's like a new it's obviously state of the art and like all the things we talked about but I think if you're showing that like we can continue pushing the frontier here and I feel like there's always people like oh we're hurting a wall like things aren't actually improving

Starting point is 00:10:27 and I think the interesting thing is I feel like we've almost saturated a lot of these evils and the real like metric of like how good our models are getting is, I think, going to be, like, usage, right? Like, what are the new use cases that are being unlocked? And, like, how many more people are using this in their daily lives to help them, like, across multiple tasks? So I feel like that's actually, like, the ultimate usage in terms, like, that I'm excited about for terms of, like, are we getting to AGI?

Starting point is 00:10:52 Yeah. Actually, I think Greg made this comment about how he was comparing the last model to this model, and the benchmark went from 98 to 99. He's like, clearly we've saturated the benchmarks. At least on that front, which I think is instruction following. What benchmarks do you pay attention to? Like, how do you guys think about e-vals, right? Because given you're already saturating what's out there to a large extent or doing very well along those dimensions,

Starting point is 00:11:16 what actually gets you to push the frontier? Is that before them, I mean, so usage would be kind of post-the-model release. But before you get there, what are you guys looking to internally to help guide you? Is it a lot of internal evals that you created? You know, is it early access to startup seeing what they think? Maybe it's a combo of all the above. But how do you weigh all those things?

Starting point is 00:11:36 Yeah, I mean, I think on our team, we really work backwards from the capabilities we want the models to have. So maybe we want it to be good at creating slide decks or something or spread, you're good at editing spreadsheets. And then if e-vails for those things don't exist, we try to make e-vails that are representative measures of that capability in a way that's actually going to be useful for users. and then we'll, a lot of those are internal, we'll collect them maybe from human experts or, you know, try and synthetically create examples, or we'll actually look at usage data. And then for us, we'll just try and hill climb on those. Yeah, I think we make this joke a lot internally that, like, if you want a nerds like someone into working on something, you just need to make a good eval, and then people are to be so happy

Starting point is 00:12:25 to try to hill climb that. Yeah. I like what you said about starting with the capabilities first. do you prioritize which you actually are shooting for? Let's say there's a dimension of maybe deeper into everyday use versus getting much deeper into the expert use cases. How do you think about that trade-off? What does that trade-off mean, practically speaking? And what do you guys prioritize when? I mean, I think it's pretty unique at Open AI to be able to work on something that's so generally useful. I mean, it's like everything they tell you not to do at a startup is just like

Starting point is 00:12:56 your user is anyone. Like for deep research, we wanted it. to be good across every single domain someone might want to do research in and I think you only have the privilege of doing that if you work at a company that has like huge distribution and like all different kinds of users. So yeah, I mean I think if you choose a capability that's quite general like online research,

Starting point is 00:13:18 you just have to make sure that you represent like a distribution of tasks across loads of different domains if you want to get good at all of them. But then yeah, sometimes it is, it's hard to decide to focus on one specific thing. because there are just so many different verticals that you could choose from. But I think in some cases, maybe coding will be really important. So then a specific team will focus on coding.

Starting point is 00:13:41 But I think in general, because the capabilities are so general, usually like the next model improvement just kind of improves performance on a pretty broad range. Yeah, I think we've kind of seen this like with the progression of even the models that we've have in chat, GPT. like as the model gets smarter, it's better at instruction following. It's better at tool use. And like just more things get unlocked as we just continue to make smarter models. So I think like a good chunk of our team also like does focus on just getting general intelligence up.

Starting point is 00:14:12 Because I think the wins that we get from there are like easy saying like pretty great. Whenever we get a new base model, it's just saying like, oh wow, suddenly this clicks. It works. And I think we kind of saw that moment with like operator because we had been working on computer usage. But I think it was hard to finally get the model to actually without like the, the. multimodal capabilities to really support it. Like you couldn't have something like operator when it launched. Yeah, it's the same thing with everyone was talking about agents,

Starting point is 00:14:37 but we didn't really have a way of actually training useful agents. I mean, I think everyone was talking, there were all these agent demos, but nothing that actually really works. But I think when we saw the reinforcement learning algorithm working really well on math and physics problems and coding problems, it became pretty clear, like just from breeding through the chain of thought, okay, this thing's actually like thinking and reasoning and backtracking. and to build something that's able to navigate the real world,

Starting point is 00:15:02 it also needs to have that ability. So we realize, okay, like, this is a thing that's going to actually let us get to useful agents. And so I think it's interesting at opening eye because you have people pushing, like, you know, foundational algorithms, getting really good at math, getting a gold medal in the IMO. And then on post-training,

Starting point is 00:15:19 we'll often take, like, those methods and try and figure out how to make things that are most useful and, like, usable to, like, all of our users. How much of the improvements are coming from, the architecture versus the data versus the scale. Like where, how do you sort of think about that? My opinion, I'm very data-pilled. Like, I think data is very important.

Starting point is 00:15:38 I think, like, I think deep research was so good because Issa put so much thought and, like, careful attention to, like, the data curation that they did and thinking about all the different use cases she wanted to have represented. So I'm, I'm on team data. Yeah. I mean, I think all are very important, but especially, like, especially now that we have such an efficient way of learning, data is even, high-quality data is even more important. Maybe on the data topic, we've been talking a lot about RL environments.

Starting point is 00:16:09 It's a popular space for startups who all want to work with you guys. And I was curious just to get your thoughts on this, since you've been data, or your data-pilled. But what are the bottlenecks that you see for the next stage? Is that, I mean, maybe tying it to RL environments, is there sort of a lack of good, realistic RL environments, that that's sort of the next frontier, which maybe creates an opportunity for these startups, that once you, you know, sort of are able to really work within a environment that takes a long time to build, these are not, you know, sort of built in a day or two, that you can actually automate labor to the full extent of, like, compute, you know,

Starting point is 00:16:51 the way that you would need computer use to do. Yeah, I think in my opinion, I do think there's a lot of value in getting really good tasks and getting really good tasks requires really good RL environments. I think the more complicated, the more realistic, the more simulated we can make them, I think the better we'll get. And I think we're kind of saying that like tasks matter just like tasks matter more at this point given the fact that we have such a strong algorithm. So I think the data, creating data and figuring out like the best tasks to train on is like the one of the big questions we have. Yeah, like there's some generalization from training on like one website to another, but if you want to get really, really good at something, the best thing to do is just

Starting point is 00:17:29 like train on that exact thing. Right. So, yeah, I think we're definitely just constrained by how, like, things that we can represent in a way that we can train on. Like the chatybt agent, for example, has such a general tool. It has a browser and a terminal. And between those two things, you can basically do most of the tasks that a human does on a computer.

Starting point is 00:17:50 So in theory, you can ask it to do anything that you can do on your computer. it's obviously not good enough to do that yet, but with the tools it has in theory, you can push it really, really far. So now we just have to make it really good at all those things by, you know, training on, training on way more things. Let's talk about creative writing. Maybe you talk about the improvements there, how you think about it. That's one of my favorite improvements in GBT5. The writing, I honestly find it's very tender and touching, especially for a lot of the creative writing that we want to do. We were thinking through, like, a bunch of different samples for the live stream.

Starting point is 00:18:27 And, like, every time I was like, oh, that's, like, actually, like, that, like, hits. Like, it's, like, good. And it's, like, spooky. And I'm just like, oh, this feels like someone, like, someone should have written this. But I think it's really cool because you can actually really use it for, like, helping you with things. Like, like, my example that I did in the live stream was, like, writing helping me write the eulogy, something like that's, like, kind of hard to write. Especially since writing isn't really something a lot of people are good at. Like, I'm personally a very, very bad writer.

Starting point is 00:18:53 That's not true. I think it's... But it makes a better story. Compared to maybe the other things I'm better at. But it's so great to have this tool to help me like craft whenever, like I use it literally for as simple things as like Slack messages to figure out like how to phrase this well. And it'll help me give me some iteration.

Starting point is 00:19:14 How to say something to the team. I want to see those prompts. Yeah. We're now all just looking for M dashes. That was good to say. We're like, where do you stand on the end? Where do you stand on the M-Dash discourse? I like M-Dash.

Starting point is 00:19:25 I do the normally now people think I'm just using. I know, I know. I know. I know. Me too. Going back to the discourse for a second, Sam said in his interview with Jack, he said, if you had said 10 years ago that we would get models at the level of sort of PhD students, I would think, wow, the world looks so different and yet we've basically taken it for granted.

Starting point is 00:19:48 Do you think basically the improvements are similar, like as soon as we get them, we're just going to be like, oh, you know, now we're just, now this is the standard, or do you think at some point this is going to be like, oh, my God, this is like, how do you think about sort of people's ability to sort of acclimate or adjust? Yeah, I mean, it seems like people adjust really quickly, don't you think? Yeah, like whatever happens, basically. I feel like, attached U.T got released and everyone was like, wow, that's so cool.

Starting point is 00:20:11 But then you just kind of take it for granted that you literally have this, like, wizard in your pocket. You can, like, ask you whatever, whatever random thought you have. And it just pops out like a good essay, and you're like, oh, okay, cool, that's what's happening. I guess people adapt to things rather quickly, in my opinion with technology, and it is really easy. And I think because the form factor is so easy, even with, like, new tools, like, deep research and chat GPD agent, it's, like, presented in such, like, a, like, easy way that people already know how to interface with. Like, I think as long as that's true, even with the model is getting, like, much smarter than us, like, I think it'll be, it's still going to be, like, quite approachable to people.

Starting point is 00:20:46 Do you think the jump from GPT 4 to 5 was bigger or 3 to 4? or maybe three and a half to four. I mean, at least one thing for me and my usage of it is sometimes I'm wondering if I have hard enough questions to ask it to actually, like, highlight the difference. Right. Because when it gets to a point where it's just answering what you need so well, it's like almost harder to tell the difference in some areas.

Starting point is 00:21:09 But with writing, yeah, I've been using it for a few weeks and it's just kind of blown me away in a way that models previously haven't. Maybe I'm biased, my recency biased, but I think the jump to four to five is most impressive for me Because I guess with 3.5 when we first released it, the most common use case for me then also was still just for coding. And but now, like, even though 4 was better at coding, I feel like the jump between 4 and 5 in terms of like breadth of ability to do things,

Starting point is 00:21:34 it's just way different and way more. And you can just handle a lot more complex things than like before. Like with the context length being much longer as well. Like I think the jump to 4 to 5 to me is like much bigger. Is there anything the model categorically can't do? I guess for five, we don't really take, like, actions in the real world yet. We're going to team up with agent for that. Yeah, as I said, you could ask the agent to do anything,

Starting point is 00:22:01 but it's not capable enough to do everything you want it to do yet. We take a conservative approach, especially with, like, asking the user for confirmation before doing any kind of action that's irreversible. So, like, sending an email or ordering something, booking something. So I think I can imagine quite a number of tasks where you'd want to take like bulk actions, which you might not be able to do right now because it would ask you every single time.

Starting point is 00:22:26 But I think as people get more comfortable using these things and as they get better and you trust them more, you might allow it to do things for you without checking in with you as much. Maybe just to build on that question, in terms of what it can't do today, but what you would sort of direct future research toward, if you look at coding,

Starting point is 00:22:47 something like end-to-end DevOps, for example, that feels like the logical next set of capabilities. Do you guys think we'll get there in, I don't know what you'll name it, but 5.5 or GPD6, how far are we from something like that? Yeah, I don't know about the exact thing of DevOps, but I do feel like with the models getting much smarter,

Starting point is 00:23:06 one other thing that came to my mind when you asked me the question is like longer running tasks and things like that. I think like we, GBT5 is great because like, yeah, within like a couple minutes, Maybe you get a full-fledged app, but then what would it look like if you actually gave it like an hour, like a day, a week? What can it actually get done? And I think that's, there's going to be a lot of interesting stuff.

Starting point is 00:23:25 We're interested to see what will happen there. Yeah, I think a lot of it is not just about the model capability, but it's actually like how you set it up in a way to do things. Like I'm sure that you could build something that's like monitoring, you know, your Humio or like data dog, whatever. Like with these current models, it's just like setting up the harness like to make that possible. And same for, for, like, agenic tasks. I think a lot of things that will be quite useful will be when the agent, like, proactively does something for you, which I don't think is impossible today. It's just not, like, set up that way.

Starting point is 00:23:57 But eventually, like, as it proactively does things for you, then we might get feedback on whether that was useful and we can make it, like, even better at, like, triggering. Agents is probably, or agent is probably the most overused word of 2025. That being said, your agent's launch was extremely exciting. What does that word mean to you in the context of capabilities that you'd like to build in the near term or have already built? And what is sort of most important that the agent is able to do on behalf of your users? I guess my very general definition would just be something that does work, useful work for me on my behalf with, I would say, asynchronously. So like you'd kind of leave it and then come back and get either get a result or like a question about what it's doing.

Starting point is 00:24:42 And then in terms of, I guess, roadmap for agents, I mean, longer term, you want it to be able to do anything that, you know, a chief of staff or a system or something like that would do for you. But I think in the more immediate term, there are a lot of new capabilities that we launched in chat chabit agent that we just want to improve. So one of the main capabilities is deep research, so just being really good at synthesizing information from the internet, but also. So I think we can improve capabilities on synthesizing information from all of the services that you use and like private data that you have. And then also being better at creating and editing artifacts like docs or slides and spreadsheets because I think so much of like the work that's useful that people do in their jobs is basically just research and making something.

Starting point is 00:25:36 But then also I'm personally like love all the consumer use cases, like making it better like shopping or planning a trip and those kinds of things are like also really fun and so that also involves like taking an action which is interesting because it's it's kind of the last step often of a of a task and it's the maybe a task that would take less time for a human and it's like actually very hard like a very hard research question to like get it to do something or like book something or use a calendar picker but yeah once you have the end-to-end flow working really well it can basically do do anything yeah that's incredible on the shopping piece i now do not make a single large ticket purchase without having chat jebd put all the options in a table for me along the

Starting point is 00:26:22 dimensions i care about it's incredible um but i want to push on the async piece um because i i don't know if you would agree with this but it felt like a revelation to me at least um at the beginning of the year that people were willing to wait because you kind of think about oh we want it faster like the value prop of this tool is that it gives me the answer fast, right? That was sort of very 2024. Clearly, this paradigm has shifted. People are willing to wait for high quality, high value answers and work. How do you think about the trade-off between how long something take, how long you take to get something back to the user versus what you're actually, the value that you're providing? And like, what do you think is the ideal frontier for something like

Starting point is 00:27:03 that. Yeah, it's interesting because I built the retrieval on chat GBT and was on the browsing team before this. Tina was also on the browsing team. And we were always making these tradeoffs and optimizations for latency. And so we were thinking, how can you best fill the context with information you've retrieved so that the answer is pretty good in a few seconds? And so I think with deep research, I was just very excited to remove latency as a constraint. And since we're going for these, we're going for these tasks that are really hard for humans to do and would take humans many hours to do. I think we felt like, you know, if you asked an analyst to do this and it would take them 10 hours or two days, seems reasonable that someone would be willing to wait like five

Starting point is 00:27:45 minutes in your product. So I think that was the, we just kind of made that bet. And luckily, it seems like it's the case. But I do also think that, you know, initially people are like, oh, this is amazing. It's doing all this work. That would have taken me so long. And now people are okay but I want it now I want it in 30 seconds right to the point on the far changing because yeah I was going to say is there any sort of rule of thumb and I'm sure it's constantly shifting where as long as you're 10 times faster than it would take the human to do they're willing to wait for it or is that just constantly shifting sand I think with these launches people's expectations keep getting changing yeah yeah I do think we have like a specific a specific number one thing that's interesting is I think

Starting point is 00:28:29 sometimes people just biased to thinking that the longer answer is more like thorough or has done more work for it, which I don't necessarily think is the case. Like deep research, for example, always gives you a really long report. But sometimes for me, I don't want to read this whole long report. I actually don't like that. So agent, like it will only give you a long report if you ask for it. But I think sometimes people, since now that you're still always getting a really long report, they're like, wait, I've been waiting. Like, where's my long report? But sometimes it's like really hard to find a specific piece of information and would have also taken a human a long time because it's in like page 10 of the results, whereas where it finds this information.

Starting point is 00:29:04 So I think it's interesting also how you can condition people's expectations with the product so that when you change, or like with deep research it always thinks for a really long time, which again, I don't necessarily think is a feature, but I think now people are like really used to the amount of time that they wait. Definitely. I think we hear this with GPT-5, but internally when people are testing and they're like, oh, I thought I asked like a really hard question. I feel like a little bit insulted that it's not for like two seconds.

Starting point is 00:29:30 Or like when it doesn't even want to think at all. It's like the Mark Twain line. I didn't have time to write you a short letter, so I wrote you a long one. Yeah, yeah. When do you talk about the bottom, like, why don't we have reliable agency? What are the main bottlenecks as you see them? Yeah, I think a big part of it is the things that we train on are often really good at.

Starting point is 00:29:52 And then sometimes with the things outside of that, it can be a bit, sometimes it's good at those things, sometimes it's not good at those things. So I think, yeah, creating more data across like a broader range of things that we want it to be good at. I think also what's interesting with agents is we have this, like,

Starting point is 00:30:10 when something is doing something on your behalf and it has access to your, you know, your private data and the things that you use, it's kind of more scary the different things it could do to achieve its final goal. You know, in theory, if you asked it to buy you something and like make sure that I like it, it could go and buy five things just to make sure that you liked one of them, which you might not necessarily want.

Starting point is 00:30:35 So I think that there's definitely like having oversight during training is also like an interesting area. I think there's just like new things that we have to like develop to, you know, push these agents even further. So yeah, I think that that's part of it. And then also like as every time we have a smarter, like, base model or something like this, it improves every model that's built on top of that. So I think that will also help, especially with, like, multimodal capabilities, as Tina said, with, like, computer use. Because it's, like, just literally looking at screenshots of the, of a web page.

Starting point is 00:31:12 And it's, like, it's a little interesting because the way that humans, like, focus on specific things, it's like, it's a lot to expect a model to just, like, take a whole image and be able to, like, everything about the image when like when we're looking at something we'll like focus on a specific thing. Yeah, I think that there's lots of room for improvement in lots of areas. Sorry, that was kind of a general answer. No, no. Well, actually, I was going to maybe that last example gets into something that we were curious about, which is, and this ties back to training data as well, but what sort of, I guess, what specific categories of browsing tasks are challenging for agents today? And like, I don't know if you have thoughts on how you'd overcome this for sort of

Starting point is 00:31:52 the next iteration of the model. I mean, I think one thing is, like, so free training, it's based on, like, what data is available, right? And so I think when we've done these free training, there's not much data out there to begin with, but people using computers. Like, computer usage is not really a thing that, like, there's lots of, like, data out there, and this is something we actually have to, like, seek out now that this is a capability that we want. So I think that's actually probably a big one, just for general improvements of, like, computer

Starting point is 00:32:18 usage. Do you think you'll lean more heavily on human data vendors to help collect that? given it doesn't exist, to your point, like, recorded in the way that maybe it's most helpful for training. Like, how do we, but it is probably the most useful application of the models to, you know, at least knowledge work. Like, how do you overcome that? I mean, I think one cool thing is for, for example, for initial deep research, there's not really any data sets that exist for browsing in the same way that you have a math data set that already exists. So we have to create all this data. But once you have good browsing models or good

Starting point is 00:32:52 beauty use models, you can bootstrap them to help you make synthetic data. So I think that's a pretty promising area. Christina, can you explain what mid-training is and how it sort of, what does it achieve that pre- or post doesn't? So I think with your pre-training runs, these are like your, these are the big runs. These are the massive ones, like that's what we're building all these giant clusters for. So you can kind of think of mid-training is literally, it's for like middle. Like we do it before after pre-trading, but before post-training. You kind of think of a way to like extend the models like intelligence without having to do a whole new pre-training run. So this is mostly just focus on data and off of the pre-training models.

Starting point is 00:33:30 So this is a way for us to do things like updating the knowledge cut off of these models, right? So when you pre-training, you're kind of like, okay, shoot, now we're kind of stuck in this date and we can't ever updated again. And doesn't quite make sense to put all that data into post-training. And so mid-training is just a smaller pre-training run to help expand like the models intelligence and like up-to-dateness. Christina, did you work on WebGPT?

Starting point is 00:33:52 Yes, I did. Okay, so you're basically like an AI historian. Yes, yes. She also watched some compus. I'm an elder. So can you like reflect back a little bit to, you know, four years ago, five years ago and sort of reflect on like what are the biggest thing? Like if you were to predict the five years out, like what are the inflection points

Starting point is 00:34:11 or biggest things that would have surprised you? Honestly, with WebGBT, the main thing we were just excited about was like trying to ground these language models. like it's they had so many issues with like hallucinations and the model just saying random things and like the fact of we didn't really do in the training sense so like the fact of like how do we make sure the model is actually up to date like most factually up to date so then that's kind of how we thought about like oh let's give it a browsing tool i think that makes sense um and then yeah like like i said that kind of went on from like oh actually want to keep asking questions so what a chatbot would look like but at this point i think there had been a few chatbots by a few other companies and i feel like a chatbot is also like a very common AI thing to think of um But they're quite unpopular at the time, so we weren't really even sure that, like, this is actually something useful for people to work on or, like, people to use or will people be excited about this? Is this really, like, a research innovation that we, like, are remaking the Turing test here? But I think it kind of clicked into me that, like, maybe there was actually something interesting happening here. We gave early access to about 50 people, most of those people being, like, people I lived with at the time. And there are two of my roommates just used it all the time. They just, like, would never stop using it. And they would just have. have these long conversations and they would ask it like quite technical things because they're also AI researchers. And so I was just like, oh, this is like kind of interesting. Like I don't know.

Starting point is 00:35:27 And at the time we're kind of thinking like, okay, we kind of have this chat bot. Should we do make this like a really specific like meeting bot type of thing? Do we like make it a coding helper? But it was interesting to see my two roommates just use it like for anything and everything and just like literally be chatting with it like the whole workday as they were using it. So I was like, oh, this is kind of interesting. But then it was also interesting to see like the majority of the people that I gave access to on that 50-person list, like, didn't really use it that much. But I was like, oh, it's like, there's clearly like something here, but it's like not quite maybe for everyone yet, but there's something here.

Starting point is 00:35:58 When did you realize, like, I'm working at one of the most important companies of this generation? Like, like, when was the moment where you were like, hey, this is something that I obviously believe it's important. That's why I joined, but that you realized like the scale and significance. Honestly, I kind of had this moment before I joined Open AI. Like, like, I think with the scaling laws paper with GP3, I was just like kind have hit me that like if this exponential is true like there's not really much else i want to spend

Starting point is 00:36:22 my life working on um and like i want to be part of this like story like i think there's there's going to be so many interesting things unlocked with this and i think this is this is probably the next like step level in terms of like technology that it kind of made me realize like oh i should probably go start reading about deep learning and figure out how i can get into one of these labs is it what was your moment i think for me it was also before i started working at opening i um using i think i think i first learned about Open AI in an AI class or some kind of computer science class and they were saying like, oh, they trained on the whole internet. It's like, oh, that's so crazy. Like, what is this company? And then started using GPT3, like, in the, I think I was the, it was a power user of the Open AI playground. And at a certain point, like, had early access to these, like, different Open AI features, like embeddings and things like that. And just became this, like, big Open AI fan, which is like a little embarrassing, but, you know, it's fine because it got me here. And eventually they're like, okay, like, you're stalking us. Do you want to interview here?

Starting point is 00:37:21 But yeah, I think it was pretty clear to me, but just how much I was using GPT3, which wasn't even compared to what we have now, like, just the pills in comparison. But I was like, from then I was hooked and just trying to figure out a way to work here. Maybe a question of more in the company building front. We all sort of read and reread Calvin French-Jowen's piece just as reflections on working at Open AI. I'm curious, and you don't have to comment on that piece unless you want to, but would love your reflections on the change that you've seen over the last four years or, you know, or even less than then given, I think that was only covering one year of change. But what are the biggest things that you've seen change at Open AI? I mean, when I first joined OpenAI, the applied team was 10 engineers or something.

Starting point is 00:38:05 It's just like we didn't really have this like product arm. We had just launched the API. It was just a completely different world. And I think AI is in most people's mind now after chat GPT, but I think pre-chatGBTGT, like, people didn't really know what AI was or really, like, thought about it as much. It's kind of cool working on a place that, like, my parents know what I do now. And, like, it's like, that's really cool. And I think the company, obviously, is just a lot bigger. But I think with that, we can just take a lot more bets.

Starting point is 00:38:31 I think when I first joined Open AI, there were obviously way less people. Like, it was much, much smaller. It's around, like, 200-ish people, and I think we're close to. like a few thousand for sure yeah yeah when i joined it was also a few hundred before chat gpt so it's obviously yeah very different in how much you know all of your friends have heard of you know what you work on but i think culturally obviously the company is much bigger i still think we've maintained this it still feels very much like a startup i think some people who come from a startup are surprised like oh i'm working even harder than when i was working the startup that i founded

Starting point is 00:39:06 I think ideas can still come from anywhere. And if you just take initiative and want to make something happen, you can. And this doesn't really matter, like, how senior you are or anything like that. I think we've been able to maintain that culture, which I think is pretty special. Yeah, we definitely reward agency. And I think that's, like, always been true. And I think, especially in the research side, the teams are quite small. Like, when ESA was working on deep research, it was like two people still.

Starting point is 00:39:27 So, like, I think we still do that in the research side. Like, most research teams are quite small and nimble for that reason. And earlier you said, you know, we do something at Open AI, which startups never do, which is, you know, try to appeal to every single person with the product. What are there other things that come to mind that Open AI just does differently than your peers or other startups or things that we may not appreciate being on the? I mean, I think it's different for different teams, but my team collaborates so closely with the applied, like, the engineering team and the product team and design team in a way that I think sometimes, like, research can be quite separate from like the rest of the company. But for us, it's like so integrated. We all sit together, you know, sometimes like the researchers will help with like implementing something. I'm not sure that engineers are always happy about it.

Starting point is 00:40:22 but we'll try like get out of the front end code and vice versa like they'll help us with things that we're doing for like model training rounds and things like that so I think some of the like product teams are quite integrated

Starting point is 00:40:35 I think it's for post training it's a pretty common pattern which I think just lets you move really quickly I guess one thing that I think is unique about Open AI is that you're both very much a consumer company

Starting point is 00:40:52 company by revenue, etc. products, but also an enterprise company. How does that internally, like what would you guys consider yourself? Or is that even just the wrong paradigm to think about? Yeah, I mean, I guess if you tie it to the mission, it's like we're trying to make the most capable thing. And we're also trying to make it useful to as many people as possible and accessible to as many people as possible. So like in that framing, I think it makes a lot of sense. The concept of taste has become also very widely used. What does good taste mean within open AI? How do you know when you see it, know it when you see it?

Starting point is 00:41:27 And is that something that even in a world where everything, the cost to produce everything just keeps going down and down? Is that the one thing that's not commoditizable? Or is that also shifting given maybe that can go into the training data? No, I think taste is quite important, especially now that like it is, like I said, like our models are getting smarter. It's easier to use them as tools. So I think having the right direction matters a lot now.

Starting point is 00:41:52 and like having the right intuitions and like with the right questions you want to ask. So I would say maybe it matters more now than before. I think also I've been surprised by how often the thing that is the most simple, like easy to explain is the thing that works the best. And so sometimes it's like sounds seems very obvious, but, you know, it's quite hard to get the details of something right. But I think usually good researcher tastes is just like pretty simplifying the problem to like the dumbest thing or the most simple thing.

Starting point is 00:42:22 you can do. Yeah, I feel like with every, like, research release we do, and when people figure out what happened there, they're like, oh, that's so simple. Like, oh, I should like that, obviously. Obviously, that would have worked. But I think it's like knowing to try that like obvious or like at the time, not obvious thing that is obvious in hindsight. Yeah. And then all of the details around. Yeah. The hyperprime and all these things and like the info, that's obviously like very hard. But the actual concept itself is usually, usually pretty straightforward. Hmm. Very cool. Taste is Occam's razor. Yeah. So sort of in closing here, obviously historic day, you want to contextualize sort of

Starting point is 00:42:58 with what this means in context of the mission and, you know, where you've been to get to now, to where you're going? Yeah, I think with GPT5, the thing that's the word that's like been in my mind throughout all of this is like usable. And I think the thing that we're excited about is getting this out to everyone. We're excited to get like our best reasoning models out to free users now. And I think just getting this our smartest model yet to like everyone. And I'm just, excited to see what people are going to actually use it for. That's a great place to wrap. Tina, Issa, thanks so much for coming to the podcast. Yeah, thank you. Thank you for having us.

Starting point is 00:43:32 Thanks for listening to the A16Z podcast. If you enjoy the episode, let us know by leaving a review at rate thispodcast.com slash A16Z. We've got more great conversations coming your way. See you next time. Thank you.

a16z Podcast - GPT-5 Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.