The a16z Show - GPT-5 and Agents Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim

Starting point is 00:00:00 I mean, I think it's pretty unique at Open AI to be able to work on something that's so generally useful. I mean, it's like everything they tell you not to do at a startup is just like your user as anyone. You just kind of take it for granted that you literally have this like wizard in your pocket. We're trying to make the most capable thing. And we're also trying to make it useful to as many people as possible and accessible to as many people as possible. I think we hear this with GPD5 internally when people were testing and they're like, oh, I thought I asked like a really hard question. I feel like a little bit insulted that's after like two seconds. Or like when it doesn't even want to think at all.

Starting point is 00:00:32 Today's episode was recorded the day GPT-5 launched, a major milestone not just for OpenAI, but for the entire AI ecosystem. Joining me in the studio, fresh off the launch live stream, where three people were instrumental in making this model a reality. Christina Kim, researcher at OpenAI,

Starting point is 00:00:48 who leads the core models team on post-training. Issa Fulford, researcher at OpenAI, who leads deep research and the chat GPT agent team on post-training, and A16Z general partner, Sarah Wang, who's helped lead our investment in OpenAI since 2021. We talk about what's new in GPT5, from major leaps in coding and creative writing to meaningful improvements in reasoning, behavior, and trust. We also get into training, RL environments, and why data quality is more important than ever.

Starting point is 00:01:15 We also cover agents, what that word actually means, the paradigm shift for async workflows, and the golden age for the idea guys. Let's get into it. As a reminder, the content here is for informational purposes only. should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.

Starting point is 00:01:45 For more details, including a link to our investments, please see A16Z.com forward slash disclosures. So slow news day. Not much going on for you guys. Thank you for coming out. No, obviously, you know, Tina, you were just on the live stream. We're recording a day of. Congratulations. Thank you.

Starting point is 00:02:07 For those who are unfamiliar, why don't you introduce what you guys do at Open Air? Yeah, I'm Christina. I lead the core model team on post training. I'm Issa. I lead the deep research, like chat GPT agent team on post training. And Tina, you've been here for, or you both been here for a while now. Tina, what wanted you'd be a little bit of your history at the company? Yeah, I've been on Opening Eye for about four years now.

Starting point is 00:02:29 I originally worked on WebGPT, which was the original first LLM using tool use, but it was just one question. So the model learned how to use the browser tool, but you only ask one question you got to answer back. And then we kind of just had this realization like, oh, you normally when you have questions, you have more questions after that. And so we started building this chat bot, and then that eventually became chat GPT. And would have been the reactions so far, you know, it's only been a few hours, but in your live stream, like what are any reflections, any react, what can,

Starting point is 00:02:59 you tell us the day of? I'm honestly really excited. I think that obviously we have some great evel numbers and numbers are always really exciting, but I think the thing I'm really excited about this model is just it's way more useful. Like, in cross like all the things that people actually use chat for. And it's not just like, and it's, I think the evel numbers look good, but then also like the way when people use it, I think they'll notice it quite a bit big of a difference when the utility of it.

Starting point is 00:03:24 I mean, this is my personal use cases. I use it for coding and writing all the time. And it's just a huge stuff change. Yeah. Sarah, you've been involved in helping lead our investment since 2021. When do you either share more or tee up, how you've been thinking about sort of this as it relates to coding or more broadly? Yeah, well, actually, just on the topic of coding, it was a huge deal to have Michael Terrell come on there and not only showcase the capabilities, but also say this is the best coding model in the market. And so just curious to the extent that you can share, what did you do differently to get these results? Yeah, I think huge shout out to the team, especially Michelle Pokeris.

Starting point is 00:04:03 Like me, I think to get these things right and like e-val numbers is one thing, like I said, but to get the actual usability and like how great it is at coding, I think it takes a lot of detail and care. I think the team put a lot of effort into datasets and thinking about the reward models for this. But I think it's just literally just caring so much about getting coding working well. And maybe actually just to double click on front-end web development. I mean, we've seen as sort of investors in the ecosystem, that's obviously taken off in the last six to eight months. If you could pinpoint the improvement to that piece specifically,

Starting point is 00:04:41 is it more around aesthetics, or is there sort of another capability leap forward in terms of what we can do with front-end web development? I think there's going to be a lot more we can do with front-end. I think the way we've gone this big leap, I mean, if you compare it to 03s front-end coding capability, this is just totally next. level. It feels very different. And I think it kind of just goes back to what I was saying.

Starting point is 00:05:02 The team just really cared about like nailing front end. And that means like getting the best data, like thinking about the aesthetics of the model and all of these things. I think it's just all those details that are really coming together and making the model like great at front end. Really exciting to see. Love loved the demos in the in the live stream too. I wanted to ask about model behaviors because I know you worked on that too. But how did you guys think about that for GBT5? And there are a lot of things that, you know, we've talked about in prior models of like synchifancy and characteristics like that. How did you guys think about for this? What did you guys change or tweak? Yeah, the design of this model has been very,

Starting point is 00:05:40 very intentional for model behavior, especially with the sycifancy issues that we had like a few months ago with 4-0. And we've just spent a lot of time thinking about like, yeah, what is the ideal behavior? And I think for post-training, one of the reasons I really like post-training is it feels more like an art than maybe even like other areas of research because you kind of have to make all these tradeoffs right like you have to think about like for my rewards like all these different rewards I could be optimizing during the run like how do like how does that trade off against it right like I want the assistant to be like super like helpful and engaging but maybe that's like a bit too engaging and getting too engaging gets to the overly effusive like assistants that we have um so I think it's really like a

Starting point is 00:06:18 balancing act of trying to figure out like what are like the characteristics and like what do we want this model to actually feel like. And I think we were really excited with GPT-5 because it's kind of with time to like reset and rethink about, especially since it's so easy to make something, I think very engaging in the sense that in an unhealthy way, how can we make this like a very healthy, helpful assistant? Say more about how you received such kind of reduction in hallucinations, but also deception. What's the relation between those? I guess like for me, I find hallucinations, deception's like pretty related. So the model, and we kind of saw this a lot with the reasoning models. Like the reasoning model would understand that it didn't have some ability, but then it still really wanted to respond.

Starting point is 00:06:56 I think if we really baked it into the models that they want to be helpful. And so they're like, whatever I can say to be helpful in that moment. And that's kind of what we consider for like deception versus hallucination. Sometimes the model like literally, it seems that they will just say something quickly. And we kind of see a lot of this reduction with the thinking. When the models are able to take step by step, they actually can like pause before blurting out an answer is kind of what it feels like with a lot of the previous models. for hallucinations. Over the next few weeks

Starting point is 00:07:24 as you're evaluating usage, what are the biggest questions that you're having or that you're sort of anticipating being potentially answered? I'm just really curious to see how all of these things reflect in usage, right?

Starting point is 00:07:35 Like, I think coding is way, way better. Like, what does this actually unlock for people? And I think we're really excited to be offering these models at the price points that we have because I think this actually like unlocks like a lot more use cases that really weren't there before.

Starting point is 00:07:46 Maybe like previous competitor models are good at coding, but the price point is not is exciting. And so I think with this number of capabilities that we have in this model and the price point, I'm kind of excited to see all the new startups and like developers like doing things on top of it. Yeah, we're excited too. But by the way, just on the topic of usage, you obviously have a lot of products with a ton of usage already. And since we have one of the deep research gurus here too, how did deep research, chat GPT operator, sort of your existing products inform how you went about

Starting point is 00:08:20 approaching GPT-5. One thing that's interesting is with reinforcement learning, training a model to be good at a specific capability is very data-efficient. You don't need that many examples to teach it something new. And so the way that we think about it on my team is we're trying to push capabilities and things that are useful to people. So deep research was the first model to do very comprehensive browsing. But then when 03 came out, it was also good at comprehensive browsing.

Starting point is 00:08:50 that's because we're able to take the data sets that we've created for the, you know, frontier agent models and then contribute it back to the frontier reasoning models. We always want to make sure that the capabilities that we're pushing with agents makes it into into their flagship models as well. Yeah, that's great. Very self-reinforcing. You mentioned all the startups that you're excited to see come as, like, flesh out what you think that could look like or even just high-level some opportunities you're more excited

Starting point is 00:09:18 about because of this. I mean, people always say, vibe coding. I think basically like non-technical people like have such a powerful tool at their hands. I think really you just need some good idea and like you're not going to be limited by the fact that like you don't know how to code something. Like you saw two of our demos which were front end coding or in the beginning and that's just literally took minutes. I'd literally, I think that would have honestly taken me like a week to actually build like fully interactive. And so I think we're just going to have a lot more. I would expect like maybe a lot more like indie type of like

Starting point is 00:09:44 businesses built around this because of the fact that like you just need to have the idea, write a simple prompt and then you get the full-fledged out. It's the world of the ideas guy. Yeah. It's our time. Yeah. Finally. How about in the broader sort of AGI discourse, like, what does this mean or accelerate

Starting point is 00:10:04 or not, or like, how do we think about sort of the broader AI discourse in terms of what does GBT5 mean here or change the conversation in any sort of way? I think with GBT5, it kind of says like a new, it's obviously state of the art in like all the things we talked about. But I think if you're showing that like, you know, we can continue pushing the frontier here. And I feel like there's always people like, oh, we're hurting a wall. Like, things aren't actually improving. And I think the interesting thing is I feel like we've almost saturated a lot of these evils and the real like metric of like how good our models are getting is I think going to be like usage, right? Like what are the new use cases that are

Starting point is 00:10:39 being unlocked and like how like how many more people are using this in their daily lives to help them like across multiple tasks? So I feel like that's actually like the ultimate usage in terms that I'm excited about for terms of like, are we getting to AGI? Yeah. Actually, I think Greg made this comment about how he was comparing the last model to this model and the benchmark went from 98 to 99. It's like, clearly we saturated the benchmarks, at least on that front, which I think is instruction following.

Starting point is 00:11:06 What benchmarks do you pay attention to? Like, how do you guys think about evals, right? Because given you're already saturating what's out there to a large extent or doing very well along those dimensions, what actually gets you to push the first? frontier, is that before them, I mean, so usage would be kind of post the model release, but before you get there, what are you guys looking to internally to help guide you? Is it a lot of internal evils that you created? You know, is it early access to startup seeing what they think? Maybe it's a combo of all the above, but how do you weigh all those things? Yeah, I mean, I think on our team,

Starting point is 00:11:38 we really work backwards from the capabilities we want the models to have. So maybe we want it to be good at creating slide decks or something or spread you go to editing spreadsheets and then if evils for those things don't exist we try to make evils that are representative measures of that capability in a way that's actually going to be useful for users and then we'll a lot of those are internal will collect them maybe from human experts or you know to try and synthetically create examples or we'll actually look at usage data and then for us we'll just try and hill he'll climb on those. Yeah. I think we make this joke a lot internally that like if you want a nerds like someone into working on something, you just need to make a good eval and then people are to be so

Starting point is 00:12:25 happy to try to hill climb that. Yeah. I like what you said about starting with the capabilities first. How do you prioritize which you actually are shooting for? Let's say there's a dimension of maybe deeper into everyday use versus getting much deeper into the expert use cases. How do you think about that trade-off, what does that trade-off mean, practically speaking? And what do you guys prioritize when? I mean, I think it's pretty unique at Open AI to be able to work on something that's so generally useful. I mean, it's like everything they tell you not to do at a startup. It's just like your user as anyone. Like for deep research, we wanted it to be good across like every single domain someone might want to do research in. And I think you only have the like

Starting point is 00:13:05 privilege of doing that if you work at a company that has like huge distribution and like all different kinds of users. So yeah, I mean, I think if you choose a capability that's quite general, like online research, you just have to make sure that you represent like a distribution of tasks across loads of different domains if you want to get good at all of them. But then, yeah, sometimes it is, it's hard to decide to focus on one specific thing because there are just so many different verticals that you could go, could choose from. But I think in some cases maybe like coding will be really important. So then, you know, a specific team will focus on coding. But I think in general, because the capabilities are so general, usually like the

Starting point is 00:13:47 next model improvement just kind of improves performance on a pretty broad range. Yeah, I think we've kind of seen this like with the progression of even the models that we've have in chat, GPT. Like as a model gets smarter, it's better at instruction following. It's better a tool use. And like some more things get unlocked as we just continue to make smarter models. So I think, like, a good chunk of our team also, like, does focus on just getting general intelligence up, because I think the wins that we get from there are, like, Issa saying, like, pretty great. Whenever we get a new base model, it's just saying, like, oh, wow, suddenly this clicks, it works. And I think we kind of saw that moment with, like, operator, because we had been working on computer usage,

Starting point is 00:14:25 but I think it was hard to finally get the model to actually, without, like, the multimodal capabilities to really support it, like you couldn't have something like operator when it launched. Yeah, it's the same thing with everyone was talking about age. but we didn't really have a way of actually training useful agents. I mean, I think everyone was talking, there were all these agent demos, but nothing that actually really works. But I think when we saw the reinforcement learning algorithm working really well on math and physics problems and coding problems,

Starting point is 00:14:52 it became pretty clear, like, just from reading through the chain of thought, okay, this thing's actually like thinking and reasoning and backtracking and to build something that's able to navigate the real world, it also needs to have that ability. So we realize, okay, like this is a thing that's going to actually, let us get to useful agents. And so I think it's interesting at Open AI because you have people pushing, like, you know, foundational algorithms, getting really good at math, getting a gold medal

Starting point is 00:15:17 and the IMO. And then on post-training, we'll often take like those methods and try and figure out how to make things that are most useful and like usable to like all of our users. How much of the improvements are coming from the architecture versus the data versus the scale? Like where, how do you sort of think about that? My opinion, I'm very data-pilled. Like, I think data is very important.

Starting point is 00:15:38 I think, like, I think deep research was so good because Issa put so much thought and, like, careful attention to, like, the data curation that they did and thinking about all the different use cases she wanted to have represented. So I'm on team data. Yeah, I mean, I think all are very important, but especially, like, especially now that we have such an efficient way of learning, data is even, high-quality data is even more important. Maybe on the data topic, we've been talking a lot about RL environments. It's a popular space for startups who all want to work with you guys. And I was curious just to get your thoughts on this since you've been data or your data-pilled. But what are the bottlenecks that you see for the next stage? Is that, I mean, maybe tying it to RL environments, is there sort of a lack of good, realistic

Starting point is 00:16:29 RL environments that that's sort of the next frontier, which maybe creates. it's an opportunity for these startups, that once you, you know, sort of are able to really work within an environment that takes a long time to build, these are not, you know, sort of built in a day or two, that you can actually automate labor to the full extent of, like, the way that you would need computer use to do. Yeah, I think in my opinion, I do think there's a lot of value in getting really good tasks and getting really good tasks requires really good RL environments. I think the more complicated, the more realistic, the more simulated we can make them, I think the better we'll get. And I think we're kind of saying that like tasks matter just

Starting point is 00:17:11 like tasks matter more at this point, given the fact that we have such a strong algorithm. So I think the data, creating data and figuring out like the best tasks to train on is like one of the big questions we have. Yeah, like there's some generalization from training on like one website to another. But if you want to get really, really good at something, the best thing to do is just like train on that exact thing. Right. So yeah, I think we're definitely just constrained by how like things that we can represent in a way that we can train on.

Starting point is 00:17:39 Like the chat GBT agent, for example, has such a general tool. It has a browser and a terminal. And between those two things, you can basically do most of the tasks that a human does on a computer. So in theory, you can ask it to do anything that you can do on your computer. It's obviously not good enough to do that yet. But with the tools it has in theory, you can push it really, really far.

Starting point is 00:18:00 So now we just have to make it really good at all those things by, you know, training on, training on way more things. Yeah. Let's talk about creative writing. Maybe you talk about the improvements there, how you think about it. That's one of my favorite improvements in GBT5. The writing, I honestly find it's very tender and touching, especially for a lot of the creative writing that we want to do.

Starting point is 00:18:24 We were thinking through like a bunch of different samples for the live stream. And like every time I was like, oh, that's like actually. like that like hits like it's like good and it's like spooky and I'm just like oh this feels like someone like someone should have written this um but I think it's really cool because you can actually really use it for um like helping you with things like like like my example I did in the live stream was like writing helping me write the eulogy something that like that's like kind of hard to write especially since writing isn't really something a lot of people are good at like I'm personally a very very bad writer that's not true I think it's it makes a better story

Starting point is 00:18:58 Yeah. I think it's true. Compared to maybe the other things I'm better at. But it's so great to have this tool to help me, like, craft whenever. Like, I use it literally for as simple things as, like, Slack messages to figure out, like, how to phrase this well. And it'll help me give me some iteration. How to say something to the team. I want to see those prompts.

Starting point is 00:19:17 Yeah. We're now all just looking for M dashes. That was good to say. Right? We're like, where do you stand on the M dash discourse? I like M dash. I do that normally now. People think I'm just unique.

Starting point is 00:19:27 I know, I know. I know, me too. Going back to the discourse for a second, Sam said in his interview with Jack, he said, if you had said 10 years ago that we would get models at the level of sort of PhD students, I would think, wow, the world looks so different, and yet we've basically taken it for granted. Do you think basically the improvements are similar, like as soon as we get them, we're just going to be like, oh, you know, now this is the standard, or do you think at some point there's going to be like, oh my god, this is like, how do you think about sort of people's ability to

Starting point is 00:20:00 sort of acclimate or adjust? Yeah, I mean, it seems like people adjust really quickly, don't you think? Yeah, like whatever happens. I feel like, it's actually. And everyone was like, wow, that's so cool. But then you just kind of take it for granted that you literally have this like wizard in your pocket. You can like ask you whatever, whatever random thought you have.

Starting point is 00:20:17 And it just pops out like a good essay and you're like, oh, okay, cool. That's what's happening. I guess people adapt to things rather quickly, in my opinion with technology. it is really easy. I think because the form factor is so easy, even with new tools, like deep research and chat GPD agent, it's presented in such like a easy way

Starting point is 00:20:34 that people already know how to interface with. Like I think as long as that's true, even with the model is getting much smarter than us, like, I think it'll be, it's still going to be like quite approachable to people. Do you think the jump from GPT 4 to 5 was bigger or 3 to 4, or maybe 3 and a half to 4? I mean, at least one thing for me and my usage of it is sometimes I'm wondering if I have hard enough questions to ask it to actually like highlight the difference.

Starting point is 00:21:01 Right. Because when it gets to a point where it's just answering what you need so well, it's like almost harder to tell the difference in some areas. But with writing, yeah, I've been using it for a few weeks and it's just kind of blown me away in a way that models previously haven't. Maybe I'm biased, recency bias, but I think to jump to four to five is most impressive for me because I guess with 3.5 when we're just, I guess, with 3.5, when we're just, you know, first released it, the most common use case for me then also was still just for coding. And but now, like, even though four was better at coding, I feel like the jump between four and five in terms of like breadth of ability to do things is just way different and way more. And you can just handle a lot more complex things than like before, like with the context

Starting point is 00:21:41 length being much longer as well. Like, yeah. I think the jump to four to five to me is like much bigger. Is there anything the model categorically can't do? I guess for five, we don't really take like actually. in the real world yet. We're going to team up with agent for that. Yes.

Starting point is 00:21:57 Yeah, as I said, you could ask the agent to do anything, but it's not capable enough to do everything you want it to do yet. We take a conservative approach, especially with, like, asking the user for confirmation before doing any kind of action that's irreversible. So like sending an email or ordering something, booking something. So I think I can imagine quite a number of tasks where you'd want to take like bulk action. which you might not be able to do right now because it would ask you every single time.

Starting point is 00:22:26 But I think as people get more comfortable using these things and as they get better and you trust them more, you might allow it to do things for you without checking in with you as much. Maybe just to build on that question, in terms of what it can't do today, but what you would sort of direct future research toward, if you look at coding,

Starting point is 00:22:47 something like end-to-end DevOps, for example, that feels like the logical next set of capabilities Do you guys think we'll get there and I don't know what you'll name it, but 5.5 or GPD6 how far are we from something like that? Yeah, I don't know about the exact thing of DevOps, but I do feel like with the models getting much smarter, one other thing that came to my mind when you asked me the question

Starting point is 00:23:09 is like longer running tasks and things like that. I think like, I think we, GBT5 is great because like, yeah, within like a couple minutes, maybe you get a full-fledged app, but then what would it look like if you actually gave it like an hour, like a day, a week? what can actually get done. And I think that's, there's going to be a lot of interesting stuff.

Starting point is 00:23:25 We're interested to see what will happen there. Yeah, I think a lot of it is not just about the model capability, but it's actually like how you set it up in a way to do things. Like I'm sure that you could build something that's like monitoring, you know, your Humio or like data dog, whatever. Like with these current models, it's just like setting up the harness like to make that possible. And same for like agenic tasks. I think a lot of things that will be quite useful will be when the,

Starting point is 00:23:51 agent proactively does something for you, which I don't think is impossible today. It's just not like set up that way, but eventually like as it proactively does things for you, then we might get feedback on whether that was useful and we can make it like even better at like triggering. Agents is probably or agent is probably the most overused word of 2025. That being said, your agent's launch is extremely exciting. What does that word mean to you in the context of capabilities that you'd like to build in the near term or have already built?

Starting point is 00:24:20 and what is sort of most important that the agent is able to do on behalf of your users? I guess my very general definition would just be something that does work, useful work for me on my behalf with, I would say asynchronously, so you'd kind of leave it and then come back and get either get a result or like a question about what it's doing. And then in terms of, I guess, roadmap for agents, I mean, longer term, you want it to be able to do anything that, you know, a chief of staff or assistant or something like that would do for you. But I think in the more immediate term, there are a lot of new capabilities that we launched in chat chabitie agent that we just want to improve.

Starting point is 00:25:05 So one of the main capabilities is deep research, so just being really good at synthesizing information from the internet. But also I think we can improve capabilities on synthesizing information from like all of the services that you use and private data that you have. And then also being better at creating and editing artifacts, like docs or slides and spreadsheets, because I think so much of like the work that's useful that people do in their jobs is basically just research and making something. But then also, I personally love all the consumer use cases, like making it better at like shopping or planning a trip and those kinds of things are like also really fun.

Starting point is 00:25:45 And so that also involves like taking an action. which is interesting because it's kind of the last step often of a task. And it's maybe a task that would take less time for a human. And it's like a very hard, like a very hard research question to like get it to do something or like book something or use a calendar picker. But yeah, once you have the end-to-end flow working really well, it can basically do anything. Yeah, that's incredible. On the shopping piece, I now do not make a single large ticket person.

Starting point is 00:26:18 without having chatGPD put all the options in a table for me along the dimensions I care about. It's incredible. But I want to push on the async piece because I don't know if you would agree with this, but it felt like a revelation to me at least at the beginning of the year that people were willing to wait. So you kind of think about, oh, we want it faster. Like the value prop of this tool is that it gives me the answer fast, right? That was sort of very 2024.

Starting point is 00:26:44 Clearly, this paradigm has shifted. people are willing to wait for high-quality, high-value answers in work. How do you think about the trade-off between how long something take, how long you take to get something back to the user versus what you're actually, the value that you're providing? And, like, what do you think is the ideal frontier for something like that? Yeah, it's interesting because I built the retrieval on chat GBT and was on the browsing team before this.

Starting point is 00:27:10 Tina was also on the browsing team. And we were always making these trade-offs and optimizations for latency. And so we're thinking, how can you best fill the context with information you've retrieved so that the answer is pretty good in a few seconds. And so I think with deep research, I was just very excited to remove latency as a constraint. And since we're going for these tasks that are really hard for humans to do and would take humans many hours to do, I think we felt like, you know, if you asked an analyst to do this and it would take them 10 hours or two days, seems reasonable that someone would be willing to wait like five minutes in your product.

Starting point is 00:27:47 So I think that was the, we just kind of made that bet. And luckily it seems like it's the case. But I do also think that, you know, initially people are like, oh, this is amazing. It's doing all this work. That would have taken me so long. And now people are like, okay, but I want it. Now I want it in 30 seconds. Right.

Starting point is 00:28:04 To the point on the bar changing. Because, yeah, I was going to say, is there any sort of rule of thumb? And I'm sure it's constantly shifting, where as long as you're 10 times faster than it would take the human to do, they're willing to wait for it? Or is that just constantly shifting sand? I think with these launches, people's expectations keep getting changing. Yeah, I do think we have like a specific number. One thing that's interesting is I think sometimes people just biased to thinking that the longer answer is more like thorough or has done more work for it, which I don't necessarily think is the case. like deep research, for example, always gives you a really long report. But sometimes for me,

Starting point is 00:28:42 I don't want to read this whole long report. I actually don't like that. So agent, like, it will only give you a long report if you ask for it. But I think sometimes people, since now there, you're still always getting a really long report, they're like, wait, I've been waiting. Like, where's my long report? But sometimes it's like really hard to find a specific piece of information and would have also taken a human a long time because it's in like page 10 of the results where it's where it finds this information. So I think it's interesting also how you can condition people's expectations with a product so that when you change, or like with deep research it always thinks for a really long time, which again, I don't necessarily think is a feature,

Starting point is 00:29:16 but I think now people are like really used to the amount of time that they wait. Definitely. I think we hear this with GBT5, but internally when people were testing and they're like, oh, I thought I asked like a really hard question. I feel like a little bit insulted that it got for like two seconds or like when it doesn't even want to think at all. It's like the Mark Twain line. I didn't have time to write you a short letter.

Starting point is 00:29:36 I wrote you a long one. Yeah, yeah. When do you talk about the bottom, like, why don't we have reliable agency at us? What are the main bottlenecks as you see them? Yeah, I think a big part of it is the things that we train on are often really good at. And then sometimes with the things outside of that, it can be a bit, sometimes it's good at those things, sometimes it's not good at those things.

Starting point is 00:30:00 So I think, yeah, creating more data across like a broader range of things that we want it to be good at. I think also what's interesting with agents is we have this like when when something is doing something on your behalf and it has access to your, you know, your private data and the things that you use, it's kind of more scary the different things it could do to achieve its final goal. You know, in theory, if you asked it to buy you something that, like make sure that I like it, it could go and buy five things just to make sure that you liked one of them. Right. which you might not necessarily want. So I think that there's definitely like having oversight during training is also like an interesting area.

Starting point is 00:30:41 I think there's just like new things that we have to like develop to, you know, push these agents even further. So yeah, I think that that's part of it. And then also like every time we get to have a smarter like base model or something like this, it improves every model that's built on top of that. So I think that will also help, especially with like multimodermost. capabilities, as Tina said, with computer use. Because it's like just literally looking at screenshots of the web page.

Starting point is 00:31:12 And it's like, it's a little interesting because the way that humans like focus on specific things, it's like, it's a lot to expect a model to just like take a whole image and be able to like know everything about the image when like when we're looking at something will like focus on a specific thing. Yeah, I think that there's just lots of room for improvement in lots of areas. Sorry, that was kind of a general answer. No, no, well, actually, I was going to, maybe that last example gets into something that we were curious about, which is, and this ties back to training data as well, but what sort of, I guess, what specific categories of browsing tasks are challenging for agents today? And, like, I don't know if you have thoughts on how you'd overcome this for sort of the next iteration of the model.

Starting point is 00:31:55 I mean, I think one thing is, like, so free training, it's based on, like, what data is available, right? And so I think when we've done these pre-train, there's not much data out there to begin with, with people using computers. Like computer usage is not really a thing that there's lots of data out there. And this is something we actually have to seek out now that this is a capability that we want.

Starting point is 00:32:13 So I think that's actually probably a big one, just for general improvements of computer usage. Do you think you'll lean more heavily on human data vendors to help collect that? Or given it doesn't exist, to your point, like recorded in the way that maybe it's most helpful for training? Like how do we, but it is probably, the most useful application of the models to, you know, at least knowledge work.

Starting point is 00:32:36 Like, how do you overcome that? I mean, I think one cool thing is for, for example, for initial deep research, there's not really any data sets that exist for browsing in the same way that you have a math data set that already exists. So we have to create all this data. But once you have good browsing models or good computer use models, you can like bootstrap them to help you make synthetic data. So I think that's a pretty promising area.

Starting point is 00:32:58 Christina, can you explain what mid-training is? and how it sort of, what does it achieve that pre or post doesn't? So I think with your pre-training runs, these are like your, these are your, these are the big runs. These are the massive ones, like that's what we're building all these giant clusters for. So you can kind of think of mid-training is literally, it's for like middle. Like we do it before, after pre-training, but before post-training. You kind of think of a way to like extend the models like intelligence without having to do a whole new pre-training run. So this is mostly just focused on data and it's off of the pre-training models.

Starting point is 00:33:30 So this is a way for us to do things like updating the knowledge cut off of these models, right? So when you pre-training, you're kind of like, okay, shoot, now we're kind of stuck in this date and we can't ever updated again. And doesn't quite make sense to put all that data into post-training. And so mid-training is just a smaller pre-training run to help expand like the models' intelligence and like up-to-dateness. Christian, did you work on WebGPT? Yes, I did. Okay, so you're basically like an AI historian. Yes, yes.

Starting point is 00:33:56 She also watched some computer use. I'm an elder. So can you like reflect back a little bit to you know four years ago five years ago and sort of reflect on like what are the biggest thing like if you were to predict the five years out like what are the inflection points or biggest things that would have surprised you? Honestly with webGBT the main thing we were just excited about was like trying to ground these language models like it's they had so many issues with like hallucinations and the model just saying random things and like the fact of we didn't really do in the training sense so like the fact of like how do we make sure the model is actually up to date like most factually up to date. So then that's kind of how we thought about like, oh, let's give it a browsing tool. I think that makes sense. And then, yeah, like I said, that kind of went on from like, oh, I actually want to keep asking questions. So what a chatbot would look like?

Starting point is 00:34:40 But at this point, I think there had been a few chatbots by a few other companies. And I feel like a chatbot is also like a very common AI thing to think of. But they're quite unpopular at the time. So we weren't really even sure that like this is actually something useful for people to work on or like people to use or will people be excited about this. Is this really like a research innovation that we don't really? like are remaking the Turing test here. Like, but I think it kind of clicked into me that like maybe there was actually something interesting happening here.

Starting point is 00:35:07 We gave early access to about 50 people, most of those people being like people I lived with at the time. And there are two of my roommates just used it all the time. They just like would never stop using it. And they would just have these long conversations. And they would ask it like quite technical things because they're also AI researchers. And so I was just like, oh, this is like kind of interesting. Like I don't know.

Starting point is 00:35:27 And at the time we're kind of thinking like, okay, we kind of have this chat. Should we do we make this like a really specific like meeting butt type of thing? Do we like make it a coding helper? But it was interesting to see my two roommates just use it like for anything and everything and just like literally be chatting with it like the whole workday as they were using it. So I was like, oh, this is kind of interesting. But then it was also interesting to see like the majority of the people that I gave access to on that 50 person list like didn't really use it that much.

Starting point is 00:35:51 But I was like, oh, it's like there's clearly like something here, but it's like not quite maybe for everyone yet, but there's something here. When did you realize, like, I'm working at one of the most important companies of this generation? Like, like, when was the moment where you were like, hey, this is something that I obviously believe important. That's why I joined, but that you realized like the scale and significance. Honestly, I kind of had this moment before I joined Open AI. Like, I think with the scaling laws paper with GBT3, I was just like kind of hit me that, like, if this exponential is true, like, there's not really much else I want to spend my life working on.

Starting point is 00:36:23 And like, I want to be part of this, like, story. Like, I think there's, there's going to be so much. any interesting things unlocked with this. And I think this is probably the next step level in terms of technology that it kind of made me realize, oh, I should probably go start reading about deep learning and figure out how I can get into one of these labs. Is it what was your moment?

Starting point is 00:36:41 I think for me, it was also before I started working at Open AI using, I think I first learned about Open AI in an AI class or some kind of computer science class and they were saying like, oh, they trained on the whole internet. It's like, oh, that's so crazy. Like what is this company? And then started using GPT3, like in the, I think I was the, it was a power user of the opening I playground. And at a certain point, like, had early access to these, like, different opening eye features, like embeddings and things like that. And just became this, like, big opening eye fan, which is like a little embarrassing, but, you know, it's fine because it got me here.

Starting point is 00:37:16 And eventually they're like, okay, like, you're stalking us. Do you want to interview here? But yeah, I think it was like pretty clear to me, but just how much I was using GPT3, which wasn't even, compares to what we have now. like just pills in comparison, but I was like, from then I was hooked and just trying to figure out a way to work here. Maybe a question of more on the company building front. We all sort of read and reread Calvin French-Owen's piece just as reflections on working at Open AI. Curious, and you don't have to comment on that piece unless you want to, but would love your reflections on the change that you've seen over the last four years or, you know, or even less than then given I think that was only cover. one year of change. But what are the biggest things that you've seen change at OpenAI? I mean, when I first joined Open AI, the applied team was 10 engineers or something.

Starting point is 00:38:05 It's just like we didn't really have this like product arm. We had just launched the API. It was just a completely different world. And I think AI is in most people's mind now after chat GPT, but I think pre-chatch EBT, like people didn't really know what AI was or really like thought about it as much. It's kind of cool working on a place that like my parents know what I do now. And like it's like, that's like, That's really cool. And I think the company obviously is just a lot bigger. But I think with that, we can just take a lot more bets. I think when I first joined Open AI, there were obviously way less people.

Starting point is 00:38:36 Like it was much, much smaller. It was around like 200-ish people. And I think we're close to like a few thousand for sure. Yeah, when I joined it was also a few hundred before chat GPT. So it's obviously, yeah, very different in how much, you know, all of your friends have heard of, you know, what you work on. But I think culturally, obviously the company is much bigger. I still think we've maintained this. It still feels very much like a startup.

Starting point is 00:39:01 I think some people who come from a startup are surprised at like, oh, I'm working even harder than when I was working on the startup that I founded. I think ideas can still come from anywhere. And if you just take initiative and want to make something happen, you can. And this doesn't really matter, like, how senior you are or anything like that. I think we've been able to maintain that culture,

Starting point is 00:39:16 which I think is pretty special. Yeah, we definitely reward agency. And I think that's like always been true. And I think, especially in the research side, the teams are quite small. Like when ESA was working on deep research, it was like two people still. So like I think we still do that on the research side. Like most research teams are quite small and nimble for that reason.

Starting point is 00:39:35 And earlier you said, you know, we do something at open AI, which startups never do, which is, you know, try to appeal to every single person with the product. What are there other things that come to mind that Open AI just does differently than your peers or other startups or things that we may not appreciate being on the. I mean, I think it's different. different for different teams, but my team collaborates so closely with the applied, like, the engineering team and the product team and design team in a way that I think sometimes, like research can be quite separate from like the rest of the company. But for us, it's like so

Starting point is 00:40:12 integrated. We all sit together. You know, sometimes like the researchers will help with like implementing something. I'm not sure that engineers are always happy about it, but we'll try. They'll get out of the front end code. But and vice versa, like, they'll help us with things that we're doing for, like, model training rounds and things like that. So I think some of the, like, product teams are quite integrated. I think it's for post-training. It's a pretty common pattern, which I think just lets you move really quickly. I guess one thing that I think is unique about Open AI is that you're both very much a consumer company by revenue, etc.,

Starting point is 00:40:54 products, but also an enterprise company. How does that internally, like, what would you guys consider yourself? Or is that even just the wrong paradigm to think about? Yeah, I mean, I guess if you tie it to the mission, it's like, we're trying to make the most capable thing. And we're also trying to make it useful to as many people as possible and accessible to as many people as possible. So, like, in that framing, I think it makes a lot of sense.

Starting point is 00:41:17 The concept of taste has become also very widely used. What does good taste mean within OpenAI? How do you know when you see it, know it when you see it? And is that something that even in a world where everything, the cost to produce everything just keeps going down and down? Is that the one thing that's not commoditizable? Or is that also shifting given maybe that can go into the training data? No, I think taste is quite important, especially now that like it is like, like I said,

Starting point is 00:41:45 like our models are getting smarter. It's easier to use them as tools. So I think having the right direction matters a lot now and like having the right intuitions and like with the right questions you want to ask. So I would say maybe it matters more now than before. I think also I've been surprised by how often the thing that is the most simple, like easy to explain is the thing that works the best. And so sometimes it's like sounds seems very obvious.

Starting point is 00:42:10 But, you know, it's quite hard to get the details of something right. But I think usually good researcher tastes is just like pretty simplifying the problem to like the dumbest thing or the most simple thing you can do. Yeah, I feel like with every, like, research release we do, and when people figure out what happened there, they're like, oh, that's so simple. Like, oh, I should like that obviously. Obviously, that would have worked. But I think it's like knowing to try that like obvious or like at the time not obvious thing that is obvious in hindsight. Yeah. And then all of the details around. Yeah. The hyperbrown and all these things. Like the info is obviously like very hard. But the actual concept itself is usually pretty straightforward. Hmm. Very cool. Taste is Occam's razor. Yeah. So sort of in closing here, obviously historic day, you want to contextualize sort of what this means in context of the mission and, and, you know, where you've been to get to now, to where you're going?

Starting point is 00:43:04 Yeah, I think with GPT5, the thing that's the word that's like been in my mind throughout all of this is like usable. And I think the thing that we're excited about is getting this out to everyone. We're excited to get like our best reasoning models out to free users now. And I think just getting our smartest model yet to like everyone. And I'm just excited to see what people are going to actually use it for. That's a great place to wrap. Tina, Issa, thanks so much for coming on the podcast. Yeah, thank you.

Starting point is 00:43:29 Thank you for having us. Thanks for listening to the A16Z podcast. If you enjoy the episode, let us know by leaving a review at rate thispodcast.com slash A16Z. We've got more great conversations coming your way. See you next time.

The a16z Show - GPT-5 and Agents Breakdown – w/ OpenAI Researchers Isa Fulford & Christina Kim

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.