No Priors: Artificial Intelligence | Technology | Startups - Inside Deep Research with Isa Fulford: Building the Future of AI Agents

Starting point is 00:00:00 Hi, listeners, and welcome back to No Priors. Today, I'm joined by Issa Fulford, one of the pioneering minds behind OpenAI's deep research. This is a new agentic product that Open AI released in February of this year, which uses reasoning and tools like web browsing to complete multi-step research tasks for you. Today, they're making it free to all U.S. users. Welcome, Issa. Issa, thank you for doing this. Thank you so much for having me. You and your team have shipped, like, one of the most exciting AI products of late.

Starting point is 00:00:34 I use it a lot, deep research. Where do the idea come from? Tell me the origin story. Yeah, so around a year ago now, we were very excited about the progress internally on this new reinforcement learning algorithm. We were seeing a lot of progress on math problems and science problems and coding problems. And at the same time, I was working with my friend Yash, who works at Open AI, on a few side projects, and we're very interested in agents generally and kind of wondered if we could apply the same algorithms to tasks that may be more in line with what the average user would do every day. And so the two first things we were thinking about were online browsing tasks, because I think in a lot of different professions, people do just have to do a lot of research, synthesize a lot of information. and then come back with a report.

Starting point is 00:01:26 And then we're also thinking about software engineering. We kind of have been working on those things. I've been focusing on browsing. So to start, we kind of, with the math and coding problems that people were already training on, those data sets already exist. You know, you can have a math problem with a ground truth answer and you can train on those.

Starting point is 00:01:43 But for browsing, it's kind of more open-ended. You don't really have data sets like that that exist. So we really started by grounding the research and what product use cases. actually wanted the final model to be good at. So we literally would write out just a list of things. I hope the model that could find this list of products for me and rank them by like these reviews from Reddit or something like that. Or I want it to be able to write a literature review on this topic. I feel like a lot of people when they think about, you know, browsing and

Starting point is 00:02:14 agents, they land on the same like two, three transactional use cases that I actually don't think are particularly inspiring, right? So it tends to be, like, order a burger on DoorDash or something like that. Or I feel like ordering flowers is also like a really common one. Why do you think you came up with like such a different set of goals for the agent? Yeah. So I think before we focused on taking right actions, which those are examples of taking right actions, we wanted to get really good at synthesizing information from a large number of sources and mostly read only tasks. There was for a number of reasons. Firstly, just a huge number of knowledge work professions mostly do that. So it would be quite useful for those groups of people.

Starting point is 00:02:57 Secondly, I think the overall goal for Open AI is to create an AGI that can make new scientific discoveries. And we kind of felt that a prerequisite to that is to be able to synthesize information. You know, if you can't write a literature of you, you're not going to be able to write a new scientific paper. So it felt very in line with the company broader goals. It's also very meta because you have helped make an AI that makes me better at learning and it's learning. Yeah. I hadn't thought about that. I love that. More practically, the read-only, read-only task is maybe the safety question is a bit more constrained. So it was a good thing to start with as well. Yeah, it seems that the read-only space, people were also not nearly as ambitious as you were going in or you and Yash were going in about, like maybe it could understand this set of things. things for me. Okay, so you thought of these end eval and come up with a set of tasks that could be

Starting point is 00:03:54 auto-gradable or fit a set of characteristics that made them better fit the algorithms. And then what? That was actually a huge process in itself. I think we initially had built a demo to pitch people on this idea and it was no model training involved. It was fully just prompted models with the UI pitching the vision of what this product could look like. And so I think after that, then we were at the point we actually had to start thinking about how are we going to do this, how are we going to create the data, how are we going to train the model, what tools do we have to create to enable the model to browse the internet effectively? And that was a lot of iteration. I was working very closely with Edward, Sun, and a few other people on this. And so we also collaborated a lot with the RL

Starting point is 00:04:36 team. I think it was definitely a big undertaking. And a good thing about it was we were able to work uninterrupted for quite a few months, making the numbers or not even. balls go up. So I think it was nice to have not too much pressure to ship something really quickly. And we were just able to iterate and get it, get it to a good state. Did you have a favorite, like, most important task? We had a few tasks. People would just propose different tasks. One of them was to find all of the papers that Liam Fettis and Barrett Zoff had written together. I think there was 11. The model now can find most of them or all of them. Okay. We would always ask that question. And then,

Starting point is 00:05:16 another one, which the model actually can't answer anymore, probably for good reason, but finding the middle name of one of our co-workers. And then personally, I think I started using it pretty early on for actually finding information for like product recommendations, travel. And I think actually quite a few people internally, we had kind of a streamlet playground that people would just use. A lot of people had found it and were using it. Sam told me he used it to buy a bunch of things. Every time it would go down, people would message us. Like, what happened? We need to use the model, even when a previous version that honestly wasn't that good. So I think that was a good initial sign. Yeah. What can you say about the actual bulk of the work, like the tool

Starting point is 00:05:57 creation and the data creation? So for the data, we did a bunch of different things. We used human trainers. For some of it, we kind of had to come up with new ways, new kinds of data sets, I guess, and we had to figure out how to design data sets to exercise the kind of skills that we wanted the model to learn. And then you have to make a way to grade those data sets as you're training them. And then you also have to make the good tools for the model to be able to actually complete the task successfully. So right now we just have the browsing tool, which is a text-based browser, but it can see embedded images and like open PDFs. And then also it has access to a Python tool so it can do analysis and calculations and plot graphs and things like that.

Starting point is 00:06:41 But you can imagine in future versions, we'll just expand the tool set. And so the model will just become more capable, but we'll also need to make data sets that actually make the model exercise all of those different tools and figure out how to use them and backtrack and, you know, all these different things during training. So that's actually able to like flexibly answer new problems from users in the product. It is clear that reinforcement fine tuning on very powerful base models can do very useful things now. That's super exciting. What advice would you have for startups or other companies who are thinking about doing RFT for a particular task as to when it's worth doing or when they can just try to

Starting point is 00:07:19 do sort of just traditional orchestration where agents are a component? So I think in general, you will always get a model better at a specific task if you train on that task. But we also see a lot of generalization from training on one kind of task to, to, you know, other domains. So you can train a reasoning model on mostly math, coding, other reasoning kind of problems. And it will be good at writing. But if you, you know, trained it on that specific task, it would be better at it. I think if you have a very specific task that you think is so different to anything that the model was likely trained on, and you try it a bunch of times yourself and you've tried a lot of different prompts and it's just really not good at it. So maybe it's some genetic sequencing task or something

Starting point is 00:08:04 that's just so out of distribution for the model that it doesn't know how to figure it out. I think that is a good time to try reinforcement fine tuning. Or if you have a task that is so critical to your business workflow that getting the extra 10, 15% performance is really make or break, then probably try it. But if it's something that you think, oh, the model's pretty good at, but it gets things wrong, you know, some percentage of the time. And then you see with every next model that's released, it gets a little bit better. It might not be worth the effort if the model naturally is just going to get better at those things. So that would be my recommendation. Okay, great, great advice. You've talked about needing to use human experts to create some of this data. I think of

Starting point is 00:08:47 browsing as a somewhat universal task. I guess they're good and better, they're, you know, worse and better browsers. Like, where do you feel like you need expertise or what do you know about browsing expertise that you didn't before, or information gathering expertise? Yeah, I guess it's one of those things where basically every single profession involves, you know, having a question or wanting to do research in a domain and then having to find information from many different sources to synthesize an answer. And like, while doing that, you have to have the expertise to reason about, is this a useful source? Is this not? Is this, you know, should I include this? Is this like completely off topic, whatever? Like, that is kind of, you know,

Starting point is 00:09:25 universal to most jobs or most, you know, scientific domains, any kind of anything. So, and the cool thing with RL is that you don't necessarily need to know the whole process of how the person would do the research. You just have to know what the task is and what the outcome should be. And the model will just learn during training how to get from the problem to a good answer. So I think we just took a pretty broad approach. I think that's one thing that if you work at a place like Open AI, I think you can do what they would tell most startups not to do and just try and focus on a really broad set of users and just get experts in loads of different domains and try and see if you can go to everything at once, which was the approach that

Starting point is 00:10:07 we took. And then we also created a lot of synthetic data sets and things like that. But the human data was definitely a really key part for making this model successful. Did any of the learned planning from the model across these domains surprise you, like in terms of the past to find the perfect handbag or the restaurant in Japan or the set of papers that was relevant. Yeah, I guess sometimes it will use search terms that I wouldn't necessarily have used or, you know, we didn't teach it to plan up front, but sometimes we'll see it. It does end up making a plan up front before starting its research. Sometimes the model will do smart things and try to get around restrictions you put on it.

Starting point is 00:10:47 So you have to make sure that it's not hacking, you know, and trying to use a different search engine other than the search engine that you gave it or something like that. Like, it will do smart things that you have to make sure you're looking out for, you know, in case you want to not allow the models to do those things. Maybe we can actually use this as a, like, a moment to talk about some of the failure modes. Like, how did you think about some of the classic issues with agents, like maybe, you know, compounding error or distraction or even safety? Yeah, so I think with deep research, since it can't actually take actions,

Starting point is 00:11:22 that aren't kind of the same class of the typical agent safety problems you would think of. But I think the fact that the responses are much more comprehensive and take longer means that people will trust them more. So I think maybe hallucinations is a bigger problem. While this model is hallucinates less than any model that we've ever released, it's still possible for it to hallucinate most times because it will infer something incorrectly from one of its sources. So that's part of the reason we have citations because it's very important. that the user is able to check where the information came from. And if it's not correct, they can hopefully figure it out. But yeah, that's definitely one of the biggest model limitations and something that we're actively always working on to improve. In terms of future agents,

Starting point is 00:12:06 I think the ideal agent will be able to do research and take actions on your behalf. And so I think that's a much harder question that we need to address. And it's kind of at that point when capabilities and safety kind of converge where an agent is not useful if you can't trust it to do a task in a way that doesn't have unintended side effects that you don't want. Like if you ask it to do a task for you and then in the process it sends an embarrassing email or something like this, you know, that's not a successful completion of the task. So I think that is going to be a much more interesting and difficult safety area that we're starting to tackle. You can tell me if you just don't have a projection here, but do you think people are going to want

Starting point is 00:12:49 explicit guardrails? Do you think you can learn a bunch of those characteristics in the model itself? If you've used operator, I'm sure you have, you have to confirm every right action. I think to start with, that makes a lot of sense. You want to build trust with users. And as the models become more capable, maybe you've seen it successfully do things a few times and you'll start to trust it more. And so maybe you allow it to, okay, every time, you don't have to ask me every time you send an email to these people. Like, that's fine. But I do think that as these agents start to roll out, we will definitely want to have guardrails and confirmation. Just so, you know, while they're not the end state capability, we still want to make sure

Starting point is 00:13:29 we have a good level of oversight. But I think that they will get so good that we'll just trust them to do things on our behalf. What are some of the obvious ways you feel like deep research as a product is going to get better? Yeah, I mean, it's going to extend into right. You just implied that at some point. Yes. I mean, I think maybe it's, you know, the ideal state would to have, be to have a unified agent that can do all of these different things. Anything that you would delegate to a co-worker, it should be able to do. How are we going to make decisions about if it's like Sarah, you do this versus agent, please do this? Yeah, I guess. Or is it always just try the agent first? Probably. I mean, I would try the agent first. It's my work. It's kind of the pattern

Starting point is 00:14:09 of every time the model becomes more capable, the level of abstraction of the human becomes higher, if that makes sense. Like the task you're asking it to do is just higher and higher level, but you're still initiating the task. So, you know, maybe previous, a year ago, I was asking it to write a function for me and now I'm writing it to asking it to write a whole file and maybe next year that will, you know, make a whole PR for me or something like that. So I still think we'll be in the driving seat. As to deep research, I think obvious next steps for deep research would also be to have access to private data, like be able to do research over, you know, any internal documentation or GitHub, whatever it is.

Starting point is 00:14:50 There's a golden thread here because when we first met you were working on retrieval and I was like, there cannot be only one person at this company working on retrieval. Everything all roads lead back to retrieval. So I think that will be really cool. And then eventually taking right actions or calling APIs. And then obviously there are just a lot of things that the model is not perfect at now that we just need to improve. But I think we have a really cool working relationship with the reinforcement learning team. So a lot of teams will contribute data sets to the big runs that they do.

Starting point is 00:15:19 So we contribute data sets. And then as they train models with a ton of compute, then it just becomes a better base model for us to continue training from. So just think the capabilities are compounding. So this was not a low-key research preview, but a side project that turned into a very interesting, you know, internally pitched project. How do you think about, like, what is a product that Open AI or at least you yourself want to work on? independently versus what belongs in the core research path. A cool thing about Open AI is that even though the company is bigger, I think the culture of anyone being able to have an idea and prove it out and then push it to

Starting point is 00:16:05 completion is still, you know, it's still been maintained as the company has grown. For me personally, I'm always motivated to work on things that I will use myself with the research, for example, I do use it a lot for, you know, looking at various things, travel recommendations. But I think I'm probably a daily active user. It's fun when you get to dog food. I think I'm a doubt now. Oh, yeah. Amazing. Yeah. I'm burning a lot of GPS. Are there use cases where, like, you know, you're the original expert, are there ways that you or Yash or like you've seen the user base use them that you encourage people to use deep research? I'm always interested to see people using it in domains

Starting point is 00:16:45 that I have absolutely no expertise in. For example, in medical research or I've seen a lot of different scientists posting about how they've used deep research and how I help them do something. To me, that's the most interesting because when we were working on it, I obviously had no way of judging whether an output is good or not. So seeing

Starting point is 00:17:01 experts actually ratify deep research responses is useful. An area that I was surprised to see people using the model in was code search for coding questions, I think, like, use the latest package or latest version of whatever repo to help me write this file or something for data analysis as well. That's also something the model's already pretty good at, and I think we'll just continue to get better at. I think

Starting point is 00:17:30 uploading a file or something like that and having it do some analysis for you or do some research and then create a report with numerical analysis is pretty interesting. I actually haven't tried this. So it's, and it's not a, it's not a browsing task. Like, what makes the model particularly good at this? Or what is it capable of? Is it really like multi-step and then being able to do planning and understanding of the task and produce a report that's cohesive? Yeah. I think also the base model or the model that we started fine-tuning from O3 is just very capable model. It's trained on many different data sets, including a lot of coding, reasoning and math tasks. So that inherited capability is pretty strong. And then when you add the browsing on top of

Starting point is 00:18:08 that, still able to do that analysis. So I think those two together can be quite powerful. Before the podcast, we were just talking about the idea of learning taste or preferences from users, like opening eyes just released a bunch of memory features. How do you think that deep research could or, you know, just agents in general could evolve to take into account, like how people want to learn or their information ingestion preferences? Yeah, I think agent memory will definitely be very important. It would be very annoying if every time. you ask it to do a task, you have to repeat the same information, how you want it to do the task, everything about you, which currently for deep research, you do have to do. And I think as a task

Starting point is 00:18:48 get more complex and right now it will take five to 30 minutes, you can imagine in the future it might take hours or days to complete a task that you ask the model to do. You definitely want the models research to be compounding. You don't want it to want to have to start fresh every time. So I don't necessarily have a good answer, but I think it's something that will be very important. There is a common understanding between many people at some of the leading labs that like the recipe to AGI is, I'd say, somewhat known or, you know, there's confidence on this. And, you know, the return of RL is very exciting for everyone. The stance that I've heard from you and from others is both enthusiasm on like, this seems to work. We're going to get real capability

Starting point is 00:19:30 out of it. It's quite data efficient. And it's going to be a lot of work. Tell me a little bit about like the emotional experience of building deep research and if that changes your view at all. I agree with everything, everything you said. I think it's so impressive to see how data efficient the algorithm is. I guess the data you train on is much higher quality and smaller. So actually curating that is an undertaking and then making sure that the model has access to all the tools that a human would have access to to do the work that they need to do and then making sure that you represent tasks that people will find useful or doing their jobs in a way that you can, you know, judge whether the model did a good job or not is also hard. And there's so many

Starting point is 00:20:14 other challenges for pre-training where you have so much more data. You have to do all of these different things that are like, I think it's just a different challenge in both compounding. Like, you need a really good base model to be able to do RL. And then for our team, we just do more RL. So, yeah, it's, like, all very compounding. But I think that we, everybody does kind of see a pretty clear path to this, like, broadly capable agent. Do you think there are, like, big blockers to progress of, like, you said, maybe not exactly describing it as the next iteration of deep research, but just confidence that, you know,

Starting point is 00:20:51 we're going to have these unified agent capabilities and it will feel like a coworker. What stands between us and that? There's a lot of really hard safety. questions that we need to figure out, you know, we would never ship anything that we don't have very high confidence is safe. And I think the stakes are way higher when it can, when it has access to your GitHub repositories and your passwords and your private data. So I think that's a really big challenge. I guess also if you want the model to be able to do tasks that take many, many hours, finding efficient ways to manage context, kind of similar to the memory thing. But if you're doing a

Starting point is 00:21:27 task for a really long time, you're going to run out of context. So what's an efficient way of dealing with that, allowing the model to continue to do its thing? And then, yeah, just the task of making the data and making the tools. I mean, I've said this already a few times, but that's a lot of work. I was just looking at my history of queries. My user request is, like, I want to see what things I asked of deep research versus other models in particular in my memory. But it has range from like obviously, you know, from trying to get up to speed on a market for a company I'm looking at or on a technical topic or travel planning. It's a big one. Also, I have looked for things that are taste related. So I'll be like, okay, I like, you know, this set of books for

Starting point is 00:22:13 these reasons. I want you to, you know, actually just giving me a long form summary of a bunch of other things you think I should read and explain why. I realize I don't have a super clear model of like when deep research should be better than 03, what instinct can you give me here? Deep research is very good when you have a very specific query or well-defined query, so maybe not a general overview of a topic, but you're looking for some specific information and you think it would be supplemented by existing research online. Even if that information is also, you know, we also train the model on the base model on that information. I think having live access to it is quite useful. So if I have any instinct about like directing to retrieval or

Starting point is 00:22:57 particular sources, that focusing is useful. I think so. And also we trained it to have much longer outputs than I think the, you know, normal models would. So if you're looking for something very comprehensive, maybe sometimes too comprehensive for some tasks, I think deep researchers will be useful for those things. Connect this for me to a deep research, like fashion. task. I've used it to find new brands. So I'll say these are the kinds of brands I like. Please find new brands where I can find this specific coat that looks like this one or something like that. And then it's very good at finding those versus the, I think the base model or the normal model will say it will give you some brands, but it won't necessarily fit all of the constraints

Starting point is 00:23:39 that I had given. Like I wanted to sell this, you know, fake fur coat that's this length, this season or something. It's not going to be able to do that because it just won't have the up-to-date information and also just won't necessarily be able to deal with all of the constraints in a query, like in one shot. 0-1 isn't browsing as comprehensively. I'll use it to find things where I'm looking for a very specific thing that would take me hours to find. So I'm looking for this very specific item or, you know, sweater that is probably available on Real Real or somewhere, but I can't find it. Or I'm looking for an Airbnb with like very specific constraints. So I think those kinds of things deep research is good for. And then more,

Starting point is 00:24:20 general, like, high-level things you should use, like, normal search for. Yes. Well, I will admit, I have had some multi-year browsing slash shopping tasks, but I'm now making a cron job for deep research. I want to ask just one more experience question, which is, was there a particular, like, win or failure that surprised you in the training of deep research? It really was one of those things where we thought that, you know, training on browsing tasks would work, you know, felt like we had good conviction in it, but actually the first time you train a model on a new dataset using this algorithm and seeing it actually working and playing with the model was pretty incredible, even though we thought it would work.

Starting point is 00:25:05 So honestly, just that it worked so well was pretty surprising, even though we thought it would, if that makes sense. Yeah, yeah, it's the restural experience of like, oh, the path is paved, but. strawberries or whatever. Exactly. But then sometimes some of the things that it fails at are also surprising. Like sometimes it will make a mistake where it will do such smart things and then make a mistake where I'm just thinking, why are you doing that? Stop. So I think there's definitely a lot of room for improvement. But yeah, we've been impressed with the model so far. I'm used to all my technology tools being instantaneous. Deep research is not instantaneous. It's thinking and using

Starting point is 00:25:40 tools. Can it be faster? Yeah, I do think there's a good middle ground in between where sometimes you don't want it to do really deep research, but you want it to do more than a search. And I think that we will release things soon that people will be happy about and we'll fill that gap. Okay. I don't know how to communicate this preference, but I want to like toggle at some point to be like as much work as, I mean, because I would say this to a human, I want you to do as good of a job you possibly can do in the next five minutes. Yeah. See, that's something where I think it seems like a bad U.S. to actually make the user, make that decision. The model should just be better at no. knowing how much time to think. I think we made a decision when training the model that we just are going to go for max thinking time every time. So I'm sure I will ask it a really simple query sometimes just to test and then get quite frustrated that it's still thinking. So I do think

Starting point is 00:26:34 that that's also an area for improvement. It's, you know, knowing how long to think for. But yeah, I suspect with deep research will always be focusing on the tasks that take the maximum length of time and then I think like 03 or you know oh next will have a better in between what is an example of a test you can imagine deep research taking a day yet in the future I mean there's some GPU smoking yeah I think anything that would take I mean right now in five or 30 minutes it can do what human experts rate take many hours so I guess in an hour it could do something that take a human days in a day it could, you know, do something that would take a human weeks. Obviously, there will be a lot of challenges to get it to scale like that. But I think you can

Starting point is 00:27:23 imagine it doing a research project that would have taken weeks to complete or, like, write a thesis or something like that. Okay. I'm going to make our intern compete with it over the next couple months then. Yeah. Sounds good. If you were to project forward a year, which is a really long time in AI land. What is something that you think will surprise people that agents can do? And that will actually be released. So it takes the safety considerations into the set. A general agent that could do a lot of the, you know, help you do a lot of the tasks that you would do in a lot of different areas. Like for me, I do a lot of coding. I'm hoping that there'll be an agent that is pretty sufficient at coding, but that I will just trust to, I'll give it a task

Starting point is 00:28:09 and it will hopefully make a PR or something, but maybe I can ask the same agent to help me book a trip to Korea or something. I hope that we'll get to a more unified experience. But I also think that the rate at which these models are improving is going to be pretty surprising to most people. Why do you think a unified experience is important? Or why do you think that makes sense? Because I think today it's quite different to think about, obviously, Chachupit is one experience that's very encompassing. But there are models that people use in different contexts like, you know, next line completion type models for coding that are just feel like a very different setting. I think that you'll probably want both. Like, you'll probably want an experience where you can

Starting point is 00:28:54 at some point override or interrupt the model and say, oh, no, I didn't mean that. Or you can take over and like start typing something. Yeah. Especially in the short term as the models are not as capable as humans in a lot of areas and are more capable in other areas. So I think it will be a combination of like you asking the model to do something, but then when maybe to go with the coding example, then maybe you're also in your your VS code or whatever it is, your cursor and it's been doing something for you, but you can also like actually type and, you know, write some of it yourself. So I think it will be a combination of those things. But I kind of want it to be something that is just like it's like having some, a co-worker on Slack or like a remote co-worker. You can just

Starting point is 00:29:33 ask to do things for you, send them a Slack message, and then they'll start doing it. And then you can, like, review their work or, you know, help at some point. But it seems like a pretty nice, like, general interface. And you don't have to think about which agent should I ask to do which task. Like, you should just be able to figure it out. The mental model I have for this is my general ethos is actually I love the people I work with. I prefer to work with fewer people with less management overhead, all things considered. because each person has more context and I have more understanding of them.

Starting point is 00:30:06 And so, like, the universally useful agent is attractive. Yeah, and you only have to tell it something once and it will remember and then it will have state on everything you're working on, things like that. Awesome. Well, this has been a great conversation, Issa. Thanks for doing this. And thank you for the product release. Thank you so much for having me.

Starting point is 00:30:24 And thank you for using deep research. Find us on Twitter at No Pryor's Pod. Subscribe to our YouTube channel. if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no dash priors.com.

No Priors: Artificial Intelligence | Technology | Startups - Inside Deep Research with Isa Fulford: Building the Future of AI Agents

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.