No Priors: Artificial Intelligence | Technology | Startups - Inside Deep Research with Isa Fulford: Building the Future of AI Agents
Episode Date: April 24, 2025On this episode of No Priors, Sarah sits down with Isa Fulford, one of the masterminds behind deep research. They unpack how the initiative began, the role of human expert data, and what it takes to b...uild agents with real-world capability and even taste. Isa shares the differences between deep research and OpenAI’s o3 model, the challenges around latency, and how she sees agent capabilities evolving. Plus, OpenAI has announced that deep research is free for all US users starting today. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @IsaFulf Show Notes: 0:00 Deep research’s inception & evolution 6:12 Data creation 7:20 Reinforcement fine-tuning 9:05 Why human expert data matters 11:23 Failure modes of agents 13:55 The roadmap ahead for Deep Research 18:32 How do agents develop taste? 19:29 Experience and path to building a broadly capable agent 22:03 Deep research vs. o3 25:55 Latency 27:56 Predictions for agent capabilities
Transcript
Discussion (0)
Hi, listeners, and welcome back to No Priors.
Today, I'm joined by Issa Fulford, one of the pioneering minds behind OpenAI's deep research.
This is a new agentic product that Open AI released in February of this year, which uses reasoning and tools like web browsing to complete multi-step research tasks for you.
Today, they're making it free to all U.S. users.
Welcome, Issa.
Issa, thank you for doing this.
Thank you so much for having me.
You and your team have shipped, like, one of the most exciting AI products of late.
I use it a lot, deep research.
Where do the idea come from?
Tell me the origin story.
Yeah, so around a year ago now, we were very excited about the progress internally on this new reinforcement learning algorithm.
We were seeing a lot of progress on math problems and science problems and coding problems.
And at the same time, I was working with my friend Yash, who works at Open AI, on a few side projects, and we're very interested in agents generally and kind of wondered if we could apply the same algorithms to tasks that may be more in line with what the average user would do every day.
And so the two first things we were thinking about were online browsing tasks, because I think in a lot of different professions, people do just have to do a lot of research, synthesize a lot of information.
and then come back with a report.
And then we're also thinking about software engineering.
We kind of have been working on those things.
I've been focusing on browsing.
So to start, we kind of, with the math and coding problems
that people were already training on,
those data sets already exist.
You know, you can have a math problem
with a ground truth answer and you can train on those.
But for browsing, it's kind of more open-ended.
You don't really have data sets like that that exist.
So we really started by grounding the research
and what product use cases.
actually wanted the final model to be good at. So we literally would write out just a list of
things. I hope the model that could find this list of products for me and rank them by like
these reviews from Reddit or something like that. Or I want it to be able to write a literature
review on this topic. I feel like a lot of people when they think about, you know, browsing and
agents, they land on the same like two, three transactional use cases that I actually don't think
are particularly inspiring, right? So it tends to be, like, order a burger on DoorDash or something
like that. Or I feel like ordering flowers is also like a really common one. Why do you think
you came up with like such a different set of goals for the agent? Yeah. So I think before we
focused on taking right actions, which those are examples of taking right actions, we wanted to
get really good at synthesizing information from a large number of sources and mostly read only tasks.
There was for a number of reasons. Firstly, just a huge number of knowledge work professions mostly do that.
So it would be quite useful for those groups of people.
Secondly, I think the overall goal for Open AI is to create an AGI that can make new scientific discoveries.
And we kind of felt that a prerequisite to that is to be able to synthesize information.
You know, if you can't write a literature of you, you're not going to be able to write a new scientific paper.
So it felt very in line with the company broader goals.
It's also very meta because you have helped make an AI that makes me better at learning and it's learning.
Yeah. I hadn't thought about that. I love that. More practically, the read-only, read-only task is maybe the safety question is a bit more constrained. So it was a good thing to start with as well.
Yeah, it seems that the read-only space, people were also not nearly as ambitious as you were going in or you and Yash were going in about, like maybe it could understand this set of things.
things for me. Okay, so you thought of these end eval and come up with a set of tasks that could be
auto-gradable or fit a set of characteristics that made them better fit the algorithms. And then what?
That was actually a huge process in itself. I think we initially had built a demo to pitch people
on this idea and it was no model training involved. It was fully just prompted models with the UI
pitching the vision of what this product could look like. And so I think after that, then we were at the point
we actually had to start thinking about how are we going to do this, how are we going to create
the data, how are we going to train the model, what tools do we have to create to enable the model
to browse the internet effectively? And that was a lot of iteration. I was working very closely
with Edward, Sun, and a few other people on this. And so we also collaborated a lot with the RL
team. I think it was definitely a big undertaking. And a good thing about it was we were able to
work uninterrupted for quite a few months, making the numbers or not even.
balls go up. So I think it was nice to have not too much pressure to ship something really quickly.
And we were just able to iterate and get it, get it to a good state.
Did you have a favorite, like, most important task? We had a few tasks. People would just propose
different tasks. One of them was to find all of the papers that Liam Fettis and Barrett Zoff
had written together. I think there was 11. The model now can find most of them or all of them.
Okay. We would always ask that question. And then,
another one, which the model actually can't answer anymore, probably for good reason, but
finding the middle name of one of our co-workers. And then personally, I think I started using
it pretty early on for actually finding information for like product recommendations, travel.
And I think actually quite a few people internally, we had kind of a streamlet playground that
people would just use. A lot of people had found it and were using it. Sam told me he used it to buy a
bunch of things. Every time it would go down, people would message us. Like, what happened? We need to
use the model, even when a previous version that honestly wasn't that good. So I think that was a
good initial sign. Yeah. What can you say about the actual bulk of the work, like the tool
creation and the data creation? So for the data, we did a bunch of different things. We used
human trainers. For some of it, we kind of had to come up with new ways, new kinds of data sets,
I guess, and we had to figure out how to design data sets to exercise the kind of skills that we
wanted the model to learn. And then you have to make a way to grade those data sets as you're
training them. And then you also have to make the good tools for the model to be able to actually
complete the task successfully. So right now we just have the browsing tool, which is a text-based browser,
but it can see embedded images and like open PDFs. And then also it has access to a Python tool
so it can do analysis and calculations and plot graphs and things like that.
But you can imagine in future versions, we'll just expand the tool set.
And so the model will just become more capable, but we'll also need to make data sets that actually make the model
exercise all of those different tools and figure out how to use them and backtrack and, you know,
all these different things during training.
So that's actually able to like flexibly answer new problems from users in the product.
It is clear that reinforcement fine tuning on very powerful base models can do very useful things now.
That's super exciting. What advice would you have for startups or other companies who are thinking
about doing RFT for a particular task as to when it's worth doing or when they can just try to
do sort of just traditional orchestration where agents are a component? So I think in general,
you will always get a model better at a specific task if you train on that task. But we also see
a lot of generalization from training on one kind of task to, to, you know, other domains. So
you can train a reasoning model on mostly math, coding, other reasoning kind of problems. And it
will be good at writing. But if you, you know, trained it on that specific task, it would be better
at it. I think if you have a very specific task that you think is so different to anything that the
model was likely trained on, and you try it a bunch of times yourself and you've tried a lot of different
prompts and it's just really not good at it. So maybe it's some genetic sequencing task or something
that's just so out of distribution for the model that it doesn't know how to figure it out. I think
that is a good time to try reinforcement fine tuning. Or if you have a task that is so critical to your
business workflow that getting the extra 10, 15% performance is really make or break, then
probably try it. But if it's something that you think, oh, the model's pretty good at, but it gets things
wrong, you know, some percentage of the time. And then you see with every next model that's
released, it gets a little bit better. It might not be worth the effort if the model naturally is
just going to get better at those things. So that would be my recommendation. Okay, great, great
advice. You've talked about needing to use human experts to create some of this data. I think of
browsing as a somewhat universal task. I guess they're good and better, they're, you know,
worse and better browsers. Like, where do you feel like you need expertise or what do you know
about browsing expertise that you didn't before, or information gathering expertise? Yeah, I guess
it's one of those things where basically every single profession involves, you know, having a
question or wanting to do research in a domain and then having to find information from many
different sources to synthesize an answer. And like, while doing that, you have to have the
expertise to reason about, is this a useful source? Is this not? Is this, you know, should I include
this? Is this like completely off topic, whatever? Like, that is kind of, you know,
universal to most jobs or most, you know, scientific domains, any kind of anything.
So, and the cool thing with RL is that you don't necessarily need to know the whole process
of how the person would do the research. You just have to know what the task is and what the
outcome should be. And the model will just learn during training how to get from the problem
to a good answer. So I think we just took a pretty broad approach. I think that's one thing
that if you work at a place like Open AI, I think you can do what they would tell most startups
not to do and just try and focus on a really broad set of users and just get experts in loads of
different domains and try and see if you can go to everything at once, which was the approach that
we took. And then we also created a lot of synthetic data sets and things like that. But the human
data was definitely a really key part for making this model successful. Did any of the learned
planning from the model across these domains surprise you, like in terms of the past to find
the perfect handbag or the restaurant in Japan or the set of papers that was relevant.
Yeah, I guess sometimes it will use search terms that I wouldn't necessarily have used
or, you know, we didn't teach it to plan up front, but sometimes we'll see it.
It does end up making a plan up front before starting its research.
Sometimes the model will do smart things and try to get around restrictions you put on it.
So you have to make sure that it's not hacking, you know, and trying to use a different search
engine other than the search engine that you gave it or something like that.
Like, it will do smart things that you have to make sure you're looking out for, you know,
in case you want to not allow the models to do those things.
Maybe we can actually use this as a, like, a moment to talk about some of the failure modes.
Like, how did you think about some of the classic issues with agents, like maybe, you know,
compounding error or distraction or even safety?
Yeah, so I think with deep research, since it can't actually take actions,
that aren't kind of the same class of the typical agent safety problems you would think of.
But I think the fact that the responses are much more comprehensive and take longer means that people will trust them more.
So I think maybe hallucinations is a bigger problem.
While this model is hallucinates less than any model that we've ever released, it's still possible for it to hallucinate most times because it will infer something incorrectly from one of its sources.
So that's part of the reason we have citations because it's very important.
that the user is able to check where the information came from. And if it's not correct,
they can hopefully figure it out. But yeah, that's definitely one of the biggest model limitations
and something that we're actively always working on to improve. In terms of future agents,
I think the ideal agent will be able to do research and take actions on your behalf. And so I think
that's a much harder question that we need to address. And it's kind of at that point when
capabilities and safety kind of converge where an agent is not useful if you can't trust it to
do a task in a way that doesn't have unintended side effects that you don't want. Like if you ask
it to do a task for you and then in the process it sends an embarrassing email or something like
this, you know, that's not a successful completion of the task. So I think that is going to be a
much more interesting and difficult safety area that we're starting to tackle. You can tell me if you
just don't have a projection here, but do you think people are going to want
explicit guardrails? Do you think you can learn a bunch of those characteristics in the model
itself? If you've used operator, I'm sure you have, you have to confirm every right action.
I think to start with, that makes a lot of sense. You want to build trust with users. And as the
models become more capable, maybe you've seen it successfully do things a few times and you'll
start to trust it more. And so maybe you allow it to, okay, every time, you don't have to ask
me every time you send an email to these people. Like, that's fine. But I do think that as these
agents start to roll out, we will definitely want to have guardrails and confirmation.
Just so, you know, while they're not the end state capability, we still want to make sure
we have a good level of oversight. But I think that they will get so good that we'll just trust
them to do things on our behalf. What are some of the obvious ways you feel like deep research
as a product is going to get better? Yeah, I mean, it's going to extend into right. You just implied
that at some point. Yes. I mean, I think maybe it's, you know, the ideal state would to have,
be to have a unified agent that can do all of these different things. Anything that you would
delegate to a co-worker, it should be able to do. How are we going to make decisions about if it's like
Sarah, you do this versus agent, please do this? Yeah, I guess. Or is it always just try the agent
first? Probably. I mean, I would try the agent first. It's my work. It's kind of the pattern
of every time the model becomes more capable, the level of abstraction of the human becomes
higher, if that makes sense. Like the task you're asking it to do is just higher and higher level,
but you're still initiating the task. So, you know, maybe previous, a year ago, I was asking
it to write a function for me and now I'm writing it to asking it to write a whole file and maybe
next year that will, you know, make a whole PR for me or something like that. So I still think
we'll be in the driving seat. As to deep research, I think obvious next steps for deep research
would also be to have access to private data, like be able to do research over, you know, any
internal documentation or GitHub, whatever it is.
There's a golden thread here because when we first met you were working on retrieval
and I was like, there cannot be only one person at this company working on retrieval.
Everything all roads lead back to retrieval.
So I think that will be really cool.
And then eventually taking right actions or calling APIs.
And then obviously there are just a lot of things that the model is not perfect at now that we just need to improve.
But I think we have a really cool working relationship with the reinforcement learning team.
So a lot of teams will contribute data sets to the big runs that they do.
So we contribute data sets.
And then as they train models with a ton of compute, then it just becomes a better base model for us to continue training from.
So just think the capabilities are compounding.
So this was not a low-key research preview, but a side project that turned into a very interesting, you know, internally pitched project.
How do you think about, like, what is a product that Open AI or at least you yourself want to work on?
independently versus what belongs in the core research path.
A cool thing about Open AI is that even though the company is bigger,
I think the culture of anyone being able to have an idea and prove it out and then push it to
completion is still, you know, it's still been maintained as the company has grown.
For me personally, I'm always motivated to work on things that I will use myself with the
research, for example, I do use it a lot for, you know, looking at various things, travel
recommendations. But I think I'm probably a daily active user. It's fun when you get to dog food.
I think I'm a doubt now. Oh, yeah. Amazing. Yeah. I'm burning a lot of GPS. Are there use cases where, like,
you know, you're the original expert, are there ways that you or Yash or like you've seen the
user base use them that you encourage people to use deep research? I'm always interested to see
people using it in domains
that I have absolutely no expertise in.
For example, in medical research or
I've seen a lot of different scientists posting about
how they've used deep research and how I help them do something.
To me, that's the most interesting because
when we were working on it, I obviously
had no way of judging whether
an output is good or not. So seeing
experts actually ratify
deep research responses is useful.
An area that I was surprised
to see people using the
model in was code search
for coding questions, I think, like, use the latest package or latest version of whatever
repo to help me write this file or something for data analysis as well. That's also something
the model's already pretty good at, and I think we'll just continue to get better at. I think
uploading a file or something like that and having it do some analysis for you or do some research
and then create a report with numerical analysis is pretty interesting. I actually haven't tried
this. So it's, and it's not a, it's not a browsing task. Like, what makes the model particularly
good at this? Or what is it capable of? Is it really like multi-step and then being able to
do planning and understanding of the task and produce a report that's cohesive? Yeah. I think also
the base model or the model that we started fine-tuning from O3 is just very capable model.
It's trained on many different data sets, including a lot of coding, reasoning and math
tasks. So that inherited capability is pretty strong. And then when you add the browsing on top of
that, still able to do that analysis. So I think those two together can be quite powerful.
Before the podcast, we were just talking about the idea of learning taste or preferences from
users, like opening eyes just released a bunch of memory features. How do you think that deep research
could or, you know, just agents in general could evolve to take into account, like how people
want to learn or their information ingestion preferences? Yeah, I think agent memory will definitely be
very important. It would be very annoying if every time.
you ask it to do a task, you have to repeat the same information, how you want it to do the task,
everything about you, which currently for deep research, you do have to do. And I think as a task
get more complex and right now it will take five to 30 minutes, you can imagine in the future
it might take hours or days to complete a task that you ask the model to do. You definitely want
the models research to be compounding. You don't want it to want to have to start fresh every
time. So I don't necessarily have a good answer, but I think it's something that will be very
important. There is a common understanding between many people at some of the leading labs that
like the recipe to AGI is, I'd say, somewhat known or, you know, there's confidence on this.
And, you know, the return of RL is very exciting for everyone. The stance that I've heard from you
and from others is both enthusiasm on like, this seems to work. We're going to get real capability
out of it. It's quite data efficient. And it's going to be a lot of work. Tell me a little bit about
like the emotional experience of building deep research and if that changes your view at all.
I agree with everything, everything you said. I think it's so impressive to see how data efficient
the algorithm is. I guess the data you train on is much higher quality and smaller. So actually
curating that is an undertaking and then making sure that the model has access to all the tools
that a human would have access to to do the work that they need to do and then making sure that
you represent tasks that people will find useful or doing their jobs in a way that you can,
you know, judge whether the model did a good job or not is also hard. And there's so many
other challenges for pre-training where you have so much more data. You have to do all of these
different things that are like, I think it's just a different challenge in both compounding. Like,
you need a really good base model to be able to do RL. And then for our team, we just do more RL.
So, yeah, it's, like, all very compounding.
But I think that we, everybody does kind of see a pretty clear path to this, like, broadly
capable agent.
Do you think there are, like, big blockers to progress of, like, you said, maybe not exactly
describing it as the next iteration of deep research, but just confidence that, you know,
we're going to have these unified agent capabilities and it will feel like a coworker.
What stands between us and that?
There's a lot of really hard safety.
questions that we need to figure out, you know, we would never ship anything that we don't have
very high confidence is safe. And I think the stakes are way higher when it can, when it has access
to your GitHub repositories and your passwords and your private data. So I think that's a really
big challenge. I guess also if you want the model to be able to do tasks that take many, many hours,
finding efficient ways to manage context, kind of similar to the memory thing. But if you're doing a
task for a really long time, you're going to run out of context. So what's an efficient way of
dealing with that, allowing the model to continue to do its thing? And then, yeah, just the task
of making the data and making the tools. I mean, I've said this already a few times, but that's a lot
of work. I was just looking at my history of queries. My user request is, like, I want to see what
things I asked of deep research versus other models in particular in my memory. But it has range
from like obviously, you know, from trying to get up to speed on a market for a company I'm
looking at or on a technical topic or travel planning. It's a big one. Also, I have looked for
things that are taste related. So I'll be like, okay, I like, you know, this set of books for
these reasons. I want you to, you know, actually just giving me a long form summary of a bunch
of other things you think I should read and explain why. I realize I don't have a super clear
model of like when deep research should be better than 03, what instinct can you give me here?
Deep research is very good when you have a very specific query or well-defined query,
so maybe not a general overview of a topic, but you're looking for some specific information
and you think it would be supplemented by existing research online. Even if that information
is also, you know, we also train the model on the base model on that information. I think having
live access to it is quite useful. So if I have any instinct about like directing to retrieval or
particular sources, that focusing is useful. I think so. And also we trained it to have much longer
outputs than I think the, you know, normal models would. So if you're looking for something very
comprehensive, maybe sometimes too comprehensive for some tasks, I think deep researchers will be
useful for those things. Connect this for me to a deep research, like fashion.
task. I've used it to find new brands. So I'll say these are the kinds of brands I like.
Please find new brands where I can find this specific coat that looks like this one or something
like that. And then it's very good at finding those versus the, I think the base model or the normal
model will say it will give you some brands, but it won't necessarily fit all of the constraints
that I had given. Like I wanted to sell this, you know, fake fur coat that's this length,
this season or something. It's not going to be able to do that because it just won't have the
up-to-date information and also just won't necessarily be able to deal with all of the constraints
in a query, like in one shot. 0-1 isn't browsing as comprehensively. I'll use it to find things
where I'm looking for a very specific thing that would take me hours to find. So I'm looking
for this very specific item or, you know, sweater that is probably available on Real Real or somewhere,
but I can't find it. Or I'm looking for an Airbnb with like very specific constraints. So
I think those kinds of things deep research is good for. And then more,
general, like, high-level things you should use, like, normal search for.
Yes. Well, I will admit, I have had some multi-year browsing slash shopping tasks,
but I'm now making a cron job for deep research. I want to ask just one more experience question,
which is, was there a particular, like, win or failure that surprised you in the training
of deep research? It really was one of those things where we thought that, you know, training
on browsing tasks would work, you know, felt like we had good conviction in it, but actually
the first time you train a model on a new dataset using this algorithm and seeing it actually
working and playing with the model was pretty incredible, even though we thought it would work.
So honestly, just that it worked so well was pretty surprising, even though we thought it would,
if that makes sense.
Yeah, yeah, it's the restural experience of like, oh, the path is paved, but.
strawberries or whatever. Exactly. But then sometimes some of the things that it fails at are also
surprising. Like sometimes it will make a mistake where it will do such smart things and then make
a mistake where I'm just thinking, why are you doing that? Stop. So I think there's definitely a lot of
room for improvement. But yeah, we've been impressed with the model so far. I'm used to all my
technology tools being instantaneous. Deep research is not instantaneous. It's thinking and using
tools. Can it be faster? Yeah, I do think there's a good middle ground in between where sometimes
you don't want it to do really deep research, but you want it to do more than a search.
And I think that we will release things soon that people will be happy about and we'll fill that gap.
Okay. I don't know how to communicate this preference, but I want to like toggle at some point to be like as much work as, I mean, because I would say this to a human, I want you to do as good of a job you possibly can do in the next five minutes.
Yeah. See, that's something where I think it seems like a bad U.S. to actually make the user, make that decision. The model should just be better at no.
knowing how much time to think. I think we made a decision when training the model that we just
are going to go for max thinking time every time. So I'm sure I will ask it a really simple query
sometimes just to test and then get quite frustrated that it's still thinking. So I do think
that that's also an area for improvement. It's, you know, knowing how long to think for. But yeah,
I suspect with deep research will always be focusing on the tasks that take the maximum length of
time and then I think like 03 or you know oh next will have a better in between what is an
example of a test you can imagine deep research taking a day yet in the future I mean there's some
GPU smoking yeah I think anything that would take I mean right now in five or 30 minutes it can
do what human experts rate take many hours so I guess in an hour it could do something that
take a human days in a day it could, you know, do something that would take a human weeks.
Obviously, there will be a lot of challenges to get it to scale like that. But I think you can
imagine it doing a research project that would have taken weeks to complete or, like, write a thesis
or something like that. Okay. I'm going to make our intern compete with it over the next
couple months then. Yeah. Sounds good. If you were to project forward a year, which is a really
long time in AI land. What is something that you think will surprise people that agents can do?
And that will actually be released. So it takes the safety considerations into the set.
A general agent that could do a lot of the, you know, help you do a lot of the tasks that you
would do in a lot of different areas. Like for me, I do a lot of coding. I'm hoping that there'll be
an agent that is pretty sufficient at coding, but that I will just trust to, I'll give it a task
and it will hopefully make a PR or something, but maybe I can ask the same agent to help me book
a trip to Korea or something. I hope that we'll get to a more unified experience. But I also
think that the rate at which these models are improving is going to be pretty surprising to
most people. Why do you think a unified experience is important? Or why do you think that makes sense?
Because I think today it's quite different to think about, obviously, Chachupit is one
experience that's very encompassing. But there are models that people use in different contexts
like, you know, next line completion type models for coding that are just feel like a very different
setting. I think that you'll probably want both. Like, you'll probably want an experience where you can
at some point override or interrupt the model and say, oh, no, I didn't mean that. Or you can take
over and like start typing something. Yeah. Especially in the short term as the models are not as
capable as humans in a lot of areas and are more capable in other areas. So I think it will be
a combination of like you asking the model to do something, but then when maybe to go with the
coding example, then maybe you're also in your your VS code or whatever it is, your cursor and
it's been doing something for you, but you can also like actually type and, you know, write some
of it yourself. So I think it will be a combination of those things. But I kind of want it to be
something that is just like it's like having some, a co-worker on Slack or like a remote co-worker. You can just
ask to do things for you, send them a Slack message, and then they'll start doing it.
And then you can, like, review their work or, you know, help at some point.
But it seems like a pretty nice, like, general interface.
And you don't have to think about which agent should I ask to do which task.
Like, you should just be able to figure it out.
The mental model I have for this is my general ethos is actually I love the people I work with.
I prefer to work with fewer people with less management overhead, all things considered.
because each person has more context and I have more understanding of them.
And so, like, the universally useful agent is attractive.
Yeah, and you only have to tell it something once and it will remember and then it will
have state on everything you're working on, things like that.
Awesome.
Well, this has been a great conversation, Issa.
Thanks for doing this.
And thank you for the product release.
Thank you so much for having me.
And thank you for using deep research.
Find us on Twitter at No Pryor's Pod.
Subscribe to our YouTube channel.
if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen.
That way you get a new episode every week.
And sign up for emails or find transcripts for every episode at no dash priors.com.