OpenAI Podcast - Why Tejal Patwardhan stopped underestimating the models - Episode 21
Episode Date: June 16, 2026The old tests are getting too easy. Tejal Patwardhan leads OpenAI’s frontier evals team, which is finding new ways to measure and forecast progress as models become more capable. She and host Andrew... Mayne discuss why evals matter for research, how benchmarks can break or get gamed, and what models need to be judged on next.Chapters00:00:24 Growing up at OpenAI00:03:10 Why reasoning changed everything00:06:28 What made o1 surprising00:11:20 Why old benchmarks stopped working00:14:45 What makes a good benchmark00:17:35 Why evals are getting harder00:22:09 Measuring voice and vision models00:24:48 Testing models on real science00:33:23 How OpenAI tracks frontier progress00:40:47 What AI means for work Hosted on Acast. See acast.com/privacy for more information.
Transcript
Discussion (0)
Hello, I'm Andrew Main and welcome to the opening eye podcast. On today's episode, we're talking
at the research lead Tagell Pat Warden about the need to build frontier e-valves as old benchmarks get saturated.
Generally bad. Benchmaxing is bad. How can we make these models useful for people in their real work?
We were really nervous because we were like, this human baseline is kind of hard. We don't know if the
model is going to beat it. But we should never underestimate the model.
Tangel, I have a question. How did you end up where you were? What brought you into opening eye?
Oh, I thought we weren't going to start with this.
Tegel, I have a question for you.
What would you like to start with?
Can we start with, like, tell us, like, what you did when you started at OpenAI and you can
like work backwards.
Don't you want to talk about your early days?
No.
I grew up at Open AI.
Okay.
Tell me a bit about your journey here working inside artificial intelligence, inside OpenAI.
So I joined Open AI in Fall 23, and it was right after a chat GPT had come out, GPT4 was out, and
Open Air had started.
it's super alignment team, and I joined for the preparedness team that was getting started as we were starting to get, look at how capable these models were becoming and think about, you know, what would the next generation of models look like? And at the time, it was extremely exciting because right after I joined was when some of the early results for the reasoning models had started to pick up. And we were thinking about, you know, if these models really take off, what will the future of capabilities look like? And how can we be prepared for that future? And so,
We did a whole bunch of work on like threat modeling and like what evel should we be running.
How do we think about releasing a model like this?
It's very exciting time to join.
What got you interested in this area?
Yeah.
Well, to me, evils are really exciting because they're a way to sort of measure and understand what our models can do and see progress, you know, sort of before it tends to happen.
Like there's this term called capability overhang, which is this idea that the models will be capable of things long before people actually adopt them and use them for those capabilities.
like there might be cultural or legal or regulatory barriers towards using a capability even before it's ready.
And so being someone who can help develop and measure our models via e-vals,
it helps you really understand what this technology can do and sort of see the future before it happens,
which is very interesting.
And I also think it's important because it can help sort of ready the world for what's happening.
When I originally started here, part of why I was really excited to work on some of the preparedness e-vals was because I thought,
these models were getting very capable. And it felt like a lot of my friends, like, in my real
life, didn't really understand how powerful these models would soon become because they'd look at,
you know, a chat GPT output and be like, yeah, it's hallucinating. And like, it's kind of not that
smart and kind of reads like AI slop. And it's like, well, that's now. But like the question is the
slope. Like if the slope is very high, then, you know, change might be happening much faster than one
would expect. And so I think one of the greatest services that we can do is sort of measure and share
with the world, what progress looks like, especially because there's often this capability
overhang before people really understand and feel that in the models themselves. So that's
part of why I think all of this is very important. Reasoning was such an exciting moment. And for most
of the world, that didn't happen until, you know, a year later that they found out about this. But
what was that like for you to all of a sudden understand that if you gave the models a longer time
to think about things, you got better results, even though the size hadn't gotten bigger?
a really fun time. I mean, so in some of the early experiments, which we've talked about now,
it's like the model is trained really just on math. And I remember there was this set of
experiments where Nat McAwee was like, hey, the model is trained on math. But if you eval it on
GPQA, which was this benchmark with like biology and chemistry and physics problems,
the model is doing really well. This is very interesting. And smarter models are much smarter.
And he had put together this forecast that at the time it said that if, you know, progress kept
going. Within six months, we'd have human level performance on science from just training on math.
And we were like, oh my gosh, that's crazy. And at the time, this was extremely locked down.
It was like, we kind of found our way to like curl to be able to see some model outputs. And you were like, wow, this is like one of the smartest things like I've ever seen.
Like I've never seen a model reason like this before. It was just like if this, if this becomes a paradigm that continues to scale.
But then we just looked back and you were like, you know, GPQA was like, you know, Ph.D level biology chemistry and physics.
And you were like, what is that?
We really need professional level.
And just kept changing the stakes of what counted.
But yeah, it was very cool.
I remember early on when AP Bio was just that was the benchmark to try to see if the model could do that.
But what's interesting is you brought this up is that a lot of stuff that comes out from opening as math focused.
Math has been useful because it's more objectively verifiable in some way.
So some of the earlier problems that we trained on, it was just easier to do RL and scale up the reasoning paradigm on math.
And math is also useful in various ways.
You know, it's like one of the core, you know, types of science.
But also in many ways, it's just happened by coincidence to be a thing that we focused on.
But it's not necessarily the end product of what we even want to focus on in research.
Like we're now realizing, okay, if we can do this for math, can we scale this up for other types of science, for professional work, for, you know, for capabilities that are useful to humans on a personal level.
And so I think math is more like the proof point versus like the end goal.
But it does seem like you said, though, that if something is able to think for a long time, break something down into steps and think through them as you have to do for really complex medical problems, it does just carry over.
Well, this is a big debate.
So, like, some of it definitely carries over, like the general idea of reasoning can be useful.
But then also there could be some domain specific skills or tools or types of reasoning that you would need in different domains.
Like, for example, for coding, you need to be able to actually write and execute code and test code if you want to scale up a coding agent.
And so something we've thought about a lot in terms of both evals and then also training is how do we make sure we also give the model the skills and tools and affordances that it would need to reason in that particular domain.
And some of the benefits of math will translate.
And then also you might need some domain specific scaffolding to really pull out its full abilities.
Like kind of, you know, like a general high school or liberal arts education and then like a specialized education.
Reasoning models were just a very interesting moment because I think it changed a lot of the ways we thought about what was possible.
with just a certain amount of compute if you let a model think longer and you gave the model
the opportunity to just come up more complex answers to this. Were there any interesting
things that happened with 01 that surprised you? So the 01 release process was very exciting.
We were sort of thinking about the reasoning paradigm for a very long time. And there were people
that were worried about making sure we didn't release it too soon just because it felt like a paradigm
shift, like possibly the thing that got us to AGI. Like I said at the beginning, we thought,
We had AGI in six months when like some of the early runs were happening.
And so there was this question of, okay, how do we put this out responsibly?
How do we test this technology?
And during the initial launch review for 01, we, during some of our cybersecurity tests,
the model, it was like one of the first examples of the model, like breaking out of the sandbox,
we published about this, where it was supposed to be in this Docker container, during this
capture the flag, and the model found this like security vulnerability and like how we had implemented.
the capture the flag scenario and it broke out. And we were all like, oh, no, what else has the model done
if it did this? And it was kind of a field-a-I moment, one of many. I feel like ever since then,
there have been many other such moments where the model has done something really surprising or
intelligent or novel that we didn't even think of when we were doing the tests. And then you
would come back and look at the transcripts and results and be like, wow, these guys, they're
clever, they're clever. And then it was just very important that we published and made sure the
world knew, like, the models can do this sort of thing. Yeah. There was this period right before
01 and was announced. A lot of people were like, oh, looks like we've hit the wall. It's been a few
months since anything's happened. Then 01 came out. And they're like, what's a wall?
Hitting the wall is just so not the right way to think about. Yeah, I get very frustrated when I
see posts like that because I'm like, man, if you look at, I feel like I've been looking at this
model improvement and this progress for a long time and it just keeps getting better. Like, it just
keeps getting better. And if I look at our research roadmap now, I see no signs of stopping. Like,
things are just going to keep getting better. This is going to be a really crazy year. A lot of really cool
research is going to come out. And I think this is probably true across the whole industry. So, yeah,
if anything, people are really under, they really under expect from the models.
It seems like sometimes, though, that their opening releases a lot. They tell people about things
were headed and say that this looks interesting. Sometimes people forget this. Or you get rumors of
stuff like Q-Star.
Q-star, man.
Very interesting.
But no, people don't realize.
I don't know.
I feel like we try to be very open and say like, hey, guys, here are some plots.
Like the lines are going up.
Things are really capable.
I think maybe there's this meme that, oh, the researchers, they don't understand.
The models are only good at math and research, but not good at things in the real world.
But I just don't think that's true.
I think people from even other occupations that have transitioned into open AI are starting to see our models are picking up at all sorts of things.
And I know it's like it might seem like the researchers are trying to overhype the model or something.
But if anything, I think we're underhyping the power of them.
You brought up AGI.
If I brought GPT4 back from, you know, March 223 back into, let's say, you know, 2020, I think people would have called it that.
And now we have this much more different idea of this.
People talk to AI every day, they have long conversations with things like nobody talks
about the Turing test anymore is one that nobody really understood what he was trying to
explain, you know.
But now we're well past that period.
Is there the Eval for AGI?
Yeah.
I mean, the models past the Turing test and no one talked about it.
It's kind of crazy.
Yeah, like I think models are pretty much indistinguishable from humans in many, many situations.
In terms of the test for AGI.
I mean, I think if a model can do, like, there's the classic most economically valuable work.
And I think people are increasingly using the model for large parts of their work.
And I think there'll be like a big spectrum and debate of like when exactly this happened.
But gosh, I certainly feel like Codex does a lot of work for me.
And I feel very lucky to have unlimited tokens, you know.
So that's certainly.
Another reason to come work here.
Please join.
Yeah.
But yeah, I think there'll just be a moment when people are realizing that they're using the models for so.
much of their work and also the scientific breakthroughs that we're going to see or I think
there'll be at some point it'll be incontrovertible like these models are really really powerful.
We're getting mathematics experts talking about how good the models are getting at that
and we're getting physicists talking about doing that.
I think that we're starting to see some real work come out of it which is just exciting.
Yeah.
So you brought up part of the problem with some of the earlier revolves.
Like a lot of them were inherited from older natural language processing methods and stuff.
And then sort of when you're looking for ways, how do we measure the success of
this literally some of these were just so simplistic that pretty much those benchmarks got
passed and then you had to figure out new categories of stuff. How have these been evolving?
It used to be that, you know, even the academic benchmarks, so to speak, our models couldn't
pass. Like, you know, classic tests that someone would take in high school or college or sort of
more multiple choice types of questions. And as the models got smarter, we had to make things
more and more realistic. So one of the first benchmarks that we put out more publicly was this
benchmark called Swee Bench Verified, which was like testing how well the model could, you know,
interact in meal code bases in Python, like Django and like, you know, complete PRs and that sort of
thing. And like pass unit tests. And then those became even more advanced where we were like,
okay, can the model take, you know, multi-step actions on like some complex environment, take actions
on the computer, like take actions that link up to the real world with like some of our wet labs
and biology work. So I think over time, as the models keep getting better, we have to be more
ambitious with like how long horizon and how realistic our measurements are. And doing that is very fun
because you have to like sort of stay ahead of the pace of progress. So two terms I want you to unpack.
When we talk about benchmarks, you often hear benchmaxing. Yeah. Benchmaxing is I would say
this idea that you, if someone training a model was just trying to look good on some evaluation or
benchmark and not actually making the model generally useful. And I would say that's generally not super
helpful because you want the model to be good at the real thing that the user might want to do.
And you don't just care about it looking good in some like marketing copy because like when a user
uses it, they'll be like, hey, this is like not quite what I sign up for.
And so generally bad.
Benchmaxing is bad.
Yeah.
And I think the way they've heard of explain kind of makes sense is that you have X amount of compute
budget, time, how much you're going to spend on it.
And you can spend a large part of that making the model just overall very good.
Or I can say I'm going to spend 90% of it.
So my evils are going to look really good when I release it.
And sometimes we've seen people just go literally use those E's VALVALs for it that comes out like, oh, that's like a great model.
And then you find out, oh, it's only good at that.
Yeah, that's not a great experience for the user.
So I think something that the OpenEI Research Program has done quite well is try to be very disciplined about making sure we are investing in general model improvements on the areas that really matter.
And then, you know, you'll run some eVALs at the end for comparison.
But the goal should not be, oh, we just want to look good on an e-Val.
We want to make a model that's useful to push forward the frontier of science or push forward the frontier of work or something like this.
And I think Yaakov has done a really good job also, like enforcing throughout the research org.
Like we should be really scientific and honest.
And that's included, you know, we've published results where our models were not the best before.
We just want to publish the reality and make sure that we are painting a very accurate picture of what our models can do and then aim to make them useful in the real world as much as we can.
You mentioned the software engineering bench as a one of the metric.
that's maybe not as useful now, and we hear the term saturated.
Explain what it means in a benchmark saturated.
Saturated is when a model is close to passing all of the questions correctly,
like getting close to 100% on the test.
And once a benchmark is saturated, it's not super useful because you can't really tell
models apart with that test.
It's like comparing two geniuses on like a high school math exam.
They might just both pass, but that's not very useful as you're trying to separate
really, really smart pieces of intelligence.
So the challenge is always to make more and more difficult, realistic, unsaturated benchmarks that you can then measure models against over time and forecast sort of where progress is going.
How do you do that now? How do you figure out what a good benchmark's going to be?
Yeah, I mean, the best benchmarks, I think, are really realistic and measure something people actually care about.
So one of our first forays towards doing this, which, you know, it's been a while now, but that we published was called GDPVAL.
I was really excited about the idea of having a measurement for how the models could interact with the real world.
And we were really having this crisis of evals where we kept training successively better models.
And on sui bench, they looked about the same because they were just doing really well.
And we were reaching the top of what that benchmark could measure.
And we were like, man, we have no idea how to measure what people actually want to use our models for.
And so there was very much a, hey, like, the Bureau of Labor Statistics has a list of all the top jobs and like all the top tasks per job.
and if you're a financial analyst, like doing an investment diligence or writing a legal memo
or, you know, writing a paper based on a piece of research or something like this.
And the idea was, can we actually ask the model those tasks that someone would want in real life
with the context they would have at the time and then see how the model could solve those tasks?
And at the time, when we tested one of the earliest models on this benchmark, it got like, you know,
less than 20%. Like if you compare how well a model would do on this well-specified work task,
compared to a human, like the model was way worse.
I'm like really proud of the org for being like, actually, you know what, we should publish
this new way to sort of measure and forecast progress on real world economic impacts.
And it's been like very useful to a lot of economists.
And also our models now are the best.
And it's like very cool because I think at the time we were like not really investing in real
world work in some of our training programs and weren't even measuring or tracking it.
And I think now there's a lot more focus on how can we make these moments.
model's useful for people in their real work, like for real scientists. And this kind of helped
catalyze a wake-up call that, hey, maybe we should also think about how to measure how stuff
is used in the real world. So that was pretty cool. But now we're like, okay, this benchmark's
probably too easy because it's extremely well-specified. Like each of the prompts is, you know,
hundreds of words of, I want you to go to this spreadsheet and make this change and do this thing
and then take that calculation and put it in a memo. It's like very detailed. And I think the next
step is how do we give the model as much ambiguity as you would give a report.
in the real world. Like, you know, if a manager asks, like, hey, can you run this analysis for me?
They should go figure out what to do, put that together, run the analysis and give you an output.
And so I think we've been working a lot on, like, more realistic ways to measure real work in the
real world, whether that's in like science, for personal use or even for enterprise.
There seems to be something to the idea of instead of hiding a benchmark, putting it out there
because internally as an org, you go like, okay, this can't stand.
Yeah. It really motivates research also. I think people want to know.
the truth and they want to know where we can be better and deliver a better model for our users.
And so knowing the gaps is quite useful.
What do you think the current limitations are right now with the ways that we're doing evils?
I think the types of work that we're doing now with codex and with our latest reasoning
models like 5-5, it's just such a different level of capability than when we had even six
months ago where a static benchmark just doesn't measure the long, like the nature of how long
you can get work out of these things.
Like, these models can work for days or weeks for you.
And, like, internally in research, we've had the models just, like, run for really long
periods of time to do work.
And one of the problems with an automated e-val is you kind of needed to run within some
amount of time and get results to be able to look at them.
And a lot of the ways that we're measuring models now also just include looking at production
usage and looking at real-world use by people and seeing what they're using it for and what
types of tasks they're able to get done because the time horizon of,
how much work is done by the model is just getting so much longer.
It was interesting watching, for instance, long context.
There was kind of this early race for companies to say that, hey, our models can take, you know, 100,000 tokens, a million tokens, whatever.
But there wasn't a lot of evaluation on how well that was.
And then we got needle in the haystack, which is a method of seeing if it could find a word or whatever.
And I think that people sort of assumed that that was a solved problem, but it wasn't.
It was just the benchmarks weren't really good.
And then we had to have better benchmarks.
And is that what kind of made it better was finally people could, one, spend more attention
solving that problem when they understood where it was failing?
Yeah, we definitely have better benchmarks for this sort of thing now.
And then also sometimes these problems reveal gaps in how we're thinking about training.
So one example is we used to think, oh, what matters is just how much context you can stuff
into the model at test time.
When now it seems that you can just dump a bunch of files in a container and the model
can kind of rep around and search for what it needs and when. And like this ability to have
search or tools to figure out what context you should use can be more efficient than just
stuffing everything in the context. And we wouldn't have really realized that without trying
that out and then seeing how that performed on various benchmarks. So I think that makes it,
this like makes the model a lot more useful because, for example, now the model can like search over
a whole repo and like find the files that you need and like understand the context of where
you're making changes. And the same is true for many work context.
where folks in Codex can now, like, upload their local file system.
And, like, you know, you might have made PowerPoints before or sent slacks that are
relevant to the work that you're doing now.
And the model can sort of search over that context with tool calls.
And so we're not as limited by how much you can literally stuff into context because the model can search.
Do you have any favorite e-vails?
My favorite e-vail?
I mean, GDPVL is my favorite public e-val.
Okay.
But I have many internal e-fells.
I will say the name of one of them.
It's called Houdini Bench, and I cannot explain further.
Oh, my God. You know, I was a magician, right?
No.
Yeah.
Maybe.
I don't know if you'd pass her Dienie bench.
No, I'd probably not pass you to any bench.
That was actually one of the things that I was played around with some of the early
vision models and stuff was using stuff, photographs of stuff of magic tricks and stuff
and seeing this.
That's very cool.
Yeah, multimodal brings a whole new element.
Like, I remember when 4-0 had first come out, there was a group of us that was sitting
on the roof of this building that our minds were just so blown by the idea.
do you have a real-time voice model?
And then we were like, how do we even eval this thing?
Right?
Because the whole paradigm of doing things in text and code and on your computer is just completely blown away if there's like a voice interaction in real time.
Something that was really interesting about that launch is, and we said this publicly at the time, is we actually delayed the public launch by six weeks as we were figuring out how to make sure the model was safe.
This was 4-0.
Yeah, because this was before the elections, actually.
And so there was like a lot of worry of, oh, if the model can, in real time, talk to you with a realistic sounding voice, could this be used for persuasive propaganda or this sort of thing?
And it was very cool.
The company delayed the launch to make sure we could build out all of these tests and build in mitigations to make sure the models couldn't be used for this sort of thing.
Well, it seemed like that's a very complicating factor as these models became multimodal.
I remember early on with GPT4, would it be, you know, with GPT4 vision back?
when it was that was that you could, you could, I could, I had terrible handwriting, I could
write a prompt and all of a sudden would solve for this and you realize, oh, it's not a text
in prompt, it's a visual prompt. And then with the audio models, when you're doing audio in,
audio out, the model could emulate things and could do stuff in such different ways. And so it seems
like that's really, where do you begin trying to figure out how you're going to measure that?
Yeah, I mean, it's just a lot of work. Um, usually for, for any of these, we start with what
would humans do in this case? So, like, you know, you would, like, have a set of inputs that you
put into the model and a set of outputs you would evaluate. And then you can, like, build up, okay,
can we, like, automate some of these? Can we build a new platform to measure this sort of thing
at scale and sort of move from there? But for some of the natively multimodal, it's just, like,
you have to, like, rip apart a bunch of your infra and make stuff work. Like, this was also true
with SORA for, you know, we were interested in making sure the videos weren't overly realistic
or could be used for the wrong thing. And that required, like, especially from safety,
building up a whole new stack of evils and mitigations, like including refusals at the model
level, monitoring when this was being used in prod. And yeah, it requires a whole new stack of thinking.
Yeah. Well, that's the thing, too, is that when you start to think about, okay, how do you prioritize one
evel over another? When do you decide that this is an, or do you just sort of go look, this one saturated,
we move on? And because there is, even though you may not be trying to optimize towards certain
public benchmarks, you still have to figure out, like, what we're, what's important to us now?
Like, there was a time when Open Eye was leading in code and then there was a time when it wasn't.
Now there was a time it is, but there was a dark period where that happened.
Yeah, we try not to get distracted by public benchmarks too much because it can be kind of noisy.
I think the, um, internally we have this thing called AGI index, which is inspired by the idea of like
CPI or inflation where you have.
have like some weighted basket of goods and you're tracking the price of those goods.
For the same thing for us, it's we have like this basket of evals that include measurements
across all of the core areas we're interested in that can include alignment, it can include
safety, it can include capabilities. It's just sort of what you want from your model.
And we just iterate, we like keep updating that index to represent more and more sort of
the difficult version of what we want our models to do. And we sort of track that index internally
and try not to be distracted by, you know, trying to benchmark some public.
benchmark or something like that. It's more having a blend of evils across different domains that we
care about across science or work and then also safety and alignment and making sure we keep making
progress on that sort of weighted basket. Try to stay focused. We've watched this evolution of these
evals. We've watched the evolution of the models. And I've talked to people here working in the
sciences, like people who are active in the science, not just researchers who like science or like
computer science or people who are in biology, mathematics, can.
you tell me what's going on to the e-vils in the scientific frontier? Because we're at this point now,
it seems like we're going to see meaningful results. Yeah, I think the work in some of our science evils
is some of our most exciting. So in the past few years of evils that we've made public. So the first
tier was this eval called Frontier Science Olympiad, which was kind of the equivalent to the math
Olympiad style evils that we had before, where we were measuring how well the models could do on, like,
high school Olympiad style problems in biology, chemistry, and physics. And they were sort of
shorter answer, but still quite hard, and the models weren't very good yet. And then the next phase we did was frontier science research, which is also public, and people can run this, which measured how well models could help complete sort of unfinished biology, chemistry, and physics thesis. So we had people who were PhDs or professors in these fields that had some texts that was not published, like maybe part of their thesis, and just turned that into an evaluation where the model was given maybe some input data or some initial starting point. And it had to,
sort of see how it'd fill out the rest of that paper and judge against a rubric for how well it did.
And that was starting to measure like, okay, are the model starting to do research?
Like, are they using tools, this sort of thing?
And then one of the final iterations of this was to see how well the model could do in the real
world in a wet lab.
And so we worked with this company called Ginko Biowworks that has a bunch of really cool
automated wet lab robots where the model had to optimize this protocol for protein synthesis.
And the idea was the model would generate a protocol.
and then they would actually automatically test it in the wet lab,
or they would put in the reagents the model suggested,
and then see what protein yield they got.
And this was for a protein that's sort of related to this ovarian cancer drug
or it's sort of a toy scenario for that.
And the model, like, we were really nervous at first
because we were like, this human baseline is kind of hard.
We don't know if the model is going to beat it.
But we should never underestimate the models because, you know,
it's just the curve is pretty clear.
Just every cycle got better and better beat the human baseline.
and set the state of the art on how efficiently the model could cost per yield generate this protein.
And I think that's just the start of how if we give these models optimization problems,
like, you know, go try to figure out how inexpensive you can make this vaccine or, you know,
synthesize this protein that's important for a drug.
The model can just go and keep optimizing these protocols with real world inputs.
And it was one of our first time de-risking an e-val that's actually connected to the real world.
Like, they weren't waiting for a piece of code to run.
We were waiting for the robot to finish the experiment so we could record how much protein was synthesized.
And yeah, I just think the models are going to do so much science for us.
It's going to be really interesting.
Well, that was exciting because that was just like, I think, GPT5, and it hadn't gone through any sort of,
here's how to be a scientist.
And now these models have progressed a lot since, and they have a lot more real-world experience with this.
Yeah, that wasn't even with one of our best models.
It was like just an early reasoning model.
And so I think, yeah, all of these things stack.
We'll have better pre-training.
We have better RL and post-trading, and we're going to get a lot better at using these models at test time to really elicit their capabilities.
And I think the next generation of evels is really about how can we have these models take actions in the real world and solve sort of unsolved problems for us that would take humans a long time, you know, some of these scientific problems that we haven't been able to put enough effort against.
It's like, well, now we have all of these agents that can spend compute to solve problems for us and try to steer them towards what would be useful.
It does seem like that brings in a new challenge, though.
Do you think that e-vals are going to give you a lot more complex?
Yeah, I mean, we have the saying on our team that pain is the moat.
I really think a lot of operations in the physical world will become part of the bottlenecks
and being able to measure what the models can do because even just starting with digital,
there's so much more scaffolding and infrastructure work we need to do to run these.
Like now if you want to test how well codex does, it's like, well, the model is calling APIs.
It's like taking actions on your computer and in your browser.
It's making artifacts for you.
It's writing and running and executing that code.
It's just so much more complex to measure that model.
And that's only digital.
Now if you want to measure how the model could interact with the physical world,
there's all sorts of ops and logistics that you need to have a really smooth process for
to see how you can deploy these things at scale.
And yeah, I think a lot of the work is actually shifting from being like theory or math or even programming.
I feel like people don't program that much.
they just ask codex and more shifting towards like planning, operations, physical stuff,
or at least at least my job has shifted a lot that way.
And those things are very hard.
It's actually kind of easy to just like write something like in a corner.
It's a lot harder when you have to manage all of these operations and logistics.
It's exciting, but it seems like part of the challenges these aren't just simple evils anymore.
They take more compute.
They take more time.
When you're trying to do a long horizon eval, you know, it's long.
You have to wait a long time to get the outcome on that.
Yeah, definitely.
So it's both a lot more work to come up with the evils and run them at scale.
And also if the, you know, the work takes a longer amount of time.
We don't get the signal as fast.
So we have to invest more in scaling laws where we can predict, okay, well, if by one day the model looks like this,
then we can forecast that at seven days it would look like this and sort of come up with trends
so that we can get signal faster.
Otherwise, we're just like stuck there waiting for a week to get an update, which is not
the most productive way to spend time.
I have certain benchmarks and things I used to test every time a new model comes out to find out how it's personally useful to me.
And it's one thing is I tell people who run businesses or other things is think about your own evals, things that will tell you where something is.
Because sometimes people might try something, they might try chat GPT six months ago and go like, eh, it wasn't good, it didn't do this.
They don't realize how fast things move.
Do you have any advice for people on how to figure out how to come up with a benchmark?
Yeah, I mean, if things move really fast, things change every couple of weeks.
and I feel like people are not as awake about, in my job, I'm one of the first people in the world to see some of the most powerful models.
So I'm extremely AGI-I-filled.
And I think progress is happening a lot faster.
What have you seen?
I've seen good models, man.
Yeah, but progress is happening a lot faster than people would think.
And I think the best, Eval honestly, is just to dog food or use the model.
Like, people should just try to use the models as much as they can.
And even if there are things that they think the model didn't do well one week, they should just try.
again the next week, it'll probably work. I think that's one of the things that should be obvious to
people kind of outside AI is how really good frontier AI companies are using these tools internally.
And that's why things are speeding up and getting more capable. Yeah, I basically try to have the
model take a first pass of everything that I do. Like whether it's, you know, sending a Slack message,
like understanding what experiment to perform next, like any management stuff, ops, logistics.
Like you'd have the model take a first pass. And then if the model's not good,
we figure out how to put that in the e-vowel.
I'm excited about the computer using evals, like just watching the performance of codex
the computer use is just light years over where it was just, you know, maybe eight months ago.
And it seems like those things are just going to get faster and better.
My predictions, like, probably by the end of the year, it'll use my computer better and
faster than I do.
Yeah.
Yes.
I think so.
The models have some advantages over you, right?
Like they can call a connector or a plugin, which is a much faster mode of communication
than you on your computer having to, like, go click into a service.
and understand every page and then copy some data back and forth,
or even writing some service to call that API or MCP or whatever.
It's like more work for the human than for the model.
So the model has that advantage.
And the models can just be faster if it's trained to navigate a browser or desktop,
whether it's through accessibility tree or through code.
So the models have an advantage over us.
And I think for a long time there was really no product deployment.
that was very effective.
We launched operator
and chat GPT agent a while ago
and those were really useful
for showing like this could be possible
but the latency on those models
was just too high.
They were just super slow
and I don't think people use them
at super high scale yet
but we've now reached sort of a tipping point
we're doing things like asking the model
to read my Slack for me
or like go schedule a bunch of calendar invites
and optimize the rooms
is faster for me than it would have been
to do it myself.
And I think, yeah, people are not ready.
Also, a lot of people haven't tried the stuff out because it's all launched so recently.
But everyone should go get the computer use plugins and use those and like install all the
plugins and all the good connectors that will make things faster.
Then you'll be mind-blown.
Let's talk about Frontier Evals.
Yeah.
So the goal of the Frontier Evils team is really to measure and forecast progress of the frontier
models at Open AI to better understand where we are, where we're going, and sort of try to share
that with the world.
And one of the things I think the team has tried to do is to help publish and open source as much that we can.
So, you know, some evels that we've helped open source include like Sweet Bench verified, which helped measure progress on coding, MLEB bench, which was a way to measure how well models could train other models and sort of track the progress of machine learning engineering skills in our models.
Paper Bench, which was a way to measure how well models could replicate real top machine learning papers from like ICML or.
iClear and GDPVAL, which, you know, helped measure how well models could perform on real
world tasks across, you know, over 40 occupations. And the goal for all of these has been,
you know, the models might not seem good now, but if you just plot how they increase with each,
you know, the results that improve with each model generation, often when people say like,
oh, well, I expect this will take like a year or whatever, they like over, they over expect in terms of
how much time it will take to saturate a benchmark. And like even at my,
own or people on my team's predictions are often like not ambitious enough for how fast things
will change. And so I just think we're trying to do our service and helping inform the world
about what is possible. I think some of these research acceleration evils in particular are
quite interesting. Like when we first started, we had this e-vail called the OpenAI research
interview eval, which was just taking the researcher questions that we asked people applying to
Open AI and putting those in an e-pal. And the model blasted through that pretty quickly. It's
like definitely can pass our interviews right now, which I think has caused a whole other slew of
downstream questions on like, how do we make sure people don't cheat on the interviews and like
how do we actually measure research talent? But I think all of this is very useful because
measuring internal progress is it's like kind of a way to measure the lever by which the models
will keep getting better, faster, like sort of the acceleration of the slope of improvement,
so to speak.
And, yeah, I think having ways to measure model progress is just good information.
I've heard that in some of the e-vals that were out there for a while, that it turned out
that there were actually errors in the questions, that that was an issue with some of the e-vals,
that was some of the publicly available ones were actually you couldn't score above a third
level.
And if you did, it was actually because you were training on the data and people looked at
that and found out like, oh, there's actually, this is not the right answer. Yeah, this is a problem
with a lot of public benchmarks. I think, like, so the original reason for sui bench verified was
because we wanted to run sui bench and it was half the problems were like either broken or
underspecified. And, you know, people in the industry were publishing results on this as some metric
of how well you did. And we were like, well, we should at least try to fix it and then like share that
so we can have a better yardstick. But I think one of the reasons that public benchmarks maybe
aren't always as, you know, battle tested as we'd like, is that not, they tend to be like,
you know, someone in a lab, like an academic lab, like had a good idea and like wanted to write a
paper, but they never had to run that e-val at scale and like production, training run or production
like level eval sweep for a launch. And just when you run some of this stuff at scale,
it like breaks or falls over and you like catch all of these bugs. And so I kind of think
sitting in a lab and being closer to product is a forcing function for making sure the quality
of your measurements is really high because like we're not doing this like look good in a paper.
We're like doing this.
Like it has to work because it has to work for our systems at scale.
So it kind of forces the quality to be high.
And it seems like kind of one of the things that can happen is these models become incredibly capable.
Sometimes they're very good at sometimes they can solve a problem with the takes for the laziest path and kind of they can they can give you the memorized answer instead of solving it.
And we saw that with like counting in like how many words are in a,
are any letters in a character and a word or whatever.
And it was often the model, if you prompt it right, it would get the answer right.
But if you didn't prompt it the right way, it would just sort of throw you an answer.
Yeah, that brings up all sorts of interesting concepts.
I mean, so there's one concept of memorization, which is the idea that the model literally
knows the answer and doesn't have to really think or reason to solve.
It's just like regurgitating, something it already knows.
And that makes the measurement not super useful because you're just measuring whether
you happen to have trained on that data a ton versus whether the model learned the skill or tool
or capability you were trying to measure.
So that's one way to avoid that
is to try to be really clean and disciplined
about your data, not including any benchmarks
or any evils that you want to measure
and that helps solve sort of the first problem
that you laid out.
So that's one thing.
And then there's this other thing
where the model can kind of like reward hack
or sometimes like cheat to solve an eval.
And that's very much a question of having clean evel design
where you like sort of test these at scale,
see if there's any hacks,
make sure those environments
that you're testing.
don't have the hacks as something that's possible for the model to do.
And that just requires a lot of quality control to make sure like the eval is not overly
hackable.
Yeah.
Yeah,
because it seems like there was some very simple ones like grade school math and whatnot that
models,
if you just change it a little bit,
some of the early models will get confused and give you the wrong answer that was
actually capable of solving it.
But it just goes,
oh, this one I got it.
And then,
you know,
that's happened to like,
should I drive my car to the car wash?
Your problem.
Yeah,
yeah,
so like the models can get tricked.
To me,
like the model does it like if it didn't get a good,
do well on that. Like it should have been smarter. Like we should also like have the models be a bit more
robust to being tricked. But this also relates to this idea of capability elicitation or like
trying to measure the models in the best way, which is especially important for our safety testing.
Like for example, if you want to measure how well the model can, you know, find vulnerabilities
or, you know, do some of the cyber security stuff, you want to make sure the model is not just getting
tricked by the problem like that. You really measured the true capability. And so there's a lot
of like prompt tuning and like changing the harness and sometimes like even doing like a fine
tune to get the model maximally ready to solve that challenge that we do to make sure if we say,
oh, the model's not good at some like very risky capability. We can be a bit more sure before we say
that. When I as a kid, I loved reading these encyclopedia Brown stories, these little mysteries
and you had to solve them. And with GPD4, I would write custom ones for it just in case somebody
had like tipped all these answers to it out there. But that was a pain to kind of do that. And it's
exciting to think now I can have a model write something, I come up with some new e-vow.
So how helpful have the model's been now for?
Yeah.
They're semi-useful.
Yeah.
Okay.
I think we're in this like phase of model development where, um, sometimes the outputs are still
kind of sloppy.
Yeah.
And they require like, um, human QC or like oversight to make sure the quality is still high and
like we're not getting tricked.
So I would say people sometimes are surprised that we still have a lot of human.
and intervention and involvement in the e-vils,
just because that's something, you know,
evils can be a lower end than training data,
and you want to make sure every single point that you're testing,
every data point is very high quality.
And so this is one of the areas
where, like, a human touch can be quite nice.
We're seeing some interesting trends
where jobs that actually touch AI seem to be more in demand
because it's made people more productive.
How are you tracking this?
How do you look for areas
where you think this is going to have an impact?
Yeah, these are very difficult questions.
I think that our, I think people are not calibrated to how much work our models will be able to do.
And how quickly, like, across a wide variety of jobs.
And right now, the models are still mostly just good at tasks versus a job.
Like, there's a lot to a job than a task, right?
Like, you have to figure out what you want to work on, navigate, like, ambiguity.
Like, you might have coworkers that you're collaborating with and, like, communicating with.
And then you might, like, figure out what task you want to do and then give that to a moment.
model. And that's kind of the phase we're at now where it's a lot of, I mean, even in my job,
the model is like doing individual tasks for me, but I'm still doing a lot of the thinking and
planning and that sort of thing. And I think people aren't even calibrated to that. Like I feel
like people in software and research are a lot more calibrated. By calibrated, I mean like realize
how capable the models are compared to some of my friends in other industries. And I like wish people
just tried the models more and saw because the people who try and see first, like they'll
start to really get it. But I also think the models are going to start to be able to do the stuff,
like the delegating part at some point too, maybe not too far from now. The figuring out what to
work on, navigating ambiguity, like writing the spec that the model then executes on. And people
should really start to think about, okay, what is what happens in the maximally AGI-I-pilled world
where even just for digital work, the model can come up with what to do, do it, execute it on it,
interact with the real world.
Like, you know, if it's, you know, there's entire businesses that now, like, you see,
like, stories of, like, unicorns that where it was, like, mostly AI and a few employees
that were, like, able to drive all of this value.
And so I do think there's this question of, you know, are we realizing how big this might be?
Personally, I think the opportunity space is getting bigger.
Everybody I know the most, the most AGI people I know that people who are using tools like
Codex all the time are doing way more now.
They're more productive now because they don't have to do.
the tasks because the jobs is the AI gets better to handing certain jobs.
Like, cool, there are five jobs I need done now because I can do more.
And I think that we just think about the light cone of the potential where we can be is bigger than we can imagine.
And I think these tools just help us get there faster, not narrow it.
I think that it's probably some mix of things.
Yeah.
Even if you have models that can speed up paperwork, like think about like a clinical trial for a drug, right?
It's like people spend months putting together all this paperwork, like hundreds of pages of like why they should be able to do the trial.
and they like submit it to the FDA.
And then there's like a 35% chance it got rejected because they like made a mistake or forgot
something.
They revise.
And finally you can do the trial.
And you know, these processes are good, but it just takes a long time.
And then the trial is, you know, you have a case into control or whatever and you're like
documenting symptoms and tracking these for like just documenting what happens for a long time.
And then doing a bunch of data analysis.
Like a lot of this is just documentation or data analysis or sort of like very classically
digital work.
And I think if models can help accelerate all parts of this, you know, for health, for energy, manufacturing, policy research, education, this will be very accelerative. We will have hopefully, you know, faster, cheaper, better goods. And that's really good for people. It's, like, very good for the individual consumer. So I think that is, like, something people should be excited about. But we should be very thoughtful about how to navigate the transition to that world in a way that's thoughtful and, like, responsible.
Excellent. Take it, Chitjell. Thank you for having me.
