Latent Space: The AI Engineer Podcast - [State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
Episode Date: December 31, 2025From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the ...de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin’s launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just “more repos,” why Tau-bench’s “impossible tasks” controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition’s emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.We discuss:* John’s path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks* The SWE-bench origin story: released October 2023, mostly ignored until Cognition’s Devin launch kicked off the arms race (Walden emailed John two weeks before: “we have a good number”)* SWE-bench Verified: the curated, high-quality split that became the standard for serious evals* SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution* The SWE-bench Pro controversy: independent authors used the “SWE-bench” name without John’s blessing, but he’s okay with it (”congrats to them, it’s a great benchmark”)* CodeClash: John’s new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization)* SWE-Efficiency (Jeffrey Maugh, John’s high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations)* AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation)* The Tau-bench “impossible tasks” debate: some tasks are underspecified or impossible, but John thinks that’s actually a feature (flags cheating if you score above 75%)* Cognition’s research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents)* The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve—John Yang* SWE-bench: https://www.swebench.com* X: https://x.com/jyangballinFull Video EpisodeTimestamps00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Discussion (0)
We're here at New Rips, with John Yang, sweet bench, and many other things.
But welcome.
Thanks so much for having me.
Yeah, really happy to be here.
Last year I talked to Othier and I think Carlos as well, one of your cool authors.
How's Sve bench doing?
Just generally, the project is like one and a half years old.
Yeah, yeah. I think one and a half years old in terms of when it was actually useful.
Yeah. We put it out October 23 and then people didn't really touch it too much.
And then of course, like cognition came on the scene and Devin was an amazing release.
And I think after that, it kind of kicked off the arms rate.
Did they tell you beforehand? And they just showed up.
You know, I got an email about like two weeks ago. I think it was from, I think it was from Walden.
It was like, hey, you know, we have a good number on it.
I was like, wow, congrats. You know, thanks for using it.
And then the release was like mind-boy.
I was like, wow, these guys did an excellent job.
Amazing.
And then SweetBinch Verified was like maybe last year.
That's right.
Yeah.
Catch us up this year.
Like you have other languages.
There's like a whole bunch of varieties of SweetBench now.
Yeah.
So what should people know?
Yeah, for sure.
I think there's a couple extensions that have happened.
One is like more SweetBenches, SweetBench Pro, SpeedBench Live.
Oh, Submage Pro, was that with you guys?
Because it looks independent.
It's like different authors.
It's completely independent.
So they just called themselves to your Benchpro without your blessing.
I think we're okay with it.
When we came out, we were like, oh, cool, interesting.
It would have been fun to be part of it.
But, you know, I mean, congrats to them.
It's a great benchmark.
Yeah.
But yeah, multimodal.
Yeah, we did multimodal and multilingual.
And I think, like, those have multilingual.
Is it like JavaScripts?
What else?
Yeah, yeah.
Multilingual is like nine languages across like 40 repos.
But yeah, you got him like JavaScript, Rust, Java, C, you know, Ruby.
Yeah, yeah, you got him.
Yeah.
And then Corsary bench itself, a lot of people, like, they talk about the Django focus.
Yes.
Is there, is there like, I don't know, how do we move past Janko?
Yeah, for sure.
I mean, it's cool to see a lot of the newer benchmarks, like, really try to diversify the repos.
Like, in the two follow-ups we did with multimodal and multilingual, we made it a point to do that.
So I think.
can also just put out Subbench 2025 and just... That is true. And do a new distribution. Yeah, yeah. So
it's been cool to see the follow-ups. I think quietly, and it's an open question for me, I'm excited to see
how people curate the next sets. Like, it's kind of interesting to see in the literature or in their
blog posts, like how they're justifying why they're creating their separate split. The easier ones
where like, oh, more languages, more repos. And then I think now people are like, well, ours is more
difficult because of this curation technique. And I'm, yeah, I'm excited to see how long that lasts
and, you know, where we're going to, like, guide the evaluations towards. Yeah. And more recently,
you're working on Code Crash. Yes, that's right. So let's get people, you've already done
other episodes, other podcasts about it. Yeah, I'll refer people to to that with your chat with
Andy. But just give, like, a people, like a one, two sentence. Yeah, no, happy to do it,
especially on your podcast. It's on. Yeah, so basically the idea is, I don't like unit
tests as a form of verification.
And I also think there's the issue with SweetBench where all of the task instances are
independent of each other.
So the moment you have the model kind of submit it, oh, it's done, you know, and that's
the end of the story, the end of the episode, you know.
So with CodeClash, what we're thinking is let's try to really evaluate like long horizon
development and development on a code base that is consequential and condition upon what a model
did, you know, before to that codebase.
And so the general idea is you have two or more language models and they play a programming tournament.
And what that means is each model maintains their own code base and each round of the tournament.
First, they get to like edit and improve their code base, however they see fit, very self-determined.
And then in the competition phase, those two code bases are pitted against each other.
So the code bases are run and there's generally an arena.
You know, we have a lot of diverse arenas.
but the arena's determined like codebase A is better than code base B.
And then you kind of repeat that across multiple.
As determined by an L.M. Judge.
Yeah.
Yeah.
So element judge is definitely one of the mechanisms.
We started with some pretty like simple programming games.
So one of the cooler ones is like Hallite, which, uh, Michael.
Oh yeah. I played it for Jane Street.
Yes.
That's right.
That's right.
You know, that's awesome.
Yeah.
Hallite one, two, three.
Like Michael Trull of Cursor wrote this game.
Two Sigma on Jane Street.
Yes.
Oh, 2 Sigma.
I worked at 2 Sigma.
I'm like, oh, there you go.
This is too long ago.
There you go.
Yeah, 2016 at this point, but we're bringing it back, you know.
Hellan that is fun.
I would say if you've never done a programmatic competition
where you have to control fleets of ships and attack things and defend things
and collect resources.
Yeah, it's like play StarCraft, but you can code.
Yeah, exactly, yeah.
A lot of games.
Yeah.
Are there non-games or are you phone games?
I think that's an excellent point.
So for kind of the initial release, for scientific purposes,
we kind of use existing programming games.
The current ongoing effort is, you know,
to build economically valuable arenas.
That's, you know, the popular word these days.
Yeah, a sweet lancers is a big one this year.
Yeah, GDP valve.
Awesome.
Yeah, just, I mean, I think the big selling point of Terminal Bench
and Sweet Bench in these eVals is that it's really close to real world utility.
And so I think it's resolvable for code.
And that's what we're working on.
Yeah.
Okay.
Yeah.
So you're part of Ophir's group.
Yes.
The other students have also been putting a lot of other stuff.
What would you highlight?
Yeah.
No, I mean, O'Fere is such a prolific mentor when it comes to benchmarking.
Sweetfficiency, I really like in the line of performance.
What's the deal on that one?
Yeah, for sure.
So sufficiency was wrote by this PhD student called Jeffrey Ma, who happened to be my high school
classmate.
And the idea there was, like, you take a code base and you just want to, you know,
do modifications that will literally make the code run faster.
So I think it's like paralyzation, simbion operations, stuff like that.
So no behavior change just faster?
Exactly.
Keep the unit test passing, but I want better runtime.
Okay.
Yeah.
Yeah.
And then there's Algotune that is kind of in line with that.
And then there's also kind of pushing along like the scientific a coding domain.
Cycode.
Yeah, exactly.
So I go to.
Psycho2 is awesome.
They did like a quick.
And for people,
code is, the way I explain psych code is, it's human eval, but better.
Yes, exactly. Exactly. I think, you know, there's a lot of good stuff that these days
where, yeah, that's the way to go. Which is, like, Subbench is expensive to run. Any
agentic benchmark is expensive to run. Actually, you do need some completion benchmarks. Yeah,
just, just, just complete. Exactly. Like, you know, you can do well on those first and then sort
of graduate to the multi-turn expensive stuff. Yeah. Yeah. Okay. Other than that, just like broadly,
other work in the field in 2025.
In terms of coding evels,
obviously we shot up meter.
They use sweet bench and they have a very interesting
like, I guess human hours worked number.
Yeah, they like the X-axis being sort of the runtime
and their, or yeah, Y-axis being the completion,
you know, like we can do more long-running street agent tasks.
I think the projections are quite interesting.
And I definitely appreciate them kind of using
SweetBench Verified to sort of proxy a lot of these things.
But yeah, they're great.
Okay. Any other work that, like, call your eye.
Yeah, I mean, I think within the, okay, terminal bench, sweet bench, yeah, critical point was kind of cool.
Critical point?
Yeah, it's like a very new benchmark that Ofeer did.
And I think it's kind of related to physics.
There's this one called Sec Bench, kind of related to cybersecurity.
Yeah, exactly.
SRE Bench, which I think is affiliated with LOD.
It's just cool to kind of see people really dive into different coding domains.
and then stepping a little bit outside of coding.
I personally think it's quite interesting
to think about the user simulator stuff.
So like Taub...
Vending bench, too.
Yeah, and Vending Bench.
I got to make feelings.
Yeah, no, I'm interested.
Well, I mean, it's like you're sampling one path.
I don't know how realistic it is, to be honest.
Yeah, it's just yellow of this.
But it is cool.
No, for sure.
Yeah, I agree.
I think it's a good initial effort.
To me, I think it's super cool to see companies,
like, you know, I'm sure Mercore and stuff
for focusing on building environments, like for code, beyond code.
And so I think it might be interesting to have like work gym style stuff.
This is stuff that my advisor, D. Young at Stanford thinks about a lot.
So yeah.
Yeah.
I just realized we were talking about Terminal Bend.
Yes.
We have the honor.
Folks.
Yeah, yeah.
You know, really, really good work just overall.
Yeah.
It's not about Taubench.
Yeah, because you mentioned Taubanche.
Yes, yes.
There's some discussion or some people are saying.
that Taubbench is impossible to get a high score on
because some of the tasks are underspecified
or just impossible.
Yeah.
I don't know if you're up to speed on that.
I'm a little bit spicy.
Yeah, it's a bit spicy.
I think I saw, so I, you know,
like I worked with Shuny and Karthik back in Princeton very closely.
I think Carthic I just saw posted a tweet kind of...
Defendantity?
Yeah, like rebutting some of these claims.
Yeah, I mean, it's...
I think I get the concern.
But yeah, I think it also brings up just maybe like interesting research problems to solve of like, okay, like why is it impossible?
Is it the ambiguity?
Is it kind of the user simulator that has issues?
And I think generally we all agree that, you know, we'll improve on these things over time for you, boss.
So I actually really like benchmarks that intentionally, I think we should intentionally include impossible tasks as a flag.
Yeah.
Of like, hey, you're cheating.
Yes.
It's kind of sad that like Carpick actually is defending it because the master move would be like, oh yeah, you caught us.
that that was, you know, like everyone reporting above 75 in Taubench retail, you've been cheating.
Yeah. Oh, interesting. That would be, that would be cool. Yeah. I mean, yeah, you'll have to ask the Taub bench authors,
but yeah, no, that's fun. Yeah, I think there was an impossible bench was a recent benchmark.
Maybe from, was it from Anthropic? I don't know, but they basically took Sweet Bench verified and
they changed the issues to make them impossible. And they checked like how often the models would be like,
I actually just can't do this.
I don't know what's going on.
Oh, like for refusals?
Yes, yes, yes.
Oh, how do they do?
I thought that was interesting.
I think they're all,
the models are all kind of attempting
and saying like, oh, I did it, you know,
so maybe not great.
That's cool.
But that's an important one.
Yeah.
How does Cody evals evolve next year?
Wow, that's a great question.
I mean, honestly, I think people will make more sweet benches.
I think terminal benches really got something going
where you ask people to, you know,
a sweet bench, you're confined.
in some sense to the domain of issues and PRs that already exist, which I think has its benefits of
being close to reality and natural. But I think with Terminal Bench, there's a lot of creativity
that you can infuse into that. So I would personally be really excited. Like the 2.0 job was really
excellent. And I'd be super excited to see, you know, 3.0, 4.9.
Because of like the environments. Yeah. I mean, the environments, you know, bringing more people
into the fold, you know, I think, correct me if I'm wrong, Mike. But early on, you had PhD students,
very smart CS people who are adding tasks and you know what does that look like when you fold more
coding environments for non-coding tasks non-coding environments in general and ask people to make stuff
there so that's pretty cool and then of course for myself i think just like this long-running
sui agent kind of thing just feels very compelling i think the vision of like hey i tell it a goal
i don't have to be super specific about my task i have like a decent verifier that proxies what i want
something literally like a code base that makes the most money in this like setting you know like that's
my verifier you know and i walk away for five hours the thing is just running i'm hanging out with you
talking to my friends i come back and it gives me like literally a soda code base on on that you know
task i think that would be super cool okay i'll push back we're part-time in cognition yes and we are
emphasizing a lot of interactivity because the the point is that you're going to underspecify
right and actually what people want is back and forth back and forth on like a really
fast time frame which is terrible for a benchmark author right because how you do that
yeah but but realistic yeah so um I think like that this uh this this is where I'm a little bit
anxious or cautious about this push for long autonomy right we're gonna I mean you know let's say
this time next year we'll have five hours is is pessimistic like yeah it'll be it'll be 24
yeah right days
But I don't know if that actually materially changes the industry.
So we'll push it.
As in evils, you know, we have the people, people make evils here.
Yeah.
We push the industry in ways that we wanted to push.
But I don't know if we, like, that's a productive way because that's more of like a stunt that like, yeah, it's a proof of a concept that existence proof.
It can be done.
Yeah.
Yeah.
But will you use it in front for real life?
Yeah.
Yeah.
I mean, honestly, to me, I think there's potentially room for growth.
So I would actually agree with your take here.
I mean, with my lab at Stanford, with D, like, there's a, you know, her emphasis is on human AI
collaboration.
And so I definitely don't believe in this idea of just kind of getting rid of the human.
But, yeah, maybe just like finding the balance of like, you know, just because the developer
ecosystem is so diverse and there's so many participants in it who want different things
out of it, like just enabling different levels of abstraction.
And, you know, it depends on the task.
like there's settings where you want to be, you know, more involved and more sort of hands-on,
and so you want to use windsurf for that.
But then maybe there's kind of this general data processing thing.
It's just a lot of JSON parsing.
You don't really care about.
And that's the one I kind of want to walk away from and just let it figure it out.
Yeah.
So, yeah, I would agree with you generally.
Yeah.
Amazing.
Any calls to action?
What do you want help on how can people, I guess, like, find more of your work?
Definitely.
For the call to action, super jealous of all the great data, that cognition,
and then, you know, a cursor would get, like, that user and action data is, like, really fascinating.
From the academic standpoint, it feels like there's two difficult approaches to resolving that.
Either you build, like, a really compelling product, like, El Marina, that people have people use
consistently, which is, I mean, really tricky in and of itself.
Or you build, like, really good user simulators that try to mimic sort of these settings.
But that is also, like, non-trivial.
I don't think it's as simple as, hey, check TPT, act like a human, right?
So it would be really cool to sort of get inspiration of like what exactly does that data look like or between the two like what's the best way to scale up sort of evaluating human AI interaction.
And then I think for visibility for my own work, pushing more arenas.
Like I think for code clash, what I'm excited about is the current framing is really long running sweet agents.
But you know, you could have multi agents like two agents work together on the code base and what happens.
you have a human and an agent work on the code base versus just AIs.
What happens there?
You know, like when the models improve and hopefully they hill climb and they become better at
digesting laws and iterating on analysis, you know, how does human AI interaction like
change with model capability?
And so I'm kind of hoping, you know, I'm trying to inspire and convince people that it's a very
cool test bed where you can do a lot of different sort of combinations of like human
AI on different arenas, playing one arena at a time, N arenas at a time.
You know, I just, you know.
Yeah, I think very interested work with you on the interaction stuff.
Oh, that would be awesome.
And then I think one more thing I'll add is for cognition,
is going to be pushing a lot of code-based understanding,
which is kind of code-based retrieval plus-plus.
Yes.
And mostly it is helping humans understand their own code basis better to enable humans.
Yeah.
Or to sort of mind-meld the human with the machine to do,
the highest possible task that LLMs could not do alone, humans couldn't do alone.
And then the other thing is also like basically automatic context engineering for an LM.
So that is like sort of like a research subagent that we're working on.
That's so awesome. Yeah.
So I don't know what the benchmark would be because like how do you benchmark understanding?
That is.
Apart from I think like it's also mostly like you freeze a repo, have some manually curated
answers and then, you know, pose trivia questions.
that's very easy to saturate, so I don't know how else you do it.
I think Silas tweeted a while ago,
like sort of like the Wiki, the Code Wiki,
that's incredible. I mean, I use it on a...
Yeah, with Google actually just came out the own version.
Oh, yeah.
Yeah, with the anti-gravity people, that's...
No, no, no, this is like a separate...
It's a different routine.
Yeah, yeah, gotcha, gotcha.
But cool, that's the state of code.
Yep.
