Latent Space: The AI Engineer Podcast - [State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Episode Date: December 31, 2025

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the ...de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale. We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin’s launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just “more repos,” why Tau-bench’s “impossible tasks” controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs. interactivity (Cognition’s emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.We discuss:* John’s path: Princeton → SWE-bench (October 2023) → Stanford PhD with Diyi Yang and the Iris Group, focusing on code evals, human-AI collaboration, and long-running agent benchmarks* The SWE-bench origin story: released October 2023, mostly ignored until Cognition’s Devin launch kicked off the arms race (Walden emailed John two weeks before: “we have a good number”)* SWE-bench Verified: the curated, high-quality split that became the standard for serious evals* SWE-bench Multimodal and Multilingual: nine languages (JavaScript, Rust, Java, C, Ruby) across 40 repos, moving beyond the Django-heavy original distribution* The SWE-bench Pro controversy: independent authors used the “SWE-bench” name without John’s blessing, but he’s okay with it (”congrats to them, it’s a great benchmark”)* CodeClash: John’s new benchmark for long-horizon development—agents maintain their own codebases, edit and improve them each round, then compete in arenas (programming games like Halite, economic tasks like GDP optimization)* SWE-Efficiency (Jeffrey Maugh, John’s high school classmate): optimize code for speed without changing behavior (parallelization, SIMD operations)* AlgoTune, SciCode, Terminal-bench, Tau-bench, SecBench, SRE-bench: the Cambrian explosion of code evals, each diving into different domains (security, SRE, science, user simulation)* The Tau-bench “impossible tasks” debate: some tasks are underspecified or impossible, but John thinks that’s actually a feature (flags cheating if you score above 75%)* Cognition’s research focus: codebase understanding (retrieval++), helping humans understand their own codebases, and automatic context engineering for LLMs (research sub-agents)* The vision: CodeClash as a testbed for human-AI collaboration—vary the setup (solo agent, multi-agent, human+agent), freeze model capability, and measure how interaction changes as models improve—John Yang* SWE-bench: https://www.swebench.com* X: https://x.com/jyangballinFull Video EpisodeTimestamps00:00:00 Introduction: John Yang on SWE-bench and Code Evaluations00:00:31 SWE-bench Origins and Devon's Impact on the Coding Agent Arms Race00:01:09 SWE-bench Ecosystem: Verified, Pro, Multimodal, and Multilingual Variants00:02:17 Moving Beyond Django: Diversifying Code Evaluation Repositories00:03:08 Code Clash: Long-Horizon Development Through Programming Tournaments00:04:41 From Halite to Economic Value: Designing Competitive Coding Arenas00:06:04 Ofir's Lab: SWE-ficiency, AlgoTune, and SciCode for Scientific Computing00:07:52 The Benchmark Landscape: TAU-bench, Terminal-bench, and User Simulation00:09:20 The Impossible Task Debate: Refusals, Ambiguity, and Benchmark Integrity00:12:32 The Future of Code Evals: Long Autonomy vs Human-AI Collaboration00:14:37 Call to Action: User Interaction Data and Codebase Understanding Research This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:00 We're here at New Rips, with John Yang, sweet bench, and many other things. But welcome. Thanks so much for having me. Yeah, really happy to be here. Last year I talked to Othier and I think Carlos as well, one of your cool authors. How's Sve bench doing? Just generally, the project is like one and a half years old. Yeah, yeah. I think one and a half years old in terms of when it was actually useful.
Starting point is 00:00:36 Yeah. We put it out October 23 and then people didn't really touch it too much. And then of course, like cognition came on the scene and Devin was an amazing release. And I think after that, it kind of kicked off the arms rate. Did they tell you beforehand? And they just showed up. You know, I got an email about like two weeks ago. I think it was from, I think it was from Walden. It was like, hey, you know, we have a good number on it. I was like, wow, congrats. You know, thanks for using it. And then the release was like mind-boy.
Starting point is 00:01:03 I was like, wow, these guys did an excellent job. Amazing. And then SweetBinch Verified was like maybe last year. That's right. Yeah. Catch us up this year. Like you have other languages. There's like a whole bunch of varieties of SweetBench now.
Starting point is 00:01:18 Yeah. So what should people know? Yeah, for sure. I think there's a couple extensions that have happened. One is like more SweetBenches, SweetBench Pro, SpeedBench Live. Oh, Submage Pro, was that with you guys? Because it looks independent. It's like different authors.
Starting point is 00:01:31 It's completely independent. So they just called themselves to your Benchpro without your blessing. I think we're okay with it. When we came out, we were like, oh, cool, interesting. It would have been fun to be part of it. But, you know, I mean, congrats to them. It's a great benchmark. Yeah.
Starting point is 00:01:46 But yeah, multimodal. Yeah, we did multimodal and multilingual. And I think, like, those have multilingual. Is it like JavaScripts? What else? Yeah, yeah. Multilingual is like nine languages across like 40 repos. But yeah, you got him like JavaScript, Rust, Java, C, you know, Ruby.
Starting point is 00:02:06 Yeah, yeah, you got him. Yeah. And then Corsary bench itself, a lot of people, like, they talk about the Django focus. Yes. Is there, is there like, I don't know, how do we move past Janko? Yeah, for sure. I mean, it's cool to see a lot of the newer benchmarks, like, really try to diversify the repos. Like, in the two follow-ups we did with multimodal and multilingual, we made it a point to do that.
Starting point is 00:02:30 So I think. can also just put out Subbench 2025 and just... That is true. And do a new distribution. Yeah, yeah. So it's been cool to see the follow-ups. I think quietly, and it's an open question for me, I'm excited to see how people curate the next sets. Like, it's kind of interesting to see in the literature or in their blog posts, like how they're justifying why they're creating their separate split. The easier ones where like, oh, more languages, more repos. And then I think now people are like, well, ours is more difficult because of this curation technique. And I'm, yeah, I'm excited to see how long that lasts and, you know, where we're going to, like, guide the evaluations towards. Yeah. And more recently,
Starting point is 00:03:09 you're working on Code Crash. Yes, that's right. So let's get people, you've already done other episodes, other podcasts about it. Yeah, I'll refer people to to that with your chat with Andy. But just give, like, a people, like a one, two sentence. Yeah, no, happy to do it, especially on your podcast. It's on. Yeah, so basically the idea is, I don't like unit tests as a form of verification. And I also think there's the issue with SweetBench where all of the task instances are independent of each other. So the moment you have the model kind of submit it, oh, it's done, you know, and that's
Starting point is 00:03:40 the end of the story, the end of the episode, you know. So with CodeClash, what we're thinking is let's try to really evaluate like long horizon development and development on a code base that is consequential and condition upon what a model did, you know, before to that codebase. And so the general idea is you have two or more language models and they play a programming tournament. And what that means is each model maintains their own code base and each round of the tournament. First, they get to like edit and improve their code base, however they see fit, very self-determined. And then in the competition phase, those two code bases are pitted against each other.
Starting point is 00:04:20 So the code bases are run and there's generally an arena. You know, we have a lot of diverse arenas. but the arena's determined like codebase A is better than code base B. And then you kind of repeat that across multiple. As determined by an L.M. Judge. Yeah. Yeah. So element judge is definitely one of the mechanisms.
Starting point is 00:04:38 We started with some pretty like simple programming games. So one of the cooler ones is like Hallite, which, uh, Michael. Oh yeah. I played it for Jane Street. Yes. That's right. That's right. You know, that's awesome. Yeah.
Starting point is 00:04:50 Hallite one, two, three. Like Michael Trull of Cursor wrote this game. Two Sigma on Jane Street. Yes. Oh, 2 Sigma. I worked at 2 Sigma. I'm like, oh, there you go. This is too long ago.
Starting point is 00:05:01 There you go. Yeah, 2016 at this point, but we're bringing it back, you know. Hellan that is fun. I would say if you've never done a programmatic competition where you have to control fleets of ships and attack things and defend things and collect resources. Yeah, it's like play StarCraft, but you can code. Yeah, exactly, yeah.
Starting point is 00:05:20 A lot of games. Yeah. Are there non-games or are you phone games? I think that's an excellent point. So for kind of the initial release, for scientific purposes, we kind of use existing programming games. The current ongoing effort is, you know, to build economically valuable arenas.
Starting point is 00:05:37 That's, you know, the popular word these days. Yeah, a sweet lancers is a big one this year. Yeah, GDP valve. Awesome. Yeah, just, I mean, I think the big selling point of Terminal Bench and Sweet Bench in these eVals is that it's really close to real world utility. And so I think it's resolvable for code. And that's what we're working on.
Starting point is 00:05:56 Yeah. Okay. Yeah. So you're part of Ophir's group. Yes. The other students have also been putting a lot of other stuff. What would you highlight? Yeah.
Starting point is 00:06:05 No, I mean, O'Fere is such a prolific mentor when it comes to benchmarking. Sweetfficiency, I really like in the line of performance. What's the deal on that one? Yeah, for sure. So sufficiency was wrote by this PhD student called Jeffrey Ma, who happened to be my high school classmate. And the idea there was, like, you take a code base and you just want to, you know, do modifications that will literally make the code run faster.
Starting point is 00:06:28 So I think it's like paralyzation, simbion operations, stuff like that. So no behavior change just faster? Exactly. Keep the unit test passing, but I want better runtime. Okay. Yeah. Yeah. And then there's Algotune that is kind of in line with that.
Starting point is 00:06:44 And then there's also kind of pushing along like the scientific a coding domain. Cycode. Yeah, exactly. So I go to. Psycho2 is awesome. They did like a quick. And for people, code is, the way I explain psych code is, it's human eval, but better.
Starting point is 00:06:59 Yes, exactly. Exactly. I think, you know, there's a lot of good stuff that these days where, yeah, that's the way to go. Which is, like, Subbench is expensive to run. Any agentic benchmark is expensive to run. Actually, you do need some completion benchmarks. Yeah, just, just, just complete. Exactly. Like, you know, you can do well on those first and then sort of graduate to the multi-turn expensive stuff. Yeah. Yeah. Okay. Other than that, just like broadly, other work in the field in 2025. In terms of coding evels, obviously we shot up meter.
Starting point is 00:07:28 They use sweet bench and they have a very interesting like, I guess human hours worked number. Yeah, they like the X-axis being sort of the runtime and their, or yeah, Y-axis being the completion, you know, like we can do more long-running street agent tasks. I think the projections are quite interesting. And I definitely appreciate them kind of using SweetBench Verified to sort of proxy a lot of these things.
Starting point is 00:07:51 But yeah, they're great. Okay. Any other work that, like, call your eye. Yeah, I mean, I think within the, okay, terminal bench, sweet bench, yeah, critical point was kind of cool. Critical point? Yeah, it's like a very new benchmark that Ofeer did. And I think it's kind of related to physics. There's this one called Sec Bench, kind of related to cybersecurity. Yeah, exactly.
Starting point is 00:08:13 SRE Bench, which I think is affiliated with LOD. It's just cool to kind of see people really dive into different coding domains. and then stepping a little bit outside of coding. I personally think it's quite interesting to think about the user simulator stuff. So like Taub... Vending bench, too. Yeah, and Vending Bench.
Starting point is 00:08:32 I got to make feelings. Yeah, no, I'm interested. Well, I mean, it's like you're sampling one path. I don't know how realistic it is, to be honest. Yeah, it's just yellow of this. But it is cool. No, for sure. Yeah, I agree.
Starting point is 00:08:43 I think it's a good initial effort. To me, I think it's super cool to see companies, like, you know, I'm sure Mercore and stuff for focusing on building environments, like for code, beyond code. And so I think it might be interesting to have like work gym style stuff. This is stuff that my advisor, D. Young at Stanford thinks about a lot. So yeah. Yeah.
Starting point is 00:09:03 I just realized we were talking about Terminal Bend. Yes. We have the honor. Folks. Yeah, yeah. You know, really, really good work just overall. Yeah. It's not about Taubench.
Starting point is 00:09:13 Yeah, because you mentioned Taubanche. Yes, yes. There's some discussion or some people are saying. that Taubbench is impossible to get a high score on because some of the tasks are underspecified or just impossible. Yeah. I don't know if you're up to speed on that.
Starting point is 00:09:30 I'm a little bit spicy. Yeah, it's a bit spicy. I think I saw, so I, you know, like I worked with Shuny and Karthik back in Princeton very closely. I think Carthic I just saw posted a tweet kind of... Defendantity? Yeah, like rebutting some of these claims. Yeah, I mean, it's...
Starting point is 00:09:47 I think I get the concern. But yeah, I think it also brings up just maybe like interesting research problems to solve of like, okay, like why is it impossible? Is it the ambiguity? Is it kind of the user simulator that has issues? And I think generally we all agree that, you know, we'll improve on these things over time for you, boss. So I actually really like benchmarks that intentionally, I think we should intentionally include impossible tasks as a flag. Yeah. Of like, hey, you're cheating.
Starting point is 00:10:12 Yes. It's kind of sad that like Carpick actually is defending it because the master move would be like, oh yeah, you caught us. that that was, you know, like everyone reporting above 75 in Taubench retail, you've been cheating. Yeah. Oh, interesting. That would be, that would be cool. Yeah. I mean, yeah, you'll have to ask the Taub bench authors, but yeah, no, that's fun. Yeah, I think there was an impossible bench was a recent benchmark. Maybe from, was it from Anthropic? I don't know, but they basically took Sweet Bench verified and they changed the issues to make them impossible. And they checked like how often the models would be like, I actually just can't do this.
Starting point is 00:10:48 I don't know what's going on. Oh, like for refusals? Yes, yes, yes. Oh, how do they do? I thought that was interesting. I think they're all, the models are all kind of attempting and saying like, oh, I did it, you know,
Starting point is 00:10:58 so maybe not great. That's cool. But that's an important one. Yeah. How does Cody evals evolve next year? Wow, that's a great question. I mean, honestly, I think people will make more sweet benches. I think terminal benches really got something going
Starting point is 00:11:13 where you ask people to, you know, a sweet bench, you're confined. in some sense to the domain of issues and PRs that already exist, which I think has its benefits of being close to reality and natural. But I think with Terminal Bench, there's a lot of creativity that you can infuse into that. So I would personally be really excited. Like the 2.0 job was really excellent. And I'd be super excited to see, you know, 3.0, 4.9. Because of like the environments. Yeah. I mean, the environments, you know, bringing more people into the fold, you know, I think, correct me if I'm wrong, Mike. But early on, you had PhD students,
Starting point is 00:11:45 very smart CS people who are adding tasks and you know what does that look like when you fold more coding environments for non-coding tasks non-coding environments in general and ask people to make stuff there so that's pretty cool and then of course for myself i think just like this long-running sui agent kind of thing just feels very compelling i think the vision of like hey i tell it a goal i don't have to be super specific about my task i have like a decent verifier that proxies what i want something literally like a code base that makes the most money in this like setting you know like that's my verifier you know and i walk away for five hours the thing is just running i'm hanging out with you talking to my friends i come back and it gives me like literally a soda code base on on that you know
Starting point is 00:12:31 task i think that would be super cool okay i'll push back we're part-time in cognition yes and we are emphasizing a lot of interactivity because the the point is that you're going to underspecify right and actually what people want is back and forth back and forth on like a really fast time frame which is terrible for a benchmark author right because how you do that yeah but but realistic yeah so um I think like that this uh this this is where I'm a little bit anxious or cautious about this push for long autonomy right we're gonna I mean you know let's say this time next year we'll have five hours is is pessimistic like yeah it'll be it'll be 24 yeah right days
Starting point is 00:13:13 But I don't know if that actually materially changes the industry. So we'll push it. As in evils, you know, we have the people, people make evils here. Yeah. We push the industry in ways that we wanted to push. But I don't know if we, like, that's a productive way because that's more of like a stunt that like, yeah, it's a proof of a concept that existence proof. It can be done. Yeah.
Starting point is 00:13:35 Yeah. But will you use it in front for real life? Yeah. Yeah. I mean, honestly, to me, I think there's potentially room for growth. So I would actually agree with your take here. I mean, with my lab at Stanford, with D, like, there's a, you know, her emphasis is on human AI collaboration.
Starting point is 00:13:52 And so I definitely don't believe in this idea of just kind of getting rid of the human. But, yeah, maybe just like finding the balance of like, you know, just because the developer ecosystem is so diverse and there's so many participants in it who want different things out of it, like just enabling different levels of abstraction. And, you know, it depends on the task. like there's settings where you want to be, you know, more involved and more sort of hands-on, and so you want to use windsurf for that. But then maybe there's kind of this general data processing thing.
Starting point is 00:14:22 It's just a lot of JSON parsing. You don't really care about. And that's the one I kind of want to walk away from and just let it figure it out. Yeah. So, yeah, I would agree with you generally. Yeah. Amazing. Any calls to action?
Starting point is 00:14:32 What do you want help on how can people, I guess, like, find more of your work? Definitely. For the call to action, super jealous of all the great data, that cognition, and then, you know, a cursor would get, like, that user and action data is, like, really fascinating. From the academic standpoint, it feels like there's two difficult approaches to resolving that. Either you build, like, a really compelling product, like, El Marina, that people have people use consistently, which is, I mean, really tricky in and of itself. Or you build, like, really good user simulators that try to mimic sort of these settings.
Starting point is 00:15:06 But that is also, like, non-trivial. I don't think it's as simple as, hey, check TPT, act like a human, right? So it would be really cool to sort of get inspiration of like what exactly does that data look like or between the two like what's the best way to scale up sort of evaluating human AI interaction. And then I think for visibility for my own work, pushing more arenas. Like I think for code clash, what I'm excited about is the current framing is really long running sweet agents. But you know, you could have multi agents like two agents work together on the code base and what happens. you have a human and an agent work on the code base versus just AIs. What happens there?
Starting point is 00:15:47 You know, like when the models improve and hopefully they hill climb and they become better at digesting laws and iterating on analysis, you know, how does human AI interaction like change with model capability? And so I'm kind of hoping, you know, I'm trying to inspire and convince people that it's a very cool test bed where you can do a lot of different sort of combinations of like human AI on different arenas, playing one arena at a time, N arenas at a time. You know, I just, you know. Yeah, I think very interested work with you on the interaction stuff.
Starting point is 00:16:19 Oh, that would be awesome. And then I think one more thing I'll add is for cognition, is going to be pushing a lot of code-based understanding, which is kind of code-based retrieval plus-plus. Yes. And mostly it is helping humans understand their own code basis better to enable humans. Yeah. Or to sort of mind-meld the human with the machine to do,
Starting point is 00:16:40 the highest possible task that LLMs could not do alone, humans couldn't do alone. And then the other thing is also like basically automatic context engineering for an LM. So that is like sort of like a research subagent that we're working on. That's so awesome. Yeah. So I don't know what the benchmark would be because like how do you benchmark understanding? That is. Apart from I think like it's also mostly like you freeze a repo, have some manually curated answers and then, you know, pose trivia questions.
Starting point is 00:17:09 that's very easy to saturate, so I don't know how else you do it. I think Silas tweeted a while ago, like sort of like the Wiki, the Code Wiki, that's incredible. I mean, I use it on a... Yeah, with Google actually just came out the own version. Oh, yeah. Yeah, with the anti-gravity people, that's... No, no, no, this is like a separate...
Starting point is 00:17:27 It's a different routine. Yeah, yeah, gotcha, gotcha. But cool, that's the state of code. Yep.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.