Latent Space: The AI Engineer Podcast - [State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor

Starting point is 00:00:00 Okay, we're here at Neeribs. We're recording a special land space coverage of the folks in New Rips, and we're here with Ashton from Kursa. Welcome. Hi, yeah, thanks for having me. So, I guess the, like, Ashton from Kursa is like a new identity. I didn't even know if I should say that because you only joined Kursa for three months. Before that, you're opening I work in 01.

Starting point is 00:00:19 Before that, Berkeley, PhD in RL just, but focus on robotics. Robotics, yeah. Is it weird searching for robotics to lineage models? Okay, this is kind of interesting because a lot of people, have been kind of doing this. I mean, opening out, yeah, you got a robot. I actually was at Open Eye in 2017 also.

Starting point is 00:00:38 Where's it on robotics? Yeah, I was interning right before my PhD where I worked on robotics there. 2017, is that like Japan and that's Japan over there? He was famously opening at his first intern. Oh, really? Okay, then he might have been before. But yeah, there was like 15 interns. It was a very different company.

Starting point is 00:00:55 It was just like robotics, Dota, and like 15 interns that summer. all having like pretty exciting individual projects like uh yeah that set of interns if you look over there now it's kind of cool yeah um but yeah anyone from that class that like you would shout out um like uh there's just like a lot of cool papers that came out like low pinto now is it NYU um yeah the uh the person who leads um reasoning at x-a i forgot his name uh well he led no eric yeah i forgot his name but um he worked on like kfack and stuff i think um Did you know? Greg?

Starting point is 00:01:31 Not Greg, but yeah. But yeah, it was like an exciting time to be there. But yeah, I think robotics is a pretty good fit for LMs because like this switch ends being pretty like, you know, you kind of do similar things. Like you want to look at a lot of data. It's like kind of hard to get stuff working in robotics world. I think, you know, it kind of builds like very greedy people who like look at data a lot, that kind of thing.

Starting point is 00:01:56 So yeah, for whatever reason I think like that transfer. like yeah, it's happening a lot and it makes a lot of sense. One of my in Europe's highlights so far I had dinner with, it's like a small group dinner with Lex Freeman yesterday. And Lex used to be in robotics. And he was like,

Starting point is 00:02:10 my assessment of robotics people, robotic people are the best to talk to at a nearerps because they're most rounded, he says, because they don't have a choice. They work with the real world. Yeah, look at data. And then like the most unhinged,

Starting point is 00:02:22 the most detachment reality are like the simulation people. I see. Yeah, yeah, yeah. Yeah. I think I agree. Yeah. Yeah, and I actually did a little bit of both during my PhD.

Starting point is 00:02:32 Like I work in kind of like, you know, like prototype ideas in SIM and then get them working on real-world robotics. And, yeah, I mean, probably like robotics is where you kind of feel AI at the least, right? Because it's just so far away from working. Now I think, like, I think over the last year maybe, there's been demos that have been super interesting from like physical intelligence and like Sunday and stuff that, yeah, I'm starting to be like, okay, like this kind of feels like Sunday robots themselves?

Starting point is 00:02:57 I haven't seen them. Apparently they've been doing demos, I'm pretty keen to seeing that. Yeah, I've seen the physical intelligence ones live, and yeah, it's pretty impressive. Like, just on, like, in, like, someone's living room, like, folding laundry and stuff. Like, you know, you can just, like, toss it in there. Everybody must be materialized. Yeah, yeah, yeah, yeah. Okay, and last thing on robotics, and you can kind of pivot to 0103.

Starting point is 00:03:20 Just, Omniye has, like, is like restarting a robotics team? Is that serious? Is that? I actually know very little about it because, yeah, I was in like a pretty different Farsi org. So, yeah, I mean, I think it's serious. I think there's a ton of excitement around robotics right now. I'm actually kind of curious what drives it because I don't think I fully understand.

Starting point is 00:03:42 You know, like there's been like crazy raises and stuff recently, right? For robotics companies. I guess my own view on it. And so when I left robotics in 2022, I thought I would actually come back to robotics. I think my view on it now is that it feels like LLM agents are going to be like a trillion dollar market before robotics is maybe even like a $10 billion market.

Starting point is 00:04:05 And this is just because so I mean, you know, LM agents already create value out in the world. Robotics, it's like kind of hard to make the case that like, you know, kind of AI robotics like does anything that useful yet. And then once it does something useful,

Starting point is 00:04:20 then you have to make the unit economics work out. And I think that's also quite hard. Like reliability, you know, I mean, these robots have to be, like, fixed and this kind of thing. So I think it's kind of hard. I would say the market is kind of efficient in that the software LM companies are raising tens of billions. Yeah.

Starting point is 00:04:39 And then the robotics companies are raising hundreds of millions. So I think this very recently, it's been like single digit billions. Oh, really? Okay. Yeah. So I think, I think that's like the maybe surprising thing to me is that like, it feels like. It's ahead of where it's actually at. Yeah.

Starting point is 00:04:54 Like, I would say that robotics isn't kind of like the GPS. GPT1 to GPT2 era right now. Okay. And I haven't worked on a robot explicitly. What task would qualify as like, oh, that's the inflection? It's a little bit like you know when you see it. I thought the like Sunday demo is were kind of cool. Like maybe it's like starting to get there where and the details matter a lot where it's kind of like it can't be, it has to be in like a new scenario.

Starting point is 00:05:19 Like in one that you haven't seen before and maybe on like. General IDOD. Yeah, exactly. And I think that was kind of what GPT2 was too, right? is kind of like you start to see hints of like cool generalization. But like and I think that's fine. Like you know, it doesn't have to like work out of the box. But yeah, I think at this point especially it still feels like in robotics.

Starting point is 00:05:40 You're not exactly investing in a technology probably. You're just investing in a team. Yeah. Yeah, I'm not in the space whatsoever. Sure. That's kind of my impression. It's actually nice when you're not in there. Because you know as much as like most basically everyone else.

Starting point is 00:05:51 Yeah, yeah. So we just kind of speculate. Exactly. In my people, there's a robotics team at opening eye. Yeah. back to language models. Did you join 401 or were you like... So I joined right before Chachabutti in like, I think September of 22. Yeah.

Starting point is 00:06:07 So, yeah, actually, yeah, I was like pretty burnt out for my PhD and I was like, okay, I'm going to go to this like chill research lab and then like, yeah, yeah, like Chachaputea happens and like, you know, everything kind of blew up and like a lot of stuff got kind of like refocused. But what, I guess what... So obviously Chachutevety's surprise opening. what did they tell you they were looking for you to do? And then obviously it changed.

Starting point is 00:06:32 Yeah, I mean, so I joined on the CodeGen team. The Codex. Yeah, like you codex. Exactly, yeah. It was like the team that shipped Codex. But by the time we were working on like, by the time I joined, we were kind of more so working on the model doing tool use and these kind of things. Yeah.

Starting point is 00:06:46 And so like very related to the chat. Like we're kind of like a sister team to the team that made like a shit of chatchip. Yeah. Yeah, exactly. So yeah, so we're just kind of working on making the models like smarter. like kind of programming competitions, like, yeah, how to do like SIT for that, that kind of stuff. The word and IOWI gold has felt reachable in that title. Oh, yeah, crazy.

Starting point is 00:07:08 Like, I think, I think, and this is something I've like, like, repeat to people again and again. And these days, like, if you told me that we could have gotten IOUI gold then, I would have just assumed that we could all just go on vacation. Like, you know, it's all over, like AI solved. Like, no point in working anymore. We got it. Yeah, it feels like nothing's nothing that much has changed, right?

Starting point is 00:07:29 Like, life is still the same. Yeah. Yeah, so I think that's like super interesting. Yeah, I don't have a great way to explain it. But I think that's actually like what I spent a lot of time thinking about is like, you know, why is that the case? Yeah. Because, yeah, I mean, you kind of see this again and again in AI, right,

Starting point is 00:07:43 with like solving chess and then like it doesn't really matter and solving go. And yeah, so you keep seeing it. But yeah, I think you like surprise you every single time. Yeah. I think maybe, I think one is we keep moving a bit, the goalposts. We're very good at that. And then two is I think actually our just our definitions of what constitutes AGI is bad. And we don't actually mean what we say when we say, oh, when we have achieved this, then we have AGI. So like clearly when we have achieved our goal with a language model,

Starting point is 00:08:12 we have AGI. It's wrong. Yeah. And I think shifting the goalpost to some extent is correct. Like, we keep good-hearteding whatever goalpost we have. And I think it's kind of hard to like... To be good-hard is like too negative. It's like, I will change. cheat to do what you asked me to do. But I don't think it was cheating. It was just scaling test time compute. At a meta level, I think the community, not cheating, but makes a lot of like implicit decisions

Starting point is 00:08:37 to go after the, you know, evals and benchmarks that matter the most. So, Sue is verified for sure. Yeah, exactly. But, yeah, I.I. Hopefully not that goodhearted. Well, but like, it kind of clearly is to some extent, right? Because, like, you know, most programmers in the world cannot do I. at any decent level, but like we're still struggling to like automate most programming jobs.

Starting point is 00:09:01 Or like, you know, there's a lot of stuff to do. So it's like, it's like where language models are here, like junior, senior dev, and then suddenly for IOUI, you're like spike. Exactly. And like there's something to switch about that. Yeah, okay. I kind of saw this at a meta level also with RL research. So yeah, I did my PhD with 3011 at Berkeley from like 2017 to 22. And that era of RL research was like super interesting because, oil was like super hyped right like starting from about dQN in like 2015 and a lot of the methods that people were really excited about is like you know off policy learning like value functions like these kind of things and somehow that that stuff hasn't really panned out I would say and it's not exactly clear why but in the academic literature we thought we were making a ton of progress and I think in retrospect I had to say that we probably kind of overfit to the benchmark

Starting point is 00:09:55 pretty heavily. And, you know, how I see this in retrospect is that we gave ourselves a lot of, like, new knobs to tune and then implicitly kind of tuned those to fit the benchmarks. Everyone knew that we're doing that at some level, but I think it's hard to appreciate, like, that it's not just happening for a single paper at kind of like a meta level for the whole community that's happening to. And I think the result is that like, I don't know, like a lot of the RL research that came out of that era, I don't think is like that used, you know? And I think it's for the first similar reason that basically you were kind of like benchmark maxing. I will full out say there was R.O. Winter. Entire startups that were founded based on premise at the time basically gave up.

Starting point is 00:10:36 Some of them died, some of them pivoted, whatever. Yeah, yeah. Yeah, I think in, so because I was like in academia, there's still quite a lot of excitement over it. But yeah, it still felt quite academic. And yeah, I in that era was a little bit frustrated because I felt like, you know, one of the pitfalls of academia is that it doesn't really reward, like, simple ideas that work,

Starting point is 00:10:59 and instead kind of tends to reward, like, kind of math-year ideas. Those math-year ideas also give you these, like, kind of implicit knobs to tune that allow you to, like, overfit. Well, you know, the things that actually work tend to be kind of simple ones that have less knobs and just generalize to, like, many things without. There's just, like, less secret sauce to it, apart from just throw a lot of compute in it. Exactly, exactly. But those are things that tend to like...

Starting point is 00:11:24 It's like not intellectually interesting. Yeah, exactly. And from academic point of view, it's like, oh, like, why am I sitting in school? Like, yeah, I think for a lot of people who do PhDs, they're kind of wired in a way they want to, like, think about interesting new stuff. Yeah. And, yeah, like, you know, the scaling era kind of like, you know, probably stuck to that. Scaling era. Is scaling era over since we're proud of?

Starting point is 00:11:45 I think I've just been, like, page into that from Elyasasheba's interview. I don't think it's over, but there's definitely something. interesting happening, like the thing I was saying about, like, I.O.I. and IMO. I think we'll still continue more or less on the same track. Like, clearly, you know, like, these labs are, like, releasing their new pre-trained models, and they're, like, still doing, like, much better than before. So I think, I think scaling is still happening, but I think it's happening in a different way or it's worth like seriously

Starting point is 00:12:21 interrogating why is it that we're not just like automating all jobs right now I think my view it is something like RL, the way it's applied to LMs right now is kind of a weird funny tool where it doesn't really generalize beyond the training distribution that much it generates to some extent

Starting point is 00:12:39 and it generalizes in interesting ways but it's like very piquy right like it can kill the training distribution completely it can be like best in the world at it with like not that much effort really but yeah it doesn't really generalize so I think what we had to do is

Starting point is 00:12:54 bring the world of economically useful tasks in distribution for RL if we commit to using RL as a tool and you know it might be the case that maybe there's some like cool continual learning thing or something that like shifts a paradigm next year or something like that but it really feels like if RL is a tool then

Starting point is 00:13:12 yeah a big thing that needs to happen is like, it doesn't feel like intelligence of the models of the balladeck. It's more like, you just have products that bring the entire context of what someone wants to do into the product so that the LLM can see it. And then you used to RL on top of that.

Starting point is 00:13:31 Yeah. Have you seen GDP VAL? Yeah, I've seen it, yeah, yeah. Is that basically what you're envisioning? Yeah, I haven't looked at GDPVAL closely. I actually haven't seen exactly, like, roughly, yeah. And recap, it's 128 tasks across like any white-collar job

Starting point is 00:13:47 that takes more than 5% of GDP, right? And they basically created all the context to eval on it and evaluated every model. Famously, OpenEI has evils to you. Whoever runs that one always finds the anthopics are the best.

Starting point is 00:14:02 Yeah, yeah. It's, uh, yeah, props to them for, uh, being published. Yeah, it's doing that, yeah. It's an actual science. I think it's good. But, but like, I think like, in a sense of like, generalizing beyonds,

Starting point is 00:14:14 coding competitions to economically useful tax. That is it. I think that is the... What is more important for GG6? Yeah, what I'd like to do is kind of, like, I just haven't read the GDP-VAL traces closely. It's not clear to me that, you know, like, what is the job of an accountant entail

Starting point is 00:14:31 and, like, what kind of context needs to be in the product? Yeah, so you can actually do it. They have, like, EDFs. I see, I see. Like, so they try to go as close to source documents as well. I see, I see. Yeah. Yeah, so, yeah, I think, like, roughly operating in this kind of thing is what I envision.

Starting point is 00:14:43 Yeah. Because it can be like an artificial, like, oh, let me clean up this data for you. Exactly. To make it easy for the LLM to process. No. PDF in an agent, go. Yeah, I think that's roughly the right shape of the thing. And I guess how I imagine this being operationalized is that you'd want to code design the product and the model.

Starting point is 00:15:02 So that like the product for whatever it is. I mean, coding is kind of maybe the easiest first stuff because most of the context that you care about is just your code base. And like being able to run stuff in the terminal and that kind of stuff. And still, like, we're not that close really to automating it necessarily. But, you know, for, like, all the other jobs, the context is, like, insane, right? It's, like, all the conversations you've had with their coworkers, like, your Slack messages, you know, like, for my... So at Open AI, I was working on kind of, like, hyper-parameter scaling research. And I actually wrote not that much code.

Starting point is 00:15:32 Like, grid search or neuroarchitecture search? No, more, like, understanding how different, like, science of deep learning in, like, 2020, where it's like, oh, you have to, like, initialize the, layers in a particular way to get good scaling love. Kind of the analog for that for RL. The thing is, I didn't write a ton of code. So the LM, you know, like writing code is not the bottleneck. But it's more like, you know, over the course of a year, I like run sweeps, look at like the interaction between different hyper parameters and kind of build up that

Starting point is 00:16:04 knowledge for like a year of like just different graphs. And to do my job, the model would also need all those things in context, you know, to like successfully like, you know, kind of automate my job. And you'd kind of want a product that allows you to like bring all that context in. Did you have to build it for yourself or? Oh, no. I mean, like, no, I mean, like I, you know, those graphs are just sitting in my head. Yeah.

Starting point is 00:16:30 Right. So I think it would be pretty hard to like go automate that job. But I think what you need to do is build a product that kind of, yeah, has, it brings that context in. and then you want an URL on top of that to understand, like, to teach the models to use that context. Yeah. Another conversation that I think

Starting point is 00:16:49 has really come to a phase this year is kind of the depth of one model fits all. I feel like the point of the G and the AGI is like one model fits all. I think Obriniye has clearly abandoned that this year. Oh, what did you say that? Fiji Simo writing a blog post of the title that we are no longer doing one model fits all.

Starting point is 00:17:09 Okay, interesting. And I think Mark Chen or one of the other senior people that are not Sam also saying this in a podcast. So basically like the idea was you started with Codex. Someone else was doing intro of GBT. Then we launched JPT4, 40, I guess, 01. And O1 was kind of a supposed to be like a reasoning one model fits all. And there we merged the 4-0 and 0103 line into 5. and now we're splitting it out into five and five codex again.

Starting point is 00:17:41 It's like just a weird... Well, Ophi is very guilty. I mean, you know, I don't think you should interpret those as like scientific facts about the universe. It's just more like OpenEight has a tendency to ship the org chart basically. Yeah. Right? The world has the tendency.

Starting point is 00:17:57 Yeah, exactly. So I think a lot of it's really to that. But yeah, it's what you mean by like... Yeah, actually, I do wonder if, yeah, like, the current reasoning paradigm, The current reasoning paradigm is just kind of fitting itself to this kind of peaky in certain areas thing. I don't think it's so much a matter of like model capacity though. It's just more of another kind of organizational thing that like if you care really like a lot about coding, you probably don't have the data to do all the other stuff.

Starting point is 00:18:24 I don't think it's so much a matter of like if you had all the data, probably you would benefit from just like training on all of it. And you'll get some generalization between these. But it's hard to find like one organization that cares about all these at once. Yeah, yeah. Yeah. So before I double-click on just like the old series in OpenE Eye, I do like to ask Open Eye people who are there. Do you have a favorite blip story? Yeah, the blip was crazy for me. Like, yeah, I was, it was like Thanksgiving. Like, everyone remembers where they were, what they were.

Starting point is 00:18:53 Yeah, yeah, exactly. I was at Thanksgiving with two opening eye friends, actually, and then one of them on like Friday afternoon is like, oh, like, Sam Altman. just got fired. We were just like, co-working together. I'm like, what? Oh, ha-ha. Like, good joke. And then, yeah, it was crazy. And then, yeah, it was just like a crazy weekend of just like ups and downs.

Starting point is 00:19:17 Like, you know, we thought, yeah, like, you signed a letter? Yeah, I did. It's like 95% people signed. Yeah, yeah, yeah. Yeah, I thought, you know, like. I bought to Microsoft or? Well, I think maybe I had a slightly more complicated, like, I actually do think

Starting point is 00:19:33 that governance feels really important. to me. Yeah. Because it does feel like, no matter if we hit AGI in like two years or 10 or whatever, it's not clear that we have a good structure for the governance of it. Okay. And so it is a question that I think we like probably should spend more time on. And I was like during that period just pretty willing to be like, you know what? Like let's forget about the like equity and stuff. Like, you know, I think it's like good and healthy to have a conversation about like how exactly the government should work. Okay. You care about this. Yeah? Uh-huh. Right. So. So, now the open-end nonprofit has this like secret shadow board of members that determine when we've reached AGI.

Starting point is 00:20:12 Yeah. Yeah. Better? Yeah. I don't have a like maybe, I would say I don't have an answer. Like, you know, like it's just, it's not big. Above my pay grade, but like. Yeah.

Starting point is 00:20:22 And even even back then, I was kind of like, well, I don't care. I do care quite a lot. When the blip happened, one of my reactions was like, well, you know, this nonprofit board stuff, like, actually if it takes such someone. like surprising maybe erratic actions. Like maybe you'd rather just have like, you know, a thing like the Microsoft board, which is kind of like, you know, like probably like all the pensions of the world. But like serious people, but also like, you know, the stakeholders are kind of like the whole world because everyone's kind of, you know, do their pensions or something invested in it.

Starting point is 00:20:55 Like maybe that is a bit more of a democratic way to run things and having like seven people run it. But yeah, I don't really know. It feels like we haven't solved governance like. At all, though, right? Like, forget AI. Even stuff like unhealthy food or, like, social media, it kind of feels like, like, whatever the kind of, like, capitalistic incentive is, like, doesn't actually, like, capture kind of good outcomes for society, maybe. Yeah. So about, like, the transition into reasoning, right? You shocked me by,

Starting point is 00:21:27 by mentioning that the reasoning team is 300 people? It's a... It's kind of like, you know, now that... it, like, you know, when 03 was the kind of structured as a product, like, I think it just, like, gets, like, larger and larger how many people worked on it. So, yeah, I think I've, like, lost track for the numbers, but, yeah, like, a lot of people contribute to the different aspects

Starting point is 00:21:47 of, like, safety and whatever, e-val. Yeah, so, like, original 01, like, I saw the video. It's, like, a dozen people, you know. Yeah, well, even then, like, if you look at all the contributors, it was probably more, like, 50, 200 people. Okay. Yeah. So, so, I mean, like, let's tell that story from your point of view,

Starting point is 00:22:03 figuring out what does RL mean there, and I guess was this a branch of any other prior work that you wanted to credit? Yeah, so I think, yeah, like setting the scene, I guess, you know, in like 2023, people kind of talking about, oh, like, is scaling laws dead, this kind of stuff. Every year, every year. Yeah, yeah, but especially, I think especially that year, it felt pretty, like, serious, you know? Yeah, I think in general, open-air is really good about, like, having conviction in something and just like really like from first principles like going after it.

Starting point is 00:22:38 And I think like the people who are kind of most responsible for that is probably like like Ilya Satskyber and Yaakov Pachaki. I think even like Dota was kind of more or less the same template in some ways, right? And that was 2017. And so a lot of the people there have kind of this like aGI in their bones kind of point of view. And they've basically been convinced that like RL would be the, way to get there. So I think for a long, long time, people have been convinced that something like that should work. And it's just that it started to work once the, like, kind of pre-training

Starting point is 00:23:14 got good enough. Okay. Yeah. I think human feedback is kind of like a bit of like a side branch because yeah, you can't really pour that much compute into it, right? It's like, you take the model and you like elicit it to be a little bit better in terms of personality. But like the people that were really convinced that at some point, you know, it's not about copying the internet. Like, you can go, yeah, do RL and, like, you know, that's like the path to, like, getting much better intelligence. So I think it, it was kind of like a long line of kind of like returning to RL in, like, in like different ways. And then it's just that around like, yeah, 2023 is when it started like really clicking. And it's kind of interesting because even, you know, it's, it's not like

Starting point is 00:23:55 those initial models performed, like, way better than the existing models, because they're like smaller scale. But people were very good at being like, oh, like, this is kind of interesting. Like, you know, the reasoning trace that you see here is kind of not something that you've really seen be so accurate in other models like this, like this one. Kind of similar to how I think a lot of people didn't really think of GPT or GPT2 as something that was like super compelling probably. I know that I personally didn't like GPD2 that much of GPD2. I was like okay whatever and then I think and then GPD3 happened I'm like oh whoa like I feel a lot of phone listening my

Starting point is 00:24:33 PhD. It's kind of that where I think it takes a bit of like first principles conviction to like yeah decide that like oh this this thing like there's something here and we should really go scale it up and open eyes really good about once you decide that something is good then you just like scale it up all the way. Yeah is was there an internal prototype pre-01 that was like okay this is the thing we'll fund it to scale it up, right? Like, there usually is. Yeah, yeah, exactly.

Starting point is 00:24:57 What was the thing? What was the demo that like really sort of sold? Just this like, you know, like a, like running RL on even like a pretty small model, producing like very interesting reasoning traces and like getting like surprisingly good scores on math. Yeah. In a way that we couldn't have done without like a bunch more pre-training. And then, you know, once once that looks good, then, you know, more and more resources to just like scaling up that new.

Starting point is 00:25:23 new law. Yeah. And, you know, things like adding tool use and this kind of stuff. Yeah. I think a lot of people make a lot of headlines on the

Starting point is 00:25:34 large models, but I think a lot, it's very underappreciated, the minis, how well this solution works. Any comments or just like discoveries on... Yeah, nothing much to say there. I was also like not super involved in the mini stuff.

Starting point is 00:25:48 I think maybe one thing, not exactly related to that, but like, it seems like, externally people are kind of very like oh, like research seems to come in these big leaps. Okay. But I think internally at OpenA, it feels very smooth.

Starting point is 00:26:02 Like you have a bunch of experiments. Yeah. Some of them have inconclusive results, but maybe you stack them. Yeah, exactly. You stack them and just like you keep scaling, you keep like having like different runs that, you know, get a little better each time. Okay.

Starting point is 00:26:13 So I think that's maybe one other aspect. That's like a little underappreciated is that like, I don't know, like in the media, there's just these wild swings between like, Oh, it's so like, Googleers were in it. Yeah, exactly. And I think like internally at Big Labs, it's just kind of like, oh, we're just like chugging along.

Starting point is 00:26:29 Like maybe this month is a little better than last month or something. But it's like not as crazy up and down. I think the question is, they used to be more of this. And now I know there's less, which is, well, the stuff we've released, we're like, you know, internally we're like six months ahead. But it's like part of the reason why people, opening I wasn't that excited about Chad GPT's launch was because you already had GPT4. they're like, oh, we just put this out.

Starting point is 00:26:54 Like, we're already way ahead. I think now people are just releasing things as they have them. I think, yeah, especially because there's some like competitive pressure, right? Yeah, yeah. I think people are probably pretty worried that, like, if you let a lead linger for too long, that will, like, grab a lot of market show. Like, I don't know, like, nano banana pro right now is probably like, you know, it's like, pretty good.

Starting point is 00:27:15 A month. So I would say, like, now the lead, internal to external lead time is but one to two months. Yeah, yeah. Which is exactly. Tiny. Pretty short, yeah. Anything else on reasoning side? I guess you can talk about

Starting point is 00:27:27 on, specifically the work on coding, anything surprised you or is an external misconception on a 103 side before I go to Cursor? Well, not really. Like, yeah, I mean, it's pretty cool. Like, I think it felt already by like maybe early 2024. Like, oh, wow, like this recipe like really works. And we can see how far we take it.

Starting point is 00:27:49 And so I think, you know, it was like very steady progress. And by that point, it was probably pretty predictable that we could, like, you know, really, like, smash, like, you know, things like IMO or I-O-I-O-I-O-I-I-I. Yeah, one funny thing that kind of happened is while this was happening, I went to this conference called The Cur, which is about, like, kind of AI progress. And Joseph Gordon-Levitt. Yeah, I went last year. This was before the O-1 stuff was released. Yeah. And I, like, went to this thing where people were kind of making bets on where we would be on epoch AI's, like,

Starting point is 00:28:22 like the math, the Epiope math exam and, like, Humanities last exam and stuff I got. And their estimates were like, oh, we'll be at like 10, 20% in like 2027. And I think at the time, there was like, you know, models internally that were like already better than their estimates. So it's like off by like, you know, two years or something. And the interesting thing is like those are also people who are kind of like, you know, predicting that there would be like Dyson Spheres by like 2035 or something. Okay.

Starting point is 00:28:51 Simpsies, so the current estimate is way under. Yeah, they're too pessimistic in the short term, too optimistic the long term? Yeah, well, I don't know if, like, there might be decent spheres like 2035. Like, I don't really aren't. But I think that is like one interesting aspect is that, yeah, I think people still seem pretty miscalibrated in different ways. I do really appreciate how that community makes predictions, though. Yeah, because I think most of the rest of the world just kind of like cynically says like, I saw this the whole time.

Starting point is 00:29:23 Like, yeah. So I do appreciate that. Is this EA adjacent? Yeah, I think I think it's, yeah, exactly. It's like that. It's like that group. Yeah, yeah. I like that they like to sort of register their opinions ahead of time.

Starting point is 00:29:36 Yeah. And I think broadly, the people who've been, you know, the capabilities predictions in that group have been broadly correct if you look, you know, from like 2015 to 2020 or something, like where I think a lot of people kind of thought that AI was like a sham or like, you know, not really going to be that useful for a long time. And actually, you know, it is, it's somewhere under like 20, 30-ish thing that like you will probably reach like human level intelligence. Yeah. It's weird. So like I feel like a skeptic when I keep saying, like everyone always predicts that AJ happens in their lifetime. And then very convenient for whoever.

Starting point is 00:30:11 And like we have a consistent view of history where you make, see the people in the 1800s and 1900s making predictions. It somehow always lands in their lifetime, whatever the thing is. But like, this time it might happen. Almost surely, right? Like, I'm pretty sure. Yeah. Yeah. Yeah.

Starting point is 00:30:26 So it's an interesting observation, like how different are we from our predecessors in terms of developing of technology? Yeah. Did the Deep Seek moment this year, also this year, crazy, changed anything internally? Not really, yeah. I think that was, I think more so just, like, surprised that it created such a moment. Like, it was kind of confusing, right? It was like deep seek shows that

Starting point is 00:30:53 Nvidia chips are actually more useful than previously thought and like Nvidia's stock like goes down a bunch. Like it was kind of like... I think it's more like, okay, well, I'll do the steel man that side, which is, well, you don't need the top of the line Nvidia's. You can just use the sort of previous generation or the shackled ones they sell to China

Starting point is 00:31:14 to do an equivalent amount of work for a recent model. I see. Yeah, but then it was also, I guess the feeling in open eyes that like, well, I think we had a better model already at the time, right? So, and it was quite valuable. Like, like smarter models were clearly quite valuable. So you kind of wanted to be at the frontier. Okay, so I wasn't quite framing this as like a race dynamics thing between labs. It was just also more like, well, would they write?

Starting point is 00:31:44 Would their approach is right? They had R10, which is kind of a really cool branch. So more like commentary on what we learned about RL this year. Yeah. Yeah. Well, it does seem like basically a lot of the labs have kind of like converged onto some similar-ish way of doing RL. And they're all kind of back at the same level of like Frontier again. Like even the anthropic models like the Opus 2.5.

Starting point is 00:32:11 It has this kind of like there's this like RKGI2 plot that looks exactly like the open-eye ones. Right? Like so I think everyone seems to be. converging on a pretty similar form of RL. Yeah, it's kind of interesting. I think people basically figured out in one way or another to achieve more or less the same thing. Yeah.

Starting point is 00:32:27 Let's talk about the move to Cursor. Yeah. Why is Cursor accumulating, enjoying so many cool RL people? Yeah. Yeah, I've actually kind of already talked about this so far, I guess. So, yeah, I think from the perspective of Cursor, it's like, you know, nice not to be so dependent on, like, external labs for everything.

Starting point is 00:32:48 And like, I think there's also like unique opportunities to co-design the product with the model in ways that we couldn't do unless we actually, you know, built the model ourselves and like had access to, yeah, making it good. Yeah. So, yeah, that's kind of like broadly why cursor so excited. Okay, I'll push back a little bit, right? Openly eye is, has infinity resources. Infinity data has codex.

Starting point is 00:33:15 You could have just stayed. Yeah, yeah. Well, actually, right around when I was leaving is when, like, I think people started actually, like, using codex a lot. So that was kind of like, you're like happened right after I left. So that's kind of funny. So mostly people are using cursor eternally, maybe a bit of windsurf because it's left over from the previous thing. Sure, yeah, yeah, exactly. So it wasn't that obvious.

Starting point is 00:33:36 But actually, I think more to the point, this thing I was saying about, like, R-L is kind of a tool that doesn't really generalize that well. So what you want to do is bring the entire like kind of test distribution inside your training distribution. I saw the opportunity to do that at cursor kind of like directly. I think the cursor folks also just like really excited about that kind of vision. And it's just like a small place where, you know, like the product people sit like right next to the ML people. And I think there's a lot of potential there. You can kind of see that. Recently Jacob Jackson had this blog post about like online tab where like we're doing pulse.

Starting point is 00:34:14 It's every two hours. Exactly. Like a policy update every two hours or something. And I think that's the type of thing that, you know, I think it's like a little hard to do,

Starting point is 00:34:22 it's like very hard to imagine that at Open Eye, for example, just because like, you know, the product is this like kind of complicated thing. And also like the product people and RL people are pretty like, you know, on like different sides of the org.

Starting point is 00:34:34 I think if you put your mind to it, you would. It's like, you know, tab is an auto confete. It's a smaller model. It's, you know, it's not as complex, I guess, as below them.

Starting point is 00:34:43 Yeah, but I don't think that's really this, like, you know, I think we, I don't think that's why Kirster was able to do it. It's actually more about, like, just the org itself being kind of like smaller and a bit more like focused. Yeah. Well, I mean, since you're indulging this, I think the question about continual learning, which obviously is a big theme. It's always been a big theme as bigger this year. Is, well, don't you need to cure your data? You can't just like chuck whatever your users are doing in straight in because that tends to get you towards the middle of the distribution that actually you want to spike it.

Starting point is 00:35:13 I guess it depends how you're thinking about continuing. I mean, I don't know, like, humans are quite good about doing the bad data too, right? Like, you can see someone doing something dumb and decide, like, you're not going to do it. Filter it out, yeah. Yeah, but, like, it's not even actually filtered out. Like, you have, you know, presumably some kind of value function that, like, says that if you see someone touch a hot stove, like, you're not going to go, you don't need to. You don't need to, like, it's not just filtering it out. You're actually not going to do it, right?

Starting point is 00:35:38 You could rediscover hot stoves on first place. Yeah, but, like, you don't need to. So I think there's something pretty deep there Yeah, it seems like we're kind of like a few orders of magnitude of Like kind of data efficiency basically away from like that kind of like You know you do something once or like you make a mistake Like you yeah you you introduce like a bug in your code You're not going to do it again

Starting point is 00:36:02 But the models will happily just like keep doing it Even within the same context but definitely you know of course across context Yeah So I think there's something like interesting deep there is like maybe, yeah, I suspect that it will be kind of like paradigm shifting in the next like year or something, but I had no idea like, you know, what it might be.

Starting point is 00:36:19 Yeah. So is primarily you worked on composer, tab, and maybe search? So I have, I've actually just worked on composer. Yeah, and that's kind of like the main focus of the company, basically, or like the ML group is shipping a better, shipping a better composer. Can you describe, I guess, the impressive,

Starting point is 00:36:40 brag a bit about. about the ML group. Yeah, yeah. I mean, I think the ML group is great. It's like, you know, it's just like 20, 25 people. And, you know, I was like, honestly, like, pleasantly, like, very, very surprised at, like, how good composer is, like, given the size of the group. And, you know, it's not, like, a big research lab yet.

Starting point is 00:37:00 And, yeah, I think it's, like, a really good model. You can kind of see that in the reception. And I think it's kind of the start of hints of, like, co-design with a product in some ways because I think one of the reasons that people really like it is it's smart enough that you actually want to use it and it's also fast so you kind of like stay in the loop with the model while you use it because I think all the other smart models had this kind of this slow that you want to go kind of context switch away and come back and that sucks you know like just as like a programmer it just sucks to kind of context switch it kind of like gives you ADHD like it's like

Starting point is 00:37:38 really terrible yeah I agree and I think yeah it's like one step in the direction of like being able to be more sync. I think that's, like they see the whole company is just really, you know, full of people who want to, you know, code, even like the co-founders, you know, like actually, the co-founders are often some of the best, like, like, high-taste testers, which also kind of gives you a lot of like reassurance that you're going to ship good stuff. So yeah. Any example test that like maybe composer doesn't solve yet, but you're really motivated to solve? Yeah, ironically, I feel like I am actually like a low taste. tester in some ways because I don't know

Starting point is 00:38:13 like you know I just like write like slow like machine learning code and just like think about algorithm and stuff all day. I think more broadly I'm super excited about code designing the product so that you can actually you know not just right now we're getting better and better at

Starting point is 00:38:29 like answering user prompts and I think that's why I compose one is like quite good but you know what we're really aiming for is like more like you know automate software engineering as a process where you like write code you go look at data dog, look at what's happening, then come back and like, you know, maybe have some hypotheses about what's better, like, rerun stuff. I think that's the type of thing

Starting point is 00:38:51 that we actually want to make them all do. And I do think that cursor is kind of like uniquely positioned to do that in the sense of like, you know, if we can kind of, if a lot of what a software engineer does kind of ends up in the product, I think we can use that to like get better and better at, you know, not just writing code, but kind of like the whole job. Yeah, I think that's inspiring. Just to double-click on just any sort of RL insights, Sasha and we have talked a lot about like the internal tooling that you've had for all the like the cluster visualizations. Is that helpful? Was that what every lab has? Yeah, I think the tooling at cursor is actually really good. I think because, you know, it's just kind of like a, people are just down to like

Starting point is 00:39:32 vibe code stuff. They like do test their own stuff. So we just have like a lot of good tooling where you can have like a SSAH session into like our own like user environment or something and like you know see if like code runs the way that like users got it to run like this kind of thing I think that's actually yeah quite nice I think basically one of the big lessons in MLD in general is that you want to be like really close to your data and understand your data well and yeah I think there's like kind of yeah again kind of like uniquely positioned to do that well especially because all internal tooling you're not buying anything yeah it's just just like internal.

Starting point is 00:40:07 And part of it is just that we're also working on a product where you can understand it really well because of the code product. Well, like, you know, if, I don't know, in Open AI, if I was like to look at like a biology question, I have no idea, like, you know, what this is about. Yeah. Yeah, yeah, yeah, yeah. Interesting. Okay.

Starting point is 00:40:24 So I think that's a good overview of everything. I guess other than the, we covered OpenEI and Cursor, just interesting REO work that other people are doing, that you're like still mulling over. it's influential to your thinking, good papers, anything like that. Yeah, you know, unfortunately, I've kind of gotten the habit, especially at open AI, of, like, not reading that much external work and just, like, reading people's, like, Slack posts internally. That's, like, the main, like, way to, like, you know, like, learn new stuff.

Starting point is 00:40:54 No super inspiring recent things have popped up to me. I do think that this, like, kind of vibe of, like, yeah, continue learning like does feel like, I think there's something super interesting there, And like it feels like maybe even in academia people could make like a big crack at it. And continual learning specifically meaning kind of what Tab is doing. Yeah, maybe what Tab is doing, but also just like kind of like in context learning, but with like infinite memory or something so that you don't, once you experience something in context, it should just like be in your weights and you shouldn't have to like make that same mistake again,

Starting point is 00:41:29 that kind of thing. Why do you think there's, okay, so it should be in your weights, but there's a finite. capacity for the way to remember things. You will forget things if you do that too much, right? Not really. I mean, you know, you start out by memorizing or, you know, like learning from trillions of tokens. Yeah. Now you're going to experience like thousands or maybe millions of tokens and somehow like, you know, we can, and those, the million tokens are kind of into.

Starting point is 00:41:56 And you only need one epoch. Yeah, exactly. So crazy. Yeah. So it feels like if you could learn enough about those million tokens that you're actually in deployment on, I don't think you should need, like, I don't think there's a risk of overloading the capacity of your model, right? Because you can trade on a trillion tokens, and it's like, fine. Right, right. So proportionally it's a drop in the water. Yeah, exactly. Yeah. Unless you run it for years. And, you know, at some point it's sort. Maybe, yeah. Yeah. So basically, I find it very curious. I've only had one podcast on information theory of

Starting point is 00:42:26 language longers. Like, what is the theoretical capacity? How much are we using? And you should probably track that. Yeah. Yeah. Yeah. Yeah. That's a good idea. Yeah. Like, treat that. the weights, if you want to store things and weights, okay, if it is a hard drive, what's the capacity that the hard drive, how much can be stored in there? We know the capacity. It is the number of bits that, you know, but the parameters

Starting point is 00:42:47 physically cannot store one in that. Yeah, yeah, yeah. And it's, yeah, I've heard that there's this kind of like someone recently at Curse or Jacob kind of brought up this view. I don't know if it's like a more public view that's like, oh, there's kind of like a hard drive view of, you know, neural networks and kind of like a CPU view of

Starting point is 00:43:04 neural networks where, you know, is what's happening the weights, like, yeah, memorizing stuff? Or is it like you're, like, having some, like, few circuits that do a lot of work? Yeah. This kind of thing. And, yeah, I don't know. Yeah. You know, I would love to, uh, yeah, there's, like, actually so many of these kind of more sciencey questions that I would, like, love to explore some time.

Starting point is 00:43:22 But then it really kind of conflicts with, like, empirical stuff, you know? Like, unfortunately, at any given moment in time, it doesn't seem like the most root for, like, improving something, especially in the short room, but even in the next couple of years, is understanding some of these questions. Yeah, I mean, I guess this is technically supposed to be the role of academia, but it's also hard to explore those ideas there without enough compute. But yeah, actually, I would love to, like, go at some point,

Starting point is 00:43:48 you know, like in return to exploring these kind of like fundamental science ideas. Okay. This is a, I'm kind of springing this on you, so you can take some time. What is a good RL interview question that if somebody can answer, they should join Cursor immediately? Ooh, it's a hard question. I assume you do interviews. Yeah, yeah.

Starting point is 00:44:06 Well, actually, at Kirster, we do like work trials, and it's like two-day work trials that I actually think that that's like more representative. Because you plug in and see how they behave. Exactly. So I actually think it's like more valuable. This is honestly less of a thing about how you understand RL and a bit more like were you around in the like 2017 to 22 era. But it's like, why is Alpha Palsy RL unstable? is kind of, I think, like, a good question to, like, yeah, dive into. I don't actually know, so I'm digging into it.

Starting point is 00:44:36 Yeah. Cool. Thank you. That was great conversation. Do you have any sort of call section? Yeah, I mean, you know, we are definitely hiring a cursor. So, yeah, if you're interested in working on, especially, like, kind of data and rewards for code, I think that's, like, a huge need.

Starting point is 00:44:54 Yeah, please, like, get in touch. Yeah. That's it? Yeah. Thank you. Sweet. Thank you.

Latent Space: The AI Engineer Podcast - [State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.