Latent Space: The AI Engineer Podcast - Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

Starting point is 00:00:04 Hey, everyone. Welcome to the Late in Space podcast. This is Alessio, founder of Colonel Labs, and I'm joined by Swix, editor of Laton Space. Hello, hello, and we're so excited to have Kyle finally in the studio. Welcome. Hey, I'm very excited to be here. Kyle, you're CEO, co-founder, co-founder, CEO of OpenPipe, which started two years ago and recently got acquired by CoreWeave. Congrats. Thanks. I think you might be our first, like, started and exited founder that we've had on the pod. maybe,ish. I don't know. I'm not, I'm not keeping.

Starting point is 00:00:37 Especially on that timeline. Well, I don't think I was exited when we, I don't remember if we set this up before or after we announced we were getting acquired. I specifically pinged you because you got, I think you got acquired. You've been on my list to watch. Obviously, you've spoken three times at AIE. And you've been on my list of like, when is it a good time to have an open pipe or fine-tuning, RL discussion? And then you got acquired. And I'm like, okay, yeah, that's a good time to talk about it.

Starting point is 00:01:04 Also because I think like it gives us a window to talk about acquisitions, consolidation, like what should be an independent company, what maybe doesn't have to be anyway. But we'll maybe do this chronologically so we don't get too far ahead of ourselves. You were famously director of startup school. Yes. Maybe for people who don't know like what is startup school. Did that make you become like fall in love with the color orange? I'm wearing an orange shirt for those who are listening. A very bright orange shirt.

Starting point is 00:01:32 This is my conference shirt. felt like, you know, it was appropriate for the pot as well. So yes, I was at I was at Ycombinator for about four and a half years and led the startup school team there. So startup school, it's, it's changed over the years. It meant one thing before I was there. It means another thing now. But during the time I was at YC, startup school was basically all of the external facing a lot of the content, certainly all of the tech. So it was, it was things like we had a like a MOOC effectively where founders could come in. They could learn about how to start a company. They could get advice from YC founders, YC partners.

Starting point is 00:02:04 We had a co-founder matching service that we built, which actually worked really well. We got a lot of people through. Our total, like, you know, I guess technically I can't, that probably doesn't matter anymore, but a very large fraction of the batches that went through YC while I was there were directly attributable to people that we found and end up recruiting to YC through their experience to at Starp School. So that was kind of what we were working on. Yeah, I was got, I was kind of consider it as like the Scout program for YC.

Starting point is 00:02:32 Yeah. Right, like the YC before the YC. Any notable, like, famous people that met as part of your co-founder matching? Because I'm always very negative on those things because, like, it's like online dating. Like the chances of success is super low. Yeah. When it works, it's really nice. You know, that's a great question.

Starting point is 00:02:47 I left. So we launched that product probably nine months before I left. And so I don't know what the long-term outcomes were of that specifically. Yeah. So you left YC. You spent a year in the kind of the wilderness. You went through YC S-23. What's that journey like?

Starting point is 00:03:03 What's the... You know, I was very excited about AI things in general. This was... So I left YC, I guess, beginning of 2022. And I was trying out a bunch of different things. Ended up landing on what turned into open pipe in early 2023. This was, let's see, so I'd been working. So my co-founder's my brother, my little brother, which has been fun journey on its own.

Starting point is 00:03:24 We were looking at different ideas. And one thing we realized was we actually started the company immediately after the GPD4 launch. And what we saw as the opportunity in the market at the time, which has changed since then, was GPD4 was insanely expensive and extremely powerful. But there was an opportunity to distill, like, specific workflows from GPD4 down to much smaller, much cheaper models. And there was like a very clear value prop there, given how expensive GPD4 was, it was hard to deploy in production, but you could sort of like take those abilities and deploy them much more cheaply. So that was kind of the first thing we built was this kind of very managed, very clean distillation. flow. What was that process like in the beginning to get people to actually care? Because I'm assuming most people are doing experimentation, but like didn't really have these large production workflows that

Starting point is 00:04:10 they needed to like distill down. And then I think maybe once we got there, the models get cheaper and faster. So what was like the initial, you know, six, nine months of the company through the evolution of the models? Yeah. So it worked. It was great. So I mean, it did take us a while. I guess we formed the company early, maybe March of 2020, 2023. By the time we launched our product, it was August, I want to say. There were some like different things we were trying in between. And actually it was not hard to find people and get them excited. There weren't very many. I mean, this was even late 2023. There weren't very many people in production, but anyone who did have production workflows, it was extremely painful. Like, you know, they were paying hundreds of thousands of dollars a month to open AI. So it was very easy to convince them to try this out. And so we got our first three customers after launching probably within a month. And we were doing significant revenue. Over the next six months, we actually got to a million in ARR. over about a eight-month period following that launch, so by the latter part of 2024. So actually, yes, initial traction was super strong, very clear value prop.

Starting point is 00:05:10 But then, as you were alluding to, kind of like, there was just this slow march of like the frontier model token prices, just dropping over and over by, you know, 3, 5X over and over again, which kind of ate away our value prop over time. What was the process of like fine-tuning the model? Because even the open models were not that great, you know? And so what were maybe the bottlenecks? Like instead of having three to get those,

Starting point is 00:05:31 like 30 customers. Did you feel like in the beginning it was like a matter of like just the market growing like the open source models not being good enough like the fine tuning not being simple, efficient enough? The pain point, I guess repeating what I said before was the price was too eye on the closed models, but you couldn't just drop in an open model and replace them because like you're saying the quality was quite bad, especially as you're moving to two smaller model sizes, but larger models, open models weren't even available at that time. So that's kind of where the value prop was was like, hey, the closed models are too expensive, at least the ones that are performance enough to do the job. The open ones are not good enough. We have like a very clear managed flow.

Starting point is 00:06:03 The way the flow worked was quite simple. You simply put in our SDK. It's a drop in replacement for the open AI SDK. It's capturing. You continue to use GPD4 in production for a period of time. We're capturing the requests and responses. And then we had just a very clean managed flow where it's like, okay, at some point you say, hey, I want to distill this down and you train on that. And then, you know, we provided an API that was a direct drop in replacement. You would just change kind of the inference URL and you were using your own model in it, your app continued working. Yeah, I think the market analysis here, because I was also exploring, starting a business around that at the time, and that's why I ended up not investing, was basically you get

Starting point is 00:06:43 squeezed between the GPU providers who also want to do fine-tuning as a service because then that makes people more sticky, and the labs who keep putting out distilled versions of like something, whatever mini versions of their models. was the analysis on the neocloud side because you kind of also want to host the inference. Yeah. Honestly, we, like I said, felt very squeezed from the frontier labs that were putting out just more capable models at lower cost. I did not see the competition ever really materialize from the neoclouds, from the GPU providers.

Starting point is 00:07:16 Everybody had an offering and fine-tuning. When we talked to customers, nobody used them because they just were really hard to use. So I do think that, like, you know, call it a product thing, I guess. Like, it's not their focus. Yeah. Who cares? Yeah. Interesting.

Starting point is 00:07:30 Developer experience matters. It does. Yeah. Still does. Did. I don't know. Maybe it doesn't matter anymore. Now, we just have coding models to everything.

Starting point is 00:07:36 No, it still does. Like, when you have thinking machines launching an API and people getting excited about the API, you're like, yeah, okay, that's pure developer experience there. That's fair. Yeah. Yeah. What's the, I'm just going through the chronological list here. Yeah. It was like the Mistral, 7B, find two and kind of like one of the big inflection points like in the history of the company.

Starting point is 00:07:55 It's like, okay, this is like a good open model and like the 7B size or is that just. Mistral and mixed trial. That was like a golden period of fine-tuning startups because mistrial was like a credible open source model. Yeah, they were really strong models better than the Lama 2 that they were, you know, effectively replacing. And they also have a super open license, which I think the licensing has become maybe less of a concern over time at the margin because people are getting used to maybe. But at the time, that was like a pretty big deal, that they had this fully open Apache 2 license. And, you know, yeah, maybe they have their own, like, IP issues with how they train it.

Starting point is 00:08:35 I don't know. I have no inside information there. But at least the guarantee they were making of people using their model. Yeah, I call this mistral washing. Yes. As long as it's like, it's, you know, constant the sparkling region of France called mistral. It's okay. Don't ask about what goes into it.

Starting point is 00:08:48 There's plausible deniability. Exactly. Arm's length connection there, yeah. Okay. There was this mistral period. But Jan 2024, you talked about S-Lora. And there was a period of time where LORAS became more important. I feel like they then became less important.

Starting point is 00:09:02 And I don't know what's like the rise and fall of LORAS for you as a business. Yeah. So LORAs have really, really, so if you're predicated on the fact that you're doing fine-tuning at all, LORAs have very, very attractive properties relative to doing a full fine-tune, right? Because if you're doing a Laura, you can at training time, it makes, it helps some. using less memory to train. But it really, where it really helps you out is at inference time. Because if you're doing lauras, then when you deploy it for inference, you can multiplex,

Starting point is 00:09:29 you know, basically an arbitrarily large number of lauras on the same GPU deployment that lets you do things like do per token pricing as opposed to GPU hour pricing. It just gives you a much more flexibility at deployment time. I'm actually still allurable, like for the record. You know, you're talking about the rise and fall. I think, I think Laura's, you know, their future is still out there. I mean, they're cool again because of thinking machines. Yeah.

Starting point is 00:09:50 I felt very vindicated by that blog post for the record. just I guess for listeners, thinking machines put out like a week or two ago with a blog post doing quite a lot of research on the tradeoffs between Laura's and full fine tuning and in various different training regimes. I think the reason Laura's were uncool for a while was mostly just because like fine tuning was uncool. Like I think if you're doing fine tuning anyway, like, lores are still like, you know, in many cases the way you want to do it. But not that many people were doing fine tuning. As a marketing guy, Laura's had bad marketing. Like they were just like, oh, like you can't afford full fine tuning. Here's like the Walmart, like store brand fine tuning.

Starting point is 00:10:25 No, that's fair. There is some of that. I think we didn't have a huge issue. Like, we've had to do some user education, like, hey, just try it. I think for the training runs that, like the types of training runs that were interested in where it's like, hey, I'm doing a relatively lightweight customization of an existing model for a specific task. There's really no downside to using Allura. And there's a lot of like upsides from an like in, for a simplicity point of view.

Starting point is 00:10:47 I agree that there's like a branding issue around that. Hopefully the thinking machines blog post. kind of like you know like rank one i like you know i think there's there's different hyperparameters the laura's that you can use to to make yourself happy the fact that john schumann was like nope like we're actually banking the company on this at least for now is a pretty big vote of confidence i you know i feel i think it's surprising that no one's done the research prior to to them and i was talking to one of the machines prior to their launch who had come from one of the big labs and and what that research at all was like oh no everyone doing post-trainer research inside

Starting point is 00:11:20 this big lab uses Laura's. I mean, not for like the full run, but like when they're doing like their experiments, they'll just use Laura's on a base model to run the experiments and it works fine. For listeners of the pod, that was leaked in one of the pods that we released, but it's up to you to find it. Cool. And then,

Starting point is 00:11:37 so then it was the first World's Fair. You talked about you probably don't need fine tuning as a fine tuning founder. Basically, I think your talks are really good. I would recommend people watch all of them. What I pulled out was you had a piece of advice. So your talk title is obviously somewhat intentionally clickbaity, but your actual advice on when people should fine tune is when it's cost, latency, or quality consistency that you really care about. Yeah, I mostly stand by that.

Starting point is 00:12:01 I don't think it's changed. And the biggest one we see today, and this is true for kind of like classical SFT, it's also true for the RL stuff we're doing today. Cross my fingers is not always the thing. But the main one I see that really it drives fine-tuning is if you have to move to a smaller model, and it's typically for latency reasons, and this is usually like real-time voice, So if you're sort of forced into a smaller model anyway, then there's a very high chance that doing some tuning on that model is going to get you, like it will be necessary, basically, to have a successful deployment. So we see that a lot coming from customers that, again, have those latency requirements. There's other reasons as well. Sometimes, for whatever reason, you really have to deploy on a single GPU.

Starting point is 00:12:39 You have to deploy within your own cloud. And you want a, you know, you basically have to use a smaller model to do that. So basically, in the case where you're forced to a smaller model anyway, then fine tuning it is often necessary. I would say for 90% of use cases where you aren't forced to a smaller model, then it's still not a good ROI. And you probably shouldn't invest in it today. How do you quantify these things? So costs, right? Could always be lower.

Starting point is 00:13:05 So is there kind of like a threshold of like cost to ROI? Like, because it's also hard to figure out how much is going to cost to do the fine tune. Because you need to get the data and all of that. Like, do you have a mental model of that? This is sort of like a function of the total amount of overhead required. I'd say there's two parts on the cost side, and then, you know, there's multiple parts on the benefit side. On the cost side, the main things you have to think about are the upfront effort required to get an actual like training system set up for your task. And that can be quite variable, but I would say at a minimum, you're going to have to dedicate a couple of weeks of like a fairly competent engineer's time.

Starting point is 00:13:43 and if you have like a very complex system and you're doing RL and you need to set up a whole environment, it could be a lot longer. It could be, you know, a couple of months of time. So that's just like a fixed cost you have to pay. There's also like an ongoing carrying cost where once you've committed to doing fine tuning, it does make other parts of your stack less flexible, less nimble because whenever you're updating your prompt or like you're at new context or whatever, like now you have to like, you know, spend a few hours training a model and that's just going to like slow down your, your iterations. cycle, which is a real cost. And in many cases, that's the larger cost. So you only want to do that if, like, the benefits are large enough. The dollar cost, I would say, is basically never a factor. It's just so much less than the time, the amount you're spending this engineer to do the work. But it's not, I mean, each of these runs is between five and a couple hundred dollars.

Starting point is 00:14:36 And it's just, you don't have to do that many of them. Yeah, because most of the data is like first party. Yeah. Right. Okay. When was the switch to RL? Was it when 01 preview came out? You were maybe like, okay, it's time to move on from SFT?

Starting point is 00:14:50 Yeah. So that was a big moment for us with, you know, there's all the leaks before that about strawberry and all this and like, you know, a lot of people talking about, okay, how are they doing it? We realized through that like, okay, someone's figured out how to make RL actually work with LLMs, which was not a thing. I mean, it was a thing like some people had played around with before that, but it wasn't like I think many people were thinking about.

Starting point is 00:15:10 And so our bet at that point was, yes, let's figure out whether this works for tasks specifically. And the space we just, I think it's important to kind of like tease out different parts of the market. I think with the release of 01, and this has been like proved out many times with releases since then, I think like there's now like a very strong consensus that like, okay, on the frontier model like general purpose model side, investments in RL are paying off. I think I don't think most people would argue with that. You're, especially as you're getting into these agentic tasks and training them to do that, like, it seems very clear. Well, obviously, the big lattice are paying, like, ridiculous amounts of money for these environments and everything. But also, like, they're actually getting really good results. The model's coming out.

Starting point is 00:15:53 You know, we're seeing it especially on the coding model side, but like in other contexts as well, we're seeing the sort of especially agentic use is working way better because of this. So I think, like, even late 2024, it was pretty clear that, like, RL was going to work in that context. And then the question in our mind was like, can we apply this in a different segment of the business, which is kind of like task specific customization? And so the question is like, does that work well? How much effort does that take? Is it going to be something that ends up being unnecessary because, oh, the big labs can just like train on every single task and the base models are going to be just good at everything. And so there's, you know, no benefit to it. So those were kind of the open questions in our mind.

Starting point is 00:16:28 But it seemed like there was like at least a good enough bet that, you know, we wanted to try it out. Yeah. And you've had this agent reinforcement training framework and you did the email. agent, it's kind of like the first proof of concept. Was that obvious to do email? Was it obvious to call it that way? What was like the behind the scene? How should we package this?

Starting point is 00:16:45 So what I told our team, and this was, we decided to go all in RL in January of 2025. And we've been doing some experience before that. We released before that kind of like an RL model that had, you know, would generate like hacker news titles from articles, which is a fun project. So we've done a little bit before that. But that was kind of like, hey, we're going to bet the company on, not in a literal sense. Like we could have done something else later. but like this is like the thing that we're going to spend all of our time working on for at least a few months.

Starting point is 00:17:10 And like what I told our team at that time in January 25 was like there's probably like a 25% chance that this is the right direction in the sense that like a year or two years from now, all the companies, you know, everyone doing inference should be doing RL and task specific training so that like their models just way way better at their task is a relatively low chance. But it was sort of like one of those big of true things. Like if that is true, if it turns out that like just doing RL on your task is just like something everyone should be doing and it's and it's just, you know, teaching these agents continually, teaching them through experience is just going to be a huge benefit than like being the first people working on that would be a really, really like awesome position to be in. So that's how we thought about it is like, you know, less than 50% chance, but really big outcome if not. If so, I think since that time and I've been very transparent with this like with our team and like when I'm talking to other people like I don't think the chance that that. that is the right approach is 100% yet. I think that we're still in the process, even after going through this and, and, you know, doing that of like figuring out.

Starting point is 00:18:10 But the probabilities in my mind are going in the right direction. Like, now I think they're actually, like, today, I was actually just thinking about this with another conversation. I think that the chances that, like, everyone should be or, you know, everyone who's deploying an agent at scale should be doing an URL with it, either as part of sort of like a, you know, like pre-deployment or even like continuously as it's deployed, that that's like the pattern that that's going to get to. I'd say there's like a 55, 60% chance that that's just like the better thing to do. And that's informed by kind of like our experiments working with customers. So anyway, not 100%.

Starting point is 00:18:40 But like going all the way back to your question, like, no, it was not obvious. It was an informed bet. You know, it's still a bet, but one that I'm feeling pretty good about right now. One thing I think that is tricky about just as he's onboarding onto this space is all the math. I remember reading the DPO paper. I think they were at New Rip's for 2023. And people are very excited about it. some of it's like just being pretentious for a paper but some of it's actually like real complexity

Starting point is 00:19:05 you know you don't have like a PhD like a prior sort of ML background how do you sort of come to grips with it like what would the best ways to get around it for you I would probably push back on that a little bit I don't think the math is actually that complicated I think that like when you you know you see the PPO equation or something with all the symbols like if that's your first intro to it then it feels very complicated but I think like if you were to show that exact same equation, just like code, not maybe not pie torch code, because that you also have to understand. But if you just, like, did the naive implementation in like Python and like showed someone like, hey, this is, this is kind of like how we're computing the loss here, who was

Starting point is 00:19:41 like a strong engineer. Like, I think it's actually like quite grokable. So yeah, I mean, like, I don't think it's like the buried entry is that high. I think you just have to like believe you can do it and then like spend some time staring at it. That would be what I would recommend. It's like, you know, you can read the papers and look at the equation. I think, actually, this is one area where OEMs have been super helpful. If I'm reading a new paper and I look at one of those equations and I'm like, I don't understand how this new term they introduced, like, corresponds to like these other terms, then I can like dump like all the context around it into, you know, GPD5 and say like,

Starting point is 00:20:15 hey, can you like write this out of Python for me and show me what they're doing differently? And that's super helpful for kind of like my background, I guess. Yep. The way I put it is, I wish that all these papers would just publish with pseudocode or just straight up Python instead of math. Yeah. Because you actually just need to look at the implementation. I know like Jeremy Howard's been beating this drum for years. I know.

Starting point is 00:20:34 I know. I mean, there's a literal website I call papers with code. And like people just keep not following it. I remember interviewing the DPO guys when they're at New Reps. And it was just like, they were just very obsessed with like proving in principle equivalence to PPO. and like it was just like it was very hard to follow I'll definitely say that and I think like now obviously at some point like GRPO kind of took over the general consensus it was very strange because I think when deep seek first like started talking about it it was viewed as an optimization they tend to just generally catch everything as an optimization but I think the leader insight which I think you touched on in one of your blog posts was that no actually it makes comparisons independence rather than global. And that's actually what unlocks some monos, like, sort of self-supervised RL.

Starting point is 00:21:28 Yeah, I mean, it's interesting. There's real pros and cons. If you're moving from PPO or something similar to it to GRPO, there are some big pros. I mean, one pro is just sort of like operational simplicity. Like there's a whole extra model you need for this value model you need for PPO that you can throw away with GRPO. And that just, like, makes your life easier. you don't have to train that model, but also, like, there's, like, no hyperparameters around that model that you have to configure. So, so that that's nice. Another thing is the benefit that you're

Starting point is 00:21:57 talking about, which we've observed. So the way GRPO works is, is you have to do, like, you know, a set of different trajectories or set of different rollouts all in parallel with the exact same environment, the exact same conditions, and then you score each of them. And GRPO uses the differences in those scores to promote the trajectories that do better and sort of, like, decrease the probability of the ones that did worse. Because they do it in sort of a group relative way, the only, it lets you be a little bit looser with how you score them potentially. Like, you don't have to necessarily have a globally aware scoring function. You just need some scoring function that is able to distinguish between this small set of things you have in front of you. And then that's easier.

Starting point is 00:22:36 That's easier for a human. You know, if you tell a human which of these, choose which of these is better. It's easier for them to do than say, like, is this one good or bad in absolutely terms? Yeah. So that's nice. The big downside, the huge downside of geography. And I think actually the reason why GRPO actually is likely to be a dead end and we probably will not be continue using it indefinitely. The fact that you need to have these parallel rollouts in order to train on it is actually that like that makes the data generation much more complicated because you need a fully reproducible environment to be able to do these sort of parallel rollouts. And it turns out in practice, that's like getting that setup is the hardest challenge today with getting RL working is like actually designing this robot. useable, you know, environment that you can run all of this training in. Most companies, and that's not true.

Starting point is 00:23:24 Like, sometimes that's easy to do. Like, there's certain situations where you can do that. But for the work we do at least, where we're training agents on real code bases to, like, operate, like, you know, real applications, it turns out it's, like, really, really hard to sandbox those things in a way that's, like, totally reproducible. And PPO, now in practice, a lot of times when you're training with PPO, you also will use an environment like that because it lets you do a bunch of runs and be more data efficient. But at least in principle, you have the option with PPO.

Starting point is 00:23:50 You can actually like purely train on like say real production traces of like real people interacting with your app. And so you don't have to have a simulated environment at all, which makes the deployment like much easier. Can you double click on why it's hard to do the sandboxing? Because in principle, we just capture all the inputs. Yeah. Well, you don't need to just capture all the inputs. You need you need a system that reacts the same way your production system does. And in many different ways.

Starting point is 00:24:18 And so let's say your Airbnb, right? I'm bringing this up because this is like an example of one that like, you know, companies have gone out and built sandboxes. Like if you're Airbnb and you're trying to, you want to train an agent to like, maybe you're not Airbnb, fine. You're a company like us that's trying to train an agent to like do really well at operating Airbnb and booking on your behalf, right? Like you have to build a copy of the Airbnb website that reacts to you as the user

Starting point is 00:24:44 the exact same way that the real one does with the same failure modes, right? Because if you don't include the same failure modes and bugs they have, then like, when one of those bugs comes up in production, your agent's going to have no idea what to do with it, it's just going to fall over. You also need to simulate if this is like a sort of cooperative agent, right, where it's getting human input as well and kind of like working with the human to get something done, which in practice is the way a lot of these are deployed. You also need to simulate the user.

Starting point is 00:25:05 And I mean, you can do the naive thing and just say, oh, we're going to have a separate LLM that with a system prompt that is like the user simulator. And we do that. But it's like, okay, but like the breadth of, ways a user might respond, there's like a lot more diversity in that than the actual diversity you'll get in practice when you have this like simulated user. And so then it's like, okay, well, is this environment close enough to how a real user would interact that like, you know, if a user says something different, that it's going to know what to do. And the answer in many cases

Starting point is 00:25:30 is no. If you're just purely training on kind of like an LM user simulator, it's going to have its own idea of like what the correct way to answer is and the breadth of like a way a human might respond in the situation is wider. And your agent. just may not be able to deal with that. Do you feel like it's hard to build the simulations as a company that needs to build the product that lets everybody do it? Or do you feel like even for the individual companies that own the code base that are like domain experts in their own product is still just like a very hard infrastructure problem?

Starting point is 00:26:01 I think it's still very hard. You know, like ideally all companies should have this anyway because they're getting, you know, if you're doing end-end testing, like theoretically, if you're following best practices, you would have one of those set up. When we talk to enterprises almost universally, that's like not something. that really exists. So there are some startups. Like there's some companies we've talked to that do have it.

Starting point is 00:26:18 And we can just like use that. But it's a very, very small number that actually have an environment like that. And I think it's hard to do it. And like there's lots of like weird bugs that don't show up an environment like that. And even if they do have a testing environment, they don't have it populated with like full realistic data, which is also like important so that it understands how to, you know, interact. So I think in practice, it's hard in both cases. Maybe it's easier for the company.

Starting point is 00:26:43 but at the same time, depending on the quality of the company's engineers, it might not be easy for them either. Yeah. How do you classify the types of environments? So you have formal environments, like a compiler, you know, you can put in there. So you don't need to do any work. They just work. Then you have this kind of like RL environment startups in a way that are building a bank environment. They're building these things that are not digital twins or whatever term of like the actual environments, but they're like close to it.

Starting point is 00:27:11 And then on top of it, you have helping people trying to build the exact replica of their thing. There's obviously value in like the formally verified ones. We verified that. Do you think there's value in this like RL environment startups that are building like somewhat generic but test specific environments? And then if none of those work, then what do we do instead of GRPO? I guess the question. Yeah. I suspect there is value in that.

Starting point is 00:27:37 You know, I think the, you know, the folks buying those environments, and training on them in the big labs would have the best knowledge on how well they work. I think they probably work okay. I think they probably also are like, you know, and we'll see maybe with the next generation of models released, like how well they transfer. I would say so far, it seems like they don't train well enough. Like if you use, you know, Open AI's agent interface, it's like okay. Or if you use the computer use products that everybody's putting out, they're like, okay,

Starting point is 00:28:08 but like not reliable enough to like actually. like let go do something interesting unsupervised in the world. And I think if the, you know, if the environments they were training in and were high enough fidelity, then they would be good enough in the same way that like coding agents can go much further because I think that in that case, we do have environments that are much higher fidelity because it's a much simpler environment in a lot of ways. It's like it's a code base. It's like maybe running a web browser. Like it's, it's much easier to capture the full realistic environment in that context. For those who are interested, when you make a reference to our own environment startups selling to the big labs,

Starting point is 00:28:43 they're selling it for a lot of money. Yeah. Like at least seven figures, right? That's my understanding. I don't know. I'm not a buyer. Please, please, like, drop data points because, like, people who are not in Silicon Valley don't know this.

Starting point is 00:28:55 And, like, it's, like, probably the current thing in VC, which is our own environment startups. Anyway, I just... A lot of them. It's, like, 20 of them, apparently. Yeah. But it's like a small number. I know that, yeah, all the labs are buying ad hoc.

Starting point is 00:29:10 But in a way, it's almost like they don't even care. It's not a product. It's like they're basically like paying the company to build an environment, ad hoc for them. It's a very services business. But I mean, if you're spending like a billion dollar in a training run. Yeah, but like you can specialize in like we are the one that does e-commerce. Like we are the e-commerce.

Starting point is 00:29:28 So come to us for e-commerce. Yeah. Go to the other guys for like social media. Go to the other guys for like, I don't know. But I'm curious. Your take is like, how do you need to get the data out to make it fit in your training run? Especially when you get to like this larger labs, I think that are like very sophisticated post-training pipelines. And I don't know if there's like a way to just build a company where it's like you just send them a CSV of like data.

Starting point is 00:29:51 It needs to be very integrated in it. But I'm curious what you've seen working with customers too. So for RL, like the whole way this works is, you know, it has to sort of be getting feedback from the real environment. So I don't see a world where it's as simple as like, hey, you can, you know, there's like a CSV type approach. I guess you could code anything as a CSV, but if you try hard enough. For oral to work, you have to be looking at real runs, ideally of your actual agent in its current state across, with an environment as real as possible. So you have to like look at actually. And like the data format's like actually super simple.

Starting point is 00:30:28 Like it's just like basically a list of, you know, like chat completion messages. It's effectively whatever. Tool calls. Yeah, exactly. Yeah, it's whatever your agent will be seeing and doing when it's running. So getting the data is not hard, but what's hard is like when you're doing one of these runs and your agent makes a tool call, okay, now that tool call has to connect, you know, somehow it's got to get data back from something and that data has to look like it will look

Starting point is 00:30:49 in real usage. So setting up that whole part of the system is the challenge. And then for just a reference job for more people, Web Arena is my first instance of this kind of thing where you literally have a Docker container that has like a clone of Reddit, clone of Wikipedia, clone of GitLab, clone of CMS, and a clone of an e-commerce place. And I think since then, there's like mine to web, maybe. I don't know if there's other large, well-known academic environments where people are basically using these as benchmarks, but probably also it's pretty useful for trading. So if you want to check out those things,

Starting point is 00:31:24 you can definitely check there. I think the question for you is, as someone who bet on SFT, then you bet on RLFT, and then now you see these guys making a lot of money. why didn't you go there? It seems to me like that definitely is a services heavy business at the moment, as it's presently constituted. I'm sure that these companies are all developing different kinds of secret sauce on how to do this more quickly. So that's part of it.

Starting point is 00:31:46 I don't particularly enjoy services businesses. But, you know, I also kind of feel like we will move towards a world where either the big labs, like, it's one of those businesses where like the only customers right now are like, whatever, four, maybe, maybe six big labs that like, you know, you. you know, are training these models on environments. And I don't think I'm a little... Right. What's the tab?

Starting point is 00:32:08 Yeah. But, you know, like, look, you can say the same about scale AI and all of their competitors that are like, you know, many billion dollar companies that have basically the exact same customer set. So, so, yeah. I may work out. Yeah. Unless you, I don't know if you want to do a small shameless plug for Various.

Starting point is 00:32:25 Oh, yeah. I mean, so Various, one of our portfolio companies, they work with the people building the agents, not with the model on like their internal tool call loop. So they can observe all the internal traces and build the data to then have like a open pipe do the RFT on the thing. I think in the enterprise, we've seen a lot of that, especially for chat pots. It's like the less sexy use case, but like they work with a lot of financial services company where their customers go in there and say, what's my balance? Like, when did I do this transaction? And those are all tool calls, you know, and they need a way to test and improve that behavior.

Starting point is 00:32:58 And the models haven't got in that much better because these tools are like badly documented. They're like badly named. I think that's kind of like the problem with a lot of the agent builders that are not AI native companies. It's like they just put this like very generic tools in the thing. And then they expect it to work like magic. And these simulations kind of help them also have the usual compliance things. It's like before shipping this, we tested that it doesn't give financial advice. We tested that, you know, there's all these different things.

Starting point is 00:33:27 So I'm curious to see how much the companies generalize, you know, I think like, There's has a lot of success in, like, highly regulated environments because of different requirements. But I'm curious if you have a different way to segment the market of, like, when you think about RL, there's, like, environments that are, like, low stakes. There's, like, environment that are, like, high stakes. There's environment that have implicit rules that are made by the SEC or other government agencies. How do you think about it? Yeah. I don't know that that segmentation is necessarily the most relevant.

Starting point is 00:34:01 I'd have to think more about that segmentation, whether it's, you know, there's like a strong difference in how useful RL is across those sectors. Where I see the segmentation is something basically just like capabilities based, where it's like, hey, if I'm trying to do something that's like much more advanced and, you know, maybe like long horizon, then RL can probably get me a much better behavior. And I might almost think that like, yeah, those sort of like more compliance. Like I feel like in those kind of environments, you probably don't want your agent doing very much because Because then it's like you can't make any guarantees about what it might do. And so you're probably not doing these long horizon things. And maybe RL is not going to get you what you want. But I don't know.

Starting point is 00:34:41 Yeah. I haven't thought about it too much. Yeah. I think like a lot of the customers don't necessarily end up doing RL anyway. It's almost like the simulation and the environment. It's like a way for them to understand the paths that the agent can take and less about we need to then use that data to do fine tuning. But I think it's like it's going to be a spectrum.

Starting point is 00:35:00 Yeah. What replaces your PO? Yeah, it's a good question. We need the alpha. Yeah, I mean, I don't know is the short answer. I do think this is like a fairly high salience question in the research community. I think there's a lot of folks like trying to figure that out. Every paper has a variant.

Starting point is 00:35:16 Yeah, but a lot of, but I think, you know, the big question is like, are we doing, you know, normalization based on grouping or in some other way, right? That's, that's like, I would say, like, I would claim we're just going to keep calling it GRPO as long as the normalization is done. within like a group, even though, yeah, there's a lot of things that, like, probably should get their own names, a lot of things that have tried to get their own names and have failed on the marketing side. Yeah. I think something that like doesn't require group level normalization, which a lot of, you know, older things didn't probably works, but I think that the older things also are really finicky. So there's, there may be other kinds of simplification. And I don't know exactly what, what those will be. Where do you put the prompt optimization thing? We did a Dev Day episode and

Starting point is 00:35:56 we mentioned Japa and then everybody came out of the woodwork on, on Twitter. D.S.I. Brose. Yeah, exactly. Tell me, have you or people you talked to tried Jepa? I want to know, like, what? I read the paper. I'm just like, look, like, the prompt layer updates are not the same as weights updates, which they're just comparing apples and oranges. And I talked with a few people of respect on this, on the RL side.

Starting point is 00:36:19 And they kind of validate it. Like, the way that these grad students market their papers is their thing beats the current hot thing. And the current hot thing is GRPO. But like, they're just not. that comparable. I disagree with that. Like, I actually think they are comparable in the sense that it depends on for what purpose, right? But like, if I'm a company and trying to like get the best performance on my agent, like, I don't care if you're changing my prompt or if you're changing

Starting point is 00:36:43 my weight. So if you get better performance on my agent, you know, I'm, I'm happy. On that front, I do think they're comparable. And we've evaluated, I mean, we evaluated like like the answer was you are going to do both. If you really want max performance, you're going to do both. Yeah. We valued everything from Disp. And we evaluated JEP as well. And it's like, It just doesn't work. Okay. That's going to be the... Fighting words.

Starting point is 00:37:06 Jepa doesn't work. It didn't work on the problems we tried it on. It just didn't. It got like a minor boost over the sort of like more naive prompt we had and was just like, it was like, okay, just kind of like our naive prompt with our model gets maybe like 50% on this benchmark. And like Jepa got to 56 and we do our own. We get to like 96. I mean, it was just like not even comparable.

Starting point is 00:37:26 And so maybe we were holding it wrong. You see, so both sides are claiming skill issue, right? So what they would say is you probably used it wrong. And then what's fair? Or all people are saying that probably JEPA guys, when they set up the GRPO benchmark, it wasn't a very fair comparison, which is exactly what my source said. It's hard to tell. You know, everyone has, everyone has, is trying to get to some version of the truth.

Starting point is 00:37:47 Yeah. But what I will say is like we want it. I mean, I don't know if I would say it goes so far as to say we want it to work, but we certainly want to know if it works. Like, that's like actually very relevant to like the product for building on us on. efficient to get there. Yeah. And then you should have been able to go working.

Starting point is 00:38:01 That's, yeah. It's actually kind of more credible now that you're like, you know, you're part of a larger core weave that you're not. Obviously, because I think JEPA maybe is, uh, makes, uh, open pipe like less relevant. I, I totally would disagree with that. Okay. Because like the level we see ourselves operating out is actually, we're not like, RL bros trying to figure out like the use case for all RL.

Starting point is 00:38:23 We're like, hey, we're working with all these enterprises, we all these big companies we're talking to and we're trying to figure out like how we make. make their stuff work better. And so, like, I personally am very motivated. Like, if something like JEPA works, like, okay, let's, let's build a product around that. That's how I think about Open Pipe at least. No, I mean, that's a good clarification to make. Even more so, you actually took a sincere look at it and you concluded that there was nothing to do, nothing to build. Well, you know, maybe we were holding it wrong. So we had shown you on the podcast a while ago. And like, I think he's been a proponent of automatic prompts optimization and this idea that, like, you can do a lot more in the prompts than you

Starting point is 00:38:56 can do in weights. And in principle, I'm biased inclined to believe that something like a DSI, something like a JEPA works. So I'm very surprised to hear this. Yeah, like we keep trying it, you know, we tried the MIPRO B2 stuff that was hyped before that. Also, okay, I should not bury the lead on the best argument for this, which is it basically JEPA models how the big labs do their system prompts. It's genetic evolution, you know, and they sort of incrementally evolved, based on the overall e-vals that they have, it's slow because it's done by humans, but Jepa theoretically improves it,

Starting point is 00:39:34 it automates this. Okay, hold on. Is the Quinn of the Big Labs have something... This is news, this is interesting. No, no, no, no, no. This is philosophically the same. I'm not saying, like... Oh, sure, but, like,

Starting point is 00:39:44 you're injecting a whole lot of human intuition and kind of, like, potentially out-of-band information. We have the best model in the world, which is humanity, or, like, smart humans. And now we're doing JEPA using dumb LMs. Right, but they're all... also like the humans can bring an out of baddened information that like maybe is not captured in the actual like, you know, the eval. Like they can be like, oh yes, technically this did well on the eval, but it's like not really, you know, like I would suspect that a lot of that ends up getting injected through that human being in the loop. Yeah, yeah. I've always been very surprised at how these guys work on their system prompts, which are tens of thousands of words long. And there's no ablations. They just kind of pick what seems to work and then chuck it in there. And that is the cloud system prompt.

Starting point is 00:40:30 I can't argue with success. Is GPD5 the first model that had a prompt optimizer by one of the large labs? I believe so, but I don't remember. Cloud Workbench had this like a year and a half ago, if you see it that way. It just wasn't like fully automated, but it was extremely good for this time. I kept telling people about it and nobody believed me. Do we know if they used it internally? Cloud Workbench?

Starting point is 00:40:50 Yeah. Okay. Why not? Oh, I don't know. Like I just, my experience, you know, knowing a lot of people at these labs, is like they launch a lot of products because like some team is super excited about this product. But that, I wouldn't put that much weight on it just because they launched it. For some measure of use internally, I'm sure the people I talk to are biased.

Starting point is 00:41:08 I don't know if you fully explored that. Yeah, no, I think that's a, it's just interesting that now it's been acknowledged that like the LLM can improve your prompt. And so I think like Japan always also writing this way of like, okay, maybe we can do this programmatically. But I think the long tail of people just prompts really badly. And so I think there's some value there. Versus, once you go into URL, you already have a more sophisticated audience.

Starting point is 00:41:32 You know, like who gets to do GRPO, people that are really smarter. Who gets to do prompt optimization? Like, everybody's trying to do it. Yeah, that's fair. Maybe our baseline was. I know. Your naive prompt is probably like, you know, top 10 percentile of prompts that people put in these algorithms. Sure.

Starting point is 00:41:47 I'll take it. Yeah. And then the other thing that comes to mind as you were talking about things, injecting things out of ban and all that, I think it's a broader trend that I'm tracking for Wolfsford-26, which is the move to online evals. The way that we do e-vals today is probably too locked down. You're kind of fighting the war that you already know should be fought, and you're not fighting the wars that you don't know about because you didn't plan for it, whatever.

Starting point is 00:42:10 How can we sort of move more online evils into our JEPA process? Maybe that's what it is. That part I'm much more bullish on. And we can make the analogy, like we can pull in kind of like RL intuition here, which is if you're doing JEPA on a sort of static data set of like, oh, this is the input, this is like what makes a good or bad output, then like as you're updating your prompt, like your information, the data you're training on becomes less useful, right? Because it's generated by, you know, because it's based on kind of like the problems you were running into before. And that's the same problem you have with RL where you have this concept of being off policy, where it's like as you're doing training, you really want to be training on rollouts that came from the latest version of your model. Because if you train on some that came from further back, then it's like, it's sort of staled.

Starting point is 00:42:53 data and it's like not it's no longer representing the current issues with your model. And so if you try and correct for the issues that existed back then, it may not actually be helping you that much. And I think, you know, for either RL or prompt optimization, that's definitely true. I think that like one way to apply that in practice is exactly what you're saying, where you're using the actual data from your, your real evals. You have some way of saying like, hey, either people flagging these or Nellin flagging these or some way of saying like this was a good or bad output.

Starting point is 00:43:21 I totally agree with you. If you're bringing that into your process, I'm much more optimistic that you're going to get good results. Yeah. And the pipelines are not set up. This is like analytics and UX people, like, trying to being drawn into the ML process, which they've never been done before. If I had to make it as a big theme for next year, this is going to be it. No, I agree. And I mean, I think that like all of the sort of observability people like platforms see that and like are trying to figure out what the right shape is.

Starting point is 00:43:49 I haven't seen the right shape yet, but yes, it seems like a theme for next year. Statsig? Maybe. Yeah, I haven't used them, but opening eyes seems to like them. Yeah, I mean, like, I do think, like, buying, you know, an experimentation platform makes sense. And, like, you know, I think it's started, like, I've said before on the podcast, I think, that I'm very bullish on model routing as a feature, but less bullish on model routing companies because of exactly stuff like this where, like, it is just.

Starting point is 00:44:19 going to get absorbed into a model. It's a very big part of building the process. You probably don't want to, and it's not that hard. It's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, it's, like, connecting pipes and making sure things are set up so that it's easy to use that data. I have a question for you, a general question. So, what fraction of tokens generated by, say, like, the end of 2026? Do you think are going to come from open source models versus proprietary models? Oh, oh, oh, that's a fun question. So we have an answer from anchor, from free interest where it was like it's 5% and going down. I think it's going to go up because of the amount of enterprise adoption of open models that I'm seeing. And also- Because there's a lot of

Starting point is 00:45:02 demand. Like there's, the enterprises would much rather be on open models if they actually could get the performance they're looking for. Yeah. For cause, for privacy, all that stuff. And I think like, basically, honestly, it's just literally like, we may have hit quote-unquote AGI in a sense of like, it is the average LLM is capable of the work of the average human. Not the best human, but the average human, sure. Like, it's actually pretty decent at customer service. And it's actually pretty decent and like, I don't know, transcribing things from PDFs, whatever. So, like, yeah, I mean, totally, I think that should rise.

Starting point is 00:45:38 But people who believe that it should rise to like 50% are out of their minds. And I think it's a true question. We should take coding out. I think once you take coding out, I think, yeah. I can be like 15, 20%. But I think with coding, it's still going to be very low. Because these max plans are like so subsidized and so many tokens are being generated. Like, Anthropic is like, you know, 50% of the revenue is like.

Starting point is 00:46:00 It's your claim that it'll mostly be, you know, that coding will mostly be closed models because the tokens are subsidized or because the models are just so much better. I think as long as, I mean, I'm paying 200 bucks a month and it's like, I'm spending thousands of dollars. Like by accident, by accident I pay with like my credit. card and I spend like a hundred bucks in like an hour and it's like by this is like the thing that no way wants to talk about for anthopic like anthopic went from like one billion in revenue to five billion and it was like ooh-hoo yay and then like what's the margins you have this like goose me and going like what's the margins right um yeah they say it's like six percent there you are

Starting point is 00:46:36 part of the six percent that is abusing everything so everyone else i'm not abusing you're the lost leader it's not like i'm rotating account i'm just using the product for you know it's like yeah yeah but like through you people like hear about cloud code they pay paid it $200 a month and they don't use it and they pay for your influence. Yeah. Thank you. Thank you, everyone. Keep doing it. I don't have to go away. But I think like I don't really see, it's hard to see a world in which QuenCoder or whatever model replaces that. Because between quality and costs, it's like to make, to generate data sum on tokens for 200 bucks a month. I don't know how anybody can like offer like to get or fireworks. They can not really offer it at that price.

Starting point is 00:47:16 And the quality is not as good. But the reason they can't offer that price. is because of the subsidies, right? Which is not like the long-term, like, sustainable dynamic. I mean, it's interesting because both Anthropic and Obenai are building their own infrared. And, like, they're going to get to a place where they're going to have idle GPUs that they own. And so they will also be incentivized to have 100% utilization. And so, you know, they will subsidize some of it. Just the same way, you know, if you go on SF compute, like, you pay a buck 40 for like an H-100 instead of like the 220 list.

Starting point is 00:47:48 the price on AWS. So I think it will continue, but again, it depends on whether or not they actually have the $500 billion like they were saying, which I think they do. You know, just to be clear, I think Stargate will go online. But once it goes online, then it's like, well... If they figure out how to pay for $500 billion worth of compute, then they probably can subsidize for a while. I think they have the 500B.

Starting point is 00:48:08 They're going bigger. Isn't it obvious? What do you mean by half? At the start of this year, when they announced Stargate, people were like, oh, you don't even have 10. Elon was like, you don't even have 10. Whatever. And then Satya's like, I'm good for my 80. But like now, now we're seeing all the money start coming in. And like probably it's in order of like 200, 300, 300 billion, like that you could probably get raised and committed. And they're going to get the rest. Like it's fine. Like I think that the plan is actually a lot bigger. Can I just say I love this

Starting point is 00:48:39 industry? It's like, yeah, they've got like two or three hundred billion. And like, what's another couple hundred billion? There's no other industry in the history of the world where you can Yeah, yeah, it is stupid, but also like, do you doubt it? Like, I don't. I like, yeah, that's fair. No, like, I literally, like, after last week, I think maybe two weeks ago with the whole Oracle and Vedia and then even AMD deal, I'm like, oh, like, these guys not only, they've locked down Stargate one, they're working on Stargate 2, wherever that that is.

Starting point is 00:49:09 And like, the sheer ambition is like freaking crazy. There is still one more shoe to drop, which is the non-es, the non-sharegion. sovereign wealth funding that OpenDIA needs to get, which they've promised to drop by the end of this year. And my money is on, they have to do a coin. Like, I'm not a crypto guy at all, but like, you know, this is going to be like an open AI coin. This is the one AI founder that has his own coin already.

Starting point is 00:49:32 Yeah. And like he needs more money. And he said that they will come up with new innovative financing methods. What else is there? Yeah. They're already in a token selling business. But you've got to. That's a great line.

Starting point is 00:49:44 Like buy an opening a token, they translate to a GPT5 token. Like, you're sure? It's a stable coin. You'd have to get, you'd have to get a lot of political buy-in, I think, to take that level of risk. The White House that is most crypto-friendly since the dawn of time. I guess, like, Elon's out of there now, so maybe they can get the, make, make the friends, yeah. I think it's doable. We'll see, you know, like, who knows?

Starting point is 00:50:11 Yeah, for what it's worth, I've, nobody's, like, this is a, like, this is a, this is a me theory. I don't have any inside information. Yeah, should we go back to the ruler? Yeah, sorry, right. Open fire. Anyways, we were saying, I think this story takes us to July 25 when you release Ruler, which you got easy mode for RL rewards. And then, I mean, shortly after you get acquired in September. So maybe you just want to talk through the summer, you know, what was the vision? Then maybe now the acquisition came together. Yeah, absolutely. So, you know, I mentioned my initial, like, opinion of like how likely this, this, this, direction was to work was maybe 25%, we're up to, you know, 55% or so. And rulers actually

Starting point is 00:50:48 a big update on, that got me from the 25 to the 50. So let me, you know, I guess just for context there. So basically, there are several problems you have to solve if you want to use RL successfully. The problems you have to solve, I mean, some of them are just like really dumb, basic, like, hey, you got to get the infra and like the libraries have all really sucked and been built by, you know, PhD students who don't know, like, how to build reliable software. So there's, there's like all these, like, practical issues that, that we're working through. That's one thing. And that's, that's kind of what we're trying to solve with art. But even after you've got that solved, you've got like major issues, which is like,

Starting point is 00:51:19 you got to know if your, if your agent is actually, or, you know, whatever system you're using on RL is doing a good job, right? That's, that's fundamental. You have to have a reward. You have to know it's doing well or poorly. Sometimes that's easy to do. If you're solving like a math problem or something, you can come up with a data set of math problems and the known solution and check if it's the same. On the coding side, there's been a lot of like innovative work around, I mean, there's, first of all, like a lot of open data and a lot of, you know, I think the approach a lot of companies take is you find existing test cases and then you break them. But there's sort of like a way to figure out if you can run the test case, right, and see if your code fixes it or not.

Starting point is 00:51:54 In a lot of other domains, it's like much more murky. It's like what is a good job versus a bad job? How do I know if I did a good job? And you really need that information. So we've tried a bunch of different things. Ruler is a library that we released. Which, let me, relative universal LM elicited rewards. Thank you.

Starting point is 00:52:10 Yes. And the way it works is basically this depends on the sort of GRPO insight, which was mentioning earlier, that you actually don't, with GRPO, it has this nice property where you don't have to have like an absolute judge of the truth. You just have to judge relatively. And so simplifying a lot is basically just LMS judge on a whole group. So you say, okay, this is the task I'm trying to achieve. Here's four different runs of an agent trying to achieve it. Which of these did best? And it stack rank some. And it turns out that works phenomenally well. with the GRPO, like way better than I expected, way better than, you know, anyone who kind of like I talked to before we actually tried this expected because it's sort of in, in the L.M. He was in his judge, it can have sort of like self-ground because it's, it's just getting these relative ranks, right? So it doesn't have to like have like an omniscient view of like what good or bad looks like. So that has worked at basically everything we threw it at. We've done it with a bunch of client projects. We've done a bunch of our own customers.

Starting point is 00:53:05 It basically just works. Like it's basically like I honestly kind of feel like the reward assignment problem. is like fairly solved. Yeah, it's fantastic. Just any LMS judge off the hook? We've tried it with so many things. So one of the results we published was we used Quen 2.514B as the model we're training.

Starting point is 00:53:23 And as the judge, we used Quen 2.532B, which is like not, I mean, it's fine, but it's like much worse than any frontier model, right? And even with that combination, we were able to get our agent doing like state of the art better than any frontier model on the tasks we tried it on, even with like an extremely the weak judge model. So it really doesn't depend on having like a really great judge model in practice. So yeah, it's just like, it's just not something we've had to worry about since then at all. So that's kind of like checked off. So that's sort of like got me like a significant increase and like, okay, this is actually something people can apply. This is now something that's packaged up.

Starting point is 00:53:56 People can just use our, it's a, we open source to everything. You can use it off the shelf. If you stick in your train and run, it will probably just work. So that leaves the remaining problem, which we were, I guess we were talking about the amount of order, but like that remains the the environment problem, right? That's like the one big remaining piece that like we don't know yet how to automate or remove and requires a lot of manual work for every single task. For listeners, you know, this is why I kind of refer to it as self-supervised because it is like removes more and more of the human judgment. And like the history of machine learning all the way from like, I guess, the start of like image net and everything is really like that insight of like you should just

Starting point is 00:54:36 take humans increasingly out of it and scale up the data you can just throw in there with no supervision. Yeah, yeah, totally. Yeah, it's really awesome. Are you bullish on dedicated LMS judge models? Have you looked at those bespoke labs? We did an episode with them, and they're really trying to carve our niche in there. We've looked into it. We've trained some ourselves.

Starting point is 00:54:56 We've also, like, used some off the shelf. There's an evaluation benchmark that the AI2 people put together. A reward bench. And so reward bench is kind of like trying to. benchmark models on serving as LMS. And reward models are LMS judged in your mind. It's the same, same thing. Yeah, yeah, yeah.

Starting point is 00:55:12 They have mildly different. Depends on the task. Like, LM is judged is usually more sort of product facing and reward. Reward modeling is much more specific within like a chat task. Which is that used to be the old meaning of reward model. I don't know. Maybe terminology has changed. Like, I think, I think they're pretty equivalent.

Starting point is 00:55:31 I understand that, yeah, I can see your side. Anyway, so, so yeah, reward bench is kind of. of like, and so we've tried a bunch of off that. The thing is, like, I guess my, my maybe meta take on this is any task that is extremely common is going to end up in, like, as a specific, like, part of the training data for the frontier labs. And LMS judge is just something everybody is doing in so many different contexts that you have to assume that all the frontier labs have a bunch of, like, LMS judge style tasks that they're training their models on. And I do believe that if something does kind of like make it in a like more than minor way into their training data that like they're

Starting point is 00:56:09 going to do at least as good a job as a dedicated model. So I don't think there's probably a lot of alpha in dedicated LMS judges just because it's something that like the let me caveat that and say like if you've got like a very, very specific task that's like weird and has weird requirements and you have a lot of data on what's good or bad, then like training a reward model for your specific task I think could still work or you know, fine-tuning an LMS judge on your specific task could work. I'm pretty bearish on like, hey, this is a model that is trained as an LMS judge, but it's a generic LMS judge that can be used to judge anything.

Starting point is 00:56:42 I just don't think you're going to beat the frontier labs on that. Yeah. One other version of this that is not quite an LLM, but some people are thinking about it is something that we're working on for a future episode, which is world models. And very sexy. Yeah, very sexy. First applied in video, as far as I can tell for Genie 3, Genie 1,2, 3, and now with code. and potentially with virtual cells for AI bio.

Starting point is 00:57:07 Any exploration there that's interesting to you? Yeah. So we've been playing around with it a little bit. It's one of the directions that I'm fairly optimistic on for solving the environment problem specifically. Because if you think about it, like a world model, it's a simulated environment. That's like what its whole purpose, right? So if you get one that's... But in an LLM like thing, not like a Docker.

Starting point is 00:57:29 Yes. Yeah, yeah. So it's like, you know, whatever. hallucinating, generating, imagining the responses you'll get from the world. So you can imagine, right, if you had like a really, really great world model that you're training on, yeah, it's like your agent that you're using, it would go out and make some tool call and then this world model model would generate, hey, this is like probably what the tool call. And if you have a smart enough, strong enough one, then it could keep its own, you know, effective internal state of like the changes

Starting point is 00:57:54 you made so far and how that affects. So we've played around with it some. You know, I think if we can get it to work really well, then that could be a solution for the environment problem, which where you just take a bunch of production traces and use those to condition your world model so it understands your specific system and what its failure modes are and then train against that world model. And the resultant, the agent that you train with that would then be able to perform in your real environment. So I do think it's like a really interesting area of research. Yeah. And did you see the meta-cold world model work? I don't think I saw that one.

Starting point is 00:58:27 Okay. Yeah, it was like two weeks ago. We've just confirmed the guy for AIU code in November. And it's really interesting. Like the world model is... Oh, sorry. You're talking about the meta one? Yeah.

Starting point is 00:58:39 Okay. Yes, I did. I saw that way. I said a lot of syllables. It may not have parsed. But like, yeah, it's literally like having a debugger as the environment, as the world model, and opening up the execution trace to the model to see what's going on and see the state and track the state as the code executes.

Starting point is 00:58:54 Seems to be smart and, you know, exploits the unique situation of, you know, exploits the unique situation of code environments where we can actually do these things. Yeah, I think the way they envision that model being used is a little different. Like, I think they're trying. Actually, I'm curious. I'll have to see the talk. But my understanding from that paper is like the goal they're imagining is this is almost sort of like a pre-training step. And then now that this model understands code really, really well, we can then use it as basically

Starting point is 00:59:20 like a code generation or a coding agent of some kind. Okay. Yeah. Which I think makes sense. That's almost more like a different kind of. of pre-training, I would say. The way I'm interested in applying world models is as not, it's basically as its own end, right, where it's like actually the goal is to come out of this with something that simulates the world, which is not something you really need in code at all,

Starting point is 00:59:39 because it's so easy to like run code. And you don't need to model what will happen if you execute this code typically because you can just execute the code and see what happens. Right. For training purposes. But if closely models how we think about code when we code, is we kind of mentally execute the model as we type and Google like, is that what we really want? Yeah, I don't know. Anyway, it's the first model that met us released since the MSL reorganization. We know, you know, just based on our context, they're very, very interested in code models as a path to AGI, which I'm also, of course, very interested in.

Starting point is 01:00:11 I know we kept in here for a while. Let's wrap up on the acquisition. So a lot of people say, you know, companies are not sold, their bot. What was that process like for you? Did it just happen? Like, what was the behind the scenes? Yeah. Yeah, so that was driven by actually mostly the weights and bias is founding team.

Starting point is 01:00:28 Lucas? Yeah. So, so yeah, Lucas and Sean particularly. So they, you know, had recently been acquired by Corweave. And Corweave was looking to, you know, continue growing up the stack. And so, yeah, they approached me and were like, hey, you know, like, no pressure. But like, this is like an area that we think is really promising. And we, you know, would you like to work here?

Starting point is 01:00:48 And so that's how the conversation started. It was like long. It was pretty painful. There were points as late as the week before we actually signed where it was like unclear if it was actually going to happen. So that part was super painful. However, we've been there a month now. We just shipped a product yesterday, which I'm super excited about.

Starting point is 01:01:06 It's been fantastic working there so far. Like I was like very concerned. I was like, okay, yes, this is great. We make a lot of money by selling our company. But like, is the work environment going to like really really suck? And I was like, well, I guess that's just a risk we'll have to take. It's been fantastic. Like, it's honestly been way, way better than I could have imagined.

Starting point is 01:01:21 Do you go down to the office, the one down here? I was there today. We work for, so I'm based in Seattle. So they have a small office up there that we work for. Ways and Biasis office in San Francisco is fantastic. If you have the chance, go visit. They do do all hackathons and co-working things. Yeah, there's a hackathon going on in a month or so.

Starting point is 01:01:37 Every week there's a hackathon. But yeah, I mean, so do you consider yourself working for weights and biases or core reef? Or both. And open pipe, too. No. No, yeah. So we, so we, I report to the weights and biases. like, yeah, founders. So we're within that organization.

Starting point is 01:01:55 In the org chart, we're there. I don't know. Like branding-wise, they're trying to say everything kind of that's not being sold to like big labs is kind of weights and biases. So like our stuff we're launching is weights and biases branded. Yeah. It's not. Yeah, not core we've branded as much.

Starting point is 01:02:13 I don't know. It's still like, they're still figuring it out. And what's the product you launched? We launched serverless reinforcement learning. Basically, it lets you offload all of the GPU management. You don't have to worry about crashes and out of memories and like, you know, scaling up and down. We handle all that for you. And you just like define your environment.

Starting point is 01:02:31 You define your reward function. And then you just like every time you run a step, you kind of like ship back to our back end. Hey, these are the trajectories. These are the rewards now update my model. And we just like make it work for you. It makes it way easier. Yeah. Okay.

Starting point is 01:02:44 Very thinky like. It is very thinky like. I love the thinking machines launch. I think they have a really good idea. It's also very validating. How did this takes a long? to appear. Like, it seems like... I don't know. Yeah, we were...

Starting point is 01:02:56 But that's... I felt this way about everything. Like, there's so many things that should exist, like, clearly. I just think there's, like, still not enough people, like, smart people working in the space. Like, honestly, we need... Like, I realize that there's, like, you know, like a lot of people just feels like there's still a lot of low-hanging fruit. Nobody's doing. Okay. One thing I saw from your post was your North Star as the RL team at CoreWeave is to build an old world where every agent learns continually from his real world experience.

Starting point is 01:03:21 So you're touching on the... hot topic of the moment, continual learning. What else do we need to get there? I super believe that. And like, that's basically the vision where I'm like, you know, I keep talking about these percentages, 25. If we get to the world where we build that, then I think it's just like the advantages are huge. They're clear. Everyone should just deploy their agents that way. We want to be like the team that builds the software that makes that easy to do. So I talk to a lot of engineers at our customers and they're trying to deploy agents. And it's so easy to get the initial prototype and like something that like kind of works well.

Starting point is 01:03:53 It is so hard to get from that to something that, like, you hire confident is reliable enough to actually deploy in production. And when you actually look at what those failure modes look like, it's like, oh, yeah, like, we know if it gets in this situation or if it gets like these kind of like inputs, like it behaves funnily. But then it's like, yeah, you can update your problem to address that. But like, that's not scalable because at a certain point it's like going to start breaking other things. You know, you don't know what it's breaking. You really want some way to just like say, okay, look, this thing you did there. That was the wrong thing. Just like adjust this behavior when you get in this.

Starting point is 01:04:22 and then, you know, otherwise carry on, right? And that's what we can do with RL. And that's what we can do with continual learning is like, we don't have to like have this concept of like, oh, up front, I'm like trying to make the perfect model that solves everything. It's like, I'm trying to make a model that's good enough. I can deploy it in production. And then when these errors come in, I'm going to say, oh, you know, exactly.

Starting point is 01:04:40 I mean, very analogous to how you train a human employee. Like, be like, oh, no, actually, that's not what you should do in that situation. All right, fix that and carry on. And that's just going to make this whole process so much easier. And I think that, you know, like, I think that there is. today, like 10 times as much AI inference that could exist than is existing right now, just purely with projects that are like sitting in the proof of concept stage and have not been deployed because there's like huge bucket of those.

Starting point is 01:05:06 And it's all about this kind of like reliability issue where it's like, okay, like it works in controlled circumstances. There's areas where it doesn't work. And so if we can solve this problem, there's that like 90% of the like inference market, like addressable market today that's just going to like come online because we've solved that problem. So that's what we want to do. I'm super excited about it.

Starting point is 01:05:24 And I think we have very concrete ideas on the specific pieces we need to make that work. And we just have to execute against them. Do you feel like the online REL is more susceptible to like the reward hacking, especially as you're like short in this loop and like you don't spend this much time like looking at the different checkpoints? I'm not that worried about it. And the reason why is because it is reward hacking is quite easy to detect once it starts happening. Because once the model's found some hack, it just starts like doing it all the time. It's like, oh, yes, this worked great.

Starting point is 01:05:53 I'm just going to keep doing it. And so you notice very quickly, whoa, it's doing this thing. And assuming you're using at least in part, an LLM as judge, to, like, determine which ones are good and bad, it's so easy to just throw in an extra turn and be like, hey, that weird thing that you keep doing. Like, if it does that, like, that's bad. Give it a low reward. So we've done this with a bunch of customers.

Starting point is 01:06:10 And like, reward hacking does happen. But, like, you just see it. And you, like, adjust your, you know, reward prompt and it just goes away. What's a thing from YC that guided you through your entrepreneurship journey? and what's one thing that maybe you like find that you disagree with Yon? Oh, that's a good question. One thing that I really identify with and I've tried to do a good job is kind of like, you know, sort of, I think they say like, hold your problem tight and your solution loosely, right? Where it's like, that's what you did?

Starting point is 01:06:39 Yeah, spend a lot of time thinking about what is the problem people trying to solve. And then it's like, don't be too bought into like the way you're solving it today. I think that's super important. everyone, you know, it's very easy to get that balance wrong if you're not thinking about it very consciously. That's something I disagree with. That's a good question. I think there's lots of things I disagree with, but I don't have it like cashed in that direction in my brain.

Starting point is 01:07:04 I don't know. Like I definitely have disagreed with lots of specific piece of advice, but yeah, I don't have like a great answer right now. I'll bridge it for you in case in case something comes up. Sam Altman's like, you know, everything I said as president of YC. was wrong for opening eye, right? Like, do B2B ended up doing B2C. You know, you should ship, like, products often, like ended up being installed for two years.

Starting point is 01:07:27 Yeah. Yeah. Actually, I think that, that second one does resonate with me a lot. Like, we have tried to ship really quickly and just kind of, like, sort of, like, follow the gradient of the market. I think if I do another startup, like, and I don't know, maybe this is just me, like, being beat up by the market too much.

Starting point is 01:07:44 If I do another startup, like, I would, like, I think at least some, points I probably would have done better to be like heads down and execute my vision for longer and like kind of like go for the more ambitious thing, but it would take longer to sort of like prove value, which is definitely not the YC way. But I think if you have like, I don't know, a good vision and good taste, then like that can like work quite well. Yeah, we'll see what that is whenever that comes out. But thanks for your time. This is a great overview of everything. Thank you. This has been a super fun conversation. Thanks to both you. Awesome.

Latent Space: The AI Engineer Podcast - Why RL Won — Kyle Corbitt, OpenPipe (acq. CoreWeave)

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.