Latent Space: The AI Engineer Podcast - [NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Episode Date: January 2, 2026

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional ...wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible. We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it’s not just about depth, it’s about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs. quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the “critical depth” phenomenon where performance doesn’t just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn’t just “make networks bigger” but a fundamental shift in RL objectives (their code doesn’t have a line saying “maximize rewards”—it’s pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.We discuss:* The self-supervised RL objective: instead of learning value functions (noisy, biased, spurious), they learn representations where states along the same trajectory are pushed together, states along different trajectories are pushed apart—turning RL into a classification problem* Why naive scaling failed: doubling depth degraded performance, doubling again with residual connections and layer norm suddenly skyrocketed performance in one environment—unlocking the “critical depth” phenomenon* Scaling depth vs. width: depth grows parameters linearly, width grows quadratically—depth is more parameter-efficient and sample-efficient for the same performance* The Jax + GPU-accelerated environments unlock: collecting thousands of trajectories in parallel meant data wasn’t the bottleneck, and crossing 15M+ transitions was when deep networks really paid off* The blurring of RL and self-supervised learning: their code doesn’t maximize rewards directly, it’s an actor-critic goal-conditioned RL algorithm, but the learning burden shifts to classification (cross-entropy loss, representation learning) instead of TD error regression* Why scaling batch size unlocks at depth: traditional RL doesn’t benefit from larger batches because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension—RL1000 Team (Princeton)* 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities: https://openreview.net/forum?id=s0JVsx3bx1Full Video EpisodeTimestamps00:00:00 Introduction: Best Paper Award and NeurIPS Poster Experience00:01:11 Team Introductions and Princeton Research Origins00:03:35 The Deep Learning Anomaly: Why RL Stayed Shallow00:04:35 Self-Supervised RL: A Different Approach to Scaling00:05:13 The Breakthrough Moment: Residual Connections and Critical Depth00:07:15 Architectural Choices: Borrowing from ResNets and Avoiding Vanishing Gradients00:07:50 Clarifying the Paper: Not Just Big Networks, But Different Objectives00:08:46 Blurring the Lines: RL Meets Self-Supervised Learning00:09:44 From TD Errors to Classification: Why This Objective Scales00:11:06 Architecture Details: Building on Braw and SymbaFowl00:12:05 Robotics Applications: Goal-Conditioned RL Without Human Supervision00:13:15 Efficiency Trade-offs: Depth vs Width and Parameter Scaling00:15:48 JAX and GPU-Accelerated Environments: The Data Infrastructure00:18:05 World Models and Next State Classification00:22:37 Unlocking Batch Size Scaling Through Network Capacity00:24:10 Compute Requirements: State-of-the-Art on a Single GPU00:21:02 Future Directions: Distillation, VLMs, and Hierarchical Planning00:27:15 Closing Thoughts: Challenging Conventional Wisdom in RL Scaling This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Layton Space Night To be able to live Right up Night in Space We are basically Trying to provide The best optimal
Starting point is 00:00:17 sort of podcast experience of Europe's for people who are not here And congrats on your paper How's it feel? Yeah, it was very exciting Yeah We had a poster yesterday And then today we'll have an oral talk
Starting point is 00:00:27 Were you just like mobbed? Oh yeah There was a lot of people It's like three hours straight up like, you know, like waves of people to like, that were we were trying to, but... So I've never received the best paper. Do you just find out on the website?
Starting point is 00:00:39 Like, what... Oh, I just, like, woke up one day and, like, checked my email, and then... Ah, they just tell... They was like, oh, like, I just saw email, oh, you've, like, been awarded best paper online. But maybe you know from the reviews as well, right? So I think...
Starting point is 00:00:54 Yeah, we know from the reviews that we did well, but there's a difference between, like, doing well on the reviews and getting best paper. So the right part we didn't actually know. Yeah. Okay, so I skipped a little bit. Maybe we can go sort of one by one and sort of introduce, you know, who you are and what you did on the team. I'm Kevin.
Starting point is 00:01:12 I was an undergrad from Princeton. And I just graduated. And, yeah, I guess I led the project, like, started the project. And then I was very happy to collaborate with Ishawn and Nicole and Ben also. Right. And were you in like the same research group? Like, how do you, how do you, the social context? So, yeah, so we're all from Princeton.
Starting point is 00:01:32 Yeah. With F. And thanks to Ellen for booking you guys. So this project actually started from like an IW seminar. So like an independent work research seminar that Ben was teaching. And this was like actually like one of my first experiences in like ML research. So it was really valuable to like get that experience. And then Ischano was also in that seminar and working on adjacent things.
Starting point is 00:01:53 So we collaborated a lot during that seminar. And then yeah, the product turned out to have some pretty cool results. And then later on also, like the HALT, working on sort of similar things also joined it on the project and became like a good collaboration. Yeah. And I don't know if any of you guys want to chime in on like other elements of coming into like deciding on this problem. So it's like probably my lab works on deep reinforcement learning. But historically deep meant like two or three or four layers. Not 1,000.
Starting point is 00:02:25 When Kevin and Sean makes me want to try really deep networks, it's kind of a lot of. skeptical it was going to work. I've tried this before. It doesn't work. Other people have tried this before and it doesn't going to work. So I was very, very skeptical starting at. I don't know if I made this at the time, but that was my prior going at because... But do you view your job as like screening or like, hey guys, this is probably isn't going to work. You should try a different idea? You know, like, or should you be encouraging, even if it's dumb? It's selecting bets. Yeah. And this was a bet I was willing to make. What made you willing to make a bet? It seemed relatively low cost in that we, Mihal, in particular, I had spent the past year developing infrastructure
Starting point is 00:03:05 and made a lot easier to run some of these experiments. And the precedent was deeper in ours should do a whole lot better. Like, that's what the deep learning revolution has been over the last day. Yeah, I know. Why do we sell making them deeper? And reinforcement was like this one anomaly where we continued to use these really shallow networks. And that's particularly true in the settings that we were looking at, where you're starting from track, you're starting from nothing. Any other perspectives you guys want to chime with?
Starting point is 00:03:30 I guess maybe I should just go over an overview of our project. Yes, okay, sorry, yes. So the way that I kind of view our project is that if you look at the landscape of deep learning, you know, you have NLP, like language, vision, and then RL. And as Ben kind of alluded to, you know, like in language, in vision, we've sort of converged to these like paradigms of scaling to massive networks, right? Like hundreds of billions of parameters, trillions of parameters. And there's been, you know, a lot gain in deep learning.
Starting point is 00:03:57 from that. But then it seems like in the third sort of branch of deep learning in deep RL, that has not yet been the case. Like I was very surprised like coming into some like, you know, Ben's class and seminar when I was looking at the networks. Oh, why were you just using like a simple like two layer MLP for like these frontier sort of, you know, state of art, RL algorithms? And so I was very curious.
Starting point is 00:04:18 Like can we design RL algorithms? Can we sort of put together a recipe for RL that can allow it to scale in potentially, analogous ways that language envisioned my skill. And so what we did is that we know that traditional RL, like let's say like value based RL doesn't really scale. This is pretty clear from the literature. So we tried a different approach to RL called self-supervised R.O. Where instead of learning like a value function,
Starting point is 00:04:42 we're learning representations of states, actions, and future states, such that the representations along the same trajectory are pushed together, the representations along different trajectories are pushed apart. And this is just like a different approach to RL that allows us to learn in a self-supervised manner. So we can solve task reach goals without any human-crafted reward signal. And so we know that self-supervised learning
Starting point is 00:05:04 is scalable in these different areas in deep learning. So can't self-supervised R.L. scale in similar ways. When we first tried it, it actually didn't work. We made the networks deeper. Performance totally degraded. But then we also,
Starting point is 00:05:17 but then I separately was like, there's also some other work like in our literature. We tried like residual connections. and there's a few other architectural components that we had to put into the recipe. And then all of a sudden, like one day, like I ran this experiment,
Starting point is 00:05:32 and there was like this one environment in which there was like, like going from like, like doubling the depth didn't really do anything, but like doubling the depth again with these different components, suddenly like skyrocketed performance in this one environment. Getting this to work was very non-trivial in the sense that like usually wanting to think about doing hyperparameter optimization.
Starting point is 00:05:51 We try changing A, see if it makes it better, try caninking B, C, weather makes it better. And if we just made the depth bigger, it makes it worse. We guess at residual connections didn't make it better. And it was really this combination of factors that Kevin and Eishon figured out that really made this work. And as a precursor to that, we also try scaling along different dimensions. So scaling the back size, scaling the width of the networks or the hidden layers. And effect, yeah, pretty much kind of similar to just scaling depth naively. And then once we started introducing residual connections, these specific architectural choices,
Starting point is 00:06:24 that's when we saw these significant jumps in performance, like these critical depths at which performance multiplies by a pretty huge factor. And that's where we really noticed unlocking some significant performance gains as opposed to scaling just along with, which did yield some performance improvements. But when you look at the number of parameters that your network has as you grow with, it's roughly a quadratic as opposed to something like growing depth. So it's more, in some sense, it's more parameter efficient,
Starting point is 00:06:51 also more sample efficient from the experiments that we conduct. in some ways you're sort of replicating stuff that is seen in the wild but on a very small model that you can study. Would you say that? Yeah. So I kind of add to what Kevin said earlier, we saw these huge performance improvements in language models, image generation models by making them larger, making them deeper, which seems very intuitive. Yeah. And so that's why our work we draw from like foundational research, right, like residual networks, which employ residual connections to avoid vanishing gradients. And that's something that we show in some of our ablations in our paper,
Starting point is 00:07:27 further down, it's probably in the appendices, where he did experiments without these residual connections. And so there's sort of boring these concepts that have existed in other fields and applying them to this setting with RL and showing that it works. Before Ben has to go, I'll leave the last word to him. What additional work does this inspire that you want to push on next? I think there's one thing I'd clarify about the paper, and then I'll directly answer the question.
Starting point is 00:07:53 I think the thing I might clarify about the paper is, I think a lot of people reading the title, I like, wow, big networks, they're great. I'll take big networks. You solved it now. We can just go. Yeah, we just take big networks, add them to PPO, add them to SAC,
Starting point is 00:08:04 add them to your favorite reinforcement learning algorithm. But I think that's actually not the main conclusion. I think the main conclusion is that using big networks not only requires these architectural tricks, but also, as Kevin mentioned before, it requires using a different objective. This objective doesn't actually use rewards in it. And so there's another word in the title,
Starting point is 00:08:22 reinforcement learning that also might be a little bit of a misnomer because we aren't directly trying to maximize rewards. Our code doesn't have a line of code saying maximize rewards here. And so is at the end of the day this a reinforcement learning method? I don't know. It looks much more similar to the self-supervised methods in other areas of machine learning. And so I think that the method and the work really stands in some sort of interesting intersection of reinforcement learning and self-supervised learning research. And we, we, we have a lot of, we We had this little figure on the bottom left of the poster, which was the screenshot of a slide from Yom Lecun,
Starting point is 00:08:59 talking about how to build intelligent systems and whether that's going to be done by unsupervised learning or supervised learning and reinforcement learning. And I think what our paper really suggests is that the boundary between these things is really blurry and maybe the keys to building intelligent systems are going to be leveraging insights from all of them. Yeah, the layer kick. Exactly.
Starting point is 00:09:19 Well, thank you for your time. I know you have to go soon. unfortunately after on. Yeah, thank you so much for coming. I think that the insight of like blurring things is interesting. I don't know if you, like you were talking about so like the abstraction layer of representation learning. I don't know if that triggers anything in terms of like the mix between self-supervised
Starting point is 00:09:38 and reinforcement learning. Is that something fundamental that you've discovered or that we, that people don't understand when they read the paper? Yeah. I think the best way that I would explain it is that we know that standard RL is not super scalable. And so why can this different approach or different objective RLB scalable? I think it's because we're fundamentally shifting the burden of learning from something like Q learning or like regressing to like TD errors,
Starting point is 00:10:04 which we know is quite spurious and noisy and bias to fundamentally like a classification problem. We're trying to classify whether a future state is along the same trajectory or along a different trajectory. And we do this with representation learning. And we know that classification, cross-entry loss and representation learning is scalable in the deep learning literature, right? If we think about language and some of the objectives there. So in some sense, we're kind of
Starting point is 00:10:28 blurring the lines. We're doing reinforcement learning. It's still an actor-critic reinforcement of running algorithm. It's like a goal condition of reinforcement algorithm. But the objective, the burden of learning of solving the RL task shifts to something that's more similar to
Starting point is 00:10:42 objectives that you might see in language and vision that we know have scaled so much. And so I think, yeah, I think that's like one of the fundamental insights how we've seen is that it seems like by approaching RL in this different approach, we're able to get so much more out of, we were able to scale our networks
Starting point is 00:10:59 significantly beyond what is standard used in RA. Can I jump in? I will just give a bit more of context about the architecture because, yeah, we use another objective, the influence, so the contrastive flows. However, the architecture is quite similar to the previous works of previous papers, like Bra or Simba, Simbao-1, Simba-Fa-Fa-2,
Starting point is 00:11:26 Simba-Fa-1, Simba-Fo2. So we also tweaked a bit of this architecture. However, it's not that we invented the wheel for the first time. It's the merging between the architecture and the objective that makes the scale really go up and performance follow the scale. I think that's something that we should probably mind deeper. Do you think, I guess, like, what domains, what industry,
Starting point is 00:11:53 like you've applied on multiple different types of networks or data sets. Is there a particular affinity that you think is like sort of low-hanging food? Yeah. So actually, if you look at a lot of our tasks, they're particularly sort of like robotics tasks. So this is a person, I'd be very curious about how a work like this could impact, like, the robotics field. Like my understanding of robotics is that a lot of robotics are now, there's kind of a few different approaches.
Starting point is 00:12:20 like one approach is we want to train robots using imitation learning. So we try to collect like an insane amount of data. We have a ton of human supervision and we try to scale up this data and we're like learning with imitation learning. But on the other hand, potentially like perhaps there's another approach, which is like, for example, like gold condition reinforcement learning, where we can actually train robotic agents and Troy and R.L agents to solve meaningful tasks with absolutely no human supervision, no demonstration. And it's much more scalable. So yeah. So this could serve as an alternate approach. and perhaps instead of like scaling data,
Starting point is 00:12:51 like scaling manual like human supervision, which is not super scalable, if there are ways to sort of make goal condition reinforcement learning scalable and like we can just scale the architecture or we can scale... Because you're focused on the new objectives, yeah. Right, with certain different objectives.
Starting point is 00:13:05 I think that could be very exciting to see how that can affect a field like robotics, for example. Yeah. Double click on just one thing on the efficiency, which you go is talking about. I would expect the very deep, the deeper it is,
Starting point is 00:13:18 it should be quadratically worse. I'm not familiar with like the pre-existing literature. I'm just like sort of working out intuitions. But basically what are the tradeoffs that you've found that I think you might want to warn people about? Because you are the guy who mentioned efficiency. Sure, sure. Yeah. So I was referring to like one of the figures on our poster also in our paper.
Starting point is 00:13:39 Very compare like the number of parameters that models have as we scale along the axis of depth. And as we scale along the axis of width. Yeah. From our baseline architecture, the most baseline one would be like a width of 256, the hidden layers of 256 neurons and then the depth is four layers or hidden layers. And so the point I was making there is that when you scale along depth, the number of parameters that your model has is going to grow roughly linearly. Whereas with, you're making your network outputs wider and then the input to the next network
Starting point is 00:14:10 is also growing as well. And so the number of parameters your network is then going to have grows approximately quadratically And so one of the experiments we did was sort of examining as we grow the number of parameters in our model by scaling along these two different choices, which one for the same approximate number of parameters yields a better performance. And the depth curve kind of goes like this,
Starting point is 00:14:30 it jumps up pretty fast. That's like present throughout our paper. For with it grows a little bit more slowly. And so the kind of takeaway from that is that if you are a bit more resource constraints, scaling long depth might be better because there's fewer parameters with a smaller number to learnable parameters,
Starting point is 00:14:46 With it's expensive. With it's expensive. Exactly. And in general, of course, more parameters is also going to be more expensive. So that's just like another consideration to think about when using these networks, I suppose. Yeah.
Starting point is 00:14:57 Any other sort of rules of thumbs like that that I can extract that? This is just the most basic one that I could think of. Yeah. I don't know. There's any others? Yeah, I guess like, your original question of like the tradeoffs, like one of the tradeoffs, one of the limitations that we say is like,
Starting point is 00:15:11 obviously if you make the networks bigger, it will take longer to run, right? So if you double the depth, at some level of depth, it might take twice as much to make a forward pass through the network, right? However, this is not, so within our paper, like, for most environments, we are able to, like, saturate, like, get to, like, almost perfect performance within just, you know, we don't even need to get to, like, a thousand layers, like, maybe just 64 layers, for example, is sufficient. And in this regime, like, the latency of the network is not necessarily, actually,
Starting point is 00:15:42 even not necessarily like a significant bottleneck. Like you can imagine there's a lot of tasks in which, especially in RL, that like collecting data might be the bottleneck, right? And making four passes through our network may not be the bottleneck. And so in our environment, in our research,
Starting point is 00:15:57 we specifically used the Jax GRL environment, which is a Jax-based GPU Accelerat environment. So we can collect like thousands of like environment trajectories like in parallel at the same time so that we're able to like make like, oh, this is built in. Right. This is built in so that we can collect a thousand trajectories at the same time along all these environments. And so make sure that we have enough data to exaggerate the learning from.
Starting point is 00:16:23 Wow. That's like work they've got out. Okay. I don't know if you want to expand upon that on the Jaguar's Degro. And most people are familiar with Pythoris, may less familiar with Jax. I think Jax is getting the traction, especially in RL field, because for online, reinforcement learning, getting as much data as you can is the most important. There's got to be a pithort equivalence, but anyway.
Starting point is 00:16:50 Any tips for other people also exploring this kind of rollout? Yeah, so I think I can also recommend, like, for goal-conditioned REL, I'm recommending JuxD's REL, but there are also like multi-agent Jaxe implementation and other. So going back to our paper, if you look at the plots, we only see this like huge performance increase when we cross like 50 millions of transitions gap. So I think the data is crucial like here. I guess even to build on that,
Starting point is 00:17:24 I like drawing analogies to successes in other areas of deep learning. Like for example, in large language models, the reason why we're able to scale to such large networks is that we found a paradigm in which we can leverage the entire internet scale of data is alert. Right. And so data in RL traditionally has been hard to come by. But now
Starting point is 00:17:43 with these GPU Accelered environments, we can collect hundreds of millions of times of the data within just a few hours. And so I think that this serves as like a really good test bet for us to be able to also find ways to scale up network capacity and get similar kind of games. Are you saying that you
Starting point is 00:17:59 have a difference, you would do pre-training differently in LLMs? Like what's the what's the difference objective now? Yeah, very simply, very simply the paradigm that you're referencing is next word or next token, right? It's very robust.
Starting point is 00:18:18 How do you change that? Oh, I'm not saying that. I want to leverage insights from that to apply to morale. I feel like you should go the other way. You think you should go to other way? Maybe. I mean, I would be a very interesting research direction too. But actually, even on that point, like one of the things I was thinking about is that
Starting point is 00:18:34 the way that our objective works is in some set, it's not exactly next word prediction, but it's kind of like next. state prediction. You imagine you're at some current state and you're at some current action. And we want to predict whether or not this future state, this certain state is a future state along the same trajectory or a different trajectory. And so in some sense, we are actually doing some sort of like implicit world model. I don't know if that's a bad word these things. Or like in language, you do a cross-entry laws to classify the next token, right? And here we're
Starting point is 00:19:04 just doing a binary classification of like whether or not some next state is some. Yeah, yeah, yeah. And so I do see that. there are some sort of parallels here that perhaps we should dig into deeper and see like what is the core to of what enables deep learning to scale. And then how can we like leverage that, how we distill those like insights and then apply those across like all different fields, whether it's language or reinforcement learning. Yeah. Did you get my meaning about the whole model stuff?
Starting point is 00:19:32 Yeah, yeah. Actually. And I heard, I think I might have heard Professor Eisenbach yesterday talking about the Senate poster and he's experiencing to a couple of people that because this is like doing representation, learning and trying to learn these meaningful representations for a given state and action been for a given goal. In some sense, you can think of it almost like learning a model of the environment, learning a model of the world, but without having to do any sort of like next frame prediction or stuff
Starting point is 00:19:55 like that that's a little bit more high dimensional and complex. Yeah. I will think like the angle that I'm trying to think about and push is instead of learn the next world, they basically like generate a number of candidates possible worlds and classify them to your point, which is exactly how I do things. Let's say I'm playing poker, and I'm trying to classify what hands you have. Well, there's a range of hands based on what you're doing, and the more information I get, the more I resolve to,
Starting point is 00:20:23 oh, I know exactly what hand you have based on what you're showing, or you're buffing, but that's a different thing. But you know what I mean? I feel like that is the ultimate sort of angle of representation, which is the world, but I don't know if that is too vague compared to the more concrete types of world models that, let's say, the video gen people are doing. And then I guess one of the thing, like, I'm also exploring, you mentioned the deep models being slower or more expensive. Yeah, that is a trend in the inference world of making models shallower,
Starting point is 00:20:56 right? And I wonder if this like short catchphrase I was thinking about like deep teacher, shallow student would be a good deployment paradigm. Yeah. Like, you push the frontier capabilities with death and then you distill. Actually, this is a good point. Like, if you go out to our website, like, this is one of the future directions that we list at the very bottom. Oh, okay. Yeah.
Starting point is 00:21:20 We would love to see if we could get similar performance. Like, we push the, you know, like, we do achieve state of the art performance on gold condition RL and JaxiRL by a significant amount. And so it was very exciting to see the sort of frontier of the ability to train RL agents sort of pushed. and if we can do that in a way that also is just as efficient as a standard networks that would be very cool so you know like
Starting point is 00:21:45 is able to be because training doesn't have to be the same thing that you deploy at inference right you know what I mean like like that yeah so yeah so if there's like to like to staled out to a small model or prune the model and maybe not still retain performance
Starting point is 00:21:58 that's a very interesting research direction that we should be. Let's all about other future directions what else is your personal passions yeah so currently I'm Bruce doing direction of stitching in reinforcement learning. So we are trying to generalize reinforcement learning from shorter subbehaviors
Starting point is 00:22:17 so that they are stitched, merged during the test time. And yeah, I think this is one of my last papers that I will tackle during the PhD. Personally, I'm very curious of like, can we, like, what's the, like, real, like, can we push? I'm curious about, like, advancing the frontier as much as possible. So if you actually look at our paper, we focus on the scaling depth, but we notice that we see that scaling with actually also improves performance. And we also find that actually by scaling depth, we actually unlock the ability to scale along batch size as well. So this is one of, yeah. So, okay, I guess.
Starting point is 00:22:53 Co-linear, like, yeah. Right. So, like, okay, I guess for context, like in traditional RL, like value-based RL, scaling batch size is not super effective. But we also can see, there's also other work in other areas of deep learning that show that scaling batch sites is only most effective when, there's like a large enough network capacity to take advantage of the skilled batch size. And we actually find that, you know, perhaps, you know, perhaps one hypothesis to not be, like, perhaps the reason why skilled batches isn't that effective in traditional RLs, because, like, we've been using these tiny networks that haven't been able to capture that.
Starting point is 00:23:22 And one of our experiments is that, like, because we are enabled successful training of deep network, we actually were able to, this is a great test bed for, you know, like testing this hypothesis. And we find that indeed, as we scale it to network capacity, we also unlock this different dimension of scaling batch site. And so all is that to say is that I'm very curious for someone like with enough compute to like take some of these environments, scale up batch, scale up depth to the maximum capability, also scale along width, also scale along batch size. And like basically like in the same way that in language we're scaling along so many different ATSs, can we unlock different dimensions of scaling as well and what capabilities and how far can we push the frontier
Starting point is 00:24:01 of training these RL agents from doing that. Before we pass, Sean, when you say enough compute, What kind of compute budget did you have? How does it? I just want to see what you guys got. Good question. So we wanted to make sure that this is, we wanted to make it such that, like, you know, it's quite accessible. So the nice thing is that all of our experiments, even a thousand layer networks,
Starting point is 00:24:21 can be run on one single 80 gigabyte H100 GPU. So those dollars. Right, right, right. So everything can be run on one GPU. But in theory, if we had, you know, like a distributed training setup and like can just like blast compute through this. really wanted to push the frontier. It'd be very interesting to see how it's going. Yeah, cool. And I've actively been trying to learn as much as I can about vision language action models,
Starting point is 00:24:44 role models at Europe's going to a lot of machine language action models. Vision language. Vision language. And yeah, yeah, curious about applications of representator. Yeah, exactly for robotics. I'm actively trying to explore more in that area. So just reading a lot of literature, talking to as many people. Yeah, we just released our episode with genuine intuition. Oh, okay. Awesome. Where if you know a bit about their history, they started as a gaming,
Starting point is 00:25:11 clipping company, and they basically have a vision language action model. Yeah. Which I saw a preview. It was very impressive. I'm not sure exactly how transferable it is to embodied use cases, but it doesn't have to. Like, screen is fine, you know?
Starting point is 00:25:28 Like, yeah, I don't know if you have any takes on. Yeah, that's an exciting research direction. Definitely. Yeah, I think the, The concepts of actions as something that you are outputting is actually not that popular in industry, only because text has completely dominated the last three years. And tool calling, which is just another form of structure text. And I feel like the action research is kind of like, I don't know what needs to happen
Starting point is 00:26:01 in order to unlock the next phase in that. I don't know if you've seen anything interesting out here. Shut it out. Yeah, there's a lot of cool work on leveraging pre-trained VLMs. You freeze it and then you apply it. Exactly. And then you put it on top of that, like, some sort of experts to output actions. Also, like, systems for doing, like, hierarchical planning,
Starting point is 00:26:23 maybe outputting some higher-level plan. And this is, like, a larger network that takes a long time to, a little longer to do inference. And so it outputs its plans with less frequency, some sort of chunk. And then from there, it's, like, some sort of second system that operates a bit more fast. I think there's quite a bit of interesting research in that direction.
Starting point is 00:26:40 So that's what I'm looking forward to. Cool. Final question. Hardest question you were asked at the postal session or just favorite encounter, anyone famous that you met? So I actually haven't gotten a chance to go to the conference that much. I'm actually working full time now. Oh, damn?
Starting point is 00:26:55 Yeah. So so far, I actually literally just got my badge a few moments before. So I guess I wouldn't be the best to answer that question. No, no, no, like, people ask you stuff, right? Oh, oh, oh, oh, I'm my poster. People asking you or meeting you, and, like, you know, just give a vibe of, like, what people are saying and saying about it. People were very, I think it's sort of, like, a very eye-opening.
Starting point is 00:27:20 I think that the general question is that people thought it's a very eye-opening paper. Because, like, the objective is quite simple. It's quite elegant. And for us to be able to, like, you know, like, I don't want to say, like, overturned, but, like, sort of challenge the conventional wisdom that, like RL is not super scalable and push it to such limits like a thousand layers D and see continued improve performance. I think the general impression that I've gotten is that, you know, this could be like a really cool, like if we can sort of build along this direction
Starting point is 00:27:50 and that like we can really scale along all these different dimensions and push the frontier of the ability for RL. I'm very curious to see how that goes. All right. Well, thank you so much for dropping by. Congrats on the paper again. And good luck in your future work. Thank you.
Starting point is 00:28:03 Thanks for having us. Yeah.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.