Microsoft Research Podcast - Abstracts: NeurIPS 2024 with Dylan Foster

Episode Date: December 6, 2024

Can existing algorithms designed for simple reinforcement learning problems be used to solve more complex RL problems? Researcher Dylan Foster discusses the modular approach he and his coauthors explo...red in their 2024 NeurIPS paper on RL under latent dynamics.Read the paper

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Abstracts, a Microsoft Research podcast that puts the spotlight on world-class research in brief. I'm Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot or a podcast abstract of their new and noteworthy papers. Our guest today is Dylan Foster. He is a principal researcher at Microsoft Research and co-author of a paper called Reinforcement Learning Under Latent Dynamics Toward Statistical and Algorithmic Modularity. The work is among the oral presentations at this year's
Starting point is 00:00:46 Conference on Neural Information Processing Systems, or NEURIPS, in Vancouver. Dylan, welcome and thank you for joining us on the podcast. Thanks for having me. Let's start with a brief overview of this paper. Tell us about the problem this work addresses and why the research community should know about it. So this is a kind of a theoretical work on reinforcement learning or RL. When I say reinforcement learning, broadly speaking, this is talking about the question of how can we design AI agents that are capable of like interacting with unknown environments and learning how to solve problems through trial and error. So this is part of some broader agenda we've been doing on theoretical foundations of RL. And the key questions we're looking at here are what are
Starting point is 00:01:37 called exploration and sample efficiency. So this just means we're trying to understand what are the algorithm design principles that can allow you to explore an unknown environment and learn as quickly as possible. What we're doing in this paper is we're kind of looking at how can you most efficiently solve reinforcement learning problems where you're faced with very high dimensional observations, but the underlying dynamics of the system you're interacting with are simple. So this is a setting that occurs in a lot of natural reinforcement learning and control problems, especially in the context of like, say embodied decision making. So if you think about say games like Pong, you know, the state of the the game like the state of like pong is extremely
Starting point is 00:02:26 simple it's just you know what is the position and velocity of the ball and like where are the paddles but what we'd like to be able to do is learn to uh you know like control or like solve games like this from raw pixels or like images kind of in the same way that a human would like just solve them from vision so if you look at these types of problems you know we call these like rl with rich observations or rl with latent dynamics you know these are interesting because they kind of require you to explore the system but they also require you know representation learning like you want to be able to use neural nets to learn a mapping from, say, the images you see to the latent state of the system. This is a pretty interesting and
Starting point is 00:03:11 non-trivial algorithmic problem. And what we do in this work is we take a first step towards something like a unified understanding for how to solve these sorts of rich observation or latent dynamics RL problems. So how did you go about developing this theoretical framework? Yeah, so if you look at these sort of RL problems with latent dynamics, this is something that's actually received a lot of investigation in theory. And a lot of this goes back to kind of early work from our lab from like 2016, 2017 or so. There's some really interesting results here, but progress was largely on a like case by case basis, meaning, you know, there are many different ways that you can try to model the latent dynamics of your problem. And, you know, each of these somehow leads to a different
Starting point is 00:04:04 algorithm, right? So like, you know, you think very hard about this modeling assumption, you and each of these somehow leads to a different algorithm. So you think very hard about this modeling assumption, you think about what would an optimal algorithm look like, and you end up writing an entire paper about it. And there's nothing wrong with that per se, but if you want to be able to iterate quickly and kind of try different modeling assumptions and see what works in practice, this is not really tenable. It's just too slow. And so the starting point for this work was to kind of try to take a different and more modular
Starting point is 00:04:35 approach. So the idea is, you know, there are many, many different types of sort of systems or modeling assumptions for the dynamics that have been already studied extensively and have entire papers about them for the simpler setting in which you can directly see the state of the system. And so what we wanted to ask here is, is it possible to use these existing results in more of like a modular fashion? Like if someone has already written a paper on how to optimally solve a particular type of MDP or Markov decision process, can we just take their algorithm as is and perhaps plug it into some kind of meta algorithm that can directly kind of combine this with representation learning and use it to solve the corresponding rich observation or latent dynamics RL problem. What were your major findings? What did you learn during this process? We started by asking the question sort of exactly in the way that I just posed it, right? Like, can we take existing algorithms and use them to solve rich observation RL problems in a modular fashion. And this turned
Starting point is 00:05:47 out to be really tricky. Like there's a lot of natural algorithms you might try that seem promising at first, but don't exactly work out. And what this kind of led us to, and sort of the first main result in this paper is actually a negative result so what we actually showed is most sort of well-studied uh types of systems or like mdps that have been studied in like the prior literature on rl uh even if they're tractable when you're able to directly see the state of the system, they can become statistically intractable once you add sort of high dimensional attempts to explore the system that you need in order to learn a good decision making policy becomes like very very large like much much larger than the corresponding uh sort of complexity if you were able to directly see the states of the system you know you could look at this and say i guess i guess we're we're out of you know, maybe there's just no hope of solving these sorts of
Starting point is 00:07:06 problems. But that's perhaps a little too pessimistic, you know, really, the way you should interpret this result is just that you need more assumptions. And that's precisely what the sort of second result we have in this paper is. So our second result shows that you can sort of bypass this impossibility results and, you know, achieve truly modular algorithms under a couple different types of additional assumptions. Dylan, I'd like to know, and I'm sure our audience would too, what this work means when it comes to real-world application? What impact will this have on the research community? Yeah, so maybe I'll answer that with two different points. The first one is a broader point, which is why is it important to understand this problem of exploration and sample efficiency in reinforcement learning. If you look at the sort of setting we study in this paper, you know, this like RL or decision making with high dimensional observations, on the empirical side, people have
Starting point is 00:08:12 made a huge amount of progress on this problem through deep reinforcement learning. This was what kind of led to these amazing breakthroughs in solving games like Atari in the last decade. But if you look at these results the gains are somehow more coming from the like inductive bias or the like generalization abilities of deep learning and not necessarily from the specific algorithms so like current algorithms do not actually explore very deliberately and so their sample efficiency is very high like Like, it's hard to draw a one to one comparison, but you can argue they need like far more experience than a human would to solve these sorts of problems. So it's not clear that we're really anywhere near the ceiling of what can be
Starting point is 00:08:56 achieved in terms of like, how efficiently can you have, you know, an agent learn to solve new problems from trial and error. And I think better algorithms here could potentially be like transformative in a lot of different domains. To get into this specific work, I think there's a couple of important takeaways for researchers. One is that by giving this impossibility result that shows that RL with latent dynamics is impossible without further assumptions, we're kind of narrowing down the search space where other researchers can look for efficient algorithms. The second takeaway is, you know, we are showing that this problem becomes tractable when you make additional assumptions, but I view these more as like a proof of concept. Like we're kind of showing for the first time that it is possible to do something non-trivial, but I think a lot more
Starting point is 00:09:49 work and research will be required in order to like, you know, build on this and take this to something that can lead to like practical algorithms. Well, Dylan Foster, thank you for joining us today to discuss your paper on reinforcement learning under latent dynamics. We certainly appreciate it. Thanks a lot. Thanks for having me. And to our listeners, thank you all for tuning in. If you'd like to read Dylan's paper, you may find a link at aka.ms slash abstracts. You can also find the paper on Archive and on the NeurIPS conference website. I'm Amber Tingle from Microsoft Research, and we hope you'll join us next time on Abstracts.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.