Microsoft Research Podcast - 117 - Provably efficient reinforcement learning with Dr. Akshay Krishnamurthy

Starting point is 00:00:00 When you have casual conversations with people, you typically emphasize the final performance. And when you talk about these high-profile achievements, we don't often spend much time scrutinizing the sample efficiency. And one thing that we've been kicking around a little bit is actually, can we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga. MSR's New York City lab is home to some of the best reinforcement learning research on the planet. But if you ask any of the best reinforcement learning research on the planet. But if you ask any of the researchers, they'll tell you they're very interested in getting it out of the lab and into the real world.

Starting point is 00:00:52 One of those researchers is Dr. Akshay Krishnamurthy, and today he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real world direction. That and much more on this episode of the Microsoft Research Podcast. Akshay Krishnamurthy, welcome to the podcast. Thanks for having me. Let's start by situating. You're a principal researcher at MSR in New York City, and part of what I'd like to call a small but tactical unit working on reinforcement learning in real-world settings. Both for the short term, you called that translational, and the long term. So to kick things off today, tell us how these two approaches are different and how they're being tackled in your lab. And then talk about how they're each making an impact

Starting point is 00:01:51 on the science of reinforcement learning, both practically and theoretically. I like to categorize our projects in terms of the time horizon with which we might push things into product. And the first kind of bucket here is more translational efforts. These are the kinds of things that maybe we hope to put in production in the next year or two years. And then the other bucket

Starting point is 00:02:14 is maybe more blue sky kind of efforts, things that maybe are five years out or 10 years out or just more high risk kind of projects. Let me talk a little bit about the translational stuff. Yeah. So our group maintains and develops this RL system that is currently deployed in production in several settings. Most of these are recommendation kind of settings.

Starting point is 00:02:37 And we're constantly adding features. We have this amazing group of engineers and data scientists working on this product. And the researchers are adding features to the system based on new research and also kind of based on requests and feedback from product groups. So this is, I think, a really cool way to do research where you're working with a product group on some deployments, and then they come to you and say, hey, we also want to do X. And then we think about it and we're like, oh,

Starting point is 00:03:06 we actually don't know how to do X. So then we like go do some research and try and figure out how to do that kind of thing. And then we circle back with the product group maybe in six months or one year when we figure out something. And one example of this is actually this project we did on recommendation when the sort of decision space that the algorithm is operating over is very, very rich and typically like of a combinatorial nature. So the prototypical example here is personalized news. So you don't just choose like one story to show to the user. You might choose like 10 stories and you have to rank them. So now your decision space is much, much more complicated. And it becomes much more difficult statistically to figure out what sort of recommendation rules are better. And so we sort of came up with some technique that's inspired by a lot of literature in

Starting point is 00:03:55 learning theory community. And we brought this back to the news team. And I think in the last one year, this has sort of been making its way to production in the system that we're running. So that's kind of like what the translational efforts look like. And the other side of the story is this blue sky or long-term stuff. And here, I think we're not that attached to applications when I think about this style of research. I mostly am thinking about what are the really challenging problems in reinforcement learning? And how can we come up with models and algorithms that address those issues? I don't really know the timescale that these things will pan out, right?

Starting point is 00:04:33 I think these problems are super hard. I'm definitely not the only one thinking about them. You know, we're making progress, but it's just not exactly clear to me when these things will make it to applications. But the hope is that, you know, as we build our understanding, we will be able to do that. Well, let's situate you, Akshay. Researchers often live at an intersection, or they like to say they do. So what's your address? Where does your research passion lie? And what gets you up in the morning? Yeah, so these days, I'm pretty interested in the statistical issues around reinforcement learning.

Starting point is 00:05:09 But this, I think, for me is part of a bigger story. I've always been interested in interactive learning, which typically when I tell this to people, they think of something else. But I usually use this to refer to any sort of learning setting where the algorithm engages in feedback-driven data collection. And this is, I think, one of the central engages in feedback-driven data collection. And this is, I think, one of the central problems in reinforcement learning. The agent operates in this world, makes a bunch of decisions, and the decisions it makes influence the data that the agent has. And so part of the challenge is actually learning to make the right decisions to collect the right data. So that's sort of where I came from. I've always been

Starting point is 00:05:44 interested in this sort of like broader topic of designing algorithms to collect the right data. So that's sort of where I came from. I've always been interested in this sort of like broader topic of designing algorithms to collect the right data. And the last four or five years, it's mostly been about reinforcement learning because that's a very hard version of that problem. Well, let's go a little deeper on the theoretical for a moment and talk about what you referred to as incentive structures in scientific circles. So what does the current landscape look like, particularly with regards to how benchmarks are set up and measured and how we might be thinking differently about incentive structures so that the reinforcement learning ball moves forward?

Starting point is 00:06:19 I think data efficiency is one of the core issues in RL. I was just mentioning this. And so I think most researchers are interested in moving towards real-world applications. And the ones that people typically kick around are recommendation, high-precision medicine, personalized medicine, robotics, these kinds of things. And in these settings, we may not have the luxury of collecting millions or billions of interactions before we produce some high quality policy. So I feel the real world applications are quite different from the kinds of simulated environments that we are currently using for benchmarking.

Starting point is 00:06:55 And don't get me wrong, the benchmarks are amazing. They've really pushed the field forward. But I also am kind of mindful about the way benchmarks also influence the community. So there's some kind of feedback loop here. Right. So these benchmarks kind of overemphasizes quality of final policy at the expense of how much experience did you need to get there. And I think when you have casual conversations with people, you typically emphasize the final performance. And when you talk about these high-profile achievements, we don't often spend much time scrutinizing the sample efficiency.

Starting point is 00:07:36 And one thing that we've been kicking around a little bit is actually, can we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront. And I think these are kind of things that hopefully will catch on and actually nudge the community back to thinking about statistical issues. Right. Let's get down to the nitty gritty and talk about math. The beating heart of RL research is algorithms. And you told me they need to be able to do three things. So I'm going to have you get down to the nitty gritty on some impressive algorithms you've been working on in a minute. But first, give us a framework for the research. What things are reinforcement learning algorithms supposed to be able to do?

Starting point is 00:08:16 And where are we on the path towards cheap, fast, good pick three instead of pick two? This is actually something I took from one of John Langford's talks. And the core capabilities of RL algorithms are credit assignment, which is when you make a sequence of decisions and observe a certain outcome, you know how to allocate which decisions were responsible for that outcome. And this is not so easy because maybe you did 100 things and then at the end, you know, you are at a pot of gold. And is it because you turned left over here or turned right over there or whatever? That's credit assignment. The other issue is exploration. So you're an agent, you're in some complicated environments, you run around for some time. If you run around blindly, you will

Starting point is 00:09:01 not, you know, find all the corners of every nook and cranny that you might need to, you know, solve your problem. So you have to explore the environment. And the other one is generalization. If your environment is relatively complicated, you might not actually see the same thing twice. And what you'd hope for is that the skills you learn and the patterns you pick up in one place can transfer to other situations. So this is this thing that my colleague Dependra always talks about. He's saying, I learned how to bike in Brooklyn. And when I go to Amsterdam, I should also know how to bike. So we should be able to generalize our experience to new experiences.

Starting point is 00:09:38 Okay. So those are the three capabilities. And when you take any pair of these right now, we have relatively mature algorithms that often are used in production. And we have a pretty solid theory for how to do two of three. So I think the one that is pretty common these days is doing credit assignment and generalization without exploration. And so what's going on in a lot of these policy search kind of algorithms, like maybe you've heard of policy gradient methods. And so what these algorithms do is you have some current policy,

Starting point is 00:10:13 you execute it perhaps with some noise to sort of deviate a little bit, and you pick up some signal about what local changes to your policy are doing. And this is sort of solving the credit assignment, this deviation is addressing the credit assignment aspect. And your policy is maybe parametrized by some neural network. And so that's handling the generalization aspect. You don't have to keep track of every single state you might see. So then you do these local deviations and you make small improvements on your policy.

Starting point is 00:10:37 So these algorithms are actually pretty effective and they work really well in like continuous control kind of problems, but they kind of are doing a local search. And so they don't really work very well in problems where you have to do very sophisticated exploration. But maybe one research question is, can we bring an exploration into these style of algorithms? We certainly don't have robust algorithms for doing all three. The blue sky optimism is that we can do all three. Yeah. And I think even in this cheap, fast and good, the definitions keep moving.

Starting point is 00:11:09 So certainly I think now, like if you think about with the advent of the internet, we can access information very cheaply, very quickly, and it's very high quality. Yeah. And it's very different from maybe what it was 30 years ago or something. Absolutely. So the definitions keep moving and that's fine. But I think we should actually be optimistic to push for these things that maybe seem almost impossible. Well, let's talk about some specific algorithms since we're on that track right now.

Starting point is 00:11:51 The first one is one you call policy covered by inductive decoding or PCID. And the exploration around this work was published in 2019 in an ICML paper titled, Ready? Provably Efficient learning with rich observations. Tell us about PCID and how it moved the needle on provably data efficient methods. What were the key claims and contributions of this work? This paper is part of a longer effort

Starting point is 00:12:18 on this sort of blue sky track of designing algorithms with these three capabilities of credit assignment, generalization and exploration. So in an ICML 2017 paper, we came up with a more general theory for some sort of way to build algorithms with these three capabilities and some kind of environment assumptions under which these things are possible. But the algorithms in that paper are not particularly practical. So this ICML paper 2019 is more focused towards practicality, although I would still consider this a relatively theoretical paper. So what do we do?

Starting point is 00:12:55 So we have this new algorithm. It's called PCID. And we can use it to find near-optimal policies for a restricted class of environments that we call block MDPs. And it does this in a provably efficient way, both in terms of the amount of experience it requires and the total computation. And I think one thing I do want to highlight here is that these block MDPs, they're relatively structured environments, but they do actually require the agent to demonstrate all three of those core capabilities of credit assignment, generalization, and exploration. But the other thing I want to highlight is the phrase provably efficient that I mentioned. I don't want to suggest that other

Starting point is 00:13:35 algorithms do not have these capabilities. What I'm just saying is we are able to prove that our algorithm does. Perhaps the key claim is we're proving that this algorithm has some data efficiency guarantee. And that is something I kind of want to emphasize. So let me tell you like the high level of the approach. It's superficially similar to, I think, what's now a relatively common strategy, which is to create some auxiliary supervision

Starting point is 00:14:00 based on what the agent is doing and use that to identify some kind of compact representation of the world and then use the compact representation to drive an exploration module. So the way we sort of realize this is our representation is like a discrete latent state. And that's why this thing is called decoding. We decode the latent state of the world, and then we run one of these algorithms that can do exploration and credit assignment, but not generalization, on the decoded latent state. So that's like the high-level idea. And to explain more,

Starting point is 00:14:36 I have to tell you about this block MDP model. The model is this kind of stylized setting in which there exists some latent state space that's relatively small. In the literature, they might call this a tabular state space. But these states are never observed by the agent. Instead, the agent will be in a state, but it doesn't know that. And it will see an observation, which is like sort of emitted from a distribution associated with this state. And the observation might be extremely high dimensional and very, very rich. But the observation might actually have enough information for the agent to figure out what its latent state is.

Starting point is 00:15:13 But the agent sort of doesn't know how to do that compression. So there's some mapping in the world that takes observations and maps them to the latent state. And if the agent only knew this mapping, it could just run one of these tabular methods, which do exploration and credit assignment, but don't do generalization. So this model kind of encourages

Starting point is 00:15:35 a little bit of a factored approach where you do the generalization part to map observations to state, and then you do the other two things using a technique we already know. The key challenge here is that you never observe the to state. And then you do the other two things using a technique we already know. The key challenge here is that you never observe the latent state. So it's not that clear how are you going to learn this decoder. And this is sort of where the auxiliary supervision comes in. So we use supervised learning to train a predictor to predict some other stuff. And somehow we show that when you predict this other stuff

Starting point is 00:16:05 you recover the latent state the algorithm proceeds in this like forward manner so you take like a hundred steps and then you reset and then you take a hundred steps and you reset and each one of these is an episode and so we can assume we have the decoder at stage h some time step and we're going to use the decoder at the previous time to learn the decoder at the current time. And the way you do it is you take the observation at the current time and you predict the previous action and the previous latent state, or decoded state.

Starting point is 00:16:39 Right. You can imagine it kind of recovers the dynamics of the world because the reason you're seeing this observation now is because you were in that state yesterday and you took that action yesterday. So you recover like the backward dynamics of the process. And because the dynamics are governed by the latent state only, you kind of learn this compression. And then once you have the latent state at stage H, we learn how to visit every state that you could possibly

Starting point is 00:17:06 reach at stage H. And then that gives you a good data coverage. So that's like the exploration component. You get good data coverage at stage H and you use that to drive the process of learning the decoder at the next stage. So the PCID supervision is predict the previous learned state and action from the current observation. Okay, so what do we prove? We prove that if you are in a block MDP with not too many latent states, not too many actions per stage, and the time horizon is not too long, and then there's some technical assumptions,

Starting point is 00:17:42 we prove that this algorithm will find some way to visit every state of the world with good probability without too many samples. So the number of samples will be polynomial in the number of latent states, the time horizon, the number of actions, all these sort of relevant quantities. And then we learn how to visit every state of the world, which is solving the exploration problem. Right. And the generalization component is solving the exploration problem. Right. And the generalization component is baked in

Starting point is 00:18:07 because this is a block MDP and you have to do a decoding. On the computational side, we operate in this sort of, it's called the Oracle model of computation. So we assume that you have a class of functions, maybe it's like a neural network or something, and that you can solve supervised learning problems

Starting point is 00:18:24 over this neural network. And that's where this auxiliary supervision comes in. We train a neural network or whatever to map raw observation to previous state action. And we assume that you can actually solve those classification problems. And so that's the computation. Let me just mention the last piece, which is the first one where we actually ran some experiments. Because this algorithm is not super practical, but not super impractical either. It's somewhere in the middle. It's the Goldilocks one. Yeah.

Starting point is 00:18:51 It's one that we felt like we could implement. So the experiments, I think, are more of a proof of concept that it is actually doing something sensible than anything else. And what we show is that we sort of concoct this kind of very, very difficult environment for exploration. And we run these algorithms that don't do any exploration and they fail, obviously. And then we run an algorithm

Starting point is 00:19:18 that we give it cheating access to the latent state and we run an exploratory algorithm. So this is like an algorithm that does exploration and credit assignment but not generalization and it does extremely well and then we run our algorithm which does like not much worse than that thing okay so it's like the first one is like obviously not going to work the second one is obviously going to work and the question is how close are we to like the skyline that we could hope to achieve? And so we are getting there, but we're solving that harder

Starting point is 00:19:48 problem where you have to do the decoding. Okay. And that was 2019. We're all the way into 2020. Since then, you haven't sat around on your algorithms. You have a new paper in the pipeline with another algorithm that you think is actually pretty cool. The paper is called Kinematic State Abstraction and provably, there's that word again, efficient rich observation reinforcement learning. That's a mouthful and also a mathful. In this paper, you present another algorithm called HOMER. Tell us about HOMER and how it moves us closer to solving for X if X equals robust real-world reinforcement learning. And again, feel free to show your work. Yeah, so HOMER is, I would think of it as an improvement on PCID,

Starting point is 00:20:33 but the high-level claims are kind of similar. So we're still working in the block MDP. On a technical level, we're actually able to remove some of these technical assumptions that I didn't really explain to you in the previous story. But those are actually required by the PCID theory, and we can actually show that PCID empirically fails when some of those things are not satisfied. So that's actually kind of important. But I think there are two perhaps more important points about Homer. So Homer is much, much more robust in practice. And part of the reason is that in PCID we're trying to use the current observation to predict the previous state and the previous

Starting point is 00:21:11 action. Now we don't have access to the previous state so we predict the previous predicted state and this can cause some catastrophic failure when your previous predicted state is wrong. So you're trying to predict something that you're predicting yourself, which causes some sort of bubbling of errors that doesn't work very well. And so in Homer, we use a different auxiliary supervision problem that does not rely on the previous one. We still do this kind of inductive thing, but we train this decoder for time h, and then we throw it away.

Starting point is 00:21:45 And then we train a completely new decoder for time h plus one. And so this makes the algorithm somehow much, much more robust and much less prone to these kind of propagating errors, which is a very, very common problem in reinforcement learning. The other thing that we're doing in Homer, which is, I think, empirically very important, is in PCID, we are using the latent state space that we build. We build like a dynamics model over the latent state space, and we do what they call model-based planning. And this is also prone to failure when your model is wrong. Okay. So you're trying to plan in your model to reach some state, but your model is wrong, and so you just completely fail. And in Homer, we replaced this planning module with a policy search module that operates,

Starting point is 00:22:35 again, on the raw observations. So again, here, we're not using the decoders. We were just saying, we've collected a good data set that covers the whole world, and we can use the data to train a machine learning model to do the things we want to do. So those are the two reasons that Homer is like an empirical and in some ways a theoretical improvement. But the other more important point is that in the Homer project, we've identified a new structural notion that we think might be useful in the design of other algorithms. And this is also this first part of the title, which is this kinematic state abstraction. It's kind of some property of an environment where you might say two states of the world are very similar if the dynamics to and from those states are the same. You should be able to collapse these two things together because if one policy gets to one, it also gets to the other. So for the purpose of exploration, these two states are very coupled.

Starting point is 00:23:28 And for the purpose of forward prediction, they're also very coupled because they have the same dynamics going forward. And so this kinematic state abstraction is a way to formalize this. And it's kind of easy to show that in block MDPs, all the observations coming from a latent state are tied together. They induce the same forward and backward dynamics. This idea also inspires this new supervised learning auxiliary supervision problem that we use here. So this prediction problem is what they call contrastive learning. So what you're going to try and do is you're going to splice data sequences together in some real and fake way, and you're going to try to predict which ones are real and which ones are fake. So I'm the agent. I see some

Starting point is 00:24:05 observation. I take an action. I see some next observation. That's a real transition. So I have an observation, action, observation, triple. Now I'm at some new place in the world, a new observation. I take a different action. I see another observation. Let me swap the last two observations between these two triples. Those are fake transitions, right? And now if I can just somehow train a model to predict what's real and fake from this crazy data set that I just like made up, right? But I know the label because I'm doing the splicing. So I train a supervised learning model, just a classifier to predict what's real and what's fake. If you think about it, you can kind of just pick out the transition operator because you know what a real

Starting point is 00:24:43 transition is now. And that means you know what the distribution of the next state is going to be given your current state in action. And again, we can show that somehow when you solve this problem, you kind of are decoding the latent state. And then like sort of a lot of the stuff that we do with PCIDs is going through again. On the empirical side, it's still a relatively toy experiment, but we're more serious about comparing with the algorithms that people are running in practice. There was one algorithm that was pretty good. So this is one of these policy search methods. When you attach on an exploration module based on this idea called random network distillation. So it's this chain

Starting point is 00:25:23 of like 100 actions. So this thing got like to like level 25 or 30 or something. But Homer just crushed the whole thing because it's kind of designed to do so. I think it's in some ways demonstrating what is a very difficult exploration problem. The algorithms that people typically use do not solve those hard exploration problems. And Homer is sort of constructed to do so and does it.

Starting point is 00:26:03 Let's talk briefly about the section in every academic paper that addresses research limitations and what scientists are still working on. In general, what big challenges remain in this process or pipeline of algorithmic perfection, and how are you moving into unlit spaces to resolve them? Let's think about this on the theory side and the empirical side. So on the theory side, I think it's becoming important to identify more expressive models that emphasize the need for these core capabilities of generalization, exploration, and credit assignment. To date, we've been working on the block MDP model, which it certainly does these things, but it's quite limited in many ways. So a lot of problems tend

Starting point is 00:26:40 to have like a continuous but low dimensional latent state, and this is not captured by the block MDP. But we're also thinking about like somehow substantially more expressive models so we have some some new proposal for model that's like exponentially more expressive than the block mdp okay and whether we can build algorithms for those things so like homer and pcid really are leveraging this discrete latent state space in some very, very strong way. And because they're so attached to the model, they don't really work in the real world. Like your real world does not have that block MDP structure. And then what happens, right?

Starting point is 00:27:14 But the other frustration here is that we kind of know that some modeling assumptions are required for provable sample and computational efficiency. And so the question is, what are the assumptions that are relevant to applications? And perhaps we should just keep expanding until we cover most things, but it might not actually work in that way. Like maybe we'll have to sort of fragment and say, okay, for this type of application, we're going to focus on this set of assumptions, which are relevant, and come up with a specialized set of algorithms, and so on. So that's kind of the theory side.

Starting point is 00:27:48 I think we are sort of pushing on more expressive models and algorithms that address more general settings. On the empirical side, Homer and PCID are very much algorithms designed by theoreticians. And there's a question of whether we can make them more empirically effective.

Starting point is 00:28:01 I think there are problems that do have something close to block MDP structure. And if we can make them more empirically effective. I think there are problems that do have something close to block MDP structure. And if we can make these algorithms somehow more natural, more robust, more practically viable, maybe we can start pushing towards those kinds of problems. And so we have an ongoing effort around this where we're trying to identify some problems that might have the kind of structures

Starting point is 00:28:23 that Homer is sort of exploiting, and also trying to make Homer somehow a little bit more data efficient and these kind of things. It's about now that I always ask what keeps you up at night, both figuratively and literally. Reinforcement learning is a bit of a rock star when it's in a controlled or simulated environment, but once it hits the open world, the state space gets real, as we might say. So what could possibly go wrong, Akshay? And how are you working from the beginning to ensure that it doesn't? Well, to be honest, what usually keeps me up at night is some bug in a proof or something in my code that's not working. And this has been especially true in the recent months because I've been like working very, very hard on this fairly delicate

Starting point is 00:29:12 proof and keep getting stuck like every day. So mostly I can't sleep because I like cannot solve my technical problem. On the higher level, in terms of I think what could go wrong, I'm definitely aware that when you pursue these sort of blue sky efforts, they're necessarily pretty high risk. And it could be the case that when you take this theory first approach, or even when you are thinking about reinforcement learning in the way that we're thinking, it just might not pan out. So it just could be the case that we push on this for, you know, five years, 10 years, and then we just hit a dead end. And that would kind of be depressing. But I think it's part of what happens when you do this kind of high risk stuff.

Starting point is 00:29:49 Yeah. But I'm also a bit reassured that, you know, RL is a very, very popular topic. We have a large community of incredibly brilliant and amazing people that are working on these problems and everyone is tackling them with a different approach and different perspective. So even if what we're trying to do doesn't really pan out or doesn't do the things that we want, I think we will find something that makes sense as a community. And from like a scientific perspective, that sort of, I think, keeps me sane. And I always have conversations with people about this kind of stuff.

Starting point is 00:30:19 Like when I give a talk about our work, they're like, well, do you really think this is going to pan out? And I say, I don't know, but like someone should try it, right? So I think there's a real risk that maybe these approaches don't make sense. And obviously we're thinking about when they do make sense and trying to make them make sense, but maybe they will not work, right? And that kind of keeps me up at night a little bit, but usually it's the more technical stuff. The other concern about, I think, which is sort of what you're talking about is when we deploy these algorithms in the real world, what could go wrong? I certainly think a lot of things could go wrong. And there's a lot of things to be concerned about. Like, you know, there's biases, there's safety issues, there's fairness

Starting point is 00:30:57 issues, and there's like become a very vibrant community of people working on those kinds of problems. I think we're so far from applications. Like I don't encourage anyone to deploy Homer anywhere, but I think I would say that, you know, MSR has brilliant people working on these kinds of issues. And I think if, and when we do get closer to deploying something, I am super fortunate to feel like I can go talk to them and like loop them in and see like,

Starting point is 00:31:24 what should we be doing? And I don't claim to be an expert on those things at all. So I would just want to take advice from other people on these kinds of things. Well, it's story time here on the Microsoft Research Podcast. So let's hear yours. What got you started in high tech? What was your academic path toward research and how did you land in New York City, or at least post-COVID, the home office in Brooklyn, working on reinforcement learning for Microsoft Research? Yeah, so I grew up in a pretty technical family. I was born and raised in Silicon Valley. Both of my parents were engineers.

Starting point is 00:31:56 My brother is a computer scientist, so I think it was kind of in the cards. That said, I didn't think I wanted to be an academic probably through most of grad school. So when I went to college, I kind of had more of an entrepreneurial bent. I think this is supernatural for people in Silicon Valley. They all kind of want to do that. You hear about all these tech stars and so on. So I didn't really think I wanted to be an academic. I was, you know, hacking on random startup-y project ideas in my dorm room and stuff.

Starting point is 00:32:27 I did some industry internships at tech startups and these kind of things. And I actually didn't really like it. So kind of what happened is I did those things and I didn't enjoy it. And then I did a research project in undergrad. And then I also had this amazing opportunity to spend a summer in Tel Aviv doing a research project with a professor there. And that was like super fun and really kind of pushed me towards going to grad school. And then my last summer of grad school, I was like super fortunate that Alec and Miro like invited me to do an internship. And they were just like, hey, do you want to come spend

Starting point is 00:33:01 the summer in New York doing internship? And I was like, yeah, that sounds amazing. And so I did that. I really fell in love with the lab. Since then, I think John has really been an advocate for me. Lankford? John Lankford, yeah. So he encouraged me to apply for a postdoc. He was on my thesis committee. He wrote a letter for me.

Starting point is 00:33:18 So I came back for a postdoc. And then I spent two years as an assistant professor at UMass. And then he sort of encouraged me to come back as a researcher. I had to do this advertisement at a conference about MSR. And I think the thing that is real is like, I've been there for three different stages of my life and I keep coming back. So it is like really a glorious place to be, to do research. And I love it a lot.

Starting point is 00:33:42 So I think having these people like Alec Agarwal, Mira Dudik, John Langford, they have been super inspirational. They're really like role models for me and have really like given me an opportunity to, you know, be here and succeed and so on. What's one interesting thing we don't know about you? I leave it open-ended, so it can really be anything. Maybe it's a personal characteristic or life event that impacted your career or life. Maybe it's a side quest or a hobby that defines you outside your scientific work, or maybe it's just something fun or funny that people might not suspect about you. Yes, I was actually discussing this question last night with my partner,

Starting point is 00:34:22 and the conclusion is that I'm pretty cliche for like a tech bro. So I do all the things that like, you know, the San Francisco startup people do. I like to rock climb. I play Ultimate Frisbee. I read a ton of sci-fi. I am rereading The Three-Body Problem right now, which is just an excellent book. I cannot recommend it enough. I'm also kind of a foodie.

Starting point is 00:34:44 So in the pre-COVID times, Sid, Sen, and I would frequently sneak out of the office to grab fancy sweets. And being in New York, there's just tons of amazing restaurants and dessert places to go to. So when I go to conferences, I almost always bring my climbing equipment. And there's starting to be a group of people that are in this crew of machine learning climbers. And we typically go check out the local bouldering gyms or maybe even go to like outdoor spot for climbing. And there's also obviously a bunch of foodies in the ML community. So we have a small group of people who go check out like, you know, Michelin star restaurant and stuff. And so the academic lifestyle is kind of great for those kinds of things. What's the biggest thing you've

Starting point is 00:35:28 climbed? So I mostly boulder. So the biggest thing I climbed is not very tall. And mostly I actually do like indoor bouldering. So, you know, these climbing gyms are cropping up everywhere. At NeurIPS last year, me and a friend went to, so this was in Vancouver. There's this like, you know, world-class bouldering area just north in Squamish.

Starting point is 00:35:51 Yes. And so we went bouldering in Squamish. It was a bit raining, so the rocks were kind of wet, which means we couldn't do anything. I think I'm also not that good, so the outdoors is really just when you kind

Starting point is 00:36:06 of get humiliated. Nature slams you. And I think the other thing is like the way they do this grading, the grading outdoors is not commensurate with the grading indoors. So you kind of get humbled. We need an algorithm for that. Yeah. Well, as we close, and this has been super fun, I'd like you to dream out loud about the future for a minute and give our listeners a vision for next-gen reinforcement learning. I sometimes frame this now as if you're wildly successful, what will you be known for at the end of your career? And how will the research you and your colleagues have done made an impact on the world, both to the scientific community and people at large. So I really hope that we can be involved in the development of data-efficient, empirically effective reinforcement learning algorithms. I think it's a little bit hard to predict

Starting point is 00:36:56 where they will be effective. So this is this lamppost sort of phenomenon that we were talking about the other day. I don't know what all the applications might be for such algorithms if we have them. And this is part of this constrained thinking where like, you know, we think about applications that our current technologies can solve. And we don't think about applications that maybe new technologies might enable. Looking under the lamppost is this story, or it's like a metaphor. So there's a person and they lose their keys. It's the nighttime. They're looking for their keys under the lamppost because this story, or it's like a metaphor. So there's a person and they lose their keys. It's the nighttime, they're looking for their keys

Starting point is 00:37:28 under the lamppost because that's the only place where they can see. Even though they are 100% sure that is not where they lost their keys. It's just where there's light. So this is this thing of, you know, we work on the things that we know we can do. And the applications that we think about are the things

Starting point is 00:37:43 that are like just beyond our grasp. But maybe if we have some radically new technology, the set of applications will just really open up and we cannot really predict what those things might be. I think it would be really awesome to be a part of that effort and see what kind of things open up. But it would be really amazing to see that these algorithms do have like a positive impact on the world. Akshay Krishnamurthy, this has been more fun than a podcast about algorithms should be. Thank you so much for coming on. Thank you very much. To learn more about Dr. Akshay Krishnamurthy

Starting point is 00:38:27 and the very latest in reinforcement learning, visit microsoft.com slash research.

Microsoft Research Podcast - 117 - Provably efficient reinforcement learning with Dr. Akshay Krishnamurthy

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.