Microsoft Research Podcast - 117 - Provably efficient reinforcement learning with Dr. Akshay Krishnamurthy
Episode Date: June 3, 2020MSR’s New York City lab is home to some of the best reinforcement learning research on the planet but if you ask any of the researchers, they’ll tell you they’re very interested in getting it ou...t of the lab and into the real world. One of those researchers is Dr. Akshay Krishnamurthy and today, he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real-world direction. https://www.microsoft.com/research
Transcript
Discussion (0)
When you have casual conversations with people, you typically emphasize the final performance.
And when you talk about these high-profile achievements, we don't often spend much time scrutinizing the sample efficiency.
And one thing that we've been kicking around a little bit is actually,
can we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront. You're listening to the Microsoft Research Podcast, a show that brings you closer to
the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen
Huizinga. MSR's New York City lab is home to some of the best reinforcement learning research on
the planet. But if you ask any of the best reinforcement learning research on the planet.
But if you ask any of the researchers, they'll tell you they're very interested in getting it out of the lab and into the real world.
One of those researchers is Dr. Akshay Krishnamurthy, and today he explains how his work on feedback-driven data collection and provably efficient reinforcement learning algorithms is helping to move the RL needle in the real
world direction. That and much more on this episode of the Microsoft Research Podcast.
Akshay Krishnamurthy, welcome to the podcast.
Thanks for having me. Let's start by situating. You're a principal researcher at MSR in New York City,
and part of what I'd like to call a small but tactical unit working on reinforcement learning
in real-world settings. Both for the short term, you called that translational, and the long term.
So to kick things off today, tell us how these two approaches are different
and how they're being tackled in your lab. And then talk about how they're each making an impact
on the science of reinforcement learning, both practically and theoretically.
I like to categorize our projects in terms of the time horizon with which we might push things into
product. And the first kind of bucket here
is more translational efforts.
These are the kinds of things
that maybe we hope to put in production
in the next year or two years.
And then the other bucket
is maybe more blue sky kind of efforts,
things that maybe are five years out or 10 years out
or just more high risk kind of projects.
Let me talk a little bit about the translational stuff.
Yeah.
So our group maintains and develops this RL system
that is currently deployed in production in several settings.
Most of these are recommendation kind of settings.
And we're constantly adding features.
We have this amazing group of engineers and data scientists
working on this product.
And the researchers are
adding features to the system based on new research and also kind of based on requests
and feedback from product groups. So this is, I think, a really cool way to do research where
you're working with a product group on some deployments, and then they come to you and say,
hey, we also want to do X. And then we think about it and we're like, oh,
we actually don't know how to do X. So then we like go do some research and try and figure out
how to do that kind of thing. And then we circle back with the product group maybe in six months
or one year when we figure out something. And one example of this is actually this project we did
on recommendation when the sort of decision space that the algorithm is operating over is very,
very rich and typically like of a combinatorial nature. So the prototypical example here is
personalized news. So you don't just choose like one story to show to the user. You might choose
like 10 stories and you have to rank them. So now your decision space is much, much more complicated. And it becomes much more difficult statistically to figure out what sort of recommendation rules are better.
And so we sort of came up with some technique that's inspired by a lot of literature in
learning theory community. And we brought this back to the news team. And I think in the last
one year, this has sort of been making its way to production in the system that we're running.
So that's kind of like what the translational efforts look like.
And the other side of the story is this blue sky or long-term stuff.
And here, I think we're not that attached to applications when I think about this style of research.
I mostly am thinking about what are the really challenging problems in reinforcement learning?
And how can we come up with models and algorithms that address those issues?
I don't really know the timescale that these things will pan out, right?
I think these problems are super hard.
I'm definitely not the only one thinking about them.
You know, we're making progress, but it's just not exactly clear to me when these things
will make it to applications.
But the hope is that, you know, as we build our understanding, we will be able to do that. Well, let's situate you, Akshay. Researchers
often live at an intersection, or they like to say they do. So what's your address? Where does
your research passion lie? And what gets you up in the morning? Yeah, so these days, I'm
pretty interested in the statistical issues around reinforcement learning.
But this, I think, for me is part of a bigger story.
I've always been interested in interactive learning, which typically when I tell this to people, they think of something else.
But I usually use this to refer to any sort of learning setting where the algorithm engages
in feedback-driven data collection.
And this is, I think, one of the central engages in feedback-driven data collection.
And this is, I think, one of the central problems in reinforcement learning. The agent operates in this world, makes a bunch of decisions, and the decisions it makes influence
the data that the agent has. And so part of the challenge is actually learning to make the right
decisions to collect the right data. So that's sort of where I came from. I've always been
interested in this sort of like broader topic of designing algorithms to collect the right data. So that's sort of where I came from. I've always been interested in this sort of like broader topic of designing algorithms to collect the right data. And the last
four or five years, it's mostly been about reinforcement learning because
that's a very hard version of that problem. Well, let's go a little deeper on the theoretical
for a moment and talk about what you referred to as incentive structures in scientific circles. So
what does the current landscape look like,
particularly with regards to how benchmarks are set up and measured
and how we might be thinking differently about incentive structures
so that the reinforcement learning ball moves forward?
I think data efficiency is one of the core issues in RL.
I was just mentioning this.
And so I think most
researchers are interested in moving towards real-world applications. And the ones that people
typically kick around are recommendation, high-precision medicine, personalized medicine,
robotics, these kinds of things. And in these settings, we may not have the luxury of collecting
millions or billions of interactions before we produce some high quality policy. So I feel the real world applications are quite different from
the kinds of simulated environments that we are currently using for benchmarking.
And don't get me wrong, the benchmarks are amazing. They've really pushed the field forward.
But I also am kind of mindful about the way benchmarks also influence the community.
So there's some kind of feedback loop here.
Right.
So these benchmarks kind of overemphasizes quality of final policy at the expense of
how much experience did you need to get there.
And I think when you have casual conversations with people, you typically emphasize the final performance.
And when you talk about these high-profile achievements, we don't often spend much time scrutinizing the sample efficiency.
And one thing that we've been kicking around a little bit is actually, can we come up with these empirical test suites or benchmarks that really put data efficiency at the forefront. And I think these are kind of things that hopefully will catch on and actually
nudge the community back to thinking about statistical issues.
Right. Let's get down to the nitty gritty and talk about math. The beating heart of RL research
is algorithms. And you told me they need to be able to do three things. So I'm going to have
you get down to the nitty gritty on some impressive algorithms you've
been working on in a minute.
But first, give us a framework for the research.
What things are reinforcement learning algorithms supposed to be able to do?
And where are we on the path towards cheap, fast, good pick three instead of pick two?
This is actually something I took from one of John Langford's talks.
And the core capabilities of RL algorithms are credit assignment, which is when you make a
sequence of decisions and observe a certain outcome, you know how to allocate which decisions
were responsible for that outcome. And this is not so easy because maybe you did 100 things and then at the end, you know,
you are at a pot of gold. And is it because you turned left over here or turned right over there
or whatever? That's credit assignment. The other issue is exploration. So you're an agent, you're
in some complicated environments, you run around for some time. If you run around blindly, you will
not, you know, find all the corners of every nook and cranny that you might need to, you know, solve your problem. So you have to explore the environment.
And the other one is generalization. If your environment is relatively complicated,
you might not actually see the same thing twice. And what you'd hope for is that the skills you
learn and the patterns you pick up in one place can transfer to other situations.
So this is this thing that my colleague Dependra always talks about.
He's saying, I learned how to bike in Brooklyn.
And when I go to Amsterdam, I should also know how to bike.
So we should be able to generalize our experience to new experiences.
Okay.
So those are the three capabilities.
And when you take any pair of these right now,
we have relatively mature algorithms that often are used in production.
And we have a pretty solid theory for how to do two of three.
So I think the one that is pretty common these days is doing credit assignment and generalization without exploration.
And so what's going on in a lot of these policy search kind of algorithms, like maybe you've heard of policy gradient methods.
And so what these algorithms do is you have some current policy,
you execute it perhaps with some noise to sort of deviate a little bit, and you pick up some signal about what local changes to your policy are doing. And this is sort of solving the credit assignment,
this deviation is addressing the credit assignment aspect. And your policy is maybe parametrized
by some neural network.
And so that's handling the generalization aspect.
You don't have to keep track
of every single state you might see.
So then you do these local deviations
and you make small improvements on your policy.
So these algorithms are actually pretty effective
and they work really well
in like continuous control kind of problems,
but they kind of are doing a local search.
And so they don't really work very well in problems where you have to do very sophisticated exploration.
But maybe one research question is, can we bring an exploration into these style of
algorithms? We certainly don't have robust algorithms for doing all three.
The blue sky optimism is that we can do all three. Yeah. And I think even in this cheap, fast and good, the definitions keep moving.
So certainly I think now, like if you think about with the advent of the internet, we
can access information very cheaply, very quickly, and it's very high quality.
Yeah.
And it's very different from maybe what it was 30 years ago or something.
Absolutely.
So the definitions keep moving and that's fine.
But I think we should actually be optimistic to push for these things that maybe seem almost
impossible. Well, let's talk about some specific algorithms since we're on that track right now.
The first one is one you call policy covered by inductive decoding or PCID.
And the exploration around this work was published in 2019 in an ICML paper titled,
Ready?
Provably Efficient learning with rich observations.
Tell us about PCID and how it moved the needle
on provably data efficient methods.
What were the key claims and contributions of this work?
This paper is part of a longer effort
on this sort of blue sky track of designing algorithms
with these three capabilities of credit assignment,
generalization and exploration. So in an ICML 2017 paper, we came up with a more general theory
for some sort of way to build algorithms with these three capabilities and some kind of
environment assumptions under which these things are possible. But the algorithms in that paper
are not particularly practical. So this ICML paper 2019 is more focused towards practicality, although I would still consider
this a relatively theoretical paper.
So what do we do?
So we have this new algorithm.
It's called PCID.
And we can use it to find near-optimal policies for a restricted class of environments that we call block MDPs.
And it does this in a provably efficient way, both in terms of the amount of experience it requires and the total computation.
And I think one thing I do want to highlight here is that these block MDPs, they're relatively structured environments,
but they do actually require the agent to demonstrate all three of those core
capabilities of credit assignment, generalization, and exploration. But the other thing I want to
highlight is the phrase provably efficient that I mentioned. I don't want to suggest that other
algorithms do not have these capabilities. What I'm just saying is we are able to prove that our
algorithm does. Perhaps the key claim is we're proving that this algorithm
has some data efficiency guarantee.
And that is something I kind of want to emphasize.
So let me tell you like the high level of the approach.
It's superficially similar to, I think,
what's now a relatively common strategy,
which is to create some auxiliary supervision
based on what the agent is doing
and use that to identify some kind of compact representation
of the world and then use the compact representation to drive an exploration module.
So the way we sort of realize this is our representation is like a discrete latent state.
And that's why this thing is called decoding.
We decode the latent state of the world, and then we run one
of these algorithms that can do exploration and credit assignment, but not generalization,
on the decoded latent state. So that's like the high-level idea. And to explain more,
I have to tell you about this block MDP model. The model is this kind of stylized setting in which
there exists some latent state space that's relatively
small. In the literature, they might call this a tabular state space. But these states are never
observed by the agent. Instead, the agent will be in a state, but it doesn't know that. And it will
see an observation, which is like sort of emitted from a distribution associated with this state.
And the observation might be extremely high dimensional and very, very rich.
But the observation might actually have enough information
for the agent to figure out what its latent state is.
But the agent sort of doesn't know how to do that compression.
So there's some mapping in the world that takes observations
and maps them to the latent state.
And if the agent only knew this mapping,
it could just run one of these tabular methods,
which do exploration and credit assignment,
but don't do generalization.
So this model kind of encourages
a little bit of a factored approach
where you do the generalization part
to map observations to state,
and then you do the other two things
using a technique we already know.
The key challenge here is that you never observe the to state. And then you do the other two things using a technique we already know. The key challenge here is that you never observe the latent state. So it's not that clear how are you going to learn this decoder. And this is sort of where the auxiliary supervision comes
in. So we use supervised learning to train a predictor to predict some other stuff. And somehow
we show that when you predict this other stuff
you recover the latent state the algorithm proceeds in this like forward manner so you take
like a hundred steps and then you reset and then you take a hundred steps and you reset and each
one of these is an episode and so we can assume we have the decoder at stage h some time step
and we're going to use the decoder at the previous time
to learn the decoder at the current time.
And the way you do it is you take the observation at the current time
and you predict the previous action and the previous latent state,
or decoded state.
Right.
You can imagine it kind of recovers the dynamics of the world
because the reason you're seeing this observation now is because you were in that state yesterday and you took that action
yesterday.
So you recover like the backward dynamics of the process.
And because the dynamics are governed by the latent state only, you kind of learn this
compression.
And then once you have the latent state at stage H, we learn how to visit every state that you could possibly
reach at stage H. And then that gives you a good data coverage. So that's like the exploration
component. You get good data coverage at stage H and you use that to drive the process of
learning the decoder at the next stage. So the PCID supervision is predict the previous
learned state and action from the current observation.
Okay, so what do we prove?
We prove that if you are in a block MDP with not too many latent states,
not too many actions per stage, and the time horizon is not too long,
and then there's some technical assumptions,
we prove that this algorithm will find some way to visit every state of the world with
good probability without too many samples.
So the number of samples will be polynomial in the number of latent states, the time horizon,
the number of actions, all these sort of relevant quantities.
And then we learn how to visit every state of the world, which is solving the exploration
problem.
Right. And the generalization component is solving the exploration problem. Right.
And the generalization component is baked in
because this is a block MDP
and you have to do a decoding.
On the computational side,
we operate in this sort of,
it's called the Oracle model of computation.
So we assume that you have a class of functions,
maybe it's like a neural network or something,
and that you can solve supervised learning problems
over this neural network. And that's where this auxiliary supervision comes in.
We train a neural network or whatever to map raw observation to previous state action.
And we assume that you can actually solve those classification problems. And so that's the
computation. Let me just mention the last piece, which is the first one where we actually ran some
experiments. Because this algorithm is not super practical, but not super impractical either.
It's somewhere in the middle.
It's the Goldilocks one.
Yeah.
It's one that we felt like we could implement.
So the experiments, I think, are more of a proof of concept that it is actually doing something sensible than anything else.
And what we show is that we sort of concoct this kind of
very, very difficult environment for exploration.
And we run these algorithms
that don't do any exploration
and they fail, obviously.
And then we run an algorithm
that we give it cheating access
to the latent state
and we run an exploratory algorithm.
So this is like an algorithm
that does exploration and credit assignment but not generalization and it does extremely well
and then we run our algorithm which does like not much worse than that thing okay so it's like the
first one is like obviously not going to work the second one is obviously going to work and the
question is how close are we to like the skyline that we could hope to achieve? And so we are getting there, but we're solving that harder
problem where you have to do the decoding. Okay. And that was 2019. We're all the way
into 2020. Since then, you haven't sat around on your algorithms. You have a new paper in the
pipeline with another algorithm that you think is actually pretty cool. The paper is called Kinematic State Abstraction and provably, there's that word again, efficient rich observation
reinforcement learning. That's a mouthful and also a mathful. In this paper, you present another
algorithm called HOMER. Tell us about HOMER and how it moves us closer to solving for X
if X equals robust real-world reinforcement learning.
And again, feel free to show your work.
Yeah, so HOMER is, I would think of it as an improvement on PCID,
but the high-level claims are kind of similar.
So we're still working in the block MDP.
On a technical level, we're actually able to remove some of these technical assumptions
that I didn't really explain to you in the previous story. But those are actually required by the PCID theory, and we can actually show that PCID
empirically fails when some of those things are not satisfied. So that's actually kind of important.
But I think there are two perhaps more important points about Homer. So Homer is much, much more
robust in practice. And part of the reason is that
in PCID we're trying to use the current observation to predict the previous state and the previous
action. Now we don't have access to the previous state so we predict the previous predicted
state and this can cause some catastrophic failure when your previous predicted state
is wrong. So you're trying to predict something that you're predicting yourself,
which causes some sort of bubbling of errors that doesn't work very well.
And so in Homer, we use a different auxiliary supervision problem
that does not rely on the previous one.
We still do this kind of inductive thing,
but we train this decoder for time h, and then we throw it away.
And then we train a completely new decoder for time h plus one. And so this makes the algorithm
somehow much, much more robust and much less prone to these kind of propagating errors,
which is a very, very common problem in reinforcement learning. The other thing that
we're doing in Homer, which is, I think, empirically very important, is in PCID, we are using the latent state space that we build.
We build like a dynamics model over the latent state space, and we do what they call model-based
planning. And this is also prone to failure when your model is wrong. Okay. So you're trying to
plan in your model to reach some state, but your model is wrong, and so you just completely fail.
And in Homer, we replaced this planning module with a policy search module that operates,
again, on the raw observations. So again, here, we're not using the decoders. We were just saying,
we've collected a good data set that covers the whole world, and we can use the data to train a machine learning model to do the things we want to do. So those are the two reasons that Homer is like an empirical and in some ways a theoretical improvement. But the other more
important point is that in the Homer project, we've identified a new structural notion that we
think might be useful in the design of other algorithms. And this is also this first part of
the title, which is this kinematic state abstraction. It's kind of some property of an environment where you might say
two states of the world are very similar if the dynamics to and from those states are the same.
You should be able to collapse these two things together because if one policy gets to one,
it also gets to the other. So for the purpose of exploration, these two states are very coupled.
And for the purpose of forward prediction, they're also very coupled because they have the same dynamics going forward. And so this kinematic state abstraction is a way to formalize this.
And it's kind of easy to show that in block MDPs, all the observations coming from a latent state
are tied together. They induce the same forward and backward dynamics. This idea also inspires
this new supervised learning auxiliary supervision problem that we use here.
So this prediction problem is what they call contrastive learning. So what you're going to
try and do is you're going to splice data sequences together in some real and fake way,
and you're going to try to predict which ones are real and which ones are fake.
So I'm the agent. I see some
observation. I take an action. I see some next observation. That's a real transition. So I have
an observation, action, observation, triple. Now I'm at some new place in the world, a new observation.
I take a different action. I see another observation. Let me swap the last two observations
between these two triples. Those are fake transitions, right? And
now if I can just somehow train a model to predict what's real and fake from this crazy data set that
I just like made up, right? But I know the label because I'm doing the splicing. So I train a
supervised learning model, just a classifier to predict what's real and what's fake. If you think
about it, you can kind of just pick out the transition operator because you know what a real
transition is now. And that means you know what the distribution of the next state is going to be
given your current state in action. And again, we can show that somehow when you solve this problem,
you kind of are decoding the latent state. And then like sort of a lot of the stuff that we do
with PCIDs is going through again. On the empirical side, it's still a relatively toy experiment,
but we're more
serious about comparing with the algorithms that people are running in practice. There was one
algorithm that was pretty good. So this is one of these policy search methods. When you attach on
an exploration module based on this idea called random network distillation. So it's this chain
of like 100 actions. So this thing got like to like level 25 or 30 or something.
But Homer just crushed the whole thing
because it's kind of designed to do so.
I think it's in some ways demonstrating
what is a very difficult exploration problem.
The algorithms that people typically use
do not solve those hard exploration problems.
And Homer is sort of constructed to do so and does it.
Let's talk briefly about the section in every academic paper that addresses research limitations and what scientists are still working
on. In general, what big challenges remain in this process or pipeline of algorithmic
perfection, and how are you moving into unlit spaces to resolve them?
Let's think about this on the theory side and the empirical side. So on the theory side,
I think it's becoming important to identify more
expressive models that emphasize the need for these core capabilities of generalization,
exploration, and credit assignment. To date, we've been working on the block MDP model, which
it certainly does these things, but it's quite limited in many ways. So a lot of problems tend
to have like a continuous but low dimensional latent state, and this is not captured by the
block MDP. But we're also thinking about like somehow substantially more expressive models so we have
some some new proposal for model that's like exponentially more expressive than the block mdp
okay and whether we can build algorithms for those things so like homer and pcid really are
leveraging this discrete latent state space in some very, very strong way.
And because they're so attached to the model, they don't really work in the real world.
Like your real world does not have that block MDP structure.
And then what happens, right?
But the other frustration here is that we kind of know that some modeling assumptions
are required for provable sample and computational efficiency.
And so the question is, what are the assumptions that are relevant to applications? And perhaps
we should just keep expanding until we cover most things, but it might not actually work in that way.
Like maybe we'll have to sort of fragment and say, okay, for this type of application, we're going to
focus on this set of assumptions, which are relevant, and come up with a specialized set of algorithms,
and so on.
So that's kind of the theory side.
I think we are sort of pushing on
more expressive models and algorithms
that address more general settings.
On the empirical side,
Homer and PCID are very much
algorithms designed by theoreticians.
And there's a question of whether
we can make them more empirically effective.
I think there are problems that do have
something close to block MDP structure. And if we can make them more empirically effective. I think there are problems that do have something close to block MDP structure.
And if we can make these algorithms somehow
more natural, more robust, more practically viable,
maybe we can start pushing towards those kinds of problems.
And so we have an ongoing effort around this
where we're trying to identify some problems
that might have the kind of structures
that Homer is sort of exploiting,
and also trying to make Homer somehow a little bit more data efficient and these kind of things.
It's about now that I always ask what keeps you up at night, both figuratively and literally.
Reinforcement learning is a bit of a rock star when it's in a controlled or simulated environment,
but once it hits the open world, the state space gets real, as we might say. So what could possibly go wrong, Akshay? And how
are you working from the beginning to ensure that it doesn't? Well, to be honest, what usually
keeps me up at night is some bug in a proof or something in my code that's not working. And this has been especially
true in the recent months because I've been like working very, very hard on this fairly delicate
proof and keep getting stuck like every day. So mostly I can't sleep because I like cannot solve
my technical problem. On the higher level, in terms of I think what could go wrong,
I'm definitely aware that when you pursue these sort of blue sky efforts, they're necessarily pretty high risk. And it could be the case that
when you take this theory first approach, or even when you are thinking about reinforcement
learning in the way that we're thinking, it just might not pan out. So it just could be the case
that we push on this for, you know, five years, 10 years, and then we just hit a dead end. And
that would kind of be depressing. But I think it's part of what happens when you
do this kind of high risk stuff.
Yeah.
But I'm also a bit reassured that, you know, RL is a very, very popular topic.
We have a large community of incredibly brilliant and amazing people that are working on these
problems and everyone is tackling them with a different approach and different perspective.
So even if what we're trying to do doesn't really pan out or doesn't do the things that we want,
I think we will find something that makes sense as a community.
And from like a scientific perspective, that sort of, I think, keeps me sane.
And I always have conversations with people about this kind of stuff.
Like when I give a talk about our work, they're like, well, do you really think this is going to pan out?
And I say, I don't know, but like someone should try it, right? So I think there's a real risk that
maybe these approaches don't make sense. And obviously we're thinking about when they do
make sense and trying to make them make sense, but maybe they will not work, right? And that
kind of keeps me up at night a little bit, but usually it's the more technical stuff.
The other concern about, I think, which is sort of what you're talking about is when we deploy these algorithms in the real world,
what could go wrong? I certainly think a lot of things could go wrong. And there's a lot of things
to be concerned about. Like, you know, there's biases, there's safety issues, there's fairness
issues, and there's like become a very vibrant community of people working on those kinds of
problems. I think we're so far from applications.
Like I don't encourage anyone to deploy Homer anywhere,
but I think I would say that, you know,
MSR has brilliant people working on these kinds of issues. And I think if, and when we do get closer
to deploying something,
I am super fortunate to feel like I can go talk to them
and like loop them in and see like,
what should we be doing? And I don't claim to be an expert on those things at all. So I would just
want to take advice from other people on these kinds of things. Well, it's story time here on
the Microsoft Research Podcast. So let's hear yours. What got you started in high tech? What
was your academic path toward research and how did you land in New York City, or at least post-COVID, the home office in Brooklyn,
working on reinforcement learning for Microsoft Research?
Yeah, so I grew up in a pretty technical family.
I was born and raised in Silicon Valley.
Both of my parents were engineers.
My brother is a computer scientist, so I think it was kind of in the cards.
That said, I didn't think I wanted to be an academic probably through most of grad school.
So when I went to college, I kind of had more of an entrepreneurial bent.
I think this is supernatural for people in Silicon Valley.
They all kind of want to do that.
You hear about all these tech stars and so on.
So I didn't really think I wanted to be an academic.
I was, you know, hacking on random startup-y project ideas in my dorm room and stuff.
I did some industry internships at tech startups and these kind of things.
And I actually didn't really like it.
So kind of what happened is I did those things and I didn't enjoy it.
And then I did a research project in undergrad.
And then I also had this amazing opportunity to spend a summer in Tel Aviv doing a research project with a
professor there. And that was like super fun and really kind of pushed me towards going to grad
school. And then my last summer of grad school, I was like super fortunate that Alec and Miro
like invited me to do an internship. And they were just like, hey, do you want to come spend
the summer in New York doing internship? And I was like, yeah, that sounds amazing. And so I did that.
I really fell in love with the lab.
Since then, I think John has really been an advocate for me.
Lankford?
John Lankford, yeah.
So he encouraged me to apply for a postdoc.
He was on my thesis committee.
He wrote a letter for me.
So I came back for a postdoc.
And then I spent two years as an assistant professor at UMass.
And then he sort of encouraged me to come back as a researcher.
I had to do this advertisement at a conference about MSR.
And I think the thing that is real is like, I've been there for three different stages
of my life and I keep coming back.
So it is like really a glorious place to be, to do research.
And I love it a lot.
So I think having these people like Alec Agarwal, Mira Dudik,
John Langford, they have been super inspirational. They're really like role models for me and
have really like given me an opportunity to, you know, be here and succeed and so on.
What's one interesting thing we don't know about you? I leave it open-ended,
so it can really be anything. Maybe it's a personal characteristic
or life event that impacted your career or life. Maybe it's a side quest or a hobby that defines
you outside your scientific work, or maybe it's just something fun or funny that people might not
suspect about you. Yes, I was actually discussing this question last night with my partner,
and the conclusion is that I'm pretty cliche for like a tech bro.
So I do all the things that like, you know, the San Francisco startup people do.
I like to rock climb.
I play Ultimate Frisbee.
I read a ton of sci-fi.
I am rereading The Three-Body Problem right now, which is just an excellent book.
I cannot recommend it enough.
I'm also kind of a foodie.
So in the pre-COVID times, Sid, Sen, and I would
frequently sneak out of the office to grab fancy sweets. And being in New York, there's just tons
of amazing restaurants and dessert places to go to. So when I go to conferences, I almost always
bring my climbing equipment. And there's starting to be a group of people that are in this crew of machine
learning climbers. And we typically go check out the local bouldering gyms or maybe even go to like
outdoor spot for climbing. And there's also obviously a bunch of foodies in the ML community.
So we have a small group of people who go check out like, you know, Michelin star restaurant and
stuff. And so the academic lifestyle is kind of great for those kinds of things. What's the biggest thing you've
climbed? So I mostly boulder. So the biggest thing I climbed is not very tall. And mostly I actually
do like indoor bouldering. So, you know, these climbing gyms are cropping up everywhere.
At NeurIPS last year, me and a friend went to,
so this was in Vancouver.
There's this like,
you know,
world-class bouldering area
just north in Squamish.
Yes.
And so we went bouldering
in Squamish.
It was a bit raining,
so the rocks were kind of wet,
which means we couldn't do anything.
I think I'm also not that good,
so the outdoors is really just when you kind
of get humiliated. Nature slams you. And I think the other thing is like the way they do this
grading, the grading outdoors is not commensurate with the grading indoors. So you kind of get
humbled. We need an algorithm for that. Yeah. Well, as we close, and this has been super fun,
I'd like you to dream out loud about the future for a minute and give our listeners a vision for next-gen reinforcement learning.
I sometimes frame this now as if you're wildly successful, what will you be known for at the end of your career?
And how will the research you and your colleagues have done made an impact on the world, both to the scientific community and people at large.
So I really hope that we can be involved in the development of data-efficient, empirically
effective reinforcement learning algorithms. I think it's a little bit hard to predict
where they will be effective. So this is this lamppost sort of phenomenon that we were talking
about the other day. I don't know what all the applications might be for such algorithms if we have them.
And this is part of this constrained thinking where like, you know, we think about applications
that our current technologies can solve.
And we don't think about applications that maybe new technologies might enable.
Looking under the lamppost is this story, or it's like a metaphor.
So there's a person and they lose their keys. It's the nighttime. They're looking for their keys under the lamppost because this story, or it's like a metaphor. So there's a person and they lose their keys.
It's the nighttime, they're looking for their keys
under the lamppost because that's the only place
where they can see.
Even though they are 100% sure that is not
where they lost their keys.
It's just where there's light.
So this is this thing of, you know,
we work on the things that we know we can do.
And the applications that we think about are the things
that are like just beyond our
grasp. But maybe if we have some radically new technology, the set of applications will just
really open up and we cannot really predict what those things might be. I think it would be really
awesome to be a part of that effort and see what kind of things open up. But it would be really amazing to see
that these algorithms do have like a positive impact on the world. Akshay Krishnamurthy,
this has been more fun than a podcast about algorithms should be.
Thank you so much for coming on. Thank you very much.
To learn more about Dr. Akshay Krishnamurthy
and the very latest in reinforcement learning,
visit microsoft.com slash research.