Microsoft Research Podcast - 075r - Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn
Episode Date: January 1, 2020This episode originally aired in May, 2019.Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working,... in his own words, to solve machine learning. Rafah Hosn, also of MSR New York, is a principal program manager who’s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a “go big, or go home” kind of town, and MSR NYC is a “go big, or go home” kind of lab. Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it’s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR’s “research, incubate, transfer” process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer.
Transcript
Discussion (0)
When John Langford and Rafa Hosen were on the podcast last May,
they gave us two perspectives on bringing reinforcement learning out of the lab and into the world,
highlighting the special relationship between science and business at MSR.
Whether you heard about their work in online reinforcement learning last spring,
or you're ringing in the new year with John and Rafa,
I know you'll enjoy Episode 75 of the Microsoft Research Podcast,
Reinforcement Learning for the Real World.
Welcome to another two-chair, two-mic episode of the Microsoft Research Podcast.
Today, we bring you the perspectives of two guests on the topic of reinforcement learning for online applications.
Since most research wants to be a product when it grows up,
we've brought in a brilliant researcher-program-manager duo to illuminate the classic research-incubate-transfer process in the context of real-world reinforcement learning.
You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it.
I'm your host, Gretchen Huizinga.
Dr. John Langford, a partner researcher in the machine learning group at Microsoft Research
New York City, is a reinforcement learning expert who is working, in his own words,
to solve machine learning. Rafa Hozen, also of MSR New York, is a principal program manager
who's working to take that work to the world. If that sounds like big thinking
in the Big Apple, well, New York City has always been a go-big-or-go-home kind of
town, and MSR NYC is a go-big-or-go-home kind of lab. Today, Dr. Langford explains
why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual
bandit future might be part of the solution. Rafa Hosen talks about why it's important from
a business perspective to move RL agents out of simulated environments and into the open world
and gives us an under-the-hood look at the product side of MSR's research incubate transfer process, focusing on real-world reinforcement learning, which at Microsoft is now called
Azure Cognitive Services Personalizer.
That and much more on this episode of the Microsoft Research Podcast. I've got two guests in the booth today, both working on some big research problems
in the Big Apple.
John Langford is a partner researcher in the Machine Learning Group at MSR NYC.
And Rafa Hoson, also at the New York Lab, is the principal program manager for Personalization
Service,
also known as Real World Reinforcement Learning.
John and Rafa, welcome to the podcast.
Thank you.
Microsoft Research's New York Lab is relatively small in the constellation of MSR Labs,
but there's some really important work going on there.
So to get us started, tell us what each of you does for a living and how you work together. What gets you up in the morning? Rafa, why don't you start?
Okay, I'll start. So I wake up every day and think about all the great things that the
reinforcement learning researchers are doing. And first, I map what they're working on,
something that could be useful to customers. And then I think to myself,
how can we now take this great research, which typically comes in the form of a paper,
to a prototype, to an incubation, to something that Microsoft can make money out of.
That's a big thread, starting with a little seed and ending up with the big plant at the end.
Yes, we have to think big.
That's right. How about you, John?
I want to solve machine learning. And it's ambitious. But one of the things that you
really need to do if you want to solve machine learning is you need to solve reinforcement
learning, which is kind of a common basis for learning algorithms to learn from the
interaction with the real world. And so figuring out new ways to do this
or trying to expand the scope
of where we can actually apply these techniques
is what really drives me.
Can you go a little deeper into solve machine learning?
What would solving machine learning look like?
It would look like anything that you can pose
as a machine learning problem you can solve, right?
So I became interested in machine learning
back when I was an undergrad, actually. I went to a machine learning class and
I was like, ah, this is what I want to do for my life. And I've been pursuing it ever since.
And here you are.
So we're going to spend the bulk of our time today talking about the specific work you're
doing in reinforcement learning. But John, before we get into it, give us a little context as a
level set. From your perspective, what's unique about reinforcement learning within the machine learning universe?
And why is it an important part of MSR's research portfolio?
So most of the machine learning that's actually deployed is of the supervised learning variety.
And supervised learning is fundamentally about taking expertise from people and making that into some sort of learned function that you can then use to do some task.
Reinforcement learning is different because it's about taking information from the world
and learning a policy for interacting with the world so that you perform better in one way or another.
So that different source of information can be incredibly powerful because you can imagine a future where every time you type on the keyboard, the keyboard learns to understand you better.
Or every time you interact with some website, it understands better what your preferences are.
So the world just starts working better and better and interacting with people. And so reinforcement learning as a method within the machine learning world is different
from other methods because you deploy it in less known circumstances, or how would you
define that?
So it's different in many ways, but the key difference is the information source.
The consequence of that is that reinforcement learning can be surprising.
It can actually surprise you.
They can find solutions you might have not thought of
to problems that you pose to it.
That's one of the key things.
Another thing is it requires substantially more skill to apply
than supervised learning.
Supervised learning is pretty straightforward
as far as the statistics go,
while reinforcement learning, there's some real traps out there,
and you want to think carefully
about what you're doing. Let me go into a little more detail there. Let's suppose you need to make
a sequence of 10 steps and you want to maximize the reward that you get in those 10 steps, right?
So it might be the case that going left gives you a small reward immediately and then you get no
more rewards. While if you go right, you get no reward and then you go left and then you get no more rewards while if you go right you get no reward
and then you go left and then right and then right and then left and then right so on 10 times
do it just the right way you get a big reward all right so many reinforcement learning albums
just learn to go left because that gave the small reward immediately and that gap is not like a
little gap it's like you may require exponentially
many more samples to learn unless you actually gather the information in an intelligible,
conscious way. I'm grinning and no one can see it because I'm thinking that's how people operate
generally, you know? Actually, yeah. I mean, the way I explain reinforcement learning is the way you teach a puppy how to do a trick.
And the puppy may surprise you and do something else.
But the reward that John speaks of is the treat that you give the puppy when the puppy does what you are trying to teach it to do.
And sometimes they just surprise you and do something different. And actually, reinforcement learning has a very great affinity to Pavlovian psychology.
Well, back to your example, John, you're saying if you turn left, you get the reward immediately.
You get a small reward.
A small reward.
So the agent would have to go through many, many steps of this to figure out, don't go left because you'll get more later.
You get more later if you
go right and you take the right actions after you go right. Now imagine explaining this to a customer.
And we will get there and I'll have you explain it. Rafa, let's talk for a second about the
personalization service, which is an instantiation of what you call real world reinforcement learning,
yeah? That's right. So you characterize what you call real-world reinforcement learning, yeah?
That's right. So you characterize it as a general framework for reinforcement learning algorithms that are
suitable for real-world applications. Unpack that a bit. Give us a short primer on real-world
reinforcement learning and why it's an important direction for reinforcement learning in general.
Yeah, I'll give you my version, and I'm sure John will chime
in. But, you know, many of the reinforcement learning that people hear about are almost
always done in a simulated environment where you can be creative as to what you simulate,
and you can generate, you know, gazillions of samples to make your agents work. Our type of
reinforcement, John's type of reinforcement learning, is something that
we deploy online. And what drives us, John and I, is to create or use this methodology to solve
real-world problems. And our goal is really to advance the science in order to help enterprises
maximize their business objective through the usage of real-world reinforcement
learning. So when I say real-world, these are models that we deploy in production with real
users getting real feedback and they learn on the job. Well, John, talk a little bit about
what Rafa has alluded to. There's an online real-world element to it, but prior to
this, reinforcement learning has had some big investments in the gaming space. Tell us the
difference and what happens when you move from a very closed environment to a very open environment
from a technical perspective. Yeah, so I guess the first thing to understand is why you'd want to do this,
because if reinforcement learning and simulators works great, then why do you need to do something
else? And I guess the answer is there are many things you just can't simulate. So an example
that I often give in talks is, would I be interested in a news article about Ukraine?
The answer is yes, because my wife is from Ukraine. But you would never know this.
Your simulator would never know this. There'd be no way for the policy to actually learn that
if you're learning in a simulator. So there are many problems where there are no good simulators.
And in those simulators, you don't have a choice. So given that you don't have a choice,
you need to embrace the difficulties of the problem. So what are the difficulties of the
real-world reinforcement learning problems? Well, you don't have zillions of examples,
which are typically required for many of the existing deep reinforcement learning algorithms.
You need to be careful about how you use your samples. You need to use them to maximum and
utmost efficiency in trying to do the learning.
Another element that happens is often when people
have simulators, those simulators are kind of
effectively stationary.
They stay the same throughout the process of training.
But in real world problems, many of them we encounter,
we run into all kinds of non-stationarities,
there's exogenous events.
The algorithms need to be very robust.
So the combination of using samples very efficiently and great robustness in these algorithms are kind of key offsetting elements from what you might see in other places.
Which is challenging AlphaGo or Ms. Pac-Man or the other games that have been sort of flags waved about our progress in reinforcement learning?
I think those are fun applications. I really enjoy reading about them and learning about them.
I think it's a great demonstration of where the field has gotten, but I feel like this is the
issue of AI winter, right? So there was once a time when AI crashed, and that may happen again
because AI is now a buzzword. But I think it's important that we actually do things that have some real value in the world, which actually affect people's lives, because that's what creates a lasting wave of innovation and puts civilization into a new place.
Right.
So that's what I'm really seeking.
What season are we in now? I've heard there has been more than one AI winter and some people are saying it's AI spring.
I don't know.
Where do you see us in terms of that progress?
I think it's fair to say that there's a lot of froth in terms of people claiming things that are not going to come to pass. At the same time,
there is real value being created. Suddenly we can do things and things work better through some of
these techniques. And so there's kind of this mishmash of over-promised things that are going
to fail and there are things that are not over-promised and they will succeed. And so if
there's enough of those that succeed, then maybe you don't have a winter. Maybe it just becomes a long summer.
Like San Diego all the time.
Yeah, but I think to comment on John's point here, I think reinforcement learning is a nascent
technique compared to supervised learning. And what's important is to do the crawl, walk, run, right? So yeah, it's sexy
now and people are talking about it, but we need to rein it in from a business perspective as to,
you know, what are the classes of problems that we can satisfy the business leader with and satisfy
them effectively, right? And I think from a reinforcement learning, John, correct me,
we are very much at the crawl phase and solving generic business problems.
I mean, we have solved some generic business problems, but we don't have widely deployed or deployable platforms for reusing those solutions over and over again.
And it's so easy to imagine many more applications than people have even tried.
So we're nowhere near a mature phase in terms of even simple kinds of reinforcement learning.
We are ramping up in our ability to solve real-world reinforcement learning problems.
And there's a huge ramp still to happen.
Heading towards your goal of solving machine learning.
Yes. But I mean, to be fair, though, we can actually satisfy some classes of
problems really well with nascent technology. So yes, we are nascent and the world is out there
for us to conquer. But I think we do have techniques that can solve a whole swath of
problems. And it's up to us to harvest that.
Well, let's continue the thread a little bit on the research areas of reinforcement learning.
And there's several that seem to be gaining traction.
Let's go sort of high level and talk about this one area that you're saying is basically creating a new foundation for reinforcement learning. What's wrong with the current foundation? What do we need the new
foundation for and what are you doing? The current foundation of reinforcement learning is called a
markup decision process. The idea in a markup decision process is that you have states and
actions and given a state, you take an action, then you have some distribution over the next
state. So that's kind of what the foundation is. It's how everybody describes your solutions.
And the core issue with this is that there are no good solutions when you have a large number
of states. All solutions kind of scale with the number of states. And so if you
have a small number of possible observations about the world, then you can employ these
theoretically motivated reinforcement learning algorithms, which are provably efficient,
and they will work well. But in the real world, you have a megapixel camera, which has two to the
one million or 16 to the one million possible inputs.
And so you never encounter the same thing twice.
And so you just can't even apply these algorithms.
It doesn't even make sense.
It's ridiculous.
So when I was a young graduate student, I was, of course, learning about markup decision
processes and trying to figure out how to solve reinforcement learning better with them.
And then at some point after we had a breakthrough,
I realized that the breakthrough was meaningless
because it was all about these market position processes.
And no matter what, it just never was going to get to the point
where you could actually do something useful.
So around 2007, I decided to start working on contextual bandits.
This is an expansion of what reinforcement
learning means in one sense, but a restriction in another sense. So instead of caring about
the reward of a long sequence of actions, we're going to care about the reward of the next action.
So that's a big simplification. On the other hand, instead of caring about the state,
we're going to care about an observation, and we're going to demand that our algorithms don't depend upon the number of possible observations,
just like they do in supervised learning.
So we studied this for several years.
We discovered how to create statistically efficient algorithms for these kinds of problems.
So that's kind of the foundation of the systems that we've been working on.
And then more recently, after cracking these contextual banner problems, we wanted to address
a larger piece of reinforcement learning.
So now we're thinking about contextual decision processes, where you have a sequence of rounds,
and on each round you see some observation, you choose some action, and then you do that
again and again and
again. And then at the end of an episode, maybe 10 steps, maybe 100, you get a reward, right? So
there's some long delayed reward dependent upon all the actions you've taken and all the observations
you've made. And now it turns out that when these observations are generated by some small underlying
state space, which you do not know in advance and which is never told to you.
You can still learn.
You can still do reinforcement learning.
You can efficiently discover what a good policy is globally.
So the new foundation of reinforcement learning is about
creating a foundation for reinforcement learning algorithms
that can cope with a megapixel camera as an observation
rather than having like 10 discrete or 100 discrete states.
And you're getting some good traction with this approach?
Yeah. I mean, contextual bandits are deployed in the real world and being used in many places
at this point. There's every reason to believe that if we can crack contextual decision processes,
which is our current agenda, that will be of great use as well.
Rafa, at its core, reinforcement learning systems are designed to be self-improving systems
and kind of learn from the real world like humans do.
Yes.
Or puppies.
Or puppies.
And the real world is uncertain and risky.
Yes.
So how do you, from your perspective or from your angle, build trust with the customers that you interact with, both third-party and first-party customers, who are giving you access to their own real-life traffic online?
Yeah, this is an important topic when we start looking at how we do incubations in our team.
And we have a specific challenge, as you were saying, because if we were a supervised learning
model, we would go to a customer and say, hey, you know, give me a data set.
I'll run my algorithm.
If it improves, you deploy it.
We deploy it in an A-B test.
And if we are good, you're good to go.
Our system is deployed in production.
So here we are with customers and talking to them about advanced machine learning techniques from research.
And we want to deploy them in their online production system.
So as you can imagine, it becomes an interesting conversation.
So the way we approach this actually is by taking ideas from
product teams. So when we went and did our incubations, we did it with a hardened prototype,
meaning this is a prototype that's not your typical stitched up Python code that, you know, is hacky. We took a fair amount of time to harden it to the degree that
if you run it in production, it's not going to crash your customers' online production system.
So that's number one. And then when we approach customers, our system learns from the real world
and you do need a certain amount of traffic because our models are like
newborn puppies. They don't know any tricks. So you need to give them information in order to learn.
But what we typically do is we have a conversation with our customer and say, hey, you know,
yes, this is research, but it is hardened prototype. That's number one. And two, we use previous incubations as reference to newer ones. We borrow
ideas from how products go sell their prototypes, right? And then we, as a methodology, say to
customers when they have large volumes of traffic to give us a portion of their traffic, which is good enough for us to learn and prove the ROI,
but small enough for them to de-risk. And that methodology has worked very well for us.
De-risk is such a good word. Let's go a little further on that thread.
Talk a little bit about the cold start versus the warm start when you're deploying. So that's another
interesting conversation with our customers, especially those that are used to supervised
learning, where you train your model, right, with a lot of data and you deploy it and it's already
learned something. Our models in our personalization service start really cold. But the way John and the teams created those algorithms
allows us to learn very fast.
And the more traffic you give it, the faster it learns.
So I'll give you an example.
We deployed a pilot with Xbox top of home
where we were personalizing two of the three slots
or four slots that they have on the top of home.
And Xbox gets millions of events per day. So with only 6 million events per day, which is a fraction
of Xbox traffic, in about a couple of hours, we went from cold to very warm. So again, from a
de-risking with these conversations with our customers, first or third parties, we tend to say, yes,
it's cold start, but these algorithms learn super fast. And there's a certain amount of traffic flow
that enables that efficient learning. So we haven't had major problems. We start by making
our customers understand how the system works. And we go from there. Are there instances where you're coming into a warm start where there's some
existing data or infrastructure?
Yeah, so that definitely happens. It's typically more trouble than it's worth to actually use pre-existing data because when
you're training in a contextual bandit, you really need to capture four things,
the features, the action, the reward for the action, and then the probability of taking the
action. And almost always the probability is not recorded in any kind of reliable way,
if it was even randomized previously. So given that you lack one of those things,
there are ways to try to repair that. They kind of work, but they're kind of a pain.
They're not the kind of thing that you can do in an automatic fashion.
So typically we want to start with recording our own data so we can be sure that it is, in fact, good data.
Now, with that said, there are many techniques for taking into account pre-existing models, right? So we actually have a paper now in Archive talking
about how to combine an existing supervised data source with a contextual bandit data source.
Another approach, which is commonly very helpful, is people may have an existing supervised system,
which may be very complex, and they may have built up a lot of features around that,
which may not even be appropriate.
Often there's a process around any kind of real system
where the learning algorithm and the features are kind of co-evolving,
and so moving away from either of them causes a degradation in performance.
So in that kind of situation, what you'd want to do is
you want to tap the existing supervised models
to extract features which are very powerful
and then given those very powerful features you can very quickly get to a
good solution and then so the exact mechanism of that extraction it's going
to depend upon the representation that you're using with a neural network you
kind of rip off the top layer and use that with a decision tree or a boost
decision tree or decision forest you can use the leaf membership as a feature that you can then feed in for a very fast warm-up of a contextual binary learner.
John, talk about offline experimentations. What's going on there?
Yeah, so this is one of the really cool things that's possible when you're doing
shallow kinds of reinforcement learning with maybe one step or maybe two steps.
So if you record that quad of features, action, reward, and the probability, then it becomes possible to evaluate any policy that chooses amongst the set of available actions.
Okay, so what that means is that if you record this data, and then later you discover that maybe a different learning rate was helpful,
or maybe you should be taking this feature and that feature and combining them to make a new feature.
You can test to see exactly how that would have performed if you had deployed that policy at the time you're collecting data.
So this is amazing, because this means that you no longer need to use an A-B
test for the purpose of optimization. You still have reasons to use it for the purpose of safety,
but for optimization, you can do that offline in a minute rather than doing it online for two weeks
waiting to get the data necessary to actually learn. Yeah, just to pick up on why is this a gold nugget? Data scientists
spend a fair amount of time today designing models a priori and testing them in A-B tests,
only to learn two weeks after that they failed and they go back to ground zero.
So here you're running hundreds, if not thousands, of A-B tests on the spot. And when we talk about this to data scientists and enterprises, their
eyes light up. I mean, that is one of the key features of our system that just brightens the
day for many data scientists. It's a real pain for them to design models, run them in A-B. It's
very costly as well. So talk about productivity gains. It's immense when you
can run 100 to 200 AB tests in a minute versus running one AB test for two weeks.
Rafa, you work as a program manager within a research organization.
Yes.
And it's your job to bring science to the people.
Yes.
Talk about your process a little more in detail of research incubate transfer in the context of how you develop RL prototypes and engineer them and how you test them.
And specifically, maybe you could explain a couple of examples of this process of deployments that are out there already.
How are you living up to the code?
We have a decent size engineering team that supports our RL efforts in MSR.
And our job is twofold.
One is to, from a program management perspective, it's to really drive what it means to go from an algorithm to a prototype
and then validate whether that prototype has any market potential.
I take it upon me as a program manager when researchers are creating these wonderful
academic papers with great algorithm, and some of them may have huge market potential. So this
market analysis happens actually in MSR. And we ask ourselves,
great algorithm, what are the classes of problems we can solve for it? And would people like relate
to these problems such that we could actually go and incubate them? And the incubation is a
validation of this market hypothesis. So that's what we do in our incubations. We are actually
trying to see whether this is something that we could potentially tech transfer to the product
team. And we've done this with contextual bandits in the context of personalization scenarios. So
contextual bandits is a technique, right? And so we ask ourselves, okay, with this technique, what classes of problems
can we solve very efficiently? And personalization was one of them. And we went and incubated it
first with MSN. Actually, John and the team incubated it with MSN first, and they got
a 26% lift. That's multi-million dollar revenue potential. So from a market potential, it really
made sense. So we went and said, okay, one customer is not statistically significant,
so we need to do more. And we spent a fair amount of time actually validating this idea
and validating the different types of personalization. So MSN was a news article personalization.
Recently, we did a page layout personalization with Surface.com Japan,
where they had four boxes on Surface.com Japan,
and they were wondering how to present these boxes based on the user that was visiting that page.
And guess what? We gave them 2,500 events.
So it was a short run pilot that we did with them.
We gave them an 80% lift, 80.
They were flabbergasted.
They couldn't believe.
And this was run on an A-B test.
So they had their page layout that their designers had specified for them, for all users,
running as the control. And they had our personalization engine running with our contextual bandit algorithm. And they ran it.
And for us, you know, 2,500 samples is not really a lot. But even with that, we gave them an 80% lift over their control. So these are the kinds of incubation that
when we go to our sister product team in Redmond and tell the story, they get super excited that
this could be a classes of application today about diversity, and that often means having different people on the team.
But there's other aspects, especially in reinforcement learning, that include diversity of perspective and approach. How do you address this in the work you're doing and how do you practically manage it?
One thing to understand is that research is an extreme sport in many ways. You're trying to do
something which nobody has ever done before. And so you need an environment that supports you in
doing this in many ways. It's hard for a single researcher to have all the abilities that are needed to succeed.
When you're learning to do research, you're typically learning a very narrow thing.
And over time, maybe that gets a little bit broader,
but it's still going to be the case that you just know a very narrow perspective
on how to solve a problem.
So one of the things that we actually do is on a weekly basis, we have an open problems discussion where a group of researchers gets together and one of them talks about the problem that they're interested in.
And then other people can chime in and say, oh, maybe we should look at it this way or think about it that way.
That helps, I think, sharpen the problems. And then in the process of solving problems, amazing things come up in discussion, but they can only come up
if you can listen to each other. And I guess the people that I prefer to work with are the ones who
listen carefully. There's a process of bouncing ideas off each other, discovering the flaws in
them, figuring out how to get around the flaws. This process can go
on. It's indefinite, but sometimes it lands. And when it lands, that moment when you discover
something, that's really something. Rafa, do you have anything to add to that?
So when I think about diversity in our lab, I think that to complement what John's saying,
I like to always also think about the diversity of disciplines.
So in our lab, we're not a big lab, but we have researchers, we have engineers, we have designers, and we have program manager.
And I think these skill sets are diverse, and yet they complement each other so well.
And I think that adds to the richness of what we have in our lab.
In the context of the work you do,
its applications and implications in the real world,
is there anything that keeps you up at night?
Any concerns you're working to mitigate, even as you work to innovate?
I think the answer is yes.
Anybody who understands the potential of machine learning, AI, whatever you want to call it,
understands that there are negative ways to use it, right? It is a tool and we need to try to
use the tool responsibly and we need to mitigate the downsides where we can see them in advance.
So I do wonder about this. Recently, we had a paper on fair machine learning, and we showed that any supervised
learning algorithm can, in a black box fashion, be turned into a fair supervised learning algorithm.
We demonstrated this both theoretically and experimentally. So that's a promising paper
that addresses a narrow piece of ethics around AI, I guess I would say. As we see more opportunities along these lines, we will
solve them. Yeah. Also use these techniques for the social good, right? I mean, as we are trying
to use them to monetize, also we should use them for the social good. How did each of you end up
at Microsoft Research in New York City? This is actually quite a story.
So I used to be at Yahoo Research.
One day, right about now, seven years ago, the head of Yahoo Research quit.
So we decided to essentially sell the New York Lab.
So we created a portfolio of everybody in the New York lab,
15 researchers there. We sent it around to various companies. Microsoft ended up getting 13 out of 15
people. And that was the beginning of Microsoft Research New York.
Rafa, how did you come to Microsoft Research New York City?
He told me I was going to revolutionize the world.
That's why I came over from IBM. So I actually had a wonderful job at IBM applying Watson
technologies for children's education. And one day a Microsoft recruiter called me and they said,
John Lankford, renowned RL researcher, is looking for a program manager. You should interview with him. And I'm
like, okay. So I interviewed at Microsoft Research New York, spoke to many people. And at the time,
I, you know, I was comfortable in my job and I had other opportunities. But in his selling pitch to
me, John Lankford calls me one day at home and he says, you should choose to come and work for Microsoft Research because we're going to revolutionize the world. And I think it sunk in that we can be at the cusp of something really big and that got me you've made. John, you started out our interview with I want to solve machine learning.
Rafa, you have said that your ultimate goal is real world reinforcement learning for everyone.
What does the world look like if each of you is wildly successful?
Yeah, so there's a lot of things that are easy to imagine being a part of the future world that just aren't around now.
You should imagine that every computer interface learns to adapt to you rather than you needing to adapt to the user interface.
You could imagine lots of companies just working better. You could imagine a digital avatar that over time learns to help you book the flights
that you want to book or things like that, right? Often there's a lot of mundane tasks that people
do over and over again. And if you have a system that can record and learn from all the interactions
that you make with computers or with the Internet.
It can happen on your behalf.
That could really ease the lives of people in many different ways.
Lots of things where there's an immediate sense of, oh, that was the right outcome or, oh, that was the wrong outcome, can be addressed with just the technology that we have already. And then there's technologies beyond that,
like the contextual decision processes that I was talking about, that may open up even more
possibilities in the future. Rafa? To me, what a bright future would look like is when we can
cast a lot of issues that we see today, problems, enterprises, and at the personal level as a
reinforcement learning problem that we can actually solve. And more importantly for me,
you know, as we work in technology and we develop all these techniques, the question is,
are we making the world a better world, right? And can we actually solve some hard problems like famine and diseases
with reinforcement learning? And maybe not now, but can it be the bright future that we look out
for? I hope so. I do too. John Lankford, Rafa Hosen, thank you for joining us today. Thank you.
Thank you for joining us today. Thank you. Thank you.
To learn more about Dr. John Langford and Rafa Hozen and the quest to bring reinforcement learning to the real world,
visit Microsoft.com slash research.