Microsoft Research Podcast - 075 - Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn

Episode Date: May 8, 2019

Dr. John Langford, a partner researcher in the Machine Learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learnin...g. Rafah Hosn, also of MSR New York, is a principal program manager who’s working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a “go big, or go home” kind of town, and MSR NYC is a “go big, or go home” kind of lab. Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafah Hosn talks about why it’s important, from a business perspective, to move RL agents out of simulated environments and into the open world, and gives us an under-the-hood look at the product side of MSR’s “research, incubate, transfer” process, focusing on real world reinforcement learning which, at Microsoft, is now called Azure Cognitive Services Personalizer.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to another two-chair, two-mic episode of the Microsoft Research Podcast. Today, we bring you the perspectives of two guests on the topic of reinforcement learning for online applications. Since most research wants to be a product when it grows up, we've brought in a brilliant researcher-program manager duo to illuminate the classic research incubate transfer process in the context of real-world reinforcement learning. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga. Dr. John Langford, a partner researcher in the Machine Learning Group at Microsoft Research New York City,
Starting point is 00:00:47 is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafa Hosen, also of MSR New York, is a principal program manager who's working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a go-big-or-go-home kind of town, and MSRNYC is a go-big-or-go-home kind of lab. Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafa Hosen talks about why it's important from a business perspective to move RL agents out of simulated environments
Starting point is 00:01:30 and into the open world, and gives us an under-the-hood look at the product side of MSR's research incubate transfer process, focusing on real-world reinforcement learning, which at Microsoft is now called Azure Cognitive Services Personalizer. That and much more on this episode of the Microsoft Research Podcast.
Starting point is 00:01:59 I've got two guests in the booth today, both working on some big research problems in the Big Apple. John Langford is a partner researcher in the machine learning, both working on some big research problems in the Big Apple. John Langford is a partner researcher in the machine learning group at MSR NYC. And Rafa Hosin, also at the New York Lab, is the principal program manager for personalization service, also known as real-world reinforcement learning. John and Rafa, welcome to the podcast. Thank you. Microsoft Research's New York Lab is relatively small in the constellation of MSR labs, but there's some really important work going on there. So to get us started, tell us what each of you does for a living and how you work together. What gets you up in the morning?
Starting point is 00:02:39 Rafa, why don't you start? Okay, I'll start. So I wake up every day and think about all the great things that the reinforcement learning researchers are doing. And first I map what they're working on, something that could be useful to customers. And then I think to myself, how can we now take this great research, which typically comes in the form of a paper, to a prototype, to an incubation, to something that Microsoft can make money out of. That's a big thread, starting with the little seed and ending up with the big plant at the end.
Starting point is 00:03:15 Yes, we have to think big. That's right. How about you, John? I want to solve machine learning. And it's ambitious. But one of the things that you really need to do if you want to solve machine learning is you need to solve reinforcement learning, which is kind of a common basis for learning items to learn from interaction with the real world. And so figuring out new ways to do this or trying to expand the scope of where we can actually apply these techniques is what really drives me. Can you go a little deeper into solve machine learning? What would solving machine learning look like? It would look like anything
Starting point is 00:03:51 that you can pose as a machine learning problem you can solve, right? So I became interested in machine learning back when I was an undergrad, actually. Yeah. I went to a machine learning class and I was like, ah, this is what I want to do for my life. And I've been pursuing it ever since. And here you are. So we're going to spend the bulk of our time today talking about the specific work you're doing in reinforcement learning. But John, before we get into it, give us a little context as a level set. From your perspective, what's unique about reinforcement learning within the machine learning universe? And why is it an important part of MSR's research portfolio? So most of the machine learning that's actually deployed is of the supervised learning variety.
Starting point is 00:04:29 And supervised learning is fundamentally about taking expertise from people and making that into some sort of learned function that you can then use to do some task. Reinforcement learning is different because it's about taking information from the world and learning a policy for interacting with the world so that you perform better in one way or another. So that different source of information can be incredibly powerful because you can imagine a future where every time you type on the keyboard, the keyboard learns to understand you better. Or every time you interact
Starting point is 00:05:06 with some website, it understands better what your preferences are. So the world just starts working better and better and interacting with people. And so reinforcement learning as a method within the machine learning world is different from other methods because you deploy it in less known circumstances? Or how would you define that? So it's different in many ways, but the key difference is the information source. The consequence of that is that reinforcement learning can be surprising.
Starting point is 00:05:35 It can actually surprise you. They can find solutions you might have not thought of to problems that you pose to it. That's one of the key things. Another thing is it requires substantially more skill to apply than supervised learning. Supervised learning is pretty straightforward as far as the statistics go. While reinforcement learning, there's some real traps out there, and you want to think carefully about what you're doing. Let me go into a little more detail there. Please do.
Starting point is 00:06:00 Let's suppose you need to make a sequence of 10 steps and you want to maximize the reward that you get in those 10 steps, right? So it might be the case that going left gives you a small reward immediately and then you get no more rewards. While if you go right, you get no reward. And then you go left and then right and then right and then left and then right. So on 10 times, do it just the right way. You get a big reward right so many reinforcement running albums would just learn to go left because that gave the small
Starting point is 00:06:30 reward immediately and that gap is not like a little gap it's like you may require exponentially many more samples to learn unless you actually gather the information in an intelligible conscious way i'm grinning and no one can see it because i'm thinking that's how people operate generally unless you actually gather the information in an intelligible, conscious way. I'm grinning and no one can see it because I'm thinking, that's how people operate generally, you know? Actually, yeah. I mean, the way I explain reinforcement learning is the way you teach a puppy how to do a trick. And the puppy may surprise you and do something else, but the reward that John speaks of is the treat that you give the
Starting point is 00:07:06 puppy when the puppy does what you are trying to teach it to do. And sometimes they just surprise you and do something different. And actually, reinforcement learning has a very great affinity to Pavlovian psychology. Well, back to your example, John, you're saying if you turn left, you get the reward immediately. You get a small reward. A small reward. So the agent would have to go through many, many steps of this to figure out, don't go left because you'll get more later. You get more later if you go right and you take the right actions after you go right. Now, imagine explaining this to a customer. And we will get there and I'll have you explain it. Rafa, let's talk for a second about the personalization service, which is an
Starting point is 00:07:52 instantiation of what you call real-world reinforcement learning, yeah? That's right. So you characterize it as a general framework for reinforcement learning algorithms that are suitable for real-world applications. Unpack that a bit. Give us a short primer on real-world reinforcement learning and why it's an important direction for reinforcement learning in general. Yeah, I'll give you my version, and I'm sure John will chime in. But, you know, many of the reinforcement learning that people hear about are almost always done in a simulated environment where you can be creative as to what you simulate and you can generate, you know, gazillions of samples to make your agents work. Our type of reinforcement, John's type of reinforcement learning is something
Starting point is 00:08:36 that we deploy online. And what drives us, John and I, is to create or use this methodology to solve real-world problems. And our goal is really to advance the science in order to help enterprises maximize their business objective through the usage of real-world reinforcement learning. So when I say real-world, these are models that we deploy in production with real users getting real feedback and they learn on the job. Well, John, talk a little bit about what Rafa has alluded to. There's an online real world element to it. But prior to this, reinforcement learning has had some big investments in the gaming space. Tell us the difference and what happens when you move from a very closed environment to a very open environment from a
Starting point is 00:09:31 technical perspective. Yeah, so I guess the first thing to understand is why you'd want to do this, because if reinforcement learning and simulators works great, then why do you need to do something else? And I guess the answer is there are many things you just can't simulate. So an example that I often give in talks is, would I be interested in a news article about Ukraine? The answer is yes, because my wife is from Ukraine. But you would never know this. Your simulator would never know this.
Starting point is 00:09:59 There'd be no way for the policy to actually learn that if you're learning in a simulator. So there are many problems where there are no good simulators. And in those simulators, you're learning in a simulator. So there are many problems where there are no good simulators and in those simulators, you don't have a choice. So given that you don't have a choice, you need to embrace the difficulties of the problem. So what are the difficulties of the real world reinforcement learning problems? Well, you don't have zillions of examples, which are typically required for many of the existing deep reinforcement learning algorithms you need to be careful about
Starting point is 00:10:31 how you use your samples you need to use them to maximum and utmost efficiency and trying to do the learning another element that happens is often when people have simulators those simulators are kind of effectively stationary they stay the same throughout the process of training. But in real-world problems, many of them that we encounter, we run into all kinds of non-stationarities. There's exogenous events. The algorithms need to be very robust. So the combination of using samples very efficiently and great robustness in these algorithms
Starting point is 00:11:01 are kind of key offsetting elements from what you might see in other places. Which is challenging AlphaGo or Ms. Pac-Man or the other games that have been sort of flags waved about our progress in reinforcement learning. I think those are fun applications. I really enjoy reading about them and learning about them. I think it's a great demonstration of where the field has gotten,
Starting point is 00:11:24 but I feel like this is the issue of AI winter, right? So there was once a time when AI crashed. And that may happen again because AI is now a buzzword. But I think it's important that we actually do things that have some real value in the world, which actually affect people's lives, because that's what creates a lasting wave of innovation and puts civilization into a new place.
Starting point is 00:11:50 So that's what I'm really seeking. What season are we in now? I've heard there has been more than one AI winter, and some people are saying it's AI spring. I don't know. Where do you see us in terms of that progress? I think it's fair to say that there's a lot of froth in terms of people claiming things that are not going to come to pass. At the same time, there is real value being created. Suddenly we can do things and things work better through some of these
Starting point is 00:12:25 techniques. And so there's kind of this mishmash of over-promised things that are going to fail, and there are things that are not over-promised, and they will succeed. And so if there's enough of those that succeed, then maybe you don't have a winter. Maybe it just becomes a long summer. Like San Diego all the time. Yeah, but I think to comment on John's point here, I think reinforcement learning is a nascent technique compared to supervised learning. And what's important is to do the crawl, walk, run, right? So yeah, it's sexy now and people are talking about it, but we need to rein it in from a business perspective as to, you know, what
Starting point is 00:13:05 are the classes of problems that we can satisfy the business leader with and satisfy them effectively, right? And I think from a reinforcement learning, John, correct me, we are very much at the crawl phase and solving generic business problems. We have solved some generic business problems, but we don't have widely deployed or deployable platforms for reusing those solutions over and over again. And it's so easy to imagine many more applications than people have even tried. So we're nowhere near a mature phase in terms of even simple kinds of reinforcement learning. We are ramping up in our ability
Starting point is 00:13:45 to solve real-world reinforcement learning problems. And there's a huge ramp still to happen. Heading towards your goal of solving machine learning. Yes. But I mean, to be fair, though, we can actually satisfy some classes of problems really well with nascent technology. So yes, we are nascent and the world is out there for us to conquer. But I think we do have techniques that can solve a whole swath of problems.
Starting point is 00:14:11 And it's up to us to harvest that. Well, let's continue the thread a little bit on the research areas of reinforcement learning and there's several that seem to be gaining traction. Let's go sort of high level and talk about this one area that you're saying is basically creating a new foundation for reinforcement learning. What's wrong with the current foundation? What do we need the new foundation for and learning. What's wrong with the current foundation? What do we need the new foundation for and what are you doing? The current foundation of reinforcement learning is called a markup decision process. The idea in a markup decision process is that you have states and actions and given a state,
Starting point is 00:14:57 you take an action, then you have some distribution over the next state. So that's kind of what the foundation is. It's how everybody describes your solutions. And the core issue with this is that there are no good solutions when you have a large number of states. All solutions kind of scale with the number of states. And so if you have a small number of possible observations about the world, then you can employ these theoretically motivated reinforcement learning algorithms, which are provably efficient, and they will work well. But in the real world, you have a megapixel camera, which has two to the one million or 16 to the one million possible inputs.
Starting point is 00:15:37 And so you never encounter the same thing twice. And so you just can't even apply these algorithms. It doesn't even make sense. It's ridiculous. So when I was a young graduate student, I was, of course, learning about markup decision processes and trying to figure out how to solve reinforcement learning better with them. And then at some point after we had a breakthrough, I realized that the breakthrough was meaningless
Starting point is 00:15:59 because it was all about these markup decision processes. And no matter what, it just never was going to get to the point where you could actually do something useful. So around 2007, I decided to start working on contextual bandits. This is an expansion of what reinforcement learning means in one sense, but a restriction in another sense. So instead of caring about the reward of a long sequence of actions, we're going to care about the reward of the next action. So that's a big simplification.
Starting point is 00:16:30 On the other hand, instead of caring about the state, we're going to care about an observation, and we're going to demand that our algorithms don't depend upon the number of possible observations, just like they do in supervised learning. So we studied this for several years. We discovered how to create statistically efficient algorithms for these kinds of problems. So that's kind of the foundation of the systems that we've been working on. And then more recently, after cracking these contextual banner problems, we wanted to address a larger piece of reinforcement learning. So now we're thinking about contextual decision processes, where you have a sequence of rounds, and on each round you see some observation,
Starting point is 00:17:13 you choose some action, and then you do that again and again and again, and then at the end of an episode, maybe 10 steps, maybe 100, you get a reward, right? So there's some long delayed reward dependent upon all the actions you've taken and all the observations you've made. And now it turns out that when these observations are generated by some small underlying state space, which you do not know in advance and which is never told to you, you can still learn. You can still do reinforcement learning. You can efficiently discover what a good policy is globally. So the new foundation of reinforcement learning is about creating a foundation for reinforcement learning algorithms that can cope with a megapixel camera as an observation rather than having like 10 discrete or 100 discrete states. And you're getting some good traction with this approach? Yeah. I mean, contextual bandits
Starting point is 00:18:06 are deployed in the real world and being used in many places at this point. There's every reason to believe that if we can crack contextual decision processes, which is our current agenda, that will be of great use as well. Rafa, at its core, reinforcement learning systems are designed to be self-improving systems and kind of learn from the real world like humans do. Yes. Or puppies. Or puppies. And the real world is uncertain and risky. Yes. So how do you, from your perspective or from your angle, build trust with the customers that you interact with, both third-party and first-party customers, who are giving you access to their own real-life traffic online? Yeah, this is an important topic when we start looking at how we do incubations in our team. And we have a specific
Starting point is 00:18:59 challenge, as you were saying, because if we were a supervised learning model, we would go to a customer and say, hey, you know, give me a data set. I'll run my algorithm. If it improves, you deploy it. We deploy it in an A-B test. And if we are good, you're good to go. Our system is deployed in production. So here we are with customers and talking to them about advanced machine learning techniques from research, and we want to deploy them in their online production system. So as you can imagine, it becomes an interesting conversation. So the way we approach this actually is by taking ideas from product teams.
Starting point is 00:19:39 So when we went and did our incubations, we did it with a hardened prototype, meaning this is a prototype that's not your typical stitched up Python code that, you know, is hacky. We took a fair amount of time to harden it to the degree that if you run it in production, it's not going to crash your customer's online production system. So that's number one. And then when we approach customers, our system learns from the real world and you do need a certain amount of traffic because our models are like newborn puppies. They don't know any tricks. So you need to give them information in order to learn.
Starting point is 00:20:24 But what we typically do is we have a conversation with our customer and say, hey, you know, yes, this is research, but it is hardened prototype. That's number one. And two, we use previous incubations as reference to newer ones. We borrow ideas from how products go sell, their prototypes, right? And then we, as a methodology, say to customers, when they have large volumes of traffic, to give us a portion of their traffic, which is good enough for us to learn and prove the ROI, but small enough for them to de-risk. And that methodology has worked very well for us.
Starting point is 00:21:10 De-risk is such a good word. Let's go a little further on that thread. Talk a little bit about the cold start versus the warm start when you're deploying. So that's another interesting conversation with our customers, especially those that are used to supervised learning, where you train your model, right, with a lot of data and you deploy it, and it's already learned something. Our models in our personalization service start really cold, but the way John and the teams created those algorithms allows us to learn very fast. And the more traffic you give it, the faster it learns. So I'll give you an example. We deployed a pilot with Xbox top of home
Starting point is 00:21:50 where we were personalizing two of the three slots or four slots that they have on the top of home. And Xbox gets millions of events per day. So with only 6 million events per day, which is a fraction of Xbox traffic, in about a couple of hours, we went from cold to very warm. So again, from a de-risking with these conversations with our customers, first or third parties, we tend to say, yes, it's cold start. But these algorithms learn super fast and there's a certain amount of traffic flow that enables that efficient learning. So we haven't had major problems. We start by making our customers understand how the system
Starting point is 00:22:32 works. And we go from there. Are there instances where you're coming into a warm start where there's some existing data or infrastructure? Yeah, so that definitely happens. It's typically more trouble than it's worth to actually use pre-existing data because when you're training in a contextual bandit, you really need to capture four things, the features, the action, the reward for the action, and then the probability of taking the action. And almost always the probability is not recorded the probability of taking the action. And almost always, the
Starting point is 00:23:05 probability is not recorded in any kind of reliable way, if it was even randomized previously. So given that you lack one of those things, there are ways to try to repair that. They kind of work, but they're kind of a pain, and not the kind of thing that you can do in an automatic fashion. So typically, we want to start with recording our own data so we can be sure that it is, in fact, good data. Now, with that said, there are many techniques for taking into account pre-existing models. So we actually have a paper now in archive talking
Starting point is 00:23:36 about how to combine an existing supervised data source with a contextual bandit data source. Another approach, which is commonly very helpful, is people may have an existing supervised system, which may be very complex. And they may have built up a lot of features around that, which may not even be appropriate. Often, there's a process around any kind of real system
Starting point is 00:24:00 where the learning algorithm and the features are kind of co-evolving. And so moving away from either of them causes a degradation of performance. Sure. So in that kind of situation, what you'd want to do is you want to tap the existing supervised models to extract features which are very powerful. And then given those very powerful features, you can very quickly get to a good solution. And so the exact mechanism of that extraction is going to depend
Starting point is 00:24:25 upon the representation that you're using. With a neural network, you kind of rip off the top layer and use that. With a decision tree or a Bush decision tree or a decision forest, you can use the leaf membership as a feature that you can then feed in for a very fast warmup of a contextual bandit learner. John, talk about offline experimentations. What's going on there? Yeah, so this is one of the really cool things that's possible when you're doing shallow kinds of reinforcement learning with maybe one step
Starting point is 00:24:54 or maybe two steps. So if you record that quad of features, action, reward, and the probability, then it becomes possible to evaluate any policy that chooses amongst the set of available actions. Okay, so what that means is that if you record this data and then later you discover that maybe a different learning rate was helpful or maybe you should be taking this feature and that feature
Starting point is 00:25:22 and combining them to make a new feature. You can test to see exactly how that would have performed if you had deployed that policy at the time you're collecting data. So this is amazing because this means that you no longer need to use an A-B test for the purpose of optimization. You still have reasons to use it for the purpose of safety. But for optimization, you can do that offline in a minute rather than doing it online for two weeks waiting to get the data necessary to actually learn. Yeah, just to pick up on why is this a gold nugget, data scientists spend a fair amount of time today designing models a priori and testing them in A-B test only to learn two weeks after that they failed and they go back to ground zero. So here you're running hundreds, if not thousands of A-B tests on the spot. And when we talk about this to data scientists and enterprises, their eyes light up. I mean, that is one of the key features of our system that just brightens the day for many data scientists.
Starting point is 00:26:27 It's a real pain for them to design models, run them in A-B. It's very costly as well. So talk about productivity gains. It's immense when you can run 100 to 200 A-B tests in a minute versus running one A-B test for two weeks. Rafa, you work as a program manager within a research organization. Yes. And it's your job to bring science to the people. Yes.
Starting point is 00:26:54 Talk about your process a little more in detail of research incubate transfer in the context of how you develop RL prototypes and engineer them and how you test them. And specifically, maybe you could explain a couple of examples of this process of deployments that are out there already. How are you living up to the code? We have a decent-sized engineering team that supports our RL efforts in MSR. And our job is twofold. One is to, from a program management perspective, it's to really drive what it means to go from an algorithm to a prototype and then validate whether that prototype has any market potential.
Starting point is 00:27:38 I take it upon me as a program manager when researchers are creating these wonderful academic papers with great algorithm and some of them may have huge market potential. So this market analysis happens actually in MSR and we ask ourselves, great algorithm, what are the classes of problems we can solve for it? And would people like relate to these problems such that we could actually go and incubate them? And the incubation is a validation of this market hypothesis. So that's what we do in our incubations. We are actually trying to see whether this is something that we could potentially tech transfer to the product team. And we've done this with contextual bandits in the context of personalization scenarios. So contextual bandits is a technique, right? And
Starting point is 00:28:34 so we ask ourselves, okay, with this technique, what classes of problems can we solve very efficiently? And personalization was one of them. And we went and incubated it first with MSN. Actually, John and the team incubated it with MSN first, and they got a 26% lift. That's multi-million dollar revenue potential. So from a market potential, it really made sense. So we went and said, okay, one customer is not statistically significant, so we need to do more. And we spent a fair amount of time actually validating this idea and validating the different types of personalization. So MSN was a news article personalization. Recently, we did a page layout personalization with Surface.com Japan, where they had four boxes on
Starting point is 00:29:26 Surface.com Japan, and they were wondering how to present these boxes based on the user that was visiting that page. And guess what? We gave them 2,500 events. So it was a short run pilot that we did with them. We gave them an 80% lift, 80. They were flabbergasted. They couldn't believe, and this was run on an A-B test. So they had their page layout that their designers had specified for them, for all users, running as the control. And they had our personalization engine running with our contextual bandit algorithm. And they ran it. And for us, you know, 2,500 samples is not really a lot. But even with that, we gave them an 80% lift over their control. So these are the kinds of incubation that when we go to our sister product team in Redmond and
Starting point is 00:30:22 tell the story, they get super excited that this could be a classes of application that could work for the masses. John, there's a lot of talk today about diversity, and that often means having different people on the team. But there's other aspects, especially in reinforcement learning, that include diversity of perspective and approach. How do you address this in the work you're doing, and how do you practically manage it? One thing to understand is that research is an extreme sport in many ways. You're trying to do something which nobody has ever done before. And so you need an environment that supports you in doing this in many ways. It's hard for a single researcher to have all
Starting point is 00:31:19 the abilities that are needed to succeed. When you're learning to do research, you're typically learning a very narrow thing. And over time, maybe that gets a little bit broader, but it's still going to be the case that you just know a very narrow perspective on how to solve a problem. So one of the things that we actually do is on a weekly basis, we have an open problems discussion where a group of researchers gets together and one of them talks
Starting point is 00:31:45 about the problem that they're interested in. And then other people can chime in and say, oh, maybe she'd look at this way or think about it that way. That helps, I think, sharpen the problems. And then in the process of solving problems, amazing things come up in discussion, but they can only come up if you can listen to each other. I guess the people that I prefer to work with are the ones who listen carefully. There's a process of bouncing ideas off each other, discovering the flaws in them, figuring out how to get around the flaws. This process can go on. It's indefinite. But sometimes it lands.
Starting point is 00:32:21 And when it lands, that moment when you discover something, that's really something. Rafa, do you have anything to add to that? So when I think about diversity in our lab, I think that to complement what John's saying, I like to always also think about the diversity of disciplines. So in our lab, we're not a big lab, but we have researchers, we have engineers, we have designers, and we have program manager. And I think these skill sets are diverse, and yet they complement each other so well. And I think that adds to the richness of what we have in our lab. In the context of the work you do, its applications and implications in the real world, is there anything that keeps you up at night?
Starting point is 00:33:05 Any concerns you're working to mitigate, even as you work to innovate? I think the answer is yes. Anybody who understands the potential of machine learning, AI, whatever you want to call it, understands that there are negative ways to use it, right? It is a tool, and we need to try to use the tool responsibly, and we need to mitigate the downsides where we can see them in advance. So I do wonder about this. Recently, we had a paper on fair machine learning, and we showed that any supervised learning algorithm can, in a black box fashion, be turned into a fair supervised learning algorithm. We demonstrated this both
Starting point is 00:33:45 theoretically and experimentally. So that's a promising paper that addresses a narrow piece of ethics around AI, I guess I would say. As we see more opportunities along these lines, we will solve them. Yeah, also use these techniques for the social good, right? I mean, as we are trying to use them to monetize, also we should use them for the social good. How did each of you end up at Microsoft Research in New York City? This is actually quite a story. So I used to be at Yahoo Research. One day, right about now, seven years ago, the head of Yahoo Research quit. So we decided to essentially sell the New York lab. So we created a portfolio of everybody in the New York lab. There were 15 researchers there. We sent it around to various companies. Microsoft ended up getting 13 out of 15 people. And that was the
Starting point is 00:34:47 beginning of Microsoft Research New York. Rafa, how did you come to Microsoft Research New York City? He told me I was going to revolutionize the world. That's why I came over from IBM. So I actually had a wonderful job at IBM applying Watson technologies for children's education. And one day, a Microsoft recruiter called me and they said, John Lankford, renowned RL researcher, is looking for a program manager. You should interview with him. And I'm like, okay. So interviewed at Microsoft Research New York, spoke to many people. And at the time, I was comfortable in my job and I had other opportunities. But in his selling pitch to me, John Lankford calls me one day at home and he says, you should choose to come and work for Microsoft Research because
Starting point is 00:35:38 we're going to revolutionize the world. And I think it sunk in that we can be at the cusp of something really big. and that got me really excited to join and that's how I ended up at Microsoft Research. As we close I'd like each of you to address a big statement that you've made. John you started out our interview with I want to solve machine learning. Rafa you have said that your ultimate goal is real-world reinforcement learning for everyone. What does the world look like if each of you is wildly successful? Yeah, so there's a lot of things that are easy to imagine being a part of the future world that just aren't around now. You should imagine that every computer interface learns to adapt to you rather than you needing to adapt to the user interface.
Starting point is 00:36:27 You could imagine lots of companies just working better. You could imagine a digital avatar that over time learns to help you book the flights that you want to book or things like that. Often there's a lot of mundane tasks that people do over and over again. And if you have a system that can record and learn from all the interactions that you make with computers or with the internet, it can happen on your behalf. That could really ease the lives of people in many different ways. Lots of things where there's an immediate sense of, oh, that was the right outcome or, oh, that was the wrong outcome, can be addressed with just the technology that we have already. And then there's technologies beyond that, like the contextual decision processes that I was talking about, that may open up even more possibilities in the future. Rafa?
Starting point is 00:37:26 To me, what a bright future would look like is when we can cast a lot of issues that we see today, problems, enterprises, and at the personal level as a reinforcement learning problem that we can actually solve. And more importantly for me, you know, as we work in technology and we develop all these techniques, the question is, are we making the world a better world, right? And can we actually solve some hard problems like famine and diseases with reinforcement learning? And maybe not now, but can it be the bright future that we look out for? I hope so.
Starting point is 00:38:07 I do too. John Langford, Rafa Hozen, thank you for joining us today. Thank you. Thank you. To learn more about Dr. John Langford and Rafa Hozen and the quest to bring reinforcement learning to the real world, visit Microsoft.com slash research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.