Microsoft Research Podcast - 075r - Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn

Starting point is 00:00:00 When John Langford and Rafa Hosen were on the podcast last May, they gave us two perspectives on bringing reinforcement learning out of the lab and into the world, highlighting the special relationship between science and business at MSR. Whether you heard about their work in online reinforcement learning last spring, or you're ringing in the new year with John and Rafa, I know you'll enjoy Episode 75 of the Microsoft Research Podcast, Reinforcement Learning for the Real World. Welcome to another two-chair, two-mic episode of the Microsoft Research Podcast.

Starting point is 00:00:34 Today, we bring you the perspectives of two guests on the topic of reinforcement learning for online applications. Since most research wants to be a product when it grows up, we've brought in a brilliant researcher-program-manager duo to illuminate the classic research-incubate-transfer process in the context of real-world reinforcement learning. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizinga. Dr. John Langford, a partner researcher in the machine learning group at Microsoft Research New York City, is a reinforcement learning expert who is working, in his own words, to solve machine learning. Rafa Hozen, also of MSR New York, is a principal program manager

Starting point is 00:01:26 who's working to take that work to the world. If that sounds like big thinking in the Big Apple, well, New York City has always been a go-big-or-go-home kind of town, and MSR NYC is a go-big-or-go-home kind of lab. Today, Dr. Langford explains why online reinforcement learning is critical to solving machine learning and how moving from the current foundation of a Markov decision process toward a contextual bandit future might be part of the solution. Rafa Hosen talks about why it's important from a business perspective to move RL agents out of simulated environments and into the open world and gives us an under-the-hood look at the product side of MSR's research incubate transfer process, focusing on real-world reinforcement learning, which at Microsoft is now called Azure Cognitive Services Personalizer.

Starting point is 00:02:15 That and much more on this episode of the Microsoft Research Podcast. I've got two guests in the booth today, both working on some big research problems in the Big Apple. John Langford is a partner researcher in the Machine Learning Group at MSR NYC. And Rafa Hoson, also at the New York Lab, is the principal program manager for Personalization Service, also known as Real World Reinforcement Learning. John and Rafa, welcome to the podcast. Thank you.

Starting point is 00:02:57 Microsoft Research's New York Lab is relatively small in the constellation of MSR Labs, but there's some really important work going on there. So to get us started, tell us what each of you does for a living and how you work together. What gets you up in the morning? Rafa, why don't you start? Okay, I'll start. So I wake up every day and think about all the great things that the reinforcement learning researchers are doing. And first, I map what they're working on, something that could be useful to customers. And then I think to myself, how can we now take this great research, which typically comes in the form of a paper, to a prototype, to an incubation, to something that Microsoft can make money out of.

Starting point is 00:03:38 That's a big thread, starting with a little seed and ending up with the big plant at the end. Yes, we have to think big. That's right. How about you, John? I want to solve machine learning. And it's ambitious. But one of the things that you really need to do if you want to solve machine learning is you need to solve reinforcement learning, which is kind of a common basis for learning algorithms to learn from the interaction with the real world. And so figuring out new ways to do this or trying to expand the scope

Starting point is 00:04:08 of where we can actually apply these techniques is what really drives me. Can you go a little deeper into solve machine learning? What would solving machine learning look like? It would look like anything that you can pose as a machine learning problem you can solve, right? So I became interested in machine learning back when I was an undergrad, actually. I went to a machine learning class and

Starting point is 00:04:28 I was like, ah, this is what I want to do for my life. And I've been pursuing it ever since. And here you are. So we're going to spend the bulk of our time today talking about the specific work you're doing in reinforcement learning. But John, before we get into it, give us a little context as a level set. From your perspective, what's unique about reinforcement learning within the machine learning universe? And why is it an important part of MSR's research portfolio? So most of the machine learning that's actually deployed is of the supervised learning variety. And supervised learning is fundamentally about taking expertise from people and making that into some sort of learned function that you can then use to do some task.

Starting point is 00:05:09 Reinforcement learning is different because it's about taking information from the world and learning a policy for interacting with the world so that you perform better in one way or another. So that different source of information can be incredibly powerful because you can imagine a future where every time you type on the keyboard, the keyboard learns to understand you better. Or every time you interact with some website, it understands better what your preferences are. So the world just starts working better and better and interacting with people. And so reinforcement learning as a method within the machine learning world is different from other methods because you deploy it in less known circumstances, or how would you define that? So it's different in many ways, but the key difference is the information source.

Starting point is 00:05:58 The consequence of that is that reinforcement learning can be surprising. It can actually surprise you. They can find solutions you might have not thought of to problems that you pose to it. That's one of the key things. Another thing is it requires substantially more skill to apply than supervised learning. Supervised learning is pretty straightforward

Starting point is 00:06:18 as far as the statistics go, while reinforcement learning, there's some real traps out there, and you want to think carefully about what you're doing. Let me go into a little more detail there. Let's suppose you need to make a sequence of 10 steps and you want to maximize the reward that you get in those 10 steps, right? So it might be the case that going left gives you a small reward immediately and then you get no more rewards. While if you go right, you get no reward and then you go left and then you get no more rewards while if you go right you get no reward and then you go left and then right and then right and then left and then right so on 10 times

Starting point is 00:06:51 do it just the right way you get a big reward all right so many reinforcement learning albums just learn to go left because that gave the small reward immediately and that gap is not like a little gap it's like you may require exponentially many more samples to learn unless you actually gather the information in an intelligible, conscious way. I'm grinning and no one can see it because I'm thinking that's how people operate generally, you know? Actually, yeah. I mean, the way I explain reinforcement learning is the way you teach a puppy how to do a trick. And the puppy may surprise you and do something else. But the reward that John speaks of is the treat that you give the puppy when the puppy does what you are trying to teach it to do.

Starting point is 00:07:39 And sometimes they just surprise you and do something different. And actually, reinforcement learning has a very great affinity to Pavlovian psychology. Well, back to your example, John, you're saying if you turn left, you get the reward immediately. You get a small reward. A small reward. So the agent would have to go through many, many steps of this to figure out, don't go left because you'll get more later. You get more later if you go right and you take the right actions after you go right. Now imagine explaining this to a customer. And we will get there and I'll have you explain it. Rafa, let's talk for a second about the

Starting point is 00:08:19 personalization service, which is an instantiation of what you call real world reinforcement learning, yeah? That's right. So you characterize what you call real-world reinforcement learning, yeah? That's right. So you characterize it as a general framework for reinforcement learning algorithms that are suitable for real-world applications. Unpack that a bit. Give us a short primer on real-world reinforcement learning and why it's an important direction for reinforcement learning in general. Yeah, I'll give you my version, and I'm sure John will chime in. But, you know, many of the reinforcement learning that people hear about are almost always done in a simulated environment where you can be creative as to what you simulate,

Starting point is 00:08:56 and you can generate, you know, gazillions of samples to make your agents work. Our type of reinforcement, John's type of reinforcement learning, is something that we deploy online. And what drives us, John and I, is to create or use this methodology to solve real-world problems. And our goal is really to advance the science in order to help enterprises maximize their business objective through the usage of real-world reinforcement learning. So when I say real-world, these are models that we deploy in production with real users getting real feedback and they learn on the job. Well, John, talk a little bit about what Rafa has alluded to. There's an online real-world element to it, but prior to

Starting point is 00:09:47 this, reinforcement learning has had some big investments in the gaming space. Tell us the difference and what happens when you move from a very closed environment to a very open environment from a technical perspective. Yeah, so I guess the first thing to understand is why you'd want to do this, because if reinforcement learning and simulators works great, then why do you need to do something else? And I guess the answer is there are many things you just can't simulate. So an example that I often give in talks is, would I be interested in a news article about Ukraine? The answer is yes, because my wife is from Ukraine. But you would never know this. Your simulator would never know this. There'd be no way for the policy to actually learn that

Starting point is 00:10:29 if you're learning in a simulator. So there are many problems where there are no good simulators. And in those simulators, you don't have a choice. So given that you don't have a choice, you need to embrace the difficulties of the problem. So what are the difficulties of the real-world reinforcement learning problems? Well, you don't have zillions of examples, which are typically required for many of the existing deep reinforcement learning algorithms. You need to be careful about how you use your samples. You need to use them to maximum and utmost efficiency in trying to do the learning. Another element that happens is often when people

Starting point is 00:11:09 have simulators, those simulators are kind of effectively stationary. They stay the same throughout the process of training. But in real world problems, many of them we encounter, we run into all kinds of non-stationarities, there's exogenous events. The algorithms need to be very robust. So the combination of using samples very efficiently and great robustness in these algorithms are kind of key offsetting elements from what you might see in other places.

Starting point is 00:11:34 Which is challenging AlphaGo or Ms. Pac-Man or the other games that have been sort of flags waved about our progress in reinforcement learning? I think those are fun applications. I really enjoy reading about them and learning about them. I think it's a great demonstration of where the field has gotten, but I feel like this is the issue of AI winter, right? So there was once a time when AI crashed, and that may happen again because AI is now a buzzword. But I think it's important that we actually do things that have some real value in the world, which actually affect people's lives, because that's what creates a lasting wave of innovation and puts civilization into a new place. Right. So that's what I'm really seeking. What season are we in now? I've heard there has been more than one AI winter and some people are saying it's AI spring.

Starting point is 00:12:31 I don't know. Where do you see us in terms of that progress? I think it's fair to say that there's a lot of froth in terms of people claiming things that are not going to come to pass. At the same time, there is real value being created. Suddenly we can do things and things work better through some of these techniques. And so there's kind of this mishmash of over-promised things that are going to fail and there are things that are not over-promised and they will succeed. And so if there's enough of those that succeed, then maybe you don't have a winter. Maybe it just becomes a long summer. Like San Diego all the time.

Starting point is 00:13:13 Yeah, but I think to comment on John's point here, I think reinforcement learning is a nascent technique compared to supervised learning. And what's important is to do the crawl, walk, run, right? So yeah, it's sexy now and people are talking about it, but we need to rein it in from a business perspective as to, you know, what are the classes of problems that we can satisfy the business leader with and satisfy them effectively, right? And I think from a reinforcement learning, John, correct me, we are very much at the crawl phase and solving generic business problems. I mean, we have solved some generic business problems, but we don't have widely deployed or deployable platforms for reusing those solutions over and over again. And it's so easy to imagine many more applications than people have even tried.

Starting point is 00:14:06 So we're nowhere near a mature phase in terms of even simple kinds of reinforcement learning. We are ramping up in our ability to solve real-world reinforcement learning problems. And there's a huge ramp still to happen. Heading towards your goal of solving machine learning. Yes. But I mean, to be fair, though, we can actually satisfy some classes of problems really well with nascent technology. So yes, we are nascent and the world is out there for us to conquer. But I think we do have techniques that can solve a whole swath of problems. And it's up to us to harvest that.

Starting point is 00:15:01 Well, let's continue the thread a little bit on the research areas of reinforcement learning. And there's several that seem to be gaining traction. Let's go sort of high level and talk about this one area that you're saying is basically creating a new foundation for reinforcement learning. What's wrong with the current foundation? What do we need the new foundation for and what are you doing? The current foundation of reinforcement learning is called a markup decision process. The idea in a markup decision process is that you have states and actions and given a state, you take an action, then you have some distribution over the next state. So that's kind of what the foundation is. It's how everybody describes your solutions. And the core issue with this is that there are no good solutions when you have a large number

Starting point is 00:15:40 of states. All solutions kind of scale with the number of states. And so if you have a small number of possible observations about the world, then you can employ these theoretically motivated reinforcement learning algorithms, which are provably efficient, and they will work well. But in the real world, you have a megapixel camera, which has two to the one million or 16 to the one million possible inputs. And so you never encounter the same thing twice. And so you just can't even apply these algorithms. It doesn't even make sense.

Starting point is 00:16:12 It's ridiculous. So when I was a young graduate student, I was, of course, learning about markup decision processes and trying to figure out how to solve reinforcement learning better with them. And then at some point after we had a breakthrough, I realized that the breakthrough was meaningless because it was all about these market position processes. And no matter what, it just never was going to get to the point where you could actually do something useful.

Starting point is 00:16:37 So around 2007, I decided to start working on contextual bandits. This is an expansion of what reinforcement learning means in one sense, but a restriction in another sense. So instead of caring about the reward of a long sequence of actions, we're going to care about the reward of the next action. So that's a big simplification. On the other hand, instead of caring about the state, we're going to care about an observation, and we're going to demand that our algorithms don't depend upon the number of possible observations, just like they do in supervised learning. So we studied this for several years.

Starting point is 00:17:13 We discovered how to create statistically efficient algorithms for these kinds of problems. So that's kind of the foundation of the systems that we've been working on. And then more recently, after cracking these contextual banner problems, we wanted to address a larger piece of reinforcement learning. So now we're thinking about contextual decision processes, where you have a sequence of rounds, and on each round you see some observation, you choose some action, and then you do that again and again and again. And then at the end of an episode, maybe 10 steps, maybe 100, you get a reward, right? So

Starting point is 00:17:51 there's some long delayed reward dependent upon all the actions you've taken and all the observations you've made. And now it turns out that when these observations are generated by some small underlying state space, which you do not know in advance and which is never told to you. You can still learn. You can still do reinforcement learning. You can efficiently discover what a good policy is globally. So the new foundation of reinforcement learning is about creating a foundation for reinforcement learning algorithms

Starting point is 00:18:21 that can cope with a megapixel camera as an observation rather than having like 10 discrete or 100 discrete states. And you're getting some good traction with this approach? Yeah. I mean, contextual bandits are deployed in the real world and being used in many places at this point. There's every reason to believe that if we can crack contextual decision processes, which is our current agenda, that will be of great use as well. Rafa, at its core, reinforcement learning systems are designed to be self-improving systems and kind of learn from the real world like humans do.

Starting point is 00:18:58 Yes. Or puppies. Or puppies. And the real world is uncertain and risky. Yes. So how do you, from your perspective or from your angle, build trust with the customers that you interact with, both third-party and first-party customers, who are giving you access to their own real-life traffic online? Yeah, this is an important topic when we start looking at how we do incubations in our team. And we have a specific challenge, as you were saying, because if we were a supervised learning

Starting point is 00:19:31 model, we would go to a customer and say, hey, you know, give me a data set. I'll run my algorithm. If it improves, you deploy it. We deploy it in an A-B test. And if we are good, you're good to go. Our system is deployed in production. So here we are with customers and talking to them about advanced machine learning techniques from research. And we want to deploy them in their online production system.

Starting point is 00:19:57 So as you can imagine, it becomes an interesting conversation. So the way we approach this actually is by taking ideas from product teams. So when we went and did our incubations, we did it with a hardened prototype, meaning this is a prototype that's not your typical stitched up Python code that, you know, is hacky. We took a fair amount of time to harden it to the degree that if you run it in production, it's not going to crash your customers' online production system. So that's number one. And then when we approach customers, our system learns from the real world and you do need a certain amount of traffic because our models are like newborn puppies. They don't know any tricks. So you need to give them information in order to learn.

Starting point is 00:20:52 But what we typically do is we have a conversation with our customer and say, hey, you know, yes, this is research, but it is hardened prototype. That's number one. And two, we use previous incubations as reference to newer ones. We borrow ideas from how products go sell their prototypes, right? And then we, as a methodology, say to customers when they have large volumes of traffic to give us a portion of their traffic, which is good enough for us to learn and prove the ROI, but small enough for them to de-risk. And that methodology has worked very well for us. De-risk is such a good word. Let's go a little further on that thread. Talk a little bit about the cold start versus the warm start when you're deploying. So that's another interesting conversation with our customers, especially those that are used to supervised

Starting point is 00:21:51 learning, where you train your model, right, with a lot of data and you deploy it and it's already learned something. Our models in our personalization service start really cold. But the way John and the teams created those algorithms allows us to learn very fast. And the more traffic you give it, the faster it learns. So I'll give you an example. We deployed a pilot with Xbox top of home where we were personalizing two of the three slots or four slots that they have on the top of home.

Starting point is 00:22:24 And Xbox gets millions of events per day. So with only 6 million events per day, which is a fraction of Xbox traffic, in about a couple of hours, we went from cold to very warm. So again, from a de-risking with these conversations with our customers, first or third parties, we tend to say, yes, it's cold start, but these algorithms learn super fast. And there's a certain amount of traffic flow that enables that efficient learning. So we haven't had major problems. We start by making our customers understand how the system works. And we go from there. Are there instances where you're coming into a warm start where there's some existing data or infrastructure? Yeah, so that definitely happens. It's typically more trouble than it's worth to actually use pre-existing data because when

Starting point is 00:23:22 you're training in a contextual bandit, you really need to capture four things, the features, the action, the reward for the action, and then the probability of taking the action. And almost always the probability is not recorded in any kind of reliable way, if it was even randomized previously. So given that you lack one of those things, there are ways to try to repair that. They kind of work, but they're kind of a pain. They're not the kind of thing that you can do in an automatic fashion. So typically we want to start with recording our own data so we can be sure that it is, in fact, good data. Now, with that said, there are many techniques for taking into account pre-existing models, right? So we actually have a paper now in Archive talking

Starting point is 00:24:05 about how to combine an existing supervised data source with a contextual bandit data source. Another approach, which is commonly very helpful, is people may have an existing supervised system, which may be very complex, and they may have built up a lot of features around that, which may not even be appropriate. Often there's a process around any kind of real system where the learning algorithm and the features are kind of co-evolving, and so moving away from either of them causes a degradation in performance. So in that kind of situation, what you'd want to do is

Starting point is 00:24:38 you want to tap the existing supervised models to extract features which are very powerful and then given those very powerful features you can very quickly get to a good solution and then so the exact mechanism of that extraction it's going to depend upon the representation that you're using with a neural network you kind of rip off the top layer and use that with a decision tree or a boost decision tree or decision forest you can use the leaf membership as a feature that you can then feed in for a very fast warm-up of a contextual binary learner. John, talk about offline experimentations. What's going on there?

Starting point is 00:25:15 Yeah, so this is one of the really cool things that's possible when you're doing shallow kinds of reinforcement learning with maybe one step or maybe two steps. So if you record that quad of features, action, reward, and the probability, then it becomes possible to evaluate any policy that chooses amongst the set of available actions. Okay, so what that means is that if you record this data, and then later you discover that maybe a different learning rate was helpful, or maybe you should be taking this feature and that feature and combining them to make a new feature. You can test to see exactly how that would have performed if you had deployed that policy at the time you're collecting data. So this is amazing, because this means that you no longer need to use an A-B test for the purpose of optimization. You still have reasons to use it for the purpose of safety,

Starting point is 00:26:11 but for optimization, you can do that offline in a minute rather than doing it online for two weeks waiting to get the data necessary to actually learn. Yeah, just to pick up on why is this a gold nugget? Data scientists spend a fair amount of time today designing models a priori and testing them in A-B tests, only to learn two weeks after that they failed and they go back to ground zero. So here you're running hundreds, if not thousands, of A-B tests on the spot. And when we talk about this to data scientists and enterprises, their eyes light up. I mean, that is one of the key features of our system that just brightens the day for many data scientists. It's a real pain for them to design models, run them in A-B. It's very costly as well. So talk about productivity gains. It's immense when you

Starting point is 00:27:06 can run 100 to 200 AB tests in a minute versus running one AB test for two weeks. Rafa, you work as a program manager within a research organization. Yes. And it's your job to bring science to the people. Yes. Talk about your process a little more in detail of research incubate transfer in the context of how you develop RL prototypes and engineer them and how you test them. And specifically, maybe you could explain a couple of examples of this process of deployments that are out there already. How are you living up to the code?

Starting point is 00:27:41 We have a decent size engineering team that supports our RL efforts in MSR. And our job is twofold. One is to, from a program management perspective, it's to really drive what it means to go from an algorithm to a prototype and then validate whether that prototype has any market potential. I take it upon me as a program manager when researchers are creating these wonderful academic papers with great algorithm, and some of them may have huge market potential. So this market analysis happens actually in MSR. And we ask ourselves, great algorithm, what are the classes of problems we can solve for it? And would people like relate

Starting point is 00:28:33 to these problems such that we could actually go and incubate them? And the incubation is a validation of this market hypothesis. So that's what we do in our incubations. We are actually trying to see whether this is something that we could potentially tech transfer to the product team. And we've done this with contextual bandits in the context of personalization scenarios. So contextual bandits is a technique, right? And so we ask ourselves, okay, with this technique, what classes of problems can we solve very efficiently? And personalization was one of them. And we went and incubated it first with MSN. Actually, John and the team incubated it with MSN first, and they got a 26% lift. That's multi-million dollar revenue potential. So from a market potential, it really

Starting point is 00:29:28 made sense. So we went and said, okay, one customer is not statistically significant, so we need to do more. And we spent a fair amount of time actually validating this idea and validating the different types of personalization. So MSN was a news article personalization. Recently, we did a page layout personalization with Surface.com Japan, where they had four boxes on Surface.com Japan, and they were wondering how to present these boxes based on the user that was visiting that page. And guess what? We gave them 2,500 events. So it was a short run pilot that we did with them.

Starting point is 00:30:09 We gave them an 80% lift, 80. They were flabbergasted. They couldn't believe. And this was run on an A-B test. So they had their page layout that their designers had specified for them, for all users, running as the control. And they had our personalization engine running with our contextual bandit algorithm. And they ran it. And for us, you know, 2,500 samples is not really a lot. But even with that, we gave them an 80% lift over their control. So these are the kinds of incubation that when we go to our sister product team in Redmond and tell the story, they get super excited that

Starting point is 00:30:54 this could be a classes of application today about diversity, and that often means having different people on the team. But there's other aspects, especially in reinforcement learning, that include diversity of perspective and approach. How do you address this in the work you're doing and how do you practically manage it? One thing to understand is that research is an extreme sport in many ways. You're trying to do something which nobody has ever done before. And so you need an environment that supports you in doing this in many ways. It's hard for a single researcher to have all the abilities that are needed to succeed. When you're learning to do research, you're typically learning a very narrow thing. And over time, maybe that gets a little bit broader, but it's still going to be the case that you just know a very narrow perspective

Starting point is 00:32:03 on how to solve a problem. So one of the things that we actually do is on a weekly basis, we have an open problems discussion where a group of researchers gets together and one of them talks about the problem that they're interested in. And then other people can chime in and say, oh, maybe we should look at it this way or think about it that way. That helps, I think, sharpen the problems. And then in the process of solving problems, amazing things come up in discussion, but they can only come up if you can listen to each other. And I guess the people that I prefer to work with are the ones who listen carefully. There's a process of bouncing ideas off each other, discovering the flaws in them, figuring out how to get around the flaws. This process can go on. It's indefinite, but sometimes it lands. And when it lands, that moment when you discover

Starting point is 00:32:52 something, that's really something. Rafa, do you have anything to add to that? So when I think about diversity in our lab, I think that to complement what John's saying, I like to always also think about the diversity of disciplines. So in our lab, we're not a big lab, but we have researchers, we have engineers, we have designers, and we have program manager. And I think these skill sets are diverse, and yet they complement each other so well. And I think that adds to the richness of what we have in our lab. In the context of the work you do, its applications and implications in the real world,

Starting point is 00:33:32 is there anything that keeps you up at night? Any concerns you're working to mitigate, even as you work to innovate? I think the answer is yes. Anybody who understands the potential of machine learning, AI, whatever you want to call it, understands that there are negative ways to use it, right? It is a tool and we need to try to use the tool responsibly and we need to mitigate the downsides where we can see them in advance. So I do wonder about this. Recently, we had a paper on fair machine learning, and we showed that any supervised learning algorithm can, in a black box fashion, be turned into a fair supervised learning algorithm.

Starting point is 00:34:12 We demonstrated this both theoretically and experimentally. So that's a promising paper that addresses a narrow piece of ethics around AI, I guess I would say. As we see more opportunities along these lines, we will solve them. Yeah. Also use these techniques for the social good, right? I mean, as we are trying to use them to monetize, also we should use them for the social good. How did each of you end up at Microsoft Research in New York City? This is actually quite a story. So I used to be at Yahoo Research. One day, right about now, seven years ago, the head of Yahoo Research quit. So we decided to essentially sell the New York Lab.

Starting point is 00:35:02 So we created a portfolio of everybody in the New York lab, 15 researchers there. We sent it around to various companies. Microsoft ended up getting 13 out of 15 people. And that was the beginning of Microsoft Research New York. Rafa, how did you come to Microsoft Research New York City? He told me I was going to revolutionize the world. That's why I came over from IBM. So I actually had a wonderful job at IBM applying Watson technologies for children's education. And one day a Microsoft recruiter called me and they said, John Lankford, renowned RL researcher, is looking for a program manager. You should interview with him. And I'm

Starting point is 00:35:45 like, okay. So I interviewed at Microsoft Research New York, spoke to many people. And at the time, I, you know, I was comfortable in my job and I had other opportunities. But in his selling pitch to me, John Lankford calls me one day at home and he says, you should choose to come and work for Microsoft Research because we're going to revolutionize the world. And I think it sunk in that we can be at the cusp of something really big and that got me you've made. John, you started out our interview with I want to solve machine learning. Rafa, you have said that your ultimate goal is real world reinforcement learning for everyone. What does the world look like if each of you is wildly successful? Yeah, so there's a lot of things that are easy to imagine being a part of the future world that just aren't around now. You should imagine that every computer interface learns to adapt to you rather than you needing to adapt to the user interface. You could imagine lots of companies just working better. You could imagine a digital avatar that over time learns to help you book the flights

Starting point is 00:37:10 that you want to book or things like that, right? Often there's a lot of mundane tasks that people do over and over again. And if you have a system that can record and learn from all the interactions that you make with computers or with the Internet. It can happen on your behalf. That could really ease the lives of people in many different ways. Lots of things where there's an immediate sense of, oh, that was the right outcome or, oh, that was the wrong outcome, can be addressed with just the technology that we have already. And then there's technologies beyond that, like the contextual decision processes that I was talking about, that may open up even more possibilities in the future. Rafa? To me, what a bright future would look like is when we can

Starting point is 00:37:59 cast a lot of issues that we see today, problems, enterprises, and at the personal level as a reinforcement learning problem that we can actually solve. And more importantly for me, you know, as we work in technology and we develop all these techniques, the question is, are we making the world a better world, right? And can we actually solve some hard problems like famine and diseases with reinforcement learning? And maybe not now, but can it be the bright future that we look out for? I hope so. I do too. John Lankford, Rafa Hosen, thank you for joining us today. Thank you. Thank you for joining us today. Thank you. Thank you. To learn more about Dr. John Langford and Rafa Hozen and the quest to bring reinforcement learning to the real world,

Starting point is 00:38:54 visit Microsoft.com slash research.

Your Ad Here

Microsoft Research Podcast - 075r - Reinforcement learning for the real world with Dr. John Langford and Rafah Hosn

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.