Microsoft Research Podcast - 102 - Adaptive systems, machine learning and collaborative AI with Dr. Besmira Nushi

Episode Date: December 11, 2019

With all the buzz surrounding AI, it can be tempting to envision it as a stand-alone entity that optimizes for accuracy and displaces human capabilities. But Dr. Besmira Nushi, a senior researcher in ...the Adaptive Systems and Interaction group at Microsoft Research, envisions AI as a cooperative entity that enhances human capabilities and optimizes for team performance. On today’s podcast, Dr. Nushi talks about what it takes to develop collaborative AI systems and unpacks the unique challenges machine learning engineers face in their version of the software development cycle. She also reveals why understanding the “terrain of failure” can help researchers develop AI systems that perform as well in the real world as they do in the lab. https://www.microsoft.com/research

Transcript
Discussion (0)
Starting point is 00:00:00 What I'd like AI to be, I'd like it to be a technology that enables everyone, and that it's built for us, it's built for people. My parents should be able to use it, and environmental scientists should be able to use it and make new discoveries, or a policymaker in order to take good decisions. You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizenga. With all the buzz surrounding AI, it can be tempting to envision it as a standalone entity that optimizes for accuracy and displaces human capabilities. But Dr. Besmir Anoushi, a senior researcher in the Adaptive Systems and Interaction Group
Starting point is 00:00:48 at Microsoft Research, envisions AI as a cooperative entity that enhances human capabilities and optimizes for team performance. On today's podcast, Dr. Anoushi talks about what it takes to develop collaborative AI systems and unpacks the unique challenges machine learning engineers face in their version of the software development cycle. She also reveals why understanding the terrain of failure can help researchers develop AI systems that perform as well in the real world
Starting point is 00:01:15 as they do in the lab. That and much more on this episode of the Microsoft Research Podcast. Bess Miranushi, welcome to the podcast. Thank you. It's great to be here. I've been following the podcast in the last year. And you know, it's always interesting. Every new episode is different. You've been following the podcast. I have. Well, I've talked to you before. Last time you were on a research and focus panel at Faculty Summit in 2017, and you talked about ML troubleshooting in real-time systems.
Starting point is 00:01:59 Let's go there again. As a senior researcher in the Adaptive Systems and Interaction Group, you work at what you call the intersection of human and machine intelligence. Yep. Which I love. So we'll get to your specific work in a minute. But in broad strokes, what's going on at that intersection? What gets you up in the morning? Well, the intersection is a rich field, and it really goes both ways. It goes into the direction of how can we build systems that learn from human feedback and input and intervention and maybe learn from the way people solve problems and understand the world. And it also goes in the other direction
Starting point is 00:02:38 in like how can we augment the human capabilities by using artificial intelligence systems. How can we make them more productive at work and putting the best of both worlds together? Let's talk a little bit more about this human-AI collaboration. You framed it in terms of complementarity because humans and machines have different strengths and weaknesses. And you've also characterized it as putting humans and machines together to, quote unquote, optimize for team performance. Elaborate on that for us. How should we understand AI as collaborator versus AI designed to work on its own? You know, people and algorithms, they have very different skills. We're really good in reasoning and imagination. And
Starting point is 00:03:25 machines are good in processing these terabytes of data for us and giving us these patterns. However, you know, if we can use the machine capabilities in an efficient way, we can be quicker and faster, as I said. But then on the other hand, you know, these are concepts that if you think deep about it, they're not that new. In the sense that when we invented personal computing in the 80s, this is one of the reasons why it became so successful, because the personal computer was suddenly this body that could help you do things faster and quicker. But then there is another thing that enabled that development in those years. And really, I think that that is the field of human-computer interaction. What HCI did in those years is that it made the interface understandable from a human perspective, and it really made the computation technology accessible for everybody.
Starting point is 00:04:20 So now we see billions of people around the world that use some form of computation without making any significant effort. Right. And I think that today we are in front of such forms of developments in artificial intelligence. We are in a way in the position that we can innovate in the way how people interact with AI technologies. But we still need to make that leap and make AI accessible for users. And this is what I mean by the fact that so far we have been optimizing AI for performance only and performance when the AI is designed to play alone in the field. But if it has to play together with a human, there are other scores that we need to
Starting point is 00:05:05 think about. For example, one of them is interpretability, in that people should be able to understand how a machine makes a prediction. Another one that we focus a lot on is predictability of errors. And what this really means is that if I'm working with an AI algorithm, I should be able to kind of understand that that algorithm is going to make mistakes. And this is important for me as a user because if I have the agency to make the final decision at the end, I need to know when it's right or wrong so that I can correct it at the right time as it goes. Let's drill in on the topic of AI as collaborator then. We've talked a little bit about AI working alone and it's designed to optimize for performance and speed. How do you then go about training ML models with collaborative properties in mind instead of optimizing for speed and performance? What are the tradeoffs in the algorithmic design,
Starting point is 00:06:09 and how do you go about enforcing them? Right, right. Yeah, so you're right that it is always a trade-off. It is a trade-off for the machine learning developer to decide which model to deploy. Should I deploy a model that is fully accurate by its own, or a model that optimizes team performance? But the difficulty is that this trade-off is not always as visible or as easy to access. So I see the work that we do in our group as an enabling paradigm that allows you to explore this trade-off or in a way to extend it and show it to the developer so that the developer can make the right choices. And there are two ways how you can go about this. The first one is what happens during grid search.
Starting point is 00:06:52 So in machine learning, we call grid search this process where you try to search through many, many, many parameters. And through this search, you try to find the model that pleases you the most in that it is accurate. But what we suggest is that you should also be looking at these other scores that work for collaboration, like predictability of errors. So the second way you can go about this is to include these definitions in the training objective itself. And we have been doing more and more work in this part because we think that this explores the trade-off even more.
Starting point is 00:07:27 It extends it. It gives you a rich perspective of how many other parameters you can optimize and augment the objective function in a way that it should think about accuracy. But it should also think with a certain factor about this human collaboration scores. Right. think with a certain factor about these human collaboration scores. And the best way to go is to do both during training and during grid search so that you really get the algorithm that works best for humans. So you've talked a bit about enforcing properties within the algorithmic design. Unpack that a little bit more. Yeah. so the enforcement usually really comes from the optimization stage. During optimization in machine learning, we model a loss function. And this loss function is the function that gives a signal to the training algorithm on how well it is doing, whether it
Starting point is 00:08:19 is a good model or it is a bad model. And this is really the only signal that we get. And it's computed on the data, right? So we're saying that signal should be augmented with these human collaboration scores and be put together so that when we train the algorithm, these properties get enforced. The other way you could do it is by adding constraints in the training algorithm, saying that whatever you do, whatever model you find, you shouldn't be going lower or upper than a particular score. Well, let's turn our attention to the methodologies and tools you're working on for machine learning systems. And before we get specific, I'd like you to give us a general overview of the software development cycle writ large, talking about the differences between traditional software development and software engineering for ML systems.
Starting point is 00:09:21 Yeah. So machine learning has really been until recently a research-like field in that the algorithms were in the lab and they were accessed by the machine learning scientists. But now that we have machine learning deployed out there in the field in different products, machine learning software is being combined with traditional software. However, these two are very different in nature together in that machine learning software can be undeterministic. It may be hard for you to say what it is going to do. And it also may have a black box nature in that it's hard to understand what exactly it is going to say. And also it may make these types of mistakes that do not happen because of bugs in the code. It may just be
Starting point is 00:10:05 because the data that you are training the algorithm upon might either be insufficient, in that there is not enough to learn what you want to learn, or it may not quite resemble what is out there in the real world. It is just lab data, and it's never going to be as rich as the world that we live in. And also needless to say, this is a very subtle difference, but we often forget about it, is that in machine learning, we don't really write the code that is going to execute the program. We write the algorithm that is going to process the data and come up with a function that is the actual code.
Starting point is 00:10:43 So all those differences make the process very, very different. It is very data dependent. And in traditional software engineering, for example, we didn't have these parts of the life cycle that we currently care so much about. Like, for example, data collection and data cleaning, if it has mistakes. In fact, like according to a study that we recently did at Microsoft, collection and cleaning takes at least 50% of the time from a machine learning engineer. And, you know, this is significant.
Starting point is 00:11:14 It's a lot of time that is spent into these new stages. The other thing is versioning. So in traditional software, we know very well how to do versioning. We have tools like GitHub and other versioning tools that do that for us. But in machine learning, we have to version not only code, we have to version the model, we have to version the data and the parameters of the model. So there are all these things that are entangled together and that we need to version in the right way. Okay, so let's go a little deeper there.
Starting point is 00:11:53 You've got your traditional software engineer that is very comfortable with how this all works. And now you've got machine learning engineers that are adding on layer upon layer to this software development cycle. How are you addressing that? So far, what we have done is we've started with first understanding the needs of machine learning engineers, like, and understanding their background as well, you know, because machine learning engineers, they may come from different fields. Some of them may not have a computer science background. They may be data scientists. They may be statisticians. And the practices that are used in statistics and computer science may be very, very different. Within the same team, you may have people with so many different backgrounds, and you need to put
Starting point is 00:12:37 them together to speak the same language. So we started by trying to understand what is their background and what are their problems. And the first number one challenge that they have is to have end-to-end support in tooling that can support all these different stages in the lifecycle. It will take some time, but I think we're getting closer. Well, have you laid out the stages as it were? I mean, the software development cycle is very precise. Yeah. And the machine learning cycle is a lot bigger, isn't it? It is. So we have defined a few stages and there's other work that have tried to do the same thing. We have stages like data collection, data cleaning, model training, feature engineering, and then model monitoring and debugging and
Starting point is 00:13:28 maintenance. So these are kind of the main stages, if I didn't forget any of them. But what is different? There's something that is very interesting in the difference between the two, is that the machine learning lifecycle is very experimental in that it is a little bit of trial and error in a way. This grid search that I mentioned earlier, it's a little bit of trial and error. You have to try different things to see whether it works for your domain. Maybe you clean the data once more. Maybe you add some more features or a different representation. So there is a lot of experimentation. And when there is a
Starting point is 00:14:02 lot of experimentation, there is a lot of uncertainty. You don't know as an engineer whether it's going to work or not. So it has changed even the way we plan and manage projects. Well, let's go a little deeper and talk about that troubleshooting and debugging that you're working on. It's a key challenge for all software systems, but it's particularly challenging for debugging a black box. Yeah. And especially for complex mission and safety critical software and hardware, which you're dealing with all the time in the real world. So how do you go about, let's get real, how do you go about designing methodologies for
Starting point is 00:14:40 the debugging phase in the ML development cycle? Yeah. for the debugging phase in the ML development cycle? Yeah, it's a topic that is really timely in that if this is deployed in places that are high stake, like in medicine or autonomous driving, this can really have either a very good impact or a very bad impact on people. Or flying. Or flying, yeah.
Starting point is 00:15:00 Exactly. It can have a different impact on people's life. One of the things that we say in our work is that good debugging practices start from rigorous evaluation. You know how many times we hear things such as, this model is 90% accurate on a particular benchmark. And we use that one single score to describe the whole performance of one algorithm on the whole data set. Often that single number may hide so many important conditions of failure that we don't know about and that are so important to know if you are an engineer. What we suggest is that that performance
Starting point is 00:15:43 number should be sliced down into different demographics and different groups in the data so that we really understand, is there any pocket in the data that is maybe underrepresented and maybe the error rate is higher? Right. So these are the things that we suggest to do. And then we also continue and build interpretable models in order to explain exactly to the engineer when and how does a machine learning model fail. And we often suggest to do this for different groups.
Starting point is 00:16:13 I'll just give you a simple example. We recently were looking at gender recognition software from Image Photos, and we noticed that when these models are trained only from celebrity data, they have a much higher error rate for women that have short hair, that do not have any eye makeup on, and also do not smile in the photo. It's complicated. It is all these different dimensions that are put together. And the process of finding this particular thing, for example, I would have never thought to go and look for it. But this is what the interpretable model gives to you. And, you know, it takes away a lot of load from the engineer if you can at least automate part of this process. So how are you automating parts of the process?
Starting point is 00:16:59 Yeah, so what we're doing is that we're really gathering this data together and we are asking engineers not only to store the final aggregate number of performance, but we are asking them to give us the performance numbers on each example. And at that point, you become super powerful in that you can put in an interpretable model that can slice and dice the data in the right way and and a lot of people just say, I don't want my picture in anybody's data set. And therefore, you're precluding some of the important nuances that you might need in your data to get accurate models. So there is a tension between, you know, being ethical about collecting your data and being accurate in the data. I think as a community and also as an industry, we need to think deep about how to standardize this process. Well, as we've just kind of laid out, it's hard out there for a machine learning engineer. These people need a whole new tool belt for the job. How is your research in adaptive systems helping to equip them with the tools they need for an AI world?
Starting point is 00:18:31 These methodologies that I just mentioned, in the last two years, we have worked hard with many people at MSR AI, but also in Microsoft Cognition, to build concrete tools that can automate part of this process. And the tool that we are building now, it's called Error Terrain Analysis. And it really helps in understanding the terrain of failure. This is an effort that I'm leading together with Ece Gamar and a lot of people from the Ethics and Society team that cares a lot about these types of problems in the company and broader than that. And really what we are doing with the tool is that we are building a workflow of processes
Starting point is 00:19:10 and modular visualizations that can be put together like Lego pieces so that you can go from one general view of errors to a more detailed one and even more detailed one in looking at particular instances. Let me ask you one question on that, because we talked, I'm hearkening back to our conversation at Faculty Summit, and we talked about how modularity is good, both in software development and for understanding how things work, but it also can have problems in the debugging, because you have these different modules that aren't all connected. And if something goes wrong with one or something is better in one and you've got an older module, it poses new problems. Yeah, it does.
Starting point is 00:19:52 In this case, we're thinking about these modules more as visualization models in that first you want to have a large overview of the data. And this would be like one module. And then you want to drill down into the other ones so that you do not get overwhelmed as an engineer, so that it is not too much information for you. Okay. Go back to a phrase you just used, the terrain of failure. Un if you think about it like, you know, as a set of mountains and hills and seaside, there are cases when, you know, the terrain of failure is really calm in parts of the data in that the examples are easy to classify. There is nothing special about them and everything is flat. And there are other cases where the data is so rich, there is so much diversity in the data, in like demographics or other properties, where the error can fluctuate a lot.
Starting point is 00:20:50 And we want to feel that terrain and to really understand what it looks like. That's one of the most evocative phrases I've heard. What other kinds of tools do ML engineers need that are being worked on sort of upstream in the research community? I'd like to mention a set of other tools that we are also building in the group. One of them is called InterpretML, and this is work that is led by Rich Caruana in the Adaptive Systems and Interaction Group. They're really building a tool set for building interpretable models and generating explanations from these models. Yet another tool is called, this is shiny new, it's called TensorWatch. And this is built by Shital Shah, who built this tool
Starting point is 00:21:37 for doing real-time debugging so that you can see the errors and the training loss of machine learning models on the fly. That said, I think that there is still a lot to do when it comes to stitching all this together into one single framework. And as I said, we need to do end-to-end framework in versioning, in data provenance, data documentation, and in tools that can allow us to take these insights that we get from troubleshooting and debugging and integrating them back into the system for fixing them. And I will not claim that everything is going to be automated, but at least there is like
Starting point is 00:22:19 a workflow and a process if that happens. Well, at this point, I'll take good over automated. Right, right, right. Yep, yep. Well, hype notwithstanding, AI is still highly dependent on people. And I'm not sure that's a bad thing. I think that might be a good thing. Why does ML add a difficult layer to this idea of self-healing software? That's one of the things you talked about at Faculty Summit, where one component fixes another based on feedback. And how can strong causal reasoning tools and counterfactual analysis tools help us better understand what went wrong? Yeah, it is hard to heal a machine learning software, but it is even harder to heal a system that has many, many machine learning components that are tied together. And the reason why that is difficult is because sometimes it is hard to understand the different dynamics and interactions between the components. We've done this work that I also talked during the faculty summit on generating counterfactuals in the subcomponents in order to understand how these differences in the subcomponents
Starting point is 00:23:31 affect the larger systems. And again, we are using human intervention to generate these counterfactuals for us so that we can understand the dynamics better. The Badifta Day is starting a new streamline of work in this space in order to optimize large systems that are integrative and to optimize them on the fly in real time. So this is something new that is happening. In overall though, you know, the good news about causal reasoning in these systems for debugging particularly is that as opposed to other fields, like for example in medicine, we can actually run the system again. If we want to apply a fix and
Starting point is 00:24:12 to see how that works, we can apply the fix and see what is the impact, which is something that you cannot easily do in other fields. So that's good. The not as good news is that we still have to understand the dynamics of the components when we have to understand the data distribution. How is the data generated in order to make the right assumptions when we do causal reasoning? work, and it's all of your work, but he's like a focal point there. And he talked a little bit about this at Faculty Summit as well. These big systems with many parts and pieces, and you've got to be able to troubleshoot and debug in real time. I want you to talk a little bit more about how that variable changes the game. Yeah, so in his work, the Vadivta talks about things such as you might get new instances that are running through the system that the system has never seen before. It doesn't know how to optimize for these new instances. But by using the technique that he's building with reinforcement learning and off-policy learning, you can really try to adapt with less examples
Starting point is 00:25:23 and try to manage these instances that, you know, are not that well known for the system. Right. And so that's real world, real time life, which is what humans are good at, is adapting. And machines are still a ways back. It's kind of, yeah, adapting to an unknown world in a way. The uncertainty. All right. Well, I always like to know where research is on the delivery spectrum. And on a scale of 10 years out or more to already shipped, where can we find what I would call trustworthy AI in products and services now? And given what you've told us, how confident should we be that it's going to work as advertised? Yeah, so I think that there exists some grand opportunities for us as researchers to work with engineers together in order to really improve this tooling aspect for allowing rigorous evaluation and debugging.
Starting point is 00:26:26 And I think that if we put the right effort and if we do this the right way, we can really make progress in five years. In order to not really solve the generic intelligence problem, but in order to be able to make the right promises to the user. You know, one of the problems that we currently have is that we cannot really promise to the user or specify the performance of a system. We need to learn still how to do that and how to debug the bad cases. So if we kind of go in both ends, if we are able to explain the performance in the right way and also understand it in the right way, we can kind of meet in the middle with the user and set the expectations right. I would think it would be really important at this point to manage expectations, as they say, in terms of what you referred to as promises that you make to the user. So what are you doing in terms of communication and education about what you want to put out there in these real-time systems?
Starting point is 00:27:20 Yeah, so exactly one of the things that we'd like to do is to be able to generate these types of reports that can describe the behavior of the you stand in a particular pose, meaning that you shouldn't be able to use the system in those cases. So being able to decompose it and break it down in these cases will set the expectations right for the user and really to understand if you see it in paper, here is green, here it's red, you can kind of understand that, well, this system is not perfect. And these are the right cases where I need to ask what keeps you up at night. We could go in a couple directions here, either on the technical front or the social front, or maybe even both. What risks are you trying to mitigate and how are you going about that?
Starting point is 00:28:41 Yeah, so sometimes I wonder whether we are building the right thing. I worry that we end up building things that are isolated from the world and maybe not safe. So what I'd like AI to be, I'd like it to be a technology that enables everyone and that is built for us. It's built for people. My parents should be able to use it, and environmental scientists should be able to use it and make new discoveries, or a policymaker in order to take good decisions. And these are the things that we really have a responsibility for as researchers to make sure that we are building the right thing and that it's safe and it's a technology that we can rely on. Well, there's all kinds of iterations and in different verticals and in different horizontals where we're envisioning our future with AI.
Starting point is 00:29:32 A lot of companies are thinking, how can we do this for businesses, you know, with speech recognition and other places that have maybe some more nefarious purposes for AI. And they're not saying much about it. So is there anything you particularly see? Let's talk about the social front for a second, in terms of what we ought to be thinking now as potential end users of this? I think that there is a big question about how we manage the inventions that come up, either, you know, as academics or as industry, there are decisions that need to be made in terms of, like, how do you monitor and how do you report how you are using a certain product, right?
Starting point is 00:30:14 Because we see these questions coming up even for other technologies that are not really related to intelligence. And there should be some sort of protocol when you buy a certain product as a user to really claim which scenarios you are going to use it and for what reason. So that ends up on the upstream regulatory end of things, and it goes into much more of the ethics and policy around AI. Well, tell us your story, Besmira. How did you get started in computer
Starting point is 00:30:44 science? Where has your journey taken you?. How did you get started in computer science? Where has your journey taken you? And how did you end up at Microsoft Research doing the work you're doing? I did my undergrad in Albania, which is my home country. So this is a small country in southeastern Europe. One interesting thing about how I started is that in Albania, computer science is a very gender balanced field in that my peers at the university, 50% of them were women. And in a way, I feel really lucky that I started my career
Starting point is 00:31:15 in such an environment. It gives you the type of confidence that maybe one wouldn't get if you are in a different environment. After that, I went for a master's. It was a double degree master's in Germany and in Italy. So I ended up spending one year in each of those. This was in data mining and HCI. Then I started my PhD. I spent five beautiful years in Switzerland at ETH Zurich. And this was again at the intersection of human computation and
Starting point is 00:31:45 machine learning. So in a way, this thing about me being at the intersection of machine learning and people has followed me in my career. And I think it has really been because I cannot give up any of them. The intersection keeps me motivated and it keeps me focused. And I kind of make sure that what I'm doing is useful and it is good for people out there. So from Switzerland to Redmond, what happened in between? Oh, wow. Yeah, so I came for an internship during my PhD here. I spent three months. Seattle is beautiful in the summer.
Starting point is 00:32:22 That's how we get you. Exactly. I like the group a lot. I still work with the same people. Who do you work with? I work with Ejik Amar very closely, Salim Amershi, Eric Horvitz quite a lot. And, you know, we are surrounded by an amazing group of people who come from very diverse backgrounds. Well, continuing on a personal note,
Starting point is 00:32:46 tell us something we don't know about you. I mean, you already just did. I didn't know that about you. Spoiler alert. Yeah, right. Spoiler alert. I'm from Albania. Tell us something we don't know about you, a defining experience, an interesting hobby, a personal characteristic, a side quest, any of those that may have defined your direction in life? Yeah, so as you notice, I've moved quite a bit. U.S. is the fifth country I'm living in. And really, when I think about it, I've met so many interesting people. I've met dear friends during the years. And it's really these people that have shaped my personality and they have really helped me to think out of the box, to be creative, but also learn about
Starting point is 00:33:31 the different perspectives. All my friends and my network think in many different ways. They come from very diverse cultural backgrounds. And this really helps you to pause and think further, more than what you have learned in school or in papers and books. All right. So you've got Albanian, English, German, Italian. What else do you speak? I speak C++. Yep. C++. As we close, I want to give you the last word. What should our listeners know about what's next in adaptive systems? And I know you don't know all the answers. There's a lot of uncertainty there, just like the field.
Starting point is 00:34:11 But what are the big unanswered questions and who do you need to help you answer them? Yes, we have different directions in the adaptive systems and interaction group. There is the whole direction of interpretability and debugging. Then a lot happening on human-AI collaboration, either for decision-making or in the physical world for human-robot interaction. There is a lot of work happening in reinforcement learning and robotics and decision-making under uncertainty.
Starting point is 00:34:39 Overall, if I have to put a theme around this, is that we like to think about problems that are happening out there in the real world, so not in the lab. And we want to build trustworthy AI systems that operate out there. And as such, in all this diversity, we look for people that do have a strong technical background,
Starting point is 00:35:02 but we also look for people who can speak all these different languages and are eager to learn more about each other's field. Besmir Anoushi, thank you for joining us today. Thanks for having me. To learn more about Dr. Besmir Anoushi and the latest research at the intersection of human and machine intelligence, visit microsoft.com slash research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.