Microsoft Research Podcast - 102 - Adaptive systems, machine learning and collaborative AI with Dr. Besmira Nushi
Episode Date: December 11, 2019With all the buzz surrounding AI, it can be tempting to envision it as a stand-alone entity that optimizes for accuracy and displaces human capabilities. But Dr. Besmira Nushi, a senior researcher in ...the Adaptive Systems and Interaction group at Microsoft Research, envisions AI as a cooperative entity that enhances human capabilities and optimizes for team performance. On today’s podcast, Dr. Nushi talks about what it takes to develop collaborative AI systems and unpacks the unique challenges machine learning engineers face in their version of the software development cycle. She also reveals why understanding the “terrain of failure” can help researchers develop AI systems that perform as well in the real world as they do in the lab. https://www.microsoft.com/research
Transcript
Discussion (0)
What I'd like AI to be, I'd like it to be a technology that enables everyone, and that
it's built for us, it's built for people.
My parents should be able to use it, and environmental scientists should be able to use it and make
new discoveries, or a policymaker in order to take good decisions.
You're listening to the Microsoft Research Podcast, a show that brings you closer to
the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizenga.
With all the buzz surrounding AI, it can be tempting to envision it as a standalone entity
that optimizes for accuracy and displaces human capabilities. But Dr. Besmir Anoushi, a senior researcher in the Adaptive Systems and Interaction Group
at Microsoft Research, envisions AI as a cooperative entity that enhances human capabilities
and optimizes for team performance.
On today's podcast, Dr. Anoushi talks about what it takes to develop collaborative AI systems
and unpacks the unique challenges machine learning engineers face
in their version of the software development cycle.
She also reveals why understanding the terrain of failure
can help researchers develop AI systems
that perform as well in the real world
as they do in the lab.
That and much more on this episode
of the Microsoft Research Podcast.
Bess Miranushi, welcome to the podcast.
Thank you. It's great to be here. I've been following the podcast in the last year.
And you know, it's always interesting. Every new episode is different.
You've been following the podcast. I have. Well, I've talked to you before.
Last time you were on a research and focus panel at Faculty Summit in 2017, and you talked about ML troubleshooting in real-time systems.
Let's go there again. As a senior researcher in the Adaptive Systems and Interaction Group,
you work at what you call the intersection of human and machine intelligence.
Yep.
Which I love. So we'll get to your specific work in a minute. But in broad strokes,
what's going on at that intersection? What gets you up in the morning?
Well, the intersection is a rich field, and it really goes both ways. It goes into the direction of how can
we build systems that learn from human feedback and input and intervention and maybe learn from
the way people solve problems and understand the world. And it also goes in the other direction
in like how can we augment the human capabilities by using artificial intelligence systems.
How can we make them more productive at work and putting the best of both worlds together?
Let's talk a little bit more about this human-AI collaboration.
You framed it in terms of complementarity because humans and machines have different strengths and weaknesses.
And you've also characterized it as putting humans and machines together to, quote unquote, optimize for team performance.
Elaborate on that for us. How should we understand AI as collaborator versus AI designed to work on
its own? You know, people and algorithms, they have very different skills. We're really good
in reasoning and imagination. And
machines are good in processing these terabytes of data for us and giving us these patterns.
However, you know, if we can use the machine capabilities in an efficient way, we can be
quicker and faster, as I said. But then on the other hand, you know, these are concepts that
if you think deep about it, they're not that new. In the sense that when we invented personal computing in the 80s, this is one of the reasons why it became so successful, because the personal computer was suddenly this body that could help you do things faster and quicker.
But then there is another thing that enabled that development in those years.
And really, I think that that is the field of human-computer interaction.
What HCI did in those years is that it made the interface understandable from a human perspective,
and it really made the computation technology accessible for everybody.
So now we see billions of people around the world that use some form of computation without making any significant effort.
Right.
And I think that today we are in front of such forms of developments in artificial intelligence.
We are in a way in the position that we can innovate in the way how people interact with AI technologies.
But we still need to make that leap
and make AI accessible for users. And this is what I mean by the fact that so far we have been
optimizing AI for performance only and performance when the AI is designed to play alone in the
field. But if it has to play together with a human, there are other scores that we need to
think about. For example, one of them is interpretability, in that people should be able
to understand how a machine makes a prediction. Another one that we focus a lot on is predictability
of errors. And what this really means is that if I'm working with an AI algorithm, I should be able to kind of understand that that algorithm is going to make mistakes.
And this is important for me as a user because if I have the agency to make the final decision at the end, I need to know when it's right or wrong so that I can correct it at the right time as it goes.
Let's drill in on the topic of AI as collaborator then.
We've talked a little bit about AI working alone and it's designed to optimize for performance and speed.
How do you then go about training ML models with collaborative properties in mind instead of optimizing for speed and performance?
What are the tradeoffs in the algorithmic design,
and how do you go about enforcing them? Right, right. Yeah, so you're right that it is always a trade-off. It is a trade-off for the machine learning developer to decide which model to
deploy. Should I deploy a model that is fully accurate by its own, or a model that optimizes
team performance? But the difficulty is that this trade-off is not
always as visible or as easy to access. So I see the work that we do in our group as an enabling
paradigm that allows you to explore this trade-off or in a way to extend it and show it to the
developer so that the developer can make the right choices.
And there are two ways how you can go about this.
The first one is what happens during grid search.
So in machine learning, we call grid search this process where you try to search through
many, many, many parameters.
And through this search, you try to find the model that pleases you the most in that it
is accurate.
But what we suggest is
that you should also be looking at these other scores that work for collaboration, like predictability
of errors. So the second way you can go about this is to include these definitions in the
training objective itself. And we have been doing more and more work in this part because we think that this explores the trade-off even more.
It extends it.
It gives you a rich perspective of how many other parameters you can optimize and augment the objective function in a way that it should think about accuracy.
But it should also think with a certain factor about this human collaboration scores.
Right. think with a certain factor about these human collaboration scores. And the best way to go is to do both during training and during grid search so that you really get the algorithm that
works best for humans. So you've talked a bit about enforcing properties within the algorithmic
design. Unpack that a little bit more. Yeah. so the enforcement usually really comes from the optimization stage.
During optimization in machine learning, we model a loss function. And this loss function is the
function that gives a signal to the training algorithm on how well it is doing, whether it
is a good model or it is a bad model. And this is really the only signal that we get. And it's computed on the data, right? So we're saying that signal should be augmented with these human collaboration
scores and be put together so that when we train the algorithm, these properties get enforced.
The other way you could do it is by adding constraints in the training algorithm, saying
that whatever you do, whatever model you find, you shouldn't be going lower or upper than a particular score.
Well, let's turn our attention to the methodologies and tools you're working on for machine learning
systems. And before we get specific, I'd like you to give us a general overview of the software
development cycle writ large, talking about the differences between traditional software
development and software engineering for ML systems.
Yeah. So machine learning has really been until recently a research-like field in that
the algorithms were in the lab and they were accessed by the machine learning scientists.
But now that we have machine learning deployed out there in the field in different products,
machine learning software is being combined with traditional software. However, these two are very
different in nature together in that machine learning software
can be undeterministic. It may be hard for you to say what it is going to do. And it also may have
a black box nature in that it's hard to understand what exactly it is going to say. And also it may
make these types of mistakes that do not happen because of bugs in the code. It may just be
because the data that you are training the algorithm upon might either be insufficient,
in that there is not enough to learn what you want to learn, or it may not quite resemble what
is out there in the real world. It is just lab data, and it's never going to be as rich as the
world that we live in.
And also needless to say, this is a very subtle difference, but we often forget about it,
is that in machine learning, we don't really write the code that is going to execute the program.
We write the algorithm that is going to process the data and come up with a function that
is the actual code.
So all those differences make the process very,
very different. It is very data dependent. And in traditional software engineering, for example,
we didn't have these parts of the life cycle that we currently care so much about. Like,
for example, data collection and data cleaning, if it has mistakes. In fact, like according to
a study that we recently did at Microsoft,
collection and cleaning takes at least 50% of the time
from a machine learning engineer.
And, you know, this is significant.
It's a lot of time that is spent into these new stages.
The other thing is versioning.
So in traditional software, we know very well how to do versioning. We have
tools like GitHub and other versioning tools that do that for us. But in machine learning,
we have to version not only code, we have to version the model, we have to version the data
and the parameters of the model. So there are all these things that are entangled together
and that we need to version in the right way.
Okay, so let's go a little deeper there.
You've got your traditional software engineer that is very comfortable with how this all works. And now you've got machine learning engineers that are adding on layer upon layer to this software development cycle.
How are you addressing that?
So far, what we have done is we've started with first understanding the needs of machine
learning engineers, like, and understanding their background as well, you know, because
machine learning engineers, they may come from different fields. Some of them may not have a
computer science background. They may be data scientists. They may be statisticians. And the
practices that are used in statistics and computer science may be very, very different.
Within the same team, you may have people with so many different backgrounds, and you need to put
them together to speak the same language. So we started by trying to understand what is their
background and what are their problems.
And the first number one challenge that they have is to have end-to-end support in tooling that can support all these different stages in the lifecycle.
It will take some time, but I think we're getting closer.
Well, have you laid out the stages as it were?
I mean, the software development cycle is very precise. Yeah. And the machine learning cycle is a lot bigger, isn't it?
It is. So we have defined a few stages and there's other work that have tried to do the
same thing. We have stages like data collection, data cleaning, model training, feature engineering, and then model monitoring and debugging and
maintenance.
So these are kind of the main stages, if I didn't forget any of them.
But what is different?
There's something that is very interesting in the difference between the two, is that
the machine learning lifecycle is very experimental in that it is a little bit of trial and error in a way. This grid search that I
mentioned earlier, it's a little bit of trial and error. You have to try different things to see
whether it works for your domain. Maybe you clean the data once more. Maybe you add some more
features or a different representation. So there is a lot of experimentation. And when there is a
lot of experimentation, there is a lot of uncertainty. You don't know as an engineer whether it's going to work or not. So it has changed even the way
we plan and manage projects. Well, let's go a little deeper and talk about that troubleshooting
and debugging that you're working on. It's a key challenge for all software systems,
but it's particularly challenging for debugging a black box.
Yeah.
And especially for complex mission and safety critical software and hardware, which you're
dealing with all the time in the real world.
So how do you go about, let's get real, how do you go about designing methodologies for
the debugging phase in the ML development cycle?
Yeah. for the debugging phase in the ML development cycle? Yeah, it's a topic that is really timely
in that if this is deployed in places that are high stake,
like in medicine or autonomous driving,
this can really have either a very good impact
or a very bad impact on people.
Or flying.
Or flying, yeah.
Exactly.
It can have a different impact on people's life.
One of the things that we say in our work is that good debugging practices start from rigorous evaluation.
You know how many times we hear things such as, this model is 90% accurate on a particular benchmark.
And we use that one single score to describe the whole performance
of one algorithm on the whole data set. Often that single number may hide so
many important conditions of failure that we don't know about and that are so
important to know if you are an engineer. What we suggest is that that performance
number should be sliced down into different demographics
and different groups in the data so that we really understand, is there any pocket in
the data that is maybe underrepresented and maybe the error rate is higher?
Right.
So these are the things that we suggest to do.
And then we also continue and build interpretable models in order to explain exactly to the engineer
when and how does a machine learning model fail.
And we often suggest to do this for different groups.
I'll just give you a simple example.
We recently were looking at gender recognition software from Image Photos, and we noticed
that when these models are trained only from celebrity data, they have a much higher error rate for women that have short hair, that do not have any eye makeup on, and also do not smile in the photo.
It's complicated.
It is all these different dimensions that are put together.
And the process of finding this particular thing, for example, I would have never thought to go and look for it. But this is what the interpretable model gives to you. And, you know,
it takes away a lot of load from the engineer if you can at least automate part of this process.
So how are you automating parts of the process?
Yeah, so what we're doing is that we're really gathering this data together and we are asking engineers not only to store the final aggregate number of performance, but we are asking them
to give us the performance numbers on each example. And at that point, you become super
powerful in that you can put in an interpretable model that can slice and dice the data in the right way and and a lot of people just say, I don't want my picture in anybody's data set.
And therefore, you're precluding some of the important nuances that you might need in your data to get accurate models. So there is a tension between, you know, being ethical about collecting your data and being accurate in the data.
I think as a community and also as an industry, we need to think deep about how to standardize this process.
Well, as we've just kind of laid out, it's hard out there for a machine learning engineer.
These people need a whole new tool belt for the job. How is your research in
adaptive systems helping to equip them with the tools they need for an AI world?
These methodologies that I just mentioned, in the last two years, we have worked hard
with many people at MSR AI, but also in Microsoft Cognition, to build concrete tools that can
automate part of this process.
And the tool that we are building now, it's called Error Terrain Analysis.
And it really helps in understanding the terrain of failure.
This is an effort that I'm leading together with Ece Gamar and a lot of people from the Ethics and Society team
that cares a lot about these types of problems in the company and broader than that.
And really what we are doing with the tool is that we are building a workflow of processes
and modular visualizations that can be put together like Lego pieces so that you can go from
one general view of errors to a more detailed one and even more detailed one in looking at
particular instances. Let me ask you one question on that, because we talked, I'm hearkening back to our conversation
at Faculty Summit, and we talked about how modularity is good, both in software development
and for understanding how things work, but it also can have problems in the debugging,
because you have these different modules that aren't all connected. And if something goes wrong with one or something is better in one
and you've got an older module, it poses new problems.
Yeah, it does.
In this case, we're thinking about these modules more as visualization models
in that first you want to have a large overview of the data.
And this would be like one module.
And then you want to drill down into the other ones so that you do not get overwhelmed as an engineer, so that it is not too much information for you.
Okay. Go back to a phrase you just used, the terrain of failure. Un if you think about it like, you know, as a set of mountains and hills and seaside, there are cases when, you know, the terrain
of failure is really calm in parts of the data in that the examples are easy to classify.
There is nothing special about them and everything is flat.
And there are other cases where the data is so rich, there is so much diversity in the data, in like demographics or other properties, where the error can fluctuate a lot.
And we want to feel that terrain and to really understand what it looks like.
That's one of the most evocative phrases I've heard.
What other kinds of tools do ML engineers need that are being worked on sort of upstream in the research
community? I'd like to mention a set of other tools that we are also building in the group.
One of them is called InterpretML, and this is work that is led by Rich Caruana in the Adaptive
Systems and Interaction Group. They're really building a tool set for building interpretable
models and generating explanations from these models. Yet another tool is called, this is
shiny new, it's called TensorWatch. And this is built by Shital Shah, who built this tool
for doing real-time debugging so that you can see the errors and the training loss of
machine learning models on the fly.
That said, I think that there is still a lot to do when it comes to stitching all this together into one single framework.
And as I said, we need to do end-to-end framework in versioning, in data provenance, data documentation,
and in tools that can allow us to take these insights that
we get from troubleshooting and debugging and integrating them back into the system
for fixing them.
And I will not claim that everything is going to be automated, but at least there is like
a workflow and a process if that happens.
Well, at this point, I'll take good over automated.
Right, right, right. Yep, yep. Well, hype notwithstanding, AI is still highly dependent
on people. And I'm not sure that's a bad thing. I think that might be a good thing.
Why does ML add a difficult layer to this idea of self-healing software? That's one of the things you talked about at Faculty Summit, where one component fixes another based on feedback.
And how can strong causal reasoning tools and counterfactual analysis tools help us better understand what went wrong? Yeah, it is hard to heal a machine learning software, but it is even harder to heal a system that has many, many machine learning components that are tied together.
And the reason why that is difficult is because sometimes it is hard to understand the different dynamics and interactions between the components. We've done this work that I also talked during the faculty summit on generating counterfactuals
in the subcomponents in order to understand how these differences in the subcomponents
affect the larger systems.
And again, we are using human intervention to generate these counterfactuals for us so
that we can understand the dynamics better.
The Badifta Day is starting a new streamline of work in this space in order to
optimize large systems that are integrative and to optimize them on the fly in real time.
So this is something new that is happening. In overall though, you know, the good news about
causal reasoning in these systems for debugging particularly is that as opposed to other fields,
like for example in medicine, we can actually run the system again. If we want to apply a fix and
to see how that works, we can apply the fix and see what is the impact, which is something that
you cannot easily do in other fields. So that's good. The not as good news is that we still have to understand the dynamics of the components when we have to understand the data distribution. How is the data generated in order to make the right assumptions when we do causal reasoning? work, and it's all of your work, but he's like a focal point there. And he talked a little bit about this at Faculty Summit as well. These big systems with many parts and pieces, and you've
got to be able to troubleshoot and debug in real time. I want you to talk a little bit more about
how that variable changes the game. Yeah, so in his work, the Vadivta talks about things such as
you might get new instances
that are running through the system that the system has never seen before. It doesn't know
how to optimize for these new instances. But by using the technique that he's building with
reinforcement learning and off-policy learning, you can really try to adapt with less examples
and try to manage these instances that, you know,
are not that well known for the system. Right. And so that's real world, real time life,
which is what humans are good at, is adapting. And machines are still a ways back.
It's kind of, yeah, adapting to an unknown world in a way.
The uncertainty. All right. Well, I always like to know where research is on the delivery spectrum.
And on a scale of 10 years out or more to already shipped, where can we find what I would call trustworthy AI in products and services now?
And given what you've told us, how confident should we be that it's going to work as advertised?
Yeah, so I think that there exists some grand opportunities for us as researchers to work with engineers together in order to really improve this tooling aspect for allowing rigorous evaluation and debugging.
And I think that if we put the right effort and if we do this the right way, we can really make progress in five years.
In order to not really solve the generic intelligence problem, but in order to be able to make the right promises to the user.
You know, one of the problems that we currently have is that we cannot really promise to the user or specify the performance of a system.
We need to learn still how to do that and how to debug the bad cases. So if we kind of go in both ends, if we are able to explain the performance
in the right way and also understand it in the right way, we can kind of meet in the middle with
the user and set the expectations right. I would think it would be really important at this point to manage expectations, as they say,
in terms of what you referred to as promises that you make to the user. So what are you doing in
terms of communication and education about what you want to put out there in these real-time systems?
Yeah, so exactly one of the things that we'd like to do is to be able to generate these types of reports that can describe the behavior of the you stand in a particular pose, meaning that you
shouldn't be able to use the system in those cases.
So being able to decompose it and break it down in these cases will set the expectations
right for the user and really to understand if you see it in paper, here is green, here
it's red, you can kind of understand that, well, this system is not perfect.
And these are the right cases where I need to ask what keeps you up at night.
We could go in a couple directions here, either on the technical front or the social front,
or maybe even both. What risks are you trying to mitigate and how are you going about that?
Yeah, so sometimes I wonder whether we are building the right thing. I worry that
we end up building things that are isolated from the world and maybe not safe. So what I'd like AI
to be, I'd like it to be a technology that enables everyone and that is built for us. It's built for
people. My parents should be able to use it, and environmental scientists
should be able to use it and make new discoveries, or a policymaker in order to take good decisions.
And these are the things that we really have a responsibility for as researchers to make sure
that we are building the right thing and that it's safe and it's a technology that we can rely on.
Well, there's all kinds of iterations and in different verticals and in different horizontals where we're envisioning our future with AI.
A lot of companies are thinking, how can we do this for businesses, you know, with speech recognition and other places that have maybe some more nefarious purposes for AI.
And they're not saying much about it.
So is there anything you particularly see? Let's talk about the social front for a second,
in terms of what we ought to be thinking now as potential end users of this?
I think that there is a big question about how we manage the inventions that come up,
either, you know, as academics or as industry,
there are decisions that need to be made in terms of, like, how do you monitor
and how do you report how you are using a certain product, right?
Because we see these questions coming up
even for other technologies
that are not really related to intelligence.
And there should be some sort of protocol
when you buy a certain product as a user
to really claim which scenarios you are going to use it and for what reason.
So that ends up on the upstream regulatory end of things, and it goes into much more of the ethics
and policy around AI. Well, tell us your story, Besmira. How did you get started in computer
science? Where has your journey taken you?. How did you get started in computer science?
Where has your journey taken you?
And how did you end up at Microsoft Research doing the work you're doing?
I did my undergrad in Albania, which is my home country.
So this is a small country in southeastern Europe.
One interesting thing about how I started is that in Albania,
computer science is a very gender balanced field in that my peers at
the university, 50% of them were women. And in a way, I feel really lucky that I started my career
in such an environment. It gives you the type of confidence that maybe one wouldn't get if you are
in a different environment. After that, I went for a master's.
It was a double degree master's in Germany and in Italy.
So I ended up spending one year in each of those.
This was in data mining and HCI.
Then I started my PhD.
I spent five beautiful years in Switzerland at ETH Zurich.
And this was again at the intersection of human computation and
machine learning. So in a way, this thing about me being at the intersection of machine learning
and people has followed me in my career. And I think it has really been because I cannot give
up any of them. The intersection keeps me motivated and it keeps me focused. And I kind of make sure that what I'm doing is useful and it is good for people out there.
So from Switzerland to Redmond, what happened in between?
Oh, wow.
Yeah, so I came for an internship during my PhD here.
I spent three months.
Seattle is beautiful in the summer.
That's how we get you.
Exactly.
I like the group a lot.
I still work with the same people.
Who do you work with?
I work with Ejik Amar very closely, Salim Amershi, Eric Horvitz quite a lot.
And, you know, we are surrounded by an amazing group of people who come from very diverse backgrounds.
Well, continuing on a personal note,
tell us something we don't know about you. I mean, you already just did.
I didn't know that about you. Spoiler alert. Yeah, right. Spoiler alert. I'm from Albania.
Tell us something we don't know about you, a defining experience, an interesting hobby,
a personal characteristic, a side quest, any of those that may have defined
your direction in life? Yeah, so as you notice, I've moved quite a bit. U.S. is the fifth country
I'm living in. And really, when I think about it, I've met so many interesting people. I've met
dear friends during the years. And it's really these people that have shaped my personality
and they have really helped me to think out of the box, to be creative, but also learn about
the different perspectives. All my friends and my network think in many different ways. They come
from very diverse cultural backgrounds. And this really helps you to pause and think further, more than what you have learned in school or in papers and books.
All right. So you've got Albanian, English, German, Italian. What else do you speak?
I speak C++. Yep.
C++. As we close, I want to give you the last word.
What should our listeners know about what's next in adaptive systems?
And I know you don't know all the answers.
There's a lot of uncertainty there, just like the field.
But what are the big unanswered questions and who do you need to help you answer them?
Yes, we have different directions in the adaptive systems and interaction group.
There is the whole direction of interpretability and debugging.
Then a lot happening on human-AI collaboration,
either for decision-making or in the physical world
for human-robot interaction.
There is a lot of work happening in reinforcement learning
and robotics and decision-making under uncertainty.
Overall, if I have to put a theme around this,
is that we like to think about problems
that are happening out there in the real world,
so not in the lab.
And we want to build trustworthy AI systems
that operate out there.
And as such, in all this diversity,
we look for people that do have a strong technical background,
but we also look for people who can speak
all these different languages
and are eager to learn more about each other's field.
Besmir Anoushi, thank you for joining us today.
Thanks for having me.
To learn more about Dr. Besmir Anoushi
and the latest research at the intersection of human and machine intelligence,
visit microsoft.com slash research.