Microsoft Research Podcast - 028 - Teaching Computers to See with Dr. Gang Hua
Episode Date: June 13, 2018In technical terms, computer vision researchers “build algorithms and systems to automatically analyze imagery and extract knowledge from the visual world.” In layman’s terms, they build machine...s that can see. And that’s exactly what Principal Researcher and Research Manager, Dr. Gang Hua, and Computer Vision Technology team, are doing. Because being able to see is really important for things like the personal robots, self-driving cars, and autonomous drones we’re seeing more and more in our daily lives. Today, Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world.
Transcript
Discussion (0)
If we look back 10, 15 years ago, you see the commutation community is more diverse.
You see all kinds of machine learning methods.
You see all kinds of knowledge is borrowed from physics, from optical field,
all getting into this field to try to tackle the problem from multi-perspective.
As we are emphasizing diversity everywhere,
I think the scientific community is going to be more healthy
if we have diverse perspectives.
You're listening to the Microsoft Research Podcast,
a show that brings you closer to the cutting edge of technology research
and the scientists behind it.
I'm your host, Gretchen Huizenga.
In technical terms, computer vision researchers build algorithms and systems to automatically analyze imagery and extract knowledge from the visual world.
In layman's terms, they build machines that can see.
And that's exactly what principal researcher and research manager Dr. Gang Hua and the computer vision technology team are doing. Because being able to see is really important
for things like the personal robots,
self-driving cars, and autonomous drones
we're seeing more and more in our daily lives.
Today, Dr. Hua talks about how the latest advances
in AI and machine learning are making big improvements
on image recognition, video understanding,
and even the arts.
He also explains the distributed ensemble approach
to active learning, where humans and machines work together in the lab to get computer vision
systems ready to see and interpret the open world. That and much more on this episode
of the Microsoft Research Podcast. Geng Hua.
Hi.
Hello, welcome to the podcast. Great to have you here.
Thanks for inviting me.
You're a principal researcher and the research manager at MSR, and your focus is computer vision research.
In broad strokes right now, what gets a computer vision researcher up in the morning? What's the big goal?
Yeah, commutation is a relatively young research field. In general, you can think this field is
trying to build machines to endow computers the capability to just see the world and interpret
the world just like human. From a more technical side of view, the input to the computer is really just the image and
videos.
You can think of them as a sequence of numbers.
But what we want to extract from these images and videos, from these numbers, is some sort
of structure of the world or some semantic information out of it.
For example, I could say this part of the image really corresponds
to a cat. That part of the image corresponds to a car. This type of interpretation. So
that's the goal of commutation. Like for us humans, it looks to be a simple task to achieve,
but in order to teach computers to do it, we really have made a lot of progress in the
past 10 years. But as a research field,
this thing has been there for 50 years. Still yet, a lot of problems to tackle and address.
Yeah. In fact, you gave a talk about five years ago where you said, and I paraphrase,
after 30 years of research, why should we still care about face recognition research?
Tell us how you answered then and now where you think we are.
So I think that status quo five years ago, I would say like at that moment, if we capture a snapshot of how the research in facial recognition has progressed since the beginning of commutational face recognition research, I would say we achieved a lot, but more in controlled environment where you could carefully control the lighting, the camera
setting, and all those kinds of things when you are framing the faces.
At that moment, five years ago, when we moved towards more like a wild settings, like faces
taking on the uncontrolled environment, I would say there's a huge gap there in terms
of recognition accuracy. But in the past five years, I would say there's a huge gap there in terms of recognition accuracy.
But in the past five years, I would say the whole community also made a lot of progress
leveraging the more advanced deep learning techniques.
Even for facial recognition in the wild scenario, we've made a lot of progress and really have
pushed these things to a stage where a lot of commercial applications becomes feasible.
Okay.
Yeah.
So deep learning has really enabled, even recently, some great advances in the field
of computer vision and computer recognition of images.
Right.
So that's interesting when you talk about the difference between a highly controlled
situation versus recognizing things in the wild.
And I've had a couple of researchers on here who have said,
yeah, where computers fail is when the data sets are not full enough.
For example, dog, dog, dog, three-legged dog.
Is it still a dog?
So what kinds of things do deep learning techniques give you that you didn't have before in these recognition advances?
Yeah, that's a great question.
From a research perspective, you know, the power of deep learning resides in several facts.
The first thing is that it can conduct the learning in an end-to-end fashion and learn what's the right representation for
that semantic pattern. For example, when we're talking about a dog, if we really look into all
kinds of pictures of a dog, although, say, if my input is really 64 by 64 images, suppose each
pixel has like 250 values to take, that's a huge space if you think about it combinatorially.
But when we talk about dog as a pattern,
like actually every pixel is correlated a little bit.
So the actual pattern for dog is going to reside
in a much lower dimensional space.
So the power of deep learning is that I can conduct the learning
in an end-to-end fashion and really learn
the right numerical representation for dog. And because of the deep structures, we can come
out with really complicated models, which can really digest a large amount of training data.
So that means like if my training data covered all kinds of variations, like all kinds of views of this pattern, eventually
I can recognize it in a broader setting because I have covered almost all the spaces.
So another capability of deep learning is that this kind of compositional behavior,
because it's a layer, a feedforward structure and a layered representation there.
So when the information or image gets fed into
deep networks and it starts by extracting some very low-level image primitives, then gradually
the model can assemble all those primitives together and form higher and higher level
semantic structures. So in this sense, it captures all the small patterns corresponding
to the bigger patterns and compose them together to represent the final pattern. So that's why it
is very powerful, especially for visual recognition tasks. Right. So, yeah. So the broad umbrella of
CVPR is computer vision pattern recognition. Right. And a lot of that pattern recognition
is what the techniques are really driving to. Sure. Yeah. So that's actually, commutation really is trying to make sense
out of pixels. If we talk about it in a really mechanical way, is that I fit in the image,
you either extract some numerical output or some symbolic output from it. The numerical output,
for example, could be a 3D point cloud, which describes the structure of the scene or the shape of an object.
It could also be corresponding to some semantic labels like dog and cat, as I mentioned at the beginning.
Right. So we'll get to labeling in a bit.
Sure.
It's an interesting whole part of the machine learning process is that it has to be fed labels as well as pixels.
Sure.
Right?
Yeah.
You have three main areas of interest in your computer vision research that we talked about.
Video, faces, and arts and media.
Let's talk about each of those in turn and start
with your current research in what you call video understanding. Yes. Video
understanding, like the title sort of explains itself. Instead of the input,
Lao becomes a video stream. Instead of a single image, we're reasoning about
pixels and how they move. If we view commutation reasoning about the single
images as a spatial reasoning problem,
now we are talking about
a spatial temporal reasoning problem
because video is the third dimension
of the temporal dimension.
And if you look into
a lot of real-world problems,
we're talking about
continuous video streams,
whether it is a surveillance camera
in a building
or a traffic camera overseeing
highways. You have this constant flow of frames coming in and the object inside it is moving.
So you want to basically digest information out of it. When you talk about those kinds of cameras,
it gives us massive amounts of video, you know, constant stream of cameras and security in the 7-Eleven
and things like that. What is your group trying to do on behalf of humans with those video streams?
Sure. So one incubation project we are doing, like my team is building the foundational technology
there. One incubation project we're trying to do is to really analyze the traffic on roads.
If you think about a city, when they set up all
those traffic cameras, most of the video streams are actually wasted. But if you carefully think
about it, these cameras are smart. Just think about one scenario where you want to more
intelligently control the traffic lights. So if in one direction I saw a lot more traffic flow,
instead of having a fixed schedule of turn on and turn off those red lights and the green lights, I could see like, okay,
because this side has less cars or even no cars at this moment, I would allow the other direction,
the green lights to keep a longer time so that the traffic can flow better. So that's just the
one type of application there. Could you please get that out there?
I mean, yeah, because how many of us have sat at a traffic light when it's red
and there's no traffic coming the other way?
Exactly.
At all.
Why can't I go?
Yeah, you could also think about some other applications.
Like if we accumulated videos across years,
if there's citizens requesting like we set up additional bicycle lanes, we could use the videos we have analyzing all the traffic data there and then decide if it makes sense to set up a bike lane there.
If we set it, will it sort of significantly affect the other traffic flows and help the cities to make decisions like that.
I think this is so brilliant because a lot of times we make decisions based on, you know,
our own ideas rather than data that says, you know, hey, this is where a bike lane would be terrific. This is where it would actually ruin everything for everybody, right?
For sure. Yeah. Sometimes they leverage some other type of sensors to do that. Usually you
hire a company, like set up some special equipment on the roads to do that.
But it's very costly and ineffective.
Just thinking about all those cameras are sitting there, the video streams are there.
So that's a fantastic explanation of what you can do with machine learning and video understanding.
Right.
Yeah.
Another area you care about is faces
and kind of harkens back to the
why should we still care
about facial recognition research?
But yeah.
And this line of research
has some really interesting applications.
Talk about what's happening
with facial recognition research.
Who's doing it and what's new?
Yeah.
So indeed,
if we look back,
facial recognition technology
has progressed in Microsoft.
I think that when I was at Live Labs Research, we set up the first facial recognition library,
which could be leveraged by different product teams.
Indeed, the first adopter is Xbox.
They tried to use facial recognition technology for automatically user login at that moment.
I think that's the first adoption.
Over the time, like facial recognition research, the center sort of migrated to Microsoft Research
Asia, where we still have a group of researchers I collaborate with.
We are continuously trying to push the state of the art out.
This is more become a synergistic effort where we have engineering teams helping us to
gather more data and then we just train better models. Our research recently actually focused
more on a line of research we call identity preserving face synthesize. So recently there is a
big advancement in deep learning community which is the establishment of using deep networks to generate models
which can model the distribution of images so that you can draw from that distribution,
basically synthesize the image. You build a deep network which the output is an image.
So what we want to achieve is actually a step further. We want to synthesize faces. Well,
we want to keep the identity of those faces.
We don't want our algorithms to just randomly sample a set of faces out without any semantic information.
Say you want to generate a face of Brad Pitt, I want to really generate a face that looks like Brad Pitt.
If I want to generate a face similar to anybody I know, I think we just want
to be able to achieve that. So the identity preservation is the sort of outcome that you're
aiming for of the person that you're trying to generate the face of. Right. You know, tangentially,
I wonder if you get this technology going, does it morph with you as you get older and start to
recognize you? Or do you have to keep updating your face?
Yeah, that's indeed a very good question.
I would say in general, we actually have some ongoing research trying to tackle that problem.
I think for existing technology, yes, you need to update your face maybe from time to
time, especially if you've undergone a lot of changes.
For example, somebody could have done some plastic surgery.
That would basically break out the current system.
Wait a minute, that's not you.
Sure, no, not me at all.
There are several ways you can think about it.
Human faces actually don't change much between age 17, 18, when you grow up,
all the way to maybe 50-ish. They don't change much between age 17, 18, when you grow up, all the way to maybe 50-ish.
They don't change much.
So when people first got born, like kids, their face actually changes a lot because there's a growing in bones and basically the shape and the skin could change a lot.
But once people get matured into adult stage, the change is very slow.
So we actually have some research we're trying to model the aging process too, that will help establish
better facial recognition system across the age. This is actually a very good kind of
technology which can allow you to get into the reinforcement domain. For example, some
missing kids, they could have been kidnapped by somebody.
But after many years, if you...
They look different.
Yeah, they look different if the smart facial recognition algorithms can match the original photos and you may be able to identify...
And say what they would look like at maybe 14 if they were kidnapped earlier.
Yes, yes, exactly.
Wow, that's a great application of that.
Well, let's talk about the other area that you're actively pursuing, and that's
media and the arts.
Tell us how research is overlapping with art, and particularly with your work in deep artistic
style transfer.
Sure.
If we look into people's desire, right?
First we need to eat, and we need to drink, and we need to sleep.
Okay.
Then once all these tasks are fulfilled,
actually, we human has a strong design of arts.
And creation.
And creation and things like that.
So this theme of research and commutation,
if we link it to like a more artistic type of what we call media and arts,
like basically using commutation technologies
to give people a good artistic enjoyment.
So the particular research project we have done in the past two years
is a sequence of algorithms where we can render an image into any sort of artistic styles you want,
as long as you provide an example of that artistic style.
For example, we can render an image to
Van Gogh's style. Van Gogh. Yeah. Or any other painter's painting style. Yeah. Or Picasso.
Yeah. All of them, if you can think of anything like that. Interesting. With pixels. With pixels,
yeah. Those are all, again, all done by deep networks and some deep learning technologies we
designed.
It sounds like you need a lot of disciplines to feed into this research.
Where are you getting all of your talent from in terms of?
In a sense, I would say our goal on this is making, you know, artworks are not
necessarily accessible to everyone. Some of these artworks are really expensive.
Yeah.
By this kind of digital technology,
what we are trying to do is make this kind of artworks
to be accessible to common users.
To democratize it.
Yeah, democratize it, as you mentioned.
Yeah, that's good.
Our algorithm allows us to build the explicit representation,
like a numerical representation for each kind of style.
Then if you want to create new styles, we can blend them.
So that is like we are building an art space where we can explore in between
to see how these visual effects evolve in between two painters and things like that,
and even have a deeper understanding of how they compose their artistic style and things like that.
What's really interesting to me is that this is a really
quantitative field, computer science algorithms, a lot of math and numbers, and then you've got
art over here, which is much more metaphysical. And yet you're bringing them together and it's
revealing the artistic side of the quantitative brain. Sure. I think to bring all these things
together, the biggest tool we are leveraging indeed is statistics. As all kinds of
machine learning algorithms are dealing with, it's really trying to capture the statistics of the
pixels. Let's get a little technical. We have been a little technical, but let's get a little technical.
We have been a little technical, but let's get a little more technical.
Some of your recently published work, and our listeners can find that on both the MSR website and your website.
Sure.
You talked about a new distributed ensemble approach to active learning.
Tell us what's different about that, what you propose, how it's different, what does
it promise?
Yeah, that's indeed a great question.
I think when we are talking about active learning, we are referring to a process where we have
some sort of human oracle involved in the learning process.
In traditional active learning, we are saying that I have a learning machine.
This learning machine can intelligently pick
up some data samples and ask the human oracle to provide a little bit more input. So the
learning machine actually picks the samples and asks the human oracle to actually provide,
for example, a label for this image. So in this work, when we're talking about the ensemble
machine, we are actually dealing with a more complicated problem.
We are trying to factor active learning into actually in the crowdsourcing environment.
If you think about the Amazon Mechanical Turk, now this is really one of the biggest platforms where people send their data and ask the crowd workers to label all of them.
But in this process, if you are not careful,
the labels you connected from this process
for your data could be quite lousy.
Right.
Yeah, they may not be able to be used by you.
So in this process,
we actually tried to achieve two goals there.
The first goal,
we want to smartly distribute the data
so that we can make the label
to be most cost-effective, okay? The second is that we can make the label to be most cost effective.
The second is that we want to actually assess the quality of all my crowd workers so that
maybe even in the online process, I can purposely send my data to the good workers to label.
So that's how our model works.
So actually we have ensemble model distributed one.
Like each crowd of workers corresponds to one of these learning machines.
And we try to do a statistical check across all the models so that in the same process,
we actually come out with a quality score for each of the crowd of workers on the fly. So that we can use the model to not only select the samples,
but also send the data to the labelers with the highest quality to label them.
That way, with the progress of these label efforts, I can quickly come out with a good model.
That leads me to the human-in-the-loop issue and the necessity of the checks and balances
between humans and machines.
Aside from what you're just talking about,
how are you tackling other issues of quality control by using humans with your machines?
I have been thinking about this problem for a while, mainly in the context of robotics.
If you think about any intelligent system, I would say,
unless you are in a really close world setting,
then you may have a system which can run fully autonomously.
But whenever we hit the open world, like a current machine learning-based intelligent system,
a lot of lessons are really good in dealing with all kinds of open world cases
because there's color cases which you may have not been covered.
And variables that you don't think about.
Exactly. So one thing I have been thinking is that how could we really engage
human in that loop to not only help the intelligent agent when they need it and also at the same
time forming some mechanism which we can teach this agent to be able to handle similar situations in the future.
I will give you a very specific example.
When I was at Stevens Institute of Technology, I had a project from NIH, which we call the co-robots.
What kind of robots?
Co-robots is actually wheelchair robots.
The idea is that as long as the user can move their leg,
we actually have a head-mounted camera which the user can move their head. We use the head-mounted
camera actually to track the pose of the head and let the users to be able to control the wheelchair
robots indeed. But we don't want the user to control it all the time. So our goal is actually say, if in a home setting, we want these wheelchair robots to be able to carry
the user and move largely autonomously inside the room, whenever the user give a guidance
say, hey, I want to go to a bedroom, then the wheelchair robots would mostly do autonomous
navigation.
But if the robots sort of encounter a situation,
does not know how to deal with it, for example, how to move around,
then at that moment we're going to ask the robots to proactively ask the human for control.
Then the users will control the robots and deal with that situation.
Maybe next time these robots encounter a similar situation, they're going to be able to deal with that situation, maybe next time these robots encounter a similar situation,
they're going to be able to deal with it again.
What were you doing before you came here, and how did you end up at Microsoft Research?
This is my second term in Microsoft.
So, as I mentioned, the first term is between 2006 and 2009 when I was in a lab called LiveLabs.
That's my first time.
During that tenure, I established the first face recognition library.
Then I kind of got attracted by the external world a little bit. So I went
to Nokia Research, IBM Research, and I landed at Stevens Institute of Technology as a faculty
member there.
And that's over in New Jersey, right?
Yeah, that's in New Jersey, in the East Coast. Then in 2015, I come back to Microsoft
Research, but in the Beijing lab first. I transferred back actually in 2017 because my family stayed here.
So now you are here in Redmond after Beijing.
How did that move happen?
My family always stayed in Seattle.
So Microsoft Research, Beijing lab is a great place.
I would say I really enjoyed it.
One of the unique thing there is the super, super dynamic research intern program.
So year-round, there's several hundred interns actually works in the lab,
and they collaborate closely with their mentor.
I think it's a really dynamic environment there.
But because my family is in Seattle, so I sort of explored a little bit,
and then the intelligent group is setting up this
commutation group there. So that's why I
joined.
Back in Seattle again. Yeah.
So I
asked this question of all the researchers that come
on the podcast, and I'll ask you too.
Is there anything about your work
that we should be
concerned about? What I say is anything that keeps
you up at night?
I would say when we talk about,
especially in the commutation domain,
I think privacy is potentially the largest concern.
If now you look into all countries,
there are hundreds of millions of cameras
that are set up everywhere,
in public domain or in buildings.
And those, I would say, with the technology advancement, it is really not sci-fi to expect
that cameras can now really track people all the time.
I mean, everything has two sides, I would say.
Sure.
Yeah.
This on one hand could help us, for example, to better dealing with criminals.
But for ordinary citizens, there's a lot of privacy concerns in data.
So what kinds of things, and this is why I ask this question, because it prompts people to think,
okay, I have this power because of this great technology.
What could possibly go wrong?
Sure. power because of this great technology. What could possibly go wrong? So what kinds of things can we be thinking about and instituting or implementing to not have that problem?
Microsoft has a big effort on GDPR. And I think that's great because this is a mechanism to ensure
everything we produce actually got aligned
with certain regulation.
On the other hand, everything needs to strike for a balance between usability and the security
or privacy.
Sure.
If you think about it, you use some online services, your activities basically leave
traces there.
That's how it is used to better serve you for the future.
But if you want to be more convenient, sometimes you need to give up a little bit of information
out, but you don't want to give all your information out.
I think the boundary is actually a lot of black and white.
We simply need to carefully control that so that we just get the right amount of information
to serve the customer better, but a lot of unneeded
information or information that the users are not comfortable or feeling well to give.
So it seems like there's a trend towards permissions and agency of the user to say,
I'm comfortable with this, but I'm not comfortable with that.
Right.
As we finish up here,
talk about what you see on the horizon
for the next generation
of computer vision researchers.
What are the big unsolved problems
that might prompt
exciting breakthroughs
or just be the grind
for the next 10 years?
That's a great question
and also a very big question.
There are big problems
we actually
should tackle. If you think about like now, computer vision really leverages statistical
machine learning a lot. We can train recognition models, which achieved great results. But
that process is largely still appearance-based. So we need to better get in some of the fundamentals in computer vision, which is 3D
geometry, into the perception process. And there's also other things, especially when we are
talking about video understanding. It's a holistic problem where you need to do spatial temporal
reasoning, and we need to be able to factor more cognition concepts into this process, like a causal
inference. If something happened,
what really caused this thing to happen? Machine learning techniques mostly deal with correlation
between data. Correlation and causality are two totally different concepts there. So I feel that
also needs to happen. And some of the fundamental problems like learning from small data
and even learning from language
potentially we need to address.
Think about how we human are learning.
We learn in two ways.
Learning from experience,
but there is a lot of fact
that we learn from language.
For example, while we are
talking with each other,
indeed, similarly through language,
I already learned a lot from you,
for example. And I you. Sure. Yeah, that's a very compact information flow. We are now centrally
focused on deep learning. If we look back like 10, 15 years ago, you see the communication
community is more diverse. You see all kinds of machine learning methods.
You see all kinds of knowledge
borrowed from physics,
from optical field,
all getting into this field
to try to tackle the problem
from multiple perspectives.
As we are emphasizing diversity everywhere,
I think the scientific community
is going to be more healthy
if we have diverse perspectives
and tackling
the problem from multiple perspective.
You know, that's great advice because as the community welcomes new researchers,
they want to have big thinkers, broad thinkers, divergent thinkers to sort of push for the
next big breakthrough.
Yeah, exactly.
Geng Hua, thank you for coming in. It's been really illuminating and I've really enjoyed our conversation.
Thank you very much.
To learn more about Dr.
Geng Hua and the amazing advances in computer vision, visit Microsoft.com slash research.