Microsoft Research Podcast - 028 - Teaching Computers to See with Dr. Gang Hua

Starting point is 00:00:00 If we look back 10, 15 years ago, you see the commutation community is more diverse. You see all kinds of machine learning methods. You see all kinds of knowledge is borrowed from physics, from optical field, all getting into this field to try to tackle the problem from multi-perspective. As we are emphasizing diversity everywhere, I think the scientific community is going to be more healthy if we have diverse perspectives. You're listening to the Microsoft Research Podcast,

Starting point is 00:00:32 a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizenga. In technical terms, computer vision researchers build algorithms and systems to automatically analyze imagery and extract knowledge from the visual world. In layman's terms, they build machines that can see. And that's exactly what principal researcher and research manager Dr. Gang Hua and the computer vision technology team are doing. Because being able to see is really important for things like the personal robots, self-driving cars, and autonomous drones

Starting point is 00:01:09 we're seeing more and more in our daily lives. Today, Dr. Hua talks about how the latest advances in AI and machine learning are making big improvements on image recognition, video understanding, and even the arts. He also explains the distributed ensemble approach to active learning, where humans and machines work together in the lab to get computer vision systems ready to see and interpret the open world. That and much more on this episode

Starting point is 00:01:34 of the Microsoft Research Podcast. Geng Hua. Hi. Hello, welcome to the podcast. Great to have you here. Thanks for inviting me. You're a principal researcher and the research manager at MSR, and your focus is computer vision research. In broad strokes right now, what gets a computer vision researcher up in the morning? What's the big goal? Yeah, commutation is a relatively young research field. In general, you can think this field is trying to build machines to endow computers the capability to just see the world and interpret

Starting point is 00:02:17 the world just like human. From a more technical side of view, the input to the computer is really just the image and videos. You can think of them as a sequence of numbers. But what we want to extract from these images and videos, from these numbers, is some sort of structure of the world or some semantic information out of it. For example, I could say this part of the image really corresponds to a cat. That part of the image corresponds to a car. This type of interpretation. So that's the goal of commutation. Like for us humans, it looks to be a simple task to achieve,

Starting point is 00:02:58 but in order to teach computers to do it, we really have made a lot of progress in the past 10 years. But as a research field, this thing has been there for 50 years. Still yet, a lot of problems to tackle and address. Yeah. In fact, you gave a talk about five years ago where you said, and I paraphrase, after 30 years of research, why should we still care about face recognition research? Tell us how you answered then and now where you think we are. So I think that status quo five years ago, I would say like at that moment, if we capture a snapshot of how the research in facial recognition has progressed since the beginning of commutational face recognition research, I would say we achieved a lot, but more in controlled environment where you could carefully control the lighting, the camera setting, and all those kinds of things when you are framing the faces.

Starting point is 00:03:54 At that moment, five years ago, when we moved towards more like a wild settings, like faces taking on the uncontrolled environment, I would say there's a huge gap there in terms of recognition accuracy. But in the past five years, I would say there's a huge gap there in terms of recognition accuracy. But in the past five years, I would say the whole community also made a lot of progress leveraging the more advanced deep learning techniques. Even for facial recognition in the wild scenario, we've made a lot of progress and really have pushed these things to a stage where a lot of commercial applications becomes feasible. Okay.

Starting point is 00:04:26 Yeah. So deep learning has really enabled, even recently, some great advances in the field of computer vision and computer recognition of images. Right. So that's interesting when you talk about the difference between a highly controlled situation versus recognizing things in the wild. And I've had a couple of researchers on here who have said, yeah, where computers fail is when the data sets are not full enough.

Starting point is 00:04:53 For example, dog, dog, dog, three-legged dog. Is it still a dog? So what kinds of things do deep learning techniques give you that you didn't have before in these recognition advances? Yeah, that's a great question. From a research perspective, you know, the power of deep learning resides in several facts. The first thing is that it can conduct the learning in an end-to-end fashion and learn what's the right representation for that semantic pattern. For example, when we're talking about a dog, if we really look into all kinds of pictures of a dog, although, say, if my input is really 64 by 64 images, suppose each

Starting point is 00:05:40 pixel has like 250 values to take, that's a huge space if you think about it combinatorially. But when we talk about dog as a pattern, like actually every pixel is correlated a little bit. So the actual pattern for dog is going to reside in a much lower dimensional space. So the power of deep learning is that I can conduct the learning in an end-to-end fashion and really learn the right numerical representation for dog. And because of the deep structures, we can come

Starting point is 00:06:11 out with really complicated models, which can really digest a large amount of training data. So that means like if my training data covered all kinds of variations, like all kinds of views of this pattern, eventually I can recognize it in a broader setting because I have covered almost all the spaces. So another capability of deep learning is that this kind of compositional behavior, because it's a layer, a feedforward structure and a layered representation there. So when the information or image gets fed into deep networks and it starts by extracting some very low-level image primitives, then gradually the model can assemble all those primitives together and form higher and higher level

Starting point is 00:07:00 semantic structures. So in this sense, it captures all the small patterns corresponding to the bigger patterns and compose them together to represent the final pattern. So that's why it is very powerful, especially for visual recognition tasks. Right. So, yeah. So the broad umbrella of CVPR is computer vision pattern recognition. Right. And a lot of that pattern recognition is what the techniques are really driving to. Sure. Yeah. So that's actually, commutation really is trying to make sense out of pixels. If we talk about it in a really mechanical way, is that I fit in the image, you either extract some numerical output or some symbolic output from it. The numerical output, for example, could be a 3D point cloud, which describes the structure of the scene or the shape of an object.

Starting point is 00:07:48 It could also be corresponding to some semantic labels like dog and cat, as I mentioned at the beginning. Right. So we'll get to labeling in a bit. Sure. It's an interesting whole part of the machine learning process is that it has to be fed labels as well as pixels. Sure. Right? Yeah. You have three main areas of interest in your computer vision research that we talked about.

Starting point is 00:08:19 Video, faces, and arts and media. Let's talk about each of those in turn and start with your current research in what you call video understanding. Yes. Video understanding, like the title sort of explains itself. Instead of the input, Lao becomes a video stream. Instead of a single image, we're reasoning about pixels and how they move. If we view commutation reasoning about the single images as a spatial reasoning problem, now we are talking about

Starting point is 00:08:48 a spatial temporal reasoning problem because video is the third dimension of the temporal dimension. And if you look into a lot of real-world problems, we're talking about continuous video streams, whether it is a surveillance camera

Starting point is 00:09:02 in a building or a traffic camera overseeing highways. You have this constant flow of frames coming in and the object inside it is moving. So you want to basically digest information out of it. When you talk about those kinds of cameras, it gives us massive amounts of video, you know, constant stream of cameras and security in the 7-Eleven and things like that. What is your group trying to do on behalf of humans with those video streams? Sure. So one incubation project we are doing, like my team is building the foundational technology there. One incubation project we're trying to do is to really analyze the traffic on roads.

Starting point is 00:09:43 If you think about a city, when they set up all those traffic cameras, most of the video streams are actually wasted. But if you carefully think about it, these cameras are smart. Just think about one scenario where you want to more intelligently control the traffic lights. So if in one direction I saw a lot more traffic flow, instead of having a fixed schedule of turn on and turn off those red lights and the green lights, I could see like, okay, because this side has less cars or even no cars at this moment, I would allow the other direction, the green lights to keep a longer time so that the traffic can flow better. So that's just the one type of application there. Could you please get that out there?

Starting point is 00:10:27 I mean, yeah, because how many of us have sat at a traffic light when it's red and there's no traffic coming the other way? Exactly. At all. Why can't I go? Yeah, you could also think about some other applications. Like if we accumulated videos across years, if there's citizens requesting like we set up additional bicycle lanes, we could use the videos we have analyzing all the traffic data there and then decide if it makes sense to set up a bike lane there.

Starting point is 00:10:57 If we set it, will it sort of significantly affect the other traffic flows and help the cities to make decisions like that. I think this is so brilliant because a lot of times we make decisions based on, you know, our own ideas rather than data that says, you know, hey, this is where a bike lane would be terrific. This is where it would actually ruin everything for everybody, right? For sure. Yeah. Sometimes they leverage some other type of sensors to do that. Usually you hire a company, like set up some special equipment on the roads to do that. But it's very costly and ineffective. Just thinking about all those cameras are sitting there, the video streams are there. So that's a fantastic explanation of what you can do with machine learning and video understanding.

Starting point is 00:11:41 Right. Yeah. Another area you care about is faces and kind of harkens back to the why should we still care about facial recognition research? But yeah. And this line of research

Starting point is 00:11:52 has some really interesting applications. Talk about what's happening with facial recognition research. Who's doing it and what's new? Yeah. So indeed, if we look back, facial recognition technology

Starting point is 00:12:04 has progressed in Microsoft. I think that when I was at Live Labs Research, we set up the first facial recognition library, which could be leveraged by different product teams. Indeed, the first adopter is Xbox. They tried to use facial recognition technology for automatically user login at that moment. I think that's the first adoption. Over the time, like facial recognition research, the center sort of migrated to Microsoft Research Asia, where we still have a group of researchers I collaborate with.

Starting point is 00:12:37 We are continuously trying to push the state of the art out. This is more become a synergistic effort where we have engineering teams helping us to gather more data and then we just train better models. Our research recently actually focused more on a line of research we call identity preserving face synthesize. So recently there is a big advancement in deep learning community which is the establishment of using deep networks to generate models which can model the distribution of images so that you can draw from that distribution, basically synthesize the image. You build a deep network which the output is an image. So what we want to achieve is actually a step further. We want to synthesize faces. Well,

Starting point is 00:13:24 we want to keep the identity of those faces. We don't want our algorithms to just randomly sample a set of faces out without any semantic information. Say you want to generate a face of Brad Pitt, I want to really generate a face that looks like Brad Pitt. If I want to generate a face similar to anybody I know, I think we just want to be able to achieve that. So the identity preservation is the sort of outcome that you're aiming for of the person that you're trying to generate the face of. Right. You know, tangentially, I wonder if you get this technology going, does it morph with you as you get older and start to recognize you? Or do you have to keep updating your face?

Starting point is 00:14:06 Yeah, that's indeed a very good question. I would say in general, we actually have some ongoing research trying to tackle that problem. I think for existing technology, yes, you need to update your face maybe from time to time, especially if you've undergone a lot of changes. For example, somebody could have done some plastic surgery. That would basically break out the current system. Wait a minute, that's not you. Sure, no, not me at all.

Starting point is 00:14:38 There are several ways you can think about it. Human faces actually don't change much between age 17, 18, when you grow up, all the way to maybe 50-ish. They don't change much between age 17, 18, when you grow up, all the way to maybe 50-ish. They don't change much. So when people first got born, like kids, their face actually changes a lot because there's a growing in bones and basically the shape and the skin could change a lot. But once people get matured into adult stage, the change is very slow. So we actually have some research we're trying to model the aging process too, that will help establish better facial recognition system across the age. This is actually a very good kind of

Starting point is 00:15:15 technology which can allow you to get into the reinforcement domain. For example, some missing kids, they could have been kidnapped by somebody. But after many years, if you... They look different. Yeah, they look different if the smart facial recognition algorithms can match the original photos and you may be able to identify... And say what they would look like at maybe 14 if they were kidnapped earlier. Yes, yes, exactly. Wow, that's a great application of that.

Starting point is 00:15:43 Well, let's talk about the other area that you're actively pursuing, and that's media and the arts. Tell us how research is overlapping with art, and particularly with your work in deep artistic style transfer. Sure. If we look into people's desire, right? First we need to eat, and we need to drink, and we need to sleep. Okay.

Starting point is 00:16:04 Then once all these tasks are fulfilled, actually, we human has a strong design of arts. And creation. And creation and things like that. So this theme of research and commutation, if we link it to like a more artistic type of what we call media and arts, like basically using commutation technologies to give people a good artistic enjoyment.

Starting point is 00:16:24 So the particular research project we have done in the past two years is a sequence of algorithms where we can render an image into any sort of artistic styles you want, as long as you provide an example of that artistic style. For example, we can render an image to Van Gogh's style. Van Gogh. Yeah. Or any other painter's painting style. Yeah. Or Picasso. Yeah. All of them, if you can think of anything like that. Interesting. With pixels. With pixels, yeah. Those are all, again, all done by deep networks and some deep learning technologies we designed.

Starting point is 00:17:05 It sounds like you need a lot of disciplines to feed into this research. Where are you getting all of your talent from in terms of? In a sense, I would say our goal on this is making, you know, artworks are not necessarily accessible to everyone. Some of these artworks are really expensive. Yeah. By this kind of digital technology, what we are trying to do is make this kind of artworks to be accessible to common users.

Starting point is 00:17:32 To democratize it. Yeah, democratize it, as you mentioned. Yeah, that's good. Our algorithm allows us to build the explicit representation, like a numerical representation for each kind of style. Then if you want to create new styles, we can blend them. So that is like we are building an art space where we can explore in between to see how these visual effects evolve in between two painters and things like that,

Starting point is 00:17:56 and even have a deeper understanding of how they compose their artistic style and things like that. What's really interesting to me is that this is a really quantitative field, computer science algorithms, a lot of math and numbers, and then you've got art over here, which is much more metaphysical. And yet you're bringing them together and it's revealing the artistic side of the quantitative brain. Sure. I think to bring all these things together, the biggest tool we are leveraging indeed is statistics. As all kinds of machine learning algorithms are dealing with, it's really trying to capture the statistics of the pixels. Let's get a little technical. We have been a little technical, but let's get a little technical.

Starting point is 00:18:45 We have been a little technical, but let's get a little more technical. Some of your recently published work, and our listeners can find that on both the MSR website and your website. Sure. You talked about a new distributed ensemble approach to active learning. Tell us what's different about that, what you propose, how it's different, what does it promise? Yeah, that's indeed a great question. I think when we are talking about active learning, we are referring to a process where we have

Starting point is 00:19:14 some sort of human oracle involved in the learning process. In traditional active learning, we are saying that I have a learning machine. This learning machine can intelligently pick up some data samples and ask the human oracle to provide a little bit more input. So the learning machine actually picks the samples and asks the human oracle to actually provide, for example, a label for this image. So in this work, when we're talking about the ensemble machine, we are actually dealing with a more complicated problem. We are trying to factor active learning into actually in the crowdsourcing environment.

Starting point is 00:19:50 If you think about the Amazon Mechanical Turk, now this is really one of the biggest platforms where people send their data and ask the crowd workers to label all of them. But in this process, if you are not careful, the labels you connected from this process for your data could be quite lousy. Right. Yeah, they may not be able to be used by you. So in this process, we actually tried to achieve two goals there.

Starting point is 00:20:17 The first goal, we want to smartly distribute the data so that we can make the label to be most cost-effective, okay? The second is that we can make the label to be most cost effective. The second is that we want to actually assess the quality of all my crowd workers so that maybe even in the online process, I can purposely send my data to the good workers to label. So that's how our model works. So actually we have ensemble model distributed one.

Starting point is 00:20:49 Like each crowd of workers corresponds to one of these learning machines. And we try to do a statistical check across all the models so that in the same process, we actually come out with a quality score for each of the crowd of workers on the fly. So that we can use the model to not only select the samples, but also send the data to the labelers with the highest quality to label them. That way, with the progress of these label efforts, I can quickly come out with a good model. That leads me to the human-in-the-loop issue and the necessity of the checks and balances between humans and machines. Aside from what you're just talking about,

Starting point is 00:21:28 how are you tackling other issues of quality control by using humans with your machines? I have been thinking about this problem for a while, mainly in the context of robotics. If you think about any intelligent system, I would say, unless you are in a really close world setting, then you may have a system which can run fully autonomously. But whenever we hit the open world, like a current machine learning-based intelligent system, a lot of lessons are really good in dealing with all kinds of open world cases because there's color cases which you may have not been covered.

Starting point is 00:22:04 And variables that you don't think about. Exactly. So one thing I have been thinking is that how could we really engage human in that loop to not only help the intelligent agent when they need it and also at the same time forming some mechanism which we can teach this agent to be able to handle similar situations in the future. I will give you a very specific example. When I was at Stevens Institute of Technology, I had a project from NIH, which we call the co-robots. What kind of robots? Co-robots is actually wheelchair robots.

Starting point is 00:22:42 The idea is that as long as the user can move their leg, we actually have a head-mounted camera which the user can move their head. We use the head-mounted camera actually to track the pose of the head and let the users to be able to control the wheelchair robots indeed. But we don't want the user to control it all the time. So our goal is actually say, if in a home setting, we want these wheelchair robots to be able to carry the user and move largely autonomously inside the room, whenever the user give a guidance say, hey, I want to go to a bedroom, then the wheelchair robots would mostly do autonomous navigation. But if the robots sort of encounter a situation,

Starting point is 00:23:28 does not know how to deal with it, for example, how to move around, then at that moment we're going to ask the robots to proactively ask the human for control. Then the users will control the robots and deal with that situation. Maybe next time these robots encounter a similar situation, they're going to be able to deal with that situation, maybe next time these robots encounter a similar situation, they're going to be able to deal with it again. What were you doing before you came here, and how did you end up at Microsoft Research? This is my second term in Microsoft. So, as I mentioned, the first term is between 2006 and 2009 when I was in a lab called LiveLabs.

Starting point is 00:24:15 That's my first time. During that tenure, I established the first face recognition library. Then I kind of got attracted by the external world a little bit. So I went to Nokia Research, IBM Research, and I landed at Stevens Institute of Technology as a faculty member there. And that's over in New Jersey, right? Yeah, that's in New Jersey, in the East Coast. Then in 2015, I come back to Microsoft Research, but in the Beijing lab first. I transferred back actually in 2017 because my family stayed here.

Starting point is 00:24:49 So now you are here in Redmond after Beijing. How did that move happen? My family always stayed in Seattle. So Microsoft Research, Beijing lab is a great place. I would say I really enjoyed it. One of the unique thing there is the super, super dynamic research intern program. So year-round, there's several hundred interns actually works in the lab, and they collaborate closely with their mentor.

Starting point is 00:25:15 I think it's a really dynamic environment there. But because my family is in Seattle, so I sort of explored a little bit, and then the intelligent group is setting up this commutation group there. So that's why I joined. Back in Seattle again. Yeah. So I asked this question of all the researchers that come

Starting point is 00:25:35 on the podcast, and I'll ask you too. Is there anything about your work that we should be concerned about? What I say is anything that keeps you up at night? I would say when we talk about, especially in the commutation domain, I think privacy is potentially the largest concern.

Starting point is 00:25:56 If now you look into all countries, there are hundreds of millions of cameras that are set up everywhere, in public domain or in buildings. And those, I would say, with the technology advancement, it is really not sci-fi to expect that cameras can now really track people all the time. I mean, everything has two sides, I would say. Sure.

Starting point is 00:26:23 Yeah. This on one hand could help us, for example, to better dealing with criminals. But for ordinary citizens, there's a lot of privacy concerns in data. So what kinds of things, and this is why I ask this question, because it prompts people to think, okay, I have this power because of this great technology. What could possibly go wrong? Sure. power because of this great technology. What could possibly go wrong? So what kinds of things can we be thinking about and instituting or implementing to not have that problem? Microsoft has a big effort on GDPR. And I think that's great because this is a mechanism to ensure

Starting point is 00:27:02 everything we produce actually got aligned with certain regulation. On the other hand, everything needs to strike for a balance between usability and the security or privacy. Sure. If you think about it, you use some online services, your activities basically leave traces there. That's how it is used to better serve you for the future.

Starting point is 00:27:24 But if you want to be more convenient, sometimes you need to give up a little bit of information out, but you don't want to give all your information out. I think the boundary is actually a lot of black and white. We simply need to carefully control that so that we just get the right amount of information to serve the customer better, but a lot of unneeded information or information that the users are not comfortable or feeling well to give. So it seems like there's a trend towards permissions and agency of the user to say, I'm comfortable with this, but I'm not comfortable with that.

Starting point is 00:28:01 Right. As we finish up here, talk about what you see on the horizon for the next generation of computer vision researchers. What are the big unsolved problems that might prompt exciting breakthroughs

Starting point is 00:28:16 or just be the grind for the next 10 years? That's a great question and also a very big question. There are big problems we actually should tackle. If you think about like now, computer vision really leverages statistical machine learning a lot. We can train recognition models, which achieved great results. But

Starting point is 00:28:37 that process is largely still appearance-based. So we need to better get in some of the fundamentals in computer vision, which is 3D geometry, into the perception process. And there's also other things, especially when we are talking about video understanding. It's a holistic problem where you need to do spatial temporal reasoning, and we need to be able to factor more cognition concepts into this process, like a causal inference. If something happened, what really caused this thing to happen? Machine learning techniques mostly deal with correlation between data. Correlation and causality are two totally different concepts there. So I feel that also needs to happen. And some of the fundamental problems like learning from small data

Starting point is 00:29:26 and even learning from language potentially we need to address. Think about how we human are learning. We learn in two ways. Learning from experience, but there is a lot of fact that we learn from language. For example, while we are

Starting point is 00:29:40 talking with each other, indeed, similarly through language, I already learned a lot from you, for example. And I you. Sure. Yeah, that's a very compact information flow. We are now centrally focused on deep learning. If we look back like 10, 15 years ago, you see the communication community is more diverse. You see all kinds of machine learning methods. You see all kinds of knowledge borrowed from physics,

Starting point is 00:30:08 from optical field, all getting into this field to try to tackle the problem from multiple perspectives. As we are emphasizing diversity everywhere, I think the scientific community is going to be more healthy if we have diverse perspectives

Starting point is 00:30:24 and tackling the problem from multiple perspective. You know, that's great advice because as the community welcomes new researchers, they want to have big thinkers, broad thinkers, divergent thinkers to sort of push for the next big breakthrough. Yeah, exactly. Geng Hua, thank you for coming in. It's been really illuminating and I've really enjoyed our conversation. Thank you very much.

Starting point is 00:31:00 To learn more about Dr. Geng Hua and the amazing advances in computer vision, visit Microsoft.com slash research.

Your Ad Here

Microsoft Research Podcast - 028 - Teaching Computers to See with Dr. Gang Hua

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.