Microsoft Research Podcast - 076 - Speech and language: the crown jewel of AI with Dr. Xuedong Huang

Episode Date: May 15, 2019

When was the last time you had a meaningful conversation with your computer… and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft’s... Speech and Language group, is successful, you will. And if his track record holds true, it’ll be sooner than you think! On today’s podcast, Dr. Huang talks about his role as Microsoft’s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from “perceptive AI” to “cognitive AI” and that much closer to truly human intelligence.

Transcript
Discussion (0)
Starting point is 00:00:00 At some point, let's say computers can understand 300 languages, can fluently communicate and converse. I have not run into a person who can speak 300 languages. And not only machine can fluently communicate and converse, but can comprehend, understand, and learn, and reason, and can really finish all the PhD courses in all the subjects. The knowledge acquisition, reasoning, is beyond anyone's individual capability. When that moment is here, you can think about how intelligent that AI is going to be.
Starting point is 00:00:47 You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it. I'm your host, Gretchen Huizenga. When was the last time you had a meaningful conversation with your computer and felt like it truly understood you? Well, if Dr. Zhu Dong Huang, a Microsoft Technical Fellow and head of Microsoft's Speech and Language Group is successful, you will. And if his track record holds true, it'll be sooner than you think. On today's podcast, Dr. Huang talks about his role as Microsoft's chief speech scientist,
Starting point is 00:01:26 gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation, and conversation will move machines further along the path from perceptive AI to cognitive AI, and that much closer to truly human intelligence. That and much more on this episode of the Microsoft Research Podcast. Zhudong Huang, welcome to the podcast. Thank you. You are a Microsoft Technical Fellow in the Speech and Language Group, and you lead Microsoft's spoken language efforts.
Starting point is 00:02:11 So we're going to talk in depth about these in a bit, but first, as the company's chief speech scientist, give us a general view of what you do for a living and why you do it. What gets you up in the morning? Well, what we do is really make sure we have the best speech and language technology that can be used to empower a wide range of scenarios. The reason we have a group to do that is really, I feel that, you know, this is not only the most natural way for people to communicate as we're doing right now, but it's really the hardest AI challenges we're facing.
Starting point is 00:02:49 So that's what we do, trying to really drive breakthrough, deliver these awesome services on our cloud Azure services, and make sure we are satisfying a wide range of customers, both inside Microsoft and outside of Microsoft. There are three things, Biddy, if you want to frame this whole thing. Yeah. The first, we have the horsepower to really drive speech recognition accuracy, to drive the naturalness of our synthesis effort, to make sure translation quality is accurate when
Starting point is 00:03:22 you translate from English to Chinese or French or German. So there's really a lot of science behind that, making sure the accuracy, naturalness, latency, they are really world-class. So that's one. The second one is really, we not only provide technology, we deliver services on Azure, that from Office to Windows, Cortana, they all depend on the same cloud services.
Starting point is 00:03:50 And we also have edge devices, like our speech device SDK. So we want to make sure the speech on the edge and the cloud, they are really delivered in a modern fashion. That's the platform in the cloud and embedded. So that's the second. The platform is modern. The third one is really to show our love to the customer because we have a wide range of customers worldwide. We want to really delight and make sure our customer experience using speech translation is top notch. That's actually really three key things I do. AI horsepower, modernize our platform in the cloud
Starting point is 00:04:30 and on the edge, and love our customers. Well, and you've got a lot of teams working in these groups to tackle each of these pillars, we might call them. Yes, we have teams worldwide as well. And so the diversity is amazing because we are really trying to address the language barriers, trying to remove the language barriers. So we do have teams in China, we have teams in Germany, in Israel, in India, and in the U.S., of course. So we really work around the globe
Starting point is 00:05:06 trying to deal with these language challenges. So I want to start by quoting you to set the stage for our conversation today. You said speech and language is the crown jewel of AI. So unpack that for us. Well, we can think in the scale of human's evolution. And at some point, the language was born. That accelerated human's evolution.
Starting point is 00:05:33 If you think about all the animals on this planet, you know, there are animals running faster than humans. They can see better. Their teeth are sharper. Especially in the night. They're stronger. Yeah. They can actually hear better can see better. Their teeth are sharper. Especially in the night. They're stronger. Yeah. They can actually hear better, smell better.
Starting point is 00:05:49 Only we, humans, have the language. We can organize better. We can describe in science fiction term. We can really organize ourselves, create a constitution. So if you look at the humans, it is speech and language that set us apart from other animals. For artificial intelligence, speech and language will drive the evolution of AI, just like it did to humans. That's why it's the crown jewel of AI, and it's a tough one to crack.
Starting point is 00:06:24 Yeah. There's a whole one to crack. Yeah. There's a whole philosophical discussion on that topic alone, but it leads to some interesting questions about, you know, if you're wildly successful with machine language, what are these machines? So let's just actually, you know,
Starting point is 00:06:41 set our imagination off a little bit, right? And at some point, let's say computers can understand 300 languages, can fluently communicate and converse. I have not run into a person who can speak 300 languages. And not only machine can fluently communicate and converse, but can comprehend, understand, and learn, and reason, and can really finish all the PhD courses in all the subjects. The knowledge acquisition, reasoning, is beyond anyone's individual capability.
Starting point is 00:07:24 When that moment is here, you can think about how intelligent that AI is going to be. Is this something you envision? Yes. Do we want that? Yes. I think this world will be a much better place. I was in Japan just a few weeks ago,
Starting point is 00:07:47 carrying Microsoft Translator on my mobile devices. I was able to really communicate with Japanese who do not speak Chinese or English. It's already there. Microsoft Translator can speak the language I do not speak and helped me to be more productive when I was in Japan. So I'm all about that. Just scares me a little bit to think about a machine. We weren't first, we're not last, we're just next.
Starting point is 00:08:18 But, you know, there are two levels of intelligence. The first level is really perceptive intelligence. That is the ability to see, to hear, to smell. Then the higher level is cognitive intelligence. That is the ability to reason, to learn, and to acquire knowledge. Most of the AI breakthroughs we have today, they are in the perceptive level, such as speech recognition, speech synthesis, computer vision. But this high-level reasoning and the knowledge acquisition, cognitive capability is still far from being close to human's level. And what I'm excited about translation, it is really something between perceptive intelligence and cognitive intelligence.
Starting point is 00:09:11 And the fact that we're actually able to really build the success on the perceptive intelligence and expand into cognitive intelligence is quite a journey. And I do not know when we're going to reach that milestone. But that one is coming.
Starting point is 00:09:29 It's just a matter of time. Could take 50 years, but I think it is going to happen. We'll have to come back for another podcast to talk about that milestone because we're going to talk about a couple milestones in a minute. But first I want to do a little bit of backtracking because you've been around for a minute. But first, I want to do a little bit of backtracking, because you've been around for a while. And you started in Microsoft Research right about the time Rick Rashid was setting the organization up. And speech was one of the first groups that was formed. And according to
Starting point is 00:09:56 MSR lore, the goal of the group was to make speech mainstream. So give us a brief history of speech at MSR. How has the research gone from not mainstream in those early take risks and look far out days to being a presence in nearly every Microsoft product today? Before I joined Microsoft Research, I was also on the faculty and the CMU in Pittsburgh. So Big Brother was a professor there. I was a junior faculty member. So, I was doing my research mostly in the CMU on speech. Microsoft reached out and they wanted to set up a speech group. So, I moved actually on the first day of 1993, after the New Year's break, I flew from Pittsburgh to Seattle and started that journey and never changed. So that was the beginning of Microsoft Speech.
Starting point is 00:10:57 We were the research group that really started working on bringing speech to the developers. Right. So... Not just Blue Sky Research anymore. Not just Blue Sky Research. So we licensed technology from CMU. That's how we started. So we're very grateful to CMU's pioneering research in this area. So we were the research group, but we delivered the first speech API, SAPI, on Windows 95
Starting point is 00:11:24 as a research group. We were pretty proud of that because usually research is doing only blue sky research. We not only did blue sky research, continue to push the envelope, continue to improve the recognition accuracy, but we also worked with Windows, brought that technology to Windows developer.
Starting point is 00:11:45 So SAP was the first speech API in the industry on Windows. And that was really quite a journey. And then I eventually left research, joined the product group. I took the team, and it was also an exceptional Microsoft speech research group came with me, went to the product group. So this has been really a fascinating 27 years experience at Microsoft. I stopped doing speech after 2004, after we shipped the speech server. And I started many different things, including running the incubation for research as a startup.
Starting point is 00:12:28 Yeah. And I also worked as an architect for Satya Nadella when he was running Bing. Okay. I was helping incubating a wide range of AI projects from a foundational piece, like a GPU cluster, Project Philly, the deep learning toolkit, CNTK, and of course, speech research, all the way to the high-end solution, like customer care intelligence. About three years ago, I had the privilege to return to run a combined speech and language group. So basically, we were able to consolidate all the resources working on speech and the translation. And that was the story,
Starting point is 00:13:16 really, you know, the journey of my experience, a fascinating 27 years. Where does speech and language live right now? So, as I said, we moved back and forth multiple times between research and product group. Right now, we are sitting in cloud and AI group. This is a product group. We're part of this cloud services. And we provide company-wide and industry-wide speech and translation services. We also have a speech and dialogue research.
Starting point is 00:13:49 They are really operating like a research group. They are all researchers in our team. As what Rick has been saying, tech transfer is a full contact spot. We are not just a full contact spot. We are a one-body spot. So it's actually a very exciting group with a group of very talented, very innovative people. So it's still forward-thinking in the research mode. It's both forward-thinking and well-grounded.
Starting point is 00:14:21 We have to be grounded to deliver services from infrastructure to cost of serving. And we also have to be standing high to see the future, to define what is the solution that the people need and people want, even though the solution may not have existed and they may not know what it is at this moment. Well, let's talk about some specific research milestones that you've been involved in. They're really interesting. Three areas you've been involved in, conversational speech recognition, machine translation, and conversational Q&A. So let's start with the recognition. In 2016, you led a team that reached historical human
Starting point is 00:15:17 parity in transcribing conversational speech. Tell us about this. What was it part of? How did it come about? So in 2016, we reached the human parity on the broadly used switchboard conversational transcription task. That task has been used in the research community and the industry probably over 10 years. And 2017, we redefined the human priority milestone. So we're not competing with only one single person. We're competing with a group of people to transcribe the same task. So I would say 2017 is a really historical moment in comparison to a group of people transcribing the same task. Microsoft Speech Stack outperformed all four teams combined together. When I challenged our research group, nobody thought that was even feasible. But in less than two years, amazingly, when we had the conviction and the resource and the focus, magic indeed happened. So that was actually a fantastic moment for the team,
Starting point is 00:16:28 for science, for the technology stack. That was the first human priority milestone for my personal professional career. So I want to go in the weeds a little bit on this because this is interesting what you say. In two years, nobody thought it was possible, and then you did it. Tell us a little more about the technical aspects of how you accomplish this. So if you look at the history of speech research, the speech group pioneered many breakthroughs that got reused by others. Let's take translation as an example. So even for speech, in early 70s, by others. Let's take translation as an example. So,
Starting point is 00:17:05 even for speech, in early 70s, the speech recognition used more traditional AI, like a rule-based approach, expert system. And IBM Watson Research pioneered
Starting point is 00:17:19 statistic speech recognition using hidden marker model, using, you know, statistic language model. They really pushed the envelope and advanced the field. So,
Starting point is 00:17:33 that was a great moment. It was the same group of IBM speech researchers. They borrowed the same idea from speech recognition, applied that to translation. They rewr wrote translation history, really advanced the quality of translation substantially.
Starting point is 00:17:52 And after Hidden Marker Model, it was deep learning that started with speech recognition, neural speech recognition. And once again, translation borrowed the same thing with neural machine translation. That also advanced. So you can see the mirror of using technology speech people pioneered.
Starting point is 00:18:15 Actually, speech guys have been doing this, you know, systematic benchmarking funded by DARPA, very rigorous evaluation. That really changed how science and engineering could be evaluated. So there are many broad lessons from speech technology community that could have been used broadly beyond speech. So we got to train to deal with tough problems.
Starting point is 00:18:42 It's no wonder the same group of people could have achieved this historic milestone. Well, let's talk about another human parity milestone, the automatic Chinese to English news translation for the WMT 2017 task. And I had Arul Menezes on the show to talk all about that. But I'd love your perspective on whether and how, this kind of goes back to what we talked about at the beginning, whether and how you think machines can now compare to traditional human translation services
Starting point is 00:19:11 and why this work is an important breakthrough for barriers between people and cultures. So the second human priority breakthrough from my team is equally exciting. As I said, transcribing switchboard conversational speech is a great milestone, but it's really at a very low level and a perceptive AI level. Translation is a task that is between perceptive AI and the cognitive AI. Of course, translation is a harder task and nobody believed we could have achieved this.
Starting point is 00:19:48 So we set a goal in five years, let's see if we can achieve translation human parity on the sentence by sentence basis. So I want to really put that condition here. When human, like you and me, translate, we look at the whole paragraph, we have the broader context, we do a better job. So we limited ourselves because for the broader use the WMT, which is just news translation measured on the sentence by sentence level, it's a broader open research public benchmark. Even for that one, we thought it could have taken five years. So we applied the same principle,
Starting point is 00:20:28 built on the success we had on transcribing switchboard speech recognition. But this time, we actually went one step beyond. We partnered with Microsoft Research Group in Beijing, because it's a Chinese to English translation. So across Pacific, multiple teams in Microsoft Research Asia worked together days and nights.
Starting point is 00:20:52 Amazingly, this group of people surprised everyone. We delivered this in less than a year, reaching human parity on the historical translation level, better than professional people on the same task as measured by our scientists. So this time, really, we did something magic. I'm very proud of the team. I'm very proud of the collaboration. Well, another super interesting area that I'd love to talk about with you is what you call COCA, and that's C-O-Q-A, conversational Q&A.
Starting point is 00:21:29 So obviously we're talking about computers having this conversation with us, question and answer. Tell us about the work that's going on in this most human and perhaps most difficult of tasks in speech recognition technology. So this task is pioneered by Stanford researchers. It's even one step closer to cognitive AI. This is really machine reading comprehension task with conversation, with dialogue about the task.
Starting point is 00:21:59 Let's say you read a paragraph. Then we challenge the reader to answer correctly with a sequence of questions that are related. For example, if you read the paragraph about Bill Gates, the first question could have been, who is the founder of Microsoft? The second question could be related to the first one. How old is the person when the person started? Or you could have say, and when the person retired, how old was he? So that context relevancy is harder than simple
Starting point is 00:22:36 machine-meaning comprehension because there's a sequence of related question you have to answer, given the context. So for this latest breakthrough, and I have to give credit mostly to our colleagues in Beijing Research Lab, we have been pioneering this working together using shared resources and the infrastructure. And it's just amazing. I'm so impressed with the agility and the speed we have to achieve this amazing conversational question answering challenge. So, the leading researchers, they're all in Beijing,
Starting point is 00:23:14 will play a great and supporting role, helping Microsoft once again be the first to achieve human parity on this broadly watched AI task. Nobody believed anyone could have achieved this conversational Q&A human parity in such a short time.
Starting point is 00:23:35 And so we thought it might take two years. Once again, we broke historical record. Well, we've talked a little bit about the more technical aspects of what you're doing and how you're doing this. So on this last one, are there any other methodologies or techniques that you brought to the table to conquer this Q&A task? So Microsoft has accumulated 30 years of research and experiences in AI, right?
Starting point is 00:24:06 The natural language group in Beijing, they have been doing this in the last 20 years and have accumulated lots of talents, lot of experiences. And we basically use the deep learning and transfer learning. Also we build our success on top of the whole community. For example, Google, they delivered this fascinating technology called BERT. Is that an acronym? Yes, it's an acronym. It's an embedding technology.
Starting point is 00:24:35 We built the success on top of that, expanded that. That's how we achieved the human priority breakthrough. So it's really a reflection of the collective community. And I talked about the collaboration between Microsoft Research in Asia and our team in the US. Actually, this is a great example of collaboration of the whole industry. On the heels of everything that could possibly go right, and it's pretty exciting what you've described to us in this podcast, we do have to address what could possibly go wrong if you're successful. You want to enable computers to listen, hear, speak, translate, answer questions,
Starting point is 00:25:26 basically communicate with people. Does anything about that keep you up at night? Yes, absolutely. My worry is really someday humans can be too dependent on AI. And AI will never be perfect. AI would have a unique sort of biases. So I worry about that unconscious influence.
Starting point is 00:25:56 Right. So how to deal with that is really a broad societal issue that we have to be aware and we have to address. Because just like anyone, if you have an assistant you depend on, you absolutely know how much that assistant can influence you, change your agenda, change your opinion. AI one day is going to play the same role. AI will be biased.
Starting point is 00:26:26 And how do we deal with that is my top concern. If everything goes well, that is really a top issue we have to deal with. We have to learn how to deal with it. We do not know because we're not there yet. So what kinds of design thinking are you bringing to this as you build these tools that can speak and listen and converse? Because one of the biggest things is that human ability to impute human qualities to something that's not human. I think just, you know, there are enough responsible people working on AI.
Starting point is 00:27:04 And the good news is that we're not there yet, right? So we have time to work together to deal with that and make sure AI is going to really serve mankind, not to destroy mankind. So that's my top worry, what keeps me awake. But my short-term worry is really AI is not good enough. Not yet. And people, as Bill Gates
Starting point is 00:27:29 used to say, always overestimate what you can do in the short term and underestimate the impact in the long term. For this case, we cannot underestimate the long-term impact, long-term milestone. Okay. It's story time.
Starting point is 00:27:46 Good. Tell us a bit about your life. What's your story? What got you interested in research, particularly speech and language technology research? And what was your path to MSR? Good. I was a graduate student in Beijing's Tsinghua University.
Starting point is 00:28:03 At that time, my first computer was Apple II. So because you know Chinese language is not easy to type, so it was very cumbersome. So that necessity brought me to speech recognition. My dream at that time was as a graduate student in Tsinghua, actually was in AI, in AI of Tsinghua's, you know, graduate school. It was fantastic to have, you know, so many professors and the faculty members who had that long-term vision and set up the pioneering
Starting point is 00:28:41 environment for us to explore and experiment with. So I finished my master degree. I was in the PhD program and I have been working on speech recognition since 82, because I was enrolled, admitted to Tsinghua in 1982. That dream to make it easier for people to really communicate with machines never disappeared. So I have been working on this for over 30 years.
Starting point is 00:29:12 Even though on Microsoft for a shorter period of time, I stepped out of speech, but I was still doing something related. So I really thought this was a fascinating story. So I got some personal, really interesting story. As I said, you know, it was hard to type in Chinese when I was at the Tsinghua University. And I didn't finish my PhD at Tsinghua. I went to University of Edinburgh in Scotland. And I did finish my PhD there. But my personal pain point when I first landed in Edinburgh was really, I learned English, mostly American English, in China.
Starting point is 00:30:01 It wasn't that good because it wasn't my native language. But listening to Scottish professor talking was always challenging. But I was so grateful BBC had the closed captioning. Oh, funny. So I really learned my Scottish English from watching BBC. And I have to say that automatic captioning technology is available on Microsoft PowerPoint today. And that journey of personally paying points to what office PowerPoint teams can bring together is fascinating and personally extremely rewarding. I'm so grateful to see the technology I have to work on is going to help many other people
Starting point is 00:30:43 who are attending Scottish universities. You know, Arul talked about that PowerPoint service, and he was talking about people who had hearing disabilities. You give it a whole new... It's much broader because the language barrier is always there. Not everyone is as fluent. And I host many visitors. Almost every year, I'm hosting Tsinghua University MBA students.
Starting point is 00:31:12 And they all learn English. But their ability to converse and listen simply is not as good as native people here. So the simple fact that we are able to provide captioning on the PowerPoint presentation helped all of them to learn and understand much better. So this is actually a fairly broad scenario without even translating. Just the fact you have captioning will enhance the communication.
Starting point is 00:31:42 Right. And you know, we talked earlier about the different languages, and we talked a little bit about dialects, but we didn't really talk about accents within language. I mean, even in the United States, you go to various parts of the country and have a more difficult time understanding, even from your own country, just because of the accent.
Starting point is 00:32:00 That's why my Scottish English is a good story. And I hope I still have a little bit of Scottish accent. I hear it. Well, at the end of every podcast, I give my guests the last word. And since you're in human language technologies, it's particularly apropos for you. Now's your chance to say whatever you want to our listeners who might be interested in enabling computers to converse and communicate. What ought they to put boots on for? Working on speech and language. This is really the crown jewel of AI. You know, there's no more challenging task than this one, in my opinion, especially if you want to move from perceptive AI to cognitive AI, to get the ability to reason, to understand,
Starting point is 00:32:48 to acquire knowledge by reading, by conversing. It's just, you know, such a fundamental area that can improve everyone's life, improve everyone's productivity, make this world a much better place without the language barriers, without communication barriers, without understanding barriers.
Starting point is 00:33:13 Zhu Dong Huang, thank you so much for joining us on the podcast today. It's been fantastic. My pleasure. To learn more about Dr. Zhu Danghuang and the science of machine speech and language, visit microsoft.com slash research.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.