Microsoft Research Podcast - 076 - Speech and language: the crown jewel of AI with Dr. Xuedong Huang
Episode Date: May 15, 2019When was the last time you had a meaningful conversation with your computer… and felt like it truly understood you? Well, if Dr. Xuedong Huang, a Microsoft Technical Fellow and head of Microsoft’s... Speech and Language group, is successful, you will. And if his track record holds true, it’ll be sooner than you think! On today’s podcast, Dr. Huang talks about his role as Microsoft’s Chief Speech Scientist, gives us some inside details on the latest milestones in speech and language technology, and explains how mastering speech recognition, translation and conversation will move machines further along the path from “perceptive AI” to “cognitive AI” and that much closer to truly human intelligence.
Transcript
Discussion (0)
At some point, let's say computers can understand 300 languages, can fluently communicate and converse.
I have not run into a person who can speak 300 languages.
And not only machine can fluently communicate and converse, but can comprehend, understand, and learn, and reason,
and can really finish all the PhD courses in all the subjects.
The knowledge acquisition, reasoning,
is beyond anyone's individual capability.
When that moment is here,
you can think about how intelligent that AI is going to be.
You're listening to the Microsoft Research Podcast, a show that brings you closer to the cutting edge of technology research and the scientists behind it.
I'm your host, Gretchen Huizenga.
When was the last time you had a meaningful conversation with your computer and felt like
it truly understood you?
Well, if Dr. Zhu Dong Huang, a Microsoft Technical Fellow and head of Microsoft's Speech and
Language Group is successful, you will.
And if his track record holds true, it'll be sooner than you think.
On today's podcast, Dr. Huang talks about his role as Microsoft's chief speech scientist,
gives us some inside details on the latest milestones in speech and language technology,
and explains how mastering speech recognition, translation, and conversation will move machines
further along the path from perceptive AI to cognitive AI, and that much closer to truly
human intelligence.
That and much more on this episode of the Microsoft Research Podcast.
Zhudong Huang, welcome to the podcast.
Thank you.
You are a Microsoft Technical Fellow in the Speech and Language Group, and you lead Microsoft's spoken language efforts.
So we're going to talk in depth about these in a bit, but first, as the company's chief speech scientist, give us a general view of what you do for a living and why you do it.
What gets you up in the morning? Well, what we do is really make sure we have the best speech and language technology
that can be used to empower a wide range of scenarios.
The reason we have a group to do that is really,
I feel that, you know,
this is not only the most natural way
for people to communicate as we're doing right now,
but it's really the hardest AI challenges we're facing.
So that's what we do, trying to really drive breakthrough,
deliver these awesome services on our cloud Azure services,
and make sure we are satisfying a wide range of customers,
both inside Microsoft and outside of Microsoft.
There are three things, Biddy, if you want to frame this whole thing.
Yeah.
The first, we have the horsepower to really drive speech recognition accuracy,
to drive the naturalness of our synthesis effort, to make sure translation quality is accurate when
you translate from English to Chinese or French or German.
So there's really a lot of science behind that,
making sure the accuracy, naturalness, latency,
they are really world-class.
So that's one.
The second one is really, we not only provide technology,
we deliver services on Azure,
that from Office to Windows, Cortana, they all depend on the same cloud services.
And we also have edge devices, like our speech device SDK.
So we want to make sure the speech on the edge and the cloud,
they are really delivered in a modern fashion.
That's the platform in the cloud and embedded. So that's the
second. The platform is modern. The third one is really to show our love to the customer because
we have a wide range of customers worldwide. We want to really delight and make sure our customer
experience using speech translation is top notch. That's actually really three key things I do.
AI horsepower, modernize our platform in the cloud
and on the edge, and love our customers.
Well, and you've got a lot of teams working
in these groups to tackle each of these pillars,
we might call them.
Yes, we have teams worldwide as well. And so the diversity is amazing
because we are really trying to address the language barriers, trying to remove the language
barriers. So we do have teams in China, we have teams in Germany, in Israel, in India, and in the
U.S., of course. So we really work around the globe
trying to deal with these language challenges.
So I want to start by quoting you
to set the stage for our conversation today.
You said speech and language is the crown jewel of AI.
So unpack that for us.
Well, we can think in the scale of human's evolution.
And at some point, the language was born.
That accelerated human's evolution.
If you think about all the animals on this planet,
you know, there are animals running faster than humans.
They can see better.
Their teeth are sharper.
Especially in the night.
They're stronger. Yeah. They can actually hear better can see better. Their teeth are sharper. Especially in the night. They're stronger.
Yeah.
They can actually hear better, smell better.
Only we, humans, have the language.
We can organize better.
We can describe in science fiction term.
We can really organize ourselves, create a constitution.
So if you look at the humans, it is speech and language that set us apart
from other animals. For artificial intelligence, speech and language will drive the evolution
of AI, just like it did to humans. That's why it's the crown jewel of AI, and it's a
tough one to crack.
Yeah. There's a whole one to crack. Yeah.
There's a whole philosophical discussion on that topic alone,
but it leads to some interesting questions
about, you know,
if you're wildly successful
with machine language,
what are these machines?
So let's just actually, you know,
set our imagination off a little bit, right?
And at some point, let's say computers can understand 300 languages,
can fluently communicate and converse.
I have not run into a person who can speak 300 languages.
And not only machine can fluently communicate and converse, but can comprehend, understand, and learn, and reason,
and can really finish all the PhD courses in all the subjects.
The knowledge acquisition, reasoning,
is beyond anyone's individual capability.
When that moment is here,
you can think about how intelligent that AI is going to be.
Is this something you envision?
Yes.
Do we want that?
Yes.
I think this world will be a much better place.
I was in Japan just a few weeks ago,
carrying Microsoft Translator on my mobile devices.
I was able to really communicate with Japanese
who do not speak Chinese or English.
It's already there.
Microsoft Translator can speak the language I do not speak
and helped me to be more productive when I was in Japan.
So I'm all about that. Just scares me a little bit to think about a machine.
We weren't first, we're not last, we're just next.
But, you know, there are two levels of intelligence. The first level is really perceptive intelligence.
That is the ability to see, to hear, to smell.
Then the higher level is cognitive intelligence.
That is the ability to reason, to learn, and to acquire knowledge.
Most of the AI breakthroughs we have today, they are in the perceptive level, such as speech recognition, speech synthesis, computer vision.
But this high-level reasoning and the knowledge acquisition, cognitive capability is still far from being close to human's level. And what I'm excited about translation,
it is really something between perceptive intelligence
and cognitive intelligence.
And the fact that we're actually
able to really build the success
on the perceptive intelligence
and expand into cognitive intelligence
is quite a journey.
And I do not know
when we're going to reach that milestone.
But that one is coming.
It's just a matter of time.
Could take 50 years, but I think it is going to happen.
We'll have to come back for another podcast to talk about that milestone
because we're going to talk about a couple milestones in a minute.
But first I want to do a little bit of backtracking
because you've been around for a minute. But first, I want to do a little bit of backtracking, because you've been around for a
while. And you started in Microsoft Research right about the time Rick Rashid was setting
the organization up. And speech was one of the first groups that was formed. And according to
MSR lore, the goal of the group was to make speech mainstream. So give us a brief history of speech
at MSR. How has the research gone from not
mainstream in those early take risks and look far out days to being a presence in nearly every
Microsoft product today? Before I joined Microsoft Research, I was also on the faculty and the CMU
in Pittsburgh. So Big Brother was a professor there. I was a junior faculty
member. So, I was doing my research mostly in the CMU on speech. Microsoft reached out
and they wanted to set up a speech group. So, I moved actually on the first day of 1993, after the New Year's break, I flew from Pittsburgh to Seattle and started that journey and never changed.
So that was the beginning of Microsoft Speech.
We were the research group that really started working on bringing speech to the developers.
Right. So...
Not just Blue Sky Research anymore.
Not just Blue Sky Research.
So we licensed technology from CMU.
That's how we started.
So we're very grateful to CMU's pioneering research in this area.
So we were the research group, but we delivered the first speech API, SAPI, on Windows 95
as a research group.
We were pretty proud of that because usually research
is doing only blue sky research.
We not only did blue sky research,
continue to push the envelope,
continue to improve the recognition accuracy,
but we also worked with Windows,
brought that technology to Windows developer.
So SAP was the first speech API in the industry on Windows.
And that was really quite a journey.
And then I eventually left research, joined the product group.
I took the team, and it was also an exceptional Microsoft speech research group came with me, went to the product
group.
So this has been really a fascinating 27 years experience at Microsoft.
I stopped doing speech after 2004, after we shipped the speech server.
And I started many different things, including running the incubation for research as a startup.
Yeah.
And I also worked as an architect for Satya Nadella when he was running Bing.
Okay. I was helping incubating a wide range of AI projects from a foundational piece,
like a GPU cluster, Project Philly, the deep learning toolkit, CNTK,
and of course, speech research, all the way to the high-end solution,
like customer care intelligence.
About three years ago, I had the privilege to return to run a combined speech and language group. So basically, we were able to
consolidate all the resources working on speech and the translation. And that was the story,
really, you know, the journey of my experience, a fascinating 27 years.
Where does speech and language live right now?
So, as I said, we moved back and forth multiple times between research and product group.
Right now, we are sitting in cloud and AI group.
This is a product group.
We're part of this cloud services.
And we provide company-wide and industry-wide speech and translation services.
We also have a speech and dialogue research.
They are really operating like a research group.
They are all researchers in our team.
As what Rick has been saying, tech transfer is a full contact spot.
We are not just a full contact spot.
We are a one-body spot.
So it's actually a very exciting group with a group of very talented, very innovative people.
So it's still forward-thinking in the research mode.
It's both forward-thinking and well-grounded.
We have to be grounded to deliver services from infrastructure to cost of serving.
And we also have to be standing high to see the future, to define what is the solution that the
people need and people want, even though the solution may not have existed and they may not
know what it is at this moment.
Well, let's talk about some specific research milestones
that you've been involved in. They're really interesting. Three areas you've
been involved in, conversational speech recognition, machine translation, and conversational Q&A.
So let's start with the recognition. In 2016, you led a team that reached historical human
parity in transcribing conversational speech. Tell us about this. What was it part of? How did it come about? So in 2016, we reached the human parity on the broadly used switchboard conversational
transcription task. That task has been used in the research community and the industry
probably over 10 years. And 2017, we redefined the human priority milestone. So we're not competing with only one
single person. We're competing with a group of people to transcribe the same task. So I would
say 2017 is a really historical moment in comparison to a group of people transcribing
the same task. Microsoft Speech Stack outperformed all four teams combined together. When I challenged
our research group, nobody thought that was even feasible. But in less than two years,
amazingly, when we had the conviction and the resource and the focus, magic indeed happened. So that was actually a fantastic moment for the team,
for science, for the technology stack. That was the first human priority milestone
for my personal professional career. So I want to go in the weeds a little bit on this because
this is interesting what you say. In two years, nobody thought it was possible, and then you did it.
Tell us a little more about the technical aspects of how you accomplish this.
So if you look at the history of speech research,
the speech group pioneered many breakthroughs that got reused by others.
Let's take translation as an example.
So even for speech, in early 70s, by others. Let's take translation as an example. So,
even for speech,
in early 70s,
the speech recognition
used more traditional AI,
like a rule-based approach,
expert system.
And IBM Watson Research
pioneered
statistic speech recognition
using hidden marker model,
using, you know,
statistic language model.
They really
pushed the envelope
and advanced the field.
So,
that was a great moment.
It was the same group
of IBM speech researchers.
They borrowed the same idea
from speech recognition,
applied that to translation.
They rewr wrote translation history,
really advanced the quality of translation substantially.
And after Hidden Marker Model,
it was deep learning that started with speech recognition,
neural speech recognition.
And once again, translation borrowed the same thing
with neural machine translation.
That also advanced.
So you can see the mirror of using technology
speech people pioneered.
Actually, speech guys have been doing this,
you know, systematic benchmarking funded by DARPA,
very rigorous evaluation.
That really changed how science and engineering could
be evaluated.
So there are many broad lessons from speech technology community that could have been
used broadly beyond speech.
So we got to train to deal with tough problems.
It's no wonder the same group of people could have achieved this
historic milestone.
Well, let's talk about another human parity milestone, the automatic Chinese to English
news translation for the WMT 2017 task.
And I had Arul Menezes on the show to talk all about that.
But I'd love your perspective on whether and how, this kind of goes back to what we talked about at the beginning,
whether and how you think machines can now compare
to traditional human translation services
and why this work is an important breakthrough
for barriers between people and cultures.
So the second human priority breakthrough from my team
is equally exciting.
As I said, transcribing switchboard conversational speech
is a great milestone, but it's really at a very low level and a perceptive AI level.
Translation is a task that is between perceptive AI and the cognitive AI.
Of course, translation is a harder task and nobody believed we could have achieved this.
So we set a goal in five years, let's see if we can achieve translation human parity
on the sentence by sentence basis.
So I want to really put that condition here.
When human, like you and me, translate, we look at the whole paragraph, we have the broader context,
we do a better job. So we limited ourselves because for the broader use the WMT, which
is just news translation measured on the sentence by sentence level, it's a broader open research
public benchmark. Even for that one, we thought it could have taken five years.
So we applied the same principle,
built on the success we had on transcribing
switchboard speech recognition.
But this time, we actually went one step beyond.
We partnered with Microsoft Research Group in Beijing,
because it's a Chinese to English translation.
So across Pacific,
multiple teams in Microsoft Research Asia
worked together days and nights.
Amazingly, this group of people surprised everyone.
We delivered this in less than a year,
reaching human parity
on the historical translation level, better than
professional people on the same task as measured by our scientists. So this time, really, we did
something magic. I'm very proud of the team. I'm very proud of the collaboration.
Well, another super interesting area that I'd love to talk about with you is what you call COCA,
and that's C-O-Q-A, conversational Q&A.
So obviously we're talking about computers having this conversation with us, question
and answer.
Tell us about the work that's going on in this most human and perhaps most difficult
of tasks in speech recognition technology.
So this task is pioneered by Stanford researchers.
It's even one step closer to cognitive AI.
This is really machine reading comprehension task
with conversation, with dialogue about the task.
Let's say you read a paragraph.
Then we challenge the reader to answer correctly with a sequence of questions
that are related.
For example, if you read the paragraph about Bill Gates, the first question could have
been, who is the founder of Microsoft?
The second question could be related to the first one.
How old is the person when the person started? Or you could have say,
and when the person retired, how old was he? So that context relevancy is harder than simple
machine-meaning comprehension because there's a sequence of related question you have to answer,
given the context. So for this latest breakthrough, and I have to give credit
mostly to our colleagues in Beijing Research Lab, we have been pioneering this working together
using shared resources and the infrastructure. And it's just amazing. I'm so impressed with the agility and the speed we have to achieve
this amazing conversational
question answering challenge.
So,
the leading researchers, they're all in Beijing,
will play a great and supporting
role, helping Microsoft
once again be the first
to achieve human parity on this
broadly watched AI task.
Nobody believed anyone could have achieved
this conversational Q&A human parity
in such a short time.
And so we thought it might take two years.
Once again, we broke historical record.
Well, we've talked a little bit about
the more technical aspects of what you're doing
and how you're doing this.
So on this last one, are there any other methodologies or techniques that you brought to the table
to conquer this Q&A task?
So Microsoft has accumulated 30 years of research and experiences in AI, right?
The natural language group in Beijing, they have been doing this in the last 20 years
and have accumulated lots of talents, lot of experiences.
And we basically use the deep learning and transfer learning.
Also we build our success on top of the whole community.
For example, Google, they delivered this fascinating technology called BERT.
Is that an acronym?
Yes, it's an acronym.
It's an embedding technology.
We built the success on top of that, expanded that.
That's how we achieved the human priority breakthrough.
So it's really a reflection of the collective community.
And I talked about the collaboration between Microsoft Research in Asia and our team in
the US.
Actually, this is a great example of collaboration of the whole industry. On the heels of everything that could possibly go right, and it's pretty exciting what you've
described to us in this podcast, we do have to address what could possibly go wrong if you're
successful. You want to enable computers to listen, hear, speak, translate, answer questions,
basically communicate with people.
Does anything about that keep you up at night?
Yes, absolutely.
My worry is really someday humans
can be too dependent on AI.
And AI will never be perfect.
AI would have a unique sort of biases.
So I worry about that unconscious influence.
Right.
So how to deal with that is really a broad societal issue
that we have to be aware and we have to address.
Because just like anyone, if you have an assistant you depend on,
you absolutely know how much that assistant can influence you,
change your agenda, change your opinion.
AI one day is going to play the same role.
AI will be biased.
And how do we deal with that is my top concern.
If everything goes well,
that is really a top issue we have to deal with.
We have to learn how to deal with it.
We do not know because we're not there yet.
So what kinds of design thinking are you bringing to this as you build these tools that can speak and listen and converse?
Because one of the biggest things is that human ability to impute human qualities to something that's not human.
I think just, you know, there are enough responsible people working on AI.
And the good news is that we're not there yet, right?
So we have time to work together to deal with that
and make sure AI is going to really serve mankind,
not to destroy mankind.
So that's my top worry, what keeps me awake.
But my short-term worry is really
AI is not good enough. Not yet.
And people, as Bill Gates
used to say, always
overestimate what you can do
in the short term and
underestimate the impact in the long term.
For this case, we cannot
underestimate the long-term
impact, long-term milestone.
Okay. It's story time.
Good.
Tell us a bit about your life.
What's your story?
What got you interested in research,
particularly speech and language technology research?
And what was your path to MSR?
Good.
I was a graduate student in Beijing's Tsinghua University.
At that time, my first computer was Apple II.
So because you know Chinese language is not easy to type,
so it was very cumbersome.
So that necessity brought me to speech recognition.
My dream at that time was as a graduate student in Tsinghua,
actually was in AI,
in AI of Tsinghua's, you know, graduate school. It was fantastic to have, you know, so many
professors and the faculty members who had that long-term vision and set up the pioneering
environment for us to explore and experiment with.
So I finished my master degree.
I was in the PhD program
and I have been working on speech recognition since 82,
because I was enrolled, admitted to Tsinghua in 1982.
That dream to make it easier for people
to really communicate with machines never disappeared.
So I have been working on this for over 30 years.
Even though on Microsoft for a shorter period of time, I stepped out of speech, but I was
still doing something related.
So I really thought this was a fascinating story. So I got some personal,
really interesting story. As I said, you know, it was hard to type in Chinese when I was at the
Tsinghua University. And I didn't finish my PhD at Tsinghua. I went to University of Edinburgh in Scotland.
And I did finish my PhD there.
But my personal pain point when I first landed in Edinburgh was really,
I learned English, mostly American English, in China.
It wasn't that good because it wasn't my native language.
But listening to Scottish professor talking was always challenging. But I was so grateful BBC had the closed captioning.
Oh, funny.
So I really learned my Scottish English from watching BBC.
And I have to say that automatic captioning technology is available on Microsoft PowerPoint
today. And that journey of personally paying points to what office PowerPoint teams can bring
together is fascinating and personally extremely rewarding.
I'm so grateful to see the technology I have to work on is going to help many other people
who are attending Scottish universities.
You know, Arul talked about that PowerPoint service, and he was talking about people who
had hearing disabilities.
You give it a whole new...
It's much broader because the language barrier is always there.
Not everyone is as fluent.
And I host many visitors.
Almost every year, I'm hosting Tsinghua University MBA students.
And they all learn English.
But their ability to converse and listen simply is not as good as native people here. So the simple fact that we are able to provide captioning
on the PowerPoint presentation helped all of them
to learn and understand much better.
So this is actually a fairly broad scenario
without even translating.
Just the fact you have captioning
will enhance the communication.
Right.
And you know, we talked earlier about the different languages,
and we talked a little bit about dialects,
but we didn't really talk about accents within language.
I mean, even in the United States,
you go to various parts of the country
and have a more difficult time understanding,
even from your own country, just because of the accent.
That's why my Scottish English is a good story.
And I hope I still have a little bit
of Scottish accent. I hear it. Well, at the end of every podcast, I give my guests the last word.
And since you're in human language technologies, it's particularly apropos for you. Now's your
chance to say whatever you want to our listeners who might be interested in enabling computers
to converse and communicate. What ought they to put boots on for?
Working on speech and language. This is really the crown jewel of AI. You know, there's no more
challenging task than this one, in my opinion, especially if you want to move from perceptive AI to cognitive AI, to get the ability to reason, to understand,
to acquire knowledge by reading, by conversing.
It's just, you know, such a fundamental area
that can improve everyone's life,
improve everyone's productivity,
make this world a much better place
without the language barriers,
without communication barriers,
without understanding barriers.
Zhu Dong Huang,
thank you so much for joining us on the podcast today.
It's been fantastic.
My pleasure.
To learn more about Dr. Zhu Danghuang and
the science of machine speech and language,
visit microsoft.com slash research.