Microsoft Research Podcast - 050 - Hearing in 3D with Dr. Ivan Tashev
Episode Date: November 14, 2018After decades of research in processing audio signals, we’ve reached the point of so-called performance saturation. But recent advances in machine learning and signal processing algorithms have pave...d the way for a revolution in speech recognition technology and audio signal processing. Dr. Ivan Tashev, a Partner Software Architect in the Audio and Acoustics Group at Microsoft Research, is no small part of the revolution, having both published papers and shipped products at the forefront of the science of sound. On today’s podcast, Dr. Tashev gives us an overview of the quest for better sound processing and speech enhancement, tells us about the latest innovations in 3D audio, and explains why the research behind audio processing technology is, thanks to variations in human perception, equal parts science, art and craft.
Transcript
Discussion (0)
You know, humans, they don't care about mean square error solution or maximum likelihood
solution.
They just want the sound to sound better for them.
And it's about human perception.
That's one of the very tricky parts in audio signal processing.
You're listening to the Microsoft Research Podcast, a show that brings you closer to
the cutting edge of technology research and the scientists behind it.
I'm your host, Gretchen Huizenga.
After decades of research in processing audio signals, we've reached the point of so-called performance saturation. But recent advances in machine learning and signal processing algorithms
have paved the way for a revolution in speech recognition technology and audio signal processing.
Dr. Ivan Tashchev, a partner software architect in the audio and acoustics group at Microsoft
Research, is no small part of the revolution, having both published papers and shipped products
at the forefront of the science of sound.
On today's podcast, Dr. Tashif gives us an overview of the quest for better sound processing and speech enhancement,
tells us about the latest innovations in 3D audio,
and explains why the research behind audio processing technology is,
thanks to variations in human perception, equal parts science, art, and craft.
That and much more on this episode of the Microsoft Research Podcast.
Ivan Tashev, welcome to the podcast.
Thank you.
Great to have you here.
You're a partner software architect in Ivan Tashev, welcome to the podcast. Thank you. Great to have you here.
You're a partner software architect in the audio and acoustics group at Microsoft Research.
So in broad strokes, tell us about your work.
What gets you up in the morning?
What big questions are you asking?
What big problems are you trying to solve?
So in general, in audio and acoustics research group, we do audio signal processing. That includes enhancing of a captured sound by our microphones,
better sound reproduction using binaural audio, so-called spatial audio.
We do a lot of work in audio analytics,
recognition of audio objects, recognition of the audio
background. We design a lot of interesting audio devices. Our research range from applied
research related to Microsoft products to a blue sky research far from what is Microsoft
business today.
So what's the ultimate goal, perfect sound?
Perfect sound is a very tricky thing because it is about human perception.
And this is very difficult to be modeled
using mathematical equations.
So the classic statistical signal processing
was established in 1947 with a paper published
by Norbert Wiener defining what we call today the Wiener filtering.
The approach is simple.
You have a process, you make a statistical model, you define optimality criterion, make
the first derivative, make it zero, voila, you have the analytical solution of the problem.
The problem is that you either have an approximate model
and find the solution analytically,
or you have a precise model which you cannot solve analytically.
The other thing is the optimality criterion.
You know, humans, they don't care about mean square error solution or
maximum likelihood solution. They just want the sound to sound better for them. And it's about
human perception. That's one of the very tricky parts in audio signal processing.
So where are we heading in audio signal processing in the era of machine learning and neural networks?
The machine learning and signal processing and the reason
why we achieve significantly better results than using statistical signal processing.
Even more, we train the neural network using certain cost function, And we can make the cost function to be even another neural network,
train it on human perception for better audio, which allows us to achieve better perception
of a higher quality of the speech enhancement we do using neural networks. I'm not saying that we should go in every single audio processing block
using machine learning and neural networks. We have processing blocks which have a nice and
clean analytical solution, and this runs fast and efficient, and they will remain the same.
But in many cases, we operate with approximate models with not very natural optimality criteria.
And then this is where the machine learning shines.
This is where we can achieve much better results and provide a higher quality of our output signal.
One interesting area of research that you're doing is noise-robust speech recognition.
And this is where researchers are working to improve automatic speech recognition systems.
So what's the science behind this, and how are algorithms helping to clean up the signal?
We are witnessing a revolution in speech recognition. The classic speech recognizer was based on so-called
hidden Markov models, or HMMs.
And they served us quite well,
but the revolution came when neural networks
were implemented and trained to do speech recognition.
My colleagues in the speech research group
were the first to design a neural network-based
speech recognition algorithm which instantly showed better results than the existing production
HMM-based speech recognizer.
The speech recognition engine has one channel input.
While in audio processing, we can deal with multiple channels,
so-called microphone arrays, and they give us a sense of spatiality.
We can detect the direction where the sounds come from,
we can enhance that sound,
we can suppress sounds coming from other directions, изглежда, можем да се увеличаваме този звук, можем да спрятаваме звуките, които изглеждат от други
направления и след това да дадем този чист звук на енджинът на спичкането.
Микрофонната технология за репроцесиране, която се събърва с техници както звукозната
локализация и трактинг и звукозната разстояние, позволяват да се разстоят and tracking. And sound source separation allows us to even separate two simultaneously speaking
humans in the conference room and feed two separate instances of the speech recognizer
for meeting transcription. Are you serious? Yes, we can do that. Even more, the audio processing
engine has more prior information, for example, the signal
we send to the loudspeakers. And the goal of this engine is to remove the sound which is interfering
for our sound. And this is also one of the oldest signal processing algorithms, and every single speakerphone has it.
But in all instances, it has been implemented as a mono acoustic echo cancellation.
In Microsoft, we were the first to design a stereo and surround sound echo canceller,
despite a paper written by the inventor of the acoustic echo cancellation himself,
stating that stereo acoustic cancellation is not possible.
And it's relatively simple to understand.
You have two channels between the left and the right speaker coming to one microphone,
so you have one equation and two unknowns. And Microsoft released, as part of Kinect for Xbox, a surround sound echo cancellation
engine. Not that we solved five unknowns from one equation, but we just found a workaround
which was good enough for any practical purposes and allowed us to clean the surround sound coming from the Xbox
to provide a cleaner sound to the speech recognition engine.
So did you write a paper and say, yes, it is possible?
Thank you very much.
I did write a paper.
Oh, you did?
And it was rejected with the most crucial feedback from the reviewers I have ever seen in my career.
It is the same to go to the French Academy of Sciences and to propose a thermal engine.
They have decided since the 18th century not to discuss papers about that.
When I received the rejection notice, I went downstairs in my lab, started
the demo, listening toic chamber or chambers,
as I came to find out at Microsoft. And one's right here in Building 99,
but there are others. And so phrases like the quietest place on earth and where sound goes to
die are kind of sensational.
But these are really interesting structures and have really specific purposes, which I was interested to find out about.
So tell us about these anechoic or echo-free chambers.
How many are there here?
How are they different from one another?
And what are they used for? So the anechoic chamber is just a room
insulated from the sounds outside.
In our case, it's a concrete cube
which does not touch the building
and sits on around half a meter of rubber
to prevent vibrations from the street
to come into the room.
And internally, the walls, the ceiling, and the
floor are covered with the sound absorption panels. This is pretty much it. What happens is that
the sound from the source reaches the microphone or the human ear only using the direct path. There is no reflection from the walls and there is no
other noise in the chamber. Pretty much that an echoic chamber simulates absence of a room
and it's just an instrument for making acoustical measurements. What we do in the chamber is we measure the directivity patterns of microphones
or radiation patterns of loudspeakers as they are installed in the devices we design.
Initially, the NICUIC chamber here in Microsoft Building 99, the headquarters of Microsoft
Research, was the only one in Microsoft. But with our engagement with product teams,
it became overcrowded, and our business partners decided to build their own Nikoic chambers.
And there are today five in Microsoft Corporation. They all can perform the standard set of
measurements, but all of them are a little bit different from each
other.
For example, the quietest place in the Earth, as recorded in the Guinness Book of Records,
is the anechoic chamber in Building 88.
And the largest anechoic chamber is in Studio B, which allows making measurements with lower
frequencies than in the rest of the chambers.
In our chamber, in Building 99, it's the only one in Microsoft which can allow human beings to stay a prolonged amount of time in the chamber because we have air conditioning connected to the chamber.
It's a different story how much effort it costs us to make the rumbling noise from the air conditioner not to enter the anechoic chamber. It's a different story how much effort cost us to make the rumbling noise from
the air conditioner not to enter the anechoic chamber, but this allowed us to do a lot of
research on human spatial hearing in that chamber. So drill in on that a little bit,
because coming from a video production background, the air conditioner in a building is always the
annoying part for the sound people. But you've got that figured out in the way that you've situated the air conditioning
unit and so on? To remove this rumbling sound from the air conditioner, we installed a gigantic
filter, which is under the floor of the entire equipment room. So think about six by four meters floor.
And this is how we were able to reduce the sound from the air conditioner. Still,
if you do a very precise acoustical measurements, we have the ability to switch it off.
Okay. So back to what you had said about having humans in this room for prolonged periods of time.
I've heard that your brain starts to play tricks on you when you're in that quiet of a place for a prolonged period of time.
What's the deal there?
Okay, this is the human perception of the aniquic chamber.
Humans, in general, are, I would say, two and a half dimensional creatures. When we walk on the
ground, we don't have very good spatial hearing vertically. We do much better horizontally,
but also we count on the first reflection from the ground to use it as a distance cue.
When you enter the anechoic chamber, you subconsciously swallow.
And this is a reaction because your brain thinks that there is a difference in the pressure between
your inner ear and the atmosphere, which presses the eardrums and you cannot hear anything.
So that swallowing reaction is what you do when you're in an airplane
and the pressure actually changes
and you get the same perception in this room,
but the pressure didn't change.
Exactly.
But the problem in the room is that you cannot hear anything
just because there is no sound in the chamber.
And the other thing, what happens is you cannot hear that reflection
from the floor,
which is basically very hardwired in our brains.
We can distinguish two separate sounds when the distance between them is a couple of milliseconds.
And when the sound source is far away,
this difference between the direct path and the reflection from the ground is less than that.
We hear this as one sound.
We start to perceive those two as separate sounds
when the sound source is closer than a couple of meters away, means two jumps.
And then subconsciously alarm bells start to ring in our brain that,
hey, there is a sound source less than two
jumps away, watch out not to become the dinner, or maybe this is the dinner.
So the progress, though, of what your brain does and what your hearing does inside
the chamber for one minute, for 10 minutes. What happens? So there is no sound.
And the brain tries to acquire as much information as possible.
And the situation when you don't get information is called information deprival.
You first, after a minute or so, start to hear a shhh, which is actually the blood in the vessels of your ear.
Then after a couple of minutes, you start to hear your body sounds, your heartbeat, your breathing.
And under no other senses, eyes closed, no sound coming, literally you reach after 10-15 minutes the stage of audio hallucinations.
Our brains are pattern matching machines, so sooner or later the brain will start to
recognize sounds you have heard somewhere, different places.
We people from my team, we have not reached that stage.
Simply because when you work there, the door is open, the tools are clanking, we have not reached that stage. Simply because when you work there, the door is open, the tools
are clanking, we have conversations, etc., etc. But maybe someday I will have to lay there and
close my eyes and see, can I reach the hallucination stage? Well, let's talk about the research behind Microsoft Connect.
And that's been a huge driver of innovations in this field.
Tell us how the legacy of research and hardware for Kinect led to progress in
other areas of Microsoft.
Kinect introduces two new modalities in human-machine interfaces, voice and gesture.
And it was a wildly successful product. Kinect entered the Guinness Book of Records for the
fastest selling electronic device in the history of
mankind. Microsoft sold 8 million devices in the first three months of the beginning of the
production. Since then, most of the technologies in Kinect have been further developed. But even
during the first year of Kinect, Microsoft released Kinect for Windows,
which allowed researchers from all over the globe to do things we even didn't thought of.
This is so-called Kinect effect. We had more than 50 startups building their products using technologies from Microsoft Kinect.
Today, most of them are further developed, enhanced, and are part of our products.
I'll give just two examples.
The first is HoloLens.
The device does not have a mouse or keyboard and the human-machine interface is
built on three input modalities gaze gesture and voice in HoloLens we have a
depth camera quite similar to the one in Kinect and we do gesture recognition
using super refined and improved algorithms, but they originate
from the ones we had in Kinect.
The second example is also HoloLens.
HoloLens has four microphones, the same number as Kinect.
And I would say that the audio enhancement pipeline for getting the voice of the person wearing the device is the granddaughter of the audio pipeline we released in Connect in 2010.
Now let's talk about one of the coolest projects you're working on. It's the spatial audio or 3D audio.
What's your team doing to make the 3D audio experience a reality? In general, spatial audio or 3D audio is a technology
that allows us to project audio sources in any desired position, to be perceived by the human
being wearing headphones. This technology is not something new. Actually, we have instances of it in mid-19th century,
when two microphones and two rented telephone lines were used for stereo broadcasting of a theatrical play.
Later, in the 20th century, there have been vinyl records marked to be listened with headphones because they were stereo recorded using a dummy head with two microphones in the ears.
This technology did not fly because of two major deficiencies.
The first is you move your head left and right, and the entire audio scene rotates with you.
The second is that your brain may not exactly like the special cues coming from the microphones in the ear of the dummy head.
And this is where we reach the topic of head-related transfer functions. Literally, if you have a sound source somewhere in the space, the sound from it reaches
your left and right ear in a slightly different way. It can be modeled as two filters. And if you
filter it through those two filters and play through headphones, your brain will perceive the
sound coming from that direction. If we know those pairs of filters for all directions around
you, this is called head-related transfer functions. The problem is that they are highly
individual. Head-related transfer functions are formed by the size and the dimensions of the head,
the position of the ears, the fine structure of the pinna,
the reflections from the shoulders.
And we did a lot of research to find a way to quickly generate personalized head-related
transfer functions.
We put in our anechoic chamber more than 400 subjects.
We measured their HRTFs.
We did a submillimeter precision scan of their head and torso.
And we did measurement of certain anthropometric dimensions of those subjects.
Today, we can just measure several dimensions of your head
and generate your personalized head-related transfer function.
We can do this even from a depth picture.
Literally, you can tell how you hear from the way you look.
And we have polished this technology to extend that. In HoloLens, you have your spatial audio personalized without even knowing it.
You put the device on and you hear through your own personalized spatial hearing.
How does that do that automatically?
Silently, we measure certain anthropometrics of your head. Our engineering teams, our partners,
decided that there should not be anything visible
for generation of those personalized spatial hearing.
So if I put this on, say the HoloLens headset,
it's going to measure me on the fly?
Mm-hmm.
And then the 3D audio will happen for me.
Both of us could have the headset on
and hear a noise in one of our ears
that's supposedly coming from behind us,
but it really isn't.
That's absolutely correct.
With the two loudspeakers in HoloLens
or in your headphones,
we can make you perceive the sound coming
from above, from below, from behind.
And this is actually the main difference between surround sound and 3D audio for headphones.
Surround sound has five or seven loudspeakers, but they are all in one plane.
So surround audio world is actually flat.
While with this spatial audio engine,
we can actually render audio above and below,
which opens pretty much a new frontier
in expressiveness of the audio, what we can do.
Listen, as you talk, I have a vision of a bat in my head
sending out signals and getting signals
and echolocations and...
We did that.
What?
We did that.
Okay, tell.
So one of our projects,
this is one of those more blue sky research projects,
is exactly about that.
What we wanted to explore is using audio as echolocation in the same way the bats see in complete darkness.
And we built a spherical loudspeaker array of eight transducers,
which sent ultrasound pulses towards given
direction, and near it, an eight-element microphone array, which, through the technology called
beamforming, listens towards the same direction.
With this, we utilize the energy of the loudspeaker as well and reduce the amount of sounds coming from other directions.
And this allows us to measure the energy reflected by the object in that direction.
When you do the scanning of the space, you can create an image,
which is exactly the same as created from a depth camera using infrared light,
but with a fraction of the energy.
The ultimate goal eventually will be to get the same gesture recognition with one-tenth
or one-hundredth of the power necessary.
This is important for all portable battery-operated devices.
Yeah.
Speaking of that, accessibility is a huge area of interest for Microsoft right now, especially here in Microsoft research with the AI for Accessibility Initiative. And it's really revolutionizing access to technology for people with disabilities. Tell us how the research you're doing is finding its way into the projects and products in the arena of accessibility? You know, accessibility finds a resonance among Microsoft employees.
The first application of our spatial audio technology was actually not HoloLens.
It was a project which was a kind of a grassroots project
when Microsoft employees worked with a charity organization called
Guide Dogs in United Kingdom. And from the name, you can basically guess that they
train guiding dogs for people with blindness. The idea was to use the spatial audio to help
the visually impaired. Multiple teams in Microsoft research actually have been involved
to overcome a lot of problems, including my team.
And this whole story ended up with releasing a product called Soundscape,
which is a phone application which allows people with blindness
to navigate easier,
where the spatial audio acts like a finger pointer.
When the system says, and on the left is the department store,
actually that voice prompt came from the direction where the department store is. And this is additional spatial cue, which helps the orientation of the visually impaired people.
Another interesting project we have been involved in also is a grassroots project.
It was driven by a girl who was hearing impaired.
She initiated a project during one of the yearly hackathons. And the project was triggered by the fact
that she was told by her neighbor
that your CO2 alarm is beeping already a week.
You have to replace the battery.
So we created a phone application
which was able to recognize a number of sounds
like CO2 alarm, fire alarm, door knock, phone ring,
baby crying, et alarm, door knock, phone ring, baby crying, etc., etc., and to signal the hearing-impaired person using vibration or the display. And this is to help to navigate and to
live a better life in our environment.
You have an interesting personal story.
Tell us a bit about your background.
Where did you grow up?
What got you interested in the work you're doing?
And how did you end up at Microsoft Research?
I'm born in a small country in Southeast Europe called Bulgaria.
Took my diploma in electronic engineering and PhD in computer science from the Technical University of Sofia. And immediately after my graduation, started to work as a researcher there.
In 1998, I was assistant professor in the Department of Electronic Engineering when Microsoft hired me.
And I moved to Washington State.
Spent two full shipping cycles in Microsoft engineering teams before in 2001 to move in Microsoft Research. And what I have learned during those two shipping cycles
actually helped me a lot to talk better with
the engineers during the technology transfers
I have done with Microsoft engineering teams.
Yeah, and there's quite a bit of
tech transfer that's coming out of your group.
What are some examples of the things that have
been Blue Sky Research at the beginning, of tech transfer that's coming out of your group. What are some examples of the things that have been
blue sky research at the beginning
that are now finding their way into millions of users' desks and homes?
I have been lucky enough to be part of very strong research groups
and to learn from masters like Anup Gupta or Rico Malver.
My first project in Microsoft Research was called Distributed Meetings,
and we used that device to record meetings, to store them, and to process them.
Later, this device became Roundtable device,
which is part of many conference rooms worldwide. Then I decided to generalize the microphone array support
I designed for round table device and this became the microphone array support in Windows Vista.
Next challenge was to make this speech enhancement pipeline to work even in
more harsh conditions like the noisy car. And I designed the algorithms and transferred them
to the first speech-driven infotainment system
in a mass production car.
And then the story continues with Kinect,
with HoloLens, many other products.
And this is another difference
between industrial research and academia.
The satisfaction from your work is measurable. You know to how many homes your technology has
been released, to how many people you change the way they live, entertain, or work.
As we close, Ivan, perhaps you could give some parting advice to those of our listeners that might be interested in the science of sound, so to speak.
What are the exciting challenges out there in audio and acoustics research, and what guidance would you offer would-be researchers in this area? So, audio processing is a very interesting area of research because it is a mixture between
art, craft, and science.
It is science because we work with mathematical models and we have repetitive results.
But it is an art because it's about human perception.
Humans have their own preferences and tastes, and this makes it very difficult to model
with mathematical models.
And it's also a craft.
There are always some small tricks and secret souls which are not mathematical models but
make the algorithms from one lab work much better
than the algorithms from another lab.
Into the mixture, we have to add the powerful invasion of machine learning technologies,
neural networks, and artificial intelligence, which allow us to solve problems we thought
were unsolvable and to produce algorithms which work much better than the classic ones.
So the advice is learn signal processing and machine learning.
This combination is very powerful.
Ivan Tashev, thank you for joining us today.
Thank you.
To learn more about Dr. Ivan Tashev and how Microsoft Research is working to make sound, sound better, visit Microsoft.com slash research.