Microsoft Research Podcast - What’s Your Story: Ivan Tashev
Episode Date: February 1, 2024Partner Software Architect Ivan Tashev talks about applying his expertise in audio signal processing to the design and study of audio components for Microsoft products such as Kinect and shares how a ...focus on what he can control has fueled professional success.Learn more:Ivan Tashev at Microsoft ResearchDistributed Meetings: A Meeting Capture and Broadcasting System | Publication, December 2002Research Collection: The Unseen History of Audio and Acoustics Research at Microsoft | Microsoft Research blog, August 2020
Transcript
Discussion (0)
To succeed in Microsoft,
you have to be laser-focused on what you are doing.
This is the thing you can change.
Focus on the problems you have to solve.
Do your job and be very good at it.
Those are the most important rules
I have used in my career in Microsoft.
. Microsoft Research works at the cutting edge.
But how much do we know about the people behind the science and technology that we create?
This is What's Your Story and I'm Johannes Gerke.
In my 10 years with Microsoft, cross-product and research, I've been continuously excited
and inspired by the people I work with and I'm curious about how they became the talented and passionate people they are today.
So I sat down with some of them.
Now I'm sharing their stories with you.
In this podcast series, you'll hear from them about how they grew up, the critical choices
that shaped their lives, and their advice to others looking to carve a similar path.
In this episode, I'm talking with partner software architect Ivan Tashev in the anechoic chamber in Building 99 on our Redmond, Washington campus.
Constructed of concrete, rubber, and sound-absorbing panels, making it impervious to outside noise,
this chamber has played a significant role in Ivan's 25 years with Microsoft.
He's put his expertise in audio processing to work in the space, helping to design and study the audio components of such products as Kinect, Teams and
HoloLens. Here's my conversation with Ivan, beginning with his childhood in
Bulgaria, where he was raised by two history teachers.
So I'm born in a city called Jambol in Bulgaria, my origin country.
The city is created 2,000 years BC and now sits on the two shores of the river called
Tunja.
There has always been important transportation and agricultural center in the entire region.
And I grew up there in a family of two lecturers.
My parents were teaching history.
And they loved to travel.
So everywhere I go, I had excellent tourist guides with me.
This in this place happened, this in this in this in this year.
Were there quizzes afterwards?
But it happened that I was more fond of engineering, technology, math, all of the devices, just mechanical things, just fascinated me.
When I read in a book about the parachutes, I decided that I'll have to try this and jump
into it from the second floor of a building with an umbrella to see how much it will slow
me down.
It didn't.
And how did you get to him?
Oh, I ended with a twisted ankle.
Nothing more. So you were always hands-on, that's what you're telling me, right? Always the
experimenter. Yep. So I was doing a lot of this stuff, but also I was good teachers in math and going to those competitions of mathematical
Olympiads was something I started since fifth grade.
Pretty much every year they were well organized on school, city, regional level.
And I remember how in my sixth grade I I won the first place of the regional Olympiad.
And the prize was an 8mm movie camera.
That, I would say, changed my life.
This is my hobby since then.
I have been holding this movie camera for several generations everywhere I go and travel.
In Moscow, in Kiev, in Venice, everywhere my parents
were traveling I was shooting 8mm films and I continue this
till today. Today I have a much better equipment but also very powerful
computers to do the processing. I produce three to five Blu-ray discs pretty much
every year. Perform performances of the choir
or the dancing groups in the Bulgarian Cultural and Heritage Center of Seattle mostly.
Wow, that's fascinating.
Was that hobby somehow connected to your entry into science and then actually doing a PhD
and then actually going into audio processing?
The mathematical high school I attended in the city where I was born was one of the five
strongest in the country, which means first math every day, two days twice, physics every day.
Around ninth grade at the end, we finished the entire high school curriculum
and started to study differentials and integrals,
something which is more towards the university math courses.
But this means that I had no problems entering any of the universities with mathematical exams.
I didn't even have to do that because I qualified in one year, my 11th grade, to become a member
of the Bulgarian national teams for the International Math Olympia and for International Physics
Olympia.
And they actually coincided, so I had to choose
one and I chose physics. And since then I'm actually saying that math is the language of physics,
physics is the language of engineering. And that kind of showed the tendency. So literally I was
11th grade and I could legitimately point and choose any of the universities and I decided to go and
study electronic engineering in the Technical University of Sofia.
And then how did you end up in the US?
So that's another interesting story.
I defended my PhD, graduated from the university, defended my PhD thesis.
It was something amazing.
What was it on, actually?
It was a control system for a telescope,
but not just for observation of celestial objects, but for tracking and ranging the distance to the
satellites. Literally one measurement is you shoot with the laser, it goes to the satellite which is
60 centimeters in diameter, it returns back and you measure the time with accuracy of 100
picoseconds.
And this was part of studying how the Earth rotates, how the satellites move.
The data, there were around 44 stations like this in the entire Earth, and the data were
public and used by NASA for finalizing the models for those satellites, which later became GPS.
Used it by Russians to finalize the models for their GLONASS system.
Used it by people who studied precession and rotation of the Earth.
A lot of interesting PhD thesis came from the data, from the results of this device, including tides. For example,
I found that Balkan Peninsula moves up and down two meters every day because of the tides.
So the earth is liquid inside and there are tides under us in the same way as with the oceans.
Oh wow, super interesting. I actually wanted to come back, so just to get the right kind of
comparison for the unit. And so picoseconds, right, kind of comparison for the for for the unit and so
picoseconds right because i know what a nanosecond is because the nanoseconds is one on minus
ninth because second is one on minus twelve okay good good just to put that in perspective
exactly so this was the the accuracy the light goes 30 centimeters for that time, for one nanosecond.
And we needed to go way shorter than that.
But why this project was so fascinating for me, can you imagine this is 1988, people having
Apple II or compatible computers playing with the joystick, a very famous game when you
have the crosshair in the space and you shoot
with laser the satellites. And I was sitting behind the ocular and moving a joystick and shooting
at the real satellites. Not with a golden straw, of course. No, the energy of the laser was one
joule. You can put your hand in front, but very short, and one nanosecond. So it can go and turn and you do have the resolution
to measure the distance. And after that I became assistant professor in the Technical University of
Sofia. How I came to and then a friend of mine
came back from a scientific institution from the former Eastern Germany.
And he basically shared how much money West Germany has poured to the East German economy to change it, to bring
it up to the standards.
And that it was, I think, 900 billion Deutsche Marks.
But this went after the changes.
After the changes, after basically the East and West Germany united.
And then this was in the first nine years of the changes.
And then we looked at each other in the eyes and said, wait a minute, if you model this
as a first order system, this is the time constant.
And the process will finish after two times more of the time constant, and then we'll
need another 900 billion marks.
You cannot imagine how exact became that prediction when this Germany will be on equal economically to the West Germany.
But then we looked at each other eyes and said, what about Bulgaria? We don't have West Bulgaria.
And then this started to make me think that most probably there will be technical university of software, but in this economical crisis there will be no money for research,
no for
development, for building skills, for going to conferences and then
pretty much around the same time somebody said, hey, you know Microsoft is coming here to hire.
And I sent my resume knowing that okay, I'm an assistant professor, I can program.
But that actually happened that I can program quite well, implementing all of those control
systems for the telescope, etc., etc.
And literally…
And so there was a programming testing as part of the interview?
Oh, the interview questions were three or four people, one hour, asking programming
questions.
The opening was for a software engineer.
Like on a whiteboard?
Like on a whiteboard.
And then I got an email saying that, Ivan, we liked your performance, we want to bring
it to Redmond for further interviews.
I flew here in 1997.
After the interviews, I returned to my hotel and the offer
was waiting for me at the reception. So this is how we decided to move here in Redmond. And I
started and went through two full shipping cycles of programmers. So you didn't start out in MSR,
right? Nope. Where were you first? So actually, I was lucky enough. Both products were version 1.0.
One of them was COM+. This is the transaction server and the COM technology, which is the
backbone of Windows. It was the component model, basically, at that point in time?
Common object model. Basically, creating an object, getting the interface, and calling the methods there.
And my experience with low-level programming on assembly language and microprocessor actually
came here very handy.
We shipped this as a part of Windows 2000, and the second product was the Microsoft Application
Center 2000, which was a cluster management system.
But both of them had nothing to do with signal processing, right?
Nope.
Except there was some load balancing in the application center, but they had nothing to
do with signal processing.
Just pure programming skills.
And then in the year of 2000, there was the first TechFest.
And I went to see it and said, wait a minute.
There are PhDs in this company, and they're doing this amazing research.
My place is here.
And TechFest, maybe you want to explain briefly what TechFest is?
TechFest is an annual event when researchers from Microsoft Research go and show and demonstrate technologies they have created.
So it used to be like in the Microsoft Conference Center, like a really big two-day event.
Microsoft Conference Center and basically visited by 6,000-7,000
Microsoft employees.
And usually Microsoft Research, all of the branches, were showing around 150 demos.
And it was amazing.
And that was the first such event.
Pretty much not only…
Oh, the very first time.
The very first TechFest.
Got it.
And pretty much not only me, but the rest of Microsoft Corporation learned that we do have a research
organization.
In short, in three months, I started in Microsoft Research.
How did you get a job here then?
How did that happen?
So seriously, visiting TechFest made me to think seriously that I should return back
to research.
And I opened a career website with potential openings, and there were two suitable for
me.
One of them was in the Enrico Malver Signal Processing Group, and the other was in the
Communication, Collaboration, and Multim multimedia group led by Anup Gupta.
So I sent my resume to both of them.
Anup replied in 15 minutes.
Next week I was on informational with him.
When Rico replied, I already had an offer from Anup to join the team.
Got it.
And that's where your focus on communication came from then?
Yes.
So our first project was RingCam.
So it's a 360 camera, eight element microphone array in the base.
And the purpose was to record the meetings, to do a meeting diarization, to have a 360
view but also based on the signal processing and face detection, to have a speaker view, separate camera for the whiteboard,
the aeration based on who is speaking, based on the direction from the microphone array.
Honestly, even today when you read our 2002 paper,
Ross Kettler was the creator of the 360 camera, I was doing the microphone array.
Even today when you read our 2002 paper, you say, wow,
that was something super exciting and super advanced.
And then you brought it all the way to shipping, right, and became a Microsoft product?
So, yes. At some point, it was actually monitored personally by Bill Gates.
And at some point…
So he was PMing it, basically?
He basically was…
He was just a graphic.
I personally installed the distributed meeting system in Bill Gates' conference room.
We do have basically 360 images with Bill Gates attending a meeting.
But anyway, it was believed that this is something important
and a product team was formed to make it a product.
Ross Cutler left Microsoft Research
and became architect of that team.
And this is what became Microsoft Roundtable device.
It was licensed to Polycom
and for many years was sold as Polycom X5000.
Yeah, actually I remember when I was in many meetings, they used to have exactly the device
in the middle.
And the nice thing was that even somebody who was remote, you could see all the people
around the table and you got this really nice view of who was next to whom and not sort
of the transactional windows that you have right now in Teams.
That's a really interesting view.
So as you can see, a very exciting start.
But then Anup went and became Bill Gates' technical assistant, and the signal processing people from his team were merged with Rico Malver's signal processing team. And this is how
I continued to work on microphone arrays and speech enhancement.
And this is what I do till today.
And you mentioned amazing products from Microsoft, like Kinect and so on, right?
And so you were involved in the audio processing layer of all of those.
And they were actually then, part of it was designed here in this room?
Yep.
So tell me a little bit more about that.
You know, at the time, I was fascinated by a problem which was considered theoretically
impossible.
Multi-channel acoustic echo cancellation.
There was a paper written in 1998 by the inventor of the acoustic echo cancellation, from Bell
Labs, stating that stereo acoustic echo cancellation is not possible.
And he proved it?
Or what does that mean?
He just…
Look, it's very simple.
You have two unknowns, the two impulse responses from the left and the right loudspeaker and one equation
That's the microphone signal
What I did was
to circumvent this
When you start Kinect
You'll hear some melodic signals and this is the calibration.
At least you know the relation between the two unknowns.
And now we have one unknown, which is basically discovered using an adaptive filter, the classic acoustic echo cancellation.
So technically, Kinect became the first device ever shipped with surround sound acoustic echo cancellation. The first device ever that could
recognize human speech from four and a half meters while the loudspeakers are blasting and gamers are
listening to very loud levels of their loudspeakers. So let me just tell the audience a little bit, what
does it mean to do acoustic echo cancellation? What is it actually good for? What does it do?
So in general speech enhancement is removing unwanted noises and sounds from the desired
signal.
Some of them we don't know anything about, which is the surrounding noise.
For some of them we have a pretty good understanding.
This is the sound from our own loudspeakers.
So you send the signal to the loudspeakers and then try
to estimate on the fly how much of it is captured by the microphone and subtract this estimation
and this is called acoustic echo cancellation. This is part of every single speaker form.
This is one of the oldest applications of the adaptive filtering. So what the right way to think
about this is that noise cancellation is cancelling unwanted noise from the outside. Unknown noises. Whereas you know acoustic
air cancellation is cancelling the own noise. Which we know about. Right okay. And that was an amazing
work but it also started actually in Techfest. I designed this surround sound echo cancellation and my
target was at the time we had a Windows Media Center. It was a device designed to
stay in a media room and controlling all of those loudspeakers and I made sure to
bring all of the VPs of Windows and Windows Media Center and then I noticed
that I started repeatedly
to see some faces which I didn't invite, I didn't know,
but they came over and over and over.
And after the meeting, after TechFest,
a person called me and said,
look, we are working on a thing
which your technology fits very well.
And this is how I started to work for Kinect.
And in the process of the work,
I had to go and talk with industrial designers
because of the design of the microphones,
with electrical designers because of the circuitry
and then requirements for identical microphone channels
and with the software team
which had to implement my algorithms.
And this actually at some point I had an office in their building and was literally embedded working
with them day and night especially at the end of the shipping cycle of the shipping cycle when the
device had to go out. And this was not a time when you could go like in the device and you know update
software on the device anything the device would go out as is right? Actually this was one of the
first devices like that. It could? Yep. Already Kinects were manufactured, they are boxed, they
are already distributed to the stores but there was a deadline when we had to provide the image when you connect
to your Xbox and it has to be uploaded.
But I get that, but then once it was actually connected to the Xbox, you could still update
the firmware on there?
Yes.
Oh, that's really cool.
Okay.
But it also has a deadline.
So that was an amazing trip.
It literally left all of us, breathless.
There are plenty of serious technological challenges to overcome.
A lot of first technology was brought to this device to make sure.
And this is the audio.
And next to us were the video people and the gaming people
and the designers and everybody was excited be working like hell so we can basically bring this
to the customers wow that's super exciting I mean even just being involved and I mean I think that's
one of the really big things that's so much fun here at Microsoft right that you can get whatever
you do in the hands of you know, if not hundreds of millions of people.
Coming back to your work now in audio signal processing, and that whole field is also being
revolutionized like many other fields right now with AI.
Absolutely.
Photography, one of the other fields that you're very passionate about, is also being
revolutionized with AI, of course.
Also revolutionized.
You know, in terms of changes that you've made in your career, how do you deal with
such changes?
This is something where you have been an expert in a certain class of algorithms, and now
suddenly it says, this is completely new technology coming along and we need to shift.
How are you dealing with this?
How do you deal with this personally?
In some sense, you're becoming a little bit of a dinosaur in a little bit while…
Oh, not at all!
Yeah, that's interesting.
I wouldn't be in research!
Exactly.
How did you overcome that?
So first, each one of us was working and trying to produce better and better technology.
And at the time, the signal processing, speech enhancement, most of the audio processing was based on statistical signal processing. You build the statistical models, distributions,
hidden Markov models, and get certain improvements. get some information yep and all of us started to sense that this though this
is set of tools we have started to saturate and it was simple we use a
simple models we can derive let's say speech is Gaussian distribution noise is
Gaussian distribution you derive the suppression rule.
But this is simplifying the reality.
If you apply a more precise model
of the speech signal distribution,
then you cannot derive easily the suppression rule.
For example, in the case of noise suppression.
And it was literally hanging in the air that we have to find a way to learn from data.
And I have several papers actually before the neural networks to appear that let's get
a big data set and learn from the data.
So a more data-driven approach already.
Data-driven approach.
I have several papers on that. And by the way, they were not quite well
accepted by my audio processing community. All of them are published on a bordering conferences,
not in the core conferences. I got those papers rejected. But then appeared neural networks.
Not that they were something new. We had neural networks in the 80s and they
didn't work well. The miracle was that now we had an algorithm which allowed us to train
them. Literally, next year after the work of Jeff Hinton was published in the implementation
of deep learning, several things happened.
At first, my colleagues in the speech research group
started to do neural network-based speech recognition,
and I and my audio group started to do
neural network-based speech enhancement.
This is the year of 2013 or 2014.
We had a speech neural network-based speech enhancement algorithm
surpassing the existing
statistical signal processing algorithm literally instantly.
It was big, it was heavy, but better.
When did the first of these ship?
Can you tell any interesting ship stories about this?
The first neural network based speech enhancement algorithm was shipped in 2020 in Teams. Okay.
We had to work with that team for quite a while. Actually, four years took us to work with Teams to find...
You see, here in the research, industrial research lab, we have a little bit different perspective.
It's not just to make it to work.
It's not just to make it a technology.
That technology has to be shippable. it has to meet a lot of other requirements
and limitations in memory and in CPU and in reliability.
It's one thing to publish a paper with very cool results with your limited data set
and completely different to throw this algorithm in the wild where eight can face everything.
And this is what it cost us around four years before to ship the first prototype in teams.
That makes sense.
And I think a lot of the infrastructure was also not there at that point in time early
on, right, in terms of, you know, how do you upload a model to the client, even in terms
of all the model profiling, you know, architecture search, quantization, and other tooling that now exists where you can take a model
and then squeeze in on the right kind of computation footprint.
That's correct.
So you did all of that manually, I guess, at that point in time.
Initially, yes.
But new architectures arrived.
The cloud.
Wow! It was a savior.
You can press a button, you can get 100 or 1,000 machines.
You can run in parallel multiple architectures.
You can really select the optimal from every single standpoint.
Actually, what we did is we ended up with a set of speech enhancement algorithms.
Given computing power, we can tell you
what is the best architecture for this.
Or if you want to hit up this improvement,
I can tell you how much CPU you will need for that.
But that trade-off is also something very typical
for industrial research lab
and not very well understood in academia.
Makes sense.
Well, let me switch gears one last time, namely, I mean, you have made quite a few changes very well understood in academia. Makes sense.
Let me switch gears one last time.
Namely, you have made quite a few changes in your career throughout.
You started as an assistant professor, then became a core developer, then were a member
of a signal processing group, and now you're driving a lot of the audio processing research
for the company.
How do you deal with this change?
Do you have any advice for our listeners on how to keep your career going, especially as the rate of change
seems to be accelerating all the time? For 25 years in Microsoft Corporation,
I have learned several rules I follow. The first is dealing with ambiguity. It is not just Най-два правила, които следвам. Първото е да се бори с амбигуитет.
Не е само да се промени технологията,
но и промени върху командата,
организациите, и т.н.
Почитайте, че това са неща, които не може да се променят.
Това са неща, които не може да се отглежда.
Просто да ги отглежда и да се продължава.
И тук приема втората правила. not hide. Just accept them and go on. And here comes the second rule. To succeed in Microsoft, you have to be laser-focused on what you are doing. This is the thing you
can change. Focus on the problems you have to solve. Do your job and be very good at
it. This is the most important. Those are the two most important rules.
I have used it in my career in Microsoft.
Okay.
Super, super interesting, Ivan.
Thank you very much for this amazing conversation.
Thank you for the invitation, Johannes.
To learn more about Ivan's work or to see photos of Ivan pursuing his passion for shooting
film and video visit aka.ms slash researcher stories.