Instant Genius - Why realistic humanoid robots need to learn to lip-sync
Episode Date: April 19, 2021In this week's episode of the Science Focus Podcast, commissioning editor Jason Goodyer speaks to Dr Carl Strathearn, a research fellow at the School of Computing at Edinburgh Napier University. He's... currently conducting research on realistic humanoid robots, specifically on more realistically synchronising their speech and mouth movements. He tells us about how to get robots out of the Uncanny Valley, why the way a robot looks is so important, and why Data from Star Trek is an inspiration for his work. Read an edited excerpt from this interview Let us know what you think of the episode with a review or a comment wherever you listen to your podcasts. Subscribe to the Science Focus Podcast on these services: Acast, iTunes, Stitcher, RSS, Overcast Read the full transcription of this episode [this will open in a new window] Listen to more episodes of the Science Focus Podcast: How virtual reality is helping patients with phobias, anxiety disorders and more Dr Pete Etchells: Do video games encourage gambling behaviour? Rana el Kaliouby: What if computers could read our emotions? Ritu Raman: Can you build with biology? Dr Erin Macdonald: Is there science in Star Trek? Robert Elliott Smith: Are algorithms inherently biased? Hosted on Acast. See acast.com/privacy for more information. Learn more about your ad choices. Visit podcastchoices.com/adchoices
Transcript
Discussion (0)
Ambition comes in all shapes and sizes.
At First Citizens Bank, we roll with your goals
because we're built for what you're building.
Fit for your ambition for Citizens Bank.
You said this place was steps from the water.
We just haven't found the steps yet.
How much did we save?
Enough.
Enough to get lost.
Or you could book a stay with Hilton.
Welcome to your oceanfront room.
Just steps from the water.
The Hilton sale is on now.
Book on Hilton.com or the Hilton app
and save up to 20% to get the stay you expected.
When you want savings, not surprises.
It matters where you stay.
Hilton, for the stay.
Study and play.
Come together on a Windows 11 PC.
And for a limited time,
college students get
the best of both worlds.
Get the unreal college deal,
everything you need to study and play
with select Windows 11 PCs.
Eligible students get a year of Microsoft
off 365 premium and a year of Xbox GamePass Ultimate with a custom color Xbox wireless controller.
Learn more at Windows.com slash student offer.
While supplies last, ends June 30th, terms at AKA.m.m.m.m.com slash college PC.
This podcast is sponsored by name, audio, and focal.
Streaming has made music more accessible than ever, but true listening is about more
than ease. It's about quality. British audio experts name audio, alongside French
acoustic specialist focal combine handcrafted tradition with cutting-edge innovation and high-end materials
delivering digital precision with analog warmth so you can experience exceptional sound at home music just as the
artist intended visit name audio.com to learn more you're listening to the science focus podcast from the
bbc science focus magazine team with the uk's best-selling science and technology monthly available
in print and in several digital formats throughout the world find out more at some
ScienceFocus.com or look out for us in your app store.
Hello and welcome to the Science Focus podcast.
I'm Jason Goodyear, commissioning editor at BBC Science Focus magazine.
Today I'm talking to Carl Stratton, a roboticist and research fellow at the School of Computing
at Edinburgh and APA University.
He's currently conducting research on realistic human robots, specifically on more
realistically synchronising their speech and mouth movements.
So thanks very much for joining us today, Carl.
No problem.
So yeah, just as by way of background, sort of one of the big talking points or maybe even driving factors behind research on, you know, realistic human robotics is this so-called Uncanny Valley effect.
So sort of before we get into the actual meat of your work, can you just explain to the people listening what that is if they're unfamiliar with it?
Yes, the Uncanny Valley is a point.
where things like robots, humanoid robots and CGI characters start to give us an eerie feeling.
And the reason for that is because they are not perfect representations of humans.
They never quite get there.
Because they never quite get there, they emit these feelings of terror, unease, unfriendliness.
And that's the uncanny valley.
It's kind of, they call it a perceptual dip, which is basically
they call it a point between being alive and being dead, basically.
It's this kind of zombie, this idea of a zombie in between the two
and humanoid robots and CGI characters
because they inhibit similar kind of qualities of a zombie
fall into the uncanny valley.
Yeah, I'd like to say like it sits somewhere between Wally,
like your sort of cute robot,
and then the T-1000 from Terminator 2.
Yeah.
So what's the sort of current thinking on the psychology?
What's going on here?
Why does this, why do people find these sort of human but not completely human at robots?
A bit, you know, a bit iffy, a bit creepy.
I think it's because from, well, from birth, we're able to detect faces and we're able to analyze faces.
And faces plays such an important part.
in our communication.
And when we start to see things that shouldn't be there out of place,
we do get that feeling of,
in Canada, it's called repulsion,
but I guess it's just, I call it negative feedback.
Like, it's unnatural feedback.
And one of the arguments, recent arguments that have come to light
is that this is starting to also occur in facial enhancement surgery,
so people who have their lips kind of,
enhanced and things like that.
This can be considered as sort of the higher realms of the Uncanny Valley.
If I was to build a robot and it had sort of these enhancements,
and I said, oh, I'm trying to make it as real as possible,
people might say, well, it doesn't look completely real
because you've added these enhancements.
So on a kind of perceptual level,
that would consider that as the higher realms of the Uncanny Valley.
There's also, there's other types of Uncanny Valley as well.
just appearance, it's in functionality as well, the way things move, the way robots move.
If a robot doesn't move, the way we kind of expect it to move, then again, that gives that
feeling of unnaturalness and easiness, and that is kind of the emphasis of the uncali-value effect.
Yeah, I remember there's the, what's he called, the Atlas robot, and I just thought that was
amazing and really fascinating, but other people, they're sort of pushing him over and he's recovering
is balance and you know some people are saying oh you know that thing's going to turn on you
yeah that's really just because it looks and behaves like a human and because the human drive is if we
see something that looks and behaves or anything kind of like a human we automatically start
soon where it must be able to feel and think and have emotions emotions like a human when it
doesn't so it's that kind of it's that drive again that kind of innate drive yeah that's very interesting
So moving on to your sort of the role you play in this.
So you focus on speech matching facial movements to speech.
So why is that important?
Why does that play such an important role in this effect?
Well, this all started from the Uncanny Valley theorem.
And the two key areas in the Uncanny Valley theorem are eyes and the mouth.
And when we communicate, our attention goes between the eyes and the mouth.
We look at eyes to gauge attention and we look at mouth for speech reading for understanding.
And with robots particularly, anything that is kind of outside the realm of natural lip movements,
when the speech is so perfectly coming out, it can be confusing and to disorientating.
Especially if you kind of interact with a certain amount of time for a length of time.
I think the Oster's choice in one of the recent Star Wars remix when they did a CGI character,
and the lip synchronisation was kind of off.
So yeah, and then, but that's where this project started, really.
It started off with how can I turn systems that are used in CGI animation and games to turn speech into something called Vise?
which is kind of the lip positions, how can I take that software and create it for a robot?
How can I turn this into a robot?
Right.
That's where that started really.
So when I was first doing this project, I was actually helping teach in the animation department at the time because the previous university I was at didn't quite have a robotics department.
So that's where these ideas start to come together because they use programs.
There's one called Oculus, which basically it takes speech and it converts it into a CGI mouth with lip positions.
So it automatically reads speech and extracts the visims for the mouth positions.
And I want to do that with the robot.
So to start with, I created a robot mouth.
The robot mouth was modelled on the human mouth,
but before I did that, I looked at previous robotic mouth systems
to see what was missing.
And that was kind of really important just to be able to see
what are the key muscles, what muscles work together,
what can be left out of this mouth.
Obviously, it's a very small area,
and you kind of confine to what you can actually put into a robotic mouth.
One of the key things that was missing
was something called the Bussinator,
muscles, which are the muscles at the corners of the mouth, not the cheek muscles, but the corners of the mouth.
And they are used for person and stretching the lips when we create vowel and consonant sounds.
So I replicated these muscles and I created this kind of robotic mouth prototype and I thought, right, the next stage is to create an application that can take these lip shapes and put them into this robotic mouth.
So we use something called advising chat and it's something that's used a lot in
CGI and game design which is basically a list of sounds, word sounds and letter sounds and
the mouth shape, the matching mouth shape.
And I made my robot do these shapes.
So for each like the A's, R's and Ouse, all these robotic mouth positions I collected
and saved into a configuration file.
to be able to bring them out later and use them.
The next part was how to create a system that can handle speech.
Now, previously in the other applications,
the speech was kind of a secondary thing.
You spoke and then you put it into a file, into the application,
and it read the file.
I need to do it live.
There was no room to kind of have some processing time,
because if you use processing time, then this idea of speech becomes unnatural.
all the conversation, you know, there's lots of huge pauses in the conversation, which is unnatural.
So it's what I did.
I created a machine learning algorithm, and I was able to take speech synthesis, which is robotic speech like you have on Siri and various other applications.
Take that speech synthesis out of a laptop and put one end of it into something called a microprocessor and turn,
that audio data back into numerical data.
And the other part of it also went back into a processing system.
So I can actually see the sound wave like you see on a normal,
like in a recording studio.
And then is what I did is I created a machine learning algorithm
that could kind of recognize patterns in the incoming speech.
And that was done not by monitoring the speech itself as such,
but the patterns in the waveform.
So you're looking at kind of the pixel sound.
and the length of each word and each sound.
And then basically feeding the system a bunch of samples.
So it kind of knew what it was looking for.
And when it came across it,
it was able to transfigure the robot mouth system
to match to the positions that I matched on the chart.
And that worked surprisingly well.
And then the next thing was,
it was the voice,
the voice patterning system,
which is syllables.
So obviously when you talk,
the syllables,
you draw moves up and down to syllables.
And that was kind of the next stage
to create this patterning system
that would,
if there was no sound,
the mouth was shut,
the louder the sound,
the wider the robot mouth.
And then there was tongue positions as well.
So there was tongue positions to include.
And then when I actually put it all together,
it was pretty amazing to see work.
It was a, like we're talking about doing Canny Valley.
I think for one of the first times I actually sat with a robot and it was very strange to see because, you know, you see all these weird parts working together.
But yeah, and it was good.
Once it was kind of configured and the system was trained, it was really quite accurate in some respects.
In the lip supercisation, it was very accurate.
In some other parts, it wasn't.
But it held up pretty well.
in the evaluation against existing robots.
Yeah, so for those who haven't seen your work, your robot,
so it's a pretty realistic looking head.
And it's ahead of an older, an older gent.
How did you go about choosing your character for your robot?
I just find that really interesting.
Well, there's actually two robots in the experiment.
There was an old-looking one and a younger-looking robot.
The younger looking robot doesn't get as much attention
because I think the older looking robot looks more realistic
but they were both kind of produced with the idea of being
one was a younger version of the older one
so they were both kind of the same robot
and when I was doing the tests
because the mouth test was part of a wider test
which involved lots of different things like eyes and personality
so I wanted to compare how people interacted
with an older-looking robot and a younger-looking robot.
And I had two sample groups.
I had a sample group of older people and younger people.
And what I found is that younger people prefer to interact with the younger robot
and the older people prefer to interact with the other-looking robot.
And there was also personalities as well.
So I had to design an older personality and then a younger personality.
So I thought, well, I'm quite young, so I'll build the younger personality on myself.
So my interest and I thought, well, and then my dad,
pretty well. And he's kind of old. So I've modelled the old one on him. So I had one that was
kind of interested in, what I'm interested in. Everyone was interested in Snooker and John Smith's.
So it was big. That's really interesting. Have there been any sort of like big studies done
on what the public or people who are going to be interacting with these robots would like
them to look like?
I'm not too sure about robots.
There certainly was in CGI characters,
but I actually wrote a paper just on this subject,
which was designing robots,
and I call it embodied artificial intelligence,
which is the personality of robots.
And it's really fascinating.
Actually, there was a robot called Bina 48,
which was modelled on somebody.
It was supposed to be acted like a vessel for her.
So it's like a collection of her memories and life experiences.
But in terms of actual academic research, there was very little to go on.
One interesting things I'm starting to really realise now is that there's been a huge movement away from academia into the private sector.
So we have like Hansen Robotics and Sophia.
And even in England we have engineered arts and they have their robots.
they're humanoid robots and in japan they have the gemenoid series and russia have a new one called
pro mobop which again is realistic humanoid robots for things like desk assistants and receptionists
and things like that so yeah yeah so i'm just sort of going into the into the nuts and bolts of
your work so you're saying you say there you've got lips teeth tongue jawbones different facial
actuating muscles. So what's it actually made of? It was all 3D printed. I, because it was
a rapid prototype and there's so many of different versions of it, that the whole system was
3D printed, but then some parts of it couldn't really stand up to the pressures of the mouth
working all the time, so I had to have them CNC'd in a special aluminium composite, which is kind of
very thin, very light material. I think eventually is what I was, I was, sort of,
I am hoping to do is
and be able to publish all this online
kind of open source and let people
create their own prototypes and expand
on the system because
it has a high accuracy
but it's not totally accurate
and so there's still work there to be done
because I've kind of moved on now to other stuff
you know I want to kind of leave it to the public
well engineers and robots are interested in that
to kind of expand on. Yeah
so what was the design process
like then? What was your sort of
of your starting point and your initial goal?
My initial goal was to replicate the human mouth as closely as possible.
The speech synthesis was difficult to deal with
because we don't have accurate speech synthesis.
And I don't think it's ever really going to sound truly human.
Because human speech is so variable.
I think that's why my system works so well.
Because with speech synthesis, you can control that.
With human speech you can't.
So if I was to speak into my machine learn application and try and get the robot to replicate, it's not going to do that.
Speech synthesis is very controlled.
If it's not totally controlled, but there are limitations to it and you can kind of work within these parameters to get really good results.
So the other interesting points and the reasons why I designed it like I designed it was because I knew from my experience that previously the humanoid robots out there like surfers.
they do not use these kind of technologies.
They simply have random jaw movements to sound.
And sometimes they do it very well.
They tend to do it very quickly,
so it's hard to see it exactly.
So when you do things quickly
and the speech is kind of at its normal pace,
then there is a little bit of scope there for almost,
it tricks the human brain.
It tends to be, if the lips are,
going slower, then, you know, you kind of see that.
But if things are going faster, you tend not to notice too much.
And I really wanted to see if I could kind of improve all this.
So from my studies, I was able to determine that using things like machine learning
is a lot more accurate and definitely the way to go to be doing these things,
rather than kind of just randomized lip movements and position.
and things.
Yeah, that's really, because going back to you saying about CGI and video games,
like I've recently noticed, I don't know if you're familiar with it,
but I really like Demon Souls and Dark Souls, those games.
And they recently did a sort of revamp of Demon Souls,
which is quite old for the PlayStation 5.
And one of the things that was vastly improved was the synchronisation of the characters,
as they were speaking with their mouths.
it looked so much more natural than previously
where it was sort of like a badly dubbed
you know 80s movie or something.
Yeah.
So is that, that's like similar to the stuff you've been working on previously?
Yeah, that's pretty much hit the nail on head.
But I'd also say that I imagine at the time
when you were playing them video games the first time around,
you might not have noticed that as much.
Or if you did notice it, you kind of thought,
well, that's just, that's still really good,
still really good attempt.
But with humanoid robotics, it's different because they're in front of you.
They're there.
CGI characters, they get away with a lot because they are there.
When you have a robot in a room in front of you,
there's very little hiding places for these things,
and you're able to kind of really pick out things that are going wrong
and things that are natural.
And that's one of the things that really came about in my studies
was how people have this kind of inbuilt ability
to recognize things that are not quite right.
and what you might think is a tiny thing
can actually give the whole game away.
Especially when we're considering,
I wrote another kind of model,
an idea called the Multimodal Shuring test,
which is also now the West World test,
which is basically when you create a robot
and it gets to the point where you can only tell
the difference between the robot and the human.
Things like that.
And what goes into that as well,
so it was kind of a model that was,
it's based on like a triangle,
a hierarchy. And the closer you get to the top, the harder it is, of course, you know, to actually
get these nuances, these things, and things like lip synchronization, pupil dilation is another
area I've worked on, robotic pupil dilation. It's these tiny nuances that play a huge part in it,
because these are the things that give the game away, you know, and things like facial
ticks or whatever, just these tiny nuances that we don't even realize are important in a
conversation suddenly become crucial. Yeah, that's really interesting. So sort of, yeah,
Westwood, I'd forgotten about that, actually. I really enjoyed that. So what are the potential
applications of this type of work? What's, what's the end goal? What do we want to do with it?
For me, I always use the example of data from Star Trek as the perfect example for this, because data,
he acts as like this very humanistic interface between lots of different things.
He acts as interface between people and aliens.
So obviously aliens that don't speak English and he acts as a translate,
but not only that, he also acts as an interface between things like the computer and a person.
So things that would be very difficult, calculations that be very difficult,
is able to translate that information and give it in a very simplified way.
in a very humanistic way, with emotion, with facial expressions,
and that's why I think this technology will eventually head towards.
I mean, we have to remember that not everybody can interact with technology effectively.
We're very privileged, I think, to have grown up with technology
and to be able to use technology, but there's lots of people in the world who don't have that.
And creating something like a humanoid robot would allow them to kind of integrate with technology
a lot more naturally.
So that's another kind of use.
I always think the data examples are really good one,
rather than the Terminator stuff.
Yeah, the friendlier end of the spectrum.
Yeah.
So you mentioned there like alien data, sorry,
translating alien languages and things.
Is there so, I only know,
I know there's some work in Japan that I've seen,
but is there any sort of differences
between different languages in this sort of sort of stuff?
Yes, definitely.
pronunciations and even regional dialects like my oxer accent would be a huge factor of this.
Again, I think that's why having a machine learning algorithm is the way to go,
because these are the sort of things you can train the system on.
So yeah, that is a very, it's very interesting.
It's something, again, I think I'd be really interested in looking at later on
to see what the influences of language and dialects and accents.
and things.
So what do you think, like, the time frame is for this sort of thing, like, playing
the long game, when are we going to be seeing, like you say an interface, say, when am I
going to have one in my home that can, you know, help, say, I don't know, maybe I'm elderly
or I'm disabled or something, that sort of, you know, I don't want to say robot butler,
but you know what I mean?
Well, you might be in look, because Hanson Robotics have announced, it was only a few
weeks ago that they're mass ruling out the Sophia robot. So that's their aim for the
for 2020-2020-222 is to start and rolling out this Sophia model. But I argued that how
useful that would be is kind of massively up for debate because Sophia is actually semi-autonomous,
not fully autonomous. So there's going to be certain things she can't do. You're going to have
to do for her. And I think it might be too early to start even
thinking about these sort of disting the distant humanoid robots out on a mass scale,
and at least until they can start doing things fully all like themselves and without any human aid.
And even then, you're still going to have to get it past kind of all the ethicists.
There's a lot of really good work done in AI ethics and robotic ethics.
So yeah, it's really hard to say.
I think it's a long way away.
But at the same time, there's lots of good research going on at the moment, which is also pushing it forward.
So it's very difficult to give your answer to that.
Yeah, yeah, that's been great.
So you just mentioned earlier that now you're moving on to new projects.
So I just wanted to ask you, you know, what are you hoping to work on next?
What are your plans for the next few years?
Well, at the moment, I'm working on what's called, we're calling,
visually enhanced common sense language models,
which is basically allowing robots to use some level of human common sense.
So an example of this would be
if I had a robot and I had a vision system
and I asked it to find a pen
on a table, it could do that. No problem
because it has object recognition
and it could recognise a pen. But if the pen
was in a drawer, say
in a kitchen, and
you asked it to find the pen, it would spend
all day going around
looking for something but not ever opening the draw.
So this idea of common sense knowledge would be
giving the robots some
ability to know that pens
are kept in drawers, clothes are
kept in a wardrobe. And these are things that are missing out. So it's like a cross between
language and vision. There's like a crossover in our common sense, what we call common sense
knowledge. So that's what I'm working at the moment. And we're currently developing a robot
to help people with cooking tasks. So it's like a robot chef to use in a kitchen. But you're
able to ask it things and do things which you can't normally do with things like the Amazon Echo,
be an example, if you ask that to give you a recipe or help you cook,
it gives you it in one solid block and it just reads the whole thing out.
And this would be more intuitive.
It would be more like an information giver, information follower kind of construct.
But also lots of common sense knowledge base is embedded in there.
So you'd be able to ask lots of things like,
I don't have a certain ingredient.
Is there another ingredient I could use instead of this?
And it would be able to do that as well.
Thanks for listening.
And if you've enjoyed this episode,
please do leave us a review.
This podcast was brought to you by the team behind BBC Science Focus magazine.
In the March issue, which is on sale now,
we talked to Tim Berners-Lee about whether we can make the internet great again.
We look at the experiment looking to bring hallucinogenic drugs to the NHS,
and we dive into plans to build a city on Mars.
And of course, there's much and much more inside,
and on our website, sciencefocus.com.
Thank you for listening to the Science Focus podcast from the BBC Science Focis magazine team.
with the UK's best-selling science and technology monthly,
available in print and in several digital formats throughout the world.
Find out more at sciencefocus.com or look out for us in your app store.
This podcast is sponsored by Name, Audio and Focal.
The texture and emotional depth of music can be lost through digital sources or poor signal.
Name audio believes you can have digital precision with analogue warmth.
Alongside French acoustic specialist focal,
Name creates high-end audio systems, combining innovation with craftsmanship,
so you can listen to music, just as the artist intended.
Discover more at namea Audio.com.
There's a moment when you start to wonder, what's the right next step?
Not about changing who they are, just finding the right kind of support.
At Kingsley Manor, life stays expressive, connected and full of character,
shaped by people who have lived interesting lives and aren't finished yet.
So it doesn't feel like a change.
It feels like a continuation.
Explore your options at canesley manor.org,
a non-profit month-to-month senior community within the Front Porch family.
