Instant Genius - Why realistic humanoid robots need to learn to lip-sync

Starting point is 00:00:00 Ambition comes in all shapes and sizes. At First Citizens Bank, we roll with your goals because we're built for what you're building. Fit for your ambition for Citizens Bank. You said this place was steps from the water. We just haven't found the steps yet. How much did we save? Enough.

Starting point is 00:00:23 Enough to get lost. Or you could book a stay with Hilton. Welcome to your oceanfront room. Just steps from the water. The Hilton sale is on now. Book on Hilton.com or the Hilton app and save up to 20% to get the stay you expected. When you want savings, not surprises.

Starting point is 00:00:41 It matters where you stay. Hilton, for the stay. Study and play. Come together on a Windows 11 PC. And for a limited time, college students get the best of both worlds. Get the unreal college deal,

Starting point is 00:00:56 everything you need to study and play with select Windows 11 PCs. Eligible students get a year of Microsoft off 365 premium and a year of Xbox GamePass Ultimate with a custom color Xbox wireless controller. Learn more at Windows.com slash student offer. While supplies last, ends June 30th, terms at AKA.m.m.m.m.com slash college PC. This podcast is sponsored by name, audio, and focal. Streaming has made music more accessible than ever, but true listening is about more

Starting point is 00:01:24 than ease. It's about quality. British audio experts name audio, alongside French acoustic specialist focal combine handcrafted tradition with cutting-edge innovation and high-end materials delivering digital precision with analog warmth so you can experience exceptional sound at home music just as the artist intended visit name audio.com to learn more you're listening to the science focus podcast from the bbc science focus magazine team with the uk's best-selling science and technology monthly available in print and in several digital formats throughout the world find out more at some ScienceFocus.com or look out for us in your app store. Hello and welcome to the Science Focus podcast.

Starting point is 00:02:14 I'm Jason Goodyear, commissioning editor at BBC Science Focus magazine. Today I'm talking to Carl Stratton, a roboticist and research fellow at the School of Computing at Edinburgh and APA University. He's currently conducting research on realistic human robots, specifically on more realistically synchronising their speech and mouth movements. So thanks very much for joining us today, Carl. No problem. So yeah, just as by way of background, sort of one of the big talking points or maybe even driving factors behind research on, you know, realistic human robotics is this so-called Uncanny Valley effect.

Starting point is 00:02:58 So sort of before we get into the actual meat of your work, can you just explain to the people listening what that is if they're unfamiliar with it? Yes, the Uncanny Valley is a point. where things like robots, humanoid robots and CGI characters start to give us an eerie feeling. And the reason for that is because they are not perfect representations of humans. They never quite get there. Because they never quite get there, they emit these feelings of terror, unease, unfriendliness. And that's the uncanny valley. It's kind of, they call it a perceptual dip, which is basically

Starting point is 00:03:39 they call it a point between being alive and being dead, basically. It's this kind of zombie, this idea of a zombie in between the two and humanoid robots and CGI characters because they inhibit similar kind of qualities of a zombie fall into the uncanny valley. Yeah, I'd like to say like it sits somewhere between Wally, like your sort of cute robot, and then the T-1000 from Terminator 2.

Starting point is 00:04:09 Yeah. So what's the sort of current thinking on the psychology? What's going on here? Why does this, why do people find these sort of human but not completely human at robots? A bit, you know, a bit iffy, a bit creepy. I think it's because from, well, from birth, we're able to detect faces and we're able to analyze faces. And faces plays such an important part. in our communication.

Starting point is 00:04:41 And when we start to see things that shouldn't be there out of place, we do get that feeling of, in Canada, it's called repulsion, but I guess it's just, I call it negative feedback. Like, it's unnatural feedback. And one of the arguments, recent arguments that have come to light is that this is starting to also occur in facial enhancement surgery, so people who have their lips kind of,

Starting point is 00:05:09 enhanced and things like that. This can be considered as sort of the higher realms of the Uncanny Valley. If I was to build a robot and it had sort of these enhancements, and I said, oh, I'm trying to make it as real as possible, people might say, well, it doesn't look completely real because you've added these enhancements. So on a kind of perceptual level, that would consider that as the higher realms of the Uncanny Valley.

Starting point is 00:05:36 There's also, there's other types of Uncanny Valley as well. just appearance, it's in functionality as well, the way things move, the way robots move. If a robot doesn't move, the way we kind of expect it to move, then again, that gives that feeling of unnaturalness and easiness, and that is kind of the emphasis of the uncali-value effect. Yeah, I remember there's the, what's he called, the Atlas robot, and I just thought that was amazing and really fascinating, but other people, they're sort of pushing him over and he's recovering is balance and you know some people are saying oh you know that thing's going to turn on you yeah that's really just because it looks and behaves like a human and because the human drive is if we

Starting point is 00:06:20 see something that looks and behaves or anything kind of like a human we automatically start soon where it must be able to feel and think and have emotions emotions like a human when it doesn't so it's that kind of it's that drive again that kind of innate drive yeah that's very interesting So moving on to your sort of the role you play in this. So you focus on speech matching facial movements to speech. So why is that important? Why does that play such an important role in this effect? Well, this all started from the Uncanny Valley theorem.

Starting point is 00:07:01 And the two key areas in the Uncanny Valley theorem are eyes and the mouth. And when we communicate, our attention goes between the eyes and the mouth. We look at eyes to gauge attention and we look at mouth for speech reading for understanding. And with robots particularly, anything that is kind of outside the realm of natural lip movements, when the speech is so perfectly coming out, it can be confusing and to disorientating. Especially if you kind of interact with a certain amount of time for a length of time. I think the Oster's choice in one of the recent Star Wars remix when they did a CGI character, and the lip synchronisation was kind of off.

Starting point is 00:07:49 So yeah, and then, but that's where this project started, really. It started off with how can I turn systems that are used in CGI animation and games to turn speech into something called Vise? which is kind of the lip positions, how can I take that software and create it for a robot? How can I turn this into a robot? Right. That's where that started really. So when I was first doing this project, I was actually helping teach in the animation department at the time because the previous university I was at didn't quite have a robotics department. So that's where these ideas start to come together because they use programs.

Starting point is 00:08:40 There's one called Oculus, which basically it takes speech and it converts it into a CGI mouth with lip positions. So it automatically reads speech and extracts the visims for the mouth positions. And I want to do that with the robot. So to start with, I created a robot mouth. The robot mouth was modelled on the human mouth, but before I did that, I looked at previous robotic mouth systems to see what was missing. And that was kind of really important just to be able to see

Starting point is 00:09:11 what are the key muscles, what muscles work together, what can be left out of this mouth. Obviously, it's a very small area, and you kind of confine to what you can actually put into a robotic mouth. One of the key things that was missing was something called the Bussinator, muscles, which are the muscles at the corners of the mouth, not the cheek muscles, but the corners of the mouth. And they are used for person and stretching the lips when we create vowel and consonant sounds.

Starting point is 00:09:40 So I replicated these muscles and I created this kind of robotic mouth prototype and I thought, right, the next stage is to create an application that can take these lip shapes and put them into this robotic mouth. So we use something called advising chat and it's something that's used a lot in CGI and game design which is basically a list of sounds, word sounds and letter sounds and the mouth shape, the matching mouth shape. And I made my robot do these shapes. So for each like the A's, R's and Ouse, all these robotic mouth positions I collected and saved into a configuration file. to be able to bring them out later and use them.

Starting point is 00:10:30 The next part was how to create a system that can handle speech. Now, previously in the other applications, the speech was kind of a secondary thing. You spoke and then you put it into a file, into the application, and it read the file. I need to do it live. There was no room to kind of have some processing time, because if you use processing time, then this idea of speech becomes unnatural.

Starting point is 00:10:55 all the conversation, you know, there's lots of huge pauses in the conversation, which is unnatural. So it's what I did. I created a machine learning algorithm, and I was able to take speech synthesis, which is robotic speech like you have on Siri and various other applications. Take that speech synthesis out of a laptop and put one end of it into something called a microprocessor and turn, that audio data back into numerical data. And the other part of it also went back into a processing system. So I can actually see the sound wave like you see on a normal, like in a recording studio.

Starting point is 00:11:40 And then is what I did is I created a machine learning algorithm that could kind of recognize patterns in the incoming speech. And that was done not by monitoring the speech itself as such, but the patterns in the waveform. So you're looking at kind of the pixel sound. and the length of each word and each sound. And then basically feeding the system a bunch of samples. So it kind of knew what it was looking for.

Starting point is 00:12:09 And when it came across it, it was able to transfigure the robot mouth system to match to the positions that I matched on the chart. And that worked surprisingly well. And then the next thing was, it was the voice, the voice patterning system, which is syllables.

Starting point is 00:12:29 So obviously when you talk, the syllables, you draw moves up and down to syllables. And that was kind of the next stage to create this patterning system that would, if there was no sound, the mouth was shut,

Starting point is 00:12:40 the louder the sound, the wider the robot mouth. And then there was tongue positions as well. So there was tongue positions to include. And then when I actually put it all together, it was pretty amazing to see work. It was a, like we're talking about doing Canny Valley. I think for one of the first times I actually sat with a robot and it was very strange to see because, you know, you see all these weird parts working together.

Starting point is 00:13:06 But yeah, and it was good. Once it was kind of configured and the system was trained, it was really quite accurate in some respects. In the lip supercisation, it was very accurate. In some other parts, it wasn't. But it held up pretty well. in the evaluation against existing robots. Yeah, so for those who haven't seen your work, your robot, so it's a pretty realistic looking head.

Starting point is 00:13:37 And it's ahead of an older, an older gent. How did you go about choosing your character for your robot? I just find that really interesting. Well, there's actually two robots in the experiment. There was an old-looking one and a younger-looking robot. The younger looking robot doesn't get as much attention because I think the older looking robot looks more realistic but they were both kind of produced with the idea of being

Starting point is 00:14:02 one was a younger version of the older one so they were both kind of the same robot and when I was doing the tests because the mouth test was part of a wider test which involved lots of different things like eyes and personality so I wanted to compare how people interacted with an older-looking robot and a younger-looking robot. And I had two sample groups.

Starting point is 00:14:27 I had a sample group of older people and younger people. And what I found is that younger people prefer to interact with the younger robot and the older people prefer to interact with the other-looking robot. And there was also personalities as well. So I had to design an older personality and then a younger personality. So I thought, well, I'm quite young, so I'll build the younger personality on myself. So my interest and I thought, well, and then my dad, pretty well. And he's kind of old. So I've modelled the old one on him. So I had one that was

Starting point is 00:14:56 kind of interested in, what I'm interested in. Everyone was interested in Snooker and John Smith's. So it was big. That's really interesting. Have there been any sort of like big studies done on what the public or people who are going to be interacting with these robots would like them to look like? I'm not too sure about robots. There certainly was in CGI characters, but I actually wrote a paper just on this subject, which was designing robots,

Starting point is 00:15:25 and I call it embodied artificial intelligence, which is the personality of robots. And it's really fascinating. Actually, there was a robot called Bina 48, which was modelled on somebody. It was supposed to be acted like a vessel for her. So it's like a collection of her memories and life experiences. But in terms of actual academic research, there was very little to go on.

Starting point is 00:15:53 One interesting things I'm starting to really realise now is that there's been a huge movement away from academia into the private sector. So we have like Hansen Robotics and Sophia. And even in England we have engineered arts and they have their robots. they're humanoid robots and in japan they have the gemenoid series and russia have a new one called pro mobop which again is realistic humanoid robots for things like desk assistants and receptionists and things like that so yeah yeah so i'm just sort of going into the into the nuts and bolts of your work so you're saying you say there you've got lips teeth tongue jawbones different facial actuating muscles. So what's it actually made of? It was all 3D printed. I, because it was

Starting point is 00:16:49 a rapid prototype and there's so many of different versions of it, that the whole system was 3D printed, but then some parts of it couldn't really stand up to the pressures of the mouth working all the time, so I had to have them CNC'd in a special aluminium composite, which is kind of very thin, very light material. I think eventually is what I was, I was, sort of, I am hoping to do is and be able to publish all this online kind of open source and let people create their own prototypes and expand

Starting point is 00:17:20 on the system because it has a high accuracy but it's not totally accurate and so there's still work there to be done because I've kind of moved on now to other stuff you know I want to kind of leave it to the public well engineers and robots are interested in that to kind of expand on. Yeah

Starting point is 00:17:37 so what was the design process like then? What was your sort of of your starting point and your initial goal? My initial goal was to replicate the human mouth as closely as possible. The speech synthesis was difficult to deal with because we don't have accurate speech synthesis. And I don't think it's ever really going to sound truly human. Because human speech is so variable.

Starting point is 00:18:06 I think that's why my system works so well. Because with speech synthesis, you can control that. With human speech you can't. So if I was to speak into my machine learn application and try and get the robot to replicate, it's not going to do that. Speech synthesis is very controlled. If it's not totally controlled, but there are limitations to it and you can kind of work within these parameters to get really good results. So the other interesting points and the reasons why I designed it like I designed it was because I knew from my experience that previously the humanoid robots out there like surfers. they do not use these kind of technologies.

Starting point is 00:18:44 They simply have random jaw movements to sound. And sometimes they do it very well. They tend to do it very quickly, so it's hard to see it exactly. So when you do things quickly and the speech is kind of at its normal pace, then there is a little bit of scope there for almost, it tricks the human brain.

Starting point is 00:19:10 It tends to be, if the lips are, going slower, then, you know, you kind of see that. But if things are going faster, you tend not to notice too much. And I really wanted to see if I could kind of improve all this. So from my studies, I was able to determine that using things like machine learning is a lot more accurate and definitely the way to go to be doing these things, rather than kind of just randomized lip movements and position. and things.

Starting point is 00:19:43 Yeah, that's really, because going back to you saying about CGI and video games, like I've recently noticed, I don't know if you're familiar with it, but I really like Demon Souls and Dark Souls, those games. And they recently did a sort of revamp of Demon Souls, which is quite old for the PlayStation 5. And one of the things that was vastly improved was the synchronisation of the characters, as they were speaking with their mouths. it looked so much more natural than previously

Starting point is 00:20:12 where it was sort of like a badly dubbed you know 80s movie or something. Yeah. So is that, that's like similar to the stuff you've been working on previously? Yeah, that's pretty much hit the nail on head. But I'd also say that I imagine at the time when you were playing them video games the first time around, you might not have noticed that as much.

Starting point is 00:20:33 Or if you did notice it, you kind of thought, well, that's just, that's still really good, still really good attempt. But with humanoid robotics, it's different because they're in front of you. They're there. CGI characters, they get away with a lot because they are there. When you have a robot in a room in front of you, there's very little hiding places for these things,

Starting point is 00:20:53 and you're able to kind of really pick out things that are going wrong and things that are natural. And that's one of the things that really came about in my studies was how people have this kind of inbuilt ability to recognize things that are not quite right. and what you might think is a tiny thing can actually give the whole game away. Especially when we're considering,

Starting point is 00:21:14 I wrote another kind of model, an idea called the Multimodal Shuring test, which is also now the West World test, which is basically when you create a robot and it gets to the point where you can only tell the difference between the robot and the human. Things like that. And what goes into that as well,

Starting point is 00:21:35 so it was kind of a model that was, it's based on like a triangle, a hierarchy. And the closer you get to the top, the harder it is, of course, you know, to actually get these nuances, these things, and things like lip synchronization, pupil dilation is another area I've worked on, robotic pupil dilation. It's these tiny nuances that play a huge part in it, because these are the things that give the game away, you know, and things like facial ticks or whatever, just these tiny nuances that we don't even realize are important in a conversation suddenly become crucial. Yeah, that's really interesting. So sort of, yeah,

Starting point is 00:22:16 Westwood, I'd forgotten about that, actually. I really enjoyed that. So what are the potential applications of this type of work? What's, what's the end goal? What do we want to do with it? For me, I always use the example of data from Star Trek as the perfect example for this, because data, he acts as like this very humanistic interface between lots of different things. He acts as interface between people and aliens. So obviously aliens that don't speak English and he acts as a translate, but not only that, he also acts as an interface between things like the computer and a person. So things that would be very difficult, calculations that be very difficult,

Starting point is 00:23:02 is able to translate that information and give it in a very simplified way. in a very humanistic way, with emotion, with facial expressions, and that's why I think this technology will eventually head towards. I mean, we have to remember that not everybody can interact with technology effectively. We're very privileged, I think, to have grown up with technology and to be able to use technology, but there's lots of people in the world who don't have that. And creating something like a humanoid robot would allow them to kind of integrate with technology a lot more naturally.

Starting point is 00:23:33 So that's another kind of use. I always think the data examples are really good one, rather than the Terminator stuff. Yeah, the friendlier end of the spectrum. Yeah. So you mentioned there like alien data, sorry, translating alien languages and things. Is there so, I only know,

Starting point is 00:23:53 I know there's some work in Japan that I've seen, but is there any sort of differences between different languages in this sort of sort of stuff? Yes, definitely. pronunciations and even regional dialects like my oxer accent would be a huge factor of this. Again, I think that's why having a machine learning algorithm is the way to go, because these are the sort of things you can train the system on. So yeah, that is a very, it's very interesting.

Starting point is 00:24:23 It's something, again, I think I'd be really interested in looking at later on to see what the influences of language and dialects and accents. and things. So what do you think, like, the time frame is for this sort of thing, like, playing the long game, when are we going to be seeing, like you say an interface, say, when am I going to have one in my home that can, you know, help, say, I don't know, maybe I'm elderly or I'm disabled or something, that sort of, you know, I don't want to say robot butler, but you know what I mean?

Starting point is 00:24:58 Well, you might be in look, because Hanson Robotics have announced, it was only a few weeks ago that they're mass ruling out the Sophia robot. So that's their aim for the for 2020-2020-222 is to start and rolling out this Sophia model. But I argued that how useful that would be is kind of massively up for debate because Sophia is actually semi-autonomous, not fully autonomous. So there's going to be certain things she can't do. You're going to have to do for her. And I think it might be too early to start even thinking about these sort of disting the distant humanoid robots out on a mass scale, and at least until they can start doing things fully all like themselves and without any human aid.

Starting point is 00:25:45 And even then, you're still going to have to get it past kind of all the ethicists. There's a lot of really good work done in AI ethics and robotic ethics. So yeah, it's really hard to say. I think it's a long way away. But at the same time, there's lots of good research going on at the moment, which is also pushing it forward. So it's very difficult to give your answer to that. Yeah, yeah, that's been great. So you just mentioned earlier that now you're moving on to new projects.

Starting point is 00:26:14 So I just wanted to ask you, you know, what are you hoping to work on next? What are your plans for the next few years? Well, at the moment, I'm working on what's called, we're calling, visually enhanced common sense language models, which is basically allowing robots to use some level of human common sense. So an example of this would be if I had a robot and I had a vision system and I asked it to find a pen

Starting point is 00:26:40 on a table, it could do that. No problem because it has object recognition and it could recognise a pen. But if the pen was in a drawer, say in a kitchen, and you asked it to find the pen, it would spend all day going around looking for something but not ever opening the draw.

Starting point is 00:26:56 So this idea of common sense knowledge would be giving the robots some ability to know that pens are kept in drawers, clothes are kept in a wardrobe. And these are things that are missing out. So it's like a cross between language and vision. There's like a crossover in our common sense, what we call common sense knowledge. So that's what I'm working at the moment. And we're currently developing a robot to help people with cooking tasks. So it's like a robot chef to use in a kitchen. But you're

Starting point is 00:27:26 able to ask it things and do things which you can't normally do with things like the Amazon Echo, be an example, if you ask that to give you a recipe or help you cook, it gives you it in one solid block and it just reads the whole thing out. And this would be more intuitive. It would be more like an information giver, information follower kind of construct. But also lots of common sense knowledge base is embedded in there. So you'd be able to ask lots of things like, I don't have a certain ingredient.

Starting point is 00:27:54 Is there another ingredient I could use instead of this? And it would be able to do that as well. Thanks for listening. And if you've enjoyed this episode, please do leave us a review. This podcast was brought to you by the team behind BBC Science Focus magazine. In the March issue, which is on sale now, we talked to Tim Berners-Lee about whether we can make the internet great again.

Starting point is 00:28:16 We look at the experiment looking to bring hallucinogenic drugs to the NHS, and we dive into plans to build a city on Mars. And of course, there's much and much more inside, and on our website, sciencefocus.com. Thank you for listening to the Science Focus podcast from the BBC Science Focis magazine team. with the UK's best-selling science and technology monthly, available in print and in several digital formats throughout the world. Find out more at sciencefocus.com or look out for us in your app store.

Starting point is 00:28:53 This podcast is sponsored by Name, Audio and Focal. The texture and emotional depth of music can be lost through digital sources or poor signal. Name audio believes you can have digital precision with analogue warmth. Alongside French acoustic specialist focal, Name creates high-end audio systems, combining innovation with craftsmanship, so you can listen to music, just as the artist intended. Discover more at namea Audio.com. There's a moment when you start to wonder, what's the right next step?

Starting point is 00:29:27 Not about changing who they are, just finding the right kind of support. At Kingsley Manor, life stays expressive, connected and full of character, shaped by people who have lived interesting lives and aren't finished yet. So it doesn't feel like a change. It feels like a continuation. Explore your options at canesley manor.org, a non-profit month-to-month senior community within the Front Porch family.

Instant Genius - Why realistic humanoid robots need to learn to lip-sync

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.