Podcast: Why realistic human robots need to learn to lip-sync © Carl Strathearn

Podcast transcript: Dr Carl Strathearn on realistic humanoid robots

Read the transcript of our Science Focus Podcast with Dr Carl Strathearn – scroll down to listen to the episode.

Jason Goodyer: Hello and welcome to the Science Focus podcast. I’m Jason Goodyer, Commissioning Editor at BBC Science Focus magazine.

Advertisement

In this week’s episode of the Science Focus Podcast, I speak to Dr Carl Strathearn, a research fellow at the School of Computing at Edinburgh Napier University. He’s currently conducting research on realistic humanoid robots, specifically on more realistically synchronising their speech and mouth movements.

So, yeah, just by way of background, sort of one of the big talking points or maybe even driving factors behind research on realistic human robotics is this so-called uncanny valley effect.

Carl Strathearn Yes. The uncanny valley is a point where things like robots, humanoid robots and CGI characters start to give us an eerie feeling. And the reason for that is because they are not perfect representations of humans. Because they never quite get there. They emit these feelings of of terror, unease and unfriendliness. And that’s the uncanny valley.

It’s a perceptual dip, we call it a point between being alive and being dead. Basically, it’s this idea of being a zombie in between the two and humanoid robots and CGI characters because they inhabit some kind of qualities of a zombie fall into the uncanny valley.

Jason Goodyer Yeah. So what’s current thinking on the on the psychology, why do people find these sort of human but not completely humanoid robots. A bit, a bit iffy. A bit creepy.

Carl Strathearn I think it’s because from what from birth we’re able to detect and analyse faces. And faces play such an important part in our communication. And when we start to see things that shouldn’t be there out of place, we do get that feeling of repulsion, t I call it negative feedback, like it’s  unnatural feedback.

And one of the recent arguments that have come to light is that this is starting to also occur in facial enhancement surgery. So people have their lips kind of enhanced and things like that. This can be considered as sort of the higher realms of the uncanny valley. And if I were to build a robot and it had these enhancements and I said, ‘I’m trying to make it as real as possible’, people might say, ‘well, it doesn’t look completely real because you’ve added these enhancements’.

There are also there’s other types of uncanny valley as well. It’s not just appearance. It’s in functionality as well. The way things move, the way robots move, etc. If a robot doesn’t move the way we expect it to move. And that again gives that feeling of unnaturalness and uneasiness.

Jason Goodyer I’ve seen that with the Atlas robot. And I just thought that was really fascinating. But there was a video of the researchers pushing him over to get him to recover his balance. And, you know, some people were writing comments saying, ‘oh, you know, that thing’s going to turn on you’.

Carl Strathearn That’s purely just because it looks and behaves like a human. If we see something that looks and behaves or anything like a human, we automatically start to assume it must be able to feel and think and have emotions like a human when it doesn’t. So it’s that kind of it’s that drive again.

Jason Goodyer So moving on to the role you play in this. You focus on matching facial movements to speech. So why is that important? Why does that play such an important role in this effect?

Carl Strathearn Well, this all started from the Uncanny Valley Theorem and the two key areas in the uncanny valley theorem are the eyes and the mouth. When we communicate, our attention goes between the eyes and the mouth. We look at the eyes to get attention and we look at mouth for speech reading, for understanding.

And with robots, particularly, anything that is outside the realm of natural lip movements, they can be confusing and disorienting. Especially if you’re interacting over a certain amount of time. The obvious one is in one of the recent Star Wars movies when they did a CGI character, the lip synchronisation was kind of off. But that’s where this project started, really.

It started off with how can you turn systems that are using CGI animation in games to turn speech into something called visemes, which is kind of the lip positions. How can I take that that software and create it for a robot? That’s where I started, really.

So when I was first doing this project, I was actually helping teach in the animation department at the time because the previous university I was at didn’t quite have a robotics department.

So that’s where these ideas started to come together because they use programmes like one called Oculus, which basically takes speech and it converts into a CGI mouth with lip positions. So it automatically reads speech and extracts the visemes for the mouth positions and I wanted to do that with the robot.

So to start with, I created a robot mouth and the robot mouth was modelled on the human mouth. But before I did that, I looked at previous robotic mouth systems to see what was missing. And that was kind of really important just to to be able to see what were the key muscles, what muscles work together, what can be left out of this mouth.

And obviously it’s a very small area and you are confined to what you could actually put into a robotic mouth. One of the key things that was missing was something called the buccinator muscles, which are the muscles at the corners of the mouth, not the cheek muscles, they’re used for pursing and stretching the lips when we create vowel and consonant sounds.

So I replicated these muscles and I created robotic mouth prototype. And I thought, right, the next stage is to create an application that can take these lip shapes and put them into this robotic mouth. So we used something called viseme chart. And it’s something that’s used a lot in CGI, in game design, which is basically a list of sounds, words, sounds and the matching mouth shape.

And I made my robot do these shapes. So for each Ahs, Rs, and Oos – all these robotic mouth positions. And I collected and saved into a configuration file to be able to bring them out later and then use them. For the next part was how to create a system that can handle speech.

Now, previously, in the in the other applications, the speech was kind of a secondary thing. You spoke and then you put it into the application and it read the file. I wanted to do it live. there was no room to kind of have some processing time, because if you use processing time, then the speech becomes unnatural. The conversation, you know, there’s lots of huge pauses in the conversation, which is unnatural.

So it’s what I did. I created a machine learning algorithm and I was able to take speech synthesis, which is a robotic speech like you have on Siri and and various other applications. Take that that speech synthesis out the laptop and put one end of it into something called a microprocessor and turn that audio data back into numerical data. And part of it also went back into a processing system. So I can actually see the sound wave like you see on a normal like in a recording studio.

And then is what I did is I created a machine learning algorithm that could kind of recognise patterns in the incoming speech. And that was done not by monitoring the speech itself as such, but the patterns in the waveform. So you’re looking at the pixel size and the length of each word and each sound and then basically feeding the system a bunch of samples. So it kind of knew what it was looking for.

And when it came across it, it was able to transfigure the robot mouth system to match to the positions that I matched on the chart, and that worked surprisingly well. And then the next thing was what I call the voice patterning system and which is syllables. So obviously, when you talk your jaw moves up and down to syllables.

And that was kind of the next stage to create this patterning system. If there was no sound, that mouth was shut, the louder the sound, the wider the robot mouth and then there was tongue positions as well. So there was tongue positions to include and then I actually put it all together. It was pretty amazing to see work.

When we talk about the uncanny valley. I think it was one of the first times I actually sat with a robot. And it was very strange to see because, you know, you see all these these weird parts working together. Once it was kind of configured and the system was trained, it was really quite accurate in some respects In the lip synchronisation. It was very accurate. In some other parts it wasn’t. But it held up pretty well in the evaluation against existing robots.

Jason Goodyer Yeah. So for those who haven’t seen them, seen your work, your robots, I’d say it’s a pretty realistic looking head. And there’s a head of an older gent, how did you go about choosing choosing your character for your room? I just find that really interesting.

Carl Strathearn Well, there’s actually two robots in the experiment. There’s an old looking one and a younger looking robot, and the looking robot doesn’t get as much attention because I think the older robot looks more realistic. But they have kind of produced with the idea of being one was a younger version of the older one. So you have kind of the same same robot.

And when I was doing the tests because the mouth test was part of a wider test, which involved lots of different things like eyes and and personality. So I wanted to compare how people interacted with an older looking robot and a younger looking robot. And I had two sample groups. I had a sample group of older people and younger people. And what I found is that young people preferred to interact with the younger robot and the older people preferred to interact with the other looking robot.

And there’s also personalities as well. So I had to design an older personality and then a younger personality. So I thought, well, I’m quite young, so I’ll build the younger personality on myself. So my interests and I thought, well, I know my dad pretty well, he’s kind of old so I modelled the older on on him. So I had the younger one interested in what I’m interested in the and older one interested in Snookers and John Smiths.

Jason Goodyer Have there been any sort of like big studies done on what the public or people who are going to be interacting with these robots would like them to look like.

Carl Strathearn I’m not too sure about robots. That certainly was in CGI characters, but I actually wrote a paper just on this subject which was designing robots and I call it embodied artificial intelligence, which is the personality of robots. And it’s really fascinating. And actually there was also robot called Bina 48, which was modelled on somebody, it’s supposed to be acting like a vessel for her. So it’s like a collection of memories and life experiences.

But in terms of actual academic research, there’s very little to go on. One Interesting thing I’m starting to really realise now is that there’s been a huge movement away from academia into the private sector. So we have like Hanson Robotics and Sophia and yeah, in in even in England, we have Engineered Arts and they have their robots that humanoid robots. And in Japan they have the Geminiod series and Russia have a new one called Promobot, which again is realistic humanoid robots for things like desk assistants and receptionists and things like that.

Jason Goodyer So just sort of going into the into the nuts and bolts of your work.  It’s got lips teeth, tongue, jawbones, different facial actuating muscles. So what’s it actually made of?

Carl Strathearn It was all 3D printed I because it was a rapid prototyping and there are so many different versions of it that the whole system was 3D printed. But then some parts of it couldn’t really stand up to the pressures of the mouth working all the time. I had to have them CNCd in a special aluminium composite, which is kind of very thin, very light material.

I think it eventually what I am hoping to do is be able to publish all this online kind of open source and let people know, create their own prototypes and expand on the system because, you know, it it has a high accuracy, but it’s not totally accurate. And so there’s still work to be done because I kind of moved on now to other stuff. You know, I’m kind of left. I want to leave it to the public. Well, engineers and roboticists are interested in that kind of expand on.

Jason Goodyer Yeah. So what was the design process like then? Like what was your sort of your starting point and your initial goal?

Carl Strathearn My initial goal was, was to replicate the human mouth as closely as possible. And the speech synthesis was difficult to deal with because we don’t have accurate speech synthesis. I don’t think it’s ever really going to sound truly human because the human speech is so variable. I think that’s why my system works so well, because with speech synthesis you can control that but with human speech, you can’t. So if I was to speak into my machine learning application and try and get the robot to replicate it, it’s not going to do that.

Speech synthesis is very controlled. It’s not totally controlled. But there are limitations to it and you can kind of work within these parameters to get really good results. So one of the interesting points and the reasons why I designed it like I designed it was because I knew from my experience that previously, the humanoid robots out there, like Sophia, they do not use these kind of technologies. They simply have random jaw movements to sound and they sometimes they do it very well and they tend to do it very quickly.

So it’s hard to see exactly. So when you do things quickly and the speech is kind of at its normal pace, then there is a little bit of scope there for for almost it tricks the human brain. It tends to be if the lips are going slower, then, you know, you kind of see that but if things are going faster, you tend not to notice it too much.

And I really wanted to see if I could kind of improve on all of this, so. From my studies, I was able to determine that using things like machine learning is a lot more accurate and definitely the way to go to be doing these things rather than kind of just randomised lip movements and positions and things.

Jason Goodyer You know, really. And then go back to what you were saying about CGI in video games. I recently noticed I don’t know if you’re familiar with it, but I really like Demon’s Souls and Dark Souls, those games. And they recently did sort of revamp the Demon Souls, which is quite old for the PlayStation five.

And one of the things that was vastly improved was the synchronisation of the characters as they were speaking with their mouths. They looked so much more natural than previously when it was sort of like it. But we don’t, you know, 80s movie or something. So it’s not that I can send it to the stuff I’ve been working on previously.

Carl Strathearn Yeah, yeah. That’s that’s pretty much hit the nail on the head. But I’d also say that I imagine at the time when you were playing them videos, video games the first time around, you might not have noticed that as much if you did notice it you kind of thought Well, well, that’s good. That’s still really good. Still really good. But with humanoid robotics, it’s different because they’re in front of you, they’re there.

CGI characters get away with a lot because they aren’t there when you have a robot in a room in front of you, there’s very little hiding places for these things. And you’re able to kind of really pick out things that are going wrong and things that are unnatural.

And that’s one of the things that really came about in my studies was was how people have this kind of inbuilt ability to to recognise things that are not quite right. And what you might think is a tiny thing can actually give the whole game away, especially when we’re considering another kind of model, an idea called the Multimodal Turing Test, which is also known the Westworld test, which is basically when you when you create a robot and it gets to the point where you can no longer tell the difference between the robot and the human, you know, things like that and what goes into that as well.

So it was kind of a model that was expressed like a triangle, like a hierarchy. And the closer you get to the top, the harder it is, of course, to actually get these these nuances, these things and things like lip synchronisation, pupil dilation is another area I’ve worked on, robotic pupil dilation. It’s these tiny nuances that play a huge part in it, because these are the things that give the game away, you know, and things like facial tics or whatever. Just just these tiny nuances that we don’t even realise are important in a conversation and suddenly become crucial.

Jason Goodyer So what are the potential applications of this type of work? You know, what’s the end goal. What do we want to do with it?

Carl Strathearn I always use the example of Data from Star Trek as the perfect example for this, because he acts like this very humanistic interface between lots of different things, the interface between people and aliens. So obviously aliens that don’t speak English and he acts as a translator. But not only that, he also acts as the interface between things like the computer and a person.

So things that would be very difficult, calculations that would be very difficult, he’s able to translate that information and give it in a very simplified way. But a humanistic way, with emotion, with facial expressions, and that’s why I think this technology will eventually head towards I mean, we have to remember that nobody can interact with technology effectively.

We’re very privileged, I think, to have grown up with technology and to be able to use technology. But there’s lots of people in the world who don’t have that, and creating something like a humanoid robot would allow them to kind of integrate with technology a lot more naturally. And so that’s another kind of use. I always think the Data example is a really good rather than the Terminator stuff.

Jason Goodyer I only know I know there’s some work in Japan that we’ve seen. But is there any sort of differences between different languages in this sort of sort of stuff?

Carl Strathearn Yes, definitely. Pronunciations and even regional dialects, like my Yorkshire accent would be a huge factor. And again, I think that’s why having a machine learning algorithm is the way to go, because these are sort of things you can train the system on. So, yeah, it’s very interesting. It is something, again, I think I’d be really interested in in looking at later on to see what the influences of of language and dialects and accents and things.

Jason Goodyer So what do you think, like the time frame is for this sort of thing, like playing the long game? When when are we going to be seeing, like you say, an interface saying when when am I going to have one in my home that can say, I don’t know, maybe I’m elderly or disabled, something that sort of you know, I don’t want to say robot butler, but you know what I mean?

Carl Strathearn Well, you might in luck, because Hanson Robotics announced a few weeks ago that they are mass rolling out the Sophia robot. So that’s their aim for 2021/2022, to start rolling out this Sophia model, but I argued that how useful that would be is kind of massively up for debate because Sophia is actually semi-autonomous, not fully autonomous.

So there’s going to be certain things she can’t do. You’re going to have to do for and I think. It might be too early to start even thinking about dishing these humanoid robots out on a mass scale, at least until they can start doing things themselves and without any human aid. And even then, you’ve still got to get past all of the ethicists. There’s a lot of really good work done in AI ethics and robotic ethics.

So, yeah, it’s really hard to say. I think it’s a long way away. But at the same time, there’s lots of good research going on at the moment, which is also pushing it forward. So it’s very difficult to give you an answer to that.

Jason Goodyer Yeah, yeah. So yeah, that’s been great. So you just mentioned earlier that now you move on to new projects. So I just wanted to ask you, you know, what’s what what are you hoping to work on next? What are your plans for the next three years?

Carl Strathearn Well, I’m at the moment, I’m working on what we’re calling enhanced common sense language models, which is basically allowing robots to use some level of human common sense. So an example of this would be if I had a robot and efficient system, I asked it to find a pen or table. It could do that, no problem, because object recognition and it could recognise a pen. But if you if the pen was in a drawer, say, in a kitchen and you asked it to find a pen, it would spend all day going around looking for something, but not ever opening the drawer.

So this idea of common sense knowledge would be giving the robots some ability to know that are kept in, draws closer, kept in a wardrobe, and that these are the things that are missing out. So it’s like a cross between language and vision.

And there’s like a crossover in our common sense, what we call common sense knowledge. So that’s why I’m working at the moment. And we’re currently developing a robot to help people with cooking tasks. So it’s like a robot chef to use in a kitchen, but you’re able to ask it things and do things which you can’t normally do with things like the Amazon Echo, for example.

If you ask that to give you a recipe or help you cook, it gives you in one solid block, just reads the whole thing out. And this would be more intuitive. It’d be more like an information give information follower kind of construct, but also lots of common sense knowledge base is embedded in there. So you’d be able to ask lots of things like I don’t have a certain ingredient is an ingredient I could use instead of this to be able to to do that as well.


Advertisement

Let us know what you think of the episode with a review or a comment wherever you listen to your podcasts.