Conventional dubbing relies on matching speech sounds – phonemes – to the shape of an actor’s lips as he or she speaks, known as visemes – think ‘visual phoneme’. It’s hard work, requiring clever scripting and a skilled voice artist, and audiences are very quick to spot when it’s not quite right.
Taking a different approach to the problem, researchers at Disney Research and the University of East Anglia analysed the movements of the lips during speech, rather than looking just at single static snapshots. These movements are known as ‘dynamic visemes’.
By analysing the sequence of shapes made by the lips, and feeding the whole lot into a computer, the technology could automatically build new phrases that perfectly matched the actor’s mouth – quite literally putting words in their mouth.
It turns out, for instance, that the phrase ‘clean swatches’ is visually the same as ‘dicier mutts’, ‘need no pots’ and ‘like to watch you’.
“Dynamic visemes are a more accurate model of visual speech articulation than conventional visemes,” says lead researcher Sarah Taylor. “[They] can generate visually plausible phonetic sequences with far greater linguistic diversity.”
As computers can be made to do all the heavy lifting by themselves, it should make convincing dubbing easier to do – so badly dubbed films will have no excuse.
Around 90 per cent of lip movements in speech can be given more than one sound with this method, opening up thousands of possibilities for a single phrase.
“This work highlights the extreme level of ambiguity in visual-only speech recognition,” says Taylor.
Although if you were trying to read her lips, there were probably 100,000 other things she could have said.