When the swap face app ZAO, as shown in the screenshot below, went viral in the Chinese community on the Internet, people started to feel panic and worry if the whole world was going to be filled with fake faces in the near future. That groom future never came.
After spending tens of millions of dollars on the generation of face swap videos, given an arbitrary face image from ordinary users in WeChat, it remains unclear how to monetize the technology in real world applications. Some lawmakers from California were moving quickly to counter the fake face generation industry, targeting the abuse in the political domain only. So far, unfortunately, the only application making money with face swap is the pornography industry, which is now rolling out fake videos with faces of Hollywood stars. This real business situation plays down the technical advocate’s passion on the powerful tools from the deep learning research community.
A series of natural questions arise here: How about the generation of fake voices in speech processing domain? Is the technique ready to generate high-fidelity voices to confuse the world? What are the business opportunities on top of the fake voice generation?
To understand the answers to the questions above, you need to learn a brief history of speech synthesis. It has been one of the core topics in speech processing research and industry for over five decades. While generative methods of synthesized images might be new, people are already using synthesized voices everywhere, in railway stations, hospitals and even your PS4 video games, almost every place you can imagine of. But we won’t call any of these synthesized voices fake, because we clearly know they are!
These voices never raised any concern from the public, because of the limitations on the collection and usage of speech data behind the voices. Specifically, these voices are built upon tens of hours of speech data from a voice actor, recorded in a high-standard studio. Traditional speech synthesis technology, also known as Text-To-Speech (TTS), constructs a synthesis model over the massive speech data from single voice actor in order to generate utterance with any given text. Concatenate approaches are mainly adopted as the underlying generative model, in which the key technology is to appropriately handle the transition between different phonemes in the output utterance. Nobody really worries if his voice would be copied or abused, because it’s too difficult to collect the right data for voice replication. It’s almost impossible to collect data from a particular individual by sitting him in the studio and speaking to a microphone with given transcripts!
Deep learning technology then came in. In last few years, a Canadian company, called Lyrebird, has come out of a few interesting demonstrations widely spread on the Internet. They made audio clips with voices of Donald Trump, Hilary Clinton and other politicians. This is a significant step in speech synthesis industry, because they only used speech data available on the Internet from the speakers. Their technology was expected to handle the variance in very different aspects, including emotions, speech speed, background noise, recording equipment and others.
From technical perspective, the demos of celebrity voices may have shocked the world to some extent. But the business with the technology remains far from clearly scoped, when it still needs huge amount of speech data of reasonable quality, considering the factors I mentioned above, from a target speaker of fake voice. It cannot be generalized to other individuals when the system only has access to just a few audio clips of his/her speech. Moreover, huge amount of resource is needed in order to build a decent speech synthesis model for new target speakers, say weeks. Such high cost is not affordable and the business of customized speech synthesis can hardly scale up.
The key to the successful customized speech synthesis business lays on 1) how to generalize to a variety of voices with very different characteristics? 2) how to control the emotions and styles of the voices for different expressions from the speaker? 3) how to minimize the cost and lower the standard of speech data collection?
Similar to the practice in other deep learning areas, such as computer vision and NLP, embedding is the most important methodology to tackle the problems above. Once the system possesses a robust embedding space over voice characteristics, emotions and styles, the speech synthesis model could employ arbitrary embeddings to generate highly diversified and good quality speech, even when only a few seconds of target speaker’s speech are available to the system. Again, as an introductory tech blog, it’s impossible and unwise to cover too much details. Hearing is believing! The following demo example is based on public data from Voice Conversion Challenge 2020. Each target speaker has only 70 utterances (in this case, Mandarin) for vocal feature modelling, while our PVoice system is supposed to generate arbitrary utterance in a different language (English).
With the technology of customized speech synthesis with a low cost and high fidelity performance, it is likely to unlock a wide spectrum of business opportunities. One of the possibilities is automatic dubbing, which automatically converts videos from one language to other, together with other technologies on synchronizing the visual and audio. The following YouTube video, for example, is made by converting a fitness video originally in Mandarin to a new video in English. This aims to break the barrier of language in online platforms, such as YouTube. You can read our previous post for more information.
Another possibility is audio book. With our in-house technology, PVoice is capable of generating high-quality audio books, by assigning a unique voice to every character in the book. This greatly enhances the level of fun in the audio book, basically turning the book into a mini audio show instead of a single voice reading. We also design and deploy a complete pipeline, supporting highly automatic conversion from the original texts in Chinese to a complete audio book in a different language. The cost of the conversion is much lower than traditional voice acting industry based on human performance. I will probably cover more details of audio book business in next post, presenting demos with different voices for different characters.
As a short conclusion, driven by the growing capability of generative models from deep learning, the speech synthesis industry is moving towards a new generation of speech processing system, targeting at better quality, lower data collection cost, better adaptivity and higher style/emotion diversity. It is not about to make fakes to confuse people but to generate cheaper and better contents to entertain everyone in the world.