Suspect portraying has been a widely used technology in crime fighting with a long history even before the invention of cameras. Based on the descriptions from the witnesses and depiction by psychologist, professional portray drawer illustrated the facial image of the suspect. One of an example is the suspect of a series of infamous murders back in 1980s in South Korea. The portray of the suspect is a fairly reasonable match to the photos of the culprit at crime time, named Lee Chun-jae, who confessed the crime in later 2019, about 30 years after the murders.
With the emergence and explosive development of deep learning, especially the huge success of image generation models, AI researchers and practitioners are eagerly looking for real world applications for these technologies. It becomes interesting to apply such new technology to enable a new portraying method, directly converting from human’s speech to his/her facial images. This is common in certain scenarios, when there is no witness but only audios recording the speech of the suspect.
Just a few days ago, we have submitted a new preprint to ArXiv, , with a full demo with audio clips and corresponding output face images available here. The followings are a few examples of the face images generated from our model, called Speech Fusion to Face (SF2F), over VoxCeleb database.
How good is the output face images? A basic conclusion is: a poorly trained automatic facial recognition model could retrieve the speaker, based on the face images from SF2F using 5 second of his/her speech only, over a candidate pool of 378 people, with success rate at 36%. This is a huge boost of success rate over best known solutions in the latest publications on this topic, such as  and , by a margin as large as 200%!
There are two key findings in our method. First, it is crucial to apply appropriate pre-processing over the face images before feeding them to the machine learning model. When the face images are of poor quality, the model has to learn irrelevant things, such as the background, complicated emotions, and etc. By deploying a group of automatic filtering strategies and finally minimized human correction, we generate a new face dataset, called HQ-VoxCeleb, in which HQ stands for High-Quality. Look at the example images below to understand how much the efforts turn into quality enhancements of the face images.
The second finding is on the appropriate connection between the vocal domain and facial imaging domain in the deep learning models. In short speech audio clips, different segments cover different pronunciations and therefore reflect very different vocal characteristics (and corresponding facial features) of the speaker. The common practice in existing methods is a brute integration of these important but scattered information, over a series of speech audio segments, in the vocal feature embedding space. A much smarter way is to delay the integration until the facial features are generally recognizable, such that the model could more accurately evaluate what facial features are important and how they are supplementary to each other. This leads to a new fusion strategy proposed in our approach.
As part of the alchemy of deep learning, we also identified some drawbacks in the existing models, such as the building blocks of the convolutions. A fine tuning over these blocks also brings significant improvement over the performance.
This a big step towards the adoption of this technology in real world applications. It will be interesting to see more and more breakthrough in the coming new generation of speech processing technologies. Please follow the posts on PVoice‘s web site or our LinkedIn home page.
 Speech Fusion to Face: Bridging the Gap Between Human’s Vocal Characteristics and Facial Imaging. https://arxiv.org/abs/2006.05888
 Learning the Face Behind a Voice. https://speech2face.github.io/
 Face Reconstruction from Voice Using Generative Adversarial Networks. https://arxiv.org/abs/1905.10604