In the last two years we have seen how artificial intelligence has been conquering the Internet through multiple tools and applications. In addition to existing chatbots, today we can find very realistic tools for generating images, audio, video and even for converting audio to video. For the last mentioned, Alibaba has prepared an amazing tool that will allow us to make any image comes to life through a song or just by talking.
The Chinese giant has developed an application that allows us to take an image as a reference and an audio track and combine them to generate a video of the person in the photo singing or speaking. Although this technique is not new, it is the first time we have seen such realistic results.
Alibaba creates EMO, the AI that converts a photo and audio into video
The team of researchers at the Computational Intelligence Institute at Alibaba have named their AI ‘EMO’, an acronym for ‘Emote Portrait Alive’. The tool is capable of animate a portrait photo and generate videos of the person while speaking or singing.
Through the official website of the project, multiple examples of the operation of this technology have been shown. Alibaba has also created an example taken from the video that OpenAI showed with Sora, its AI to generate realistic videos. According to the study document, AI is capable of create fluid and expressive facial movementsas well as head poses that fit almost perfectly with the song or audio playing in the background.
“Traditional techniques often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles,” said Linrui Tian, lead author of the paper. “To solve these problems, we propose EMO, a novel framework that uses a direct audio-to-video synthesis approach, without the need for intermediate 3D models or facial landmarks.”
Its broadcast model is capable of converting audio to video easily. The researchers have trained the model with a dataset of more than 250 hours of social gathering videos drawn from speeches, films, television shows, and performances by music artists.
Video generation algorithm and procedure. Image: Alibaba
Instead of using 3D to stretch the photograph and pretend it came to life, EMO directly converts audio waves into video frames. This allows you to capture subtle movements and identity-specific quirks associated with natural speech.
According to the experiments described in the study paper, EMO significantly surpasses the most cutting-edge methods in video quality, preservation of portrait identity and expression. And to tell the truth, you only have to look at the examples to know that this artificial intelligence is leagues away from the existing models for modifying a photograph and making the person who appears in it able to speak or sing.
There is no doubt that once the tool is launched, thousands of memes of celebrities singing or saying something completely crazy will appear on the Internet. However, it can also be a great tool for content creators or to revive the face of a deceased person in the family, to give a few examples.
Of course, the tool can also involve a huge risk regarding ethical issues and improper use of it to impersonate another person or spread disinformation.
For now, the tool is not yet available for use, so we will have to wait until we know more information about it. The only thing we can do for now is browse the list of videos published on the web and be surprised by the quality and realism of this artificial intelligence.