Three fundamental technologies — chatbots, audio fakes, and deepfake videos — have improved to the point that creating digital, real-time clones of people is merely a matter of integrating the systems.
The fundamental technologies for creating digital clones of people — text, audio, and video that sound and look like a specific person — have rapidly advanced and are within striking distance of a future in which digital avatars can sound and act like specific people, Tamaghna Basu, co-founder and chief technology officer of neoEYED, a behavioral analytics firm, told attendees at the virtual Black Hat conference on Aug. 6.
While deepfake videos that superimpose a 3D model of a specific person over another person’s face have raised fears of propaganda videos, disinformation operations, and smear campaigns, successful digital clones could cause even more problems, especially for systems that use voice or facial recognition for access management or as a way to fool employees into accepting someone’s identity. While the current result of Basu’s experiment have numerous telltale signs that the subject is clearly not human, the relative success of project demonstrates how close we may be to successfully creating simulated people.
“As you can clearly see, there is a gap, but this gap is about making the voice more convincing, making the facial expressions have more emotion, those are on the road map to be done,” he told attendees during his presentation. “The ultimate goal that I have, [building] an alternate [version of me] that can have a conversation over text, voice, and video,” seems achievable.
Inspired by futuristic shows such as Black Mirror, Basu decided to attempt to construct a digital clone of himself using three already existing technologies: chatbots, audio synthesis, and deepfake videos. The effort is less about original research and more about stitching together a variety of technologies. While the video version of his digital clone is choppy and the voice sounds generated, several friends who conversed with the chatbot version of his model thought he might be feeding the answers to the machine.
Such believable personalization, suggests that — depending on how close two people are — a digital clone could fool one into thinking it’s the other person, he said.
“Our object was to get a positive Turing test, to convince them it is really me,” he said in a Dark Reading interview, adding: “One of the scariest parts is that if you have 100 friends in your Facebook, honestly speaking, there are very few relationships where people are very personal. So, the real problem is that it is easy to fake the relationship.”
The technology could spell trouble for identity verification technologies, he added. Basu’s company uses analytics to create behavioral profiles of people to protect identities — one reason why he decided to take an adversarial strategy and try to use behavioral profiles to create a clone. Digital clones that not only look and sound like another person but also have mannerisms and patterns of speaking that are similar to the subject will make social engineering easier.
At a high level, the technology is broken up into three parts, which Basu called the brain, the voice, and the face. The brain is a text chatbot engine that attempts to have an interactive chat using natural language processing. There are a variety of approaches to chatbots that can produce reasonable functionality, depending on the type of conversation. Limited domain conversations — such as small talk and conversations seeking specific information —can often be rule-based.
Using a variety of different chat histories for a specific person, you can train such bots to use the same type of language as that person, he said during the presentation. “The brain is the engine which is the crux of the entire project. It knows what kind of questions to ask and how to answer those questions.”
Using an open source chatbot library known as Rasa, Basu created a system that could make small talk and hold conversations. Basu also used audio synthesis software and 500 samples of his voice averaging 10 seconds each to train the machine learning process. Better audio cloning will require as much as 10 hours of recording. He is playing around with accents.
For the face, he wanted to create it in near-real time and have the mouth match the words. Overall, identity attacks appear feasible and at this point merely require refinement, he said.
Veteran technology journalist of more than 20 years. Former research engineer. Written for more than two dozen publications, including CNET News.com, Dark Reading, MIT’s Technology Review, Popular Science, and Wired News. Five awards for journalism, including Best Deadline … View Full Bio