Voice Cloning for the Masses An AI project clones cartoon character voices.

Published

Apr 01, 2020

Reading time

1 min read

Are you secretly yearning to have a My Little Pony character voice your next online presentation? A new web app can make your dreams come true.

What’s new: 15.ai translates short text messages into the voices of popular cartoon and video game characters.

How it works: The model’s anonymous developer began the project in 2018 while an undergrad at MIT. In an email to The Batch, the coder declined to disclose details about how the model works but said it was inspired by the 2019 paper that pioneered transfer learning for text-to-speech models.

15.ai’s author didn’t disclose how its model differs from the implementation in the paper but said they stumbled on a technique that makes it possible to learn new voices with less than 15 minutes of training data.
The current roster of voices includes My Little Pony’s Princess Celestia and Team Fortress 2’s Soldier. Characters from the video game Fallout: New Vegas and animated series Steven Universe and Rick and Morty are in the works.

Behind the news: DeepMind made a significant advance in neural audio synthesis in 2016 with WaveNet. That work demonstrated that a neural net trained on multiple examples of similar speech or music could create passable facsimiles. Other teams continue to make advances, for example, in real-time systems.

Why it matters: Voice cloning could be enormously productive. In Hollywood, it could revolutionize the use of virtual actors. In cartoons and audiobooks, it could enable voice actors to participate in many more productions. In online education, kids might pay more attention to lessons delivered by the voices of favorite personalities. And how many YouTube how-to video producers would love to have a synthetic Morgan Freeman narrate their scripts?

Yes, but: Synthesizing a human actor’s voice without consent is arguably unethical and possibly illegal. And this technology will be catnip for deepfakers, who could scrape recordings from social networks to impersonate private individuals.

We’re thinking: Anyone who wants to synthesize Andrew’s voice has more than 15 minutes of deeplearning.ai courseware to train on.

Subscribe to The Batch