This is pretty amazing.
Via Emlyn O’Regan
Originally shared by Vincent Vanhoucke
File under ‘Big Deals’.
Neal Stephenson’s ‘The Diamond Age’ has a funky premise: in a future where technology had pretty much solved everything, one unsolved problem remains: generating human-sounding speech, and people called ‘ractors’ are hired to act out lectures delivered in spoken form.
For a while, it seemed actually believable that delivering human-sounding speech would be very hard, as synthesis technology consistently failed to deliver anything that would fool anyone for more than a few words.
Today the work from our colleagues at DeepMind feels like it’s leaped over the Uncanny Valley, and then some. The examples they provide sound fantastic. This is very exciting. I am really interested to hear what very long-form text sounds like, because that remains the ultimate challenge for TTS.
Damn, Daniel … this is good stuff. What blew my mind in the samples was actually hearing ‘vocal fry,’ as you do from human speakers but never from parametric, and rarely from concatenative systems.
LikeLike
Cool. I wasn’t aware of that concept before but it’s certainly ubiquitous.
LikeLike
I love this jumbled english:
https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-1.wav
https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-2.wav
https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-3.wav
https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-4.wav
https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-5.wav
https://storage.googleapis.com/deepmind-media/pixie/knowing-what-to-say/first-list/speaker-6.wav
LikeLike
Excellent research results. But like other botware I’m not looking forward to deceptive voice spam.
LikeLike