Google posted a blog post about the current capabilities of its voice generation AI.
In September 2024, Google will launch a service that summarizes the content of papers and books and converts it into podcast-like conversational audio.Illuminate” and the memo-taking app “NotebookLM” that utilizes AI.Ability to provide an overview with conversational audioAdded.
It is thanks to years of research that it has become possible to generate voices with characteristics such as “lasting more than several tens of seconds,” “multiple speakers appearing,” and “natural conversation.” Appeared in August 2021SoundStreamThis method makes it possible to reconstruct speech while maintaining information such as prosody and timbre, and it was released in October 2022.AudioLMThanks to this method, it is now possible to treat the speech generation task as a language modeling task that generates acoustic tokens.
and appeared in June 2023SoundStormdemonstrated the ability to generate natural 30-second conversations featuring multiple speakers. As of October 2024, it is also possible to generate 2 minutes of audio. The time it takes to generate 2 minutes of audio is less than 3 seconds with the TPU v5e chip, making it possible to generate audio more than 40 times faster than when actually recording.
An example of the actually generated voice looks like this. Conversations like “Hey, did you hear about Google DeepMind’s audio generation achievements?” and “No, I missed it” are being generated.
To improve its ability to generate realistic conversations with multiple speakers, the model is pre-trained on hundreds of thousands of hours of audio data, then generates unscripted conversations and “ums” and “ahs” from a large number of voice actors. Fine tuning was performed using a small dataset consisting of audio data containing fillers such as. This makes it possible to reliably switch speakers during a conversation and output audio with appropriate pauses and tones.
As an editorial staff member whose native language is Japanese, I can’t distinguish it from native conversation, but native English speakers seem to find the generated conversation strange.very frustrating to hear“”It sounds like you are reading from a prepared script.” comments were found here and there.
If you have not received the email, please click “Resend confirmation email” directly below this.