Yeah, out of all the generative AI fields, voice generation at this point is like 95% there in its capability of producing convincing speech even with consumer level tech like ElevenLabs. That last 5% might not even be solvable currently, as it’s those moments it gets the feeling, intonation or pronunciation wrong when the only context you give it is a text input, which is why everything purely automated tends to fall apart quite fast.
They are, however, able to inaccurately summarize it in GLaDOS’s voice, which is a strong point in their favor.
Surely you’d need TTS for that one, too? Which one do you use, is it open weights?
Zonos just came out, seems sick:
https://huggingface.co/Zyphra
There are also some “native” tts LLMs like GLM 9B, which “capture” more information in the output than pure text input.
A website with zero information, and barely anything on their huggingface page. What’s exciting about this?
Ahh, you should link to the model
https://www.zyphra.com/post/beta-release-of-zonos-v0-1
Whoops, yeah, should have linked the blog.
I didn’t want to link the individual models because I’m not sure hybrid or pure transformers is better?
Looks pretty interesting, thanks for sharing it
Yeah, out of all the generative AI fields, voice generation at this point is like 95% there in its capability of producing convincing speech even with consumer level tech like ElevenLabs. That last 5% might not even be solvable currently, as it’s those moments it gets the feeling, intonation or pronunciation wrong when the only context you give it is a text input, which is why everything purely automated tends to fall apart quite fast.
Especially voice cloning - the DRG Cortana Mission Control mod is one of the examples I like to use.