Skip to the content.

CLONE: Controllable and Lossless Non-Autoregressive End-to-End Text-to-Speech

The Overview of CLONE

Abstract

Some recent studies have demonstrated the feasibility of single-stage neural text-to-speech, which does not need to generate mel-spectrograms but generates the raw waveforms directly from the text. Single-stage text-to-speech often faces two problems: a) the one-to-many mapping problem due to multiple speech variations and b) insufficiency of high frequency reconstruction due to the lack of supervision of ground-truth acoustic features during training. To solve the a) problem and generate more expressive speech, we propose a novel phoneme-level prosody modeling method based on a variational autoencoder with normalizing flows to model underlying prosodic information in speech. We also use the prosody predictor to support end-to-end expressive speech synthesis. Furthermore, we propose the dual parallel autoencoder to introduce supervision of the ground-truth acoustic features during training to solve the b) problem enabling our model to generate high-quality speech. We compare the synthesis quality with state-of-the-art text-to-speech systems on an internal expressive English dataset. Both qualitative and quantitative evaluations demonstrate the superiority and robustness of our method for lossless speech generation while also showing a strong capability in prosody modeling.

MOS Sample (Section 4.5.1)

                  text                   VITS FastSpeech 2 + TFGAN Tacotron 2 + TFGAN CLONE without MBD CLONE
He’s still so hang up on his dead dog that no one can persuade him to eat.
Yes, we should try to contact more.
Oh, we’ve a lot to learn.
Do you know the way to Melbourne?
I like the traditions very much.
Well, I’ll restart the machine and hope for the best.
He is learning to ride a bicycle.
I’m not sure about that, we’d better ask the lady there.
Have you ever received any vaccination?
Can you tell me where to alter my suits in the downtown?

CMOS Sample (Section 4.5.1)

                  text                   48 kHz Recording 24 kHz VITS 24 kHz CLONE 48 kHz CLONE
So that’s where we should be looking to discover.
Perhaps it is no wonder people are happy to see him.
Maybe you crave a good burger now and again.
We had a taco party at my friend’s house, and I invited all these people.
I usually stick to a more natural red shade, but every once in a while I can’t resist going blue.

Prosody Variation (Section 4.5.2 - Prosody Variation)

Sample from -1 Sample from 1

Prosody Transfer (Section 4.5.2 - Prosody Transfer)

End-to-end Synthesis Transfer Result Reference

Prosody Reconstruction (Section 4.5.2 - Prosody Reconstruction)

Reconstruction Result Reference

Prosody Predictor Transfer

By replacing the speaker id of the prosody predictor, we can use the prosody of different speaker to synthesize the sound of the same timbre.

Speaker A Speaker A with Prosody Predictor of Speaker B Speaker B