Abstract

Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.

TTS Systems

Our proposed model was benchmarked against state-of-the-art zero-shot TTS models, including X-TTSv2, Vall-E, GPT-SoVITS and CosyVoice. We reproduced Vall-E based on the paper and pre-trained it on around 60,000 hours of speech corpus. X-TTSv2, GPT-SoVITS and CosyVoice are the official models provided by the respective authors, which are pretrained on common open-source data.

Model	Params↓	Data(h)↓	RTF$_{CPU}$↓	RTF$_{GPU}$↓
X-TTSv2	447M	N/A	5.75	0.26
Vall-E	369M	60K	2.61	0.47
GPT-SoVITS	223M	2K	3.53	0.38
CosyVoice	414M	173K	15.3	1.89
μSpeaker	22.5M	531	0.13	0.012

Zero-shot Audio Samples

The following examples include synthesized speech generated using prompt speeches from multiple speakers not included in the training dataset. For each prompt speech, we produced two sets of audios, each containing six clips generated by different TTS systems. Among these, our system w/o SD and our system w/ SD respectively denote the model without self-distillation (where $\sigma=0$) and with self-distillation (where $\sigma=0.8$) applied.

Prompt speech	X-TTSv2	VALL-E	GPT-SoVITS	CosyVoice	Our system w/o SD	Ours system w/ SD

Demo Page

Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement (ICASSP Paper ID: 0708)

Abstract

TTS Systems

Zero-shot Audio Samples