Zero-shot Text-to-Speech: How Does the Performance of HierSpeech++ Fare With Other Baselines?

cover
20 Dec 2024

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

5.8 Zero-shot Text-to-Speech

We compared the zero-shot TTS performance of HierSpeech++ with other baselines: 1) YourTTS, VITS-based end-to-end TTS model, 2) HierSpeech, an end-to-end TTS model using hierarchical VAE, 3) VALL-E-X, a neural codec language models-based multi-lingual zero-shot TTS model and we utilize an unofficial implementation which has an improved audio quality with Vocos decoder, 4) XTTS[12] , a TTS product XTTS v1 from Coqui Corp., and XTTS is built on a open-source TTS model, TorToise [5] which was trained with unprecedented large-scale speech dataset for the first time. For zero-shot TTS, we utilized a noisy speech prompt from the test-clean and test-other subsets of LibriTTS. HierSpeech++ synthesizes the speech with Tttv of 0.333 and Th of 0.333 in TABLE 7 and 8.

The results demonstrate that our model is a strong zeroshot TTS model in terms of all subjective and objective metrics. We conducted three MOS experiments for naturalness, prosody, and similarity. Our model beats all models significantly, and our model has even surpassed the groundtruth in terms of naturalness. However, XTTS has a better performance in pMOS, and this means learning prosody requires more datasets to improve expressiveness. Although other models show limitations in synthesizing speech with noisy prompts, our model synthesizes a speech robustly. Furthermore, our model has a better CER and WER than ground-truth, and this also demonstrates the robustness of our model. In summary, all results demonstrate the superiority of our model in naturalness, expressiveness, and robustness for zero-shot TTS.

obustness for zero-shot TTS. In addition, we could further improve the zero-shot TTS performance by introducing a style prompt replication (SPR) in the following subsection. Note that we do not apply the SPR in TABLE 2-8. The audio could be upsampled to 48 kHz. Lastly, we could also synthesize noise-free speech even with noisy speech. The details will be described in Section 6.

TABLE 9: Results on the different length of speech prompt. We utilize all sentence over 10s from the test-clean subset of LibriTTS (1,002 samples). SPR denotes style prompt replication and we replicate a short prompt five times for robust style transfer. Because we randomly slice a speech without considering voice/unvoice part, the results of 1s prompts are lower than others. UTMOS is presented with standard deviation.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.


[12]. https://github.com/coqui-ai/TTS

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.