Diffusion Models and Zero-shot Voice Cloning in Speech Synthesis: How Do They Fare?

cover
17 Dec 2024

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

2.3 Diffusion Models

Diffusion models have also demonstrated their powerful generative performances in speech synthesis. Grad-TTS [63] first introduced a score-based decoder to generate a Melspectrogram, and Diff-VC demonstrated the high-adaptation performance of diffusion models in zero-shot voice conversion scenarios. DiffSinger achieved a state-of-the-art performance in SVS task by generating a high-quality singing voice with a powerful adaptation performance. DDDM-VC [10] significantly improved speech representation disentangle [3] and voice conversion performance by a disentnalged denoising diffusion model and prior mixup. Diff-HierVC [11] introduced a hierarchical voice style transfer frameworks that generates pitch contour and voice hierarchically based on diffusion models. Guided-TTS [33] and Guided-TTS 2 [38] have also shown good speaker adaptation performance for TTS. UnitSpeech [32] introduced a unit-based speech synthesis with diffusion models. Furthermore, recent studies utilized a diffusion model in latent representation. Naturalspeech 2 [70] and HiddenSinger [24] utilized the acoustic representation of an audio autoencoder as a latent representation, and developed a conditional latent diffusion model for speech or singing voice synthesis. StyleTTS 2 [54] proposed a style latent diffusion for style adaptation. Although all the above models have shown powerful adaptation performance, they have a slow inference speed for their iterative generation manner. To reduce the inference speed, CoMoSpeech [87] and Multi-GradSpeech [83] adopted a consistency model for a diffusion-based TTS model. Recently, VoiceBox [43] and P-Flow [39] utilized flow matching with optimal transport for fast sampling. However, these models still have a traininginference mismatch problem that arises from two-stage speech synthesis frameworks and they are vulnerable to noisy target voice prompt.

2.4 Zero-shot Voice Cloning

Zero-shot learning [81] for voice cloning is a task to synthesize speech with a novel speaker, which has not been previously observed during training. A majority of the studies on voice cloning [25], [74] focused on cloning the voice styles, such as timbre and environment, and speaking styles, such as prosody and pronunciation. [72] presented a reference encoder for prosody modeling, and GST [80] utilized a learnable token for style modeling from the reference speech or manually control. [51] proposed a fine-grained prosody control for expressive speech synthesis from reference speech. Multi-SpectroGAN [50] utilized adversarial feedback and a mixup strategy for an expressive and diverse zero-shot TTS. Meta-StyleSpeech [59] introduced meta-learning for style modeling, and GenerSpeech [22] utilized a mix-style layer normalization for better generalization on the out-ofdomain style transfer. PVAE-TTS [44] utilized a progressive style adaptation for high-quality zero-shot TTS. AdaSpeech [8] introduced adaptive layer normalization for adaptive speech synthesis. YourTTS [7] trained the VITS [35] with a speaker encoder and Grad-StyleSpeech [29] utilized a styleconditioned prior on a score-based Mel-spectrogram decoder for better adaptive TTS. Built upon VQTTS [17], TN-VQTTS [18] introduce a timbre-normalized vector-quantized acoustic features for speaking style and timbre transfer. Meanwhile, there are text prompt-based style generation models which can describe a speaking or voice style from text descriptions [19], [52], [84].

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.