The Model Architecture for Text-to-Vec

cover
17 Dec 2024

Abstract and 1 Introduction

2 Related Work

2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models

2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning

3 Hierspeech++ and 3.1 Speech Representations

3.2 Hierarchical Speech Synthesizer

3.3 Text-to-Vec

3.4 Speech Super-resolution

3.5 Model Architecture

4 Speech Synthesis Tasks

4.1 Voice Conversion and 4.2 Text-to-Speech

4.3 Style Prompt Replication

5 Experiment and Result, and Dataset

5.2 Preprocessing and 5.3 Training

5.4 Evaluation Metrics

5.5 Ablation Study

5.6 Zero-shot Voice Conversion

5.7 High-diversity but High-fidelity Speech Synthesis

5.8 Zero-shot Text-to-Speech

5.9 Zero-shot Text-to-Speech with 1s Prompt

5.10 Speech Super-resolution

5.11 Additional Experiments with Other Baselines

6 Limitation and Quick Fix

7 Conclusion, Acknowledgement and References

3.5 Model Architecture

3.5.1 Text-to-Vec

The content encoder of the TTV consists of 16 layers of noncausal WaveNet with a hidden size of 256 and a kernel size of five. Content decoder consists of eight layers of non-causal WaveNet with hidden size of 512 and kernel size of five. The text encoder is composed of three unconditional Transformer networks and three prosody-conditional Transformer networks with a kernel size of nine, a hidden size of 256 and a filter size of 1024. We utilize a dropout rate of 0.2 for text encoder. T-Flow consists of four residual coupling layers which are composed of a preConv, three Transformer blocks, and postConv. We adopt convolutional neural networks with a kernel size of 5 in Transformer blocks for encoding adjacent information and AdaLN-Zero for better prosody style adaptation. We utilize a hidden size of 256, a filter size of 1024, and four attention heads for T-Flow. We utilize a dropout rate of 0.1 for T-Flow. For the pitch predictor, we utilize the source generator with the same structure as that of HAG.

Fig. 6: Inference scenarios for voice conversion and text-to-speech.

3.5.2 SpeechSR

The SpeechSR consists of a single AMP block with an initial channel of 32 without an upsampling layer. We utilize an NN upsampler for upsampling the hidden representations. For the discriminator, we utilize the MPD with the period of [2,3,5,7,11] and MS-STFTD with six different sizes of window ([4096,2048,1024,512,256,128]). Additionally, we utilize DWTD which has four sub-band discriminators.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

Authors:

(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;

(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.