Table of Links
2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.11 Additional Experiments with Other Baselines
7 Conclusion, Acknowledgement and References
2 RELATED WORK
2.1 Neural Codec Language Models
Conventional sequence-to-sequence auto-regressive TTS models, such as Tacotron [79], have successfully paved the way for speech synthesis technologies. TransformerTTS [53] first adopted a Transformer network for TTS, and VTN [28] also utilizes a Transformer network for VC. However, these auto-regressive models suffer from a slow inference speed, in addition to a lack of robustness owing to challenges in aligning text and acoustic representations and the difficulty in predicting a continuous acoustic representation. Recently, neural audio codec model [16], [89] have replaced conventional acoustic representations with a high-compressed audio codec, which can reproduce the original waveform audio. Vall-E [78] was the first neural codec language model for speech synthesis utilizing a discrete audio unit and language models. By scaling up the dataset to 60,000 h, Vall-E could perform in-context learning using a neural audio codec. However, it possessed the same limitations as auto-regressive TTS models, such as a slow inference speed and a lack of robustness. Furthermore they have a high-dependency of their pre-trained neural audio codec, resulting in lowquality audio. To overcome this limitation, high-quality neural audio codec models, such as HiFi-Codec [85] and DAC [41], have been investigated. Furthermore, SPEARTTS [31] and Make-A-Voice [23] introduced a hierarchical speech synthesis framework from semantic to acoustic token to reduce the gap between text and speech. Moreover, to reduce inference speed and improve the robustness of autoregressive methods, SoundStorm [6] proposed parallel audio generation methods that generate the token of a neural audio codec. UniAudio [86] presented a multi-scale Transformer architecture to reduce the computational complexity of long audio sequences.
2.2 Non-autoregressive Models
For fast and robust speech synthesis, FastSpeech [68] introduced a duration predictor to synthesize speech in parallel, and they significantly improved the robustness of speech synthesis by addressing the limitations of auto-regressive models such as repeating and skipping. To reduce the oneto-many mapping problem in non-autoregressive speech synthesis, FastSpeech 2 [67] adopted a variance adaptor that can reflect pitch and energy information. However, these models require an external duration extractor to align the text and speech. Glow-TTS [34] introduces a monotonic alignment search and normalizing flow to learn text-speech alignment and train the TTS model simultaneously. They add a blank token interspersed between phoneme tokens to increase robustness. VITS [35] combined the TTS model and a neural vocoder using VAE for end-to-end TTS frameworks with the aim of improving the quality of synthetic speech. NaturalSpeech [75] achieved human-level quality in a single speaker TTS by introducing a bidirectional normalizing flow and adopting a differentiable duration modeling and phoneme pre-training. Moreover, HierSpeech [48] leveraged a self-supervised speech representation in end-to-end speech synthesis, which significantly reduced the information gap between text and speech, thus addressing speech mispronunciations. In addition, HierVST [45] utilized a hierarchical VAE for zero-shot voice style transfer, and which significantly improved the voice style transfer performance in end-to-end speech synthesis models without any labels. ZS-TTS [37], WavthruVec [71] and VQTTS [17] utilized a self-supervised speech representation as an intermediate acoustic representation for robust speech synthesis. NANSY++ [9] introduced a unified speech synthesizer for various voice applications such as TTS, VC, singing voice synthesis, and voice control. Some studies [26] have combined a parallel TTS with LLMbased prosody modeling for expressive speech synthesis.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.