<aside> 💡

MoE-TTS

Author List: Heyang Xue*, Xuchen Song†, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, Yahui Zhou

*[email protected], †Corresponding author

</aside>

TL;DR: Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM forzen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.

Table of Contents

Introduction

In recent years, description-based text-to-speech (TTS) technology has been implemented in industrial applications [1][2], allowing users to precisely control the speaker and style characteristics of synthesized speech through natural language text descriptions (such as “clear, youthful voice with a magnetic tone”). This interaction method significantly lowers the threshold for speech customization and shows great potential in areas such as virtual assistants and audio content creation. However, current academic research faces two major bottlenecks:

<aside> ⚠️

Data limitations
- The descriptive text in mainstream open-source datasets (such as TextrolSpeech [3] and ParaspeechCaps [4]) mostly originates from limited tag systems (such as gender/age tags). Although diversity is improved through expansion by large language models, it is essentially still limited to in-domain descriptions.
Text understanding capabilities of the model
- Jointly trained encoders: Text understanding capabilities are constrained by the scale of labeled data
- Pre-trained text encoders (e.g., T5 [5]): Struggle to fully match the deep semantic understanding capabilities of large language models </aside>

The above issues cause the model to deviate from expectations when generating speech in response to user-inputted out-of-domain descriptions.

Summary of core contradictions: Real-world scenarios require open description comprehension capabilities, while academic research is constrained by data closedness and knowledge transfer efficiency. Based on this, we propose MoE-TTS—the first text-to-speech framework for out-of-domain description scenarios, with a core focus on how to fully leverage the pre-trained knowledge and text understanding capabilities of textual large language models (LLMs) to solve the problem of understanding generalization in out-of-domain descriptions. Our key contributions can be summarized as follows:

<aside> 🛠

We are the first to focus on the performance of description-based TTS in out-of-domain descriptions. It is helpful for bridging the gap between description-based TTS research and real-world scenarios.
To the best of our knowledge, MoE-TTS is the first work to use mixture-of-experts techniques to benefit TTS models from the pre-trained knowledge and text understanding capabilities of textual LLM.
The experimental results reveal that our design can surpass state-of-the-art commercial models in terms of generating speech that is more aligned with the descriptions. </aside>

Figure 1: Overview of MoE-TTS. MoE-TTS is initialized from a pre-trained textual LLM and transforms key components in the original Transformer blocks into mixture-of-expert layers. The original weights serve as text experts and the newly incorporated weights serve as speech experts.

Build MoE-TTS Step-by Step