<aside> đź’ˇ

MoE-TTS

Author List: Heyang Xue*, Xuchen Song†, Yu Tang, Jianyu Chen, Yanru Chen, Yang Li, Yahui Zhou

*[email protected], †Corresponding author

</aside>

TL;DR: Description-based text-to-speech (TTS) models exhibit strong performance on in-domain text descriptions, i.e., those encountered during training. However, in real-world applications, the diverse range of user-generated descriptions inevitably introduces numerous out-of-domain inputs that challenge the text understanding capabilities of these systems. To address this issue, we propose MoE-TTS, a description-based TTS model designed to enhance the understanding of out-of-domain text descriptions. MoE-TTS employs a modality-based mixture-of-experts (MoE) approach to augment a pre-trained textual large language model (LLM) with a set of specialized weights adapted to the speech modality while maintaining the original LLM forzen during training. This approach allows MoE-TTS to effectively leverage the pre-trained knowledge and text understanding abilities of textual LLMs. Our experimental results indicate that: first, even the most advanced closed-source commercial products can be challenged by carefully designed out-of-domain description test sets; second, MoE-TTS achieves superior performance in generating speech that more accurately reflects the descriptions. We encourage readers to listen to the demos at https://welkinyang.github.io/MoE-TTS/.

Table of Contents

Introduction

In recent years, description-based text-to-speech (TTS) technology has been implemented in industrial applications [1][2], allowing users to precisely control the speaker and style characteristics of synthesized speech through natural language text descriptions (such as “clear, youthful voice with a magnetic tone”). This interaction method significantly lowers the threshold for speech customization and shows great potential in areas such as virtual assistants and audio content creation. However, current academic research faces two major bottlenecks:

<aside> ⚠️

The above issues cause the model to deviate from expectations when generating speech in response to user-inputted out-of-domain descriptions.

Summary of core contradictions: Real-world scenarios require open description comprehension capabilities, while academic research is constrained by data closedness and knowledge transfer efficiency. Based on this, we propose MoE-TTS—the first text-to-speech framework for out-of-domain description scenarios, with a core focus on how to fully leverage the pre-trained knowledge and text understanding capabilities of textual large language models (LLMs) to solve the problem of understanding generalization in out-of-domain descriptions. Our key contributions can be summarized as follows:

<aside> đź› 

Figure 1: Overview of MoE-TTS. MoE-TTS is initialized from a pre-trained textual LLM and transforms key components in the original Transformer blocks into mixture-of-expert layers. The original weights serve as text experts and the newly incorporated weights serve as speech experts.

Figure 1: Overview of MoE-TTS. MoE-TTS is initialized from a pre-trained textual LLM and transforms key components in the original Transformer blocks into mixture-of-expert layers. The original weights serve as text experts and the newly incorporated weights serve as speech experts.

Build MoE-TTS Step-by Step