Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models [ABELAB]

sound, speech synthesis, Katsuki Inoue, Sunao Hara, Masanobu Abe

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

Katsuki Inoue, Sunao Hara, Masanobu Abe, Tomoki Hayashi, Ryuichi Yamamoto, Shinji Watanabe
ICASSP 2020, pp. 7634–7638, Barcelona, Spain, May 2020. (Oral/ONLINE PRESENTATION)

Abstract

Recently, end-to-end text-to-speech (TTS) models have achieved a remarkable performance, however, requiring a large amount of paired text and speech data for training. On the other hand, we can easily collect unpaired dozen minutes of speech recordings for a target speaker without corresponding text data. To make use of such accessible data, the proposed method leverages the recent great success of state-of-the-art end-to-end automatic speech recognition (ASR) systems and obtains corresponding transcriptions from pretrained ASR models. Although these models could only provide text output instead of intermediate linguistic features like phonemes, end-to-end TTS can be well trained with such raw text data directly. Thus, the proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems. The experimental results show that our proposed method achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.

Download Links: IEEE Xplore | University's Repository (TBA)