Module Comparison of Transformer-TTS for Speaker Adaptation based on Fine-tuning
  • Proceedings of APSIPA Annual Summit and Conference 2020, pp.826–830, Online/Virtual Conference (Auckland, New Zealand), Dec. 2020. (Oral/ONLINE PRESENTATION)

Abstract

End-to-end text-to-speech (TTS) models have achieved remarkable results in recent times. However, the model requires a large amount of text and audio data for training. A speaker adaptation method based on fine-tuning has been proposed for constructing a TTS model using small scale data. Although these methods can replicate the target speaker's voice quality, synthesized speech includes the deletion and/or repetition of speech. The goal of speaker adaptation is to change the voice quality to match the target speaker's on the premise that adjusting the necessary modules will reduce the amount of data to be fine-tuned. In this paper, we clarify the role of each module in the Transformer-TTS process by not updating it. Specifically, we froze character embedding, encoder, layer predicting stop token, and loss function for estimating sentence ending. The experimental results showed the following: (1) fine-tuning the character embedding did not result in an improvement in the deletion and/or repetition of speech, (2) speech deletion increases if the encoder is not fine-tuned, (3) speech deletion was suppressed when the layer predicting stop token is not fine-tuned, and (4) there are frequent speech repetitions at sentence end when the loss function estimating sentence ending is omitted.

Download Links: IEEE Xplore | PDF(APSIPA) | University's Repository (TBA)

  • 最終更新: 2021/01/19 10:38