An Investigation to Transplant Emotional Expressions in DNN-based TTS Synthesis [ABELAB]

sound, speech synthesis, Katsuki Inoue, Sunao Hara, Masanobu Abe

An Investigation to Transplant Emotional Expressions in DNN-based TTS Synthesis

Katsuki Inoue, Sunao Hara, Masanobu Abe, Nobukatsu Hojo, and Yusuke Ijima
APSIPA Annual Summit and Conference 2017, TP-P4.9, 6 pages, Kuala Lumpur, Malaysia, Dec. 2017. (Poster)

Abstract

In this paper, we investigate deep neural network(DNN) architectures to transplant emotional expressions to improve the expressiveness of DNN-based text-to-speech (TTS) synthesis. DNN is expected to have potential power in mapping between linguistic information and acoustic features. From multispeaker and/or multi-language perspectives, several types of DNN architecture have been proposed and have shown good performances. We tried to expand the idea to transplant emotion, constructing shared emotion-dependent mappings. The following three types of DNN architecture are examined; (1) the parallel model (PM) with an output layer consisting of both speakerdependent layers and emotion-dependent layers, (2) the serial model (SM) with an output layer consisting of emotion-dependent layers preceded by speaker-dependent hidden layers, (3) the auxiliary input model (AIM) with an input layer consisting of emotion and speaker IDs as well as linguistics feature vectors. The DNNs were trained using neutral speech uttered by 24 speakers, and sad speech and joyful speech uttered by 3 speakers from those 24 speakers. In terms of unseen emotional synthesis, subjective evaluation tests showed that the PM performs much better than the SM and slightly better than the AIM. In addition, this test showed that the SM is the best of the three models when training data includes emotional speech uttered by the target speaker.

Download Links: IEEE Xplore | APSIPA Archive | University's Repository (TBA)