||A study on quality improvement of HMM-based synthesized voices using asymmetric bilinear model
Dinh-Anh, Tuan ,
Morikawa, DaisukeAkagi, Masato
2016 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing (NCSP'16)
16 , 2016-03 , 信号処理学会
HMM-based synthesized voices are intelligible but not natural especially in limited data condition because of over smoothing speech spectra in time-frequency domain. Improving naturalness is a critical problem of HMM-based speech synthesis. One solution for the problem is using voice conversion techniques to convert over-smoothed spectra to natural spectra. Although conventional conversion techniques transform speech spectra to natural ones to improve naturalness, they cause unexpected distortions on acceptable intelligibility of synthesized speech. The aim of the paper is to improve naturalness without violating intelligibility of synthesized speech employing an asymmetric bilinear model (ABM) to separate intelligibility and naturalness. In the paper, an ABM was implemented on modulation spectrum domain of Mel-cepstral coefficient (MCC) sequence to enhance fine structure of spectral parameter trajectory generated from HMMs. Subjective evaluations carried out on English data confirm that the achieved naturalness of proposed method is competitive with other methods in large data condition and outperform other methods in limited data condition. Moreover, modified rhyme test (MRT) shows that acceptable intelligibility of synthesized speech is well-preserved with proposed method.