||Non-parallel training dictionary-based voice conversion with Variational Autoencoder
Vu, Ho-TuanAkagi, Masato
2018 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing (NCSP2018)
698 , 2018-03-07 , Research Institute of Signal Processing, Japan
In this paper, we present a dictionary-based voice conversion (VC) approach that does not require parallel data or linguistic labeling for training process. Dictionary-based voice conversion is the class of methods aiming to decompose speech into separate factors for manipulation. Non-negative matrix factorization (NMF) is the most common method to decomposed input spectrum into a weighted linear combination of a set of bases (dictionary) and weights. However, the requirement for parallel training data in this method causes several problems: 1) limited practical usability when parallel data are not available, 2) additional error from alignment process degrades out-put speech quality. In order to alleviate these problems, this paper presents a dictionary-based VC approach by incorporating a Variational Autoencoder (VAE) to decomposed input speech spectrum into speaker dictionary and weights without parallel training data. According to evaluation results, the proposed method achieved better speech naturalness while retaining the same speaker similarity as NMF-based VC even though un-aligned data is used.