Conference Paper Multimodal Speech Recognition Using Mouth Images from Depth Camera

安井, 勇樹  ,  Yasui, Yuki  ,  井上, 中順  ,  Inoue, Nakamasa  ,  岩野, 公司  ,  Iwano, Koji  ,  篠田, 浩一  ,  Shinoda, Koichi

pp.1233 - 1236 , 2017-12
Deep learning has been proved to be effective inmultimodal speech recognition using facial frontal images. Inthis paper, we propose a new deep learning method, a trimodaldeep autoencoder, which uses not only audio signals and faceimages, but also depth images of faces, as the inputs. We collectedcontinuous speech data from 20 speakers with Kinect 2.0 andused them for our evaluation. The experimental results with10dB SNR showed that our method reduced errors by 30%,from 34.6% to 24.2% from audio-only speech recognition whenSNR was 10dB. In particular, it is effective for recognizing someconsonants including /k/, /t/.

Number of accesses :  

Other information