Conference Paper Speaker Diarization Using Multi-Modal i-vectors

西, 史人  ,  Nishi, Fumito  ,  井上, 中順  ,  Inoue, Nakamasa  ,  篠田, 浩一  ,  Shinoda, Koichi

We propose multi-modal i-vectors, which extend the audio i-vector framework for speaker verification to a multi-modal speaker diarization in movies. In addition to the audio i-vector, which represents a speech utterance in an audio stream by a low-dimensional vector, we extract a visual i-vector from faces in a video segment. The audio and visual i-vectors are concatenated as a multi-modal i-vector clustered in an unsupervised way. We evaluate our method on the Hannah movie dataset. Our experiments show that diarization error rate is improved from 68.3% to 65.5% compared with audio stream only.

Number of accesses :  

Other information