||A Study on Robust Speech Recognition with Time Varying Speech Features
Speech feature extraction algorithms have become popular. Speech features can be usedfor various applications: biometric recognition, speech recognition, speaker identification,and so on. In these applications, a good speech feature can be obtained usingMel frequency cepstrum Coefficients (MFCC), Linear Predictive Coding (LPC), Timevarying LPC (TVLPC), Perceptual Linear Predictive (PLP) among others. This thesisfocuses on the use of TVLPC among feature extraction algorithms to improve the robustnessof automatic speech recognition (ASR) systems against various multiplicative andadditive noises. Time varying speech features (TVSF) are implemented in ASR withthe aim of improving the recognition accuracy on a number of small set of referencespeech databases. The significance of the study is based on the fact that both additiveand multiplicative noises cause great performance degradation of ASR systems, therebylimiting the speech recognition accuracy in real environments. For this reason, featurecorrection, feature compensation and normalization approaches are considered in orderto improve the robustness of a speech recognition system.The performance degradation is partly due to statistical mismatch between trainedacoustic model of clean speech features and noisy testing speech features. For the purposeof reducing the feature-model mismatch, corrective, compensation as well as normalizationtechniques are employed both during training and testing of speech features.In order to achieve improved system performance, normalization in modulationspectrum domain is used to remove non-speech components over a certain frequencyrange using running spectrum analysis (RSA) as a band pass filter. In comparison toother noise reduction techniques used in this study on robust speech recognition, theRSA filter has an advantage due to its adaptable parameters, that is, the first and secondpass band frequencies can easily be adjusted accordingly. In addition, speech featureenhancement using dynamic range adjustment (DRA) is utilized. The enhancement isaimed at correcting the difference between clean and noisy speech features by normalizingamplitude of speech features. For the purpose of channel normalization, cepstrummean subtraction (CMS) is used in this study.Two alternative time varying speech features (TVSF) methods are being proposedand compared with conventional Mel frequency cepstral coefficients (MFCC) featuresfor noisy speech recognition.The first experimental study shows that fast Fourier transform (FFT) based Mel frequencycepstrum coefficients (MFCCs) with directly converted time varying linear prediction(TVLPC) based MFCCs, which in this study is defined as time varying speechfeatures (TVSF), shows a competitive recognition accuracy performance to that of FFTbased MFCCs alone.In the second experimental study, robustness of speech recognition is further improvedby applying mel filtering and logarithmic transformations to short time windowedtime varying coefficients before converting to cepstrum coefficients in place ofdirect-converted TVLPC speech features. Results show that RSA produces better performancethan DRA and CMS/DRA on both similar pronunciation phrases and phrasesuttered by elderly persons. Experimental study shows that the use of time varying speechfeatures (TVSP) can produce improved speech recognition accuracy even if there is amismatch between the training and testing data sets.
Hokkaido University（北海道大学）. 博士(情報科学)