Conference Paper

Perceptual MVDR-based cepstral coefficients (PMCCs) for high accuracy speech recognition.

Conference: 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003
Source: DBLP
0 Bookmarks
 · 
50 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from the speech signal. This new feature representation is shown to better model the speech spectrum compared to traditional feature extraction approaches. Experimental results for small (40-word digits) to medium (5k-word dictation) size vocabulary tasks show varying degree of consistent improvements across different experiments; however, the new front-end is most effective in noisy car environments. The PMVDR front-end uses the minimum variance distortionless response (MVDR) spectral estimator to represent the upper envelope of the speech signal. Unlike Mel frequency cepstral coefficients (MFCCs), the proposed front-end does not utilize a filterbank. The effectiveness of the PMVDR approach is demonstrated by comparing speech recognition accuracies with the traditional MFCC front-end and recently proposed PMCC front-end in both noise-free and real adverse environments. For speech recognition in noisy car environments, a 40-word vocabulary task, PMVDR front-end provides a 36% relative decrease in word error rate (WER) over the MFCC front-end. Under simulated speaker stress conditions, a 35-word vocabulary task, the PMVDR front-end yields a 27% relative decrease in the WER. For a noise-free dictation task, a 5k-word vocabulary task, again a relative 8% reduction in the WER is reported. Finally, a novel analysis technique is proposed to quantify noise robustness of an acoustic front-end. This analysis is conducted for the acoustic front-ends analyzed in the paper and results are presented.
    Speech Communication. 01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A proven method for achieving effective automatic speech recognition (ASR) due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN), despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN), where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i) an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER) by 24%, and (ii) for a diverse noisy speech task (SPINE 2), where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.
    EURASIP Journal on Audio, Speech, and Music Processing. 01/2008;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a maximum likelihood (ML) based frame selection approach. A fixed frame rate adopted in most state-of-the-art speech recognition systems can face some problems, such as accidentally meeting noisy frames, assigning the same importance to each frame, and pitch asynchronous representation. As an attempt to avoid those problems, our approach selects reliable frames from a fine resolution along the time axis in a phoneme recognition task, we show that significant improvements are achieved with the frame selection approach comparing to a system with a fixed frame rate
    Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on; 06/2006 · 4.63 Impact Factor