AURORA-2J: An evaluation framework for Japanese noisy speech recognition

Shinshu University, Shonai, Nagano, Japan
IEICE Transactions on Information and Systems (Impact Factor: 0.21). 03/2005; E88D(3). DOI: 10.1093/ietisy/e88-d.3.535
Source: OAI


This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.

Download full-text


Available from: Akira Sasou
  • Source
    • "0.10, 0.45, 0.55}. The silence and clean speech GMMs with 32 Gaussian distributions were trained by using speech data for the clean HMM training of CENS- REC-1, also known as AURORA-2J (Nakamura et al., 2005). The digit contexts of AURORA-2J are exactly the same as those of AURORA-2 (Hirsch and Pearce, 2000), Fig. 7. Example likelihood with Gaussian pruning and weight normalization . "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a robust voice activity detection (VAD) method that operates in the presence of noise. For noise robust VAD, we have already proposed statistical models and a switching Kalman filter (SKF)-based technique. In this paper, we focus on a model re-estimation method using Gaussian pruning with weight normalization. The statistical model for SKF-based VAD is constructed using Gaussian mixture models (GMMs), and consists of pre-trained silence and clean speech GMMs and a sequentially estimated noise GMM. However, the composed model is not optimal in that it does not fully reflect the characteristics of the observed signal. Thus, to ensure the optimality of the composed model, we investigate a method for its re-estimation that reflects the characteristics of the observed signal sequence. Since our VAD method works through the use of frame-wise sequential processing, processing with the smallest latency is very important. In this case, there are insufficient re-training data for a re-estimation of all the Gaussian parameters. To solve this problem, we propose a model re-estimation method that involves the extraction of reliable characteristics using Gaussian pruning with weight normalization. Namely, the proposed method re-estimates the model by pruning non-dominant Gaussian distributions that express the local characteristics of each frame and by normalizing the Gaussian weights of the remaining distributions. In an experiment using a speech corpus for VAD evaluation, CENSREC-1-C, the proposed method significantly improved the VAD performance with compared that of the original SKF-based VAD. This result confirmed that the proposed Gaussian pruning contributes to an improvement in VAD accuracy.
    Full-text · Article · Feb 2012 · Speech Communication
  • Source
    • "The test conditions were almost the same as those in Sect. 2.1, except that the noisy continuous digit utterances included in the AURORA-2J database [11] were used and the two different noise reduction algorithms described below were adopted instead of the algorithms (E) and (S). (SS) Spectral subtraction with smoothing of the time direction [12] and (K) KLT-based comb-filtering [13]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes non-reference objective quality evaluation for noise-reduced speech. First, a subjective test is conducted in accordance with ITU-T Rec. P.835 to obtain the speech quality, the noise quality, and the overall quality of noise-reduced speech. Based on the results, we then propose an overall quality estimation model. The unique point of the proposed model is that the estimation of the overall quality is done only using the previously estimated speech quality and noise quality, in contrast to conventional models, which utilize the acoustical features extracted. Finally, we propose a non-reference objective quality evaluation method using the proposed model. The results of an experiment with different noise reduction algorithms and noise types confirmed that the proposed method gives more accurate estimates of the overall quality compared with the method described in ITU-T Rec. P.563.
    Preview · Article · Jun 2010 · IEICE Transactions on Communications
  • Source
    • "Different types of noise, for example, subway, babble, car, exhibition, restaurant, street, airport, station noise are added/convolved in them. In the baseline system [8], there are thirteen recognition units: eleven-digits HMMs with 16 states and 20 Gaussian mixture components; one silence HMM with three states, and one short-pause HMM with one state. Thirty-six Gaussian mixture components are used for silence and short pause. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a robust automatic speech recognition (ASR) system with less computation. Acoustic models of a hidden Markov model (HMM)-based classifier include various types of hidden factors such as speaker-specific characteristics, coarticulation, and an acoustic environment, etc. If there exists a canonicalization process that can recover the degraded margin of acoustic likelihoods between correct phonemes and other ones caused by hidden factors, the robustness of ASR systems can be improved. In this paper, we introduce a canonicalization method that is composed of multiple distinctive phonetic feature (DPF) extractors corresponding to each hidden factor canonicalization, and a DPF selector which selects an optimum DPF vector as an input of the HMM-based classifier. The proposed method resolves gender factors and speaker variability, and eliminates noise factors by applying the canonicalzation based on the DPF extractors and two-stage Wiener filtering. In the experiment on AURORA-2J, the proposed method provides higher word accuracy under clean training and significant improvement of word accuracy in low signal-to-noise ratio (SNR) under multi-condition training compared to a standard ASR system with mel frequency ceptral coeffient (MFCC) parameters. Moreover, the proposed method requires a reduced, two-fifth, Gaussian mixture components and less memory to achieve accurate ASR.
    Full-text · Article · Mar 2008 · IEICE Transactions on Information and Systems
Show more