AURORA-2J: An Evaluation Framework for Japanese Noisy Speech Recognition

01/2005; DOI: 10.1093/ietisy/e88-d.3.535
Source: OAI

ABSTRACT This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.

  • [Show abstract] [Hide abstract]
    ABSTRACT: A novel method for feature extraction from the frequency modulation (FM) in speech signals is proposed for robust speech recognition. To exploit of the multistream speech recognizers, each stream should compensate for the shortcomings of the other streams. In this light, FM features are promising as complemental features of amplitude modulation (AM). In order to extract effective features from FM patterns, we applied the proposed feature extraction method by the data-driven modulation analysis of instantaneous frequency. By evaluating the frequency responses of the temporal filters obtained by the proposed method, we confirmed that the modulation observed around 4Hz is important for the discrimination of FM patterns, as in the case of AM features. We evaluated the robustness of our method by performing noisy speech recognition experiments. We confirmed that our FM features can improve the noise robustness of speech recognizers even when the FM features are not combined with conventional AM and/or spectral envelope features. We also performed multistream speech recognition experiments. The experimental results show that combination of the conventional AM system and proposed FM system reduced word error by 43.6% at 10 dB SNR as compared to the baseline MFCC system and by 20.2% as compared to the conventional AM system. We investigated the complementarity of the AM and FM features by performing speech recognition experiments in artificial noisy environments. We found the FM features to be robust to wide-band noise, which certainly degrades the performance of AM features. Further, we evaluated the efficiency of multiconditional training. Although the performance of the proposed combination method was degraded by multiconditional training, we confirmed that the performance of the proposed FM method improved. Through a series of experiments, we confirmed that our FM features can be used as independent features as well as complemental features.
    Speech Communication. 01/2011; 53:716-725.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose an environment population projection (EPP) approach for rapid acoustic model adaptation to reduce environment mismatches with limited amounts of adaptation data. This approach consists of two stages: population construction and projection. In the population construction stage, we apply a sampling scheme on the adaptation data to construct an environment population based on acoustic models prepared in the training phase. With this sampling procedure, the environment samples in the population characterize diverse acoustic information embedded in the adaptation data. Next, the projection stage estimates a function to map the environment population into one set of acoustic models that matches the testing condition. With a well constructed environment population, a simple projection function can enable the EPP approach to accurately characterize the testing environment even with a small amount of adaptation data. To examine the rapid adaptation ability of EPP, we used only one adaptation utterance and tested performance in both supervised and unsupervised adaptation modes on Aurora-2 and Aurora-2J tasks. It is found that EPP achieves satisfactory performance under both modes for both tasks. On the Aurora-2J task for example, EPP gives a clear improvement of a 13.87% (8.58% to 7.39%) word error rate (WER) reduction over our baseline in the unsupervised adaptation mode.
    Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on; 06/2011 · 4.63 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a robust voice activity detection (VAD) method that operates in the presence of noise. For noise robust VAD, we have already proposed statistical models and a switching Kalman filter (SKF)-based technique. In this paper, we focus on a model re-estimation method using Gaussian pruning with weight normalization. The statistical model for SKF-based VAD is constructed using Gaussian mixture models (GMMs), and consists of pre-trained silence and clean speech GMMs and a sequentially estimated noise GMM. However, the composed model is not optimal in that it does not fully reflect the characteristics of the observed signal. Thus, to ensure the optimality of the composed model, we investigate a method for its re-estimation that reflects the characteristics of the observed signal sequence. Since our VAD method works through the use of frame-wise sequential processing, processing with the smallest latency is very important. In this case, there are insufficient re-training data for a re-estimation of all the Gaussian parameters. To solve this problem, we propose a model re-estimation method that involves the extraction of reliable characteristics using Gaussian pruning with weight normalization. Namely, the proposed method re-estimates the model by pruning non-dominant Gaussian distributions that express the local characteristics of each frame and by normalizing the Gaussian weights of the remaining distributions. In an experiment using a speech corpus for VAD evaluation, CENSREC-1-C, the proposed method significantly improved the VAD performance with compared that of the original SKF-based VAD. This result confirmed that the proposed Gaussian pruning contributes to an improvement in VAD accuracy.
    Speech Communication. 01/2012; 54:229-244.

Full-text (2 Sources)

Available from
May 29, 2014