AURORA-2J: An evaluation framework for Japanese noisy speech recognition

Shinshu University, Shonai, Nagano, Japan
IEICE Transactions on Information and Systems (Impact Factor: 0.19). 03/2005; E88D(3). DOI: 10.1093/ietisy/e88-d.3.535
Source: OAI

ABSTRACT This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Conventional features for Automatic Speech Recognition and Sound Event Recognition such as Mel-Frequency Cepstral Coefficients (MFCCs) have been shown to perform poorly in noisy conditions. We introduce an auditory feature based on the gammatone filterbank, the Selective Gammatone Envelope Feature (SGEF), for Robust Sound Event Recognition where channel selection and the filterbank envelope is used to reduce the effect of noise for specific noise environments. In the experiments with Hidden Markov Model (HMM) recognizers, we shall show that our feature outperforms MFCCs significantly in four different noisy environments at various signal-to-noise ratios.
    IEICE Transactions on Information and Systems 05/2012; E95.D(5):1229-1237. DOI:10.1587/transinf.E95.D.1229 · 0.19 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing brings several advantages such as flexibility, scalability, and ubiquity in terms of data acquisition, data storage, and data transmission. This can help remote healthcare among other applications in a great deal. This paper proposes a cloud based framework for speech enabling healthcare. In the proposed framework, the patients or any healthy person seeking for some medical assistance can send his/her request by speech commands. The commands are managed and processed in the cloud server. Any doctor with proper authentication can receive the request. By analyzing the request, the doctor can assist the patient or the person. This paper also proposes a new feature extraction technique, namely, interlaced derivative pattern (IDP), to automatic speech recognition (ASR) system to be deployed into the cloud server. The IDP exploits the relative Mel-filter bank coefficients along different neighborhood directions from the speech signal. Experimental results show that the proposed IDP-based ASR system performs reasonably well even when the speech is transmitted via smart phones.
    Cluster Computing 01/2015; DOI:10.1007/s10586-015-0439-7 · 0.95 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: We investigated the clustering of nonstationary noise used in the subjective and objective assessment of speech intelligibility. The feature vector used in the clustering comprises 15-dimensional features used typically in MIR, and clustering was performed by the x-means method. We then conducted tests to validate the clustering results using the Japanese Diagnostic Rhyme Test. It was found that with the JEIDA-NOISE database, the noise can be classified into three clusters, and significant differences in the speech intelligibility of the different clusters were seen. Finally, we tested the objective speech intelligibility assessment for each cluster using fwSNRseg and the logistic function. The performance of objective assessment was found to be improved by about 0.01 compared to the case without clustering.
    Electronics and Communications in Japan 05/2014; 97(5):43–52. DOI:10.1002/ecj.11609 · 0.19 Impact Factor

Full-text (2 Sources)

Available from
May 29, 2014