Conference Paper

EXTRACTING DEEP BOTTLENECK FEATURES USING STACKED AUTO-ENCODERS

DOI: 10.1109/ICASSP.2013.6638284 Conference: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Volume: 38

ABSTRACT In this work, a novel training scheme for generating bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise, unsupervised manner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational telephone speech corpus and find that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demonstrate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2% (Cantonese, ML training), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confusion network combinations with MFCCs), and 8.7% (Switchboard) are achieved.

Full-text

Available from: Jonas Gehring, Jun 03, 2015
0 Followers
 · 
119 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Two problems make Spoken Term Detection (STD) particularly challenging under low-resource conditions: the low quality of speech recognition hypotheses, and a high number of out-of-vocabulary (OOV) words. In this paper, we propose an intuitive way to handle OOV terms for STD on word-based Confusion Networks using phonetic similarities, and generalize it into a probabilistic and vocabulary-independent retrieval framework. We then reflect on how several heuristics and Machine Learning based methods can be incorporated into this framework to im-prove retrieval performance. We present experimental results on several low-resource languages from IARPA's Babel program, such as Assamese, Bengali, Haitian, and Lao.
    15th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2014, Singapore; 09/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose a deep bottleneck feature architecture that is able to leverage data from multiple languages. We also show that tonal features are helpful for non-tonal languages. Evaluations are performed on a low-resource conversational telephone speech transcription task in Bengali, while additional data for DBNF training is provided in Assamese, Pashto, Tagalog, Turkish, and Vietnamese. We obtain relative reductions of up to 17.3% and 9.4% WER over mono-lingual GMMs and DBNFs, respectively.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the IARPA sponsored program BABEL we are faced with the challenge of training automatic speech recognition systems in sparse data conditions in very little time. In this paper we show that by using multilingual bootstrapping techniques in combination with multilingual deep belief bottle neck features that are only fine tuned on the target language the training time of an LVCSR system can be essentially halved while the word error rate stays the same. We show this for recognition systems on Tagalog, making use of multilingual systems trained on the other four languages of the Babel base period: Cantonese, Pashto, Turkish, and Vietnamese.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014