Conference Paper


DOI: 10.1109/ICASSP.2013.6638284 Conference: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Volume: 38

ABSTRACT In this work, a novel training scheme for generating bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise, unsupervised manner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational telephone speech corpus and find that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demonstrate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2% (Cantonese, ML training), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confusion network combinations with MFCCs), and 8.7% (Switchboard) are achieved.

Download full-text


Available from: Jonas Gehring, Sep 28, 2015
75 Reads
  • Source
    • "It is nearly always used to provide a lower dimensional representation on top of which a classifier such as logistic regression, or Support Vector Machines are used. An example of this is the Deep Bottleneck Features that are used in Speech Recognition[17][18]. However, such approaches are less relevant to parametric synthesis since it is not a classification problem. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Nearly all Statistical Parametric Speech Synthesizers today use Mel Cepstral coefficients as the vocal tract parameterization of the speech signal. Mel Cepstral coefficients were never intended to work in a parametric speech synthesis framework, but as yet, there has been little success in creating a better parameterization that is more suited to synthesis. In this paper, we use deep learning algorithms to investigate a data-driven parameterization technique that is designed for the specific requirements of synthesis. We create an invertible, low-dimensional, noise-robust encoding of the Mel Log Spectrum by training a tapered Stacked Denoising Autoencoder (SDA). This SDA is then unwrapped and used as the initialization for a Multi-Layer Perceptron (MLP). The MLP is fine-tuned by training it to reconstruct the input at the output layer. This MLP is then split down the middle to form encoding and decoding networks. These networks produce a parameterization of the Mel Log Spectrum that is intended to better fulfill the requirements of synthesis. Results are reported for experiments conducted using this resulting parameterization with the ClusterGen speech synthesizer.
  • Source
    • "Empirical studies have also shown that deep models can subsume many carefully designed speaker adaptation techniques [25]. Moreover, it was observed that the intermediate layer representations (bottleneck features) taken from a deep neural network can considerably improve the performance GMM-based acoustic modeling [14] [15]. However, there is still limited formal understanding of how or why deep architectures lead to effective representations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of " neurons ". We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.
    INTERSPEECH 2014 - 15th Annual Conference of the International Speech Communication Association, Singapore; 09/2014
  • Source
    • "The max-pooling layer is inserted after each convolution layer. Figure 4. Architecture for the Deep Bottleneck Feature (DBNF) network [12] "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Kaldi toolkit is becoming popular for constructing automated speech recognition (ASR) systems. Meanwhile, in recent years, deep neural networks (DNNs) have shown state-of-the-art performance on various ASR tasks. This document describes our open-source recipes to implement fully-fledged DNN acoustic modeling using Kaldi and PDNN. PDNN is a lightweight deep learning toolkit developed under the Theano environment. Using these recipes, we can build up multiple systems including DNN hybrid systems, convolutional neural network (CNN) systems and bottleneck feature systems. These recipes are directly based on the Kaldi Switchboard 110-hour setup. However, adapting them to new datasets is easy to achieve.
Show more