Conference Paper

Teach an All-rounder with Experts in Different Domains

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, we observe that increasing the amounts of weakly-labeled data does not lead to noticeable gains than using one subset of the data. Inspired by teacher-student studies that facilitate the transfer of clean knowledge (Li et al., 2014;Hinton et al., 2015;You et al., 2019), we further design a teacher-student paradigm with multiple teachers, each trained with a subset of the hard-labeled, weakly-labeled data, and we feed soft-labeled, weakly-labeled/clean data to a student in a two-stage fashion, similar to the two-stage fine-tuning strategy mentioned earlier (Section 5). ...
... In our preliminary experiment, we observe that increasing the amounts of weakly-labeled data does not lead to noticeable gains than using one subset of the weakly-labeled data. Inspired by previous teacher-student frameworks (You et al., 2019; that train a multi-domain student model with multiple teacher models to help knowledge transfer, and each teacher model is trained on a clean domain-specific subset, we extend the multi-teacher idea to let models learn better from large-scale weakly-labeled MRC data for improving a target MRC task. As introduced in Section 3, we can have multiple subsets of weakly-labeled data generated based on verbal-nonverbal knowledge extracted by different types of patterns ((e.g., B c or I) in (Section 2)). ...
... Our teacher-student paradigm is most related to multi-domain teacher-student training for automatic speech recognition (You et al., 2019) and machine translation . Instead of clean domain-specific data, each of our teachers is trained with weakly-labeled data. ...
... Compared with previous multi-teacher student paradigms (You et al., 2019;Wang et al., 2020b;, to train models to be strong teachers, we conduct iterative training and leverage large-scale weakly-labeled data rather than using clean, human-labeled data of similar tasks. ...
... In previous teacher-student frameworks for domain/knowledge distillation (You et al., 2019;Wang et al., 2020b;Sun et al., 2020a), multiple teachers are trained using different data. However, it is difficult to divide the QA-based weakly-labeled data into subsets by subjects or fine-grained types of knowledge needed for answering questions. ...
... In previous teacher-student frameworks (You et al., 2019;Wang et al., 2020b;Sun et al., 2020a), multiple teacher models are trained using different data. However, it is difficult to divide the weaklylabeled data based on existing QA instances into sub-datasets by subjects or fine-grained types of knowledge required for answering questions. ...
... Compared with previous multi-teacher student paradigms (You et al., 2019;Wang et al., 2020b;, we conduct iterative training and leverage large-scale weakly-labeled data to train models to be teachers, instead of using human-labeled clean data of similar tasks. ...
Preprint
Full-text available
In spite of much recent research in the area, it is still unclear whether subject-area question-answering data is useful for machine reading comprehension (MRC) tasks. In this paper, we investigate this question. We collect a large-scale multi-subject multiple-choice question-answering dataset, ExamQA, and use incomplete and noisy snippets returned by a web search engine as the relevant context for each question-answering instance to convert it into a weakly-labeled MRC instance. We then propose a self-teaching paradigm to better use the generated weakly-labeled MRC instances to improve a target MRC task. Experimental results show that we can obtain an improvement of 5.1% in accuracy on a multiple-choice MRC dataset, C^3, demonstrating the effectiveness of our framework and the usefulness of large-scale subject-area question-answering data for machine reading comprehension.
... We observe that simply combining all the data, either in joint training or separate training, does not lead to noticeable gains compared to using one type of contextualized knowledge. Inspired by the previous work (You et al., 2019) that trains a student automatic speech recognition model with multiple teacher models, and each teacher model is trained on a domainspecific subset with a unique speaking style, we employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a single student machine reader. ...
... Our teacher-student paradigm for knowledge integration is most related to multi-domain teacherstudent training for automatic speech recognition (You et al., 2019) and machine translation . Instead of clean domainspecific human-labeled data, each of our teacher models is trained with weakly-labeled data. ...
Preprint
Full-text available
In this paper, we aim to extract commonsense knowledge to improve machine reading comprehension. We propose to represent relations implicitly by situating structured knowledge in a context instead of relying on a pre-defined set of relations, and we call it contextualized knowledge. Each piece of contextualized knowledge consists of a pair of interrelated verbal and nonverbal messages extracted from a script and the scene in which they occur as context to implicitly represent the relation between the verbal and nonverbal messages, which are originally conveyed by different modalities within the script. We propose a two-stage fine-tuning strategy to use the large-scale weakly-labeled data based on a single type of contextualized knowledge and employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader. Experimental results demonstrate that our method outperforms a state-of-the-art baseline by a 4.3% improvement in accuracy on the machine reading comprehension dataset C^3, wherein most of the questions require unstated prior knowledge.
... Domain Distillation The generalization ability of the teacher model can be transferred to the student by using the class probabilities produced by the cumbersome model for training the small model (Hinton et al. 2014). Recent studies on speech recognition show that training student networks with multiple teachers achieves promising empirical results (You et al. 2019). Inspired by these observations, we propose to teach a unified model with multiple teachers trained on different domains. ...
... Through case studies, we found that word-level distillation produced more fluent outputs, possibly due to providing smoother target labels. This explains why word-level distillation is a widely-used implementation in multi-lingual and multi-domain tasks on top of Transformer-based models (Tan et al. 2019;You et al. 2019). Therefore, we applied word-level distillation in our work. ...
Article
The key challenge of multi-domain translation lies in simultaneously encoding both the general knowledge shared across domains and the particular knowledge distinctive to each domain in a unified model. Previous work shows that the standard neural machine translation (NMT) model, trained on mixed-domain data, generally captures the general knowledge, but misses the domain-specific knowledge. In response to this problem, we augment NMT model with additional domain transformation networks to transform the general representations to domain-specific representations, which are subsequently fed to the NMT decoder. To guarantee the knowledge transformation, we also propose two complementary supervision signals by leveraging the power of knowledge distillation and adversarial learning. Experimental results on several language pairs, covering both balanced and unbalanced multi-domain translation, demonstrate the effectiveness and universality of the proposed approach. Encouragingly, the proposed unified model achieves comparable results with the fine-tuning approach that requires multiple models to preserve the particular knowledge. Further analyses reveal that the domain transformation networks successfully capture the domain-specific knowledge as expected.1
... Domain Distillation The generalization ability of the teacher model can be transferred to the student by using the class probabilities produced by the cumbersome model for training the small model (Hinton et al. 2014). Recent studies on speech recognition show that training student networks with multiple teachers achieves promising empirical results (You et al. 2019). Inspired by these observations, we propose to teach a unified model with multiple teachers trained on different domains. ...
... Through case studies, we found that word-level distillation produced more fluent outputs, possibly due to providing smoother target labels. This explains why word-level distillation is a widely-used implementation in multi-lingual and multi-domain tasks on top of Transformer-based models (Tan et al. 2019;You et al. 2019). Therefore, we applied word-level distillation in our work. ...
Preprint
The key challenge of multi-domain translation lies in simultaneously encoding both the general knowledge shared across domains and the particular knowledge distinctive to each domain in a unified model. Previous work shows that the standard neural machine translation (NMT) model, trained on mixed-domain data, generally captures the general knowledge, but misses the domain-specific knowledge. In response to this problem, we augment NMT model with additional domain transformation networks to transform the general representations to domain-specific representations, which are subsequently fed to the NMT decoder. To guarantee the knowledge transformation, we also propose two complementary supervision signals by leveraging the power of knowledge distillation and adversarial learning. Experimental results on several language pairs, covering both balanced and unbalanced multi-domain translation, demonstrate the effectiveness and universality of the proposed approach. Encouragingly, the proposed unified model achieves comparable results with the fine-tuning approach that requires multiple models to preserve the particular knowledge. Further analyses reveal that the domain transformation networks successfully capture the domain-specific knowledge as expected.
... Knowledge distillation via single student: In knowledge distillation, as introduced by Hinton et al. [15], a single student either tries to mimic a single teacher's [6,15,17,18,19,21,23,26,35,36,44] or multiple teachers [9,39,42]. Most of such schemes transfer knowledge to student by minimizing the error between the knowledge of the teacher and the student [15,18]. ...
... Employing several domain experts, Ruder et al. [19] propose to transfer knowledge from many teachers, where each teacher is an expert of one language, to one multi-lingual student, for sentiment classification. Another study that utilizes MTSS to train an all rounder single student achieves promising results on automatic speech recognition [30]. Unlike most of the knowledge distillation approaches,You et al. [28] utilize hints from intermediate layers as well as knowledge from output layers of the teacher network to train a single student network. ...
Preprint
Full-text available
To solve the problem of the overwhelming size of Deep Neural Networks (DNN) several compression schemes have been proposed, one of them is teacher-student. Teacher-student tries to transfer knowledge from a complex teacher network to a simple student network. In this paper, we propose a novel method called a teacher-class network consisting of a single teacher and multiple student networks (i.e. class of students). Instead of transferring knowledge to one student only, the proposed method transfers a chunk of knowledge about the entire solution to each student. Our students are not trained for problem-specific logits, they are trained to mimic knowledge (dense representation) learned by the teacher network. Thus unlike the logits-based single student approach, the combined knowledge learned by the class of students can be used to solve other problems as well. These students can be designed to satisfy a given budget, e.g. for comparative purposes we kept the collective parameters of all the students less than or equivalent to that of a single student in the teacher-student approach . These small student networks are trained independently, making it possible to train and deploy models on memory deficient devices as well as on parallel processing systems such as data centers. The proposed teacher-class architecture is evaluated on several benchmark datasets including MNIST, FashionMNIST, IMDB Movie Reviews and CAMVid on multiple tasks including classification, sentiment classification and segmentation. Our approach outperforms the state-of-the-art single student approach in terms of accuracy as well as computational cost and in many cases it achieves an accuracy equivalent to the teacher network while having 10-30 times fewer parameters.
Article
Full-text available
In this paper, we present an improved feedforward sequential memory networks (FSMN) architecture, namely Deep-FSMN (DFSMN), by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. As a result, DFSMN significantly benefits from these skip connections and deep structure. We have compared the performance of DFSMN to BLSTM both with and without lower frame rate (LFR) on several large speech recognition tasks, including English and Mandarin. Experimental results shown that DFSMN can consistently outperform BLSTM with dramatic gain, especially trained with LFR using CD-Phone as modeling units. In the 2000 hours Fisher (FSH) task, the proposed DFSMN can achieve a word error rate of 9.4% by purely using the cross-entropy criterion and decoding with a 3-gram language model, which achieves a 1.5% absolute improvement compared to the BLSTM. In a 20000 hours Mandarin recognition task, the LFR trained DFSMN can achieve more than 20% relative improvement compared to the LFR trained BLSTM. Moreover, we can easily design the lookahead filter order of the memory blocks in DFSMN to control the latency for real-time applications.
Conference Paper
Full-text available
Training thin deep networks following the student-teacher learning paradigm has received intensive attention because of its excellent performance. However, to the best of our knowledge, most existing work mainly considers one single teacher network. In practice, a student may access multiple teachers, and multiple teacher networks together provide comprehensive guidance that is beneficial for training the student network. In this paper, we present a method to train a thin deep network by incorporating multiple teacher networks not only in output layer by averaging the softened outputs (dark knowledge) from different networks, but also in the intermediate layers by imposing a constraint about the dissimilarity among examples. We suggest that the relative dissimilarity between intermediate representations of different examples serves as a more flexible and appropriate guidance from teacher networks. Then triplets are utilized to encourage the consistence of these relative dissimilarity relationships between the student network and teacher networks. Moreover, we leverage a voting strategy to unify multiple relative dissimilarity information provided by multiple teacher networks, which realizes their incorporation in the intermediate layers. Extensive experimental results demonstrated that our method is capable of generating a well-performed student network, with the classification accuracy comparable or even superior to all teacher networks, yet having much fewer parameters and being much faster in running.
Conference Paper
Full-text available
We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ‘naïvely’ as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76% average word error rate, which is, to our knowledge, the best score to date.
Article
Full-text available
We describe the design of Kaldi, a free, open-source toolkit for speech recognition research. Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems. Kaldi is written is C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms. Kaldi is released under the Apache License v2.0, which is highly nonrestrictive, making it suitable for a wide community of users.
Article
Full-text available
Image methods are commonly used for the analysis of the acoustic properties of enclosures. In this paper we discuss the theoretical and practical use of image techniques for simulating, on a digital computer, the impulse response between two points in a small rectangular room. The resulting impulse response, when convolved with any desired input signal, such as speech, simulates room reverberation of the input signal. This technique is useful in signal processing or psychoacoustic studies. The entire process is carried out on a digital computer so that a wide range of room parameters can be studied with accurate control over the experimental conditions. A FORTRAN implementation of this model has been included.
Conference Paper
Full-text available
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decorrelated acoustic feature vectors that correspond to individual subword units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probability distribution among subword units given the acoustic observations. In this work we show a large improvement in word recognition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By training the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35% or more on the multicondition Aurora noisy continuous digits task
Conference Paper
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Article
In this paper, we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques. We first discuss models such as recurrent neural networks U+0028 RNNs U+0029 and convolutional neural networks U+0028 CNNs U+0029 that can effectively exploit variablelength contextual information, and their various combination with other models. We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification U+0028 CTC U+0029 criterion, and the attention-based sequenceto-sequence translation model. We further illustrate robustness issues in speech recognition systems, and discuss acoustic model adaptation, speech enhancement and separation, and robust training strategies. We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.
Article
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow nets can learn these deep functions using the same number of parameters as the original deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.
Article
Deep neural network (DNN) obtains significant accuracy improvements on many speech recognition tasks and its power comes from the deep and wide network structure with a very large number of parameters. It becomes challenging when we deploy DNN on devices which have limited computational and storage resources. The common practice is to train a DNN with a small number of hidden nodes and a small senone set using the standard training process, leading to significant accuracy loss. In this study, we propose to better address these issues by utilizing the DNN output distribution. To learn a DNN with small number of hidden nodes, we minimize the Kullback-Leibler divergence between the output distributions of the small-size DNN and a standard large-size DNN by utilizing a large number of un-transcribed data. For better senone set generation, we cluster the senones in the large set into a small one by directly relating the clustering process to DNN parameters, as opposed to decoupling the senone generation and DNN training process in the standard training. Evaluated on a short message dictation task, the proposed two methods get 5.08% and 1.33% relative word error rate reduction from the standard training method, respectively.
Article
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Conference Paper
Recently, a new acoustic model based on deep neural networks (DNN) has been introduced. While the DNN has generated significant improvements over GMM-based systems on several tasks, there has been no evaluation of the robustness of such systems to environmental distortion. In this paper, we investigate the noise robustness of DNN-based acoustic models and find that they can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation. This performance can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training. When combined with the recently proposed dropout training technique, a 7.5% relative improvement over the previously best published result on this task is achieved using only a single decoding pass and no additional decoding complexity compared to a standard DNN.
Article
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.
Article
Hidden Markov model speech recognition systems typically use Gaussian mixture models to estimate the distributions of decor- related acoustic feature vectors that correspond to individual sub- word units. By contrast, hybrid connectionist-HMM systems use discriminatively-trained neural networks to estimate the probabil- ity distribution among subword units given the acoustic observa- tions. In this work we show a large improvement in word recog- nition performance by combining neural-net discriminative feature processing with Gaussian-mixture distribution modeling. By train- ing the network to generate the subword probability posteriors, then using transformations of these estimates as the base features for a conventionally-trained Gaussian-mixture based system, we achieve relative error rate reductions of 35% or more on the multi- condition AURORA noisy continuous digits task.
Article
We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.
Conference Paper
Most speech recognizers are sensitive to the speech style and the speaker's environment. This system extends an earlier robust continuous observation HMM IWR system to continuous speech using the DARPA-robust (multi-condition with a pilot's facemask) database. Performance on a 207 word, perplexity 14 task is 0.9% word error rate under office conditions and 2.5% (best speaker) and 5% (4 speaker average) for the normal test condition of the database
Making machines understand us in reverberant rooms
  • T Yoshioka
  • A Sehr
  • M Delcroix
  • K Kinoshita
  • R Maas
  • T Nakatani
Making machines understand us in reverberant rooms
  • yoshioka