[Show abstract][Hide abstract] ABSTRACT: A method for training a deep neural network, comprises receiving and formatting speech data for the training, preconditioning a system of equations to be used for analyzing the speech data in connection with the training by using a non-fixed point quasi-Newton preconditioning scheme, and employing flexible Krylov subspace solvers in response to variations in the preconditioning scheme for different iterations of the training.
[Show abstract][Hide abstract] ABSTRACT: Deep Neural Networks (DNNs) have recently been shown to significantly outperform existing machine learning techniques in several pattern recognition tasks. DNNs are the state-of-the-art models used in image recognition, object detection, classification and tracking, and speech and language processing applications. The biggest drawback to DNNs has been the enormous cost in computation and time taken to train the parameters of the networks - often a tenfold increase relative to conventional technologies. Such training time costs can be mitigated by the application of parallel computing algorithms and architectures. However, these algorithms often run into difficulties because of the cost of inter-processor communication bottlenecks. In this paper, we describe how to enable Parallel Deep Neural Network Training on the IBM Blue Gene/Q (BG/Q) computer system. Specifically, we explore DNN training using the data parallel Hessian-free 2nd order optimization algorithm. Such an algorithm is particularly well-suited to parallelization across a large set of loosely coupled processors. BG/Q, with its excellent inter-processor communication characteristics, is an ideal match for this type of algorithm. The paper discusses how issues regarding programming model and data-dependent imbalances are addressed. Results on large-scale speech tasks show that the performance on BG/Q scales linearly up to 4096 processes with no loss in accuracy. This allows us to train neural networks using billions of training examples in a few hours.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we investigate how to scale up kernel methods to take on
large-scale problems, on which deep neural networks have been prevailing. To
this end, we leverage existing techniques and develop new ones. These
techniques include approximating kernel functions with features derived from
random projections, parallel training of kernel models with 100 million
parameters or more, and new schemes for combining kernel functions as a way of
learning representations. We demonstrate how to muster those ideas skillfully
to implement large-scale kernel machines for challenging problems in automatic
speech recognition. We valid our approaches with extensive empirical studies on
real-world speech datasets on the tasks of acoustic modeling. We show that our
kernel models are equally competitive as well-engineered deep neural networks
(DNNs). In particular, kernel models either attain similar performance to, or
surpass their DNNs counterparts. Our work thus avails more tools to machine
learning researchers in addressing large-scale learning problems.
[Show abstract][Hide abstract] ABSTRACT: Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12–14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.
[Show abstract][Hide abstract] ABSTRACT: Data augmentation using label preserving transformations has been shown to be effective for neural network training to make invariant predictions. In this paper we focus on data augmentation approaches to acoustic modeling using deep neural networks (DNNs) for automatic speech recognition (ASR). We first investigate a modified version of a previously studied approach using vocal tract length perturbation (VTLP) and then propose a novel data augmentation approach based on stochastic feature mapping (SFM) in a speaker adaptive feature space. Experiments were conducted on Bengali and Assamese limited language packs (LLPs) from the IARPA Babel program. Improved recognition performance has been observed after both cross-entropy (CE) and state-level minimum Bayes risk (sMBR) training of DNN models.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a fast, vocabulary independent algorithm for spoken term detection (STD) that demonstrates a word-based index is sufficient to achieve good performance for both in-vocabulary (IV) and out-of-vocabulary (OOV) terms. Previous approaches have required that a separate index be built at the sub-word level and then expanded to allow for matching OOV terms. Such a process, while accurate, is expensive in both time and memory. In the proposed architecture, a word-level confusion network (CN) based index is used for both IV and OOV search. This is implemented using a flexible WFST framework. Comparisons on 3 Babel languages (Tagalog, Pashto and Turkish) show that CN-based indexing results in better performance compared with the lattice approach while being orders of magnitude faster and having a much smaller footprint.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we investigate the problem of automatically selecting textual keywords for keyword search development and tuning on audio data for any language. Briefly, the method samples candidate keywords in the training data while trying to match a set of target marginal distributions for keyword features such as keyword frequency in the training or development audio, keyword length, frequency of out-of-vocabulary words, and TF-IDF scores. The method is evaluated on four IARPA Babel program base period languages. We show the use of the automatically selected keywords for the keyword search system development and tuning. We show also that search performance is improved by tuning the decision threshold on the automatically selected keywords.
[Show abstract][Hide abstract] ABSTRACT: Many features used in speech recognition tasks are hand-crafted and are not always related to the objective at hand, that is minimizing word error rate. Recently, we showed that replacing a perceptually motivated mel-filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network was promising. In this paper, we extend filter learning to a speaker-adapted, state-of-the-art system. First, we incorporate delta learning into the filter learning framework. Second, we incorporate various speaker adaptation techniques, including VTLN warping and speaker identity features. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter and delta learning, compared to having a fixed set of filters and deltas. Furthermore, after speaker adaptation, we find that filter and delta learning allows for a 3% relative improvement in WER compared to a state-of-the-art CNN.
[Show abstract][Hide abstract] ABSTRACT: Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.
[Show abstract][Hide abstract] ABSTRACT: While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training these networks is slow. Even to date, the most common approach to train DNNs is via stochastic gradient descent, serially on one machine. Serial training, coupled with the large number of training parameters (i.e., 10-50 million) and speech data set sizes (i.e., 20-100 million training points) makes DNN training very slow for LVCSR tasks. In this work, we explore a variety of different optimization techniques to improve DNN training speed. This includes parallelization of the gradient computation during cross-entropy and sequence training, as well as reducing the number of parameters in the network using a low-rank matrix factorization. Applying the proposed optimization techniques, we show that DNN training can be sped up by a factor of 3 on a 50-hour English Broadcast News (BN) task with no loss in accuracy. Furthermore, using the proposed techniques, we are able to train DNNs on a 300-hr Switchboard (SWB) task and a 400-hr English BN task, showing improvements between 9-30% relative over a state-of-the art GMM/HMM system while the number of parameters of the DNN is smaller than the GMM/HMM system.
Preview · Article · Nov 2013 · IEEE Transactions on Audio Speech and Language Processing
[Show abstract][Hide abstract] ABSTRACT: Spoken content in languages of emerging importance needs to be searchable to provide access to the underlying information. In this paper, we investigate the problem of extending data fusion methodologies from Information Retrieval for Spoken Term Detection on low-resource languages in the framework of the IARPA Babel program. We describe a number of alternative methods improving keyword search performance. We apply these methods to Cantonese, a language that presents some new issues in terms of reduced resources and shorter query lengths. First, we show score normalization methodology that improves in average by 20% keyword search performance. Second, we show that properly combining the outputs of diverse ASR systems performs 14% better than the best normalized ASR system.
[Show abstract][Hide abstract] ABSTRACT: Automatic speech recognition is a core component of many applications, including keyword search. In this paper we describe experiments on acoustic modeling, language modeling, and decoding for keyword search on a Cantonese conversational telephony corpus collected as part of the IARPA Babel program. We show that acoustic modeling techniques such as the bootstrapped-and-restructured model and deep neural network acoustic model significantly outperform a state-of-the-art baseline GMM/HMM model, in terms of both recognition performance and keyword search performance, with improvements of up to 11% relative character error rate reduction and 31% relative maximum term weighted value improvement. We show that while an interpolated Model M and neural network LM improve recognition performance, they do not improve keyword search results; however, the advanced LM does reduce the size of the keyword search index. Finally, we show that a simple form of automatically adapted keyword search performs 16% better than a preindexed search system, indicating that out-of-vocabulary search is still a challenge.
[Show abstract][Hide abstract] ABSTRACT: The paper describes a state-of-the-art spoken term detection system in which significant improvements are obtained by diversifying the ASR engines used for indexing and combining the search results. First, we describe the design factors that, when varied, produce complementary STD systems and show that the performance of the combined system is 3 times better than the best individual component. Next, we describe different strategies for system combination and show that significant improvements can be achieved by normalizing the combined scores. We propose a classifier-based system combination strategy which outperforms a highly optimized baseline. The system described in this paper had the highest accuracy in the 2012 DARPA RATS evaluation.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we provide an overview of the invited and contributed papers presented at the special session at ICASSP-2013, entitled “New Types of Deep Neural Network Learning for Speech Recognition and Related Applications,” as organized by the authors. We also describe the historical context in which acoustic models based on deep neural networks have been developed. The technical overview of the papers presented in our special session is organized into five ways of improving deep learning methods: (1) better optimization; (2) better types of neural activation function and better network architectures; (3) better ways to determine the myriad hyper-parameters of deep neural networks; (4) more appropriate ways to preprocess speech for deep neural networks; and (5) ways of leveraging multiple languages or dialects that are more easily achieved with deep neural networks than with Gaussian mixture models.
[Show abstract][Hide abstract] ABSTRACT: Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.
[Show abstract][Hide abstract] ABSTRACT: We present a system for keyword search on Cantonese conversational telephony audio, collected for the IARPA Babel program, that achieves good performance by combining postings lists produced by diverse speech recognition systems from three different research groups. We describe the keyword search task, the data on which the work was done, four different speech recognition systems, and our approach to system combination for keyword search. We show that the combination of four systems outperforms the best single system by 7%, achieving an actual term-weighted value of 0.517.
[Show abstract][Hide abstract] ABSTRACT: Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural
Networks (DNN), as they are able to better reduce spectral variation in the
input signal. This has also been confirmed experimentally, with CNNs showing
improvements in word error rate (WER) between 4-12% relative compared to DNNs
across a variety of LVCSR tasks. In this paper, we describe different methods
to further improve CNN performance. First, we conduct a deep analysis comparing
limited weight sharing and full weight sharing with state-of-the-art features.
Second, we apply various pooling strategies that have shown improvements in
computer vision to an LVCSR speech task. Third, we introduce a method to
effectively incorporate speaker adaptation, namely fMLLR, into log-mel
features. Fourth, we introduce an effective strategy to use dropout during
Hessian-free sequence training. We find that with these improvements,
particularly with fMLLR and dropout, we are able to achieve an additional 2-3%
relative improvement in WER on a 50-hour Broadcast News task over our previous
best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5%
relative improvement over our previous best CNN baseline.
[Show abstract][Hide abstract] ABSTRACT: Hessian-free training has become a popular parallel second order optimization
technique for Deep Neural Network training. This study aims at speeding up
Hessian-free training, both by means of decreasing the amount of data used for
training, as well as through reduction of the number of Krylov subspace solver
iterations used for implicit estimation of the Hessian. In this paper, we
develop an L-BFGS based preconditioning scheme that avoids the need to access
the Hessian explicitly. Since L-BFGS cannot be regarded as a fixed-point
iteration, we further propose the employment of flexible Krylov subspace
solvers that retain the desired theoretical convergence guarantees of their
conventional counterparts. Second, we propose a new sampling algorithm, which
geometrically increases the amount of data utilized for gradient and Krylov
subspace iteration calculations. On a 50-hr English Broadcast News task, we
find that these methodologies provide roughly a 1.5x speed-up, whereas, on a
300-hr Switchboard task, these techniques provide over a 2.3x speedup, with no
loss in WER. These results suggest that even further speed-up is expected, as
problems scale and complexity grows.
[Show abstract][Hide abstract] ABSTRACT: Hessian-free training has become a popular parallel second or- der
optimization technique for Deep Neural Network training. This study aims
at speeding up Hessian-free training, both by means of decreasing the
amount of data used for training, as well as through reduction of the
number of Krylov subspace solver iterations used for implicit estimation
of the Hessian. In this paper, we develop an L-BFGS based
preconditioning scheme that avoids the need to access the Hessian
explicitly. Since L-BFGS cannot be regarded as a fixed-point iteration,
we further propose the employment of flexible Krylov subspace solvers
that retain the desired theoretical convergence guarantees of their
conventional counterparts. Second, we propose a new sampling algorithm,
which geometrically increases the amount of data utilized for gradient
and Krylov subspace iteration calculations. On a 50-hr English Broadcast
News task, we find that these methodologies provide roughly a 1.5x
speed-up, whereas, on a 300-hr Switchboard task, these techniques
provide over a 2.3x speedup, with no loss in WER. These results suggest
that even further speed-up is expected, as problems scale and complexity