Biing-Hwang Juang

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you Biing-Hwang Juang?

Claim your profile

Publications (154)276.52 Total impact

  • Source
    Chao Weng, Dong Yu, Shinji Watanabe, Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose recurrent deep neural networks (DNNs) for robust automatic speech recognition (ASR). Full recurrent connections are added to certain hidden layer of a conventional feedforward DNN and allow the model to capture the temporal dependency in deep representations. A new backpropagation through time (BPTT) algorithm is introduced to make the minibatch stochastic gradient descent (SGD) on the proposed recurrent DNNs more efficient and effective. We evaluate the proposed recurrent DNN architecture under the hybrid setup on both the 2nd CHiME challenge (track 2) and Aurora-4 tasks. Experimental results on the CHiME challenge data show that the proposed system can obtain consistent 7% relative WER improvements over the DNN systems, achieving state-of-the-art performance without front-end preprocessing, speaker adaptive training or multiple decoding passes. For the experiments on Aurora-4, the proposed system achieves 4% relative WER improvement over a strong DNN baseline system.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: A 6D motion gesture is represented by a 3D spatial trajectory and augmented by another three dimensions of orientation. Using different tracking technologies, the motion can be tracked explicitly with the position and orientation or implicitly with the acceleration and angular speed. In this work, we address the problem of motion gesture recognition for command-and-control applications. Our main contribution is to investigate the relative effectiveness of various feature dimensions for motion gesture recognition in both user-dependent and user-independent cases. We introduce a statistical feature-based classifier as the baseline and propose an HMM-based recognizer, which offers more flexibility in feature selection and achieves better performance in recognition accuracy than the baseline system. Our motion gesture database which contains both explicit and implicit motion information allows us to compare the recognition performance of different tracking signals on a common ground. This study also gives an insight into the attainable recognition rate with different tracking devices, which is valuable for the system designer to choose the proper tracking technology.
    IEEE Transactions on Multimedia 04/2013; 15(3):561-571. · 1.75 Impact Factor
  • Source
    J. Wung, T.S. Wada, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper examines the effect of inter-channel decorrelation by sub-band resampling (SBR) on the performance of the robust acoustic echo cancellation (AEC) system based on the residual echo enhancement technique. Due to the flexibility of SBR, the decorrelation performance as measured by the coherence can be matched with other conventional decorrelation procedures. Given the same degree of decorrelation, we have shown previously that SBR achieves superior audio quality compared to other procedures. We show in this paper that SBR also provides higher stereophonic AEC performance in a very noisy condition, where the performance is evaluated by decomposing the true echo return loss enhancement and the misalignment per sub-band to better demonstrate the superiority of our decorrelation procedure over other methods.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Yong Zhao, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel acoustic modeling framework, synchronous HMM, which takes full advantage of the capacity of the heterogeneous data sources and achieves an optimal balance between modeling accuracy and robustness. The synchronous HMM introduces an additional layer of substates between the HMM states and the Gaussian component variables. The substates have the capability to register long-span non-phonetic attributes, which are integrally called speech scenes in this study. The hierarchical modeling scheme allows an accurate description of probability distribution of speech units in different speech scenes. To address the data sparsity problem, a decision-based clustering algorithm is presented to determine the set of speech scenes and to tie the substate parameters. Moreover, we propose the multiplex Viterbi algorithm to efficiently decode the synchronous HMMs within a search space of the same size as for the standard HMMs. The experiments on the Aurora 2 task show that the synchronous HMMs produce a significant improvement in recognition performance over the HMM baseline at the expense of a moderate increase in the memory requirement and computational complexity.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • M.U. Bin Altaf, T. Butko, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Real world sounds are ubiquitous and form an important part of the edifice of our cognitive abilities. Their perception combines signatures from spectral and temporal domains, among others, yet traditionally their analysis is focused on the frame based spectral properties. We consider the problem of sound analysis from perceptual perspective and investigate the temporal properties of a “footsteps” sound, which is a particularly challenging from the time-frequency analysis viewpoint. We identify the irregular repetition of the self similarity and the sense of duration as significant to its perceptual quality and extract features using the Teager-Kaiser energy operator. We build an acoustic event detection system for “footsteps” which shows promising results for detection in cross-environmental conditions when compared with conventional approach.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Source
    Jason Wung, Ted S. Wada, Mehrez Souden, Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: It is well established that a decorrelation procedure is required in a multi-channel acoustic echo control system to mitigate the so-called non-uniqueness problem. A recently proposed technique that accomplishes decorrelation by resampling (DBR) has been shown to be advantageous; it achieves a superior performance in the echo reduction gain and offers the possibility of frequency selective decorrelation to further preserve the sound quality of the system. In this paper, we analyze with rigor the performance behavior of DBR in terms of coherence reduction and the resultant misalignment of an adaptive filter. We derive closed-form expressions for the performance bounds and validate the theoretical analysis with simulation.
    Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on; 01/2013
  • Chao Weng, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose latent semantic rational kernels (LSRK) for topic spotting on spontaneous conversational speech. Rather than mapping the input weighted finite-state transducers (WFSTs) onto a high dimensional n-gram feature space as in n-gram rational kernels, the proposed LSRK maps the WFSTs onto a latent semantic space. Moreover, with the LSRK framework, all available external knowledge can be flexibly incorporated to boost the topic spotting performance. The experiments we conducted on a spontaneous conversational task, Switchboard, show that our method can achieve significant performance gain over the baselines from 27.33% to 57.56% accuracy and almost double the classification accuracy over the n-gram rational kernels in all cases.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Source
    Mehrez Souden, Jason Wung, Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces an approach to cluster and suppress acoustic echo signals in hands-free, full-duplex speech communication systems. We employ the instantaneous recursive estimate of the magnitude squared coherence (MSC) of the echo line signal and the microphone signal, and model it with a two-component Beta mixture distribution. Since we consider the case of multiple microphone pickup, we further integrate the normalized recording vector as location feature into the proposed approach to achieve reliable soft decisions on the echo presence. The location information has been widely used for clustering-based blind source separation, and can be modeled using a Watson mixture distribution. Simulation evaluations of the proposed method show that it can achieve significant echo suppression performance.
    Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on; 01/2013
  • Chao Weng, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we present a complete framework of discriminative training using non-uniform criteria for keyword spotting, adaptive boosted non-uniform minimum classification error (MCE) for keyword spotting on spontaneous speech. To further boost the spotting performance and tackle the potential issue of over-training in the non-uniform MCE proposed in our prior work, we make two improvements to the fundamental MCE optimization procedure. Furthermore, motivated by AdaBoost, we introduce an adaptive scheme to embed error cost functions together with model combinations during the decoding stage. The proposed framework is comprehensively validated on two challenging large-scale spontaneous conversational telephone speech (CTS) tasks in different languages (English and Mandarin) and the experimental results show it can achieve significant and consistent figure of merit (FOM) gains over both ML and discriminatively trained systems.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Yong Zhao, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present the Gauss-Newton method as a unified approach to estimating noise parameters of the prevalent nonlinear compensation models, such as vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT), for noise-robust speech recognition. While iterative estimation of noise means in a generalized EM framework has been widely known, we demonstrate that such approaches are variants of the Gauss-Newton method. Furthermore, we propose a novel noise variance estimation algorithm that is consistent with the Gauss-Newton principle. The formulation of the Gauss-Newton method reduces the noise estimation problem to determining the Jacobians of the corrupted speech parameters. For sampling-based compensations, we present two methods, sample Jacobian average (SJA) and cross-covariance (XCOV), to evaluate these Jacobians. The proposed noise estimation algorithm is evaluated for various compensation models on two tasks. The first is to fit a Gaussian mixture model (GMM) model to artificially corrupted samples, and the second is to perform speech recognition on the Aurora 2 database. The significant performance improvements confirm the efficacy of the Gauss-Newton method to estimating the noise parameters of the nonlinear compensation models.
    IEEE Transactions on Audio Speech and Language Processing 10/2012; 20(8):2191-2206. · 1.68 Impact Factor
  • Source
    Qiang Fu, Yong Zhao, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: The Bayes decision theory is the foundation of the classical statistical pattern recognition approach, with the expected error as the performance objective. For most pattern recognition problems, the “error” is conventionally assumed to be binary, i.e., 0 or 1, equivalent to error counting, independent of the specifics of the error made by the system. The term “error rate” is thus long considered the prevalent system performance measure. This performance measure, nonetheless, may not be satisfactory in many practical applications. In automatic speech recognition, for example, it is well known that some errors are more detrimental (e.g., more likely to lead to misunderstanding of the spoken sentence) than others. In this paper, we propose an extended framework for the speech recognition problem with non-uniform classification/recognition error cost which can be controlled by the system designer. In particular, we address the issue of system model optimization when the cost of a recognition error is class dependent. We formulate the problem in the framework of the minimum classification error (MCE) method, after appropriate generalization to integrate the class-dependent error cost into one consistent objective function for optimization. We present a variety of training scenarios for automatic speech recognition under this extended framework. Experimental results for continuous speech recognition are provided to demonstrate the effectiveness of the new approach.
    IEEE Transactions on Audio Speech and Language Processing 01/2012; 20:780-793. · 1.68 Impact Factor
  • Source
    Mingyu Chen, Ghassan Al-Regib, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Motion-based control is gaining popularity, and motion gestures form a complementary modality in human-computer interactions. To achieve more robust user-independent motion gesture recognition in a manner analogous to automatic speech recognition, we need a deeper understanding of the motions in gesture, which arouses the need for a 6D motion gesture database. In this work, we present a database that contains comprehensive motion data, including the position, orientation, acceleration, and angular speed, for a set of common motion gestures performed by different users. We hope this motion gesture database can be a useful platform for researchers and developers to build their recognition algorithms as well as a common test bench for performance comparisons.
    Proceedings of the Third Annual ACM SIGMM Conference on Multimedia Systems, MMSys 2012, Chapel Hill, NC, USA, February 22-24, 2012; 01/2012
  • Ted S. Wada, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper examines the technique of using a noise- suppressing nonlinearity in the adaptive filter error feedback-loop of an acoustic echo canceler (AEC) based on the least mean square (LMS) algorithm when there is an interference at the near end. The source of distortion may be linear, such as local speech or background noise, or nonlinear due to speech coding used in the telecommunication networks. Detailed derivation of the error re- covery nonlinearity (ERN), which "enhances" the filter estimation errorprior to the adaptation inorder to assist the linearadaptation process, will be provided. Connections to other existing AEC and signal enhancement techniques will be revealed. In particular, the error enhancement technique is well-founded in the information- theoretic sense and has strong ties to independent component anal- ysis (ICA),which is the basisfor blind source separation (BSS) that permits unsupervised adaptation in the presence of multiple inter- fering signals. The single-channel AEC problem can be viewed as a special case of semi-blind source separation (SBSS) where one of the source signals is partially known, i.e., the far-end microphone signal that generates the near-end acoustic echo. The system ap- proach to robust AEC will be motivated, where a proper integra- tion of the LMS algorithm with the ERN into the AEC "system" allows for continuous and stable adaptation even during double talk without precise estimation of the signal statistics. The error enhancement paradigm encompasses many traditional signal en- hancement techniques and opens up an entirely new avenue for solving the AEC problem in a real-world setting. Index Terms—Acoustic echo cancellation (AEC), error enhance- ment, error nonlinearity, independent component analysis (ICA), robust statistics, semi-blind source separation (SBSS), system ap- proach to signal enhancement.
    IEEE Transactions on Audio Speech and Language Processing 01/2012; 20:175-189. · 1.68 Impact Factor
  • Source
    Mingyu Chen, Ghassan AlRegib, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: A motion gesture can be represented by a 3D spatial trajectory and maybe augmented by additional 3 dimensions of orientation. Depending on the tracking technology in use, the 6D motion gesture can be tracked explicitly with the position and orientation or implicitly with the acceleration and angular speed. In this work, we first present a motion gesture database which contains both explicit and implicit 6D motion information. This database allows us to compare the recognition performance over different tracking signals on a common ground. Our main contribution is to investigate the relative effectiveness of various feature dimensions in motion gesture recognition. Using a simple and primitive recognizer, we evaluate the recognition results of both explicit and implicit motion data. In our experiments, both user dependent and user independent cases are addressed. We also propose two general techniques to improve the recognition accuracy: smoothing and the temporal extension. Our pilot study produces benchmark results that give an insight into the attainable recognition accuracy with different tracking devices.
    01/2012;
  • Source
    J. Wung, T.S. Wada, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel decorrelation procedure by frequency-domain resampling in sub-bands. The new procedure expands on the idea of resampling in the frequency domain that efficiently and effectively alleviates the non-uniqueness problem for a multi-channel acoustic echo cancellation system while introducing minimal distortion to the signal. We show in theory and verify experimentally that the amount of decorrelation in each sub-band, measured in terms of the coherence, can be controlled arbitrarily by varying the resampling ratio per frequency bin. For perceptual evaluation, we adjust the sub-band resampling ratios to match the coherence given by other decorrelation procedures. The speech quality (PESQ) score from the proposed decorrelation procedure remains high at around 4.5, which is about the highest possible PESQ score after signal modification.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • Mingyu Chen, G. AlRegib, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Depending on the tracking technology in use, a 6D motion gesture can be tracked and represented explicitly by the position and orientation or implicitly by the acceleration and angular speed. In this work, we first present the reasoning for the definition and recognition of motion gestures. Five basic feature vectors are then derived from the 6D motion data. Our main contribution is to investigate the relative effectiveness of various feature dimensions for motion gesture recognition in both user dependent and user independent cases. We also propose a feature normalization procedure and prove its effectiveness in achieving “scale” invariance especially in the user independent case. Our study gives an insight into the attainable recognition rate with different tracking devices.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • Chao Weng, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This work focuses on a comparative study of discriminative training using non-uniform criteria for cross-layer acoustic modeling. Two kinds of discriminative training (DT) frameworks, minimum classification error like (MCE-like) and minimum phone error like (MPE-like) DT frameworks, are augmented to allow the error cost embedding at the phoneme (model) level respectively. To facilitate this comparative study, we implement both augmented DT frameworks under the same umbrella, using the error cost derived from the same cross-layer confusion matrix. Experiments on a large vocabulary task WSJ0 demonstrated the effectiveness of both DT frameworks with the formulated non-uniform error cost embedded. Several preliminary investigations on the effect of the dynamic range of error cost are also presented.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • Yong Zhao, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: We have recently proposed the stranded HMM to achieve a more accurate representation of heterogeneous data. As opposed to the regular Gaussian mixture HMM, the stranded HMM explicitly models the relationships among the mixture components. The transitions among mixture components encode possible trajectories of acoustic features for speech units. Accurately representing the underlying transition structure is crucial for the stranded HMM to produce an optimal recognition performance. In this paper, we propose to learn the stranded HMM structure by imposing sparsity constraints. In particular, entropic priors are incorporated in the maximum a posteriori (MAP) estimation of the mixture transition matrices. The experimental results showed that a significant improvement in model sparsity can be obtained with a slight sacrifice of the recognition accuracy.
    Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on; 01/2012
  • Antonio Moreno-Daniel, Jay G. Wilpon, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: As the ubiquitous access to vast and remote information sources from portable devices becomes commonplace, the need from users to perform searches in keyboard-unfriendly situations grows substantially, thus triggering the increased demand of voice search sessions. This paper proposes a methodology that addresses different dimensions of scalability of mixed-initiative voice search in automatic spoken dialog systems.The strategy is based on splitting the complexity of the fully-constrained grammar (one that tightly covers the entire hypothesis space) into a fixed/low complexity phonotactic grammar followed by an index mechanism that dynamically assembles a second-pass grammar that consists of only a handful of hypotheses. The experimental analysis demonstrates different dimensions of scalability achieved by the proposed method using actual Whitepages-residential data.
    Speech Communication. 01/2012; 54:351-367.
  • Yong Zhao, Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Gaussian mixture (GMM)-HMMs, though being the predominant modeling technique for speech recognition, are often criticized as being inaccurate to model heterogeneous data sources. In this work, we propose the stranded Gaussian mixture (SGMM)-HMM, an extension of the GMM-HMM, to explicitly model the dependence among the mixture components, i.e., each mixture component is assumed to depend on the previous mixture component in addition to the state that generates it. In the evaluation over the Aurora 2 database, the proposed 20-mixture SGMM system obtains WER of 8.07%, 10% relative improvement over the baseline GMM system. The experiments demonstrate the discriminating power that would be possessed by the mixture weights in their advanced form.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012

Publication Stats

6k Citations
276.52 Total Impact Points

Institutions

  • 2005–2012
    • Georgia Institute of Technology
      • • Center for Signal & Image Processing
      • • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
  • 2011
    • Rutgers, The State University of New Jersey
      • Department of Electrical and Computer Engineering
      New Brunswick, NJ, United States
  • 2009–2011
    • Fondazione Bruno Kessler
      Trient, Trentino-Alto Adige, Italy
    • Università degli Studi di Trento
      Trient, Trentino-Alto Adige, Italy
  • 2008
    • Broadcom Corporation
      Irvine, California, United States
  • 2006
    • Institute of Electrical and Electronics Engineers
      Washington, Washington, D.C., United States
  • 1997–2001
    • Kyoto University
      Kioto, Kyōto, Japan
  • 1995–1998
    • AT&T Labs
      Austin, Texas, United States