Biing-Hwang Juang

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you Biing-Hwang Juang?

Claim your profile

Publications (163)294.62 Total impact

  • Muhammad Umair Altaf · Taras Butko · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the acoustic gaits-the natural human gait quantitative characteristics derived from the sound of footsteps as the person walks normally. We introduce the acoustic gait profile, which is obtained from temporal signal analysis of sound of footsteps collected by microphones and illustrate some of the spatio-temporal gait parameters that can be extracted from the acoustic gait profile by using three temporal signal analysis methods-the squared energy estimate, Hilbert transform and Teager-Kaiser energy operator. Based on the statistical analysis of the parameter estimates, we show that the spatio-temporal parameters and gait characteristics obtained using the acoustic gait profile can consistently and reliably estimate a subset of clinical and biometric gait parameters currently in use for standardized gait assessments. We conclude that the Teager-Kaiser energy operator provides the most consistent gait parameter estimates showing the least variation across different sessions and zones. Acoustic gaits use an inexpensive set of microphones with a computing device as an accurate and uninstrusive gait analysis system. This is in contrast to the expensive and intrusive systems currently used in laboratory gait analysis such as the force plates, pressure mats and wearable sensors, some of which may change the gait parameters that are being measured.
    IEEE transactions on bio-medical engineering 03/2015; 62(8). DOI:10.1109/TBME.2015.2410142 · 2.23 Impact Factor
  • Source
    Chao Weng · Dong Yu · Shinji Watanabe · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose recurrent deep neural networks (DNNs) for robust automatic speech recognition (ASR). Full recurrent connections are added to certain hidden layer of a conventional feedforward DNN and allow the model to capture the temporal dependency in deep representations. A new backpropagation through time (BPTT) algorithm is introduced to make the minibatch stochastic gradient descent (SGD) on the proposed recurrent DNNs more efficient and effective. We evaluate the proposed recurrent DNN architecture under the hybrid setup on both the 2nd CHiME challenge (track 2) and Aurora-4 tasks. Experimental results on the CHiME challenge data show that the proposed system can obtain consistent 7% relative WER improvements over the DNN systems, achieving state-of-the-art performance without front-end preprocessing, speaker adaptive training or multiple decoding passes. For the experiments on Aurora-4, the proposed system achieves 4% relative WER improvement over a strong DNN baseline system.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Chao Weng · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we formulate the problem of keyword spotting as a non-uniform error automatic speech recognition (ASR) problem and propose a model training methodology based on the non-uniform minimum classification error (MCE) approach. The main idea is to adapt the fundamental MCE criteria to reflect the cost-sensitive notion in that errors on keywords are much more significant than errors on non-keywords in an automatic speech recognition task. The notion of cost sensitivity leads to emphasis of keyword models in parameter optimization. Then we present a system which takes advantage of the weighted finite-state transducer (WFST) framework to efficiently implement the non-uniform MCE. To enhance the approach of non-uniform error cost minimization for keyword spotting, we further formulate a technique called ”adaptive boosted non-uniform MCE” which incorporates the idea of boosting. We validate the proposed framework on two challenging large-scale spontaneous conversational telephone speech (CTS) datasets in two different languages (English and Mandarin). Experimental results show our framework can achieve consistent and significant spotting performance gains over both the maximum likelihood estimation (MLE) baseline and conventional discriminatively-trained systems with uniform error cost.
    01/2014; 23(2):1-1. DOI:10.1109/TASLP.2014.2381931
  • Mingyu Chen · Ghassan AlRegib · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: A 6D motion gesture is represented by a 3D spatial trajectory and augmented by another three dimensions of orientation. Using different tracking technologies, the motion can be tracked explicitly with the position and orientation or implicitly with the acceleration and angular speed. In this work, we address the problem of motion gesture recognition for command-and-control applications. Our main contribution is to investigate the relative effectiveness of various feature dimensions for motion gesture recognition in both user-dependent and user-independent cases. We introduce a statistical feature-based classifier as the baseline and propose an HMM-based recognizer, which offers more flexibility in feature selection and achieves better performance in recognition accuracy than the baseline system. Our motion gesture database which contains both explicit and implicit motion information allows us to compare the recognition performance of different tracking signals on a common ground. This study also gives an insight into the attainable recognition rate with different tracking devices, which is valuable for the system designer to choose the proper tracking technology.
    IEEE Transactions on Multimedia 04/2013; 15(3):561-571. DOI:10.1109/TMM.2012.2237024 · 1.78 Impact Factor
  • Shinji Watanabe · Atsushi Nakamura · Biing-Hwang (Fred) Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Linear regression for Hidden Markov Model (HMM) parameters is widely used for the adaptive training of time series pattern analysis especially for speech processing. The regression parameters are usually shared among sets of Gaussians in HMMs where the Gaussian clusters are represented by a tree. This paper realizes a fully Bayesian treatment of linear regression for HMMs considering this regression tree structure by using variational techniques. This paper analytically derives the variational lower bound of the marginalized log-likelihood of the linear regression. By using the variational lower bound as an objective function, we can algorithmically optimize the tree structure and hyper-parameters of the linear regression rather than heuristically tweaking them as tuning parameters. Experiments on large vocabulary continuous speech recognition confirm the generalizability of the proposed approach, especially when the amount of adaptation data is limited.
    Journal of Signal Processing Systems 03/2013; 74(3):341-358. DOI:10.1007/s11265-013-0785-8 · 0.56 Impact Factor
  • Source
    Jason Wung · Ted S. Wada · Mehrez Souden · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: It is well established that a decorrelation procedure is required in a multi-channel acoustic echo control system to mitigate the so-called non-uniqueness problem. A recently proposed technique that accomplishes decorrelation by resampling (DBR) has been shown to be advantageous; it achieves a superior performance in the echo reduction gain and offers the possibility of frequency selective decorrelation to further preserve the sound quality of the system. In this paper, we analyze with rigor the performance behavior of DBR in terms of coherence reduction and the resultant misalignment of an adaptive filter. We derive closed-form expressions for the performance bounds and validate the theoretical analysis with simulation.
    Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on; 01/2013
  • Chao Weng · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we present a complete framework of discriminative training using non-uniform criteria for keyword spotting, adaptive boosted non-uniform minimum classification error (MCE) for keyword spotting on spontaneous speech. To further boost the spotting performance and tackle the potential issue of over-training in the non-uniform MCE proposed in our prior work, we make two improvements to the fundamental MCE optimization procedure. Furthermore, motivated by AdaBoost, we introduce an adaptive scheme to embed error cost functions together with model combinations during the decoding stage. The proposed framework is comprehensively validated on two challenging large-scale spontaneous conversational telephone speech (CTS) tasks in different languages (English and Mandarin) and the experimental results show it can achieve significant and consistent figure of merit (FOM) gains over both ML and discriminatively trained systems.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Chao Weng · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose latent semantic rational kernels (LSRK) for topic spotting on spontaneous conversational speech. Rather than mapping the input weighted finite-state transducers (WFSTs) onto a high dimensional n-gram feature space as in n-gram rational kernels, the proposed LSRK maps the WFSTs onto a latent semantic space. Moreover, with the LSRK framework, all available external knowledge can be flexibly incorporated to boost the topic spotting performance. The experiments we conducted on a spontaneous conversational task, Switchboard, show that our method can achieve significant performance gain over the baselines from 27.33% to 57.56% accuracy and almost double the classification accuracy over the n-gram rational kernels in all cases.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • M. Umair Bin Altaf · Taras Butko · Biing-Hwang (Fred) Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Real world sounds are ubiquitous and form an important part of the edifice of our cognitive abilities. Their perception combines signatures from spectral and temporal domains, among others, yet traditionally their analysis is focused on the frame based spectral properties. We consider the problem of sound analysis from perceptual perspective and investigate the temporal properties of a “footsteps” sound, which is a particularly challenging from the time-frequency analysis viewpoint. We identify the irregular repetition of the self similarity and the sense of duration as significant to its perceptual quality and extract features using the Teager-Kaiser energy operator. We build an acoustic event detection system for “footsteps” which shows promising results for detection in cross-environmental conditions when compared with conventional approach.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Source
    Jason Wung · T.S. Wada · Biing-Hwang (Fred) Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper examines the effect of inter-channel decorrelation by sub-band resampling (SBR) on the performance of the robust acoustic echo cancellation (AEC) system based on the residual echo enhancement technique. Due to the flexibility of SBR, the decorrelation performance as measured by the coherence can be matched with other conventional decorrelation procedures. Given the same degree of decorrelation, we have shown previously that SBR achieves superior audio quality compared to other procedures. We show in this paper that SBR also provides higher stereophonic AEC performance in a very noisy condition, where the performance is evaluated by decomposing the true echo return loss enhancement and the misalignment per sub-band to better demonstrate the superiority of our decorrelation procedure over other methods.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel acoustic modeling framework, synchronous HMM, which takes full advantage of the capacity of the heterogeneous data sources and achieves an optimal balance between modeling accuracy and robustness. The synchronous HMM introduces an additional layer of substates between the HMM states and the Gaussian component variables. The substates have the capability to register long-span non-phonetic attributes, which are integrally called speech scenes in this study. The hierarchical modeling scheme allows an accurate description of probability distribution of speech units in different speech scenes. To address the data sparsity problem, a decision-based clustering algorithm is presented to determine the set of speech scenes and to tie the substate parameters. Moreover, we propose the multiplex Viterbi algorithm to efficiently decode the synchronous HMMs within a search space of the same size as for the standard HMMs. The experiments on the Aurora 2 task show that the synchronous HMMs produce a significant improvement in recognition performance over the HMM baseline at the expense of a moderate increase in the memory requirement and computational complexity.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Source
    Mehrez Souden · Jason Wung · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces an approach to cluster and suppress acoustic echo signals in hands-free, full-duplex speech communication systems. We employ the instantaneous recursive estimate of the magnitude squared coherence (MSC) of the echo line signal and the microphone signal, and model it with a two-component Beta mixture distribution. Since we consider the case of multiple microphone pickup, we further integrate the normalized recording vector as location feature into the proposed approach to achieve reliable soft decisions on the echo presence. The location information has been widely used for clustering-based blind source separation, and can be modeled using a Watson mixture distribution. Simulation evaluations of the proposed method show that it can achieve significant echo suppression performance.
    Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on; 01/2013
  • Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present the Gauss-Newton method as a unified approach to estimating noise parameters of the prevalent nonlinear compensation models, such as vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT), for noise-robust speech recognition. While iterative estimation of noise means in a generalized EM framework has been widely known, we demonstrate that such approaches are variants of the Gauss-Newton method. Furthermore, we propose a novel noise variance estimation algorithm that is consistent with the Gauss-Newton principle. The formulation of the Gauss-Newton method reduces the noise estimation problem to determining the Jacobians of the corrupted speech parameters. For sampling-based compensations, we present two methods, sample Jacobian average (SJA) and cross-covariance (XCOV), to evaluate these Jacobians. The proposed noise estimation algorithm is evaluated for various compensation models on two tasks. The first is to fit a Gaussian mixture model (GMM) model to artificially corrupted samples, and the second is to perform speech recognition on the Aurora 2 database. The significant performance improvements confirm the efficacy of the Gauss-Newton method to estimating the noise parameters of the nonlinear compensation models.
    IEEE Transactions on Audio Speech and Language Processing 10/2012; 20(8):2191-2206. DOI:10.1109/TASL.2012.2199107 · 2.63 Impact Factor
  • Source
    Chao Weng · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This work focuses on a comparative study of discriminative training using non-uniform criteria for cross-layer acoustic modeling. Two kinds of discriminative training (DT) frameworks, minimum classification error like (MCE-like) and minimum phone error like (MPE-like) DT frameworks, are augmented to allow the error cost embedding at the phoneme (model) level respectively. To facilitate this comparative study, we implement both augmented DT frameworks under the same umbrella, using the error cost derived from the same cross-layer confusion matrix. Experiments on a large vocabulary task WSJ0 demonstrated the effectiveness of both DT frameworks with the formulated non-uniform error cost embedded. Several preliminary investigations on the effect of the dynamic range of error cost are also presented.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 03/2012
  • Source
    Yong Zhao · Andrej Ljolje · Diamantino Caseiro · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a general algorithmic framework based on WFSTs for implementing a variety of discriminative training methods, such as MMI, MCE, and MPE/MWE. In contrast to the ordinary word lattices, the transducer-based lattices are more amenable to representing and manipulating the underlying hypothesis space and have a finer granularity at the HMM-state level. The transducers are processed into a two-layer hierarchy: at a high level, it is analogous to a word lattice, and each word transition embodies an HMM-state subgraph for that word at a lower level. This hierarchy combined with the appropriate customization of the transducers leads to a flexible implementation for all of the training criteria being discussed. The effectiveness of the framework is verified on two speech recognition tasks: Resource Management, and AT&T SCANMail, an internal voicemail-to-text task.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 03/2012
  • Antonio Moreno-Daniel · Jay G. Wilpon · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: As the ubiquitous access to vast and remote information sources from portable devices becomes commonplace, the need from users to perform searches in keyboard-unfriendly situations grows substantially, thus triggering the increased demand of voice search sessions. This paper proposes a methodology that addresses different dimensions of scalability of mixed-initiative voice search in automatic spoken dialog systems.The strategy is based on splitting the complexity of the fully-constrained grammar (one that tightly covers the entire hypothesis space) into a fixed/low complexity phonotactic grammar followed by an index mechanism that dynamically assembles a second-pass grammar that consists of only a handful of hypotheses. The experimental analysis demonstrates different dimensions of scalability achieved by the proposed method using actual Whitepages-residential data.
    Speech Communication 03/2012; 54(3):351-367. DOI:10.1016/j.specom.2011.09.006 · 1.55 Impact Factor
  • Source
    Qiang Fu · Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: The Bayes decision theory is the foundation of the classical statistical pattern recognition approach, with the expected error as the performance objective. For most pattern recognition problems, the “error” is conventionally assumed to be binary, i.e., 0 or 1, equivalent to error counting, independent of the specifics of the error made by the system. The term “error rate” is thus long considered the prevalent system performance measure. This performance measure, nonetheless, may not be satisfactory in many practical applications. In automatic speech recognition, for example, it is well known that some errors are more detrimental (e.g., more likely to lead to misunderstanding of the spoken sentence) than others. In this paper, we propose an extended framework for the speech recognition problem with non-uniform classification/recognition error cost which can be controlled by the system designer. In particular, we address the issue of system model optimization when the cost of a recognition error is class dependent. We formulate the problem in the framework of the minimum classification error (MCE) method, after appropriate generalization to integrate the class-dependent error cost into one consistent objective function for optimization. We present a variety of training scenarios for automatic speech recognition under this extended framework. Experimental results for continuous speech recognition are provided to demonstrate the effectiveness of the new approach.
    IEEE Transactions on Audio Speech and Language Processing 03/2012; 20(3):780-793. DOI:10.1109/TASL.2011.2165279 · 2.63 Impact Factor
  • Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Gaussian mixture (GMM)-HMMs, though being the predominant modeling technique for speech recognition, are often criticized as being inaccurate to model heterogeneous data sources. In this work, we propose the stranded Gaussian mixture (SGMM)-HMM, an extension of the GMM-HMM, to explicitly model the dependence among the mixture components, i.e., each mixture component is assumed to depend on the previous mixture component in addition to the state that generates it. In the evaluation over the Aurora 2 database, the proposed 20-mixture SGMM system obtains WER of 8.07%, 10% relative improvement over the baseline GMM system. The experiments demonstrate the discriminating power that would be possessed by the mixture weights in their advanced form.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • Source
    Mingyu Chen · Ghassan AlRegib · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: A motion gesture can be represented by a 3D spatial trajectory and maybe augmented by additional 3 dimensions of orientation. Depending on the tracking technology in use, the 6D motion gesture can be tracked explicitly with the position and orientation or implicitly with the acceleration and angular speed. In this work, we first present a motion gesture database which contains both explicit and implicit 6D motion information. This database allows us to compare the recognition performance over different tracking signals on a common ground. Our main contribution is to investigate the relative effectiveness of various feature dimensions in motion gesture recognition. Using a simple and primitive recognizer, we evaluate the recognition results of both explicit and implicit motion data. In our experiments, both user dependent and user independent cases are addressed. We also propose two general techniques to improve the recognition accuracy: smoothing and the temporal extension. Our pilot study produces benchmark results that give an insight into the attainable recognition accuracy with different tracking devices.
  • Source
    Mingyu Chen · Ghassan Al-Regib · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Motion-based control is gaining popularity, and motion gestures form a complementary modality in human-computer interactions. To achieve more robust user-independent motion gesture recognition in a manner analogous to automatic speech recognition, we need a deeper understanding of the motions in gesture, which arouses the need for a 6D motion gesture database. In this work, we present a database that contains comprehensive motion data, including the position, orientation, acceleration, and angular speed, for a set of common motion gestures performed by different users. We hope this motion gesture database can be a useful platform for researchers and developers to build their recognition algorithms as well as a common test bench for performance comparisons.
    Proceedings of the Third Annual ACM SIGMM Conference on Multimedia Systems, MMSys 2012, Chapel Hill, NC, USA, February 22-24, 2012; 01/2012

Publication Stats

7k Citations
294.62 Total Impact Points

Institutions

  • 2005–2014
    • Georgia Institute of Technology
      • • Center for Signal & Image Processing
      • • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
    • Nippon Telegraph and Telephone
      Edo, Tōkyō, Japan
  • 2011
    • Rutgers, The State University of New Jersey
      • Department of Electrical and Computer Engineering
      New Brunswick, NJ, United States
  • 2010
    • CA Technologies
      New York, New York, United States
  • 2009
    • Università degli Studi di Trento
      Trient, Trentino-Alto Adige, Italy
  • 2008
    • Broadcom Corporation
      Irvine, California, United States
  • 2007
    • NTT DATA Corporation
      Edo, Tōkyō, Japan
  • 2006
    • Institute of Electrical and Electronics Engineers
      Washington, Washington, D.C., United States
  • 1997–2001
    • Kyoto University
      Kioto, Kyōto, Japan
  • 1998
    • AT&T Labs
      Austin, Texas, United States