Biing-Hwang Juang

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you Biing-Hwang Juang?

Claim your profile

Publications (166)300.41 Total impact

  • Mingyu Chen · Ghassan AlRegib · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Air-writing refers to writing of characters or words in the free space by hand or finger movements. We address air-writing recognition problems in two companion papers. Part 2 addresses detecting and recognizing air-writing activities that are embedded in a continuous motion trajectory without delimitation. Detection of intended writing activities among superfluous finger movements unrelated to letters or words presents a challenge that needs to be treated separately from the traditional problem of pattern recognition. We first present a dataset that contains a mixture of writing and nonwriting finger motions in each recording. The LEAP from Leap Motion is used for marker-free and glove-free finger tracking. We propose a window-based approach that automatically detects and extracts the air-writing event in a continuous stream of motion data, containing stray finger movements unrelated to writing. Consecutive writing events are converted into a writing segment. The recognition performance is further evaluated based on the detected writing segment. Our main contribution is to build an air-writing system encompassing both detection and recognition stages and to give insights into how the detected writing segments affect the recognition result. With leave-one-out cross validation, the proposed system achieves an overall segment error rate of 1.15% for word-based recognition and 9.84% for letter-based recognition.
    No preview · Article · Nov 2015 · IEEE Transactions on Human-Machine Systems
  • Mingyu Chen · Ghassan AlRegib · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Air-writing refers to writing of linguistic characters or words in a free space by hand or finger movements. Air-writing differs from conventional handwriting; the latter contains the pen-up-pen-down motion, while the former lacks such a delimited sequence of writing events. We address air-writing recognition problems in a pair of companion papers. In Part I, recognition of characters or words is accomplished based on six-degree-of-freedom hand motion data. We address air-writing on two levels: motion characters and motion words. Isolated air-writing characters can be recognized similar to motion gestures although with increased sophistication and variability. For motion word recognition in which letters are connected and superimposed in the same virtual box in space, we build statistical models for words by concatenating clustered ligature models and individual letter models. A hidden Markov model is used for air-writing modeling and recognition. We show that motion data along dimensions beyond a 2-D trajectory can be beneficially discriminative for air-writing recognition. We investigate the relative effectiveness of various feature dimensions of optical and inertial tracking signals and report the attainable recognition performance correspondingly. The proposed system achieves a word error rate of 0.8% for word-based recognition and 1.9% for letter-based recognition. We also subjectively and objectively evaluate the effectiveness of air-writing and compare it with text input using a virtual keyboard. The words per minute of air-writing and virtual keyboard are 5.43 and 8.42, respectively.
    No preview · Article · Nov 2015 · IEEE Transactions on Human-Machine Systems
  • Muhammad Umair Altaf · Taras Butko · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the acoustic gaits-the natural human gait quantitative characteristics derived from the sound of footsteps as the person walks normally. We introduce the acoustic gait profile, which is obtained from temporal signal analysis of sound of footsteps collected by microphones and illustrate some of the spatio-temporal gait parameters that can be extracted from the acoustic gait profile by using three temporal signal analysis methods-the squared energy estimate, Hilbert transform and Teager-Kaiser energy operator. Based on the statistical analysis of the parameter estimates, we show that the spatio-temporal parameters and gait characteristics obtained using the acoustic gait profile can consistently and reliably estimate a subset of clinical and biometric gait parameters currently in use for standardized gait assessments. We conclude that the Teager-Kaiser energy operator provides the most consistent gait parameter estimates showing the least variation across different sessions and zones. Acoustic gaits use an inexpensive set of microphones with a computing device as an accurate and uninstrusive gait analysis system. This is in contrast to the expensive and intrusive systems currently used in laboratory gait analysis such as the force plates, pressure mats and wearable sensors, some of which may change the gait parameters that are being measured.
    No preview · Article · Mar 2015 · IEEE transactions on bio-medical engineering
  • Source
    Chao Weng · Dong Yu · Shinji Watanabe · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose recurrent deep neural networks (DNNs) for robust automatic speech recognition (ASR). Full recurrent connections are added to certain hidden layer of a conventional feedforward DNN and allow the model to capture the temporal dependency in deep representations. A new backpropagation through time (BPTT) algorithm is introduced to make the minibatch stochastic gradient descent (SGD) on the proposed recurrent DNNs more efficient and effective. We evaluate the proposed recurrent DNN architecture under the hybrid setup on both the 2nd CHiME challenge (track 2) and Aurora-4 tasks. Experimental results on the CHiME challenge data show that the proposed system can obtain consistent 7% relative WER improvements over the DNN systems, achieving state-of-the-art performance without front-end preprocessing, speaker adaptive training or multiple decoding passes. For the experiments on Aurora-4, the proposed system achieves 4% relative WER improvement over a strong DNN baseline system.
    Full-text · Conference Paper · May 2014
  • Chao Weng · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we formulate the problem of keyword spotting as a non-uniform error automatic speech recognition (ASR) problem and propose a model training methodology based on the non-uniform minimum classification error (MCE) approach. The main idea is to adapt the fundamental MCE criteria to reflect the cost-sensitive notion in that errors on keywords are much more significant than errors on non-keywords in an automatic speech recognition task. The notion of cost sensitivity leads to emphasis of keyword models in parameter optimization. Then we present a system which takes advantage of the weighted finite-state transducer (WFST) framework to efficiently implement the non-uniform MCE. To enhance the approach of non-uniform error cost minimization for keyword spotting, we further formulate a technique called ”adaptive boosted non-uniform MCE” which incorporates the idea of boosting. We validate the proposed framework on two challenging large-scale spontaneous conversational telephone speech (CTS) datasets in two different languages (English and Mandarin). Experimental results show our framework can achieve consistent and significant spotting performance gains over both the maximum likelihood estimation (MLE) baseline and conventional discriminatively-trained systems with uniform error cost.
    No preview · Article · Jan 2014 · IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel acoustic modeling framework, synchronous HMM, which takes full advantage of the capacity of the heterogeneous data sources and achieves an optimal balance between modeling accuracy and robustness. The synchronous HMM introduces an additional layer of substates between the HMM states and the Gaussian component variables. The substates have the capability to register long-span non-phonetic attributes, which are integrally called speech scenes in this study. The hierarchical modeling scheme allows an accurate description of probability distribution of speech units in different speech scenes. To address the data sparsity problem, a decision-based clustering algorithm is presented to determine the set of speech scenes and to tie the substate parameters. Moreover, we propose the multiplex Viterbi algorithm to efficiently decode the synchronous HMMs within a search space of the same size as for the standard HMMs. The experiments on the Aurora 2 task show that the synchronous HMMs produce a significant improvement in recognition performance over the HMM baseline at the expense of a moderate increase in the memory requirement and computational complexity.
    No preview · Conference Paper · Oct 2013
  • Chao Weng · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we present a complete framework of discriminative training using non-uniform criteria for keyword spotting, adaptive boosted non-uniform minimum classification error (MCE) for keyword spotting on spontaneous speech. To further boost the spotting performance and tackle the potential issue of over-training in the non-uniform MCE proposed in our prior work, we make two improvements to the fundamental MCE optimization procedure. Furthermore, motivated by AdaBoost, we introduce an adaptive scheme to embed error cost functions together with model combinations during the decoding stage. The proposed framework is comprehensively validated on two challenging large-scale spontaneous conversational telephone speech (CTS) tasks in different languages (English and Mandarin) and the experimental results show it can achieve significant and consistent figure of merit (FOM) gains over both ML and discriminatively trained systems.
    No preview · Conference Paper · Oct 2013
  • Chao Weng · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we propose latent semantic rational kernels (LSRK) for topic spotting on spontaneous conversational speech. Rather than mapping the input weighted finite-state transducers (WFSTs) onto a high dimensional n-gram feature space as in n-gram rational kernels, the proposed LSRK maps the WFSTs onto a latent semantic space. Moreover, with the LSRK framework, all available external knowledge can be flexibly incorporated to boost the topic spotting performance. The experiments we conducted on a spontaneous conversational task, Switchboard, show that our method can achieve significant performance gain over the baselines from 27.33% to 57.56% accuracy and almost double the classification accuracy over the n-gram rational kernels in all cases.
    No preview · Conference Paper · Oct 2013
  • M. Umair Bin Altaf · Taras Butko · Biing-Hwang (Fred) Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Real world sounds are ubiquitous and form an important part of the edifice of our cognitive abilities. Their perception combines signatures from spectral and temporal domains, among others, yet traditionally their analysis is focused on the frame based spectral properties. We consider the problem of sound analysis from perceptual perspective and investigate the temporal properties of a “footsteps” sound, which is a particularly challenging from the time-frequency analysis viewpoint. We identify the irregular repetition of the self similarity and the sense of duration as significant to its perceptual quality and extract features using the Teager-Kaiser energy operator. We build an acoustic event detection system for “footsteps” which shows promising results for detection in cross-environmental conditions when compared with conventional approach.
    No preview · Conference Paper · Oct 2013
  • Source
    Jason Wung · Ted S. Wada · Biing-Hwang (Fred) Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper examines the effect of inter-channel decorrelation by sub-band resampling (SBR) on the performance of the robust acoustic echo cancellation (AEC) system based on the residual echo enhancement technique. Due to the flexibility of SBR, the decorrelation performance as measured by the coherence can be matched with other conventional decorrelation procedures. Given the same degree of decorrelation, we have shown previously that SBR achieves superior audio quality compared to other procedures. We show in this paper that SBR also provides higher stereophonic AEC performance in a very noisy condition, where the performance is evaluated by decomposing the true echo return loss enhancement and the misalignment per sub-band to better demonstrate the superiority of our decorrelation procedure over other methods.
    Full-text · Conference Paper · Oct 2013
  • Source
    Jason Wung · Ted S. Wada · Mehrez Souden · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: It is well established that a decorrelation procedure is required in a multi-channel acoustic echo control system to mitigate the so-called non-uniqueness problem. A recently proposed technique that accomplishes decorrelation by resampling (DBR) has been shown to be advantageous; it achieves a superior performance in the echo reduction gain and offers the possibility of frequency selective decorrelation to further preserve the sound quality of the system. In this paper, we analyze with rigor the performance behavior of DBR in terms of coherence reduction and the resultant misalignment of an adaptive filter. We derive closed-form expressions for the performance bounds and validate the theoretical analysis with simulation.
    Full-text · Conference Paper · Oct 2013
  • Mingyu Chen · Ghassan AlRegib · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: A 6D motion gesture is represented by a 3D spatial trajectory and augmented by another three dimensions of orientation. Using different tracking technologies, the motion can be tracked explicitly with the position and orientation or implicitly with the acceleration and angular speed. In this work, we address the problem of motion gesture recognition for command-and-control applications. Our main contribution is to investigate the relative effectiveness of various feature dimensions for motion gesture recognition in both user-dependent and user-independent cases. We introduce a statistical feature-based classifier as the baseline and propose an HMM-based recognizer, which offers more flexibility in feature selection and achieves better performance in recognition accuracy than the baseline system. Our motion gesture database which contains both explicit and implicit motion information allows us to compare the recognition performance of different tracking signals on a common ground. This study also gives an insight into the attainable recognition rate with different tracking devices, which is valuable for the system designer to choose the proper tracking technology.
    No preview · Article · Apr 2013 · IEEE Transactions on Multimedia
  • Shinji Watanabe · Atsushi Nakamura · Biing-Hwang (Fred) Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: Linear regression for Hidden Markov Model (HMM) parameters is widely used for the adaptive training of time series pattern analysis especially for speech processing. The regression parameters are usually shared among sets of Gaussians in HMMs where the Gaussian clusters are represented by a tree. This paper realizes a fully Bayesian treatment of linear regression for HMMs considering this regression tree structure by using variational techniques. This paper analytically derives the variational lower bound of the marginalized log-likelihood of the linear regression. By using the variational lower bound as an objective function, we can algorithmically optimize the tree structure and hyper-parameters of the linear regression rather than heuristically tweaking them as tuning parameters. Experiments on large vocabulary continuous speech recognition confirm the generalizability of the proposed approach, especially when the amount of adaptation data is limited.
    No preview · Article · Mar 2013 · Journal of Signal Processing Systems
  • Source
    Mehrez Souden · Jason Wung · Biing-Hwang Fred Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper introduces an approach to cluster and suppress acoustic echo signals in hands-free, full-duplex speech communication systems. We employ the instantaneous recursive estimate of the magnitude squared coherence (MSC) of the echo line signal and the microphone signal, and model it with a two-component Beta mixture distribution. Since we consider the case of multiple microphone pickup, we further integrate the normalized recording vector as location feature into the proposed approach to achieve reliable soft decisions on the echo presence. The location information has been widely used for clustering-based blind source separation, and can be modeled using a Watson mixture distribution. Simulation evaluations of the proposed method show that it can achieve significant echo suppression performance.
    Full-text · Conference Paper · Jan 2013
  • Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: We have recently proposed the stranded HMM to achieve a more accurate representation of heterogeneous data. As opposed to the regular Gaussian mixture HMM, the stranded HMM explicitly models the relationships among the mixture components. The transitions among mixture components encode possible trajectories of acoustic features for speech units. Accurately representing the underlying transition structure is crucial for the stranded HMM to produce an optimal recognition performance. In this paper, we propose to learn the stranded HMM structure by imposing sparsity constraints. In particular, entropic priors are incorporated in the maximum a posteriori (MAP) estimation of the mixture transition matrices. The experimental results showed that a significant improvement in model sparsity can be obtained with a slight sacrifice of the recognition accuracy.
    No preview · Conference Paper · Nov 2012
  • Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present the Gauss-Newton method as a unified approach to estimating noise parameters of the prevalent nonlinear compensation models, such as vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT), for noise-robust speech recognition. While iterative estimation of noise means in a generalized EM framework has been widely known, we demonstrate that such approaches are variants of the Gauss-Newton method. Furthermore, we propose a novel noise variance estimation algorithm that is consistent with the Gauss-Newton principle. The formulation of the Gauss-Newton method reduces the noise estimation problem to determining the Jacobians of the corrupted speech parameters. For sampling-based compensations, we present two methods, sample Jacobian average (SJA) and cross-covariance (XCOV), to evaluate these Jacobians. The proposed noise estimation algorithm is evaluated for various compensation models on two tasks. The first is to fit a Gaussian mixture model (GMM) model to artificially corrupted samples, and the second is to perform speech recognition on the Aurora 2 database. The significant performance improvements confirm the efficacy of the Gauss-Newton method to estimating the noise parameters of the nonlinear compensation models.
    No preview · Article · Oct 2012 · IEEE Transactions on Audio Speech and Language Processing
  • Source
    Yong Zhao · Andrej Ljolje · Diamantino Caseiro · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a general algorithmic framework based on WFSTs for implementing a variety of discriminative training methods, such as MMI, MCE, and MPE/MWE. In contrast to the ordinary word lattices, the transducer-based lattices are more amenable to representing and manipulating the underlying hypothesis space and have a finer granularity at the HMM-state level. The transducers are processed into a two-layer hierarchy: at a high level, it is analogous to a word lattice, and each word transition embodies an HMM-state subgraph for that word at a lower level. This hierarchy combined with the appropriate customization of the transducers leads to a flexible implementation for all of the training criteria being discussed. The effectiveness of the framework is verified on two speech recognition tasks: Resource Management, and AT&T SCANMail, an internal voicemail-to-text task.
    Preview · Conference Paper · Mar 2012
  • Source
    Chao Weng · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: This work focuses on a comparative study of discriminative training using non-uniform criteria for cross-layer acoustic modeling. Two kinds of discriminative training (DT) frameworks, minimum classification error like (MCE-like) and minimum phone error like (MPE-like) DT frameworks, are augmented to allow the error cost embedding at the phoneme (model) level respectively. To facilitate this comparative study, we implement both augmented DT frameworks under the same umbrella, using the error cost derived from the same cross-layer confusion matrix. Experiments on a large vocabulary task WSJ0 demonstrated the effectiveness of both DT frameworks with the formulated non-uniform error cost embedded. Several preliminary investigations on the effect of the dynamic range of error cost are also presented.
    Preview · Conference Paper · Mar 2012
  • Antonio Moreno-Daniel · Jay G. Wilpon · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: As the ubiquitous access to vast and remote information sources from portable devices becomes commonplace, the need from users to perform searches in keyboard-unfriendly situations grows substantially, thus triggering the increased demand of voice search sessions. This paper proposes a methodology that addresses different dimensions of scalability of mixed-initiative voice search in automatic spoken dialog systems.The strategy is based on splitting the complexity of the fully-constrained grammar (one that tightly covers the entire hypothesis space) into a fixed/low complexity phonotactic grammar followed by an index mechanism that dynamically assembles a second-pass grammar that consists of only a handful of hypotheses. The experimental analysis demonstrates different dimensions of scalability achieved by the proposed method using actual Whitepages-residential data.
    No preview · Article · Mar 2012 · Speech Communication
  • Source
    Qiang Fu · Yong Zhao · Biing-Hwang Juang
    [Show abstract] [Hide abstract]
    ABSTRACT: The Bayes decision theory is the foundation of the classical statistical pattern recognition approach, with the expected error as the performance objective. For most pattern recognition problems, the “error” is conventionally assumed to be binary, i.e., 0 or 1, equivalent to error counting, independent of the specifics of the error made by the system. The term “error rate” is thus long considered the prevalent system performance measure. This performance measure, nonetheless, may not be satisfactory in many practical applications. In automatic speech recognition, for example, it is well known that some errors are more detrimental (e.g., more likely to lead to misunderstanding of the spoken sentence) than others. In this paper, we propose an extended framework for the speech recognition problem with non-uniform classification/recognition error cost which can be controlled by the system designer. In particular, we address the issue of system model optimization when the cost of a recognition error is class dependent. We formulate the problem in the framework of the minimum classification error (MCE) method, after appropriate generalization to integrate the class-dependent error cost into one consistent objective function for optimization. We present a variety of training scenarios for automatic speech recognition under this extended framework. Experimental results for continuous speech recognition are provided to demonstrate the effectiveness of the new approach.
    Preview · Article · Mar 2012 · IEEE Transactions on Audio Speech and Language Processing

Publication Stats

8k Citations
300.41 Total Impact Points

Institutions

  • 2005-2015
    • Georgia Institute of Technology
      • • School of Electrical & Computer Engineering
      • • Center for Signal & Image Processing
      Atlanta, Georgia, United States
  • 2009
    • NTT Communication Science Laboratories
      Kioto, Kyoto, Japan
    • Università degli Studi di Trento
      Trient, Trentino-Alto Adige, Italy
  • 2008
    • Broadcom Corporation
      Irvine, California, United States
  • 2007
    • NTT DATA Corporation
      Edo, Tōkyō, Japan
  • 2004
    • Instituto Tecnológico de Estudios Superiores de Occidente
      Guadalajara, Jalisco, Mexico
  • 1996-1999
    • AT&T Labs
      Austin, Texas, United States
  • 1997-1998
    • Kyoto University
      Kioto, Kyōto, Japan