Jing Huang

IBM, Armonk, New York, United States

Are you Jing Huang?

Claim your profile

Publications (21)24.7 Total impact

  • Jing Huang, P.A. Olsen, V. Goel
    [Show abstract] [Hide abstract]
    ABSTRACT: Discriminatively trained full-covariance Gaussian mixture models have been shown to outperform its corresponding diagonal-covariance models on large vocabulary speech recognition tasks. However, the size of full-covariance model is much larger than that of diagonal-covariance model and is therefore not practical for use in a real system. In this paper, we present a method to build a large discriminatively trained full-covariance model with large (over 9000 hours) training corpora and still improve performance over the diagonal-covariance model. We then reduce the size of the full-covariance model to the size of its baseline diagonal-covariance model by using subspace constrained Gaussian mixture model (SCGMM). The resulting discriminatively trained SCGMM still retains the performance of its corresponding full-covariance model, and improves 5% relative over the same size diagonal-covariance model on a large vocabulary speech recognition task.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Modern speech applications utilize acoustic models with billions of parameters, and serve millions of users. Storing an acoustic model for each user is costly. We show through the use of sparse regularization, that it is possible to obtain competitive adaptation performance by changing only a small fraction of the parameters of an acoustic model. This allows for the compression of speaker-dependent models: a capability that has important implications for systems with millions of users. We achieve a performance comparable to the best Maximum A Posteriori (MAP) adaptation models while only adapting 5% of the acoustic model parameters. Thus it is possible to compress the speaker dependent acoustic models by close to a factor of 20. The proposed sparse adaptation criterion improves three aspects of previous work: It combines ℓ0 and ℓ1 penalties, have different adaptation rates for mean and variance parameters and is invariant to affine transformations.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Feature-space transforms such as feature-space maximum likelihood linear regression (FMLLR) are very effective speaker adaptation technique, especially on mismatched test data. In this study, we extend the full-rank square matrix of FMLLR to a non-square matrix that uses neighboring feature vectors in estimating the adapted central feature vector. Throughoptimizing an appropriateobjective function we aim to filter out and transform features through the correlation of the feature context. We compare to FMLLR that just consider the current feature vector only. Our experiments are conducted on the automobile data with different speed conditions. Results show that context filtering improves 23% on word error rate over conventional FMLLR on noisy 60mph data with adapted ML model, and 7%/9% improvement over the discriminatively trained FMMI/BMMI models.
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Maximum A Posteriori (MAP) adaptation is a powerful tool for building speaker specific acoustic models. Modern speech applications utilize acoustic models with millions of parameters, and serve millions of users. Storing an acoustic model for each user in such settings is costly. However, speaker specific acoustic models are generally similar to the acoustic model being adapted. By imposing sparseness constraints, we can save significantly on storage, and even improve the quality of the resulting speaker-dependent model. In this paper we utilize the ℓ1 or ℓ0 norm as a regularizer to induce sparsity. We show that we can obtain up to 95% sparsity with negligible loss in recognition accuracy, with both penalties. By removing small differences, which constitute “adaptation noise”, sparse MAP is actually able to improve upon MAP adaptation. Sparse MAP reduces the MAP word error rate by 2% relative at 89% sparsity.
  • 01/2009: chapter Auomatic speech recognition: pages 43-59; Springer.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Smart homes for the aging population have recently started attracting the attention of the research community. One of the problems of interest is this of monitoring the activities of daily living (ADLs) of the elderly, in order to help identify critical problems, aiming to improvetheir protection and gen- eral well-being. In this paper,we reporton ourinitial attempts to recognize such activities, based on input from networks of far-field microphones distributed inside the home. We pro- pose two approaches to the problem: The first models the en- tire activity, which typically covers long time spans, with a single statistical model, for example a hidden Markov model (HMM), a Gaussian mixture model (GMM), or GMM super- vectors in conjunction with supportvector machines (SVMs). The second is a two-step approach: It first performs acoustic event detection (AED) to locate distinctive events, character- istic of the ADLs, and it is subsequently followed by a post- processingstage thatemploysactivity-specificlanguagemod- els (LMs) to classify the output sequences of detected events into ADLs. Experiments are reported on a corpus contain- ing a small number of acted ADLs, collected as part of the Netcarity Integrated Project inside a two-room smart home. Our results show that SVM GMM supervector modeling im- proves six-class ADL classification accuracy to 76%, com- pared to 56% achieved by the GMMs, while also outperform- ing HMMs by 8% absolute. Preliminary results from LM scoring of acoustic event sequences are comparable to those from GMMs on a three-class ADL classification task.
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a fall classification and detection system to distin- guish falls from other noise in the home environment using only a far-field microphone. We propose modeling each fall or noise segment using a GMM supervector, whose Euclidean distance measures the pairwise difference between audio seg- ments. A Support Vector Machine built on a kernel between GMM supervectors is used to classify audio segments into falls and various types of noise. Experiments on the Netcarity fall dataset show that the proposed fall modeling and clas- sification approach improves fall segment F-score to 67%, from 59% achieved by a standard GMM classifier. We also demonstrate that the proposed approach effectively improves fall detection accuracy by re-classifying confusable labels in the output of dynamic programming using the standard GMM classifier.
    Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan; 01/2009
  • Jing Huang, Karthik Visweswariah
    INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic speech recognition (ASR) is a critical component for CHIL services. For example, it provides the input to higher-level technologies, such as summarization and question answering, as discussed in Chapter 8. In the spirit of ubiquitous computing, the goal of ASR in CHIL is to achieve a high performance using far-field sensors (networks of microphone arrays and distributed far-field microphones). However, close-talking microphones are also of interest, as they are used to benchmark ASR system development by providing a best-case acoustic channel scenario to compare against.
    12/2008: pages 43-59;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Robust speech processing constitutes a crucial component in the development of usable and natural conversational interfaces. In this paper we are particularly interested in human-computer interaction taking place in "smart" spaces - equipped with a number of far- field, unobtrusive microphones and camera sensors. Their availability allows multi-sensory and multi-modal processing, thus improving robustness of speech-based perception technologies in a number of scenarios of interest, for example lectures and meetings held inside smart conference rooms, or interaction with domotic devices in smart homes. In this paper, we overview recent work at IBM Research in developing state-of-the-art speech technology in smart spaces. In particular we discuss acoustic scene analysis, speech activity detection, speaker diarization, and speech recognition, emphasizing multi-sensory or multi-modal processing. The resulting technology is envisaged to allow far-field conversational interaction in smart spaces based on dialog management and natural language understanding of user requests.
    Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008; 06/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speech- to-text (STT) and speaker-attributed speech-to-text (SASTT) tasks on the lecture meeting domain. Three testing conditions are considered, namely the multiple distant microphone (MDM), single distant micro- phone (SDM), and individual headset microphone (IHM) ones - the lat- ter for the STT task only. The IBM system building process is similar to that employed last year for the STT Rich Transcription Spring 2006 eval- uation (RT06s). However, a few technical advances have been introduced for RT07: (a) better speaker segmentation; (b) system combination via the ROVER approach applied over an ensemble of systems, some of which are built by randomized decision tree state-tying; and (c) development of a very large language model consisting of 152M n-grams, incorporating, among other sources, 525M words of web data, and used in conjunction with a dynamic decoder. These advances reduce STT word error rate (WER) in the MDM condition by 16% relative (8% absolute) over the IBM RT06s system, as measured on 17 lecture meeting segments of the RT06s evaluation test set, selected in this work as development data. In the RT07 evaluation campaign, both MDM and SDM systems perform competitively for the STT and SASTT tasks. For example, at the MDM condition, a 44.3% STT WER is achieved on the RT07 evaluation test set, excluding scoring of overlapped speech. When the STT transcripts are combined with speaker labels from speaker diarization, SASTT WER becomes 52.0%. For the STT IHM condition, the newly developed large language model is employed, but in conjunction with the RT06s IHM acoustic models. The latter are reused, due to lack of time to train new models to utilize additional close-talking microphone data available in RT07. Therefore, the resulting system achieves modest WERs of 31.7% and 33.4%, when using manual or automatic segmentation, respectively.
    Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers; 01/2007
  • Source
    INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present the IBM systems for the Rich Transcription 2007 (RT07) speaker diarization evaluation task on lecture meeting data. We first overview our baseline system that was developed last year, as part of our speech-to-text system for the RT06s evaluation. We then present a number of simple schemes considered this year in our effort to improve speaker diarization performance, namely: (i) A better speech activity detection (SAD) system, a necessary pre-processing step to speaker di- arization; (ii) Use of word information from a speaker-independent speech recognizer; (iii) Modifications to speaker cluster merging criteria and the underlying segment model; and (iv) Use of speaker models based on Gaussian mixture models, and their iterative refinement by frame-level re-labeling and smoothing of decision likelihoods. We report development experiments on the RT06s evaluation test set that demonstrate that these methods are effective, resulting in dramatic performance improvements over our baseline diarization system. For example, changes in the cluster segment models and cluster merging methodology result in a 24.2% rel- ative reduction in speaker error rate, whereas use of the iterative model refinement process and word-level alignment produce a 36.0% and 9.2% speaker error relative reduction, respectively. The importance of the SAD subsystem is also shown, with SAD error reduction from 12.3% to 4.3% translating to a 20.3% relative reduction in speaker error rate. Unfor- tunately however, the developed diarization system heavily depends on appropriately tuning thresholds in the speaker cluster merging process. Possibly as a result of over-tuning such thresholds, performance on the RT07 evaluation test set degrades significantly compared to the one ob- served on development data. Nevertheless, our experiments show that the introduced techniques of cluster merging, speaker model refinement and alignment remain valuable in the RT07 evaluation.
    Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech processing of lectures recorded inside smart rooms has recently attracted much interest. In particular, the topic has been central to the Rich Transcription (RT) Meeting Recognition Evaluation campaign series, sponsored by NIST, with empha- sis placed on benchmarking speech activity detection (SAD), speaker diarization (SPKR), speech-to-text (STT), and speaker- attributed STT (SASTT) technologies. In this paper, we present the IBM systems developed to address these tasks in preparation for the RT 2007 evaluation, focusing on the far-field condition of lecture data collected as part of European project CHIL. For their development, the systems are benchmarked on a subset of the RT Spring 2006 (RT06s) evaluation test set, where they yield significant improvements for all SAD, SPKR, and STT tasks over RT06s results; for example, a 16% relative reduction in word error rate is reported in STT, attributed to a number of system advances discussed here. Initial results are also pre- sented on SASTT, a task newly introduced in 2007 in place of the discontinued SAD. Index Terms: speech processing, speech recognition, speaker diarization, speech activity detection, lectures, smart rooms.
    INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the IBM systems submitted to the NIST RT06s Speech-to-Text (STT) evaluation campaign on the CHIL lecture meet- ing data for three conditions: Multiple distant microphone (MDM), single distant microphone (SDM), and individual headset microphone (IHM). The system building process is similar to the IBM conversational tele- phone speech recognition system. However, the best models for the far- field conditions (SDM and MDM) proved to be the ones that use neither variance normalization nor vocal tract length normalization. Instead, feature-space minimum-phone error discriminative training yielded the best results. Due to the relatively small amount of CHIL-domain data, the acoustic models of our systems are built on publicly available meeting corpora, with maximum a-posteriori adaptation applied twice on CHIL data during training: First, at the initial speaker-independent model, and subsequently at the minimum phone error model. For language modeling, we utilized meeting transcripts, text from scientific conference proceed- ings, and spontaneous telephone conversations. On development data, chosen in our work to be the 2005 CHIL-internal STT evaluation test set, the resulting language model provided a 4% absolute gain in word error rate (WER), compared to the model used in last year's CHIL evaluation. Furthermore, the developed STT system significantly outperformed our last year's results, by reducing close-talking microphone data WER from 36.9% to 25.4% on our development set. In the NIST RT06s evaluation campaign, both MDM and SDM systems scored well, however the IHM system did poorly due to unsuccessful cross-talk removal.
    Machine Learning for Multimodal Interaction, Third International Workshop, MLMI 2006, Bethesda, MD, USA, May 1-4, 2006, Revised Selected Papers; 01/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe the IBM system submitted to the NIST Rich Transcription Spring 2006 (RT06s) evaluation campaign for automatic speech activity detection (SAD). This SAD system has been developed and evaluated on CHIL lecture meeting data using far-field microphone sensors, namely a single distant microphone (SDM) config- uration and a multiple distant microphone (MDM) condition. The IBM SAD system employs a three-class statistical classifier, trained on fea- tures that augment traditional signal energy ones with features that are based on acoustic phonetic likelihoods. The latter are obtained using a large speaker-independent acoustic model trained on meeting data. In the detection stage, after feature extraction and classification, the resulting sequence of classified states is further collapsed into segments belonging to only two classes, speech or silence, following two levels of smoothing. In the MDM condition, the process is repeated for every available mi- crophone channel, and the outputs are combined based on a simple ma- jority voting rule, biased towards speech. The system performed well at the RT06s evaluation campaign, resulting to 8.62% and 5.01% "speaker diarization error" in the SDM and MDM conditions respectively.
    Machine Learning for Multimodal Interaction, Third International Workshop, MLMI 2006, Bethesda, MD, USA, May 1-4, 2006, Revised Selected Papers; 01/2006
  • Jing Huang, Karthik Visweswariah
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    Jing Huang, Daniel Povey
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: It is well known that frontal video of the speaker’s mouth region contains significant speech information that, when combined with the acoustic signal, can improve accuracy and noise robustness of automatic speech recognition (ASR) systems. However, extraction of such visual speech information from full-face videos is computationally expensive, as it requires tracking faces and facial features. In addition, robust face detection remains challenging in practical human–computer interaction (HCI), where the subject’s posture and environment (lighting, background) are hard to control, and thus successfully compensate for. In this paper, in order to bypass these hindrances to practical bimodal ASR, we consider the use of a specially designed, wearable audio-visual headset, a feasible solution in certain HCI scenarios. Such a headset can consistently focus on the speaker’s mouth region, thus eliminating altogether the need for face tracking. In addition, it employs infrared illumination to provide robustness against severe lighting variations. We study the appropriateness of this novel device for audio-visual ASR by conducting both small- and large-vocabulary recognition experiments on data recorded using it under various lighting conditions. We benchmark the resulting ASR performance against bimodal data containing frontal, full-face videos collected at an ideal, studio-like environment, under uniform lighting. The experiments demonstrate that the infrared headset video contains comparable speech information to the studio, full-face video data, thus being a viable sensory device for audio-visual ASR.
    Speech Communication 10/2004; DOI:10.1016/j.specom.2004.10.007 · 1.55 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Visual speech is known to improve accuracy and noise robustness of automatic speech recognizers. However, almost all audio-visual ASR systems require tracking frontal facial features for visual information extraction, a computationally intensive and error-prone process. In this paper, we consider a specially designed infrared headset to capture audio-visual data, that consistently fo-cuses on the speaker's mouth region, thus eliminating the need for face tracking. We conduct small-vocabulary recognition experiments on such data, benchmarking their ASR performance against traditional frontal, full-face videos, collected both at an ideal studio-like environ-ment and at a more challenging office domain. By using the infrared headset, we report a dramatic improvement in visual-only ASR that amounts to a relative 30% and 54% word error rate reduction, compared to the studio and of-fice data, respectively. Furthermore, when combining the visual modality with the acoustic signal, the resulting rel-ative ASR gain with respect to audio-only performance is significantly higher for the infrared headset data.