Mark Hasegawa-Johnson

University of Illinois, Urbana-Champaign, Urbana, Illinois, United States

Are you Mark Hasegawa-Johnson?

Claim your profile

Publications (147)58.77 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Auditory salience describes how much a particular auditory event attracts human attention. Previous attempts at automatic detection of salient audio events have been hampered by the challenge of defining ground truth. In this paper ground truth for auditory salience is built up from annotations by human subjects of a large corpus of meeting room recordings. Following statistical purification of the data, an optimal auditory salience filter with linear discrimination is derived from the purified data. An automatic auditory salience detector based on optimal filtering of the Bark-frequency loudness performs with 32% equal error rate. Expanding the feature vector to include other common feature sets does not improve performance. Consistent with intuition, the optimal filter looks like an onset detector in the time domain.
    Pattern Recognition Letters 01/2014; 38:78–85. · 1.27 Impact Factor
  • Harsh Vardhan Sharma, Mark Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline.
    Computer Speech & Language 09/2013; 27(6):1147–1162. · 1.46 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Browsing large audio archives is challenging because of the limitations of human audition and attention. However, this task becomes easier with a suitable visualization of the audio signal, such as a spectrogram transformed to make unusual audio events salient. This transformation maximizes the mutual information between an isolated event's spectrogram and an estimate of how salient the event appears in its surrounding context. When such spectrograms are computed and displayed with fluid zooming over many temporal orders of magnitude, sparse events in long audio recordings can be detected more quickly and more easily. In particular, in a 1/10-real-time acoustic event detection task, subjects who were shown saliency-maximized rather than conventional spectrograms performed significantly better. Saliency maximization also improves the mutual information between the ground truth of nonbackground sounds and visual saliency, more than other common enhancements to visualization.
    ACM Transactions on Applied Perception 01/2013; 10(4):26:1-26:16. · 1.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.
    The Journal of the Acoustical Society of America 12/2012; 132(6):3980-9. · 1.65 Impact Factor
  • Panying Rong, Torrey Loucks, Heejin Kim, Mark Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: A multimodal approach combining acoustics, intelligibility ratings, articulography and surface electromyography was used to examine the characteristics of dysarthria due to cerebral palsy (CP). CV syllables were studied by obtaining the slope of F2 transition during the diphthong, tongue-jaw kinematics during the release of the onset consonant, and the related submental muscle activities and relating these measures to speech intelligibility. The results show that larger reductions of F2 slope are correlated with lower intelligibility in CP-related dysarthria. Among the three speakers with CP, the speaker with the lowest F2 slope and intelligibility showed smallest tongue release movement and largest jaw opening movement. The other two speakers with CP were comparable in the amplitude and velocity of tongue movements, but one speaker had abnormally prolonged jaw movement. The tongue-jaw coordination pattern found in the speakers with CP could be either compensatory or subject to an incompletely developed oromotor control system.
    Clinical Linguistics & Phonetics 09/2012; 26(9):806-22. · 0.78 Impact Factor
  • Heejin Kim, Mark Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: Second-formant (F2) locus equations represent a linear relationship between F2 measured at the vowel onset following stop release and F2 measured at the vowel midpoint in a consonant-vowel (CV) sequence. Prior research has used the slope and intercept of locus equations as indices to coarticulation degree and the consonant's place of articulation. This presentation addresses coarticulation degree and place of articulation contrasts in dysarthric speech, by comparing locus equation measures for speakers with cerebral palsy and control speakers. Locus equation data are extracted from the Universal Access Speech (Kim et al. 2008). The data consist of CV sequences with labial, alveolar, velar stops produced in the context of various vowels that differ in backness and thus in F2. Results show that for alveolars and labials, slopes are less steep and intercepts are higher in dysarthric speech compared to normal speech, indicating a reduced degree of coarticulation in CV transitions, while for front and back velars, the opposite pattern is observed. In addition, a second-order locus equation analysis shows a reduced separation especially between alveolars and front velars in dysarthric speech. Results will be discussed in relation to the horizontal tongue body positions in CV transitions in dysarthric speech.
    The Journal of the Acoustical Society of America 09/2012; 132(3):2089. · 1.65 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A video's soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as "ngine sounds," "utdoor/indoor sounds." These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question "ho spoke when?"by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.
    International Journal of Multimedia Data Engineering & Management. 07/2012; 3(3):1-19.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we break the real-time barrier of human audi-tion by producing rapidly searchable visualizations of the au-dio signal. We propose a saliency-maximized audio spectro-gram as a visual representation that enables fast detection of audio events by a human analyst. This representation mini-mizes the time needed to examine a particular audio segment by embedding the information of the target events into visually salient patterns. In particular, we find a visualization function that transforms the original mixed spectrogram to maximize the mutual information between the label sequence of target events and the estimated visual saliency of the spectrogram features. Subject experiments using our human acoustic event detection software show that the saliency-maximized spectro-gram significantly outperforms the original spectrogram in a 1/10-real-time acoustic event detection task.
    Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on 01/2012; · 4.63 Impact Factor
  • Source
    Sujeeth Bharadwaj, Raman Arora, Karen Livescu, Mark Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: We consider the problem of learning a linear transformation of acoustic feature vectors for phonetic frame classification, in a setting where articulatory measurements are available at training time. We use the acoustic and articulatory data to-gether in a multi-view learning approach, in particular using canonical correlation analysis to learn linear transformations of the acoustic features that are maximally correlated with the articulatory data. We also investigate simple approaches for combining information shared across the acoustic and artic-ulatory views with information that is private to the acous-tic view. We apply these methods to phonetic frame classi-fication on data drawn from the University of Wisconsin X-ray Microbeam Database. We find a small but consistent ad-vantage to the multi-view approaches combining shared and private information, compared to the baseline acoustic fea-tures or unsupervised dimensionality reduction using princi-pal components analysis.
    IEEE IWSML; 01/2012
  • Source
    Lae-Hoon Kim, Mark Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: Hands-free speech telephony and speech recognition in cars suffer from additive noise and reverberation. We propose an iterative blind channel estimation algorithm based on an analysis-by-synthesis loop closed around a multipath Gener- alized Sidelobe Canceller (GSC). By combining a post-filter with the proposed scheme, optimal speech enhancement in practical situations can be achieved. The algorithm is tested using simulated data and using real speech recordings from the AVICAR database.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Separating singing voices from music accompaniment is an important task in many applications, such as music infor-mation retrieval, lyric recognition and alignment. Music ac-companiment can be assumed to be in a low-rank subspace, because of its repetition structure; on the other hand, singing voices can be regarded as relatively sparse within songs. In this paper, based on this assumption, we propose using ro-bust principal component analysis for singing-voice separa-tion from music accompaniment. Moreover, we examine the separation result by using a binary time-frequency masking method. Evaluations on the MIR-1K dataset show that this method can achieve around 1∼1.4 dB higher GNSDR com-pared with two state-of-the-art approaches without using prior training or requiring particular features.
    Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on 01/2012; · 4.63 Impact Factor
  • Mark A Hasegawa-Johnson, Jui-Ting Huang, Sarah King, Xi Zhou
    [Show abstract] [Hide abstract]
    ABSTRACT: An invariant feature is a nonlinear projection whose output shows less intra-class variability than its input. In machine learning, invariant features may be given a priori, on the basis of scientific knowledge, or they may be learned using feature selection algorithms. In the task of acoustic feature extraction for automatic speech recognition, for example, a candidate for apriori invariance is provided by the theory of phonological distinctive features, which specifies that any given distinctive feature should correspond to a fixed acoustic correlate (a fixed classification boundary between positive and negative examples), regardless of context. A learned invariance might, instead, project each phoneme into a high-dimensional Gaussian mixture supervector space, and in the high-dimensional space, learn an inter-phoneme distance metric that minimizes the distances among examples of any given phoneme. Results are available for both tasks, but it is not easy to compare them: learned invariance outperforms a priori invariance for some task definitions, and underperforms for other task definitions. As future work, we propose that the a priori invariance might be used to regularize a learned invariance projection.
    The Journal of the Acoustical Society of America 10/2011; 130(4):2524. · 1.65 Impact Factor
  • Heejin Kim, Mark Hasegawa-Johnson, Adrienne Perlman
    [Show abstract] [Hide abstract]
    ABSTRACT: Imprecise production of fricatives has been noted in various types of dysarthria, but their acoustic properties have not been fully investigated. Using durational measures and spectral moment analysis, this study examines the acoustic characteristics of alveolar and post-alveolar voiceless fricatives produced by speakers with cerebral palsy (CP) in three different intelligibility levels. The following questions are addressed: (1) Are fricatives longer in CP-associated dysarthric speech, compared to controls? (2) What is the evidence of reduced acoustic distinctions between alveolar vs. post-alveolar fricatives? and (3) Are the intelligibility levels associated with acoustic measures of fricatives? Duration and the first three spectral moments were obtained from word initial fricatives produced by 18 American English native speakers (9 speakers diagnosed with spastic CP and 9 controls). Results showed that speakers with CP exhibited significantly longer duration of fricatives and a reduced distinction between alveolar versus post-alveolar fricatives compared to control speakers. A reduced place distinction in dysarthric speech was mostly due to lower first moments and higher third moments compared to normal speech. The group difference was greater for alveolar fricatives than for post-alveolar fricatives. Furthermore, as the intelligibility level decreased, durational increase and the degree of place overlap were consistently greater.
    The Journal of the Acoustical Society of America 10/2011; 130(4):2446. · 1.65 Impact Factor
  • Source
    Mark A Hasegawa-Johnson, Jui-Ting Huang, Xiaodan Zhuang
    [Show abstract] [Hide abstract]
    ABSTRACT: Semi-supervised learning requires one to make assumptions about the data. This talk will discuss two different assumptions, and algorithms that instantiate those assumptions, for the tasks of acoustic modeling and pronunciation modeling in automatic speech recognition. First, the acoustic spectra corresponding to different phonemes overlap, but there is a tendency for the instantiations of each phoneme to cluster within a well-defined region of the feature space--a sort of "soft compactness" assumption. Softly compact distributions can be learned by an algorithm that encourages compactness without strictly requiring it, e.g., by maximizing likelihood of the unlabeled data, or even better, by minimizing its conditional class entropy. Second, the observed phone strings corresponding to coarticulated pronunciations of different words are also, often, indistinguishable, but can be transformed into a representation in which the degree of overlap is substantially reduced. The canonical phonetic pronunciations are transformed into an articulatory domain, possible mispronunciations are predicted based on a compactness criterion in the articulatory domain, and the result is transformed back into the phonetic domain, forming a finite state transducer that is able to effectively use hundreds of alternate pronunciations.
    The Journal of the Acoustical Society of America 10/2011; 130(4):2408. · 1.65 Impact Factor
  • Source
    Hao Tang, Stephen Mingyu Chu, Mark Hasegawa-Johnson, Thomas S Huang
    [Show abstract] [Hide abstract]
    ABSTRACT: Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical model-based distance metrics, 2) our advocated use of the cosine distance metric yields consistent increases in the speaker clustering performance as compared to the commonly used euclidean distance metric, 3) our partially supervised speaker clustering concept and strategies significantly improve the speaker clustering performance over the baselines, and 4) our proposed LSDA algorithm further leads to state-of-the-art speaker clustering performance.
    IEEE Transactions on Software Engineering 08/2011; 34(5):959-71. · 2.59 Impact Factor
  • Source
    Bowon Lee, Camille Goudeseune, Mark A. Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper considers methods for audio display in a CAVE-type virtual reality theater, a 3 m cube with displays covering all six rigid faces. Headphones are possible since the user's headgear continuously measures ear positions, but loudspeakers are preferable since they enhance the sense of total immersion. The proposed solution consists of open-loop acoustic point control. The transfer function, a matrix of room frequency responses from the loudspeakers to the ears of the user, is inverted using multi-channel inversion methods, to create exactly the desired sound field at the user's ears. The inverse transfer function is constructed from impulse responses simulated by the image source method. This technique is validated by measuring a 2x2 matrix transfer function, simulating a transfer function with the same geometry, and filtering the measured transfer function through the inverse of the simulation. Since accuracy of the image source method decreases with time, inversion performance is improved by windowing the simulated response prior to inversion. Parameters of the simulation and inversion are adjusted to minimize residual reverberant energy; the best-case dereverberation ratio is 10 dB.
  • Xiaodan Zhuang, Xi Zhou, Mark A. Hasegawa-Johnson, Thomas S. Huang
    [Show abstract] [Hide abstract]
    ABSTRACT: Effective object localization relies on efficient and effective searching method, and robust image representation and learning method. Recently, the Gaussianized vector representation has been shown effective in several computer vision applications, such as facial age estimation, image scene categorization and video event recognition. However, all these tasks are classification and regression problems based on the whole images. It is not yet explored how this representation can be efficiently applied in the object localization, which reveals the locations and sizes of the objects. In this work, we present an efficient object localization approach for the Gaussianized vector representation, following a branch-and-bound search scheme introduced by Lampert et al. [5]. In particular, we design a quality bound for rectangle sets characterized by the Gaussianized vector representation for fast hierarchical search. This bound can be obtained for any rectangle set in the image, with little extra computational cost, in addition to calculating the Gaussianized vector representation for the whole image. Further, we propose incorporating a normalization approach that suppresses the variation within the object class and the background class. Experiments on a multi-scale car dataset show that the proposed object localization approach based on the Gaussianized vector representation outperforms previous work using the histogram-of-keywords representation. The within-class variation normalization approach further boosts the performance. This chapter is an extended version of our paper at the 1st International Workshop on Interactive Multimedia for Consumer Electronics at ACM Multimedia 2009 [16].
    02/2011: pages 93-109;
  • Source
    INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
  • Source
    I. Yücel Özbek, Mark Hasegawa-Johnson, Mübeccel Demirekler
    IEEE Transactions on Audio, Speech & Language Processing. 01/2011; 19:1180-1195.
  • Po-Sen Huang, Thyagaraju Damarla, Mark Hasegawa-Johnson
    [Show abstract] [Hide abstract]
    ABSTRACT: Personnel detection at border crossings has become an important issue recently. To reduce the number of false alarms, it is important to discriminate between humans and four-legged animals. This paper proposes using enhanced summary autocorrelation patterns for feature extraction from seismic sensors, a multi-stage exemplar selection framework to learn acoustic classifier, and temporal patterns from ultrasonic sensors. We compare the results using decision fusion with Gaussian Mixture Model classifiers and feature fusion with Support Vector Machines. From experimental results, we show that our proposed methods improve the robustness of the system.

Publication Stats

613 Citations
58.77 Total Impact Points


  • 2003–2013
    • University of Illinois, Urbana-Champaign
      • • Department of Electrical and Computer Engineering
      • • Department of Speech and Hearing Science
      Urbana, Illinois, United States
  • 2011
    • Hewlett-Packard
      • HP Labs
      Palo Alto, CA, United States
  • 2010
    • Oregon Health and Science University
      • Center for Spoken Language Understanding
      Portland, Oregon, United States
  • 2008
    • University of Victoria
      Victoria, British Columbia, Canada