Mark Hasegawa-Johnson

University of Illinois, Urbana-Champaign, Urbana, Illinois, United States

Are you Mark Hasegawa-Johnson?

Claim your profile

Publications (210)162.4 Total impact

  • Source
    Po-Sen Huang · Minje Kim · Mark Hasegawa-Johnson · Paris Smaragdis ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Monaural source separation is important for many real world applications. It is challenging in that, given only single channel information is available, there is an infinite number of solutions without proper constraints. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including the monaural speech separation task, monaural singing voice separation task, and speech denoising task. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our proposed system on TSP, MIR-1K, and TIMIT dataset for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30~4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30~2.48 dB GNSDR gain and 4.32~5.42 dB GSIR gain compared to previous models in the singing voice separation task, and outperform NMF and DNN baseline in the speech denoising task.
    IEEE/ACM Transactions on Audio, Speech, and Language Processing 02/2015; 23(12). DOI:10.1109/TASLP.2015.2468583
  • Mark Hasegawa-Johnson · Jennifer Cole · Preethi Jyothi · Lav R. Varshney ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Transcribers make mistakes. Workers recruited in a crowdsourcing marketplace, because of their varying levels of commitment and education, make more mistakes than workers in a controlled laboratory setting. Methods for compensating transcriber mistakes are desirable because, with such methods available, crowdsourcing has the potential to significantly increase the scale of experiments in laboratory phonology. This paper provides a brief tutorial on statistical learning theory, introducing the relationship between dataset size and estimation error, then presents a theoretical description and preliminary results for two new methods that control labeler error in laboratory phonology experiments. First, we discuss the method of crowdsourcing over error-correcting codes. In the error-correcting-code method, each difficult labeling task is first factored, by the experimenter, into the product of several easy labeling tasks (typically binary). Factoring increases the total number of tasks, nevertheless it results in faster completion and higher accuracy, because workers unable to perform the difficult task may be able to meaningfully contribute to the solution of each easy task. Second, we discuss the use of explicit mathematical models of the errors made by a worker in the crowd. In particular, we introduce the method of mismatched crowdsourcing, in which workers transcribe a language they do not understand, and an explicit mathematical model of second-language phoneme perception is used to learn and then compensate their transcription errors. Though introduced as technologies that increase the scale of phonology experiments, both methods have implications beyond increased scale. The method of easy questions permits us to probe the perception, by untrained listeners, of complicated phonological models; examples are provided from the prosody of English and Hindi. The method of mismatched crowdsourcing permits us to probe, in more detail than ever before, the perception of phonetic categories by listeners with a different phonological system.
    01/2015; 6(3-4). DOI:10.1515/lp-2015-0012
  • Austin Chen · Mark A. Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Many past studies have been conducted on speech/music discrimination due to the potential applications for broadcast and other media; however, it remains possible to expand the experimental scope to include samples of speech with varying amounts of background music. This paper focuses on the development and evaluation of two measures of the ratio between speech energy and music energy: a reference measure called speech-to-music ratio (SMR), which is known objectively only prior to mixing, and a feature called the stereo-input mix-to-peripheral level feature (SIMPL), which is computed from the stereo mixed signal as an imprecise estimate of SMR. SIMPL is an objective signal measure calculated by taking advantage of broadcast mixing techniques in which vocals are typically placed at stereo center, unlike most instruments. Conversely, SMR is a hidden variable defined by the relationship between the powers of portions of audio attributed to speech and music. It is shown that SIMPL is predictive of SMR and can be combined with state-of-the-art features in order to improve performance. For evaluation, this new metric is applied in speech/music (binary) classification, speech/music/mixed (trinary) classification, and a new speech-to-music ratio estimation problem. Promising results are achieved, including 93.06% accuracy for trinary classification and 3.86 dB RMSE for estimation of the SMR.
    IEEE/ACM Transactions on Audio, Speech, and Language Processing 12/2014; 22(12):2025-2033. DOI:10.1109/TASLP.2014.2359628
  • Kai Hsiang Lin · Pooya Khorrami · Jiangping Wang · Mark Hasegawa-Johnson · Thomas S. Huang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose a novel saliency-based algorithm to detect foreground regions in highly dynamic scenes. We first convert input video frames to multiple patch-based feature maps. Then, we apply temporal saliency analysis to the pixels of each feature map. For each temporal set of co-located pixels, the feature distance of a point from its kth nearest neighbor is used to compute the temporal saliency. By computing and combining temporal saliency maps of different features, we obtain foreground likelihood maps. A simple segmentation method based on adaptive thresholding is applied to detect the foreground objects. We test our algorithm on images sequences of dynamic scenes, including public datasets and a new challenging wildlife dataset we constructed. The experimental results demonstrate the proposed algorithm achieves state-of-the-art results.
    IEEE International Conference on Image Processing; 10/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we address the problem of object class recognition via observations from actively selected views/modalities/features under limited resource budgets. A Partially Observable Markov Decision Process (POMDP) is employed to find optimal sensing and recognition actions with the goal of long-term classification accuracy. Heterogeneous resource constraints-such as motion, number of measurements and bandwidth-are explicitly modeled in the state variable, and a prohibitively high penalty is used to prevent the violation of any resource constraint. To improve recognition performance, we further incorporate discriminative classification models with POMDP, and customize the reward function and observation model correspondingly. The proposed model is validated on several data sets for multi-view, multi-modal vehicle classification and multi-view face recognition, and demonstrates improvement in both recognition and resource management over greedy methods and previous POMDP formulations.
    Computer Vision and Pattern Recognition Workshops (CVPRW), 2014 IEEE Conference on; 06/2014
  • Yang Zhang · Zhijian Ou · Mark Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Current model-based speech analysis tends to be incomplete - only a part of parameters of interest (e.g. only the pitch or vocal tract) are modeled, while the rest that might as well be important are disregarded. The drawback is that without joint modeling of parameters that are correlated, the analysis on speech parameters may be inaccurate or even incorrect. Under this motivation, we have proposed such a model called PAT (Probabilistic Acoustic Tube), where pitch, vocal tract and energy are jointly modeled. This paper proposes an improved version of PAT model, named PAT2, where both signal and probabilistic modeling are tremendously renovated. Compared to related works, PAT2 is much more comprehensive, which incorporates mixed excitation, glottal wave and phase modeling. Experimental results show its ability in decomposing speech into desirable parameters and its potential for speech synthesis.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Po-Sen Huang · Minje Kim · Mark Hasegawa-Johnson · Paris Smaragdis ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3.8~4.9 dB SIR gain compared to NMF models, while maintaining better SDRs and SARs.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Kyungtae Kim · Kai-Hsiang Lin · Dirk B. Walther · Mark A. Hasegawa-Johnson · Tomas S. Huang ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Auditory salience describes how much a particular auditory event attracts human attention. Previous attempts at automatic detection of salient audio events have been hampered by the challenge of defining ground truth. In this paper ground truth for auditory salience is built up from annotations by human subjects of a large corpus of meeting room recordings. Following statistical purification of the data, an optimal auditory salience filter with linear discrimination is derived from the purified data. An automatic auditory salience detector based on optimal filtering of the Bark-frequency loudness performs with 32% equal error rate. Expanding the feature vector to include other common feature sets does not improve performance. Consistent with intuition, the optimal filter looks like an onset detector in the time domain.
    Pattern Recognition Letters 03/2014; 38(1):78–85. DOI:10.1016/j.patrec.2013.11.010 · 1.55 Impact Factor
  • Po-Sen Huang · Li Deng · Mark Hasegawa-Johnson · Xiaodong He ·
    [Show abstract] [Hide abstract]
    ABSTRACT: The recently developed deep learning architecture, a kernel version of the deep convex network (K-DCN), is improved to address the scalability problem when the training and testing samples become very large. We have developed a solution based on the use of random Fourier features, which possess the strong theoretical property of approximating the Gaussian kernel while rendering efficient computation in both training and evaluation of the K-DCN with large training samples. We empirically demonstrate that just like the conventional K-DCN exploiting rigorous Gaussian kernels, the use of random Fourier features also enables successful stacking of kernel modules to form a deep architecture. Our evaluation experiments on phone recognition and speech understanding tasks both show the computational efficiency of the K-DCN which makes use of random features. With sufficient depth in the K-DCN, the phone recognition accuracy and slot-filling accuracy are shown to be comparable or slightly higher than the K-DCN with Gaussian kernels while significant computational saving has been achieved.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 10/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Browsing large audio archives is challenging because of the limitations of human audition and attention. However, this task becomes easier with a suitable visualization of the audio signal, such as a spectrogram transformed to make unusual audio events salient. This transformation maximizes the mutual information between an isolated event's spectrogram and an estimate of how salient the event appears in its surrounding context. When such spectrograms are computed and displayed with fluid zooming over many temporal orders of magnitude, sparse events in long audio recordings can be detected more quickly and more easily. In particular, in a 1/10-real-time acoustic event detection task, subjects who were shown saliency-maximized rather than conventional spectrograms performed significantly better. Saliency maximization also improves the mutual information between the ground truth of nonbackground sounds and visual saliency, more than other common enhancements to visualization.
    ACM Transactions on Applied Perception 10/2013; 10(4-4):26:1-26:16. DOI:10.1145/2536764.2536773 · 0.65 Impact Factor
  • Harsh Vardhan Sharma · Mark Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline.
    Computer Speech & Language 09/2013; 27(6):1147–1162. DOI:10.1016/j.csl.2012.10.002 · 1.75 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.
    The Journal of the Acoustical Society of America 12/2012; 132(6):3980-9. DOI:10.1121/1.4763545 · 1.50 Impact Factor
  • Sarah King · Mark Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic-phonetic landmarks provide robust cues for speech recognition and are relatively invariant between speakers, speaking styles, noise conditions and sampling rates. The ability to detect acoustic-phonetic landmarks as a front-end for speech recognition has been shown to improve recognition accuracy. Biomimetic inter-spike intervals and average signal level have been shown to accurately convey information about acoustic-phonetic landmarks. This paper explores the use of inter-spike interval and average signal level as input features for landmark detectors trained and tested on mismatched conditions. These detectors are designed to serve as a front-end for speech recognition systems. Results indicate that landmark detectors trained using inter-spike intervals and signal level are relatively robust to both additive channel noise and changes in sampling rate. Mismatched conditions — differences in channel noise between training audio and testing audio — are problematic for computer speech recognition systems. Signal enhancement, mismatch-resistant acoustic features, and architectural compensation within the recognizer
    Proceedings of COLING 2012: Posters; 12/2012
  • Mohamed Elmahdy · Mark Hasegawa-Johnson · Eiman Mustafawi ·

    10/2012; DOI:10.5339/qfarf.2012.CSO3
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Identification of network linkages through direct observation of human interaction has long been a staple of network analysis. It is, however, time consuming and labor intensive when undertaken by human observers. This paper describes the development and validation of a two-stage methodology for automating the identification of network links from direct observation of groups in which members are free to move around a space. The initial manual annotation stage utilizes a web-based interface to support manual coding of physical location, posture, and gaze direction of group members from snapshots taken from video recordings of groups. The second stage uses the manually annotated data as input for machine learning to automate the inference of links among group members. The manual codings were treated as observed variables and the theory of turn taking in conversation was used to model temporal dependencies among interaction links, forming a Dynamic Bayesian Network (DBN). The DBN was modeled using the Bayes Net Toolkit and parameters were learned using Expectation Maximization (EM) algorithm. The Viterbi algorithm was adapted to perform the inference in DBN. The result is a time series of linkages for arbitrarily long segments that utilizes statistical distributions to estimate linkages. The validity of the method was assessed through comparing the accuracy of automatically detected links to manually identified links. Results show adequate validity and suggest routes for improvement of the method.
    Social Networks 10/2012; 34(4):515-526. DOI:10.1016/j.socnet.2012.04.002 · 2.93 Impact Factor
  • Heejin Kim · Mark Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: Second-formant (F2) locus equations represent a linear relationship between F2 measured at the vowel onset following stop release and F2 measured at the vowel midpoint in a consonant-vowel (CV) sequence. Prior research has used the slope and intercept of locus equations as indices to coarticulation degree and the consonant's place of articulation. This presentation addresses coarticulation degree and place of articulation contrasts in dysarthric speech, by comparing locus equation measures for speakers with cerebral palsy and control speakers. Locus equation data are extracted from the Universal Access Speech (Kim et al. 2008). The data consist of CV sequences with labial, alveolar, velar stops produced in the context of various vowels that differ in backness and thus in F2. Results show that for alveolars and labials, slopes are less steep and intercepts are higher in dysarthric speech compared to normal speech, indicating a reduced degree of coarticulation in CV transitions, while for front and back velars, the opposite pattern is observed. In addition, a second-order locus equation analysis shows a reduced separation especially between alveolars and front velars in dysarthric speech. Results will be discussed in relation to the horizontal tongue body positions in CV transitions in dysarthric speech.
    The Journal of the Acoustical Society of America 09/2012; 132(3):2089. DOI:10.1121/1.4755719 · 1.50 Impact Factor
  • Panying Rong · Torrey Loucks · Heejin Kim · Mark Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: A multimodal approach combining acoustics, intelligibility ratings, articulography and surface electromyography was used to examine the characteristics of dysarthria due to cerebral palsy (CP). CV syllables were studied by obtaining the slope of F2 transition during the diphthong, tongue-jaw kinematics during the release of the onset consonant, and the related submental muscle activities and relating these measures to speech intelligibility. The results show that larger reductions of F2 slope are correlated with lower intelligibility in CP-related dysarthria. Among the three speakers with CP, the speaker with the lowest F2 slope and intelligibility showed smallest tongue release movement and largest jaw opening movement. The other two speakers with CP were comparable in the amplitude and velocity of tongue movements, but one speaker had abnormally prolonged jaw movement. The tongue-jaw coordination pattern found in the speakers with CP could be either compensatory or subject to an incompletely developed oromotor control system.
    Clinical Linguistics & Phonetics 09/2012; 26(9):806-22. DOI:10.3109/02699206.2012.706686 · 0.58 Impact Factor
  • Robert Mertens · Po-Sen Huang · Luke Gottlieb · Gerald Friedland · Ajay Divakaran · Mark Hasegawa-Johnson ·
    [Show abstract] [Hide abstract]
    ABSTRACT: A video's soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as "ngine sounds," "utdoor/indoor sounds." These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question "ho spoke when?"by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.
    07/2012; 3(3):1-19. DOI:10.4018/jmdem.2012070101
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we break the real-time barrier of human audi-tion by producing rapidly searchable visualizations of the au-dio signal. We propose a saliency-maximized audio spectro-gram as a visual representation that enables fast detection of audio events by a human analyst. This representation mini-mizes the time needed to examine a particular audio segment by embedding the information of the target events into visually salient patterns. In particular, we find a visualization function that transforms the original mixed spectrogram to maximize the mutual information between the label sequence of target events and the estimated visual saliency of the spectrogram features. Subject experiments using our human acoustic event detection software show that the saliency-maximized spectro-gram significantly outperforms the original spectrogram in a 1/10-real-time acoustic event detection task.
    Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on 05/2012; DOI:10.1109/ICASSP.2012.6288368 · 4.63 Impact Factor
  • Source
    I. Yucel Ozbek · Mark Hasegawa-Johnson · Mubeccel Demirekler ·
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a complete framework for articulatory inversion based on jump Markov linear systems (JMLS). In the model, the acoustic measurements and the position of each articulator are considered as observable measurement and continuous-valued hidden state of the system, respectively, and discrete regimes of the system are represented by the use of a discrete-valued hidden modal state. Articulatory inversion based on JMLS involves learning the model parameter set of the system and making inference about the state (position of each articulator) of the system using acoustic measurements. Iterative learning algorithms based on maximum-likelihood (ML) and maximum a posteriori (MAP) criteria are proposed to learn the model parameter set of the JMLS. It is shown that the learning procedure of the JMLS is a generalized version of hidden Markov model (HMM) training when both acoustic and articulatory data are given. In this paper, it is shown that the MAP-based learning algorithm improves modeling performance of the system and gives significantly better results compared to ML. The inference stage of the proposed algorithm is based on an interacting multiple models (IMM) approach, and done online (filtering), and/or offline (smoothing). Formulas are provided for IMM-based JMLS smoothing. It is shown that smoothing significantly improves the performance of articulatory inversion compared to filtering. Several experiments are conducted with the MOCHA database to show the performance of the proposed method. Comparison of the performance of the proposed method with the ones given in the literature shows that the proposed method improves the performance of state space approaches, making state space approaches comparable to the best published results.
    IEEE Transactions on Audio Speech and Language Processing 02/2012; 20(1-20):67 - 81. DOI:10.1109/TASL.2011.2157496 · 2.48 Impact Factor

Publication Stats

2k Citations
162.40 Total Impact Points


  • 2000-2015
    • University of Illinois, Urbana-Champaign
      • Department of Electrical and Computer Engineering
      Urbana, Illinois, United States
  • 2011
    • Middle East Technical University
      • Department of Electrical and Electronics Engineering
      Ankara, Ankara, Turkey
  • 2010
    • Bureau of Materials & Physical Research
      Springfield, Illinois, United States
  • 2008
    • University of Victoria
      Victoria, British Columbia, Canada