[Show abstract][Hide abstract] ABSTRACT: Monaural source separation is important for many real world applications. It
is challenging in that, given only single channel information is available,
there is an infinite number of solutions without proper constraints. In this
paper, we explore joint optimization of masking functions and deep recurrent
neural networks for monaural source separation tasks, including the monaural
speech separation task, monaural singing voice separation task, and speech
denoising task. The joint optimization of the deep recurrent neural networks
with an extra masking layer enforces a reconstruction constraint. Moreover, we
explore a discriminative training criterion for the neural networks to further
enhance the separation performance. We evaluate our proposed system on TSP,
MIR-1K, and TIMIT dataset for speech separation, singing voice separation, and
speech denoising tasks, respectively. Our approaches achieve 2.30~4.98 dB SDR
gain compared to NMF models in the speech separation task, 2.30~2.48 dB GNSDR
gain and 4.32~5.42 dB GSIR gain compared to previous models in the singing
voice separation task, and outperform NMF and DNN baseline in the speech
[Show abstract][Hide abstract] ABSTRACT: Many past studies have been conducted on speech/music discrimination due to the potential applications for broadcast and other media; however, it remains possible to expand the experimental scope to include samples of speech with varying amounts of background music. This paper focuses on the development and evaluation of two measures of the ratio between speech energy and music energy: a reference measure called speech-to-music ratio (SMR), which is known objectively only prior to mixing, and a feature called the stereo-input mix-to-peripheral level feature (SIMPL), which is computed from the stereo mixed signal as an imprecise estimate of SMR. SIMPL is an objective signal measure calculated by taking advantage of broadcast mixing techniques in which vocals are typically placed at stereo center, unlike most instruments. Conversely, SMR is a hidden variable defined by the relationship between the powers of portions of audio attributed to speech and music. It is shown that SIMPL is predictive of SMR and can be combined with state-of-the-art features in order to improve performance. For evaluation, this new metric is applied in speech/music (binary) classification, speech/music/mixed (trinary) classification, and a new speech-to-music ratio estimation problem. Promising results are achieved, including 93.06% accuracy for trinary classification and 3.86 dB RMSE for estimation of the SMR.
[Show abstract][Hide abstract] ABSTRACT: Current model-based speech analysis tends to be incomplete - only a part of parameters of interest (e.g. only the pitch or vocal tract) are modeled, while the rest that might as well be important are disregarded. The drawback is that without joint modeling of parameters that are correlated, the analysis on speech parameters may be inaccurate or even incorrect. Under this motivation, we have proposed such a model called PAT (Probabilistic Acoustic Tube), where pitch, vocal tract and energy are jointly modeled. This paper proposes an improved version of PAT model, named PAT2, where both signal and probabilistic modeling are tremendously renovated. Compared to related works, PAT2 is much more comprehensive, which incorporates mixed excitation, glottal wave and phase modeling. Experimental results show its ability in decomposing speech into desirable parameters and its potential for speech synthesis.
ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
[Show abstract][Hide abstract] ABSTRACT: Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3.8~4.9 dB SIR gain compared to NMF models, while maintaining better SDRs and SARs.
ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
[Show abstract][Hide abstract] ABSTRACT: Auditory salience describes how much a particular auditory event attracts human attention. Previous attempts at automatic detection of salient audio events have been hampered by the challenge of defining ground truth. In this paper ground truth for auditory salience is built up from annotations by human subjects of a large corpus of meeting room recordings. Following statistical purification of the data, an optimal auditory salience filter with linear discrimination is derived from the purified data. An automatic auditory salience detector based on optimal filtering of the Bark-frequency loudness performs with 32% equal error rate. Expanding the feature vector to include other common feature sets does not improve performance. Consistent with intuition, the optimal filter looks like an onset detector in the time domain.
[Show abstract][Hide abstract] ABSTRACT: Browsing large audio archives is challenging because of the limitations of human audition and attention. However, this task becomes easier with a suitable visualization of the audio signal, such as a spectrogram transformed to make unusual audio events salient. This transformation maximizes the mutual information between an isolated event's spectrogram and an estimate of how salient the event appears in its surrounding context. When such spectrograms are computed and displayed with fluid zooming over many temporal orders of magnitude, sparse events in long audio recordings can be detected more quickly and more easily. In particular, in a 1/10-real-time acoustic event detection task, subjects who were shown saliency-maximized rather than conventional spectrograms performed significantly better. Saliency maximization also improves the mutual information between the ground truth of nonbackground sounds and visual saliency, more than other common enhancements to visualization.
[Show abstract][Hide abstract] ABSTRACT: Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline.
[Show abstract][Hide abstract] ABSTRACT: The hidden Markov model (HMM) is widely popular as the de facto tool for representing temporal data; in this paper, we add to its utility in the sequence clustering domain - we describe a novel approach that allows us to directly control purity in HMM-based clustering algorithms. We show that encouraging sparsity in the observation probabilities increases cluster purity and derive an algorithm based on lp regularization; as a corollary, we also provide a different and useful interpretation of the value of p in Renyi p-entropy. We test our method on the problem of clustering non-speech audio events from the BBC sound effects corpus. Experimental results confirm that our approach does learn purer clusters, with (unweighted) average purity as high as 0.88 - a considerable improvement over both the baseline HMM (0.72) and k-means clustering (0.69).
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.
The Journal of the Acoustical Society of America 12/2012; 132(6):3980-9. DOI:10.1121/1.4763545 · 1.56 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Acoustic-phonetic landmarks provide robust cues for speech recognition and are relatively invariant between speakers, speaking styles, noise conditions and sampling rates. The ability to detect acoustic-phonetic landmarks as a front-end for speech recognition has been shown to improve recognition accuracy. Biomimetic inter-spike intervals and average signal level have been shown to accurately convey information about acoustic-phonetic landmarks. This paper explores the use of inter-spike interval and average signal level as input features for landmark detectors trained and tested on mismatched conditions. These detectors are designed to serve as a front-end for speech recognition systems. Results indicate that landmark detectors trained using inter-spike intervals and signal level are relatively robust to both additive channel noise and changes in sampling rate. Mismatched conditions — differences in channel noise between training audio and testing audio — are problematic for computer speech recognition systems. Signal enhancement, mismatch-resistant acoustic features, and architectural compensation within the recognizer
[Show abstract][Hide abstract] ABSTRACT: Identification of network linkages through direct observation of human interaction has long been a staple of network analysis. It is, however, time consuming and labor intensive when undertaken by human observers. This paper describes the development and validation of a two-stage methodology for automating the identification of network links from direct observation of groups in which members are free to move around a space. The initial manual annotation stage utilizes a web-based interface to support manual coding of physical location, posture, and gaze direction of group members from snapshots taken from video recordings of groups. The second stage uses the manually annotated data as input for machine learning to automate the inference of links among group members. The manual codings were treated as observed variables and the theory of turn taking in conversation was used to model temporal dependencies among interaction links, forming a Dynamic Bayesian Network (DBN). The DBN was modeled using the Bayes Net Toolkit and parameters were learned using Expectation Maximization (EM) algorithm. The Viterbi algorithm was adapted to perform the inference in DBN. The result is a time series of linkages for arbitrarily long segments that utilizes statistical distributions to estimate linkages. The validity of the method was assessed through comparing the accuracy of automatically detected links to manually identified links. Results show adequate validity and suggest routes for improvement of the method.
Social Networks 10/2012; 34(4):515-526. DOI:10.1016/j.socnet.2012.04.002 · 2.93 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Second-formant (F2) locus equations represent a linear relationship between F2 measured at the vowel onset following stop release and F2 measured at the vowel midpoint in a consonant-vowel (CV) sequence. Prior research has used the slope and intercept of locus equations as indices to coarticulation degree and the consonant's place of articulation. This presentation addresses coarticulation degree and place of articulation contrasts in dysarthric speech, by comparing locus equation measures for speakers with cerebral palsy and control speakers. Locus equation data are extracted from the Universal Access Speech (Kim et al. 2008). The data consist of CV sequences with labial, alveolar, velar stops produced in the context of various vowels that differ in backness and thus in F2. Results show that for alveolars and labials, slopes are less steep and intercepts are higher in dysarthric speech compared to normal speech, indicating a reduced degree of coarticulation in CV transitions, while for front and back velars, the opposite pattern is observed. In addition, a second-order locus equation analysis shows a reduced separation especially between alveolars and front velars in dysarthric speech. Results will be discussed in relation to the horizontal tongue body positions in CV transitions in dysarthric speech.
The Journal of the Acoustical Society of America 09/2012; 132(3):2089. DOI:10.1121/1.4755719 · 1.56 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A multimodal approach combining acoustics, intelligibility ratings, articulography and surface electromyography was used to examine the characteristics of dysarthria due to cerebral palsy (CP). CV syllables were studied by obtaining the slope of F2 transition during the diphthong, tongue-jaw kinematics during the release of the onset consonant, and the related submental muscle activities and relating these measures to speech intelligibility. The results show that larger reductions of F2 slope are correlated with lower intelligibility in CP-related dysarthria. Among the three speakers with CP, the speaker with the lowest F2 slope and intelligibility showed smallest tongue release movement and largest jaw opening movement. The other two speakers with CP were comparable in the amplitude and velocity of tongue movements, but one speaker had abnormally prolonged jaw movement. The tongue-jaw coordination pattern found in the speakers with CP could be either compensatory or subject to an incompletely developed oromotor control system.
[Show abstract][Hide abstract] ABSTRACT: A video's soundtrack is usually highly correlated to its content. Hence, audio-based techniques have recently emerged as a means for video concept detection complementary to visual analysis. Most state-of-the-art approaches rely on manual definition of predefined sound concepts such as "ngine sounds," "utdoor/indoor sounds." These approaches come with three major drawbacks: manual definitions do not scale as they are highly domain-dependent, manual definitions are highly subjective with respect to annotators and a large part of the audio content is omitted since the predefined concepts are usually found only in a fraction of the soundtrack. This paper explores how unsupervised audio segmentation systems like speaker diarization can be adapted to automatically identify low-level sound concepts similar to annotator defined concepts and how these concepts can be used for audio indexing. Speaker diarization systems are designed to answer the question "ho spoke when?"by finding segments in an audio stream that exhibit similar properties in feature space, i.e., sound similar. Using a diarization system, all the content of an audio file is analyzed and similar sounds are clustered. This article provides an in-depth analysis on the statistic properties of similar acoustic segments identified by the diarization system in a predefined document set and the theoretical fitness of this approach to discern one document class from another. It also discusses how diarization can be tuned in order to better reflect the acoustic properties of general sounds as opposed to speech and introduces a proof-of-concept system for multimedia event classification working with diarization-based indexing.
[Show abstract][Hide abstract] ABSTRACT: In this work, we break the real-time barrier of human audi-tion by producing rapidly searchable visualizations of the au-dio signal. We propose a saliency-maximized audio spectro-gram as a visual representation that enables fast detection of audio events by a human analyst. This representation mini-mizes the time needed to examine a particular audio segment by embedding the information of the target events into visually salient patterns. In particular, we find a visualization function that transforms the original mixed spectrogram to maximize the mutual information between the label sequence of target events and the estimated visual saliency of the spectrogram features. Subject experiments using our human acoustic event detection software show that the saliency-maximized spectro-gram significantly outperforms the original spectrogram in a 1/10-real-time acoustic event detection task.
Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on 05/2012; DOI:10.1109/ICASSP.2012.6288368 · 4.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper presents a complete framework for articulatory inversion based on jump Markov linear systems (JMLS). In the model, the acoustic measurements and the position of each articulator are considered as observable measurement and continuous-valued hidden state of the system, respectively, and discrete regimes of the system are represented by the use of a discrete-valued hidden modal state. Articulatory inversion based on JMLS involves learning the model parameter set of the system and making inference about the state (position of each articulator) of the system using acoustic measurements. Iterative learning algorithms based on maximum-likelihood (ML) and maximum a posteriori (MAP) criteria are proposed to learn the model parameter set of the JMLS. It is shown that the learning procedure of the JMLS is a generalized version of hidden Markov model (HMM) training when both acoustic and articulatory data are given. In this paper, it is shown that the MAP-based learning algorithm improves modeling performance of the system and gives significantly better results compared to ML. The inference stage of the proposed algorithm is based on an interacting multiple models (IMM) approach, and done online (filtering), and/or offline (smoothing). Formulas are provided for IMM-based JMLS smoothing. It is shown that smoothing significantly improves the performance of articulatory inversion compared to filtering. Several experiments are conducted with the MOCHA database to show the performance of the proposed method. Comparison of the performance of the proposed method with the ones given in the literature shows that the proposed method improves the performance of state space approaches, making state space approaches comparable to the best published results.
IEEE Transactions on Audio Speech and Language Processing 02/2012; 20(1-20):67 - 81. DOI:10.1109/TASL.2011.2157496 · 2.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of learning a linear transformation of acoustic feature vectors for phonetic frame classification, in a setting where articulatory measurements are available at training time. We use the acoustic and articulatory data to-gether in a multi-view learning approach, in particular using canonical correlation analysis to learn linear transformations of the acoustic features that are maximally correlated with the articulatory data. We also investigate simple approaches for combining information shared across the acoustic and artic-ulatory views with information that is private to the acous-tic view. We apply these methods to phonetic frame classi-fication on data drawn from the University of Wisconsin X-ray Microbeam Database. We find a small but consistent ad-vantage to the multi-view approaches combining shared and private information, compared to the baseline acoustic fea-tures or unsupervised dimensionality reduction using princi-pal components analysis.