[Show abstract][Hide abstract] ABSTRACT: Aspects of speech production have provided inspiration for ideas in speech technologies throughout the history of speech processing research. This special issue was inspired by the 2013Workshop on Speech Production in Automatic Speech Recognition in Lyon, France, and this introduction provides an overview of the included papers in the context of the current research landscape.
Full-text · Article · Mar 2016 · Computer Speech & Language
[Show abstract][Hide abstract] ABSTRACT: Transcribers make mistakes. Workers recruited in a crowdsourcing marketplace, because of their varying levels of commitment and education, make more mistakes than workers in a controlled laboratory setting. Methods for compensating transcriber mistakes are desirable because, with such methods available, crowdsourcing has the potential to significantly increase the scale of experiments in laboratory phonology. This paper provides a brief tutorial on statistical learning theory, introducing the relationship between dataset size and estimation error, then presents a theoretical description and preliminary results for two new methods that control labeler error in laboratory phonology experiments. First, we discuss the method of crowdsourcing over error-correcting codes. In the error-correcting-code method, each difficult labeling task is first factored, by the experimenter, into the product of several easy labeling tasks (typically binary). Factoring increases the total number of tasks, nevertheless it results in faster completion and higher accuracy, because workers unable to perform the difficult task may be able to meaningfully contribute to the solution of each easy task. Second, we discuss the use of explicit mathematical models of the errors made by a worker in the crowd. In particular, we introduce the method of mismatched crowdsourcing, in which workers transcribe a language they do not understand, and an explicit mathematical model of second-language phoneme perception is used to learn and then compensate their transcription errors. Though introduced as technologies that increase the scale of phonology experiments, both methods have implications beyond increased scale. The method of easy questions permits us to probe the perception, by untrained listeners, of complicated phonological models; examples are provided from the prosody of English and Hindi. The method of mismatched crowdsourcing permits us to probe, in more detail than ever before, the perception of phonetic categories by listeners with a different phonological system.
[Show abstract][Hide abstract] ABSTRACT: Our goal is to improve understanding and use of numeric information (e.g., clinical test results) provided through portals to Electronic Health Record (EHR) systems by older adults with diverse numeracy and risk literacy abilities. We help older adults understand this information by emulating in portal environments best practices from face-to-face communication. To do this, we are developing a computer-based agent (CA) that will use nonverbal cues (e.g. voice intonation, facial expressions) as well as words to convey affective and cognitive meaning of the numbers and improve patient comprehension of the clinical information. The present paper describes a pilot study designed to evaluate the appropriateness and effectiveness of audio-video messages of a physician delivering clinical test results. These messages will serve as a template for the development of the CA. Older adult pilot participants generally understood the gist of the test results presented in the video messages. Participants’ affective responses to the messages were appropriate to the message’s level of risk: as the level of risk associated with the test results increased, positive affect decreased and negative affect increased. In addition, participants also thought the physician’s delivery matched the message content, and they thought that the messages were informative. These findings will be leveraged to finalize the materials for the primary study in which the impact of video and CA-based messages on patient comprehension of numeric information will be evaluated relative to standard formats used in patient portals.
[Show abstract][Hide abstract] ABSTRACT: We are interested in a multichannel transient acoustic signal classification task which suffers from additive/convolutionary noise corruption. To address this problem, we propose a double-scheme classifier that takes the advantage of multichannel data to improve noise robustness. Both schemes adopt task-driven dictionary learning as the basic framework, and exploit multichannel data at different levels - scheme 1 imposes joint sparsity constraint while learning the dictionary and classifier; scheme 2 adopts beamforming at signal formation level. In addition, matched filter and robust ceptral coefficients are applied to improve noise robustness of the input feature. Experiments show that the proposed classifier significantly outperforms the baseline algorithms.
[Show abstract][Hide abstract] ABSTRACT: Monaural source separation is important for many real world applications. It
is challenging in that, given only single channel information is available,
there is an infinite number of solutions without proper constraints. In this
paper, we explore joint optimization of masking functions and deep recurrent
neural networks for monaural source separation tasks, including the monaural
speech separation task, monaural singing voice separation task, and speech
denoising task. The joint optimization of the deep recurrent neural networks
with an extra masking layer enforces a reconstruction constraint. Moreover, we
explore a discriminative training criterion for the neural networks to further
enhance the separation performance. We evaluate our proposed system on TSP,
MIR-1K, and TIMIT dataset for speech separation, singing voice separation, and
speech denoising tasks, respectively. Our approaches achieve 2.30~4.98 dB SDR
gain compared to NMF models in the speech separation task, 2.30~2.48 dB GNSDR
gain and 4.32~5.42 dB GSIR gain compared to previous models in the singing
voice separation task, and outperform NMF and DNN baseline in the speech
Preview · Article · Feb 2015 · IEEE/ACM Transactions on Audio, Speech, and Language Processing
[Show abstract][Hide abstract] ABSTRACT: Many past studies have been conducted on speech/music discrimination due to the potential applications for broadcast and other media; however, it remains possible to expand the experimental scope to include samples of speech with varying amounts of background music. This paper focuses on the development and evaluation of two measures of the ratio between speech energy and music energy: a reference measure called speech-to-music ratio (SMR), which is known objectively only prior to mixing, and a feature called the stereo-input mix-to-peripheral level feature (SIMPL), which is computed from the stereo mixed signal as an imprecise estimate of SMR. SIMPL is an objective signal measure calculated by taking advantage of broadcast mixing techniques in which vocals are typically placed at stereo center, unlike most instruments. Conversely, SMR is a hidden variable defined by the relationship between the powers of portions of audio attributed to speech and music. It is shown that SIMPL is predictive of SMR and can be combined with state-of-the-art features in order to improve performance. For evaluation, this new metric is applied in speech/music (binary) classification, speech/music/mixed (trinary) classification, and a new speech-to-music ratio estimation problem. Promising results are achieved, including 93.06% accuracy for trinary classification and 3.86 dB RMSE for estimation of the SMR.
No preview · Article · Dec 2014 · IEEE/ACM Transactions on Audio, Speech, and Language Processing
[Show abstract][Hide abstract] ABSTRACT: In this paper, we propose a novel saliency-based algorithm to detect foreground regions in highly dynamic scenes. We first convert input video frames to multiple patch-based feature maps. Then, we apply temporal saliency analysis to the pixels of each feature map. For each temporal set of co-located pixels, the feature distance of a point from its kth nearest neighbor is used to compute the temporal saliency. By computing and combining temporal saliency maps of different features, we obtain foreground likelihood maps. A simple segmentation method based on adaptive thresholding is applied to detect the foreground objects. We test our algorithm on images sequences of dynamic scenes, including public datasets and a new challenging wildlife dataset we constructed. The experimental results demonstrate the proposed algorithm achieves state-of-the-art results.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we address the problem of object class recognition via observations from actively selected views/modalities/features under limited resource budgets. A Partially Observable Markov Decision Process (POMDP) is employed to find optimal sensing and recognition actions with the goal of long-term classification accuracy. Heterogeneous resource constraints-such as motion, number of measurements and bandwidth-are explicitly modeled in the state variable, and a prohibitively high penalty is used to prevent the violation of any resource constraint. To improve recognition performance, we further incorporate discriminative classification models with POMDP, and customize the reward function and observation model correspondingly. The proposed model is validated on several data sets for multi-view, multi-modal vehicle classification and multi-view face recognition, and demonstrates improvement in both recognition and resource management over greedy methods and previous POMDP formulations.
[Show abstract][Hide abstract] ABSTRACT: Current model-based speech analysis tends to be incomplete - only a part of parameters of interest (e.g. only the pitch or vocal tract) are modeled, while the rest that might as well be important are disregarded. The drawback is that without joint modeling of parameters that are correlated, the analysis on speech parameters may be inaccurate or even incorrect. Under this motivation, we have proposed such a model called PAT (Probabilistic Acoustic Tube), where pitch, vocal tract and energy are jointly modeled. This paper proposes an improved version of PAT model, named PAT2, where both signal and probabilistic modeling are tremendously renovated. Compared to related works, PAT2 is much more comprehensive, which incorporates mixed excitation, glottal wave and phase modeling. Experimental results show its ability in decomposing speech into desirable parameters and its potential for speech synthesis.
[Show abstract][Hide abstract] ABSTRACT: Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3.8~4.9 dB SIR gain compared to NMF models, while maintaining better SDRs and SARs.
[Show abstract][Hide abstract] ABSTRACT: Auditory salience describes how much a particular auditory event attracts human attention. Previous attempts at automatic detection of salient audio events have been hampered by the challenge of defining ground truth. In this paper ground truth for auditory salience is built up from annotations by human subjects of a large corpus of meeting room recordings. Following statistical purification of the data, an optimal auditory salience filter with linear discrimination is derived from the purified data. An automatic auditory salience detector based on optimal filtering of the Bark-frequency loudness performs with 32% equal error rate. Expanding the feature vector to include other common feature sets does not improve performance. Consistent with intuition, the optimal filter looks like an onset detector in the time domain.
No preview · Article · Mar 2014 · Pattern Recognition Letters
[Show abstract][Hide abstract] ABSTRACT: Examining articulatory compensation has been important in understanding how the speech production system is organized, and how it relates to the acoustic and ultimately phonological levels. This paper offers a method that detects articulatory compensation in the acoustic signal, which is based on linear regression modeling of co-variation patterns between acoustic cues. We demonstrate the method on selected acoustic cues for spontaneously produced American English stop consonants. Compensatory patterns of cue variation were observed for voiced stops in some cue pairs, while uniform patterns of cue variation were found for stops as a function of place of articulation or position in the word. Overall, the results suggest that this method can be useful for observing articulatory strategies indirectly from acoustic data and testing hypotheses about the conditions under which articulatory compensation is most likely.
[Show abstract][Hide abstract] ABSTRACT: This paper investigates how prosodic elements such as prominences and prosodic boundaries in Hindi are perceived. We approach this using data from three sources: (i) native speakers of Hindi without any linguistic expertise (ii) a linguistically trained expert in Hindi prosody and finally, (iii) classifiers trained on English for automatic prominence and boundary detection. We use speech from a corpus of Hindi narrative speech for our experiments. Our results indicate that non-expert transcribers do not have a consistent notion of prosodic prominences. However, they show considerable agreement regarding the placement of prosodic boundaries. Also, relative to the nonexpert transcribers, there is higher agreement between the expert transcriber and the automatically derived labels for prominence (and prosodic boundaries); this suggests the possibility of using classifiers for the automatic prediction of these prosodic events in Hindi.
[Show abstract][Hide abstract] ABSTRACT: The hidden Markov model (HMM) is widely popular as the de facto tool for representing temporal data; in this paper, we add to its utility in the sequence clustering domain - we describe a novel approach that allows us to directly control purity in HMM-based clustering algorithms. We show that encouraging sparsity in the observation probabilities increases cluster purity and derive an algorithm based on lp regularization; as a corollary, we also provide a different and useful interpretation of the value of p in Renyi p-entropy. We test our method on the problem of clustering non-speech audio events from the BBC sound effects corpus. Experimental results confirm that our approach does learn purer clusters, with (unweighted) average purity as high as 0.88 - a considerable improvement over both the baseline HMM (0.72) and k-means clustering (0.69).
[Show abstract][Hide abstract] ABSTRACT: The recently developed deep learning architecture, a kernel version of the deep convex network (K-DCN), is improved to address the scalability problem when the training and testing samples become very large. We have developed a solution based on the use of random Fourier features, which possess the strong theoretical property of approximating the Gaussian kernel while rendering efficient computation in both training and evaluation of the K-DCN with large training samples. We empirically demonstrate that just like the conventional K-DCN exploiting rigorous Gaussian kernels, the use of random Fourier features also enables successful stacking of kernel modules to form a deep architecture. Our evaluation experiments on phone recognition and speech understanding tasks both show the computational efficiency of the K-DCN which makes use of random features. With sufficient depth in the K-DCN, the phone recognition accuracy and slot-filling accuracy are shown to be comparable or slightly higher than the K-DCN with Gaussian kernels while significant computational saving has been achieved.
[Show abstract][Hide abstract] ABSTRACT: Browsing large audio archives is challenging because of the limitations of human audition and attention. However, this task becomes easier with a suitable visualization of the audio signal, such as a spectrogram transformed to make unusual audio events salient. This transformation maximizes the mutual information between an isolated event's spectrogram and an estimate of how salient the event appears in its surrounding context. When such spectrograms are computed and displayed with fluid zooming over many temporal orders of magnitude, sparse events in long audio recordings can be detected more quickly and more easily. In particular, in a 1/10-real-time acoustic event detection task, subjects who were shown saliency-maximized rather than conventional spectrograms performed significantly better. Saliency maximization also improves the mutual information between the ground truth of nonbackground sounds and visual saliency, more than other common enhancements to visualization.
Full-text · Article · Oct 2013 · ACM Transactions on Applied Perception
[Show abstract][Hide abstract] ABSTRACT: Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline.
No preview · Article · Sep 2013 · Computer Speech & Language
[Show abstract][Hide abstract] ABSTRACT: Speech can be represented as a constellation of constricting vocal tract actions called gestures, whose temporal patterning with respect to one another is expressed in a gestural score. Current speech datasets do not come with gestural annotation and no formal gestural annotation procedure exists at present. This paper describes an iterative analysis-by-synthesis landmark-based time-warping architecture to perform gestural annotation of natural speech. For a given utterance, the Haskins Laboratories Task Dynamics and Application (TADA) model is employed to generate a corresponding prototype gestural score. The gestural score is temporally optimized through an iterative timing-warping process such that the acoustic distance between the original and TADA-synthesized speech is minimized. This paper demonstrates that the proposed iterative approach is superior to conventional acoustically-referenced dynamic timing-warping procedures and provides reliable gestural annotation for speech datasets.
Full-text · Article · Dec 2012 · The Journal of the Acoustical Society of America
[Show abstract][Hide abstract] ABSTRACT: Acoustic-phonetic landmarks provide robust cues for speech recognition and are relatively invariant between speakers, speaking styles, noise conditions and sampling rates. The ability to detect acoustic-phonetic landmarks as a front-end for speech recognition has been shown to improve recognition accuracy. Biomimetic inter-spike intervals and average signal level have been shown to accurately convey information about acoustic-phonetic landmarks. This paper explores the use of inter-spike interval and average signal level as input features for landmark detectors trained and tested on mismatched conditions. These detectors are designed to serve as a front-end for speech recognition systems. Results indicate that landmark detectors trained using inter-spike intervals and signal level are relatively robust to both additive channel noise and changes in sampling rate. Mismatched conditions — differences in channel noise between training audio and testing audio — are problematic for computer speech recognition systems. Signal enhancement, mismatch-resistant acoustic features, and architectural compensation within the recognizer