About
386
Publications
28,207
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,665
Citations
Publications
Publications (386)
A review of techniques to improve distorted speech is presented, noting the strengths and weaknesses of common methods. Speech signals are discussed from the point of view of which features should be preserved to retain both naturalness and intelligibility. Enhancement methods range from classical spectral subtraction and Wiener filtering to recent...
Speech is the primary way via which most humans communicate. Computers facilitate this transfer of information, especially when people interact with databases. While some methods to manipulate and interpret speech date back many decades (e.g., Fourier analysis), other processing techniques were developed late last century (e.g., linear predictive c...
Speech is the most common form of human communication, and many conversations use digital communication links. For efficient transmission, acoustic speech waveforms are usually converted to digital form, with reduced bit rates, while maintaining decoded speech quality. This paper reviews the history of speech coding techniques, from early mu-law lo...
A review of techniques to identify speakers from their voices is presented, noting strengths and weaknesses of various methods. Similar acoustic analysis has been often used for both speech and speaker recognition, despite the two tasks being quite different. Speaker biometrics from voice is far more indirect and subtle than the estimation of phone...
Automatic Speech Recognition (ASR) applications have increased greatly during the last decade due to the emergence of new devices and home automation hardware that can benefit greatly from allowing users to interact hands free, such as smart watches, earbuds, portable translators, and home assistants. ASR implementation for these applications inevi...
Intra-speaker variability, caused by emotional speech, is a real threat to the performance of speaker recognition systems. In fact, as human beings, we are constantly changing our emotional state. While many efforts have been made to increase automatic speaker verification (ASV) robustness towards channel effects or spoofing attacks, only a handful...
Spoofing attacks have been acknowledged as a serious threat to automatic speaker verification (ASV) systems. In this paper, we are specifically concerned with replay attack scenarios. As a countermeasure to the problem, we propose a front-end based on the blind estimation of the channel response magnitude and as a back-end a residual neural network...
Output-based instrumental speech quality assessment relies only on the received (processed) signal to predict quality. Such methods are called non-intrusive and are crucial in speech applications where reference clean signals are not accessible. In this paper, we propose a new non-intrusive instrumental quality measure based on the similarity betwe...
Infants are difficult to understand as they cannot communicate their requirements. This motivates us to decode their language in meaningful interpretations so that adults can understand the requirements of their children. In this chapter, the cry analysis techniques used so far are discussed and some experiments in this direction are reported. Spec...
The i-vector framework has been widely used to summarize speaker-dependent information present in a speech signal. Considered the state-of-the-art in speaker verification for many years, its potential to estimate speech recording distortion/quality has been overlooked. This paper is an attempt to fill this gap. We conduct a detailed analysis of how...
Approximately one-fifth of theworld’s population suffer or have suffered from voice and speech production disorders due to diseases or some other dysfunction. Thus, there is a clear need for objective ways to evaluate the quality of voice and speech as well as its link to vocal fold activity, to evaluate the complex interaction between the larynx a...
This paper provides an overview of recent approaches to deep learning as applied to speech processing tasks, primarily for automatic speech recognition, but also text-to-speech and speaker, language and emotion recognition. The focus is on efficient methods, addressing issues of accuracy, computation, storage, and delay. The discussion puts the spe...
In this paper, we introduce a document-specific context probabilistic latent semantic analysis (DCPLSA) model for speech recognition. This is an extension of a CPLSA model [1] where the probability of word is conditioned only on topics. The CPLSA model uses the bigram counts that are the number of appearances of the bigrams in the corpus. These cou...
In this paper, we present robust feature extractors that incorporate a regularized minimum variance distortionless response (RMVDR) spectrum estimator instead of the discrete Fourier transform-based direct spectrum estimator, used in many front-ends including the conventional MFCC, to estimate the speech power spectrum. Direct spectrum estimators,...
In this paper, we propose an algorithm to improve the performance of speaker identification systems. A baseline speaker identification system uses a scoring of a test utterance against all speakers' models; this could be termed as an evaluation at the observation level. In the proposed approach, and prior to the standard evaluation phase, an algori...
We propose a document-based Dirichlet class language model (DDCLM) for speech recognition using document-based n-gram events. In this model, the class is conditioned on the immediate history context and the document, and the word is conditioned on the the class and the document in the original DCLM model [1]. In the DCLM model, the class informatio...
In this paper we present a robust feature extractor that includes the In this paper we study the performance of emotion recognition from cochlear implant-like spectrally reduced speech (SRS) using the conventional Mel-frequency cepstral coefficients and a Gaussian mixture model (GMM)-based classifier. Cochlear-implant-like SRS of each utterance fro...
This paper investigates the robustness of the warped discrete Fourier transform (WDFT)-based cepstral features for continuous speech recognition under clean and multistyle training conditions. In the MFCC and PLP front-ends, in order to approximate the nonlinear characteristics of the human auditory system in frequency, the speech spectrum is warpe...
In this paper, we introduce a novel topic n-gram count language model (NTNCLM) using topic probabilities of training documents and document-based n-gram counts. The topic probabilities for the documents are computed by averaging the topic probabilities of words seen in the documents. The topic probabilities of documents are multiplied by the docume...
This paper presents robust feature extractors for a continuous speech recognition task in matched and mismatched environments. The mismatched conditions may occur due to additive noise, different channel, and acoustic reverberation. In the conventional Mel-frequency cepstral coefficient (MFCC) feature extraction framework, a subband spectrum enhanc...
While considerable work has been done to characterize the detrimental effects of channel variability on automatic speaker verification (ASV) performance, little attention has been paid to the effects of room reverberation. This paper investigates the effects of room acoustics on the performance of two far-field ASV systems: GMM-UBM (Gaussian mixtur...
Studies of dysarthric speech rhythm have explored the possibility of distinguishing healthy speakers from dysarthric ones. These studies also allowed the detection of different types of dysarthria. The present paper aims at assessing the ability of rhythm metrics to perceive dysarthric severity levels. The study reports on the results of a statisti...
In this paper we introduce a robust feature extractor, dubbed as robust compressive gammachirp filterbank cepstral coefficients (RCGCC), based on an asymmetric and level-dependent compressive gammachirp filterbank and a sigmoid shape weighting rule for the enhancement of speech spectra in the auditory domain. The goal of this work is to improve the...
In this paper, we introduce a novel method of smoothing language models (LM) based on the semantic information found in ontologies that is especially adapted for limited-resources language modeling. We exploit the latent knowledge of language that is deeply encoded within ontologies. As such, this work examines the potential of using the semantic a...
In this paper, we present unsupervised language model (LM) adaptation approaches using latent Dirichlet allocation (LDA) and latent semantic marginals (LSM). The LSM is the unigram probability distribution over words that are calculated using LDA-adapted unigram models. The LDA model is used to extract topic information from a training corpus in an...
This work presents a noise spectrum estimator based on the Gaussian mixture model (GMM)-based speech presence probability (SPP) for robust speech recognition. Estimated noise spectrum is then used to compute a subband a posteriori signal-to-noise ratio (SNR). A sigmoid shape weighting rule is formed based on this subband a posteriori SNR to enhance...
In this paper, we investigate low-variance multitaper spectrum estimation methods to compute the mel-frequency cepstral coefficient (MFCC) features for robust speech and speaker recognition systems. In speech and speaker recognition, MFCC features are usually computed from a single-tapered (e.g., Hamming window) direct spectrum estimate, that is, t...
This paper presents regularized minimum variance distortion-less response (MVDR)-based cepstral features for robust continuous speech recognition. The mel-frequency cepstral coefficient (MFCC) features, widely used in speech recognition tasks, are usually computed from a direct spectrum estimate, that is, the squared magnitude of the discrete Fouri...
We propose a novel context-based probabilistic latent semantic analysis (PLSA) language model for speech recognition. In this model, the topic is conditioned on the immediate history context and the document in the original PLSA model. This allows computing all the possible bigram probabilities of the seen history context using the model. It proper...
Subjective speech quality assessment depends on listener “quality” opinions after hearing a particular test speech stimulus. Subjective scores are given based on a perception and quality judgment process that is unique to a particular listener. These processes are postulated to be dependent on the listener's internal reference of what good and bad...
In this paper, automatic speaker verification and gender detection using whispered speech is explored. Whispered speech, despite its reduced perceptibility, has been shown to convey relevant speaker identity and gender information. This study compares the performance of a GMM-UBM speaker verification system trained with normal and whispered speech...
We propose a language modeling (LM) approach using background n-grams and interpolated distanced n-grams for speech recognition using an enhanced probabilistic latent semantic analysis (EPLSA) derivation. PLSA is a bag-of-words model that exploits the topic information at the document level, which is inconsistent for the language modeling in speech...
We propose a language modeling (LM) approach using interpolated distanced n-grams into a latent Dirichlet language model (LDLM) [1] for speech recognition. The LDLM relaxes the bag-of-words assumption and document topic extraction of latent Dirichlet allocation (LDA). It uses default background n- grams where topic information is extracted from the...
The goal of speech emotion recognition (SER) is to identify the emotional or physical state of a human being from his or her voice. One of the most important things in a SER task is to extract and select relevant speech features with which most emotions could be recognized. In this paper, we present a smoothed nonlinear energy operator (SNEO)-based...
In this paper we present a robust feature extractor that includes the use of a smoothed nonlinear energy operator (SNEO)-based amplitude modulation features for a large vocabulary continuous speech recognition (LVCSR) task. SNEO estimates the energy required to produce the AM-FM signal, and then the estimated energy is separated into its amplitude...
The articles in this special issue are devoted to the topic of speech information processing, key technologicial and systems theories, and applications for its use.
As a pattern recognition application, automatic speech recognition (ASR) requires the extraction of useful features from its input signal, speech. To help determine relevance, human speech production and acoustic aspects of speech perception are reviewed, to identify acoustic elements likely to be most important for ASR. Common methods of estimatin...
MFCC (Mel Frequency Cepstral Coefficients) and PLP (Perceptual linear prediction coefficients) or RASTA-PLP have demonstrated good results whether when they are used in combination with prosodic features as suprasegmental (long-term) information or when used stand-alone as segmental (short-time) information. MFCC and PLP feature parameterization ai...
This paper presents a novel short-time frequency analysis algorithm, namely Instantaneous Harmonic Analysis (IHA), which can be used in Multiple Fundamental Frequency Estimation. Given a set of reference pitches, the objective of the algorithm is to transform the real-valued time-domain audio signal into a set of complex time-domain signals in such...
This paper presents a novel feature extractor for robust large vocabulary continuous speech recognition (LVCSR) task. For accurate and robust estimation of speech power spectrum we propose to compute the features from the regularized minimum variance distortionless response (regMVDR) spectral estimate instead of the windowed periodogram estimate. A...
In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-of-the-art i-vector speaker verification system. The MFCC and PLP features are usually computed from a Hamming-windowed periodogram spectrum estimate. Such a single-tapered spect...
This paper reports the results of acoustic investigation based on rhythmic classifications of speech from duration measurements carried out to distinguish dysarthric speech from healthy speech. The Nemours database of American dysarthric speakers is used throughout experiments conducted for this study. The speakers are eleven young adult males with...
Introduction Acoustic analysis for robust speaker recognition Distributed speaker recognition through UBM–GMM models Performance evaluation of DSIDV Conclusion Bibliography
In this paper, we present two robust feature extractors that use a regularized minimum variance distortionless response (RMVDR) spectrum estimator instead of the discrete Fourier transform-based direct spectrum estimator, used in many front-ends including the conventional MFCC, for estimating the speech power spectrum. Direct spectrum estimators, e...
Accuracy of speaker verification is high under controlled conditions but falls off rapidly in the presence of interfering sounds. This is because spectral features, such as Mel-frequency cepstral coefficients (MFCCs), are sensitive to additive noise. MFCCs are a particular realization of warped-frequency representation with low-frequency focus. But...
We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that...
When Speech and Audio Signal Processing published in 1999, it stood out from its competition in its breadth of coverage and its accessible, intutiont-based style. This book was aimed at individual students and engineers excited about the broad span of audio processing and curious to understand the available techniques. Since then, with the advent o...
Although Hidden Markov Models (HMMs) are still the mainstream approach towards speech recognition, their intrinsic limit such as first-order Markov models in use or the assumption of independent and identically distributed frames lead to the extensive of higher level linguistic information to produce satisfactory results. Therefore, researchers beg...
This paper presents a recognition engine especially tailored to the French language spoken in the Canadian pro-vince of New-Brunswick. It studies a global monophone model that handles the linguistic variability found in the province. The study also explores the impact of speaker locality on recognition rate when using the global model. Three models...
This paper presents a soft noise compensation algorithm in the feature space to improve the noise robustness of HMM-based on-line automatic speech recognition (ASR) in unknown highly non-stationary acoustic environments. Current hard computing techniques fail to track and compensate the non-stationary noises properly in previously unseen acoustic e...
This paper presents improvements in a dialogue interpreter sub-system for an application that allows the user to interact by speech with a Radio-Frequency IDentification (RFID) network working in a highly noisy environment. A new dialog framework is proposed in order to give the human operators the ability to communicate with the system in a more n...
This paper presents a noise tracking and estimation algorithm for highly non-stationary noises using the Bayesian on-line spectral change point detection (BOSCPD) technique. In BOSCPD, the local minima search window update technique of minima controlled recursive averaging (MCRA) algorithm is made a function of spectral change point detection. The...
Current automatic speech recognition (ASR) works in off-line mode and needs prior knowledge of the stationary or quasi-stationary test conditions for expected word recognition accuracy. These requirements limit the application of ASR for real-world applications where test conditions are highly non-stationary and are not known a priori. This paper p...
This paper proposes an efficient codebook design for tree-structured vector quantization (TSVQ) that is embedded in nature. We
modify two speech coding standards by replacing their original quantizers for line spectral frequencies (LSF’s) and/or Fourier magnitudes
quantization with TSVQ-based quantizers. The modified coders are fine-granular bit-ra...
We introduce an unsupervised language model (LM) adaptation approach using latent Dirichlet allocation (LDA) and latent semantic marginals (LSM). LSM is a unigram probability distribution over words and is estimated using the LDA model. A hard-clustering method is used to form topics. Each document is assigned to a topic based on the maximum number...
This paper presents asymmetric taper (or window)-based robust Mel frequency cepstral coefficient (MFCC) feature extraction for automatic speech recognition (ASR). Commonly, MFCC features are computed from a symmetric Hamming-tapered direct-spectrum estimate. Symmetric tapers have linear phase and also imply longer time delay. In ASR systems, phase...
The goal of this work is to improve the robustness of speech recognition systems in additive noise and real-time reverberant environments. In this paper we present a compressive gammachirp filter-bank-based feature extractor that incorporates a method for the enhancement of auditory spectrum and a shorttime feature normalization technique, which, b...
Web-based learning is rapidly becoming the preferred way to quickly, efficiently, and economically create and deliver training or educational content through various communication media. This chapter presents systems that use speech technology to emulate the one-on-one interaction a student can get from a virtual instructor. A Web-based learning to...
This paper studies the low-variance multi-taper mel-frequency cepstral coefficient (MFCC) features in the state-of-the-art speaker verification. The MFCC features are usually computed using a Hamming-windowed DFT spectrum. Windowing reduces the bias of the spectrum but variance remains high. Recently, low-variance multi-taper MFCC features were stu...
In this paper we study low-variance multi-taper spectrum estimation methods to compute the mel-frequency cepstral coefficient
(MFCC) features for robust speech recognition. In speech recognition, MFCC features are usually computed from a Hamming-windowed
DFT spectrum. Although windowing helps in reducing the bias of the spectrum, but variance remai...
This paper reports the results of a comparative study on blind speech separation (BSS) of two types of convolutive mixtures. The separation criterion is based on Frequency Oriented Principal Components Analysis (FOPCA). This method is compared to two other well-known methods: the Degenerate Unmixing Evaluation Technique (DUET) and Convolutive Fast...
This paper presents an innovative rapid adaptation technique for tracking highly non-stationary acoustic noises. The novelty of this technique is that it can detect the acoustic change points from the spectral characteristics of the observed speech signal in rapidly changing non-stationary acoustic environments. The proposed innovative noise tracki...
A new blind speech separation (BSS) method of convolutive mixtures is presented. This method uses a sample-by-sample algorithm
to perform the subband decomposition by mimicking the processing performed by the human ear. The unknown source signals are
separated by maximizing the entropy of a transformed set of signal mixtures through the use of a gr...
This paper presents a system that allows the user to interact by speech with a Radio-Frequency IDentification (RFID) network working in a highly noisy environment. A new dialog framework is proposed in order to give the human operators the ability to communicate with the system in a more natural fashion. This is achieved by the implementation of th...
In this paper, we introduce the weighting of topic models in mixture language model adaptation using n-grams of the topic models. Topic clusters are formed by using a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. Th...
A major drawback of many speech enhancement methods in speech applications is the generation of an annoying residual noise with musical character. Although the Wiener filter introduces less musical noise than spectral subtraction methods, such noise, however, exists and is perceptually annoying to the listener. A potential solution to this artifact...
This paper investigates several feature normalization techniques for use in an i-vector speaker verification system based
on a mixture probabilistic linear discriminant analysis (PLDA) model. The objective of the feature normalization technique is to compensate for the effects of environmental mismatch.
Here, we study short-time Gaussianization (ST...
In this paper, we developed soft computing models for on-line automatic speech recognition (ASR) based on Bayesian on-line
inference techniques.Bayesian on-line inference for change point detection (BOCPD) is tested for on-line environmental learning
using highly non-stationary noisy speech samples from the Aurora2 speech database. Significant impr...
Web-based learning is rapidly becoming the preferred way to quickly, efficiently, and economically create and deliver training or educational content through various communication media. This chapter presents systems that use speech technology to emulate the one-on-one interaction a student can get from a virtual instructor. A Web-based learning to...
In this paper we propose a k-NN/SASH phoneme classification algorithm that competes favourably with state-of-the-art methods. We apply a similarity search algorithm (SASH) that has been used successfully for classification of high dimensional texts and images. Unlike other search algorithms, the computational time of SASH is not affected by the dim...
A new approach for computing weights of topic models in language model (LM) adaptation is introduced. We formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. The new weighting idea is that the un...
This paper deals with blind speech separation of convolutive mixtures of sources. The separation criterion is based on Oriented Principal Components Analysis (OPCA) in the frequency domain. OPCA is a (second order) extension of standard Principal Component Analysis (PCA) aiming at maximizing the power ratio of a pair of signals. The convolutive mix...
In this paper, we propose a segment-based non-parametric method of monophone recognition. We pre-segment the speech utterance into its underlying phonemes using a group-delay-based algorithm. Then, we apply the k-NN/SASH phoneme classification technique to classify the hypothesized phonemes. Since phoneme boundaries are already known during the dec...
Contains research objectives and summary of research on fourteen research projects and reports on four research projects.
In this paper we propose an algorithm for estimating noise in highly non-stationary noisy environments, which is a challenging
problem in speech enhancement. This method is based on minima-controlled recursive averaging (MCRA) whereby an accurate, robust
and efficient noise power spectrum estimation is demonstrated. We propose a two-stage technique...
In this paper we report the results of a comparative study on blind speech signal separation approaches. Three algorithms,
Oriented Principal Component Analysis (OPCA), High Order Statistics (HOS), and Fast Independent Component Analysis (Fast-ICA),
are objectively compared in terms of signal-to-interference ratio criteria. The results of experimen...
In this paper, we investigated and simulated the frame recursive dynamic mean bias removing technique in the cepstral domain with a time smoothing parameter in order to improve the robustness of automatic speech recognition (ASR) in realtime environments. The objective of this simulation was to examine the suitability of the frame recursive cepstra...
This paper presents the simulation results of a speaker identification and verification (SIDV) system that would be efficient for resource limited mobile devices. The proposed system works as a text-independent system within the distributed speech recognition (DSR) framework and is designed to identify a target speaker or imposter using short digit...
This paper proposes an efficient codebook design for tree-structured vector quantization (TSVQ), which is embedded in nature. We modify two speech coding standards by replacing their original quantizers for line spectral frequencies (LSF's) and/or Fourier magnitudes quantization with TSVQ-based quantizers. The modified coders are fine-granular bit-...