Herve Bourlard

Herve Bourlard
  • Idiap Research Institute

About

482
Publications
48,052
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,089
Citations
Current institution

Publications

Publications (482)
Preprint
In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrai...
Preprint
In this work, we propose lattice-free MMI (LFMMI) for supervised adaptation of self-supervised pretrained acoustic model. We pretrain a Transformer model on thousand hours of untranscribed Librispeech data followed by supervised adaptation with LFMMI on three different datasets. Our results show that fine-tuning with LFMMI, we consistently obtain r...
Article
To assist the clinical diagnosis and treatment of speech dysarthria, automatic dysarthric speech detection techniques providing reliable and cost-effective assessment are indispensable. Based on clinical evidence on spectro-temporal distortions associated with dysarthric speech, we propose to automatically discriminate between healthy and dysarthri...
Preprint
Automatic dysarthric speech detection can provide reliable and cost-effective computer-aided tools to assist the clinical diagnosis and management of dysarthria. In this paper we propose a novel automatic dysarthric speech detection approach based on analyses of pairwise distance matrices using convolutional neural networks (CNNs). We represent utt...
Preprint
Automatic techniques in the context of motor speech disorders~(MSDs) are typically two-class techniques aiming to discriminate between dysarthria and neurotypical speech or between dysarthria and apraxia of speech (AoS). Further, although such techniques are proposed to support the perceptual assessment of clinicians, the automatic and perceptual c...
Preprint
Full-text available
We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other...
Technical Report
Full-text available
We present a simple wrapper that is useful to train acoustic models in PyTorch using Kaldi's LF-MMI training framework. The wrapper, called pkwrap (short form of PyTorch kaldi wrapper), enables the user to utilize the flexibility provided by PyTorch in designing model architectures. It exposes the LF-MMI cost function as an autograd function. Other...
Article
Competitive state-of-the-art automatic pathological speech intelligibility measures typically rely on regression training on a large number of features, require a large amount of healthy speech training data, or are applicable only to phonetically balanced scenarios where healthy and pathological speakers utter the same utterances. As a result, the...
Article
This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual...
Article
We propose an information theoretic framework for quantitative assessment of acoustic models used in hidden Markov model (HMM) based automatic speech recognition (ASR). The HMM backend expects that (i) the acoustic model yields accurate state conditional emission probabilities for the observations at each time step, and (ii) the conditional probabi...
Preprint
Full-text available
This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone posterior or bottleneck features extracted from a deep neural network (DNN). We use both monolingual and multilingual...
Preprint
Full-text available
State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time warping (DTW) based template matching. Here, we present a study on QbE-STD performance using several monolingual as well as multilingual bottleneck features extracte...
Article
Towards the goal of improving acoustic modeling for automatic speech recognition (ASR), this work investigates the modeling of senone subspaces in deep neural network (DNN) posteriors using low-rank and sparse modeling approaches. While DNN posteriors are typically very high-dimensional, recent studies have shown that the true class information is...
Article
Full-text available
This paper addresses the problem of detecting speech utterances from a large audio archive using a simple spoken query, hence referring to this problem as “Query by Example Spoken Term Detection” (QbE-STD). This still open pattern matching problem has been addressed in different contexts, often based on variants of the Dynamic Time Warping (DTW) al...
Conference Paper
Full-text available
In this work, we address the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State of the art solutions usually rely on dynamic time warping (DTW) based template matching. In contrast, we propose here to tackle the problem as binary classification of images. Similar to the DTW approach, we rely on deep neural...
Article
Phoneme-based multilingual training and different cross-lingual adaptation techniques for Automatic Speech Recognition (ASR) are explored in Connectionist Temporal Classification (CTC)-based systems. The multilingual model is trained to model a universal IPA-based phone set using CTC loss function. While the same IPA symbol may not correspond to ac...
Preprint
Multilingual models for Automatic Speech Recognition (ASR) are attractive as they have been shown to benefit from more training data, and better lend themselves to adaptation to under-resourced languages. However, initialisation from monolingual context-dependent models leads to an explosion of context-dependent states. Connectionist Temporal Class...
Article
We propose an information theoretic framework for quantitative assessment of acoustic modeling for hidden Markov model (HMM) based automatic speech recognition (ASR). Acoustic modeling yields the probabilities of HMM sub-word states for a short temporal window of speech acoustic features. We cast ASR as a communication channel where the input sub-w...
Article
Full-text available
Phonological classes define articulatory-free and articulatory-bound phone attributes. Deep neural network is used to estimate the probability of phonological classes from the speech signal. In theory, a unique combination of phone attributes form a phoneme identity. Probabilistic inference of phonological classes thus enables estimation of their c...
Conference Paper
Full-text available
We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semanti...
Article
Finding who spoke when in a collection of recordings, with speakers being uniquely identified across the database, is a challenging task. In this scenario, reasonable computing times and acoustic variation across recordings remain two major concerns to address in state-of-the-art speaker diarization systems. This paper extends prior work on diarizi...
Article
Conventional deep neural networks (DNN) for speech acoustic modeling rely on Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary class labels as the targets for DNN training. Subword classes in speech recognition systems correspond to context-dependent tied states or senones. The present work addresses some limitations of G...
Conference Paper
Full-text available
We cast the query by example spoken term detection (QbE-STD) problem as subspace detection where query and background subspaces are modeled as union of low-dimensional subspaces. The speech exemplars used for subspace modeling are class-conditional posterior probabilities estimated using deep neural network (DNN). The query and background training...
Conference Paper
Full-text available
Prosody in speech is manifested by variations of loudness, exaggeration of pitch, and specific phonetic variations of prosodic segments. For example, in the stressed and unstressed syllables, there are differences in place or manner of articulation, vowels in unstressed syllables may have a more central articulation, and vowel reduction may occur w...
Conference Paper
Full-text available
This paper shows that exemplar-based speech processing using class-conditional posterior probabilities admits a highly effective search strategy relying on posteriors' intrinsic sparsity structures. The posterior probabilities are estimated for phonetic and phonological classes using deep neural network (DNN) computational framework. Exploiting the...
Article
The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This wo...
Article
The speech signal conveys information on different time scales from short (20–40 ms) time scale or segmental, associated to phonological and phonetic information to long (150–250 ms) time scale or supra segmental, associated to syllabic and prosodic information. Linguistic and neurocognitive studies recognize the phonological classes at segmental l...
Article
Full-text available
We cast the problem of query by example spoken term detection (QbE-STD) as subspace detection where query and background are modeled as a union of low-dimensional subspaces. The speech exemplars used for subspace modeling consist of class-conditional posterior probabilities obtained from deep neural network (DNN). The query and background training...
Article
Full-text available
We hypothesize that optimal deep neural networks (DNN) class-conditional posterior probabilities live in a union of low-dimensional subspaces. In real test conditions, DNN posteriors encode uncertainties which can be regarded as a superposition of unstructured sparse noise to the optimal posteriors. We aim to investigate different ways to structure...
Conference Paper
Full-text available
Phonological features extracted by neural network have shown interesting potential for low bit rate speech vocoding. The span of phonological features is wider than the span of phonetic features, and thus fewer frames need to be transmitted. Moreover, the binary nature of phonological features enables a higher compression ratio at minor quality cos...
Article
In this paper, a method to use SGMM speaker vectors for speaker diarization is introduced. The architecture of the Information Bottleneck (IB) based speaker diarization is utilized for this purpose. The audio for speaker diarization is split into short uniform segments. Speaker vectors are obtained from a Subspace Gaussian Mixture Model (SGMM) syst...
Article
In this paper, the Kullback-Leibler Hidden Markov Model (KL-HMMs) is applied for unsupervised diarization of speech. A general approach to speaker diarization is to split the audio into uniform segments followed by one or more iterations of clustering of the segments and resegmentation of the audio. In the Information Bottlneck (IB) approach to dia...
Article
Full-text available
A novel localization approach is proposed in order to find the position of an individual source using recordings of a single microphone in a reverberant enclosure. The multipath propagation is modeled by multiple virtual microphones as images of the actual single microphone and a multipath distance matrix is constructed whose components consist of...
Article
In this paper, the problem of speech source localization and separation from recordings of convolutive underdetermined mixtures is addressed. This problem is cast as recovering the spatio-spectral speech information embedded in a microphone array compressed measurements of the acoustic field. A model-based sparse component analysis framework is for...
Article
In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models - HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from th...
Conference Paper
Full-text available
We propose a novel formulation of the generalized cross correlation with phase transform (GCC-PHAT) for a pair of microphones in diffuse sound field. This formulation elucidates the links between the microphone distances and the GCC-PHAT output. Hence, it leads to a new model that enables estimation of the pairwise distances by optimizing over the...
Article
Full-text available
We propose a sparse coding approach to address the problem of source-sensor localization and speech reconstruction. This approach relies on designing a dictionary of spatialized signals by projecting the microphone array recordings into the array manifolds characterized for different locations in a reverberant enclosure using the image model. Spars...
Article
Full-text available
We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low-dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training...
Article
Full-text available
Stochastic speech recognition has been cast as a natural realization of the compressive sensing problem in this work. The compressed acoustic observations are sub-word posterior probabilities obtained from a deep neural network. Dictionary learning and sparse recovery are exploited for inference of the high-dimensional sparse word posterior probabi...
Article
This paper introduces a nonlinear vector-based feature mapping approach to extract robust features for automatic speech recognition (ASR) of overlapping speech using a microphone array. We explore different configurations and additional sources of information to improve the effectiveness of the feature mapping. First, we investigate the full-vector...
Conference Paper
ROCKIT is a strategic roadmapping action in the area of multimodal conversational interaction technologies funded as a support action by the EU during 2014 and 2015. We envisage a future in which human-human, human-machine, and human-environment communication are not hampered by differences in language capability, accessibility, or knowledge of the...
Article
This paper addresses the problem of ad hoc microphone array calibration where only partial information about the distances between microphones is available. We construct a matrix consisting of the pairwise distances and propose to estimate the missing entries based on a novel Euclidean distance matrix completion algorithm by alternative low-rank ma...
Article
In this paper, we investigate the diffuse field coherence model for microphone array pairwise distance estimation. We study the fundamental constraints and assumptions underlying this approach and propose evaluation methodologies to measure the adequacy of diffuseness for microphone array calibration. In addition, an enhanced schemebased on coheren...
Conference Paper
Full-text available
This paper presents a study on multilingual deep neural network (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adaptation. Moreover, the combination of multilingual DNNs with Kullback–Leibler divergence based acoustic modeling (KL-HMM) i...
Conference Paper
Full-text available
We address the problem of ad hoc microphone array calibration where some of the distances between the microphones can not be measured. The conventional techniques require information about all the distances for accurate reconstruction of the array geometry. To alleviate this condition, we propose to exploit the properties of Euclidean distance matr...
Conference Paper
Full-text available
In this paper, the problem of multiple speaker localization via speech separation based on model-based sparse recovery is studies. We compare and contrast computational sparse optimization methods incorporating harmonicity and block structures as well as autoregressive dependencies underlying spectrographic representation of speech signals. The res...
Conference Paper
Manual transcription of audio databases for automatic speech recognition (ASR) training is a costly and time-consuming process. State-of-the-art hybrid ASR systems that are based on deep neural networks (DNN) can exploit un-transcribed foreign data during unsupervised DNN pre-training or semi-supervised DNN training. We investigate the relevance of...
Article
Full-text available
We tackle the speech separation problem through modeling the acoustics of the reverberant chambers. Our approach exploits structured sparsity models to perform speech recovery and room acoustic modeling from recordings of concurrent unknown sources. The speakers are assumed to lie on a two-dimensional plane and the multipath channel is characterize...
Article
Under-resourced speech recognizers may benefit from data in languages other than the target language. In this paper, we report how to boost the performance of an Afrikaans automatic speech recognition system by using already available Dutch data. We successfully exploit available multilingual resources through (1) posterior features, estimated by m...
Article
Acoustic variability of speakers arises due to differences in their vocal tract characteristics. These individual speaker characteristics are reflected in a speech signal when speakers pronounce a given phoneme. The current work hypothesizes that clusters within a phoneme spoken by multiple speakers roughly correspond to different speakers. Based o...
Article
Successfully modeling overlapping speech is a crucial step towards improving the performance of current speaker diarization systems. In this direction, we present ongoing work on a novel Multi-Class Vector Taylor Series (MC-VTS) approach that models overlapping speech from knowledge of the individual speaker models and the feature extraction proces...
Article
Speaker diarization of a collection of recordings with uniquely identified speakers is a challenging task. A system addressing such task must account for the inter-session variability present from recording to recording and it is asked to scale well to massive amounts of data. In this paper we use a two-stage approach to corpus-wide speaker diariza...
Article
Full-text available
Posterior features have been shown to yield very good performance in multiple contexts including speech recognition, spoken term detection, and template matching. These days, posterior features are usually estimated at the output of a neural network. More recently, sparse representation has also been shown to potentially provide additional advantag...
Article
Speech activity detection (SAD) is a conceptually simple task that still poses serious challenges for speech processing in a large variety of scenarios. Current energy-based and model-based approaches tend to directly segment speech and non-speech classes, but are not robust enough to non-stationary noise. In this paper, we use a multi-source activ...
Chapter
In the past twenty years, computers and networks have gained a prominent role in supporting human communication. This constitutes one of the most remarkable departures from their initial role as processors of large amounts of numeric data, in business or science, or as controllers of repetitive industrial operations. However, to offer truly innovat...
Chapter
When the Interactive Multimodal Information Management (IM2) declaration of intent was submitted in March 1999 to the Swiss National Science Foundation (SNSF), as an answer to the first call for proposals around a new research instrument referred to as National Center of Competence in Research (NCCR), research in multimodal human-computer interacti...
Book
In the past twenty years, computers and networks have gained a prominent role in supporting human communications. This book presents recent research in multimodal information processing, which demonstrates that computers can achieve more than what telephone calls or videoconferencing can do. The book offers a snapshot of current capabilities for th...
Conference Paper
Posterior based acoustic modeling techniques such as Kullback-Leibler divergence based HMM (KL-HMM) and Tandem are able to exploit out-of-language data through posterior features, estimated by a Multi-Layer Perceptron (MLP). In this paper, we investigate the performance of posterior based approaches in the context of under-resourced speech recognit...
Conference Paper
Kullback-Leibler divergence based hidden Markov models (KL-HMM) have recently been introduced as an efficient and principled way to directly model sequences of posterior vectors to perform Automatic Speech Recognition (ASR). Through efficient feature level adaptation and parsimonious use of parameters, KL-HMM was successfully applied to accented an...
Article
In the context of hybrid HMM/MLP Automatic Speech Recognition (ASR), this paper describes an investigation into a new type of stochastic phone space transformation, which maps “source” phone (or phone HMM state) posterior probabilities (as obtained at the output of a Multilayer Perceptron/MLP) into “destination” phone (HMM phone state) posterior pr...
Conference Paper
This paper addresses the application of missing data recovery via matrix completion for audio sensor networks. We propose a method based on Euclidean distance matrix completion for ad-hoc microphone array location calibration. This method can calibrate a full network from partial connectivity informa- tion. The pairwise distances of microphones in...
Conference Paper
In the last years, latent variable models such as factor analysis, probabilistic principal component analysis or subspace Gaussian mixture models have become almost ubiquitous in speech technologies. The key to its success is the joint modeling of multiple effects in the speech signal they address. In this paper, we propose a novel approach to use...

Network

Cited By