About
194
Publications
36,822
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,926
Citations
Current institution
Télécom Paris
Current position
- Professor (Full)
Publications
Publications (194)
People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise's frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music's ability to ma...
Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new meth...
Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem...
Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice L...
In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals...
Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associate...
Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as "this is a sound of" followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. F...
Machine listening systems often rely on fixed taxonomies to organize and label audio data, key for training and evaluating deep neural networks (DNNs) and other supervised algorithms. However, such taxonomies face significant constraints: they are composed of application-dependent predefined categories, which hinders the integration of new or varie...
We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may con...
Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in...
Winner-takes-all training is a simple learning paradigm, which handles ambiguous tasks by predicting a set of plausible hypotheses. Recently, a connection was established between Winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, hypotheses should quantize optimally the shape of the conditional distribution...
Current state-of-the-art audio analysis systems rely on pre-trained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the
subject of many recent publications. However, one aspect often
overlooked in these works is the influence of the duration of audio
input considered to extract...
The task of music structure analysis refers to automatically identifying the location and the nature of musical sections within a song. In the supervised scenario, structural annotations generally result from exhaustive data collection processes, which represents one of the main challenges of this task. Moreover, both the subjectivity of music stru...
Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. How...
Speech enhancement in ad-hoc microphone arrays is often hindered by the asynchronization of the devices composing the microphone array. Asynchronization comes from sampling time offset and sampling rate offset which inevitably occur when the microphones are embedded in different hardware components. In this paper, we propose a deep neural network (...
Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a...
Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks ex...
Adapting a segmentation model from a labeled source domain to a target domain, where a single unlabeled datum is available, is one the most challenging problems in domain adaptation and is otherwise known as one-shot unsupervised domain adaptation (OSUDA). Most of the prior works have addressed the problem by relying on style transfer techniques, w...
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer infe...
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio and speech signal processing, a wide range of features were engineered through decades of research efforts. As it turns out, learning to predict such features...
Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation techniques are usually exploited to help enforce desired invariances within the learned representations, improvi...
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In various application domains, including computer vision, natural language processing and audio/speech signal processing, a wide range of features where engineered th...
Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to ext...
Speech enhancement promises higher efficiency in ad-hoc microphone arrays than in constrained microphone arrays thanks to the wide spatial coverage of the devices in the acoustic scene. However, speech enhancement in ad-hoc microphone arrays still raises many challenges. In particular, the algorithms should be able to handle a variable number of mi...
Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech d...
Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to ext...
Speech separation with several speakers is a challenging task because of the non-stationarity of the speech and the strong signal similarity between interferent sources. Current state-of-the-art solutions can separate well the different sources using sophisticated deep neural networks which are very tedious to train. When several microphones are av...
In this paper, we consider the detection of a decrease of engagement by users spontaneously interacting with a socially assistive robot in a public space. We first describe the UE-HRI dataset that collects spontaneous Human-Robot Interactions following the guidelines provided by the Affective Computing research community to collect data "in-the-wil...
Multichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions to the real-world. Distributed sensor arrays that consider several devices with a few microphones is a viable alternative that allows for exploiting the multiple devices equipped with microphones that we are using in ou...
Audio-visual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in uncons...
In this paper we consider the detection of a decrease of engagement by users spontaneously interacting with a socially assistive robot in a public space. We first describe the UE-HRI dataset that collects spontaneous human–robot interactions following the guidelines provided by the affective computing research community to collect data “in-the-wild...
The task of predicting fine grained user opinion based on spontaneous spoken language is a key problem arising in the development of Computational Agents as well as in the development of social network based opinion miners. Unfortunately, gathering reliable data on which a model can be trained is notoriously difficult and existing works rely only o...
In the last few years, several datasets have been released to meet the requirements of "hungry" yet promising data-driven approaches in music technology research. Since, for historical reasons, most investigations conducted in the field still revolve around music of the so-called "Western" tradition, the corresponding data, methodology and conclusi...
Events in music frequently exhibit small-scale temporal deviations (microtiming), with respect to the underlying regular metrical grid. In some cases, as in music from the Afro-Latin American tradition, such deviations appear systematically , disclosing their structural importance in rhythmic and stylistic configuration. In this work we explore the...
In recent years the task of downbeat tracking has received increasing attention and the state of the art has been improved with the introduction of deep learning methods. Among proposed solutions, existing systems exploit short-term musical rules as part of their language modelling. In this work we show in an oracle scenario how including longer-te...
In this paper, we introduce a set of opinion annotations for the POM movie review dataset, composed of 1000 videos. The annotation campaign is motivated by the development of a hierarchical opinion prediction framework allowing one to predict the different components of the opinions (e.g. polarity and aspect) and to identify the corresponding textu...
This paper presents a supervised classification system for forecasting a potential user engagement breakdown in human-robot interaction. We define engagement breakdown as a failure to successfully complete a predefined interaction scenario, where the user leaves before the expected end. The goal is thus to detect as early as possible such a potenti...
In the physical sciences and engineering domains, music has traditionally been considered an acoustic phenomenon. From a perceptual viewpoint, music is naturally associated with hearing, i.e., the audio modality. Moreover, for a long time, the majority of music recordings were distributed through audio-only media, such as vinyl records, cassettes,...
We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorizatio...
Affective computing is confronted to high inter-subject variability, in both emotional and physiological responses to a given stimulus. In a stimuli-shared framework, that is to say for different subjects who watch the same stimuli, Inter-Subject Correlation (ISC) measured from Electroencephalographic (EEG) recordings characterize the correlations...
In this paper we present a robust audio classification system to efficiently detect pulmonary edema. The system uses a feature learning technique based on (NMF), then classified with logistic regression. A study was done to compare feature engineering approaches with feature selection techniques against NMF. Different NMF schemes were investigated...
Downbeat tracking consists of annotating a piece of musical audio with the estimated position of the first beat of each bar. In recent years, increasing attention has been paid to applying deep learning models to this task, and various architectures have been proposed, leading to a significant improvement in accuracy. However, there are few insight...
In this paper, the main goal is to detect a movie reviewer's opinion using hidden conditional random fields. This model allows us to capture the dynamics of the reviewer's opinion in the transcripts of long unsegmented audio reviews that are analyzed by our system. High level linguistic features are computed at the level of inter-pausal segments. T...
Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characte...
Motivated by Supervised Opinion Analysis, we propose a novel framework devoted to Structured Output Learning with Abstention (SOLA). The structure prediction model is able to abstain from predicting some labels in the structured output at a cost chosen by the user in a flexible way. For that purpose, we decompose the problem into the learning of a...
This chapter addresses sound scene and event classification in multiview settings, that is, settings where the observations are obtained from multiple sensors, each sensor contributing a particular view of the data (e.g., audio microphones, video cameras, etc.). We briefly introduce some of the techniques that can be exploited to effectively combin...
Most of the time it is nearly impossible to differentiate between particular type of sound events from a waveform only. Therefore, frequency-domain and time-frequency domain representations have been used for years providing representations of the sound signals that are more in line with the human perception. However, these representations are usua...
In this paper, we present a new dataset of spontaneous interactions between a robot and humans, of which 54 interactions (between 4 and 15-minute duration each) are freely available for download and use. Participants were recorded while holding spontaneous conversations with the robot Pepper. The conversations started automatically when the robot d...
In this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific acoustic scene classification (ASC) problem. A common way of addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments. Instead, we show that better representations of th...
Nonnegative matrix factorisation (NMF) with β-divergence is a popular method to decompose real world data. In this paper we propose mini-batch stochastic algorithms to perform NMF efficiently on large data matrices. Besides the stochastic aspect, the mini-batch approach allows exploiting intensive computing devices such as general purpose graphical...