About
361
Publications
84,607
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,643
Citations
Publications
Publications (361)
This paper tackles two major problem settings for interpretability of audio processing networks,
post-hoc
and
by-design
interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. This is extended to present an inherently interpretable mo...
We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevan...
This paper tackles two major problem settings for interpretability of audio processing networks, post-hoc and by-design interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. This is extended to present an inherently interpretable model...
Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervis...
As music has become more available especially on music streaming platforms, people have started to have distinct preferences to fit to their varying listening situations, also known as context. Hence, there has been a growing interest in considering the user's situation when recommending music to users. Previous works have proposed user-aware autot...
Understanding generalization in modern machine learning settings has been one of the major challenges in statistical learning theory. In this context, recent years have witnessed the development of various generalization bounds suggesting different complexity notions such as the mutual information between the data sample and the algorithm output, c...
This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regulariz...
Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervis...
We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns content association between music and video. In addition to adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant reco...
Recommending automatically a video given a music or a music given a video has become an important asset for the audiovisual industry - with user-generated or professional content. While both music and video have specific temporal organizations, most current works do not consider those and only focus on globally recommending a media. As a first step...
Generative Adversarial Networks (GANs) have achieved excellent audio synthesis quality in the last years. However, making them operable with semantically meaningful controls remains an open challenge. An obvious approach is to control the GAN by conditioning it on metadata contained in audio datasets. Unfortunately, audio datasets often lack the de...
We present a new probabilistic model to address semi-nonnegative matrix factorization (SNMF), called Skellam-SNMF. It is a hierarchical generative model consisting of prior components, Skellam-distributed hidden variables and observed data. Two inference algorithms are derived: Expectation-Maximization (EM) algorithm for maximum \emph{a posteriori}...
The goal of singing voice separation is to recover the vocals signal from music mixtures. State-of-the-art performance is achieved by deep neural networks trained in a supervised fashion. Since training data are scarce and music signals are extremely diverse, it remains challenging to achieve high separation quality across various recording and mix...
Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in...
Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent lin...
Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the "image data". However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial...
In this work, we study music/video cross-modal recommendation, i.e. recommending a music track for a video or vice versa. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. We rely on a self-supervised learning paradigm to learn from a large amount of unlabelled data. More precisely, we jointly learn aud...
Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical sig...
Synthetic creation of drum sounds (e.g., in drum machines) is commonly performed using analog or digital synthesis, allowing a musician to sculpt the desired timbre modifying various parameters. Typically, such parameters control low-level features of the sound and often have no musical meaning or perceptual correspondence. With the rise of Deep Le...
Style transfer is the process of changing the style of an image, video, audio clip or musical piece so as to match the style of a given example. Even though the task has interesting practical applications within the music industry, it has so far received little attention from the audio and music processing community. In this paper, we present Groov...
In this paper, we compare different audio signal representations, including the raw audio waveform and a variety of time-frequency representations, for the task of audio synthesis with Generative Adversarial Networks (GANs). We conduct the experiments on a subset of the NSynth dataset. The architecture follows the benchmark Progressive Growing Wass...
Most music streaming services rely on automatic recommendation algorithms to exploit their large music catalogs. These algorithms aim at retrieving a ranked list of music tracks based on their similarity with a target music track. In this work, we propose a method for direct recommendation based on the audio content without explicitly tagging the m...
Audio-visual (AV) representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. Specifically, we develop methods that identify events and localize corresponding AV cues in uncons...
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE)...
Prior information about the target source can improve audio source separation quality but is usually not available with the necessary level of audio alignment. This has limited its usability in the past. We propose a separation model that can nevertheless exploit such weak information for the separation task while aligning it on the mixture as a by...
Matrix factorization techniques have proven to be useful in many unsupervised learning applications. Such techniques have been recently applied to Non Intrusive Load Monitoring (NILM), the process of breaking down the total electric consumption of a building into consumptions of individual appliances. While several studies addressed the NILM proble...
Research on style transfer and domain translation has clearly demonstrated the ability of deep learning-based algorithms to manipulate images in terms of artistic style. More recently, several attempts have been made to extend such approaches to music (both symbolic and audio) in order to enable transforming musical style in a similar manner. In th...
Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be mo...
Recent studies on diffusion-based sampling methods have shown that Langevin Monte Carlo (LMC) algorithms can be beneficial for non-convex optimization, and rigorous theoretical guarantees have been proven for both asymptotic and finite-time regimes. Algorithmically, LMC-based algorithms resemble the well-known gradient descent (GD) algorithm, where...
In the physical sciences and engineering domains, music has traditionally been considered an acoustic phenomenon. From a perceptual viewpoint, music is naturally associated with hearing, i.e., the audio modality. Moreover, for a long time, the majority of music recordings were distributed through audio-only media, such as vinyl records, cassettes,...
We tackle the problem of audiovisual scene analysis for weakly-labeled data. To this end, we build upon our previous audiovisual representation learning framework to perform object classification in noisy acoustic environments and integrate audio source enhancement capability. This is made possible by a novel use of non-negative matrix factorizatio...
Separation of existing audio into remixable elements is very useful to repurpose music audio. Applications include upmixing video soundtracks to surround sound (e.g. home theater 5.1 systems), facilitating music transcriptions, allowing better mashups and remixes for disk jockeys, and rebalancing sound levels on multiple instruments or voices recor...
Recent studies have illustrated that stochastic gradient Markov Chain Monte Carlo techniques have a strong potential in non-convex optimization, where local and global convergence guarantees can be shown under certain conditions. By building up on this recent theory, in this study, we develop an asynchronous-parallel stochastic L-BFGS algorithm for...
This paper presents a Bayesian framework for under-determined audio source separation in multichannel reverberant mixtures. We model the source signals as Student’s t latent random variables in a time-frequency domain. The specific structure of musical signals in this domain is exploited by means of a non-negative matrix factorization model. Conver...
One of the most general model of music signals considers that such signals can be represented as a sum of two distinct components: a tonal part that is sparse in frequency and temporally stable, and a transient (or percussive) part composed of short term broadband sounds. In this paper, we propose a novel hybrid method built upon Nonnegative Matrix...
Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characte...
Parallel factor analysis (PARAFAC) is one of the most popular tensor factorization models. Even though it has proven successful in diverse application fields, the performance of PARAFAC usually hinges up on the rank of the factorization, which is typically specified manually by the practitioner. In this study, we develop a novel parallel and distri...
In the recent years, there has been an increasing academic and industrial interest for analyzing the electrical consumption of commercial buildings. Whilst having similarities with the Non Intrusive Load Monitoring (NILM) tasks for residential buildings, the nature of the signals that are collected from large commercial buildings introduces additio...
In the recent years, there has been an increasing academic and industrial interest for analyzing the electrical consumption of commercial buildings. Whilst having similarities with the Non Intrusive Load Monitoring (NILM) tasks for residential buildings, the nature of the signals that are collected from large commercial buildings introduces additio...
Most of the time it is nearly impossible to differentiate between particular type of sound events from a waveform only. Therefore, frequency-domain and time-frequency domain representations have been used for years providing representations of the sound signals that are more in line with the human perception. However, these representations are usua...
In the recent years, there has been an increasing academic and industrial interest for analyzing the electrical consumption of commercial buildings. Whilst having similarities with the Non Intrusive Load Monitoring (NILM) tasks for residential buildings, the nature of the signals that are collected from large commercial buildings introduces additio...
In the field of Embodied Conversational Agent (ECA) one of the main challenges is to generate socially believable agents. The long run objective of the present study is to infer rules for the multimodal generation of agents' socio-emotional behaviour. In this paper, we introduce the Social Multimodal Association Rules with Timing (SMART) algorithm....
This paper addresses the problem of under-determined audio source separation in multichannel reverberant mixtures. We target a semi- blind scenario assuming that the mixing filters are known. Source separation is performed from the time-domain mixture signals in order to accurately model the convolutive mixing process. The source signals are howeve...
This paper addresses the problem of multichannel audio source separation in under-determined reverberant mixtures. We target a semi-blind scenario assuming that the mixing filters are known. The proposed method consists in working directly with the time-domain mixture signals. This approach makes it possible to accurately represent the convolutive...
p>Afin d'améliorer l'interaction entre des Humains et des Agents Conversationnels Animés (ACA), l'un des enjeux majeurs du domaine est de générer des agents crédibles socialement. Dans cet article, nous présentons une méthode, intitulée SMART pour Social Multimodal Association Rules with Timing, capable de trouver automatiquement des associations t...
In this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific acoustic scene classification (ASC) problem. A common way of addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments. Instead, we show that better representations of th...
The papers in this special section are devoted to the growing field of acoustic scene classification and acoustic event recognition. Machine listening systems still have difficulties to reach the ability of human listeners in the analysis of realistic acoustic scenes. If sustained research efforts have been made for decades in speech recognition, s...
The typical application targeted by this work is the intelligibility improvement of speech messages when rendered in car noise environment (radio, message alerts,...). The main idea of this work is to transform the original speech to ”Lombard” speech or more precisely to simulate some of the strategies followed by humans to render their speech clea...
Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) methods have become popular in modern data analysis problems due to their computational efficiency. Even though they have proved useful for many statistical models, the application of SG-MCMC to non- negative matrix factorization (NMF) models has not yet been extensively explored. In this study...
In this paper, we focus on modeling multichannel audio signals in the short-time Fourier transform domain for the purpose of source separation. We propose a probabilistic model based on a class of heavy-tailed distributions, in which the observed mixtures and the latent sources are jointly modeled by using a certain class of multivariate alpha-stab...
A great number of methods for multichannel audio source separation are based on probabilistic approaches in which the sources are modeled as latent random variables in a time-frequency (TF) domain. For reverberant mixtures, most of the methods approximate the time-domain convolutive mixing process in the TF-domain, assuming short mixing filters. Th...
In this paper, we propose a supervised multilayer factorization method designed for harmonic/percussive source separation and drum extraction. Our method decomposes the audio signals in sparse orthogonal components which capture the harmonic content, while the drum is represented by an extension of non negative matrix factorization which is able to...
In this paper we provide a formal justification of the use of time-frequency reassignment techniques on time-frequency transforms of discrete time signals. State of the art techniques indeed rely on formulae established in the continuous case which are applied, in a somehow inaccurate manner, to discrete time signals. Here, we formally derive a gen...
Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) algorithms have become increasingly popular for Bayesian inference in large-scale applications. Even though these methods have proved useful in several scenarios, their performance is often limited by their bias. In this study, we propose a novel sampling algorithm that aims to reduce the bias...
Incorporating prior knowledge about the sources and/or the mixture is a way to improve under-determined audio source separation performance. A great number of informed source separation techniques concentrate on taking priors on the sources into account, but fewer works have focused on constraining the mixing model. In this paper we address the pro...