Vincent Lostanlen’s research while affiliated with École Centrale de Nantes and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (79)


Model-Based Deep Learning for Music Information Research: Leveraging diverse knowledge sources to enhance explainability, controllability, and resource efficiency [Special Issue On Model-Based and Data-Driven Audio Signal Processing]
  • Article

November 2024

·

5 Reads

IEEE Signal Processing Magazine

Gaël Richard

·

Vincent Lostanlen

·

·

In this article, we investigate the notion of model-based deep learning in the realm of music information research (MIR). Loosely speaking, we refer to the term model-based deep learning for approaches that combine traditional knowledge-based methods with data-driven techniques, especially those based on deep learning, within a differentiable computing framework. In music, prior knowledge for instance related to sound production, music perception or music composition theory can be incorporated into the design of neural networks and associated loss functions. We outline three specific scenarios to illustrate the application of model-based deep learning in MIR, demonstrating the implementation of such concepts and their potential.


Figure 2: Condition numbers of the encoders for all models are depicted per epoch. Using the proposed stabilization scheme keeps the condition number successfully at one all the time, i.e., the encoder stays tight. The condition number for non-stabilized encoders increases gradually.
Figure 3: The mean SNR values of the denoised signals on the training (solid) and validation set (dashed) are plotted per epoch. We added markers for the models with tight encoders. Left: No noise in the encoder. Right: Noise in the encoder. In both settings, the models with the stabilized encoders yield higher SNRs of the denoised signals. In the noisy setting (right), the SNR progression of the model with the tight encoder is significantly more stable than the one with the non-stabilized encoder.
Trainable signal encoders that are robust against noise
  • Article
  • Full-text available

October 2024

·

9 Reads

INTER-NOISE and NOISE-CON Congress and Conference Proceedings

Within the deep learning paradigm, finite impulse response (FIR) filters are often used to encode audio signals, yielding flexible and adaptive feature representations. We show that a stabilization of FIR filterbanks with fixed filter lengths (convolutional layers with 1-D filters) by means of their condition number leads to encoders that are optimally robust against noise and can be inverted with perfect reconstruction by their transposes. To maintain their flexibility as regular neural network layers we implement the stabilization via a computationally efficient regularizing term in the objective function of the learning problem. In this way, the encoder keeps its expressive power and is optimally stable and noise-robust throughout the whole learning procedure. We show in a denoising task where noise is additionally present in the encoder representation of the signals, that the proposed stabilization of the trainable filterbank encodings is decisive for increasing the signal-to-noise ratio of the denoised signals significantly compared to a model with a naively trained encoder.

Download

Towards multisensory control of physical modeling synthesis

October 2024

·

4 Reads

INTER-NOISE and NOISE-CON Congress and Conference Proceedings

Physical models of musical instruments offer an interesting tradeoff between computational efficiency and perceptual fidelity. Yet, they depend on a multidimensional space of user-defined parameters whose exploration by trial and error is impractical. Our article addresses this issue by combining two ideas: query by example and gestural control. Prior publications have presented these ideas separately but never in conjunction. On one hand, we train a deep neural network to identify the resonator parameters of a percussion synthesizer from a single audio example via an original method named perceptual-neural-physical sound matching (PNP). On the other hand, we map these parameters to knobs in a digital controller and configure a musical touchpad with MIDI polyphonic expression. Hence, we propose a multisensory interface between human and machine: it integrates haptic and sonic information and produces new sounds in real time as well as visual feedback on the percussive touchpad. We demonstrate the interest of this new kind of multisensory control via a musical game in which participants collaborate with the machine in order to imitate the sound of an unknown percussive instrument as quickly as possible. Our findings show the challenge and promise of future research in musical "Human-AI parternships".



Machine listening in a neonatal intensive care unit

September 2024

·

7 Reads

Oxygenators, alarm devices, and footsteps are some of the most common sound sources in a hospital. Detecting them has scientific value for environmental psychology but comes with challenges of its own: namely, privacy preservation and limited labeled data. In this paper, we address these two challenges via a combination of edge computing and cloud computing. For privacy preservation, we have designed an acoustic sensor which computes third-octave spectrograms on the fly instead of recording audio waveforms. For sample-efficient machine learning, we have repurposed a pretrained audio neural network (PANN) via spectral transcoding and label space adaptation. A small-scale study in a neonatological intensive care unit (NICU) confirms that the time series of detected events align with another modality of measurement: i.e., electronic badges for parents and healthcare professionals. Hence, this paper demonstrates the feasibility of polyphonic machine listening in a hospital ward while guaranteeing privacy by design.



Figure 1: The log magnitude responses of three encoders for the same speech signal. Left to right: Auditory filterbank, random filterbank, and hybrid filterbank as channel-wise composition of the previous two. While the random responses are hard to interpret, the hybrid responses are comparable to the fixed ones with the possibility to be fine-tuned in a data-driven manner.
Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

August 2024

·

28 Reads

Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.



Figure 2. We modify the ChromaNet architecture of Figure 1 to accommodate structured prediction key signature and mode. We apply batch normalization per mode m and softmax over all coefficients, yielding a 12 × 2 matrix Y θ θ θ (x x x). Summing Y θ θ θ (x x x) over modes m yields a learned key signature profile λ θ θ θ (x x x) in dimension 12; summing Y θ θ θ (x x x) over chromas q yields a pitch-invariant 2-dimensional vector µ θ θ θ (x x x).
Figure 4. Confusion matrices of STONE (left, 12 classes) and Semi-TONE (right, 24 classes) on FMAK, both using ω = 7. The axis correspond to model prediction and reference respectively, keys arranged by proximity in the CoF and relative modes. Deeper colors indicate more frequent occurences per relative occurence per reference key.
STONE: Self-supervised Tonality Estimator

July 2024

·

17 Reads

Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision.


Model-Based Deep Learning for Music Information Research

June 2024

·

24 Reads

In this article, we investigate the notion of model-based deep learning in the realm of music information research (MIR). Loosely speaking, we refer to the term model-based deep learning for approaches that combine traditional knowledge-based methods with data-driven techniques, especially those based on deep learning, within a diff erentiable computing framework. In music, prior knowledge for instance related to sound production, music perception or music composition theory can be incorporated into the design of neural networks and associated loss functions. We outline three specifi c scenarios to illustrate the application of model-based deep learning in MIR, demonstrating the implementation of such concepts and their potential.


Citations (39)


... While techniques such as mixup [87] and adding background noise [52] are somewhat common for augmenting spectrograms, they also present challenges. Mixup, which interpolates between two samples, has shown promise [51], but can introduce temporal correlation artifacts or fail to maintain the original signal's integrity. Additionally, adding background noise can degrade the quality of the signal, especially if a large volume of clean samples with birds present is not available, potentially hindering the model's ability to distinguish between the target sound and the noise [33]. ...

Reference:

Generative AI-based data augmentation for improved bioacoustic classification in noisy environments
Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds
  • Citing Conference Paper
  • August 2024

... Multimedia contents such as audio and video are of particular interest in the IoS. A very recent study discussed the rebound effect and its role in making digital music streaming less sustainable than older physical supports for the distribution of musical records [45]. However, the current largest source of Internet traffic is the sharing of videos. ...

Rebound Effects Make Digital Audio Unsustainable
  • Citing Conference Paper
  • September 2024

... We now delve deeper into these aspects by considering an application scenario referred to as perceptual sound matching. Following [26], we examine a scenario where the objective is to match a given sound x with a synthesized versionx in a perceptually convincing fashion. ...

Learning to Solve Inverse Problems for Perceptual Sound Matching
  • Citing Article
  • January 2024

IEEE/ACM Transactions on Audio Speech and Language Processing

... This approach led to significantly better performance of understanding bowing techniques. Wang et al (Wang, Lostanlen, & Lagrange, 2023) studied also convolutional neural network in the time-frequency domain and investigated the classification of five comparable real-world playing techniques from 30 instruments spanning seven octaves. They found that relevant regions around the modulation of the playing technique could be found which are highly relevant to the technique. ...

Explainable audio Classification of Playing Techniques with Layer-wise Relevance Propagation
  • Citing Conference Paper
  • June 2023

... MSS addresses the inherent tradeoff between time-frequency resolution present in magnitude spectrograms by incorporating multiple STFTs with varying time-frequency resolutions into a unified loss function [22]. However, MSS suffers from instabilities when dealing with time shifts and nonstationarity behaviors in signals [35], which are typically assumed to be minimal when analyzing late reverberation. ...

Mesostructures: Beyond Spectrogram Loss in Differentiable Time–Frequency Analysis
  • Citing Article
  • September 2023

Journal of the Audio Engineering Society

... Similarly, [31]- [33] handle individual audio event clips as binary classification tasks. In [34], multi-class audio segments are framed as binary classification, with the target class as positive and others as negative. Additionally, [35] adapts the MAML algorithm for sound localization in a new room environment. ...

Learning to detect an animal sound from five examples

Ecological Informatics

... • We show that the pre-trained model can be finetuned in a few-shot learning setting to get competitive beat-tracking results. Moreover, we show that our approach yields, in most cases, at least better performance than Zero-Note Samba (ZeroNS) [11], which is, to the best of our knowledge, the only alternative SSL approach to this problem to date. • Furthermore we show that our model outperforms ZeroNS in a cross-dataset generalization setting. ...

Zero-Note Samba: Self-Supervised Beat Tracking
  • Citing Article
  • January 2023

IEEE/ACM Transactions on Audio Speech and Language Processing

... Given its increasingly prominent role in shaping our understanding of nocturnal bird migration, it is vital to identify and account for potential spatial and temporal biases in weather radar datasets. Cross-validation studies can be used to determine these biases and inform methods that account or correct for biases, such as range, blockage, and clutter contamination (Buler & Diehl, 2009;Van Doren et al., 2023). Studies have found that time-referenced line-transect surveys estimates of bird observations and visual bird migration counts (Schmidt et al., 2017) correspond well to radar reflectivity. ...

Automated acoustic monitoring captures timing and intensity of bird migration