Romain Hennequin’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (75)


Fig. 1. Structured prediction: Summing y θ,A,c over rows produces a pitchequivariant component λ θ,A,c , summing y θ,A,c per columns produces a pitch-invariant component µ θ,A,c . Rows and columns are reversed in the figure compared to the main text due to space limitation for the figure.
S-KEY: Self-supervised Learning of Major and Minor Keys from Audio
  • Preprint
  • File available

January 2025

·

3 Reads

Yuexuan Kong

·

Gabriel Meseguer-Brocal

·

Vincent Lostanlen

·

[...]

·

Romain Hennequin

STONE, the current method in self-supervised learning for tonality estimation in music signals, cannot distinguish relative keys, such as C major versus A minor. In this article, we extend the neural network architecture and learning objective of STONE to perform self-supervised learning of major and minor keys (S-KEY). Our main contribution is an auxiliary pretext task to STONE, formulated using transposition-invariant chroma features as a source of pseudo-labels. S-KEY matches the supervised state of the art in tonality estimation on FMAKv2 and GTZAN datasets while requiring no human annotation and having the same parameter budget as STONE. We build upon this result and expand the training set of S-KEY to a million songs, thus showing the potential of large-scale self-supervised learning in music information retrieval.

Download

AI-Generated Music Detection and its Challenges

January 2025

·

40 Reads

In the face of a new era of generative models, the detection of artificially generated content has become a matter of utmost importance. In particular, the ability to create credible minute-long synthetic music in a few seconds on user-friendly platforms poses a real threat of fraud on streaming services and unfair competition to human artists. This paper demonstrates the possibility (and surprising ease) of training classifiers on datasets comprising real audio and artificial reconstructions, achieving a convincing accuracy of 99.8%. To our knowledge, this marks the first publication of a AI-music detector, a tool that will help in the regulation of synthetic media. Nevertheless, informed by decades of literature on forgery detection in other fields, we stress that getting a good test score is not the end of the story. We expose and discuss several facets that could be problematic with such a deployed detector: robustness to audio manipulation, generalisation to unseen models. This second part acts as a position for future research steps in the field and a caveat to a flourishing market of artificial content checkers.


Harnessing High-Level Song Descriptors towards Natural Language-Based Music Recommendation

Recommender systems relying on Language Models (LMs) have gained popularity in assisting users to navigate large catalogs. LMs often exploit item high-level descriptors, i.e. categories or consumption contexts, from training data or user preferences. This has been proven effective in domains like movies or products. However, in the music domain, understanding how effectively LMs utilize song descriptors for natural language-based music recommendation is relatively limited. In this paper, we assess LMs effectiveness in recommending songs based on user natural language descriptions and items with descriptors like genres, moods, and listening contexts. We formulate the recommendation task as a dense retrieval problem and assess LMs as they become increasingly familiar with data pertinent to the task and domain. Our findings reveal improved performance as LMs are fine-tuned for general language similarity, information retrieval, and mapping longer descriptions to shorter, high-level descriptors in music.




Transformers Meet ACT-R: Repeat-Aware and Sequential Listening Session Recommendation

August 2024

·

12 Reads

Music streaming services often leverage sequential recommender systems to predict the best music to showcase to users based on past sequences of listening sessions. Nonetheless, most sequential recommendation methods ignore or insufficiently account for repetitive behaviors. This is a crucial limitation for music recommendation, as repeatedly listening to the same song over time is a common phenomenon that can even change the way users perceive this song. In this paper, we introduce PISA (Psychology-Informed Session embedding using ACT-R), a session-level sequential recommender system that overcomes this limitation. PISA employs a Transformer architecture learning embedding representations of listening sessions and users using attention mechanisms inspired by Anderson's ACT-R (Adaptive Control of Thought-Rational), a cognitive architecture modeling human information access and memory dynamics. This approach enables us to capture dynamic and repetitive patterns from user behaviors, allowing us to effectively predict the songs they will listen to in subsequent sessions, whether they are repeated or new ones. We demonstrate the empirical relevance of PISA using both publicly available listening data from Last.fm and proprietary data from Deezer, a global music streaming service, confirming the critical importance of repetition modeling for sequential listening session recommendation. Along with this paper, we publicly release our proprietary dataset to foster future research in this field, as well as the source code of PISA to facilitate its future use.


Figure 4. Mean all-pairs cosine similarity between each of the closed set singers' test track embeddings and: in purple (test/other), the embeddings from a random track from another singer; in red (test/val), their validation track embeddings; in green (test/vocal), their test track's vocal stem embeddings; in orange (test/instru), their test track's instrumental stem embeddings; in blue (test/test), the other embeddings from the same track. All embeddings are generated on segments with vocals.
From Real to Cloned Singer Identification

July 2024

·

35 Reads

Cloned voices of popular singers sound increasingly realistic and have gained popularity over the past few years. They however pose a threat to the industry due to personality rights concerns. As such, methods to identify the original singer in synthetic voices are needed. In this paper, we investigate how singer identification methods could be used for such a task. We present three embedding models that are trained using a singer-level contrastive learning scheme, where positive pairs consist of segments with vocals from the same singers. These segments can be mixtures for the first model, vocals for the second, and both for the third. We demonstrate that all three models are highly capable of identifying real singers. However, their performance deteriorates when classifying cloned versions of singers in our evaluation set. This is especially true for models that use mixtures as an input. These findings highlight the need to understand the biases that exist within singer identification systems, and how they can influence the identification of voice deepfakes in music.


Figure 2. We modify the ChromaNet architecture of Figure 1 to accommodate structured prediction key signature and mode. We apply batch normalization per mode m and softmax over all coefficients, yielding a 12 × 2 matrix Y θ θ θ (x x x). Summing Y θ θ θ (x x x) over modes m yields a learned key signature profile λ θ θ θ (x x x) in dimension 12; summing Y θ θ θ (x x x) over chromas q yields a pitch-invariant 2-dimensional vector µ θ θ θ (x x x).
Figure 4. Confusion matrices of STONE (left, 12 classes) and Semi-TONE (right, 24 classes) on FMAK, both using ω = 7. The axis correspond to model prediction and reference respectively, keys arranged by proximity in the CoF and relative modes. Deeper colors indicate more frequent occurences per relative occurence per reference key.
STONE: Self-supervised Tonality Estimator

July 2024

·

18 Reads

Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision.


Figure 1: Memorization results with the LOSS (Perplexity) attack on PDNC and the unseen novel.
A Realistic Evaluation of LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3

June 2024

·

15 Reads

Large Language Models (LLMs) zero-shot and few-shot performance are subject to memorization and data contamination, complicating the assessment of their validity. In literary tasks, the performance of LLMs is often correlated to the degree of book memorization. In this work, we carry out a realistic evaluation of LLMs for quotation attribution in novels, taking the instruction fined-tuned version of Llama3 as an example. We design a task-specific memorization measure and use it to show that Llama3's ability to perform quotation attribution is positively correlated to the novel degree of memorization. However, Llama3 still performs impressively well on books it has not memorized nor seen. Data and code will be made publicly available.


Improving Quotation Attribution with Fictional Character Embeddings

June 2024

·

22 Reads

Humans naturally attribute utterances of direct speech to their speaker in literary works. When attributing quotes, we process contextual information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to automatically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit rules with neural networks when processing contextual information. However, these systems inherently lack \textit{character} representations, which often leads to errors on more challenging examples of attribution: anaphoric and implicit quotes. In this work, we propose to augment a popular quotation attribution system, BookNLP, with character embeddings that encode global information of characters. To build these embeddings, we create DramaCV, a corpus of English drama plays from the 15th to 20th century focused on Character Verification (CV), a task similar to Authorship Verification (AV), that aims at analyzing fictional characters. We train a model similar to the recently proposed AV model, Universal Authorship Representation (UAR), on this dataset, showing that it outperforms concurrent methods of characters embeddings on the CV task and generalizes better to literary novels. Then, through an extensive evaluation on 22 novels, we show that combining BookNLP's contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric and implicit quotes, reaching state-of-the-art performance. Code and data will be made publicly available.


Citations (33)


... While we used first names as a proxy for estimating the percentage of oral references, several studies reported women more often addressed by their first name than men, suggesting this descriptor may underestimate oral references to men [32]. Large differences between speech and face indicators in entertainment, containing large amounts of music and excluding singing voice from WSR, highlight the need for speaker gender prediction systems suited to the analysis of singing voices: a task known to be harder [33]. Lastly, our methodological framework required categorizing individuals into stereotypical binary genders. ...

Reference:

Gender Representation in TV and Radio: Automatic Information Extraction methods versus Manual Analyses
STraDa: A Singer Traits Dataset
  • Citing Conference Paper
  • September 2024

... To capture users' consumption diversity we construct an artist-based embedding space by applying SVD to a popularity normalised artist-artist within-session co-occurrence matrix M. We choose to apply SVD over other more "state-of-the-art" embedding models since it is both more interpretable and reproducible 26 . Furthermore, it is worth highlighting that SVD has also been shown to be closely connected to the Skip-Gram Negative Sampling (SGNS) variant of the word2vec model 27 . ...

Of Spiky SVDs and Music Recommendation
  • Citing Conference Paper
  • September 2023

... Building upon prior studies on music recommendation [44,53,61], we consider the declarative module of the ACT-R framework [2] to model relistening behaviors. This module is responsible for modeling the dynamics of information activation and forgetting of the human memory. ...

Ex2Vec: Characterizing Users and Items from the Mere Exposure Effect
  • Citing Conference Paper
  • September 2023

... They are categorized into (1) objective metadata produced when a track is registered on the platform, (2) subjective similarity with music entities, (3) user & listening context, and (4) music content information (Table 3). Metadata is mainly associated with entity recognition [19], where Track refers to requests for a single music recording entity, Artist denotes user requests for tracks released by a specific artist, Year reflects the era/year in which a piece of music was released, and Popularity indicates the level of attention a piece of music has received. Culture is related to the national or regional style of the music, often linked to the artist's nationality. ...

A Human Subject Study of Named Entity Recognition in Conversational Music Recommendation Queries
  • Citing Conference Paper
  • January 2023

... In music however, our understanding of how efficiently LMs utilize item textual features and user preference description for music recommendation is comparatively limited. Previous research has emphasized the significance of natural language features such as tags in improving music retrieval (Doh et al., 2023b;Wu* et al., 2023) and captioning algorithms (Gabbolini et al., 2022;Doh et al., 2023a). These accessible descriptors help bridge the semantic gap between audio and more complex song descriptions provided by humans (Celma Herrada et al., 2006). ...

Data-Efficient Playlist Captioning With Musical and Linguistic Knowledge
  • Citing Conference Paper
  • January 2022

... The gist of SSL is to formulate a pretext task; i.e., one in which the correct answer may be inexpensively obtained from audio data. While some SSL systems have general-purpose pretext tasks and require supervised fine-tuning [5]- [8], others are tailored for specific downstream tasks: e.g., the estimation of pitch [9], [10], tempo [11], [12], beat [13], drumming patterns [14], and structure [15]. ...

Zero-Note Samba: Self-Supervised Beat Tracking
  • Citing Article
  • January 2023

IEEE/ACM Transactions on Audio Speech and Language Processing

... For a deeper analysis of the impact of explicit temporal information on multi-modal SR, we change the types of time embeddings inputted to Temporal MoE as five variants: Besides (1) Interval Only only inputs 1, and (2) Absolute Only only inputs 2, , we also try three widely-used types of time embeddings [9,17,42], including [ ( 1 + 1 ), ( 2 + 2 ), · · · , ( + )]. Figure 5 shows the result of all variants. Though all types of time embeddings can facilitate user preference learning, our time encoding method gets more advances, which is attributed to the combination of interval information and absolute timestamps. ...

Attention Mixtures for Time-Aware Sequential Recommendation
  • Citing Conference Paper
  • July 2023

... As Liew et al. (2023) point out, music generated from culture provides relevant information when conducting cross-cultural studies. These studies reveal musical characteristics in relation to cultural psychological procedures. ...

Groovin’ to the Cultural Beat: Preferences for Danceable Music Represent Cultural Affordances for High-Arousal Negative Emotions

Psychology of Aesthetics Creativity and the Arts

... Repeated reports suggest that linguistic and geographical distance (Bello and Garcia, 2021;Jang et al. 2023;Terroso-Saenz et al. 2023;Way et al. 2020) affect cross-cultural relationships. These factors and patterns are observed on diverse platforms or content types (Liew et al. 2022;Taneja and Webster JG, 2016). Linguistic barriers may diminish using the translation, and so the effectiveness of subtitling and dubbing strategies in making content accessible to international audiences was also examined (Borell, 2000). ...

Network Analyses for Cross-Cultural Music Popularity