Emmanouil Benetos

Emmanouil Benetos
Queen Mary, University of London | QMUL · School of Electronic Engineering and Computer Science

PhD; MSc; BSc; BMus; FHEA

About

216
Publications
72,293
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,965
Citations
Introduction
I am currently Reader in Machine Listening at Queen Mary University of London and Turing Fellow at The Alan Turing Institute. Within Queen Mary, I am member of the Centre for Digital Music, Centre for Intelligent Sensing, and Institute of Applied Data Science and and co-lead the School's Machine Listening Lab. My main research topic is computational audio analysis, applied to music, urban, everyday and nature sounds. Website: http://www.eecs.qmul.ac.uk/~emmanouilb/
Additional affiliations
September 2015 - March 2020
Queen Mary, University of London
Position
  • Professor (Associate)
January 2013 - March 2015
City, University of London
Position
  • University Research Fellow
September 2009 - December 2012
Queen Mary, University of London
Education
September 2009 - October 2012
Queen Mary, University of London
Field of study
  • Electronic Engineering
September 2005 - July 2007
Aristotle University of Thessaloniki
Field of study
  • Informatics - Digital Media
September 2001 - July 2005
Aristotle University of Thessaloniki
Field of study
  • Informatics

Publications

Publications (216)
Preprint
Full-text available
Transformers have set new benchmarks in audio processing tasks, leveraging self-attention mechanisms to capture complex patterns and dependencies within audio data. However, their focus on pairwise interactions limits their ability to process the higher-order relations essential for identifying distinct audio objects. To address this limitation, th...
Article
Full-text available
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels prese...
Preprint
Full-text available
Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be...
Preprint
Full-text available
This paper introduces GraFPrint, an audio identification framework that leverages the structural learning capabilities of Graph Neural Networks (GNNs) to create robust audio fingerprints. Our method constructs a k-nearest neighbor (k-NN) graph from time-frequency representations and applies max-relative graph convolutions to encode local and global...
Preprint
Full-text available
Recent advancements in multimodal large language models (MLLMs) have aimed to integrate and interpret data across diverse modalities. However, the capacity of these models to concurrently process and reason about multiple modalities remains inadequately explored, partly due to the lack of comprehensive modality-wise benchmarks. We introduce OmniBen...
Preprint
Full-text available
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels prese...
Preprint
Full-text available
Passive acoustic monitoring (PAM) is crucial for bioacoustic research, enabling non-invasive species tracking and biodiversity monitoring. Citizen science platforms like Xeno-Canto provide large annotated datasets from focal recordings, where the target species is intentionally recorded. However, PAM requires monitoring in passive soundscapes, crea...
Preprint
Full-text available
Acoustic identification of individual animals (AIID) is closely related to audio-based species classification but requires a finer level of detail to distinguish between individual animals within the same species. In this work, we frame AIID as a hierarchical multi-label classification task and propose the use of hierarchy-aware loss functions to l...
Preprint
Full-text available
In recent years, foundation models (FMs) such as large language models (LLMs) and latent diffusion models (LDMs) have profoundly impacted diverse sectors, including music. This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music, spanning from representation learning, generative learning and multi...
Preprint
Full-text available
Multimodal models that jointly process audio and language hold great promise in audio understanding and are increasingly being adopted in the music domain. By allowing users to query via text and obtain information about a given audio input, these models have the potential to enable a variety of music understanding tasks via language-based interfac...
Preprint
Full-text available
Symbolic Music, akin to language, can be encoded in discrete symbols. Recent research has extended the application of large language models (LLMs) such as GPT-4 and Llama2 to the symbolic music domain including understanding and generation. Yet scant research explores the details of how these LLMs perform on advanced music understanding and conditi...
Preprint
Full-text available
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument. This task is challenging for modeling as it requires simultaneously identifying multiple instruments and transcribing their pitch and precise timing, and the lack of fully annotated data adds to the training difficulties...
Preprint
Full-text available
The evolution of music, language, and cooperation have been debated since before Darwin. The social bonding hypothesis proposes that these phenomena may be interlinked: musicality may have facilitated the evolution of group cooperation beyond the possibilities of spoken language. Although dozens of experimental studies have shown that synchronised...
Preprint
Full-text available
Large Language Models (LLMs) have made great strides in recent years to achieve unprecedented performance across different tasks. However, due to commercial interest, the most competitive models like GPT, Gemini, and Claude have been gated behind proprietary interfaces without disclosing the training details. Recently, many institutions have open-s...
Article
Full-text available
Both music and language are found in all known human societies, yet no studies have compared similarities and differences between song, speech, and instrumental music on a global scale. In this Registered Report, we analyzed two global datasets: (i) 300 annotated audio recordings representing matched sets of traditional songs, recited lyrics, conve...
Article
Algorithms for automatic piano transcription have improved dramatically in recent years due to new datasets and modeling techniques. Recent developments have focused primarily on adapting new neural network architectures, such as the Transformer and Perceiver, in order to yield more accurate systems. In this work, we study transcription systems fro...
Article
Deep learning models such as CNNs and Transformers have achieved impressive performance for end-to-end audio tagging. Recent works have shown that despite stacking multiple layers, the receptive field of CNNs remains severely limited. Transformers on the other hand are able to map global context through self-attention, but treat the spectrogram as...
Preprint
Full-text available
This paper proposes "LSTM (Long Short Term Memory) Ensemble" technique in building a regression model to predict the Remaining-Useful-Life (RUL) of aircraft engines along with uncertainty quantification, utilising the well-known run-to-failure turbo engine degradation dataset. This paper addressed the overlooked yet crucial aspect of uncertainty es...
Preprint
Full-text available
Self-supervised learning (SSL) has shown promising results in various speech and natural language processing applications. However, its efficacy in music information retrieval (MIR) still remains largely unexplored. While previous SSL models pre-trained on music recordings may have been mostly closed-sourced, recent speech models such as wav2vec2.0...
Preprint
Full-text available
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, to...
Preprint
Full-text available
Self-supervised learning (SSL) has recently emerged as a promising paradigm for training generalisable models on large-scale data in the fields of vision, text, and speech. Although SSL has been proven effective in speech and audio, its application to music audio has yet to be thoroughly explored. This is primarily due to the distinctive challenges...
Preprint
Full-text available
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classifica...
Poster
Full-text available
This work-in-progress presents a compact toolkit for reproducing the training of \texttt{T5}-based multi-instruments and multi-track transcription models on multiple MIR tasks. We plan to release our code in early 2023 at \url{https://github.com/mimbres/YourMT3}.
Preprint
Both music and language are found in all known human societies, yet no studies have compared similarities and differences between song, speech, and instrumental music on a global scale. In this Registered Report, we analyzed two global datasets: 1) 300 annotated audio recordings representing matched sets of traditional songs, recited lyrics, conver...
Preprint
Full-text available
Learning music representations that are general-purpose offers the flexibility to finetune several downstream tasks using smaller datasets. The wav2vec 2.0 speech representation model showed promising results in many downstream speech tasks, but has been less effective when adapted to music. In this paper, we evaluate whether pre-training wav2vec 2...
Preprint
Full-text available
As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application-focused fields like Music Information Retrieval. In this work, we explore cross-modal learning in an attempt to bridge audio and language in the music domain. To this en...
Preprint
Full-text available
Loss-gradients are used to interpret the decision making process of deep learning models. In this work, we evaluate loss-gradient based attribution methods by occluding parts of the input and comparing the performance of the occluded input to the original input. We observe that the occluded input has better performance than the original across the...
Preprint
Full-text available
Imitating musical instruments with the human voice is an efficient way of communicating ideas between music producers, from sketching melody lines to clarifying desired sonorities. For this reason, there is an increasing interest in building applications that allow artists to efficiently pick target samples from big sound libraries just by imitatin...
Preprint
Most recent research about automatic music transcription (AMT) uses convolutional neural networks and recurrent neural networks to model the mapping from music signals to symbolic notation. Based on a high-resolution piano transcription system, we explore the possibility of incorporating another powerful sequence transformation tool -- the Transfor...
Preprint
Full-text available
This paper introduces a comparison of deep learning-based techniques for the MOS prediction task of synthesised speech in the Interspeech VoiceMOS challenge. Using the data from the main track of the VoiceMOS challenge we explore both existing predictors and propose new ones. We evaluate two groups of models: NISQA-based models and techniques based...
Article
Full-text available
We propose a new measure of national valence based on the emotional content of a country’s most popular songs. We first trained a machine learning model using 191 different audio features embedded within music and use this model to construct a long-run valence index for the UK. This index correlates strongly and significantly with survey-based life...
Preprint
Full-text available
Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the ques...
Article
Full-text available
Music transcription is a process of creating a notation of musical sounds. It has been used as a basis for the analysis of music from a wide variety of cultures. Recent decades have seen an increasing amount of engineering research within the field of Music Information Retrieval that aims at automatically obtaining music transcriptions in Western s...
Preprint
Full-text available
Sound scene geotagging is a new topic of research which has evolved from acoustic scene classification. It is motivated by the idea of audio surveillance. Not content with only describing a scene in a recording, a machine which can locate where the recording was captured would be of use to many. In this paper we explore a series of common audio dat...
Preprint
Full-text available
Animal vocalisations contain important information about health, emotional state, and behaviour, thus can be potentially used for animal welfare monitoring. Motivated by the spectro-temporal patterns of chick calls in the time$-$frequency domain, in this paper we propose an automatic system for chick call recognition using the joint time$-$frequenc...
Preprint
Full-text available
Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we fi...
Preprint
Full-text available
This paper proposes a deep convolutional neural network for performing note-level instrument assignment. Given a polyphonic multi-instrumental music signal along with its ground truth or predicted notes, the objective is to assign an instrumental source for each note. This problem is addressed as a pitch-informed classification task where each note...
Preprint
Cross-cultural musical analysis requires standardized symbolic representation of sounds such as score notation. However, transcription into notation is usually conducted manually by ear, which is time-consuming and subjective. Our aim is to evaluate the reliability of existing methods for transcribing songs from diverse societies. We had 3 experts...
Chapter
The field of Music Information Retrieval (MIR) focuses on creating methods and practices for making sense of music data from various modalities, including audio, video, images, scores and metadata [54]. Within MIR, a core problem which to the day remains open is Automatic Music Transcription (AMT), the process of automatically converting an acousti...
Conference Paper
Full-text available
Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In th...
Preprint
Full-text available
Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the tas...
Preprint
Full-text available
Recent advances in automatic music transcription (AMT) have achieved highly accurate polyphonic piano transcription results by incorporating onset and offset detection. The existing literature, however, focuses mainly on the leverage of deep and complex models to achieve state-of-the-art (SOTA) accuracy, without understanding model behaviour. In th...
Preprint
Full-text available
This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-trainin...
Article
Full-text available
This paper addresses the problem of domain adaptation for the task of music source separation. Using datasets from two different domains, we compare the performance of a deep learning-based harmonic-percussive source separation model under different training scenarios, including supervised joint training using data from both domains and pre-trainin...
Conference Paper
Full-text available
Objective audio quality assessment is preferred to avoid time-consuming and costly listening tests. The development of objective quality metrics depends on the availability of datasets appropriate to the application under study. Currently, a suitable human-annotated dataset for developing quality metrics in archive audio is missing. Given the onlin...