Mark D. Plumbley

Mark D. Plumbley
University of Surrey · Centre for Vision, Speech and Signal Processing (CVSSP)

About

432
Publications
97,097
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,298
Citations

Publications

Publications (432)
Preprint
Full-text available
This paper presents a low-complexity framework for acoustic scene classification (ASC). Most of the frameworks designed for ASC use convolutional neural networks (CNNs) due to their learning ability and improved performance compared to hand-engineered features. However, CNNs are resource hungry due to their large size and high computational complex...
Preprint
Full-text available
Few-shot audio event detection is a task that detects the occurrence time of a novel sound class given a few examples. In this work, we propose a system based on segment-level metric learning for the DCASE 2022 challenge of few-shot bioacoustic event detection (task 5). We make better utilization of the negative data within each sound class to buil...
Preprint
Continuously learning new classes without catastrophic forgetting is a challenging problem for on-device environmental sound classification given the restrictions on computation resources (e.g., model size, running memory). To address this issue, we propose a simple and efficient continual learning method. Our method selects the historical data for...
Preprint
Full-text available
Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes bo...
Preprint
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been p...
Preprint
We present a method to develop low-complexity convolutional neural networks (CNNs) for acoustic scene classification (ASC). The large size and high computational complexity of typical CNNs is a bottleneck for their deployment on resource-constrained devices. We propose a passive filter pruning framework, where a few convolutional filters from the C...
Preprint
Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two...
Preprint
Full-text available
In this paper, we introduce the task of language-queried audio source separation (LASS), which aims to separate a target source from an audio mixture based on a natural language query of the target source (e.g., "a man tells a joke followed by people laughing"). A unique challenge in LASS is associated with the complexity of natural language descri...
Preprint
Polyphonic sound event localization and detection (SELD) aims at detecting types of sound events with corresponding temporal activities and spatial locations. In this paper, a track-wise ensemble event independent network with a novel data augmentation method is proposed. The proposed model is based on our previous proposed Event-Independent Networ...
Preprint
Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism...
Preprint
Full-text available
Audio captioning aims at using natural language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often enc...
Preprint
Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum likelihood estimation (MLE),which tends to make captions generic, simple and deterministic. As different people may d...
Preprint
Full-text available
The availability of audio data on sound sharing platforms such as Freesound gives users access to large amounts of annotated audio. Utilising such data for training is becoming increasingly popular, but the problem of label noise that is often prevalent in such datasets requires further investigation. This paper introduces ARCA23K, an Automatically...
Article
Imagine standing on a street corner in the city. With your eyes closed you can hear and recognize a succession of sounds: cars passing by, people speaking, their footsteps when they walk by, and the continuous falling of rain. The recognition of all these sounds and interpretation of the perceived scene as a city street soundscape comes naturally t...
Preprint
Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features extracted by the encoder. To improve the proposed system, transfer learning from either an upstream audio-related...
Preprint
Full-text available
The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous signals under the condition that only non-anomalous (normal) data is available beforehand. In UAD under Domain-Shift Conditions (UAD-S), data is further exposed to contextual changes that are usually unknown beforehand. Motivated by the difficulties encountered in the UAD-S tas...
Preprint
Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has received increasing interest in the community. The existing AAC systems are usually based on an encoder-decoder...
Preprint
Deep generative models have recently achieved impressive performance in speech synthesis and music generation. However, compared to the generation of those domain-specific sounds, the generation of general sounds (such as car horn, dog barking, and gun shot) has received less attention, despite their wide potential applications. In our previous wor...
Preprint
Audio captioning aims to automatically generate a natural language description of an audio clip. Most captioning models follow an encoder-decoder architecture, where the decoder predicts words based on the audio features extracted by the encoder. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used as the audio e...
Preprint
The goal of automatic sound event detection (SED) methods is to recognize what is happening in an audio signal and when it is happening. In practice, the goal is to recognize at what temporal instances different sounds are active within an audio signal. This paper gives a tutorial presentation of sound event detection, including its definition, sig...
Preprint
Full-text available
Federated learning (FL) is a privacy-preserving machine learning method that has been proposed to allow training of models using data from many different clients, without these clients having to transfer all their data to a central server. There has as yet been relatively little consideration of FL or other privacy-preserving methods in audio. In t...
Article
Energy disaggregation, a.k.a. Non-Intrusive Load Monitoring, aims to separate the energy consumption of individual appliances from the readings of a mains power meter measuring the total energy consumption of, e.g., a whole house. Energy consumption of individual appliances can be useful in many applications, e.g., providing appliance-level feedbac...
Preprint
This paper proposes a deep learning framework for classification of BBC television programmes using audio. The audio is firstly transformed into spectrograms, which are fed into a pre-trained convolutional Neural Network (CNN), obtaining predicted probabilities of sound events occurring in the audio recording. Statistics for the predicted probabili...
Preprint
Full-text available
Clipping is a common type of distortion in which the amplitude of a signal is truncated if it exceeds a certain threshold. Sparse representation has underpinned several algorithms developed recently for reconstruction of the original signal from clipped observations. However, these declipping algorithms are often built on a synthesis model, where t...
Conference Paper
Deep neural networks with convolutional layers usually process the entire spectrogram of an audio signal with the same time-frequency resolutions, number of filters, and dimensionality reduction scale. According to the constant-Q transform, good features can be extracted from audio signals if the low frequency bands are processed with high frequenc...
Article
Clipping is a common type of distortion in which the amplitude of a signal is truncated if it exceeds a certain threshold. Sparse representation has underpinned several algorithms developed recently for reconstruction of the original signal from clipped measurements. However, these declipping algorithms are often built on a synthesis model, where t...
Article
Credit risk is the possibility of a loss resulting from a borrower’s failure to repay a loan or meet contractual obligations. With the growing number of customers and expansion of businesses, it’s not possible or at least feasible for banks to assess each customer individually in order to minimize this risk. Machine learning can leverage available...
Preprint
Acoustic Scene Classification (ASC) aims to classify the environment in which the audio signals are recorded. Recently, Convolutional Neural Networks (CNNs) have been successfully applied to ASC. However, the data distributions of the audio signals recorded with multiple devices are different. There has been little research on the training of robus...
Preprint
Depression is a large-scale mental health problem and a challenging area for machine learning researchers in terms of the detection of depression. Datasets such as the Distress Analysis Interview Corpus - Wizard of Oz have been created to aid research in this area. However, on top of the challenges inherent in accurately detecting depression, biase...
Preprint
Full-text available
Polyphonic sound event localization and detection (SELD), which jointly performs sound event detection (SED) and direction-of-arrival (DoA) estimation, has better real-world applicability than separate SED or DoA estimation. It detects the type and occurrence time of sound events as well as their corresponding DoA angles simultaneously. We study th...
Conference Paper
Full-text available
Situated in the domain of urban sound scene classification by humans and machines, this research is the first step towards mapping urban noise pollution experienced indoors and finding ways to reduce its negative impact in peoples' homes. We have recorded a sound dataset, called Open-Window, which contains recordings from three different locations...
Preprint
Full-text available
Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event ca...
Article
Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events...
Conference Paper
Full-text available
Cardiovascular disease is one of the leading factors for death cause of human beings. In the past decade, heart sound classification has been increasingly studied for its feasibility to develop a non-invasive approach to monitor a subject's health status. Particularly, relevant studies have benefited from the fast development of wearable devices an...
Conference Paper
Credit risk is the possibility of a loss resulting from a bor-rower's failure to repay a loan or meet contractual obligations. With the growing number of customers and expansion of businesses, it's not possible or at least feasible for banks to assess each customer individually in order to minimize this risk. Machine learning can leverage available...
Preprint
Full-text available
In supervised machine learning, the assumption that training data is labelled correctly is not always satisfied. In this paper, we investigate an instance of labelling error for classification tasks in which the dataset is corrupted with out-of-distribution (OOD) instances: data that does not belong to any of the target classes, but is labelled as...
Preprint
Source separation is the task to separate an audio recording into individual sound sources. Source separation is fundamental for computational auditory scene analysis. Previous work on source separation has focused on separating particular sound classes such as speech and music. Many of previous work require mixture and clean source pairs for train...
Article
Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, prev...
Preprint
Full-text available
Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification and sound event detection. Recently neural networks have been applied to solve audio pattern recognition problems. However, previous systems focus on small datasets, which limits the...
Preprint
Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events...
Conference Paper
While deep learning is undoubtedly the predominant learning technique across speech processing, it is still not widely used in health-based applications. The corpora available for health-style recognition problems are often small, both concerning the total amount of data available and the number of individuals present. The Bipolar Disorder corpus,...
Technical Report
Full-text available
Sound event localization and detection (SELD) refers to the spatial and temporal localization of sound events in addition to classification. The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3 introduces a strongly labelled dataset to address this problem. In this report, a two-stage polyphonic sound event detection a...
Article
Full-text available
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio tagging focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classes. Audi...
Article
Sparse coding and dictionary learning are popular techniques for linear inverse problems such as denoising or inpainting. However in many cases, the measurement process is nonlinear, for example for clipped, quantized or 1-bit measurements. These problems have often been addressed by solving constrained sparse coding problems, which can be difficul...
Preprint
Full-text available
This paper proposes sound event localization and detection methods from multichannel recording. The proposed system is based on two Convolutional Recurrent Neural Networks (CRNNs) to perform sound event detection (SED) and time difference of arrival (TDOA) estimation on each pair of microphones in a microphone array. In this paper, the system is ev...
Preprint
Deep neural networks with convolutional layers usually process the entire spectrogram of an audio signal with the same time-frequency resolutions, number of filters, and dimensionality reduction scale. According to the constant-Q transform, good features can be extracted from audio signals if the low frequency bands are processed with high frequenc...
Conference Paper
Full-text available
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture. Single-channel signal separation and deconvolution is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition,...
Preprint
The availability of large-scale household energy consumption datasets boosts the studies on energy disaggregation, a.k.a. non-intrusive load monitoring, that aims to separate the energy consumption of individual appliances from the readings of a mains power meter measuring the total consumption of multiple appliances for example in a house. various...
Preprint
Full-text available
Single-channel signal separation and deconvolution aims to separate and deconvolve individual sources from a single-channel mixture and is a challenging problem in which no prior knowledge of the mixing filters is available. Both individual sources and mixing filters need to be estimated. In addition, a mixture may contain non-stationary noise whic...
Conference Paper
Acoustic Event Detection (AED) is an important task of machine listening which, in recent years, has been addressed using common machine learning methods like Non-negative Matrix Factorization (NMF) or deep learning. However, most of these approaches do not take into consideration the way that human auditory system detects salient sounds. In this w...
Preprint
Full-text available
Sound event detection (SED) and localization refer to recognizing sound events and estimating their spatial and temporal locations. Using neural networks has become the prevailing method for Sound event detection. In the area of sound localization, which is usually performed by estimating the direction of arrival (DOA), learning-based methods have...
Preprint
Full-text available
Sound event detection (SED) methods typically rely on either strongly labelled data or weakly labelled data. As an alternative, sequentially labelled data (SLD) was proposed. In SLD, the events and the order of events in audio clips are known, without knowing the occurrence time of events. This paper proposes a connectionist temporal classification...
Conference Paper
Humans are able to identify a large number of environmental sounds and categorise them according to high-level semantic categories, e.g. urban sounds or music. They are also capable of generalising from past experience to new sounds when applying these categories. In this paper we report on the creation of a data set that is structured according to...
Preprint
Full-text available
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event det...
Preprint
The Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 challenge focuses on audio tagging, sound event detection and spatial localisation. DCASE 2019 consists of five tasks: 1) acoustic scene classification, 2) audio tagging with noisy labels and minimal supervision, 3) sound event localisation and detection, 4) sound event det...
Preprint
Audio tagging is the task of predicting the presence or absence of sound classes within an audio clip. Previous work in audio classification focused on relatively small datasets limited to recognising a small number of sound classes. We investigate audio tagging on AudioSet, which is a dataset consisting of over 2 million audio clips and 527 classe...
Conference Paper
Full-text available
Sound event detection (SED) methods typically rely on either strongly labelled data or weakly labelled data. As an alternative , sequentially labelled data (SLD) was proposed. In SLD, the events and the order of events in audio clips are known, without knowing the occurrence time of events. This paper proposes a connectionist temporal classificatio...
Preprint
We address the problem of recovering a sparse signal from clipped or quantized measurements. We show how these two problems can be formulated as minimizing the distance to a convex feasibility set, which provides a convex and differentiable cost function. We then propose a fast iterative shrinkage/thresholding algorithm that minimizes the proposed...
Conference Paper
Full-text available
We propose a convolutional neural network (CNN) model based onan attention pooling method to classify ten different acoustic scenes, participating in the acoustic scene classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenesand Events (DCASE 2018), which includes data from one device(subtask A) and data fro...