Björn Schuller

Björn Schuller
Imperial College London | Imperial · Department of Computing

Prof. mult. Dr. habil.
Researching on Digital Health, AI, Deep Learning, and Computer Audition.

About

1,240
Publications
341,936
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
40,020
Citations
Additional affiliations
September 2018 - present
Imperial College London
Position
  • Professor (Full)
Description
  • Professor of Artificial Intelligence & Head of GLAM - Group on Language Audio & Music
October 2017 - present
Universität Augsburg
Position
  • Professor (Full)
Description
  • Chair of Embedded Intelligence for Health Care and Wellbeing
September 2015 - August 2018
Imperial College London
Position
  • Professor (Full)
Description
  • Reader in Machine Learning & Head of GLAM - Group on Language Audio & Music

Publications

Publications (1,240)
Preprint
Vocal bursts play an important role in communicating affect, making them valuable for improving speech emotion recognition. Here, we present our approach for classifying vocal bursts and predicting their emotional significance in the ACII Affective Vocal Burst Workshop & Challenge 2022 (A-VB). We use a large self-supervised audio model as shared fe...
Chapter
Full-text available
Computer audition based methods have increasingly attracted efforts among the community of digital health. In particular, heart sound analysis can provide a non-invasive, real-time, and convenient (anywhere and anytime) solution for preliminary diagnosis and/or long-term monitoring of patients who are suffering from cardiovascular diseases. Neverth...
Article
Full-text available
The importance of detecting whether a per- son wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal int...
Preprint
Full-text available
Emotional speech synthesis aims to synthesize human voices with various emotional effects. The current studies are mostly focused on imitating an averaged style belonging to a specific emotion type. In this paper, we seek to generate speech with a mixture of emotions at run-time. We propose a novel formulation that measures the relative difference...
Article
Full-text available
Drowsiness detection is a crucial step for safe driving. A plethora of efforts has been invested on using pervasive sensor data (e. g., video, physiology) empowered by machine learning to build an automatic drowsiness detection system. Nevertheless, most of the existing methods are based on complicated wearables (e. g., electroencephalogram) or com...
Article
Full-text available
Fragile X syndrome (FXS) and Rett syndrome (RTT) are developmental disorders currently not diagnosed before toddlerhood. Even though speech-language deficits are among the key symptoms of both conditions, little is known about infant vocalisation acoustics for an automatic earlier identification of affected individuals. To bridge this gap, we appli...
Article
Full-text available
Welcome to the fourth issue of IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS (TCSS) in 2022. First, we have some exciting news to share. In late June, Clarivate updated the Impact Factor of all journals which are indexed by Web of Science. According to the Journal Citation Reports, the 2021 Journal Impact Factor of IEEE TCSS was 4.727. Many tha...
Preprint
Full-text available
Chronic obstructive pulmonary disease (COPD) causes lung inflammation and airflow blockage leading to a variety of respiratory symptoms; it is also a leading cause of death and affects millions of individuals around the world. Patients often require treatment and hospitalisation, while no cure is currently available. As COPD predominantly affects t...
Article
Full-text available
Coughs sounds have shown promising as a potential marker for distinguishing COVID individuals from non-COVID ones. In this paper, we propose an attention-based ensemble learning approach to learn complementary representations from cough samples. Unlike most traditional schemes such as mere maxing or averaging, the proposed approach fairly considers...
Preprint
This is the Proceedings of the ICML Expressive Vocalization (ExVo) Competition. The ExVo competition focuses on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, included three competition tracks using a large-scale dataset of 59...
Preprint
Despite the recent progress in speech emotion recognition (SER), state-of-the-art systems lack generalisation across different conditions. A key underlying reason for poor generalisation is the scarcity of emotion datasets, which is a significant roadblock to designing robust machine learning (ML) models. Recent works in SER focus on utilising mult...
Conference Paper
Full-text available
Heart sound classification is one of the non-invasive methods for early detection of the cardiovascular diseases (CVDs), the leading cause for deaths. In recent years, Computer Audition (CA) technology has become increasingly sophisticated, auxiliary diagnosis technology of heart disease based on CA has become a popular research area. This paper pr...
Conference Paper
Full-text available
Cardiovascular diseases (CVDs) have been ranked as the leading cause for deaths. The early diagnosis of CVDs is a crucial task in the medical practice. A plethora of efforts were given to the automated auscultation of heart sound, which leverages the power of computer audition to develop a cheap, non-invasive method that can be used at any time and...
Preprint
Full-text available
The ACII Affective Vocal Bursts Workshop & Competition is focused on understanding multiple affective dimensions of vocal bursts: laughs, gasps, cries, screams, and many other non-linguistic vocalizations central to the expression of emotion and to human communication more generally. This year's competition comprises four tracks using a large-scale...
Preprint
Recognising continuous emotions and action unit (AU) intensities from face videos requires a spatial and temporal understanding of expression dynamics. Existing works primarily rely on 2D face appearances to extract such dynamics. This work focuses on a promising alternative based on parametric 3D face shape alignment models, which disentangle diff...
Preprint
Full-text available
We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dy...
Preprint
Full-text available
More than two years after its outbreak, the COVID-19 pandemic continues to plague medical systems around the world, putting a strain on scarce resources, and claiming human lives. From the very beginning, various AI-based COVID-19 detection and monitoring tools have been pursued in an attempt to stem the tide of infections through timely diagnosis....
Preprint
Full-text available
In this paper, we propose the Redundancy Reduction Twins Network (RRTN), a redundancy reduction training framework that minimizes redundancy by measuring the cross-correlation matrix between the outputs of the same network fed with distorted versions of a sample and bringing it as close to the identity matrix as possible. RRTN also applies a new lo...
Conference Paper
Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't g...
Preprint
Full-text available
In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a for...
Preprint
Full-text available
Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't g...
Preprint
Full-text available
Automatically recognising apparent emotions from face and voice is hard, in part because of various sources of uncertainty, including in the input data and the labels used in a machine learning framework. This paper introduces an uncertainty-aware audiovisual fusion approach that quantifies modality-wise uncertainty towards emotion prediction. To t...
Chapter
Sentiment analysis is an important area of natural language processing that can help inform business decisions by extracting sentiment information from documents. The purpose of this chapter is to introduce the reader to selected concepts and methods of deep learning and show how deep models can be used to increase performance in sentiment analysis...
Article
Full-text available
Objectives The coronavirus disease 2019 (COVID-19) has caused a crisis worldwide. Amounts of efforts have been made to prevent and control COVID-19’s transmission, from early screenings to vaccinations and treatments. Recently, due to the spring up of many automatic disease recognition applications based on machine listening techniques, it would be...
Article
Full-text available
In this article, human semen samples from the Visem dataset are automatically assessed with machine learning methods for their quality with respect to sperm motility. Several regression models are trained to automatically predict the percentage (0 to 100) of progressive, non-progressive, and immotile spermatozoa. The videos are adopted for unsuperv...
Conference Paper
Full-text available
Audio has been increasingly used as a novel digital phenotype that carries important information of the subject's health status. We can find tremendous efforts given to this young and promising field, i.e., computer audition for healthcare (CA4H), whereas the application scenarios have not been fully studied as compared to its counterpart in medica...
Conference Paper
Full-text available
A plethora of great successes has been achieved by the existing convolutional neural networks (CNN) for respiratory sound classification. Nevertheless, simultaneously capturing both the local and global features can never be an easy task due to the limitation of a CNN's structure. In this contribution , we propose a novel glance-and-gaze network to...
Article
Full-text available
In recent years, advancements in the field of artificial intelligence (AI) have impacted several areas of research and application. Besides more prominent examples like self-driving cars or media consumption algorithms, AI-based systems have further started to gain more and more popularity in the health care sector, however whilst being restrained...
Preprint
Full-text available
The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audi...
Preprint
Stress is a major threat to well-being that manifests in a variety of physiological and mental symptoms. Utilising speech samples collected while the subject is undergoing an induced stress episode has recently shown promising results for the automatic characterisation of individual stress responses. In this work, we introduce new findings that she...
Preprint
Although running is a common leisure activity and a core training regiment for several athletes, between $29\%$ and $79\%$ of runners sustain an overuse injury each year. These injuries are linked to excessive fatigue, which alters how someone runs. In this work, we explore the feasibility of modelling the Borg received perception of exertion (RPE)...
Preprint
Full-text available
Digital health applications are becoming increasingly important for assessing and monitoring the wellbeing of people suffering from mental health conditions like depression. A common target of said applications is to predict the results of self-assessed Patient-Health-Questionnaires (PHQ), indicating current symptom severity of depressive individua...
Preprint
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available on...
Preprint
Full-text available
The ICML Expressive Vocalization (ExVo) Competition is focused on understanding and generating vocal bursts: laughs, gasps, cries, and other non-verbal vocalizations that are central to emotional expression and communication. ExVo 2022, includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers. The fi...
Article
Full-text available
Cardiovascular diseases are the leading cause of death and severely threaten human health in daily life. There have been dramatically increasing demands from both the clinical practice and the smart home application for monitoring the heart status of individuals suffering from chronic cardiovascular diseases. However, experienced physicians who can...
Preprint
Despite the recent advancement in speech emotion recognition (SER) within a single corpus setting, the performance of these SER systems degrades significantly for cross-corpus and cross-language scenarios. The key reason is the lack of generalisation in SER systems towards unseen conditions, which causes them to perform poorly in cross-corpus and c...
Preprint
Full-text available
The Multimodal Sentiment Analysis Challenge 2022 (MuSe 2022) is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour;...
Article
Full-text available
In the past decade, deep learning (DL) has achieved unprecedented success in numerous fields, such as computer vision and healthcare. Particularly, DL is experiencing an increasing development in advanced medical image analysis applications in terms of segmentation, classification, detection, and other tasks. On the one hand, tremendous needs that...
Preprint
Full-text available
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand ling...
Preprint
Full-text available
Detecting COVID-19 from audio signals, such as breathing and coughing, can be used as a fast and efficient pre-testing method to reduce the virus transmission. Due to the promising results of deep learning networks in modelling time sequences, and since applications to rapidly identify COVID in-the-wild should require low computational effort, we p...
Preprint
Full-text available
Respiratory sound classification is an important tool for remote screening of respiratory-related diseases such as pneumonia, asthma, and COVID-19. To facilitate the interpretability of classification results, especially ones based on deep learning, many explanation methods have been proposed using prototypes. However, existing explanation techniqu...
Preprint
Full-text available
Emotional voice conversion (EVC) focuses on converting a speech utterance from a source to a target emotion; it can thus be a key enabling technology for human-computer interaction applications and beyond. However, EVC remains an unsolved research problem with several challenges. In particular, as speech rate and rhythm are two key factors of emoti...
Preprint
In this paper, we present our submission to 3rd Affective Behavior Analysis in-the-wild (ABAW) challenge. Learningcomplex interactions among multimodal sequences is critical to recognise dimensional affect from in-the-wild audiovisual data. Recurrence and attention are the two widely used sequence modelling mechanisms in the literature. To clearly...
Article
Full-text available
Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilize them for real-time, real-world applications. We tac...
Preprint
Emotion and a broader range of affective driver states can be a life decisive factor on the road. While this aspect has been investigated repeatedly, the advent of autonomous automobiles puts a new perspective on the role of computer-based emotion recognition in the car -- the passenger's one. This includes amongst others the monitoring of wellbein...
Preprint
Full-text available
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of mode...
Preprint
Purpose The coronavirus disease 2019 (COVID-19) has caused a crisis worldwide. Amounts of efforts have been made to prevent and control COVID-19’s transmission, from early screenings to vaccinations and treatments. Recently, due to the spring up of many automatic disease recognition applications based on machine listening techniques, it would be fa...
Preprint
Full-text available
Among the seventeen Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the 13$^{th}$ SDG is a call for action to combat climate change for a better world. In this work, we provide an overview of areas in which audio intelligence -- a powerful but in this context so far hardly co...
Preprint
Full-text available
Due to the development of machine learning and speech processing, speech emotion recognition has been a popular research topic in recent years. However, the speech data cannot be protected when it is uploaded and processed on servers in the internet-of-things applications of speech emotion recognition. Furthermore, deep neural networks have proven...
Article
Full-text available
Rett syndrome (RTT) is a rare, late detected developmental disorder associated with severe deficits in the speech-language domain. Despite a few reports about atypicalities in the speech-language development of infants and toddlers with RTT, a detailed analysis of the pre-linguistic vocalisation repertoire of infants with RTT is yet missing. Based...
Conference Paper
Full-text available
Automatic classification of heart sounds has been studied for many years, because computer-aided auscultation of heart sounds can help doctors make a preliminary diagnosis. We propose a classification method for heart sounds that uses fractional Fourier transformation entropy (FRFE) as the features and a support vector machine (SVM) as the classifi...
Preprint
Full-text available
Inspired by the humans' cognitive ability to generalise knowledge and skills, Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations, which is an expensive and time consuming task. Its success in the fields of computer vision and natural language processing have prompt...
Article
Full-text available
The article ‘The perception of emotional cues by children in artifcial background noise’, written by Emilia Parada-Cabaleiro, Anton Batliner, Alice Baird and Björn Schuller, was originally published Online First without Open Access. After publication in volume 23, issue 1, page 169–182 it has been decided to make the article an Open Access publicat...
Article
Full-text available
The monitoring of an escalating negative interaction has several benefits, particularly in security, (mental) health, and group management. The speech signal is particularly suited to this, as aspects of escalation, including emotional arousal, are proven to easily be captured by the audio signal. A challenge of applying trained systems in real-lif...
Conference Paper
Full-text available
Learning English as a foreign language requires an extensive use of cognitive capacity, memory, and motor skills in order to orally express one's thoughts in a clear manner. Current speech recognition intelligence focuses on recognising learners' oral proficiency from fluency, prosody, pronunciation, and grammar's perspectives. However, the capacit...
Preprint
Full-text available
The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from in...
Article
Full-text available
Individuals with autism are known to face challenges with emotion regulation, and express their affective states in a variety of ways. With this in mind, an increasing amount of research on automatic affect recognition from speech and other modalities has recently been presented to assist and provide support, as well as to improve understanding of...
Preprint
Full-text available
Algorithms and Machine Learning (ML) are increasingly affecting everyday life and several decision-making processes, where ML has an advantage due to scalability or superior performance. Fairness in such applications is crucial, where models should not discriminate their results based on race, gender, or other protected groups. This is especially c...