Zixing Zhang

Zixing Zhang
  • Technical University of Munich

About

113
Publications
36,551
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,283
Citations
Current institution

Publications

Publications (113)
Preprint
Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech c...
Article
Conversational Emotion Recognition (CER) has recently been explored through conversational context modeling to learn the emotion distribution, i.e., the likelihood over emotion categories associated with each utterance. While these methods have shown promising results in emotion classification, they often focus on the interactions between utterance...
Article
Alzheimer's disease (AD), as the most prevalent form of dementia, necessitates early identification and treatment for the critical enhancement of patients' quality of life. Recent studies strive to explore advanced machine learning approaches with multiple information cues, such as speech and text, to automatically and precisely detect this disease...
Article
After its inception, emotion recognition or affective computing has increasingly become an active research topic due to its broad applications. The corresponding computational models have gradually migrated from statistically shallow models to neural-network-based deep models, which can significantly boost the performance of emotion recognition and...
Preprint
BACKGROUND The field of mental health technology presently has significant gaps that need addressing, particularly in the domain of daily monitoring and personalized assessments. Current non-invasive devices like wristbands and smartphones are capable of collecting a wide range of data, which has not yet been fully utilized for mental health monito...
Article
Full-text available
Background The field of mental health technology presently has significant gaps that need addressing, particularly in the domain of daily monitoring and personalized assessments. Current noninvasive devices such as wristbands and smartphones are capable of collecting a wide range of data, which has not yet been fully used for mental health monitori...
Preprint
Full-text available
We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dy...
Article
Zero-shot speech emotion recognition (SER) endows machines with the ability of sensing unseen-emotional states in speech, compared with conventional SER endeavors on supervised cases. On addressing the zero-shot SER task, auditory affective descriptors (AADs) are typically employed to transfer affective knowledge from seen- to unseen-emotional stat...
Article
Full-text available
Computer audition (CA) has experienced a fast development in the past decades by leveraging advanced signal processing and machine learning techniques. In particular, for its non-invasive and ubiquitous character by nature, CA based applications in healthcare have increasingly attracted attention in recent years. During the tough time of the global...
Article
Full-text available
Mental health plays a key role in everyone's day-to-day lives, impacting our thoughts, behaviors, and emotions. Also, over the past years, given their ubiquitous and affordable characteristics, the use of smartphones and wearable devices has grown rapidly and provided support within all aspects of mental health research and care - from screening an...
Article
Full-text available
The population ageing is increasingly prevalent in both developed and developing countries, which may rise a series of social challenges and economic burdens. In particular, more elderly are now staying alone at home than those who are living together with other people who can take care of them. Therefore, assisted living and healthcare monitoring...
Article
Speech Emotion Recognition (SER) makes it possible for machines to perceive affective information. Our previous research differed from conventional SER endeavours in that it focused on recognising unseen emotions in speech autonomously through machine learning. Such a step would enable the automatic leaning of unknown emerging emotional states. Thi...
Article
Full-text available
Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a 'hand-crafted' manner....
Article
Full-text available
In the past three decades, snoring (affecting more than 30% adults of the UK population) has been increasingly studied in the transdisciplinary research community involving medicine and engineering. Early work demonstrated that, the snore sound can carry important information about the status of the upper airway, which facilitates the development o...
Article
Full-text available
Predicting emotions automatically is an active field of research in affective computing. Considering the property of the individual’s subjectivity, the label of an emotional instance is usually created based on opinions from multiple annotators. That is, the labelled instance is often accompanied with the corresponding inter-rater disagreement info...
Article
A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep ne...
Article
Full-text available
Background A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine...
Conference Paper
Full-text available
The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this stu...
Preprint
The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought front in responding to fight against and reduce the impact of this global health crisis. In this study, we...
Article
Full-text available
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatica...
Conference Paper
Full-text available
This study investigates the performance of wavelet as well as conventional temporal and spectral features for acoustic scene classification, testing the effectiveness of both feature sets when combined with neural networks on acoustic scene classification. The TUT Acoustic Scenes 2017 Database is used in the evaluation of the system. The model with...
Article
Full-text available
Early interventions in mental health conditions such as Major Depressive Disorder (MDD) are critical to improved health outcomes, as they can help reduce the burden of the disease. As the efficient diagnosis of depression severity is therefore highly desirable, the use of behavioural cues such as speech characteristics in diagnosis is attracting in...
Conference Paper
Using neural networks to classify infant vocalisations into important subclasses (such as crying versus speech) is an emergent task in speech technology. One of the biggest roadblocks standing in the way of progress lies in the datasets: The performance of a learning model is affected by the labelling quality and size of the dataset used, and infan...
Conference Paper
Full-text available
In recent years, machine learning has been increasingly applied to the area of mental health diagnosis, treatment, support, research, and clinical administration. In particular, using less-invasive wearables combined with the artificial intelligence to monitor, or diagnose the mental diseases has tremendous needs in real practice. To this end, we p...
Preprint
Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowled...
Conference Paper
Full-text available
A less-invasive method for the diagnosis of the major depressive disorder can be useful for both the psychiatrists and the patients. We propose a machine learning framework for automatically discriminating patients suffering from the major depressive disorder (n=14) and healthy subjects (n=17). To this end, spontaneous physical activity data were r...
Article
Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowled...
Article
Full-text available
The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human–machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features....
Article
Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversa...
Preprint
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatica...
Article
Full-text available
Snore sound (SnS) classification can support a targeted surgical approach to sleep related breathing disorders. Using machine listening methods, we aim to find the location of obstruction and vibration within a subject’s upper airway. Wavelet features have been demonstrated to be efficient in the recognition of SnSs in previous studies. In this wor...
Article
Full-text available
This paper proposes a comprehensive study on machine listening for localisation of snore sound excitation. Here we investigate the effects of varied frame sizes, and overlap of the analysed audio chunk for extracting low-level descriptors. In addition, we explore the performance of each kind of feature when it is fed into varied classifier models,...
Preprint
Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversa...
Conference Paper
Full-text available
Infant vocalisation analysis plays an important role in the study of the development of pre-speech capability of infants, while machine-based approaches nowadays emerge with an aim to advance such an analysis. However, conventional machine learning techniques require heavy feature-engineering and refined architecture designing. In this paper, we pr...
Article
Full-text available
Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments fr...
Article
Full-text available
One of the major obstacles that has to be faced when applying automatic emotion recognition to realistic human–machine interaction systems is the scarcity of labelled data for training a robust model. Motivated by this concern, this article seeks to utmost exploit unlabelled data that are pervasively available in the real-world and easy to be colle...
Article
Full-text available
Objective: Snoring can be excited in different locations within the upper airways during sleep. It was hypothesised that the excitation locations are correlated with distinct acoustic characteristics of the snoring noise. To verify this hypothesis, a database of snore sounds is developed, labelled with the location of sound excitation. Methods:...
Chapter
The involvement of affect information in a spoken dialogue system can increase the user-friendliness and provide a more natural way for the interaction experience. This can be reached by speech emotion recognition, where the features are usually dominated by the spectral amplitude information while they ignore the use of the phase spectrum. In this...
Conference Paper
Full-text available
We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We...
Conference Paper
Full-text available
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalo-gram of the audio scenes using Con...
Technical Report
Full-text available
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning for the audio scenes. First, deep representations extracted from the spec-trogram and two types of scalogra...
Conference Paper
Full-text available
Over the last decade, automatic emotion recognition has become well established. The gold standard target is thereby usually calculated based on multiple annotations from different raters. All related efforts assume that the emotional state of a human subject can be identified by a 'hard' category or a unique value. This assumption tries to ease th...
Article
Despite the widespread use of supervised learning methods for speech emotion recognition, they are severely restricted due to the lack of sufficient amount of labelled speech data for the training. Considering the wide availability of unlabelled speech data, therefore, this paper proposes semi-supervised autoencoders to improve speech emotion recog...
Article
Full-text available
In recent years, research fields including ecology, bioacoustics, signal processing, and machine learning, have made bird sound recognition a part of their focus. This has led to significant advancements within the field of ornithology, such as improved understanding of evolution, local biodiversity, mating rituals, and even the implications and re...
Article
Full-text available
Objective: Obstructive Sleep Apnea (OSA) is a serious chronic disease and a risk factor for cardiovascular diseases. Snoring is a typical symptom of OSA patients. Knowledge of the origin of obstruction and vibration within the upper airways is essential for a targeted surgical approach. Aim of this paper is to systematically compare different acous...
Article
With recent advances in machine-learning techniques for automatic speech analysis (ASA)-the computerized extraction of information from speech signals-there is a greater need for high-quality, diverse, and very large amounts of data. Such data could be game-changing in terms of ASA system accuracy and robustness, enabling the extraction of feature...
Article
Full-text available
Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches...
Preprint
Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches a...
Article
As a highly active topic in computational paralinguistics, Speech Emotion Recognition (SER) aims to explore ideal representations for emotional factors in speech. In order to improve the performance of SER, Multiple Kernel Learning (MKL) dimensionality reduction has been utilised to obtain effective information for recognising emotions. However, th...
Article
Full-text available
It has been shown that automatic bird sound recognition can be an extremely useful tool for ornithologist and ecologists, allowing for a deeper understanding of; mating, evolution, local biodiversity and even climate change. For a robust and efficient recognition model, a large amount of labelled data is needed, requiring a time consuming and costly...
Article
Full-text available
Whispered speech, as an alternative speaking style for ‘normal phonated’ (non-whispered) speech, has been received little attention in speech emotion recognition. Currently speech emotion recognition systems are exclusively designed to process normal phonated speech, and can result in significantly degraded performance on whispered speech because o...
Article
One of the serious obstacles to the applications of speech emotion recognition systems in real-life settings is the lack of generalisation of the emotion classifiers. Many recognition systems often present a dramatic drop in performance when tested on speech data obtained from different speakers, acoustic environments, linguistic content, and domai...
Article
Full-text available
Automatic continuous affect recognition from audiovisual cues is arguably one of the most active research areas in machine learning. In addressing this regression problem, the advantages of the models, such as the global-optimisation capability of Support Vector Machine for Regression and the context-sensitive capability of memory-enhanced neural n...
Conference Paper
This paper presents the University of Passau’s approaches for the Multimodal Emotion Recognition Challenge 2016. For audio signals, we exploit Bag-of-Audio-Words techniques combining Extreme Learning Machines and Hierarchical Extreme Learning Machines. For video signals, we use not only the information from the cropped face of a video frame, but al...
Conference Paper
We propose a novel method to learn multiscale kernels with locally penalised discriminant analysis, namely Multiscale-Kernel Locally Penalised Discriminant Analysis (MS-KLPDA). As an exemplary use-case, we apply it to recognise emotions in speech. Specifically, we employ the term of locally penalised discriminant analysis by controlling the weights...
Conference Paper
Full-text available
Signal noise reduction can improve the performance of machine learning systems dealing with time signals such as audio. Real-life applicability of these recognition technologies requires the system to uphold its performance level in variable, challenging conditions such as noisy environments. In this contribution, we investigate audio signal denois...
Conference Paper
Full-text available
Location and form of the upper airway obstruction is essential for a targeted therapy of obstructive sleep apnea (OSA). Utilizing snore sounds (SnS) to reveal the pathological characters of OSA patients has been the subject of scientific research for several decades. Fewer studies exist on the evaluation of SnS to identify the corresponding obstruc...
Article
Full-text available
Features for speech emotion recognition are usually dominated by the spectral magnitude information while they ignore the use of the phase spectrum because of the difficulty of properly interpreting it. Motivated by recent successes of phase-based features for speech processing, this paper investigates the effectiveness of phase information for whi...
Conference Paper
Full-text available
Automatically classifying bird species by their sound signals is of crucial relevance for the research of ornithologists and ecologists. In this study, we present a novel framework for bird sounds classification from audio recordings. Firstly, the p-centre is used to detect the ‘syllables’ of bird songs, which are the units for the recognition task...
Conference Paper
copyright 2015 ACM. In this contribution, we propose a novel method for Active Learning (AL) - Dynamic Active Learning (DAL) - which targets the reduction of the costly human labelling work necessary for modelling subjective tasks such as emotion recognition in spoken interactions. The method implements an adaptive query strategy that minimises the...
Article
The typical inherent mismatch between the test and training corpora and by that between 'target' and 'source' sets usually leads to significant performance downgrades. To cope with this, this study presents a feature transfer learning method using Denoising Auto encoders (DAEs) to build high order subspaces of the source and target corpora, where f...
Article
In this paper, we propose and evaluate a distributed system for multiple Computational Paralinguistics tasks in a client-server architecture. The client side deals with feature extraction, compression, and bit-stream formatting, while the server side performs the reverse process, plus model training, and classification. The proposed architecture fa...
Article
With the availability of speech data obtained from different devices and varied acquisition conditions, we are often faced with scenarios, where the intrinsic discrepancy between the training and the test data has an adverse impact on affective speech analysis. To address this issue, this letter introduces an Adaptive Denoising Autoencoder based on...
Article
In this article, the reverberation problem for hands-free voice controlled devices is addressed by employing Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks. Such networks use memory blocks in the hidden units, enabling them to exploit a self-learnt amount of temporal context. The main objective of this technique is to minimi...
Conference Paper
This study addresses a situation in practice where training and test samples come from different corpora - here in acoustic emotion recognition. In this situation, a model is trained on one database while tested on another disjoint one. The typical inherent mismatch between the corpora and by that between test and training set usually leads to sign...
Article
In this paper, we propose a novel method for highly efficient exploitation of unlabeled data-Cooperative Learning. Our approach consists of combining Active Learning and Semi-Supervised Learning techniques, with the aim of reducing the costly effects of human annotation. The core underlying idea of Cooperative Learning is to share the labeling work...
Conference Paper
Data sparsity is one of the major bottlenecks in the field of Computational Paralinguistics. Partially supervised learning approaches can help leverage this problem without the need of cost-intensive human labelling efforts. We thus investigate the feasibility of cotraining for exemplary paralinguistic speech analysis tasks spanning along the time-...
Conference Paper
In speech emotion recognition, training and test data used for system development usually tend to fit each other perfectly, but further 'similar' data may be available. Transfer learning helps to exploit such similar data for training despite the inherent dissimilarities in order to boost a recogniser's performance. In this context, this paper pres...
Conference Paper
Full-text available
Speech data is in principle available in large amounts for the training of acoustic emotion recognisers. However, emotional labelling is usually not given and the distribution is heavily unbalanced, as most data is 'rather neutral' than truly 'emotional'. In the 'hay stack' of speech data, Active Learning automatically identifies the 'needles', i.e...

Network

Cited By