
Zixing Zhang- Technical University of Munich
Zixing Zhang
- Technical University of Munich
About
113
Publications
36,551
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,283
Citations
Current institution
Publications
Publications (113)
Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech c...
Conversational Emotion Recognition (CER) has recently been explored through conversational context modeling to learn the emotion distribution, i.e., the likelihood over emotion categories associated with each utterance. While these methods have shown promising results in emotion classification, they often focus on the interactions between utterance...
Alzheimer's disease (AD), as the most prevalent form of dementia, necessitates early identification and treatment for the critical enhancement of patients' quality of life. Recent studies strive to explore advanced machine learning approaches with multiple information cues, such as speech and text, to automatically and precisely detect this disease...
After its inception, emotion recognition or affective computing has increasingly become an active research topic due to its broad applications. The corresponding computational models have gradually migrated from statistically shallow models to neural-network-based deep models, which can significantly boost the performance of emotion recognition and...
BACKGROUND
The field of mental health technology presently has significant gaps that need addressing, particularly in the domain of daily monitoring and personalized assessments. Current non-invasive devices like wristbands and smartphones are capable of collecting a wide range of data, which has not yet been fully utilized for mental health monito...
Background
The field of mental health technology presently has significant gaps that need addressing, particularly in the domain of daily monitoring and personalized assessments. Current noninvasive devices such as wristbands and smartphones are capable of collecting a wide range of data, which has not yet been fully used for mental health monitori...
We propose a novel Dynamic Restrained Uncertainty Weighting Loss to experimentally handle the problem of balancing the contributions of multiple tasks on the ICML ExVo 2022 Challenge. The multitask aims to recognize expressed emotions and demographic traits from vocal bursts jointly. Our strategy combines the advantages of Uncertainty Weight and Dy...
Zero-shot speech emotion recognition (SER) endows machines with the ability of sensing unseen-emotional states in speech, compared with conventional SER endeavors on supervised cases. On addressing the zero-shot SER task, auditory affective descriptors (AADs) are typically employed to transfer affective knowledge from seen- to unseen-emotional stat...
Computer audition (CA) has experienced a fast development in the past decades by leveraging advanced signal processing and machine learning techniques. In particular, for its non-invasive and ubiquitous character by nature, CA based applications in healthcare have increasingly attracted attention in recent years. During the tough time of the global...
Mental health plays a key role in everyone's day-to-day lives, impacting our thoughts, behaviors, and emotions. Also, over the past years, given their ubiquitous and affordable characteristics, the use of smartphones and wearable devices has grown rapidly and provided support within all aspects of mental health research and care - from screening an...
The population ageing is increasingly prevalent in both developed and developing countries, which may rise a series of social challenges and economic burdens. In particular, more elderly are now staying alone at home than those who are living together with other people who can take care of them. Therefore, assisted living and healthcare monitoring...
Speech Emotion Recognition (SER) makes it possible for machines to perceive affective information. Our previous research differed from conventional SER endeavours in that it focused on recognising unseen emotions in speech autonomously through machine learning. Such a step would enable the automatic leaning of unknown emerging emotional states. Thi...
Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a 'hand-crafted' manner....
In the past three decades, snoring (affecting more than 30% adults of the UK population) has been increasingly studied in the transdisciplinary research community involving medicine and engineering. Early work demonstrated that, the snore sound can carry important information about the status of the upper airway, which facilitates the development o...
Predicting emotions automatically is an active field of research in affective computing. Considering the property of the individual’s subjectivity, the label of an emotional instance is usually created based on opinions from multiple annotators. That is, the labelled instance is often accompanied with the corresponding inter-rater disagreement info...
A challenging issue in the field of the automatic recognition of emotion from speech is the efficient modelling of long temporal contexts. Moreover, when incorporating long-term temporal dependencies between features, recurrent neural network (RNN) architectures are typically employed by default. In this work, we aim to present an efficient deep ne...
Background
A crucial element of human–machine interaction, the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models. One vital challenge in speech emotion recognition (SER) is how to learn robust and discriminative representations from speech. Meanwhile, although machine...
The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought to the fore in responding to fight against and reduce the impact of this global health crisis. In this stu...
The COVID-19 outbreak was announced as a global pandemic by the World Health Organisation in March 2020 and has affected a growing number of people in the past few weeks. In this context, advanced artificial intelligence techniques are brought front in responding to fight against and reduce the impact of this global health crisis. In this study, we...
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatica...
This study investigates the performance of wavelet as well as conventional temporal and spectral features for acoustic scene classification, testing the effectiveness of both feature sets when combined with neural networks on acoustic scene classification. The TUT Acoustic Scenes 2017 Database is used in the evaluation of the system. The model with...
Early interventions in mental health conditions such as Major Depressive Disorder (MDD) are critical to improved health outcomes, as they can help reduce the burden of the disease. As the efficient diagnosis of depression severity is therefore highly desirable, the use of behavioural cues such as speech characteristics in diagnosis is attracting in...
Using neural networks to classify infant vocalisations into important subclasses (such as crying versus speech) is an emergent task in speech technology. One of the biggest roadblocks standing in the way of progress lies in the datasets: The performance of a learning model is affected by the labelling quality and size of the dataset used, and infan...
In recent years, machine learning has been increasingly applied to the area of mental health diagnosis, treatment, support, research, and clinical administration. In particular, using less-invasive wearables combined with the artificial intelligence to monitor, or diagnose the mental diseases has tremendous needs in real practice. To this end, we p...
Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowled...
A less-invasive method for the diagnosis of the major depressive disorder can be useful for both the psychiatrists and the patients. We propose a machine learning framework for automatically discriminating patients suffering from the major depressive disorder (n=14) and healthy subjects (n=17). To this end, spontaneous physical activity data were r...
Despite remarkable advances in emotion recognition, they are severely restrained from either the essentially limited property of the employed single modality, or the synchronous presence of all involved multiple modalities. Motivated by this, we propose a novel crossmodal emotion embedding framework called EmoBed, which aims to leverage the knowled...
The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human–machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features....
Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversa...
One of the frontier issues that severely hamper the development of automatic snore sound classification (ASSC) associates to the lack of sufficient supervised training data. To cope with this problem, we propose a novel data augmentation approach based on semi-supervised conditional Generative Adversarial Networks (scGANs), which aims to automatica...
Snore sound (SnS) classification can support a targeted surgical approach to sleep related breathing disorders. Using machine listening methods, we aim to find the location of obstruction and vibration within a subject’s upper airway. Wavelet features have been demonstrated to be efficient in the recognition of SnSs in previous studies. In this wor...
This paper proposes a comprehensive study on machine listening for localisation of snore sound excitation. Here we investigate the effects of varied frame sizes, and overlap of the analysed audio chunk for extracting low-level descriptors. In addition, we explore the performance of each kind of feature when it is fed into varied classifier models,...
Over the past few years, adversarial training has become an extremely active research topic and has been successfully applied to various Artificial Intelligence (AI) domains. As a potentially crucial technique for the development of the next generation of emotional AI systems, we herein provide a comprehensive overview of the application of adversa...
Infant vocalisation analysis plays an important role in the study of the development of pre-speech capability of infants, while machine-based approaches nowadays emerge with an aim to advance such an analysis. However, conventional machine learning techniques require heavy feature-engineering and refined architecture designing. In this paper, we pr...
Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments fr...
One of the major obstacles that has to be faced when applying automatic emotion recognition to realistic human–machine interaction systems is the scarcity of labelled data for training a robust model. Motivated by this concern, this article seeks to utmost exploit unlabelled data that are pervasively available in the real-world and easy to be colle...
Objective:
Snoring can be excited in different locations within the upper airways during sleep. It was hypothesised that the excitation locations are correlated with distinct acoustic characteristics of the snoring noise. To verify this hypothesis, a database of snore sounds is developed, labelled with the location of sound excitation.
Methods:...
The involvement
of affect information
in a spoken dialogue system can increase the user-friendliness and provide a more natural way for the interaction
experience. This can be reached by speech emotion recognition, where
the features
are usually dominated by the spectral amplitude information while
they ignore the use of the phase spectrum. In this...
We investigate the effectiveness of wavelet features for acoustic scene classification as contribution to the subtask of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017). On the back-end side, gated recurrent neural networks (GRNNs) are compared against traditional support vector machines (SVMs). We...
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning, based on features extracted from Short-Time Fourier Transform and scalo-gram of the audio scenes using Con...
For the Acoustic Scene Classification task of the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE2017), we propose a novel method to classify 15 different acoustic scenes using deep sequential learning for the audio scenes. First, deep representations extracted from the spec-trogram and two types of scalogra...
Over the last decade, automatic emotion recognition has become well established. The gold standard target is thereby usually calculated based on multiple annotations from different raters. All related efforts assume that the emotional state of a human subject can be identified by a 'hard' category or a unique value. This assumption tries to ease th...
Despite the widespread use of supervised learning methods for speech emotion recognition, they are severely restricted due to the lack of sufficient amount of labelled speech data for the training. Considering the wide availability of unlabelled speech data, therefore, this paper proposes semi-supervised autoencoders to improve speech emotion recog...
In recent years, research fields including ecology, bioacoustics, signal processing, and machine learning, have made bird sound recognition a part of their focus. This has led to significant advancements within the field of ornithology, such as improved understanding of evolution, local biodiversity, mating rituals, and even the implications and re...
Objective: Obstructive Sleep Apnea (OSA) is a serious chronic disease and a risk factor for cardiovascular diseases. Snoring is a typical symptom of OSA patients. Knowledge of the origin of obstruction and vibration within the upper airways is essential for a targeted surgical approach. Aim of this paper is to systematically compare different acous...
With recent advances in machine-learning techniques for automatic speech analysis (ASA)-the computerized extraction of information from speech signals-there is a greater need for high-quality, diverse, and very large amounts of data. Such data could be game-changing in terms of ASA system accuracy and robustness, enabling the extraction of feature...
Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge.
Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches...
Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches a...
As a highly active topic in computational paralinguistics, Speech Emotion Recognition (SER) aims to explore ideal representations for emotional factors in speech. In order to improve the performance of SER, Multiple Kernel Learning (MKL) dimensionality reduction has been utilised to obtain effective information for recognising emotions. However, th...
It has been shown that automatic bird sound recognition can be an extremely useful tool for ornithologist and ecologists, allowing for a deeper understanding of; mating, evolution, local biodiversity and even climate change. For a robust and efficient recognition model, a large amount of labelled data is needed, requiring a time consuming and costly...
Whispered speech, as an alternative speaking style for ‘normal phonated’ (non-whispered) speech, has been received little attention in speech emotion recognition. Currently speech emotion recognition systems are exclusively designed to process normal phonated speech, and can result in significantly degraded performance on whispered speech because o...
One of the serious obstacles to the applications of speech emotion recognition systems in real-life settings is the lack of generalisation of the emotion classifiers. Many recognition systems often present a dramatic drop in performance when tested on speech data obtained from different speakers, acoustic environments, linguistic content, and domai...
Automatic continuous affect recognition from audiovisual cues is arguably one of the most active research areas in machine learning. In addressing this regression problem, the advantages of the models, such as the global-optimisation capability of Support Vector Machine for Regression and the context-sensitive capability of memory-enhanced neural n...
This paper presents the University of Passau’s approaches for the Multimodal Emotion Recognition Challenge 2016. For audio signals, we exploit Bag-of-Audio-Words techniques combining Extreme Learning Machines and Hierarchical Extreme Learning Machines. For video signals, we use not only the information from the cropped face of a video frame, but al...
We propose a novel method to learn multiscale kernels with locally penalised discriminant analysis, namely Multiscale-Kernel Locally Penalised Discriminant Analysis (MS-KLPDA). As an exemplary use-case, we apply it to recognise emotions in speech. Specifically, we employ the term of locally penalised discriminant analysis by controlling the weights...
Signal noise reduction can improve the performance of machine learning systems dealing with time signals such as audio. Real-life applicability of these recognition technologies requires the system to uphold its performance level in variable, challenging conditions such as noisy environments. In this contribution, we investigate audio signal denois...
Location and form of the upper airway obstruction is essential for a targeted therapy of obstructive sleep apnea (OSA). Utilizing snore sounds (SnS) to reveal the pathological characters of OSA patients has been the subject of scientific research for several decades. Fewer studies exist on the evaluation of SnS to identify the corresponding obstruc...
Features for speech emotion recognition are usually dominated by the spectral magnitude information while they ignore the use of the phase spectrum because of the difficulty of properly interpreting it. Motivated by recent successes of phase-based features for speech processing, this paper investigates the effectiveness of phase information for whi...
Automatically classifying bird species by their sound signals is of crucial relevance for the research of ornithologists and ecologists. In this study, we present a novel framework for bird sounds classification from audio recordings. Firstly, the p-centre is used to detect the ‘syllables’ of bird songs, which are the units for the recognition task...
copyright 2015 ACM. In this contribution, we propose a novel method for Active Learning (AL) - Dynamic Active Learning (DAL) - which targets the reduction of the costly human labelling work necessary for modelling subjective tasks such as emotion recognition in spoken interactions. The method implements an adaptive query strategy that minimises the...
The typical inherent mismatch between the test and training corpora and by that between 'target' and 'source' sets usually leads to significant performance downgrades. To cope with this, this study presents a feature transfer learning method using Denoising Auto encoders (DAEs) to build high order subspaces of the source and target corpora, where f...
In this paper, we propose and evaluate a distributed system for multiple Computational Paralinguistics tasks in a client-server architecture. The client side deals with feature extraction, compression, and bit-stream formatting, while the server side performs the reverse process, plus model training, and classification. The proposed architecture fa...
With the availability of speech data obtained from different devices and varied acquisition conditions, we are often faced with scenarios, where the intrinsic discrepancy between the training and the test data has an adverse impact on affective speech analysis. To address this issue, this letter introduces an Adaptive Denoising Autoencoder based on...
In this article, the reverberation problem for hands-free voice controlled devices is addressed by employing Bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks. Such networks use memory blocks in the hidden units, enabling them to exploit a self-learnt amount of temporal context. The main objective of this technique is to minimi...
This study addresses a situation in practice where training and test samples come from different corpora - here in acoustic emotion recognition. In this situation, a model is trained on one database while tested on another disjoint one. The typical inherent mismatch between the corpora and by that between test and training set usually leads to sign...
In this paper, we propose a novel method for highly efficient exploitation of unlabeled data-Cooperative Learning. Our approach consists of combining Active Learning and Semi-Supervised Learning techniques, with the aim of reducing the costly effects of human annotation. The core underlying idea of Cooperative Learning is to share the labeling work...
Data sparsity is one of the major bottlenecks in the field of Computational Paralinguistics. Partially supervised learning approaches can help leverage this problem without the need of cost-intensive human labelling efforts. We thus investigate the feasibility of cotraining for exemplary paralinguistic speech analysis tasks spanning along the time-...
In speech emotion recognition, training and test data used for system development usually tend to fit each other perfectly, but further 'similar' data may be available. Transfer learning helps to exploit such similar data for training despite the inherent dissimilarities in order to boost a recogniser's performance. In this context, this paper pres...
Speech data is in principle available in large amounts for the training of acoustic emotion recognisers. However, emotional labelling is usually not given and the distribution is heavily unbalanced, as most data is 'rather neutral' than truly 'emotional'. In the 'hay stack' of speech data, Active Learning automatically identifies the 'needles', i.e...