About
548
Publications
60,826
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,108
Citations
Introduction
Current institution
Publications
Publications (548)
The rail transit switch machine ensures the safe turning and operation of trains on the track by switching switch positions, locking switch rails, and reflecting switch status in real time. However, in the detection of complex rail transit switch machine parts such as augmented reality and automatic inspection, existing algorithms have problems suc...
Graph anomaly detection aims to identify unusual patterns in graph-based data, with wide applications in fields such as web security and financial fraud detection. Existing methods typically rely on contrastive learning, assuming that a lower similarity between a node and its local subgraph indicates abnormality. However, these approaches overlook...
Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear si...
The challenges tied to unstructured graph data are manifold, primarily falling into node, edge, and graph-level problem categories. Graph Neural Networks (GNNs) serve as effective tools to tackle these issues. However, individual tasks often demand distinct model architectures, and training these models typically requires abundant labeled data, a l...
Recently, emotional speech generation and speaker cloning have garnered significant interest in text-to-speech (TTS). With the open-sourcing of codec language TTS models trained on massive datasets with large-scale parameters, adapting these general pre-trained TTS models to generate speech with specific emotional expressions and target speaker cha...
The information loss or distortion caused by single-channel speech enhancement (SE) harms the performance of automatic speech recognition (ASR). Observation addition (OA) is an effective post-processing method to improve ASR performance by balancing noisy and enhanced speech. Determining the OA coefficient is crucial. However, the currently supervi...
The existing learning resource recommendation systems suffer from data sparsity and missing data labels, leading to the insufficient mining of the correlation between users and courses. To address these issues, we propose a learning resource recommendation method based on graph contrastive learning, which uses graph contrastive learning to construc...
Sarcasm thrives on popular social media platforms such as Twitter and Reddit, where users frequently employ it to convey emotions in an ironic or satirical manner. The ability to detect sarcasm plays a pivotal role in comprehending individuals' true sentiments. To achieve a comprehensive grasp of sentence semantics, it is crucial to integrate exter...
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emp...
Self-supervised learning (SSL) has garnered significant attention in speech processing, particularly excelling in linguistic tasks such as speech recognition. However, improving the performance of pre-trained models across various downstream tasks—each requiring distinct types of speech information—remains a significant challenge. To address this,...
In recent speech enhancement (SE) research, transformer and its variants have emerged as the predominant methodologies. However, the quadratic complexity of the self-attention mechanism imposes certain limitations on practical deployment. Mamba, as a novel state-space model (SSM), has gained widespread application in natural language processing and...
To discover the mechanism of phonetic category learning both in production and perception, this study conducted two associated training experiments with the real-time visual feedback speech production training method of [l]-[n] pair with the Chinese dialect speakers who cannot distinguish [l] and [n] both in production and perception. In experiment...
Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear si...
Heterogeneous graph neural network (HGNN) is a popular technique for modeling and analyzing heterogeneous graphs. Most existing HGNN-based approaches are supervised or semi-supervised learning methods requiring graphs to be annotated, which is costly and time-consuming. Self-supervised contrastive learning has been proposed to address the problem o...
A model synthesizing average frequency components from select sentences in an electromagnetic articulography database has been crafted. This revealed the dual roles of the tongue: its dorsum acts like a carrier wave, and the tip acts as a modulation signal within the articulatory realm. This model illuminates anticipatory coarticulation's subtletie...
Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requiring different speech information, poses significant challenges. To this purpose, we propose a progr...
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emp...
Ensuring trust, security, and privacy among all participating parties in the process of sharing supervision data is crucial for engineering quality and safety. However, the current centralized architecture platforms that are commonly used for engineering supervision data have problems such as low data sharing and high centralization. A blockchain-b...
Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to fin...
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed...
Object 6D pose estimation, as a key technology in applications such as augmented reality (AR), virtual reality (VR), robotics, and autonomous driving, requires the prediction of the 3D position and 3D pose of objects robustly from complex scene images. However, complex environmental factors such as occlusion, noise, weak texture, and lighting chang...
Accurate recognition of human intent is crucial for effective human-computer speech interaction. Numerous intent understanding studies were based on speech-to-text transcription, which often overlook the influence of paralinguistic cues (such as speaker's emotion, attitude, etc.), leading to misunderstandings, especially when identical textual cont...
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a dearth of investigation into effective multimodal fusion methods. We propose a novel framework calle...
Stitched images can offer a broader field of view, but their boundaries can be irregular and unpleasant. To address this issue, current methods for rectangling images start by distorting local grids multiple times to obtain rectangular images with regular boundaries. However, these methods can result in content distortion and missing boundary infor...
With the increasing demand for humanization of human-computer interaction, dialogical speech emotion recognition (SER) has attracted the attention from researchers, and it is more aligned with actual scenarios. In this paper, we propose a dialogical SER approach that includes two modules, first module is a pre-trained model with an adapter, the oth...
Ji Liu Li Nan Meng Ge- [...]
Jianwu Dang
Reverberation is known to severely affect speech recognition performance when speech is recorded in an enclosed space. Deep learning-based speech dereverberation has been remarkably successful in recent years, achieving superior recognition performance for far-field speech applications. However, the output from conventional dereverberation systems...
Deep learning has revolutionized voice activity detection (VAD) by offering promising solutions. However, directly applying traditional features, such as raw waveforms and Mel-frequency cepstral coefficients, to deep neural networks often leads to degraded VAD performance due to noise interference. In contrast, humans possess the remarkable ability...
In task-oriented dialogue systems, slot filling aims to identify the semantic slot types of each token in user utterances. Due to the lack of sufficient supervised data in many scenarios, it is necessary to transfer relevant knowledge by using cross-domain slot filling. Previous studies rely on manually additional meta-information to build the rela...
Neural text-to-speech (TTS) has achieved humanlike synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voice, but there is growing interest in develo...
Extended reality (XR) is a general term for virtual reality (VR), augmented reality (AR), and mixed reality (MR). By converting abstract digital expressions into intelligent feedback through figures, one can effectively compensate for the poor performance of traditional learning in deep cognitive processing and operational skills training. However,...
Dimensional emotion can better describe rich and fine-grained emotional states than categorical emotion. In the realm of human–robot interaction, the ability to continuously recognize dimensional emotions from speech empowers robots to capture the temporal dynamics of a speaker’s emotional state and adjust their interaction strategies in real-time....
The success of automatic speech recognition (ASR) benefits a great number of healthy people, but not people with disorders. The speech disordered may truly need support from technology, while they actually gain little. The difficulties of disordered ASR arise from the limited availability of data and the abnormal nature of speech, e.g, unclear, uns...
With the background of concept of rail transit major engineering, guided by industry demand, and based on the cultivation of practical abilities for outstanding engineering and technology talents in rail transit characteristic majors, we are committed to creating a national strategic innovation and entrepreneurship practical education system for ra...
Introduction
Speech production involves neurological planning and articulatory execution. How speakers prepare for articulation is a significant aspect of speech production research. Previous studies have focused on isolated words or short phrases to explore speech planning mechanisms linked to articulatory behaviors, including investigating the ey...
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representation extracted from speech should contain information that is between text coding and acoustic coding. The linguistic content is salient, while the paralinguis...
Speech emotion recognition (SER) plays an important role in human-computer interaction, which can provide better interactivity to enhance user experiences. Existing approaches tend to directly apply deep learning networks to distinguish emotions. Among them, the convolutional neural network (CNN) is the most commonly used method to learn emotional...
The speaker encoder is an important front-end module that explores discriminative speaker features for many speech applications requiring speaker information. Current speaker encoders aggregate multi-scale features from utterances using multi-branch network architectures. However, naively adding many branches through a fully convolutional operation...
Sarcasm is widely utilized on social media platforms such as Twitter and Reddit. Sarcasm detection is required for analyzing people's true feelings since sarcasm is commonly used to portray a reversed emotion opposing the literal meaning. The syntactic structure is the key to make better use of commonsense when detecting sarcasm. However, it is ext...
Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations...
Recently, progress has been made towards improving automatic sarcasm detection in computer science. Among existing models, manually constructing static graphs for texts and then using graph neural networks (GNNs) is one of the most effective approaches for drawing long-range incongruity patterns. However, the manually constructed graph structure mi...
In real life, either the subjective factors of speakers or the objective environment degrades the performance of automatic speech recognition (ASR). This study focuses on one of the subjective factors, accented speech, and attempts to build a multi‐accent ASR system to solve the degradation caused by different accents, one of whose characteristic i...
The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visua...
Speech emotion recognition is a critical component for achieving natural human–robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral–temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-lev...
In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech f...
Recently, stunning improvements on multi-channel speech separation have been achieved by neural beamformers when direction information is available. However, most of them neglect to utilize speaker's 2-dimensional (2D) location cues contained in mixture signal, which limits the performance when two sources come from close directions. In this paper,...
Speech emotion recognition is a critical component for achieving natural human-robot interaction. The modulation-filtered cochleagram is a feature based on auditory modulation perception, which contains multi-dimensional spectral-temporal modulation representation. In this study, we propose an emotion recognition framework that utilizes a multi-lev...
Sentence oral reading requires not only a coordinated effort in the visual, articulatory, and cognitive processes but also supposes a top-down influence from linguistic knowledge onto the visual-motor behavior. Despite a gradual recognition of a predictive coding effect in this process, there is currently a lack of a comprehensive demonstration reg...
Short-time auditory attention detection (AAD) based on electroencephalography (EEG) can be utilized to help hearing-impaired people improve their perception abilities in multi-speaker environments. However, the large individual differences and very low signal-to-noise ratio (SNR) of EEG signals may prevent the AAD from working effectively across su...
Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For bet...
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge...
Objective. Constructing an efficient human emotion recognition model based on electroencephalogram (EEG) signals is significant for realizing emotional brain–computer interaction and improving machine intelligence. Approach. In this paper, we present a spatial-temporal feature fused convolutional graph attention network (STFCGAT) model based on mul...
Automatic speaker verification (ASV) exhibits unsatisfactory performance under domain mismatch conditions owing to intrinsic and extrinsic factors, such as variations in speaking styles and recording devices encountered in real-world applications. To ensure robust performance under unseen conditions, domain generalization has been explored. However...
Long utterances segmentation is crucial in end-to-end (E2E) streaming automatic speech recognition (ASR). However, commonly used voice activity detection(VAD)-based and fixed-length segmentation methods may lead to long segments and semantic incompleteness, affecting the user experience and ASR performance. In this paper, we propose a speech segmen...
Speech emotion recognition (SER) promotes the development of intelligent devices, which enable natural and friendly human-computer interactions. However, the recognition performance of existing approaches is significantly reduced on unseen datasets, and the lack of sufficient training data limits the generalizability of deep learning models. In thi...
As an essential technology in human–computer interactions, automatic speech recognition (ASR) ensures a convenient life for healthy people; however, people with speech disorders, who truly need support from such a technology, have experienced difficulties in the use of ASR. Disordered ASR is challenging because of the large variabilities in disorde...