Naijun Zheng’s research while affiliated with The University of Hong Kong and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders
  • Conference Paper

August 2023

·

48 Reads

·

8 Citations

Helen Meng

·

Brian Mak

·

·

[...]

·





Fig. 1. The proposed framework. X, Z, H, A are the acoustic features, hidden features, bottleneck features and the output for Question-answer layer, respectively. f and g are the SENet feature extractor and self-attention layer, corresponding to (1)-(8) and (9) in Table 1, respectively. QA and AF are the question-answering (fake span discovery) and anti-spoofing layers with loss calculation procedures respectively.
The EERs using MSTFT features. w/o or w/ mean with or without. w/ or w/o re-synthesis correspond to using the re-synthesised audios by Griffin-Lim and WORLD or not.
Partially Fake Audio Detection by Self-attention-based Fake Span Discovery
  • Preprint
  • File available

February 2022

·

77 Reads

The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022.

Download

Fig. 1. Diagram of the FFM-TS-VAD network.
The CUHK-TENCENT speaker diarization system for the ICASSP 2022 multi-channel multi-party meeting transcription challenge

February 2022

·

28 Reads

This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for diarization. Based on the assumption that there is valuable complementary information between acoustic features, spatial-related and speaker-related features, we propose a multi-level feature fusion mechanism based target-speaker voice activity detection (FFM-TS-VAD) system to improve the performance of the conventional TS-VAD system. Furthermore, we propose a data augmentation method during training to improve the system robustness when the angular difference between two speakers is relatively small. We provide comparisons for different sub-systems we used in M2MeT challenge. Our submission is a fusion of several sub-systems and ranks second in the diarization task.



Citations (5)


... Due to the irreversible progression of AD pathology [1], early detection and diagnosis play a pivotal role in facilitating timely intervention and management, conventionally relying on in-person clinical assessments [2,3]. With recent progress in spoken language technology, speech-based automatic AD detection has emerged as a promising area due to its potential for more cost-effective and scalable AD screening [4,5]. ...

Reference:

Not All Errors Are Equal: Investigation of Speech Recognition Errors in Alzheimer's Disease Detection
Integrated and Enhanced Pipeline System to Support Spoken Language Analytics for Screening Neurocognitive Disorders
  • Citing Conference Paper
  • August 2023

... In CHiME-7, the USTC team applied an iterative cACGMMbased diarization correction method, achieving progressively better results through a four-stage process [13]. Moreover, the top-performing system in M2MeT utilized a multi-channel TS-VAD model [14], while the CUHK-TENCENT team explored the use of DOA methods to estimate speaker locations and integrate this spatial information into Neural Speaker Diarization (NSD) models [15]. ...

The CUHK-Tencent Speaker Diarization System for the ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Challenge
  • Citing Conference Paper
  • May 2022

... In fact, Jung et al. (2022), Mary et al. (2022) and Verma and Berger (2021) used the Transformer-based method to get better results. (Wu et al. 2022) tried to use self-attention as the critical function to get a better detection capacity for partially fake audio. ...

Partially Fake Audio Detection by Self-Attention-Based Fake Span Discovery
  • Citing Conference Paper
  • May 2022

... Speaker diarization (SD) systems that identify "who spoke when" are valuable in various applications involving multispeaker conversations [1]. The SD system comprises submodules for voice activity detection (VAD), segmentation, speaker embedding (SE) extraction, and clustering [2][3][4][5][6][7][8][9]. Alternatively, an end-to-end neural network (NN) approach can be used [10][11][12][13][14], however it has demonstrated challenges when dealing with meetings involving large number of participants [11] and long-form audio exceeding (e.g., recordings exceeding 10 minutes) [15]. ...

Multi-Channel Speaker Diarization Using Spatial Features for Meetings
  • Citing Conference Paper
  • May 2022

... Notable recent works include using text-tospeech techniques to synthesize fake speakers [27,28], which helps make the speaker model more generalizable. Other approaches involve more sophisticated audio signal processing technologies in the front-end, such as beamforming methods [29], dereverberation techniques [30], and speech separation methods [31], all of which contribute to making speaker recognition systems more robust in varying acoustic conditions. ...

A Joint Training Framework of Multi-Look Separator and Speaker Embedding Extractor for Overlapped Speech
  • Citing Conference Paper
  • June 2021