Haizhou Li's research while affiliated with The Chinese University of Hong Kong and other places

What is this page?


This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Publications (70)


Efficient spiking neural network design via neural architecture search
  • Article

February 2024

·

22 Reads

Neural Networks

Jiaqi Yan

·

·

Malu Zhang

·

[...]

·

Share


GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

October 2023

·

25 Reads

·

1 Citation

Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source Large Language Models (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks \(3^{rd}\) on NLPCC2023 SharedTask1, demonstrating our approach’s effectiveness. The code and data are available at https://github.com/FreedomIntelligence/GrammarGPT.



Figure 2: Accuracy across various clinical medicine fields at different career stages. The accuracies are the Zero-shot average values for TOP-5 models using direct response strategy.
Figure 8: The user interface for scoring an answer (in Chinese). Note that Figure 8 follows Figure 7 in the same webpage.
Pairwise Spearman correlations between results under different decoding temperatures. Original: results of greedy decoding (temperature 0). T-x: results of using nucleus sampling under temperature x.
CMB: A Comprehensive Medical Benchmark in Chinese
  • Preprint
  • File available

August 2023

·

35 Reads

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. It is worth noting that our benchmark is not devised as a leaderboard competition but as an instrument for self-assessment of model advancements. We hope this benchmark could facilitate the widespread adoption and enhancement of medical LLMs within China. Check details in \url{https://cmedbenchmark.llmzoo.com/}.

Download

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

July 2023

·

65 Reads

Grammatical error correction aims to correct ungrammatical sentences automatically. Recently, some work has demonstrated the excellent capabilities of closed-source Large Language Models (LLMs, e.g., ChatGPT) in grammatical error correction. However, the potential of open-source LLMs remains unexplored. In this paper, we introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors. We ultimately constructed about 1k parallel data and utilized these data to fine-tune open-source LLMs (e.g., Phoenix, released by The Chinese University of Hong Kong, Shenzhen) with instruction tuning. The experimental results show that GrammarGPT outperforms the existing SOTA system significantly. Although model parameters are 20x larger than the SOTA baseline, the required amount of data for instruction tuning is 1200x smaller, illustrating the potential of open-source LLMs on native CGEC. Our GrammarGPT ranks $3^{rd}$ on NLPCC2023 SharedTask1, demonstrating our approach's effectiveness. The code and data are available at \url{https://github.com/FreedomIntelligence/GrammarGPT}.


Figure 1: The study design of our detection method
Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text

July 2023

·

58 Reads

The remarkable capabilities of large-scale language models, such as ChatGPT, in text generation have incited awe and spurred researchers to devise detectors to mitigate potential risks, including misinformation, phishing, and academic dishonesty. Despite this, most previous studies, including HC3, have been predominantly geared towards creating detectors that differentiate between purely ChatGPT-generated texts and human-authored texts. This approach, however, fails to work on discerning texts generated through human-machine collaboration, such as ChatGPT-polished texts. Addressing this gap, we introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts), facilitating the construction of more robust detectors. It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts. Additionally, we propose the "Polish Ratio" method, an innovative measure of ChatGPT's involvement in text generation based on editing distance. It provides a mechanism to measure the degree of human originality in the resulting text. Our experimental results show our proposed model has better robustness on the HPPT dataset and two existing datasets (HC3 and CDB). Furthermore, the "Polish Ratio" we proposed offers a more comprehensive explanation by quantifying the degree of ChatGPT involvement, which indicates that a Polish Ratio value greater than 0.2 signifies ChatGPT involvement and a value exceeding 0.6 implies that ChatGPT generates most of the text.


Brain Topology Modeling With EEG-Graphs for Auditory Spatial Attention Detection

July 2023

·

142 Reads

·

2 Citations

IEEE transactions on bio-medical engineering

Objective: Despite recent advances, the decoding of auditory attention from brain signals remains a challenge. A key solution is the extraction of discriminative features from high-dimensional data, such as multi-channel electroencephalography (EEG). However, to our knowledge, topological relationships between individual channels have not yet been considered in any study. In this work, we introduced a novel architecture that exploits the topology of the human brain to perform auditory spatial attention detection (ASAD) from EEG signals. Methods: We propose EEG-Graph Net, an EEG-graph convolutional network, which employs a neural attention mechanism. This mechanism models the topology of the human brain in terms of the spatial pattern of EEG signals as a graph. In the EEG-Graph, each EEG channel is represented by a node, while the relationship between two EEG channels is represented by an edge between the respective nodes. The convolutional network takes the multi-channel EEG signals as a time series of EEG-graphs and learns the node and edge weights from the contribution of the EEG signals to the ASAD task. The proposed architecture supports the interpretation of the experimental results by data visualization. Results: We conducted experiments on two publicly available databases. The experimental results showed that EEG-Graph Net significantly outperforms the state-of-the-art methods in terms of decoding performance. In addition, the analysis of the learned weight patterns provides insights into the processing of continuous speech in the brain and confirms findings from neuroscientific studies. Conclusion: We showed that modeling brain topology with EEG-graphs yields highly competitive results for auditory spatial attention detection. Significance: The proposed EEG-Graph Net is more lightweight and accurate than competing baselines and provides explanations for the results. Also, the architecture can be easily transferred to other brain-computer interface (BCI) tasks.



Figure 1: A document consisting of 14 basic units (paragraphs) is divided into five sections (different colors) according to the topic, and each of them has a subheading as its topic.
Figure 2: The Chinese paragraph-level topic structure representation, which contains m paragraphs and n subheadings. We assume a paragraph only has one topic, and a topic contains one or more paragraphs.
Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark

May 2023

·

64 Reads

Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings. Such a process unveils the discourse topic structure of a document that benefits quickly grasping and understanding the overall context of the document from a higher level. However, research and applications in this field have been restrained due to the lack of proper paragraph-level topic representations and large-scale, high-quality corpora in Chinese compared to the success achieved in English. Addressing these issues, we introduce a hierarchical paragraph-level topic structure representation with title, subheading, and paragraph that comprehensively models the document discourse topic structure. In addition, we ensure a more holistic representation of topic distribution within the document by using sentences instead of keywords to represent sub-topics. Following this representation, we construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), four times larger than the previously largest one. We also employ a two-stage man-machine collaborative annotation method to ensure the high quality of the corpus both in form and semantics. Finally, we validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) by several strong baselines, and its efficacy has been preliminarily confirmed on the downstream task: discourse parsing. The representation, corpus, and benchmark we established will provide a solid foundation for future studies.


Citations (33)


... This insight led to exploring how adjusting the ratio of AI-generated to human-written news in training datasets could enhance test-set detection accuracy. Additionally, Jiang et al. (Jiang et al., 2023) offer an overview of the difficulties in identifying LLM-crafted disinformation, advocating for advanced prompting techniques, such as Chain of Thought (CoT) and contextual analysis, as viable strategies. Table 1 offers an overview of set of significant datasets used in AI-generated text forensics, assessed across several crucial dimensions, including the AI generators used, the domains of writing, and performance metrics. ...

Reference:

A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization
Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
  • Citing Article
  • January 2024

APSIPA Transactions on Signal and Information Processing

... , as the extension users may not expect the data will be displayed on the website to the public. Notably, the ShareGPT dataset has become a popular dataset to fine-tune other models [10,20,25,31,50,76,78] in academic research and open-source community, while the risks of handling potentially sensitive personal information shared by users are not well discussed in the literature. ...

HuatuoGPT, Towards Taming Language Model to Be a Doctor
  • Citing Conference Paper
  • January 2023

... There are several already identified limitations of MTEB including: lacking long texts datasets (longest datasets in MTEB have a few hundred words), task imbalance (15 datasets on Retrieval task, 12 datasets on Classification task while only 1 dataset for Summarization task), limited multi-languages evaluation datasets and no programming language (code) datasets [59]. Understanding syntax thoroughly is essential for a text embedding model to accurately determine the relationships between words, which aids in achieving a level of language comprehension that mirrors human cognitive processes [118]. The capacity of text embedding models to generalize across various syntactic contexts is not sufficiently examined in the existing benchmark. ...

How Well Do Text Embedding Models Understand Syntax?
  • Citing Conference Paper
  • January 2023

... Similarly, many multi-modal detection approaches are proposed [29][30][31][32][33], which include methods based on the consistency among visual and audio modalities. The complementarity between the two modalities allows one modality to enhance the feature representation of the other. ...

Audio-Visual Temporal Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss

IEEE Transactions on Circuits and Systems for Video Technology

... Zhao et al. [41] proposed learning synthetic datasets by matching gradients or network parameters between real and synthetic data, while training teacher (using D original ) and student (using D distill ) networks to avoid unrolling the recursive computation graph. Several methods have been subsequently proposed to improve dataset distillation performance via differentiable data augmentation [47], feature alignment [48,49], contrastive signaling [50], and trajectory matching [51,52]. ...

Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation
  • Citing Conference Paper
  • June 2023

... Multi-task Learning (MTL) [3,27,29,60,64] is to learn multiple tasks simultaneously to improve the accuracy of each task compared with single-task learning (STL) [5,7,11,12,49]. Several existing works have proved the strong correlations between FR and AR tasks. Diniz et al. [15] illustrates that the FR model implicitly encodes latent attribute features in the representations, and the hidden layer of the FR model can be used to perform attribute prediction. ...

Generate, Discriminate and Contrast: A Semi-Supervised Sentence Representation Learning Framework
  • Citing Conference Paper
  • January 2022

... The current automatic evaluation methods for dialogue summarization mostly rely on n-gram comparisons and embedding distances between generated and reference summaries. Some studies proposed new metrics to evaluate faithfulness in dialogue summarization [Wang et al., 2022]. However, these metrics have been designed for factual correctness, and do not focus on evaluating the relevance of the emotional content, which yet plays a key role in many aspects of human interactions. ...

Analyzing and Evaluating Faithfulness in Dialogue Summarization
  • Citing Conference Paper
  • January 2022

... AR uses computer vision techniques as part of its foundation, and thus such attacks could apply. Such work includes software [6,19,20,26,58] and hardware [61,64,66] based attacks on machine learning models. While our work uses photographs, screens, manipulated images, and GPS to trick computer vision systems, attacks using additional hardware like lasers have also been explored [60]. ...

Dynamic Transformers Provide a False Sense of Efficiency

... The performance of these linear decoding approaches decreases significantly when operated at low latency settings. Cai et al 56 . introduced an EEG-graph net that exploits the topology of the human brain to perform auditory spatial attention detection from EEGs. ...

Brain Topology Modeling With EEG-Graphs for Auditory Spatial Attention Detection

IEEE transactions on bio-medical engineering