Qiuchi Li’s research while affiliated with University of Copenhagen and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (51)


Fig. 2. The proposed perception-cognition framework.
Fig. 4. The distribution of the scene description similarity across different MLMs.
Fig. 5. The distribution of gap.
Statistics of the MMSar dataset.
Comparison of 8 MLMs on MMSar. Bold indicates the best and underline the second.

+2

Are MLMs Trapped in the Visual Room?
  • Preprint
  • File available

May 2025

·

6 Reads

·

Chunwang Zou

·

Qimeng Liu

·

[...]

·

Jing Qin

Can multi-modal large models (MLMs) that can ``see'' an image be said to ``understand'' it? Drawing inspiration from Searle's Chinese Room, we propose the \textbf{Visual Room} argument: a system may process and describe every detail of visual inputs by following algorithmic rules, without genuinely comprehending the underlying intention. This dilemma challenges the prevailing assumption that perceptual mastery implies genuine understanding. In implementation, we introduce a two-tier evaluation framework spanning perception and cognition. The perception component evaluates whether MLMs can accurately capture the surface-level details of visual contents, where the cognitive component examines their ability to infer sarcasm polarity. To support this framework, We further introduce a high-quality multi-modal sarcasm dataset comprising both 924 static images and 100 dynamic videos. All sarcasm labels are annotated by the original authors and verified by independent reviewers to ensure clarity and consistency. We evaluate eight state-of-the-art (SoTA) MLMs. Our results highlight three key findings: (1) MLMs perform well on perception tasks; (2) even with correct perception, models exhibit an average error rate of ~16.1\% in sarcasm understanding, revealing a significant gap between seeing and understanding; (3) error analysis attributes this gap to deficiencies in emotional reasoning, commonsense inference, and context alignment. This work provides empirical grounding for the proposed Visual Room argument and offers a new evaluation paradigm for MLMs.

Download


NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

May 2025

·

13 Reads

This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. The benchmark comprises 1,100 real-world nursing behavior instances collected through a five-month longitudinal field study across three hospitals of varying tiers. These instances are annotated by five clinical nurses and then augmented with LLM-generated counterfactuals with reversed ethic polarity. Each original case is paired with a value-aligned and a value-violating version, resulting in 2,200 labeled instances that constitute the Easy-Level dataset. To increase adversarial complexity, each instance is further transformed into a dialogue-based format that embeds contextual cues and subtle misleading signals, yielding a Hard-Level dataset. We evaluate 23 state-of-the-art (SoTA) LLMs on their alignment with nursing values. Our findings reveal three key insights: (1) DeepSeek-V3 achieves the highest performance on the Easy-Level dataset (94.55), where Claude 3.5 Sonnet outperforms other models on the Hard-Level dataset (89.43), significantly surpassing the medical LLMs; (2) Justice is consistently the most difficult nursing value dimension to evaluate; and (3) in-context learning significantly improves alignment. This work aims to provide a foundation for value-sensitive LLMs development in clinical settings. The dataset and the code are available at https://huggingface.co/datasets/Ben012345/NurValues.


Is Sarcasm Detection a Step-by-Step Reasoning Process in Large Language Models?

April 2025

·

3 Reads

·

6 Citations

Proceedings of the AAAI Conference on Artificial Intelligence

Elaborating a series of intermediate reasoning steps significantly improves the ability of large language models (LLMs) to solve complex problems, as such steps would evoke LLMs to think sequentially. However, human sarcasm understanding is often considered an intuitive and holistic cognitive process, in which various linguistic, contextual, and emotional cues are integrated to form a comprehensive understanding, in a way that does not necessarily follow a step-by-step fashion. To verify the validity of this argument, we introduce a new prompting framework (called SarcasmCue) containing four sub-methods, viz. chain of contradiction (CoC), graph of cues (GoC), bagging of cues (BoC) and tensor of cues (ToC), which elicits LLMs to detect human sarcasm by considering sequential and non-sequential prompting methods. Through a comprehensive empirical comparison on four benchmarks, we highlight three key findings: (1) CoC and GoC show superior performance with more advanced models like GPT-4 and Claude 3.5, with an improvement of 3.5%. (2) ToC significantly outperforms other methods when smaller LLMs are evaluated, boosting the F1 score by 29.7% over the best baseline. (3) Our proposed framework consistently pushes the state-of-the-art (i.e., ToT) by 4.2%, 2.0%, 29.7%, and 58.2% in F1 scores across four datasets. This demonstrates the effectiveness and stability of the proposed framework.


Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories

March 2025

·

3 Reads

Evaluating the value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts, which directly probe models with ethically sensitive or controversial questions. However, with the rapid advancements in AI safety techniques, models have become increasingly adept at circumventing these straightforward tests, limiting their effectiveness in revealing underlying biases and ethical stances. To address this limitation, we propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios. This approach enhances the stealth and adversarial nature of the evaluation, making it more robust against superficial safeguards implemented in modern LLMs. We design and implement a dataset that includes conversational traps and ethically ambiguous storytelling, systematically assessing LLMs' responses in more nuanced and context-rich settings. Experimental results demonstrate that this enhanced methodology can effectively expose latent biases that remain undetected in traditional single-shot evaluations. Our findings highlight the necessity of contextual and dynamic testing for value alignment in LLMs, paving the way for more sophisticated and realistic assessments of AI ethics and safety.


Word2State: Modeling Word Representations as States with Density Matrices

March 2025

·

2 Reads

Chinese Journal of Electronics

Polysemy is a common phenomenon in linguistics. Quantum-inspired complex word embeddings based on Semantic Hilbert Space play an important role in natural language processing, which may accurately define a genuine probability distribution over the word space. The existing quantum-inspired works manipulate on the real-valued vectors to compose the complex-valued word embeddings, which lack direct complex-valued pre-trained word representations. Motivated by quantum-inspired complex word embeddings, we propose a complex-valued pre-trained word embedding based on density matrices, called Word2State. Unlike the existing static word embeddings, our proposed model can provide non-linear semantic composition in the form of amplitude and phase, which also defines an authentic probabilistic distribution. We evaluate this model on twelve datasets from the word similarity task and six datasets from the relevant downstream tasks. The experimental results on different tasks demonstrate that our proposed pre-trained word embedding can capture richer semantic information and exhibit greater flexibility in expressing uncertainty.





Fig. 1. Ingredients of a job posting and a candidate profile in the Jobindex database. The texts are translated to English for a better understanding.
Fig. 2. The architecture of the joint ESCO extraction and classification model.
Size, inference time and effectiveness metrics for models in the experiment. For effectiveness metrics, we compute the average and standard deviation of each value over 5 runs of different random seeds.
Joint Extraction and Classification of Danish Competences for Job Matching

October 2024

·

18 Reads

The matching of competences, such as skills, occupations or knowledges, is a key desiderata for candidates to be fit for jobs. Automatic extraction of competences from CVs and Jobs can greatly promote recruiters' productivity in locating relevant candidates for job vacancies. This work presents the first model that jointly extracts and classifies competence from Danish job postings. Different from existing works on skill extraction and skill classification, our model is trained on a large volume of annotated Danish corpora and is capable of extracting a wide range of Danish competences, including skills, occupations and knowledges of different categories. More importantly, as a single BERT-like architecture for joint extraction and classification, our model is lightweight and efficient at inference. On a real-scenario job matching dataset, our model beats the state-of-the-art models in the overall performance of Danish competence extraction and classification, and saves over 50% time at inference.


Citations (26)


... Despite such advancements, current practices in automated sentiment analysis and topic classification for reputation management face several constraints. Many existing PLM solutions typically address only a single task at a time Gao, Ghosh, and Gimpel 2023;Hu et al. 2024;Zhang et al. 2024), and those that handle multiple tasks (Wang et al. 2023;Liu et al. 2023b;Xin et al. 2024;Shen et al. 2024) still require highly specialized training regimes tailored to specific companies or contexts. This complexity makes it difficult to reuse models across multiple companies for various tasks, limiting scalability and flexibility. ...

Reference:

Scalable Reputation Management: A Multi-Task Prompting Approach Using Fine-Tuned PLMs for Sentiment and Topic Classification
Pushing The Limit of LLM Capacity for Text Classification
  • Citing Conference Paper
  • May 2025

... The complexity of multi-modal sarcasm detection * These authors contributed equally to this work. stems from the fact that it incorporates a combination of semantic, pragmatic, and analogical cues, which are not aligned with the logical and sequential nature of traditional reasoning strategies of language models (Yao et al., 2025;Kumar et al., 2022). This leads to models taking shortcuts such as over-reliance on uni-modal cues and failing to discern sarcasm from other forms of figurative language (Liang et al., 2022;Qin et al., 2023). ...

Is Sarcasm Detection a Step-by-Step Reasoning Process in Large Language Models?
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... Accurate and prompt extraction of competences from Jobs and CVs can promote accurate matching of relevant candidates and liberate recruiters' from their burden. While existing works [17,4,14,10,7] have mainly attempted to match job and candidates by their representation as a whole, the extraction of competences can support a clear presentation of reasons of matching along with the matching result. However, the job matching context poses greater challenges to competence extraction algorithms in the following aspects: ...

Template-based Contact Email Generation for Job Recommendation
  • Citing Conference Paper
  • January 2022

... Zhang et al. introduces neural networks and presents neural network based QLM (NNQLM) for question answering (QA) [3], in which sentence pairs can be jointly represented by density matrices to exact features. Then a quantum many-body inspired approach is proposed to sentence representation [23]. Both of them model texts by real-valued density matrices, which cannot express the quantum phenomena more understandably [24]. ...

A Survey of Quantum-Cognitively Inspired Sentiment Analysis Models
  • Citing Article
  • June 2023

ACM Computing Surveys

... The performance of our model is evaluated using two primary metrics: accuracy (ACC) and weighted F1score (WF1) for the Multimodal Emotion Recognition Classification (MERC) task, which were also employed by previous studies 18,17,21,23 . In particular, WF1 is used as a more balanced evaluation metric to account for class imbalance. ...

Quantum-inspired Neural Network for Conversational Emotion Recognition
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... In 2021, Li et al. [106] offered a quantum-inspired dialog emotion recognition framework, analogous to quantum measurement and emotion recognition, using complex-valued operations to construct a quantum layer that supports context modeling and multimodal fusion. In 2021, Gkoumas et al. [107] introduced a quantum cognitive theory-based decision fusion model for predicting emotional judgments in videos, considering "quantumlike" biases in human decision-making processes, and using the concept of incompatibility in quantum theory to handle conflicts in emotional judgments between different modalities, such as language, vision, and audition. In 2021, Gkoumas [108] proposed a quantum probability neural model capable of simulating human decision-making processes under uncertainty, including rational and irrational decisions. ...

Quantum Cognitively Motivated Decision Fusion for Video Sentiment Analysis
  • Citing Article
  • May 2021

Proceedings of the AAAI Conference on Artificial Intelligence

... The model provided a comprehensive empirical comparison of existing multimodal fusion strategies and proposed more effective fusion models based on quantum-inspired components. In 2021, Liu et al. [109] offered a multitask learning framework based on quantum probability theory (QPM) to address the issues of multimodal sarcasm and sentiment detection. By extracting multimodal features through quantum incompatible measurements, this helps capture the interdependencies between detecting sarcasm and emotional tone tasks. ...

What Does Your Smile Mean? Jointly Detecting Multi-Modal Sarcasm and Sentiment Using Quantum Probability
  • Citing Conference Paper
  • January 2021

... The increasing availability of multimodal data on social media and online platforms underscores the need for robust models capable of handling such complexities (Cambria et al. 2013). Gkoumas et al. (2021) introduced an Entanglementdriven Fusion Neural Network (EFNN) for video sentiment analysis, achieving an accuracy of 82.8% and an F1-score of 82.6% on the CMU-MOSEI dataset. Their model leverages quantum-inspired entanglement mechanisms to capture nonseparable interactions between modalities, setting a new benchmark for video-based sentiment analysis. ...

An Entanglement-driven Fusion Neural Network for Video Sentiment Analysis
  • Citing Conference Paper
  • August 2021

... The rise of social media brought about by internet has caused significant changes in the society, empowering individuals to share their thoughts and perspectives broadly through diverse channels. This rise has greatly boosted the use of an ironic language, making it a key part of everyday conversations (Zhang et al., 2021).This ironic language in verbal communication, where the underlying significance of the statement diverges from the literal expression, is generally considered as a sophisticated form of mockery or contempt and is termed as sarcasm (Davidov et al., 2010;Abuteir & Elsamani, 2021). The growing amount of public feedback on products and services, including both direct and indirect sentiments, has highlighted the critical necessity of accurately detecting sarcasm as a key part of Sentiment Analysis (SA) (Farha & Magdy, 2020;Shi et al., 2024). ...

CFN: A Complex-Valued Fuzzy Network for Sarcasm Detection in Conversations
  • Citing Article
  • April 2021

IEEE Transactions on Fuzzy Systems