See-Kiong Ng’s research while affiliated with National University of Singapore and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (294)


SemCORE: A Semantic-Enhanced Generative Cross-Modal Retrieval Framework with MLLMs
  • Preprint

April 2025

Haoxuan Li

·

Yi Bin

·

Yunshan Ma

·

[...]

·

Tat-Seng Chua

Cross-modal retrieval (CMR) is a fundamental task in multimedia research, focused on retrieving semantically relevant targets across different modalities. While traditional CMR methods match text and image via embedding-based similarity calculations, recent advancements in pre-trained generative models have established generative retrieval as a promising alternative. This paradigm assigns each target a unique identifier and leverages a generative model to directly predict identifiers corresponding to input queries without explicit indexing. Despite its great potential, current generative CMR approaches still face semantic information insufficiency in both identifier construction and generation processes. To address these limitations, we propose a novel unified Semantic-enhanced generative Cross-mOdal REtrieval framework (SemCORE), designed to unleash the semantic understanding capabilities in generative cross-modal retrieval task. Specifically, we first construct a Structured natural language IDentifier (SID) that effectively aligns target identifiers with generative models optimized for natural language comprehension and generation. Furthermore, we introduce a Generative Semantic Verification (GSV) strategy enabling fine-grained target discrimination. Additionally, to the best of our knowledge, SemCORE is the first framework to simultaneously consider both text-to-image and image-to-text retrieval tasks within generative cross-modal retrieval. Extensive experiments demonstrate that our framework outperforms state-of-the-art generative cross-modal retrieval methods. Notably, SemCORE achieves substantial improvements across benchmark datasets, with an average increase of 8.65 points in Recall@1 for text-to-image retrieval.


Mixture of Experts as Representation Learner for Deep Multi-View Clustering

April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

Multi-view clustering (MVC) aims to integrate information from diverse data sources to facilitate the clustering process, which has achieved considerable success in various real-world applications. However, previous MVC methods typically employ one of two strategies: (1) designing separate feature extraction pipelines for each view, which restricts their ability to fully exploit collaborative potential; or (2) employing a single shared representation module, which hinders the capture of diverse, view-specific representations. To tackle these challenges, we introduce Deep Multi-View Clustering via Collaborative Experts (DMVC-CE), a novel MVC approach that employs the Mixture of Experts (MoE) framework. DMVC-CE incorporates a gating network that dynamically selects multiple experts for handling each data sample, capturing diverse and complementary information from different views. Additionally, to ensure balanced expert utilization and maintain their diversity, we introduce an equilibrium loss and a multi-expert distinctiveness enhancer. The equilibrium loss prevents excessive reliance on specific experts, while the distinctiveness enhancer encourages each expert to specialize in different aspects of the data, thereby promoting diversity in learned representations. Comprehensive experiments on various multi-view benchmark datasets demonstrate the superiority of DMVC-CE compared to state-of-the-art MVC baselines.


Aligning Large Language Models for Faithful Integrity Against Opposing Argument

April 2025

·

1 Citation

Proceedings of the AAAI Conference on Artificial Intelligence

Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings.


Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations. Existing methods encode entity masks tracked across temporal dimensions (mask tubes), then predict their relations with temporal pooling operation, which does not fully utilize the motion indicative of the entities' relation. To overcome this limitation, we introduce a contrastive representation learning framework that focuses on motion pattern for temporal scene graph generation. Firstly, our framework encourages the model to learn close representations for mask tubes of similar subject-relation-object triplets. Secondly, we seek to push apart mask tubes from their temporally shuffled versions. Moreover, we also learn distant representations for mask tubes belonging to the same video but different triplets. Extensive experiments show that our motion-aware contrastive framework significantly improves state-of-the-art methods on both video and 4D datasets.


Multi-Scale Contrastive Learning for Video Temporal Grounding

April 2025

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online memory banks to obtain positive and negative samples. To enable such an extension, we introduce a sampling process to draw multiple video moments corresponding to a common query. Subsequently, by utilizing these moments' representations across video encoder layers, we instantiate a novel form of multi-scale and cross-scale contrastive learning that links local short-range video moments with global long-range video moments. Extensive experiments demonstrate the effectiveness of our framework for not only long-form but also short-form video grounding.


Towards Verifiable Text Generation with Generative Agent

April 2025

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

Text generation with citations makes it easy to verify the factuality of Large Language Models’ (LLMs) generations. Existing one-step generation studies expose distinct shortages in answer refinement and in-context demonstration matching. In light of these challenges, we propose R2-MGA, a Retrieval and Reflection Memory-augmented Generative Agent. Specifically, it first retrieves the memory bank to obtain the best-matched memory snippet, then reflects the retrieved snippet as a reasoning rationale, next combines the snippet and the rationale as the best-matched in-context demonstration. Additionally, it is capable of in-depth answer refinement with two specifically designed modules. We evaluate R2-MGA across five LLMs on the ALCE benchmark. The results reveal R2-MGA’ exceptional capabilities in text generation with citations. In particular, compared to the selected baselines, it delivers up to +58.8% and +154.7% relative performance gains on answer correctness and citation quality, respectively. Extensive analyses strongly support the motivations of R2-MGA.


Figure 2: Prompt of genetic algorithm initialization.
Figure 3: Prompt of genetic algorithm evaluation.
Figure 4: Prompt of genetic algorithm crossover.
Figure 5: Prompt of genetic algorithm swap.
Figure 6: Prompt of genetic algorithm add.

+5

Geneshift: Impact of different scenario shift on Jailbreaking LLM
  • Preprint
  • File available

April 2025

Jailbreak attacks, which aim to cause LLMs to perform unrestricted behaviors, have become a critical and challenging direction in AI safety. Despite achieving the promising attack success rate using dictionary-based evaluation, existing jailbreak attack methods fail to output detailed contents to satisfy the harmful request, leading to poor performance on GPT-based evaluation. To this end, we propose a black-box jailbreak attack termed GeneShift, by using a genetic algorithm to optimize the scenario shifts. Firstly, we observe that the malicious queries perform optimally under different scenario shifts. Based on it, we develop a genetic algorithm to evolve and select the hybrid of scenario shifts. It guides our method to elicit detailed and actionable harmful responses while keeping the seemingly benign facade, improving stealthiness. Extensive experiments demonstrate the superiority of GeneShift. Notably, GeneShift increases the jailbreak success rate from 0% to 60% when direct prompting alone would fail.

Download

Exploring the Impact of Personality Traits on Conversational Recommender Systems: A Simulation with Large Language Models

April 2025

Conversational Recommender Systems (CRSs) engage users in multi-turn interactions to deliver personalized recommendations. The emergence of large language models (LLMs) further enhances these systems by enabling more natural and dynamic user interactions. However, a key challenge remains in understanding how personality traits shape conversational recommendation outcomes. Psychological evidence highlights the influence of personality traits on user interaction behaviors. To address this, we introduce an LLM-based personality-aware user simulation for CRSs (PerCRS). The user agent induces customizable personality traits and preferences, while the system agent possesses the persuasion capability to simulate realistic interaction in CRSs. We incorporate multi-aspect evaluation to ensure robustness and conduct extensive analysis from both user and system perspectives. Experimental results demonstrate that state-of-the-art LLMs can effectively generate diverse user responses aligned with specified personality traits, thereby prompting CRSs to dynamically adjust their recommendation strategies. Our experimental analysis offers empirical insights into the impact of personality traits on the outcomes of conversational recommender systems.


Continual Multimodal Contrastive Learning

March 2025

·

2 Reads

Multimodal contrastive learning (MCL) advances in aligning different modalities and generating multimodal representations in a joint space. By leveraging contrastive learning across diverse modalities, large-scale multimodal data enhances representational quality. However, a critical yet often overlooked challenge remains: multimodal data is rarely collected in a single process, and training from scratch is computationally expensive. Instead, emergent multimodal data can be used to optimize existing models gradually, \textit{i.e.}, models are trained on a sequence of modality pair data. We define this problem as Continual Multimodal Contrastive Learning (CMCL), an underexplored yet crucial research direction at the intersection of multimodal and continual learning. In this paper, we formulate CMCL through two specialized principles of stability and plasticity. We theoretically derive a novel optimization-based method, which projects updated gradients from dual sides onto subspaces where any gradient is prevented from interfering with the previously learned knowledge. Two upper bounds provide theoretical insights on both stability and plasticity in our solution. Beyond our theoretical contributions, we conduct experiments on multiple datasets by comparing our method against advanced continual learning baselines. The empirical results further support our claims and demonstrate the efficacy of our method. The code will be publicly available.


PIED: Physics-Informed Experimental Design for Inverse Problems

March 2025

·

3 Reads

In many science and engineering settings, system dynamics are characterized by governing PDEs, and a major challenge is to solve inverse problems (IPs) where unknown PDE parameters are inferred based on observational data gathered under limited budget. Due to the high costs of setting up and running experiments, experimental design (ED) is often done with the help of PDE simulations to optimize for the most informative design parameters to solve such IPs, prior to actual data collection. This process of optimizing design parameters is especially critical when the budget and other practical constraints make it infeasible to adjust the design parameters between trials during the experiments. However, existing experimental design (ED) methods tend to require sequential and frequent design parameter adjustments between trials. Furthermore, they also have significant computational bottlenecks due to the need for complex numerical simulations for PDEs, and do not exploit the advantages provided by physics informed neural networks (PINNs), such as its meshless solutions, differentiability, and amortized training. This work presents PIED, the first ED framework that makes use of PINNs in a fully differentiable architecture to perform continuous optimization of design parameters for IPs for one-shot deployments. PIED overcomes existing methods' computational bottlenecks through parallelized computation and meta-learning of PINN parameter initialization, and proposes novel methods to effectively take into account PINN training dynamics in optimizing the ED parameters. Through experiments based on noisy simulated data and even real world experimental data, we empirically show that given limited observation budget, PIED significantly outperforms existing ED methods in solving IPs, including challenging settings where the inverse parameters are unknown functions rather than just finite-dimensional.


Citations (32)


... Despite the impressive advancements, MLLMs exhibit significant vulnerabilities when navigating complex conversational challenges, particularly those involving adversarial negation. This issue becomes evident when models struggle to critically analyze and resist unfaithful arguments, resulting in erroneous reversals of their initially correct answers [28,36]. As illustrated in Figure 1, GPT-4o initially provides correct answers for all examples. ...

Reference:

Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
Aligning Large Language Models for Faithful Integrity Against Opposing Argument
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... Task-oriented dialogue systems are more likely to use frame-based architecture. With this architecture, a "frame" acts as an official representation of user goals, and "slots" hold the primary variables elicited from user input(Yang et al., 2025). The core objective of these systems is to fill in any missing information within the frame and subsequently take the desired actions as per the user's intent. ...

OmniDialog: An Omnipotent Pre-training Model for Task-Oriented Dialogue System
  • Citing Article
  • January 2025

... Recent advances in Multimodal Large Language Models (MLLMs) [2,7,14,20] have demonstrated remarkable progress across a wide range of general domains. Leveraging their inherent reasoning capabilities, these models have also shown promising potential in multimodal mathematical reasoning [25,28]. However, their application to Geometry Problem Solving (GPS) remains challenging. ...

Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
  • Citing Conference Paper
  • January 2024

... For instance, several studies explore LLMs' abilities in achieving rationality within game-theoretic frameworks, such as mixed-strategy NE games [69] and classic strategic games like the prisoner's dilemma [6,37]. Some works investigate the adaptability of LLMs to various prompts that influence cooperative and competitive behavior [64], and others evaluate their capacity to replicate human-like reasoning patterns in multi-agent settings [1,77]. While these studies provide an important starting point, many are limited to binary assessments of whether LLMs meet NE [47,26] without exploring the underlying mechanisms driving their strategic reasoning capabilities. ...

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration
  • Citing Conference Paper
  • January 2024

... Natural language queries often contain ambiguity for a variety of reasons. Prior work has examined ambiguity in the context of semantics (Kuhn et al., 2023b), factual question-answering (Min et al., 2020), task-oriented dialogue intents (Budzianowski et al., 2018;Rastogi et al., 2020;Zhang et al., 2024b), personalized human preferences (Chen et al., 2024a;Handa et al., 2024;Li et al., 2023), and text-to-image generation (Hahn et al., 2024). Chandu et al. (2024) presents a visual question answering benchmark to identify epistemic and aleatory uncertainty, though the distinction between the two types of uncertainties can often be unclear. ...

Ask-before-Plan: Proactive Language Agents for Real-World Planning
  • Citing Conference Paper
  • January 2024

... According to Evans et al. (2021), while truthfulness requires a model to state what is objectively true, faithful integrity focuses on ensuring that models respond based on what they believe to be true . Previous research (Wen et al. 2024;Yang et al. 2023;Deng et al. 2024) on the faithful integrity of large language models (LLMs) primarily focused on encouraging LLMs to abstain from answering when uncertain about a question, typically responding with phrases like "I don't know." Wang, Yue, and Sun (2023) conducted experiments on large language models like ChatGPT and GPT-4, finding that although these models exhibit high accuracy and confidence when independently responding to direct questions, they struggle to maintain their assertions when faced with opposing arguments from users. ...

Don’t Just Say “I don’t know”! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations
  • Citing Conference Paper
  • January 2024

... Federated learning enables collaborative modeling with imbalanced data from various sources, which shares model parameters instead of raw data between data sources and the server (Hu et al. 2024b;Liu et al. 2023;Cai et al. 2024b;Qi et al. 2022;Kairouz et al. 2021). This significantly improves the effective utilization of isolated data, enabling them to contribute to cooperative decision-making and learn a generalized model (Cai et al. 2024a;Wang et al. 2023a;Meng et al. 2024;Wang et al. 2024a). ...

Towards Effective Federated Graph Anomaly Detection via Self-boosted Knowledge Distillation
  • Citing Conference Paper
  • October 2024

... Federated learning (McMahan et al., 2017) is a paradigm that allows multiple distributed clients to collaboratively train a global model without sharing their local data Balkus et al., 2022). This approach encompasses various scenarios such as parallel federated learning , sequential federated learning (Wang et al., 2024a), and federated ensemble learning . As depicted in Fig. 1, federated ensemble learning involves each participant independently training a model on their respective local datasets. ...

One-Shot Sequential Federated Learning for Non-IID Data by Enhancing Local Model Diversity
  • Citing Conference Paper
  • October 2024

... With the rise of LLMs in recent years, more sophisticated systems have demonstrated the ability to generate detailed, visually grounded formal analyses of artworks. GalleryGPT [9] employs a fine-tuned LLM to focus on elements such as composition, light, and color, tying critiques directly to visual content and avoiding reliance on external metadata or pre-existing textual knowledge. They also identified the problem of "LLM-biased visual hallucination," where models tend to recognize famous artworks and generate critiques based on memorized data rather than genuine visual analysis. ...

GalleryGPT: Analyzing Paintings with Large Multimodal Models
  • Citing Conference Paper
  • October 2024