Tat-Seng Chua’s research while affiliated with National University of Singapore and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (312)


Reinforcing Video Reasoning with Focused Thinking
  • Preprint

May 2025

Jisheng Dang

·

Jingze Wu

·

Teng Wang

·

[...]

·

Tat-Seng Chua

Recent advancements in reinforcement learning, particularly through Group Relative Policy Optimization (GRPO), have significantly improved multimodal large language models for complex reasoning tasks. However, two critical limitations persist: 1) they often produce unfocused, verbose reasoning chains that obscure salient spatiotemporal cues and 2) binary rewarding fails to account for partially correct answers, resulting in high reward variance and inefficient learning. In this paper, we propose TW-GRPO, a novel framework that enhances visual reasoning with focused thinking and dense reward granularity. Specifically, we employs a token weighting mechanism that prioritizes tokens with high informational density (estimated by intra-group variance), suppressing redundant tokens like generic reasoning prefixes. Furthermore, we reformulate RL training by shifting from single-choice to multi-choice QA tasks, where soft rewards enable finer-grained gradient estimation by distinguishing partial correctness. Additionally, we propose question-answer inversion, a data augmentation strategy to generate diverse multi-choice samples from existing benchmarks. Experiments demonstrate state-of-the-art performance on several video reasoning and general understanding benchmarks. Notably, TW-GRPO achieves 50.4\% accuracy on CLEVRER (18.8\% improvement over Video-R1) and 65.8\% on MMVU. Our codes are available at \href{https://github.com/longmalongma/TW-GRPO}{https://github.com/longmalongma/TW-GRPO}.



Figure 9. Principle-wise statistics of the PoA annotations across every art-style in CompArt. The order of principles are fixed to facilitate easier comparison of bar chart profiles.
Compose Your Aesthetics: Empowering Text-to-Image Models with the Principles of Art
  • Preprint
  • File available

March 2025

·

14 Reads

Text-to-Image (T2I) diffusion models (DM) have garnered widespread adoption due to their capability in generating high-fidelity outputs and accessibility to anyone able to put imagination into words. However, DMs are often predisposed to generate unappealing outputs, much like the random images on the internet they were trained on. Existing approaches to address this are founded on the implicit premise that visual aesthetics is universal, which is limiting. Aesthetics in the T2I context should be about personalization and we propose the novel task of aesthetics alignment which seeks to align user-specified aesthetics with the T2I generation output. Inspired by how artworks provide an invaluable perspective to approach aesthetics, we codify visual aesthetics using the compositional framework artists employ, known as the Principles of Art (PoA). To facilitate this study, we introduce CompArt, a large-scale compositional art dataset building on top of WikiArt with PoA analysis annotated by a capable Multimodal LLM. Leveraging the expressive power of LLMs and training a lightweight and transferrable adapter, we demonstrate that T2I DMs can effectively offer 10 compositional controls through user-specified PoA conditions. Additionally, we design an appropriate evaluation framework to assess the efficacy of our approach.

Download

Understanding Long Videos via LLM-Powered Entity Relation Graphs

January 2025

·

4 Reads

The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.


Proactive Conversational AI: A Comprehensive Survey of Advancements and Opportunities

January 2025

·

34 Reads

·

5 Citations

ACM Transactions on Information Systems

Dialogue systems are designed to offer human users social support or functional services through natural language interactions. Traditional conversation research has put significant emphasis on a system’s response-ability, including its capacity to understand dialogue context and generate appropriate responses. However, the key element of proactive behavior – a crucial aspect of intelligent conversations – is often overlooked in these studies. Proactivity empowers conversational agents to lead conversations towards achieving pre-defined targets or fulfilling specific goals on the system side. Proactive dialogue systems are equipped with advanced techniques to handle complex tasks, requiring strategic and motivational interactions, thus representing a significant step towards artificial general intelligence. Motivated by the necessity and challenges of building proactive dialogue systems, we provide a comprehensive review of various prominent problems and advanced designs for implementing proactivity into different types of dialogue systems, including open-domain dialogues, task-oriented dialogues, and information-seeking dialogues. We also discuss real-world challenges that require further research attention to meet application needs in the future, such as proactivity in dialogue systems that are based on large language models, proactivity in hybrid dialogues, evaluation protocols and ethical considerations for proactive dialogue systems. By providing a quick access and overall picture of the proactive dialogue systems domain, we aim to inspire new research directions and stimulate further advancements towards achieving the next level of conversational AI capabilities, paving the way for more dynamic and intelligent interactions within various application domains.


PMHR: Path-Based Multi-Hop Reasoning Incorporating Rule-Enhanced Reinforcement Learning and KG Embeddings

December 2024

·

16 Reads

Multi-hop reasoning provides a means for inferring indirect relationships and missing information from knowledge graphs (KGs). Reinforcement learning (RL) was recently employed for multi-hop reasoning. Although RL-based methods provide explainability, they face challenges such as sparse rewards, spurious paths, large action spaces, and long training and running times. In this study, we present a novel approach that combines KG embeddings and RL strategies for multi-hop reasoning called path-based multi-hop reasoning (PMHR). We address the issues of sparse rewards and spurious paths by incorporating a well-designed reward function that combines soft rewards with rule-based rewards. The rewards are adjusted based on the target entity and the path to it. Furthermore, we perform action filtering and utilize the vectors of entities and relations acquired through KG embeddings to initialize the environment, thereby significantly reducing the runtime. Experiments involving a comprehensive performance evaluation, efficiency analysis, ablation studies, and a case study were performed. The experimental results on benchmark datasets demonstrate the effectiveness of PMHR in improving KG reasoning accuracy while preserving interpretability. Compared to existing state-of-the-art models, PMHR achieved Hit@1 improvements of 0.63%, 2.02%, and 3.17% on the UMLS, Kinship, and NELL-995 datasets, respectively. PMHR provides not only improved reasoning accuracy and explainability but also optimized computational efficiency, thereby offering a robust solution for multi-hop reasoning.


Domain-aware Multimodal Dialog Systems with Distribution-based User Characteristic Modeling

November 2024

·

3 Reads

·

1 Citation

ACM Transactions on Multimedia Computing, Communications and Applications

Textual response generation is a pivotal yet challenging task for multimodal task-oriented dialog systems, which targets at generating the appropriate textual response given the multimodal context. Although existing efforts have obtained remarkable advancements, they ignore the potential of the domain information in revealing the key points of the user intention and the user's history dialogs in indicating the user's characteristics. To address this issue, in this work, we propose a novel domain-aware multimodal dialog system with distribution-based user characteristic modeling (named DMDU). In particular, DMDU contains three vital components: context-knowledge embedding extraction , domain-aware response generation and distribution-based user characteristic injection . Specifically, the context-knowledge embedding extraction component aims to extract the embedding of multimodal context and related knowledge following existing studies. The domain-aware response generation component targets at conducting domain-aware fine-grained intention modeling based on the context and knowledge embedding, and thus fulfills the textual response generation. Moreover, the distribution-based user characteristic injection component first captures the user's characteristics and current intention with the Gaussian distribution, and then conducts the sampling-based contrastive semantic regularization to promote the context representation learning. Experimental results on the public dataset demonstrate the effectiveness of DMDU. We release codes to promote other researchers.


Revisiting Conversation Discourse for Dialogue Disentanglement

October 2024

·

10 Reads

·

5 Citations

ACM Transactions on Information Systems

Dialogue disentanglement aims to detach the chronologically ordered utterances into several independent sessions. Conversation utterances are essentially organized and described by the underlying discourse, and thus dialogue disentanglement requires the full understanding and harnessing of the intrinsic discourse attribute. In this paper, we propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics. First of all, in feature encoding stage , we construct the heterogeneous graph representations to model the various dialogue-specific discourse structural features, including the static speaker-role structures (i.e., speaker-utterance and speaker-mentioning structure) and the dynamic contextual structures (i.e., the utterance-distance and partial-replying structure). We then develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context. Second, in model learning stage , we perform optimization with a hierarchical ranking loss mechanism, which groups dialogue utterances into different discourse levels and carries training covering pair-wise and session-wise levels hierarchically. Third, in inference stage , we devise an easy-first decoding algorithm, which performs utterance pairing under the easy-to-hard manner with a global context, breaking the constraint of traditional sequential decoding order. On two benchmark datasets, our overall system achieves new state-of-the-art performances on all evaluations. In-depth analyses further demonstrate the efficacy of each proposed idea and also reveal how our methods help advance the task. Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.


PSVMA+: Exploring Multi-Granularity Semantic-Visual Adaption for Generalized Zero-Shot Learning

September 2024

·

28 Reads

·

2 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Generalized zero-shot learning (GZSL) endeavors to identify the unseen categories using knowledge from the seen domain, necessitating the intrinsic interactions between the visual features and attribute semantic features. However, GZSL suffers from insufficient visual-semantic correspondences due to the attribute diversity and instance diversity. Attribute diversity refers to varying semantic granularity in attribute descriptions, ranging from low-level (specific, directly observable) to high-level (abstract, highly generic) characteristics. This diversity challenges the collection of adequate visual cues for attributes under a uni-granularity. Additionally, diverse visual instances corresponding to the same sharing attributes introduce semantic ambiguity, leading to vague visual patterns. To tackle these problems, we propose a multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency. PSVMA+ explores semantic-visual interactions at different granularity levels, enabling awareness of multi-granularity in both visual and semantic elements. At each granularity level, the dual semantic-visual transformer module (DSVTM) recasts the sharing attributes into instance-centric attributes and aggregates the semantic-related visual regions, thereby learning unambiguous visual features to accommodate various instances. Given the diverse contributions of different granularities, PSVMA+ employs selective cross-granularity learning to leverage knowledge from reliable granularities and adaptively fuses multi-granularity features for comprehensive representations. Experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.


Multimodal Emotion-Cause Pair Extraction with Holistic Interaction and Label Constraint

August 2024

·

39 Reads

·

3 Citations

ACM Transactions on Multimedia Computing, Communications and Applications

The multimodal emotion-cause pair extraction (MECPE) task aims to detect the emotions, causes, and emotion-cause pairs from multimodal conversations. Existing methods for this task typically concatenate representations of each utterance from distinct modalities and then predict emotion-cause pairs directly. This approach struggles to effectively integrate multimodal features and capture the subtleties of emotion transitions, which are crucial for accurately identifying causes—thereby limiting overall performance. To address these challenges, we propose a novel model that captures holistic interaction and label constraint (HiLo) features for the MECPE task. HiLo facilitates cross-modality and cross-utterance feature interaction with various attention mechanisms, establishing a robust foundation for precise cause extraction. Notably, our model innovatively leverages emotion transition features as pivotal cues to enhance causal inference within conversations. The experimental results demonstrate the superior performance of HiLo, evidenced by an increase of more than 2% in the F1 score compared to existing benchmarks. Further analysis reveals that our approach adeptly utilizes multimodal and dialogue features, making a significant contribution to the field of emotion-cause analysis. Our code is publicly available at https://is.gd/MVdYmx .


Citations (92)


... We find these results particularly interesting because they are in apparent contradiction with scholarship in the field of Proactive Dialogue Systems . The proactive nature of these systems is characterized in numerous ways, among them however, is the capacity to steer the topic of conversation and introduce novelty in order to keep the user engaged in the dialogue (Deng et al., 2025). ...

Reference:

Evaluations at Work: Measuring the Capabilities of GenAI in Use
Proactive Conversational AI: A Comprehensive Survey of Advancements and Opportunities
  • Citing Article
  • January 2025

ACM Transactions on Information Systems

... As illustrated in Figure 1(c), our approach recommends 'La La Land', which is the result of dynamically integrating focus information and background information (i.e., go watch a movie with his/her girlfriend). The advantage of this disentangled modeling approach lies in its ability to effectively distinguish the importance of different types of information, reducing interference from irrelevant details [14,20]. Also, it flexibly utilizes background information as supplementary input when necessary, thereby enhancing the accuracy of identifying the user's primary intent. ...

Revisiting Conversation Discourse for Dialogue Disentanglement
  • Citing Article
  • October 2024

ACM Transactions on Information Systems

... Massive data fine-tuned models undergo extensive training on large datasets for high-quality, versatile video editing [29]. InstructVid2Vid [29] enables complex edits via natural language instructions, while EffiVED [54] refines broad datasets into high-quality subsets for efficient editing. ...

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions
  • Citing Conference Paper
  • July 2024

... In addition, we also apply a linear classifier Linear on image features v to perform the classification. The cross-entropy loss is adopted as the classification loss L classif ication (Liu et al. 2025). The final loss function is obtained by the weighted sum of the above loss functions. ...

PSVMA+: Exploring Multi-Granularity Semantic-Visual Adaption for Generalized Zero-Shot Learning
  • Citing Article
  • September 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Feature extraction is a vital step in audio signal analysis, aiming to convert raw data into numerical representations that can be effectively processed by machine learning models [50], [51], [52]. ...

Multimodal Emotion-Cause Pair Extraction with Holistic Interaction and Label Constraint
  • Citing Article
  • August 2024

ACM Transactions on Multimedia Computing, Communications and Applications

... Traditional image retrieval approaches, whether content-based [1,2] or text-based [3,4], often struggle to handle complex user queries involving visual and textual elements. Composed Image Retrieval (CIR) [5][6][7][8][9] addresses this limitation by allowing users to query using an example image together with a natural language modification. This combined query allows fine-grained control over retrieval results: the reference image anchors the query in a concrete visual example, while the text specifies how to transform or refine it to match the desired target. ...

Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval
  • Citing Conference Paper
  • July 2024

... For the IR task, we use different standard metrics following previous work [15,62]: We use nDCG@10 (Normalized Discounted Cumulative Gain [55]) for TREC web track 2009-2014. For TREC DL Hard, since we use each relevant document as user intent for simulation, then verify whether the target document is ranked higher through clarification, we use MRR@10 (Mean Reciprocal Rank [48]) as the evaluation metric. ...

ROGER: Ranking-oriented Generative Retrieval
  • Citing Article
  • June 2024

ACM Transactions on Information Systems

... ConMask 17 is an open-world knowledge graph completion method that fully utilizes text descriptions and designs special content masking and fusion modules to effectively solve the problems of new entities and sparse connections. These methods have also been widely applied to downstream tasks such as information extraction 18 and pre-trained language models 19 . ...

Enhancing Video-Language Representations With Structural Spatio-Temporal Alignment
  • Citing Article
  • April 2024

IEEE Transactions on Pattern Analysis and Machine Intelligence

... However, current methods tend to emphasize the precise generation of multimodal trajectory predictions at the expense of behavioral intention prediction [8] (BIP). This narrow focus often overlooks the latent decision-making processes that drive observable maneuvers, resulting in models that struggle to interpret complex social interactions or anticipate nuanced changes in driving behavior. ...

Behavioral Intention Prediction in Driving Scenes: A Survey
  • Citing Article
  • August 2024

IEEE Transactions on Intelligent Transportation Systems

... Recently, multimodal meme understanding (Lin et al. 2024;Hee, Chong, and Lee 2023;Qu et al. 2023;Fang et al. 2024bFang et al. , 2023Fang et al. , 2024cFang et al. , 2025Zheng et al. 2024a) has attracted increasing attention. Unlike general multimodal learning tasks (Ji et al. 2021(Ji et al. , 2022Li et al. 2022b,a;Wu et al. 2023;Li et al. 2023a;Fei et al. 2024a;Luo et al. 2024), meme understanding relies more heavily on contextual and metaphorical information. Existing research has mainly focused on hateful memes. ...

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition
  • Citing Conference Paper
  • October 2023