Dacheng Tao’s research while affiliated with Nanyang Technological University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (924)


Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning
  • Preprint

June 2025

·

3 Reads

Kongcheng Zhang

·

Qi Yao

·

·

[...]

·

Dacheng Tao

Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.


Ablation study for the proposed regularization. "w/o adjacent reg" means that the reward of another process is not used for weighting in the proposed regularization. "w/o generation reg" means that generation probability is not used in the proposed regularization. Training dataset is 40K samples from Unified-Feedback, and model is Gemma-2B-it.
Common hyper-parameters in the experiments.
Ablation study for the proposed regularization with L1 reward alignment. Training dataset is 40K samples from Unified-Feedback and base model is Gemma-2B-it.
Intra-Trajectory Consistency for Reward Modeling
  • Preprint
  • File available

June 2025

·

7 Reads

Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards. We apply the proposed regularization to the advanced outcome reward model, improving its performance on RewardBench. Besides, we show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results. Our code is provided in https://github.com/chaoyang101/ICRM.

Download

Improving Large Language Models with Concept-Aware Fine-Tuning

June 2025

·

1 Read

Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm


GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.


The Devil is in the Darkness: Diffusion-Based Nighttime Dehazing Anchored in Brightness Perception

June 2025

·

7 Reads

While nighttime image dehazing has been extensively studied, converting nighttime hazy images to daytime-equivalent brightness remains largely unaddressed. Existing methods face two critical limitations: (1) datasets overlook the brightness relationship between day and night, resulting in the brightness mapping being inconsistent with the real world during image synthesis; and (2) models do not explicitly incorporate daytime brightness knowledge, limiting their ability to reconstruct realistic lighting. To address these challenges, we introduce the Diffusion-Based Nighttime Dehazing (DiffND) framework, which excels in both data synthesis and lighting reconstruction. Our approach starts with a data synthesis pipeline that simulates severe distortions while enforcing brightness consistency between synthetic and real-world scenes, providing a strong foundation for learning night-to-day brightness mapping. Next, we propose a restoration model that integrates a pre-trained diffusion model guided by a brightness perception network. This design harnesses the diffusion model's generative ability while adapting it to nighttime dehazing through brightness-aware optimization. Experiments validate our dataset's utility and the model's superior performance in joint haze removal and brightness mapping.


Ablation study on different training process of M3DT, which illustrate the effectiveness of our dedicated three-stage training mechanism.
DMControl tasks used in this paper.
Normalized score on MW+DMC
Normalized score on Ant+Cheetah
Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer

May 2025

·

9 Reads

Despite recent advancements in offline multi-task reinforcement learning (MTRL) have harnessed the powerful capabilities of the Transformer architecture, most approaches focus on a limited number of tasks, with scaling to extremely massive tasks remaining a formidable challenge. In this paper, we first revisit the key impact of task numbers on current MTRL method, and further reveal that naively expanding the parameters proves insufficient to counteract the performance degradation as the number of tasks escalates. Building upon these insights, we propose M3DT, a novel mixture-of-experts (MoE) framework that tackles task scalability by further unlocking the model's parameter scalability. Specifically, we enhance both the architecture and the optimization of the agent, where we strengthen the Decision Transformer (DT) backbone with MoE to reduce task load on parameter subsets, and introduce a three-stage training mechanism to facilitate efficient training with optimal performance. Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance.


Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

May 2025

·

1 Read

Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.


On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation

May 2025

The emergence of large-scale pre-trained point cloud models has significantly advanced 3D scene understanding, but adapting these models to specific downstream tasks typically demands full fine-tuning, incurring high computational and storage costs. Parameter-efficient fine-tuning (PEFT) techniques, successful in natural language processing and 2D vision tasks, would underperform when naively applied to 3D point cloud models due to significant geometric and spatial distribution shifts. Existing PEFT methods commonly treat points as orderless tokens, neglecting important local spatial structures and global geometric contexts in 3D modeling. To bridge this gap, we introduce the Geometric Encoding Mixer (GEM), a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers. GEM explicitly integrates fine-grained local positional encodings with a lightweight latent attention mechanism to capture comprehensive global context, thereby effectively addressing the spatial and geometric distribution mismatch. Extensive experiments demonstrate that GEM achieves performance comparable to or sometimes even exceeding full fine-tuning, while only updating 1.6% of the model's parameters, fewer than other PEFT methods. With significantly reduced training time and memory requirements, our approach thus sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models. Code will be released.


Resolving Knowledge Conflicts in Domain-specific Data Selection: A Case Study on Medical Instruction-tuning

May 2025

·

1 Read

Domain-specific instruction-tuning has become the defacto standard for improving the performance of large language models (LLMs) in specialized applications, e.g., medical question answering. Since the instruction-tuning dataset might contain redundant or low-quality data, data selection (DS) is usually required to maximize the data efficiency. Despite the successes in the general domain, current DS methods often struggle to select the desired data for domain-specific instruction-tuning. One of the main reasons is that they neglect the impact of knowledge conflicts, i.e., the discrepancy between LLMs' pretrained knowledge and context knowledge of instruction data, which could damage LLMs' prior abilities and lead to hallucination. To this end, we propose a simple-yet-effective Knowledge-aware Data Selection (namely KDS) framework to select the domain-specific instruction-tuning data that meets LLMs' actual needs. The core of KDS is to leverage two knowledge-aware metrics for quantitatively measuring knowledge conflicts from two aspects: context-memory knowledge alignment and intra-memory knowledge consistency. By filtering the data with large knowledge conflicts and sampling the high-quality and diverse data, KDS can effectively stimulate the LLMs' abilities and achieve better domain-specific performance. Taking the medical domain as the testbed, we conduct extensive experiments and empirically prove that KDS surpasses the other baselines and brings significant and consistent performance gains among all LLMs. More encouragingly, KDS effectively improves the model generalization and alleviates the hallucination problem.


GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

May 2025

Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.


Citations (12)


... Xue et al. 2 present OmniForce, a humancentered AutoML system that brings large-model capabilities into practical, open-environment scenarios. By uniting data curation, cloud-edge collaboration, and a crowd-sourced library of algorithms, their framework tackles the realworld complexities of industrial pipelines-from supply chain analytics to AI-generated contentwhile minimizing overhead for both training and inference. ...

Reference:

npj Artificial Intelligence—Editorial journal inauguration
Omniforce: on human-centered, large model empowered and cloud-edge collaborative AutoML system

... A representative technique is model merging, which means fusing model weights from different models [48,22,50]. Though showing impressive performance and wide applicability [51], model merging still suffers from significant performance degradation when the number or difficulty of target tasks increases [16,31,55,42,41,23]. Therefore, delta compression, which compresses the delta parameters or task vectors [18], i.e., the difference between the finetuned and the pre-trained model, has been recently proposed [29,32]. ...

Data-Adaptive Weight-Ensembling for Multi-task Model Fusion

International Journal of Computer Vision

... 6) MMMU [33]: The validation part of a new benchmark, which designed to evaluate the performance of multimodal models on multidisciplinary tasks that require university-level subject knowledge and deliberate reasoning. 7) HR [27]: A high-resolution multimodal benchmark consisting of 4K and 8K images and corresponding questions. ...

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... We compared the QUERY2TREE model with the previously researched model like QUERY2BOX, ConE, MLP, CQD-CO, SILR, GNN-QE, CKGR, LARK and LACT [33,42,48,4,[40][41][42][43][44]. The QUERY2BOX, CQD-Beam, and SILR models addressed query questions represented in the form of logical operations such as conjunction (∧), disjunction (∨), and existential quantification (∃). ...

Improving Complex Reasoning over Knowledge Graph with Logic-Aware Curriculum Tuning
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... Other contributions, including the use of convolutional layers, generative adversarial networks (GANs), and surrogate model integration [50][51][52], may also offer benefits for real-time monitoring tasks, where PINNs can be trained to infer structural states from indirect measurements, reducing reliance on dense sensor networks. These architectures enable robust data-driven approximation of system behaviour under limited or noisy input, a condition commonly observed in railway bridge monitoring under operational loads. ...

Generative adversarial physics-informed neural networks for solving forward and inverse problem with small labeled samples
  • Citing Article
  • April 2025

Computers & Mathematics with Applications

... By mapping reference images into pseudo-text tokens, these adapters convert image-text composed queries into textual input, allowing the VLM to perform cross-modal retrieval and thus achieve CIR. Recent advances in large language models (LLMs) [24,25] and multimodal LLMs (MLLMs) [26][27][28][29][30][31][32] have demonstrated powerful reasoning and instruction-following capabilities, spurring new ZS-CIR research that leverages these strengths. For instance, some methods [33][34][35] directly utilize LLMs to build <image, caption, modification text, target text> quadruplets from image-caption pairs and train LLM-specific adapters, transforming images into pseudo-text tokens for input to the LLM, thereby enabling it to handle CIR tasks. ...

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

... We conduct a comprehensive analysis of the proposed INP-Former++ on three widely used AD indus- trial datasets--MVTec-AD [4], VisA [5], and Real-IAD [70]--as well as a medical AD dataset, Uni-Medical [71]. ...

Exploring plain ViT features for multi-class unsupervised visual anomaly detection
  • Citing Article
  • February 2025

Computer Vision and Image Understanding

... Prior studies have explored various approaches to mitigate overthinking through response length control and computation routing. Existing methods mainly include: (1) Prompt-based approaches [15,36,23] that implicitly guide length through instructions, (2) Integrated training strategies that teach models to adaptively determine reasoning steps via SFT [22,20] or RL with length penalties [1,2,18], and (3) Router-based [3,9,8] architectures employing classifiers to allocate computation paths. While achieving partial progress, these methods either lack precise length control, require additional computational overhead, or fail to explicitly output optimal reasoning lengths [1,37]. ...

O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning

... This optimization-driven strategy not only improves task alignment but also reduces the reliance on large annotated datasets. This is especially valuable in practical domains such as healthcare, law, and finance, where labeled data is both scarce and expensive [6][7][8]. As such, fine-tuning techniques for few-shot learning face higher demands for practical effectiveness [9]. ...

Learning from models beyond fine-tuning

Nature Machine Intelligence

... Multi-agent debate refines LLM reasoning and decision-making based on LLMs' powerful conversational ability (Chu et al., 2024;Liang et al., 2024a; and agents' collaborative ability (Li et al., 2023;Tu et al., 2024), by exploring diverse reasoning paths and cross-verifying claims, thereby reducing hallucinations and errors (Du et al., 2024). It also serves as an evaluation mechanism for tasks like question answering and summarization (Chan et al., 2024). ...

SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing