Wayne Xin Zhao’s research while affiliated with China University of Petroleum, Beijing and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (346)


Figure 1: Average number of problems solved per contest (typically 12 problems) by AI models compared to human ICPC medalists. Despite their strong reasoning capabilities, current top models are still unable to achieve medal-winning performance in ICPC competitions.
Figure 2: The complete pipeline for test case generation and validation, enabling efficient local evaluation.
Figure 3: Refine@K scales robustly with increasing output lengths across different models. The output length is measured in tokens.
Distribution of contest problems across algorithmic tags. Each problem may be associated with one or more tags. 'WFs' and 'CFs' denote World Finals and Continental Finals, respectively.
ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests
  • Preprint
  • File available

June 2025

·

4 Reads

Shiyi Xu

·

Yiwen Hu

·

Yingqian Min

·

[...]

·

With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs

Download

Towards Effective Code-Integrated Reasoning

May 2025

In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.


Figure 1: Overall illustration of the proposed RL-based RioRAG framework.
Average results across ten domains on the RAGChecker benchmark. Fact-Rec refers to fact recall, Info-Den to information density, Cont-Util to context utilization, Rel-NS and Irrel-NS to relevant and irrelevant noise sensitivity, Hallu. to hallucination, Self-Know to self-knowledge, and Faith. to faithfulness.
Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

May 2025

·

2 Reads

Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.


Figure 1: A demonstration of off-target generation. The text with a blue background shows a French question, while the red text represents LLMs' English thinking and response, highlighting a language inconsistency.
Figure 9: Our QRT thinking intervention, which imitates LLMs' behavior about repeating questions before actually thinking about how to solve it.
Evaluation results of different models on our MMATH. AVG represents the average score across languages.
Evaluation results of different evaluation strategies. ATP means prompting models to answer in the target language. DIT introduces multilingual discourse markers to induce models' thinking language. QRT imitates models' behavior to repeat questions before thinking about how to solve them.
MMATH: A Multilingual Benchmark for Mathematical Reasoning

May 2025

·

1 Read

The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.



ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework

May 2025

·

2 Reads

Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbf{ManuSearch}, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbf{ORION}, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in https://github.com/RUCAIBox/ManuSearch


DeepRec: Towards a Deep Dive Into the Item Space with Large Language Model Based Recommendation

May 2025

·

6 Reads

Recently, large language models (LLMs) have been introduced into recommender systems (RSs), either to enhance traditional recommendation models (TRMs) or serve as recommendation backbones. However, existing LLM-based RSs often do not fully exploit the complementary advantages of LLMs (e.g., world knowledge and reasoning) and TRMs (e.g., recommendation-specific knowledge and efficiency) to fully explore the item space. To address this, we propose DeepRec, a novel LLM-based RS that enables autonomous multi-turn interactions between LLMs and TRMs for deep exploration of the item space. In each interaction turn, LLMs reason over user preferences and interact with TRMs to retrieve candidate items. After multi-turn interactions, LLMs rank the retrieved items to generate the final recommendations. We adopt reinforcement learning(RL) based optimization and propose novel designs from three aspects: recommendation model based data rollout, recommendation-oriented hierarchical rewards, and a two-stage RL training strategy. For data rollout, we introduce a preference-aware TRM, with which LLMs interact to construct trajectory data. For rewards, we design a hierarchical reward function that involves both process-level and outcome-level rewards to optimize the interaction process and recommendation performance, respectively. For RL training, we develop a two-stage training strategy, where the first stage aims to guide LLMs to interact with TRMs and the second stage focuses on performance improvement. Experiments on public datasets demonstrate that DeepRec significantly outperforms both traditional and LLM-based baselines, offering a new paradigm for deep exploration in recommendation systems.


R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

May 2025

·

3 Reads

Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at https://github.com/RUCAIBox/R1-Searcher-plus.


Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning

May 2025

·

1 Read

·

1 Citation

Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.


LARES: Latent Reasoning for Sequential Recommendation

May 2025

·

3 Reads

Sequential recommender systems have become increasingly important in real-world applications that model user behavior sequences to predict their preferences. However, existing sequential recommendation methods predominantly rely on non-reasoning paradigms, which may limit the model's computational capacity and result in suboptimal recommendation performance. To address these limitations, we present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation that enhances model's representation capabilities through increasing the computation density of parameters by depth-recurrent latent reasoning. Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity, thereby effectively capturing dynamic and intricate user interest patterns. A key difference of LARES lies in refining all input tokens at each implicit reasoning step to improve the computation utilization. To fully unlock the model's reasoning potential, we design a two-phase training strategy: (1) Self-supervised pre-training (SPT) with dual alignment objectives; (2) Reinforcement post-training (RPT). During the first phase, we introduce trajectory-level alignment and step-level alignment objectives, which enable the model to learn recommendation-oriented latent reasoning patterns without requiring supplementary annotated data. The subsequent phase utilizes reinforcement learning (RL) to harness the model's exploratory ability, further refining its reasoning capabilities. Comprehensive experiments on real-world benchmarks demonstrate our framework's superior performance. Notably, LARES exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.


Citations (41)


... Heterogenous computing requires some sort of meta-process to determine when to recruit additional cores or core types. Strategically reflective systems also need such a method to determine when to recruit the resources necessary for reflective inference about default, intuitive responses [12]. Ideally this meta-process would function automatically and adaptively. ...

Reference:

Strategic Reflectivism In Intelligent Systems
Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
  • Citing Preprint
  • May 2025

... Second, re-ranking approaches in existing mRAG frameworks predominantly rely on straightforward relevance scoring mechanisms, assigning absolute scores based on query-candidate similarity (Mortaheb et al., 2025). Alternative ranking strategies, such as pairwise and listwise methods (Gangi Reddy et al., 2024;Qin et al., 2023;Ren et al., 2025;Zhuang et al., 2024), have remained underexplored in multimodal contexts. Lastly, current mRAG frameworks typically isolate the retrieval, re-ranking, and generation phases, resulting in suboptimal coordination between evidence selection and answer generation. ...

Self-Calibrated Listwise Reranking with Large Language Models
  • Citing Conference Paper
  • April 2025

... First, we enhance Mamba with Fourier Transform capabilities, enabling explicit modeling of periodic patterns in the frequency domain. This FFT-enhanced module [22], [23] decomposes interaction sequences into their frequency components, allowing the model to separately analyze and weight different temporal scales -rapidly identifying daily patterns while also capturing longer weekly or monthly cycles. Crucially, this frequency-domain analysis provides a natural mechanism for noise reduction by attenuating highfrequency components that typically correspond to random variations rather than genuine user preferences. ...

Frequency-Augmented Mixture-of-Heterogeneous-Experts Framework for Sequential Recommendation
  • Citing Conference Paper
  • April 2025

... Common strategies include few-shot prompting [4], code completion [5], and feedbackguided iterative refinement [18,39,40]. Within combinatorial optimization, LLMs are primarily used as solvers [37,20,42,32], generating and refining candidate solutions through code synthesis. They are often integrated with evolutionary techniques such as genetic algorithms to iteratively improve solution quality [19,18,39], demonstrating growing potential as hybrid learning and search agents. ...

Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... With the growing interest and parallel development in both pure LLM-based approaches and those augmented with non-LLM techniques, it is crucial to systematically understand the different aspects of both scenarios. Accordingly, unlike pre-vious surveys (Xu et al., 2025a;, we present a fresh categorization of methods based on whether they incorporate non-LLM techniques to augment LLMs in making final recommendation decisions. This perspective aligns with the emerging trend of integrating LLMs into recommendation systems. ...

Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis
  • Citing Article
  • March 2025

ACM Transactions on Knowledge Discovery from Data

... For each, we assess both in-domain and out-of-domain generalization. We use 1000 HotPotQA [ College-level math 500 Out-of-domain GSM-Hard [7] Large number arithmetics 500 Out-of-domain AIME [8] Olympiad-level problems 90 Out-of-domain OlymMath [9] Olympiad-level problems 200 reduce evaluation cost, we limit each test set to 500 examples, following Wang et al. [66]. As a metric, we use exact match for math and llm-as-a-judge [67] using gpt-4o-mini for factual reasoning. ...

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
  • Citing Preprint
  • March 2025

... Like heterogenous computing systems, intelligent systems that can switch between default and more reflective inference can better optimize performanceefficiency tradeoffs than standard inference or always-reflective inference systems. For example, on standard benchmarking tasks, solution time decreased much faster than performance as models were given more leeway to switch between Monte Carlo tree search and default inference [13]. Sui and colleagues used a "meta-reasoner" to evaluate a summary of each thought (in a chain) before executing any strategy, that achieved better performance-cost tradeoffs than single reasoning models, chain-of-thought methods, tree of thought reasoning, multi-agent systems, and other approaches designed to enable iterative reflection on initial output [44] (Fig. 4). ...

Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking
  • Citing Preprint
  • January 2025

... Modern Information Retrieval (IR) and Recommender Systems (RS) are experiencing a profound change with the advent of Large Language Models (LLMs) [4,5]. Traditional algorithms often rely on static features or past user-item interactions, whereas languageagent systems dynamically integrate world knowledge, language understanding, reasoning, and planning abilities to improve and expand the capabilities of IR and RS in a tangible manner [7,[16][17][18]. ...

User Behavior Simulation with Large Language Model-based Agents for Recommender Systems
  • Citing Article
  • December 2024

ACM Transactions on Information Systems

... The integration of vision and language modalities in VLMs introduces new attack surfaces. Recent research [29,6,11,14,30,31,25,21] have focused on jailbreak attacks, which aim to circumvent safety mechanisms and elicit harmful or policy-violating outputs from these models. For instance, FigStep [11] uses typographic images to evade text-based filters, while VAJM [29] shows adversarial images can bypass the safety mechanisms of VLMs, forcing universal harmful outputs. ...

Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
  • Citing Chapter
  • December 2024

... LLMs have been integrated in various components of recommender systems, including feature engineering, user and item embeddings, scoring, ranking, or even functioning as agents that guide the recommendation process itself [12]. For item embedding, TedRec [34] performs sequence-level semantic fusion of textual and ID features for sequential recommendation, while NoteLLM [35] combines note semantics with collaborative signals to produce note embeddings. In contrast, Chen et al. [4] proposed a hierarchical approach where an LLM extracts features from item descriptions and converts them into compact embeddings, reducing computational overhead. ...

Sequence-level Semantic Representation Fusion for Recommender Systems
  • Citing Conference Paper
  • October 2024