Shengyu Ye’s research while affiliated with Microsoft and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (9)


Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
  • Preprint
  • File available

June 2025

Xumeng Wen

·

Zihan Liu

·

Shun Zheng

·

[...]

·

Mao Yang

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the Pass@K metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the Pass@K metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, CoT-Pass@K, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using CoT-Pass@K, we observe that RLVR can incentivize the generalization of correct reasoning for all values of K. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.

Download

VERSE: Verification-based Self-Play for Code Instructions

April 2025

·

5 Reads

Proceedings of the AAAI Conference on Artificial Intelligence

Instruction-tuned Code Large Language Models (Code LLMs) have excelled in diverse code-related tasks, such as program synthesis, automatic program repair, and code explanation. To collect training datasets for instruction-tuning, a popular method involves having models autonomously generate instructions and corresponding responses. However, the direct generation of responses does not ensure functional correctness, a crucial requirement for generating responses to code instructions. To overcome this, we present Verification-Based Self-Play (VERSE), aiming to enhance model proficiency in generating correct responses. VERSE establishes a robust verification framework that covers various code instructions. Employing VERSE, Code LLMs engage in self-play to generate instructions and corresponding verifications. They evaluate execution results and self-consistency as verification outcomes, using them as scores to rank generated data for self-training. Experiments show that VERSE improves multiple base Code LLMs (average 7.6%) across various languages and tasks on many benchmarks, affirming its effectiveness.



Fig. 1: Comparison of Area and Power Efficiency: LUTBased Approximate Computing vs. ALU (higher is better, 28 nm FD-SOI@300 Mhz, 1k × 1k × 1k matrix multiplication, V =vector length, C=number of centroids, equivalent bitwidth=V /log 2 C)
Fig. 2: VQ for Approximating Matrix Multiplication
Fig. 3: LUT-DLA Framework LUTBoost: Efficient Multi-Stage Model Converter To address Challenge 2, we design a lightweight multistage model training method as Model Converter in Sec. V, which quickly assesses model accuracy and accelerates model convergence. It not only simplifies the design of the model converter but also speeds up training and reduces accuracy loss. In addition,
Fig. 12: Comparison with PECAN and PQA
LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

January 2025

·

75 Reads

The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of 1.4~7.0×7.0\times and 1.5~146.1×146.1\times, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by 0.1%0.1\%~3.1%3.1\% using the L2L_2 distance similarity, 0.1%0.1\%~3.4%3.4\% with the L1L_1 distance similarity, and 0.1%0.1\%~3.8%3.8\% when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from 1.4%1.4\% to 3.0%3.0\%.


CursorCore: Assist Programming through Aligning Anything

October 2024

·

8 Reads

Large language models have been successfully applied to programming assistance tasks, such as code completion, code insertion, and instructional code editing. However, these applications remain insufficiently automated and struggle to effectively integrate various types of information during the programming process, including coding history, current code, and user instructions. In this work, we propose a new conversational framework that comprehensively integrates these information sources, collect data to train our models and evaluate their performance. Firstly, to thoroughly evaluate how well models align with different types of information and the quality of their outputs, we introduce a new benchmark, APEval (Assist Programming Eval), to comprehensively assess the performance of models in programming assistance tasks. Then, for data collection, we develop a data generation pipeline, Programming-Instruct, which synthesizes training data from diverse sources, such as GitHub and online judge platforms. This pipeline can automatically generate various types of messages throughout the programming process. Finally, using this pipeline, we generate 219K samples, fine-tune multiple models, and develop the CursorCore series. We show that CursorCore outperforms other models of comparable size. This framework unifies applications such as inline chat and automated editing, contributes to the advancement of coding assistants. Code, models and data are freely available at https://github.com/TechxGenus/CursorCore.


Figure 1: Vector Quantization in Weight Quantization
LLM Quantization Algorithm Comparison. VPTQ balances all dimensions and achieves SOTA.
LLaMA-2 2bit quantization results
Mistral-7B-v0.1 Wikitext2, C4 perplexity (context length 2048 and 8192) and zeroshot QA Accuracy
Parameters for 2-bit Quantization of Llama and Mistral Models. v represents the vector length, k denotes the codebook size, k1 and k2 correspond to the two codebooks, and group num indicates the number of groups into which PQ (Product Quantization) is divided.
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

September 2024

·

86 Reads

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5%1.5\% on LLaMA-2, 1%1\% on Mistral-7B, 11-22%22\% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6%18.6\% of the quantization algorithm execution time, resulting in a 1.6-1.8×1.8\times increase in inference throughput compared to SOTA.




Citations (3)


... Scaler Quantization (SQ) [10,17,35,27,36], which converts each scalar weight in the model into lower bit-width, is the most common PTQ Figure 1: Comparison between direction and magnitude. We utilize the classic K-Means [19] algorithm to perform VQ, which is also employed in VPTQ [19]. The LLaMA-2-7B model is utilized for representation. ...

Reference:

PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
  • Citing Conference Paper
  • January 2024

... Code retrieval is crucial for code generation by introducing relevant code snippets as references for large language model (LLM) generation, significantly advancing automated software development [1,2,3,4,5]. However, existing benchmarks primarily inherit classical criteria from natural language retrieval, focusing on semantic or functional relevance, i.e., whether the retrieval code addresses the intent of a query. ...

Optimizing Code Retrieval: High-Quality and Scalable Dataset Annotation through Large Language Models
  • Citing Conference Paper
  • January 2024

... 1) The student can be deployed in resource-constrained environments such as edge devices, smartphones, or limited hardware [37], [39]. Enabling new applications, this advantage has also been leveraged extensively in computer vision applications [40]- [42]. ...

ERDL: Efficient Retrieval Framework Based on Distillation from Large Language Models
  • Citing Conference Paper
  • July 2024