Yezhen Wang’s scientific contributions

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (2)


Figure 2: Left: Schematic illustration of the Subspace Scheme (SS, green) operating in a learned subspace, and the Original Scheme (OS, blue) operating in the original space. Right (top row, panels a-c): Fine-tuning loss curves of SS (green), OS (blue), and Adam (orange) on three tasks, showing that OS outperforms SS by a large margin while has a comparable performance with Adam. Right (bottom row, panels d-f): Comparison of OS (blue) and its approximate algorithm ProjFactor (purple), indicating that ProjFactor closely approximates the dynamic of OS. Specifically, columns (a) and (d) test on the Commonsense Reasoning task, columns (b) and (e) test on the MMLU, and columns (c) and (f) test on the GSM8K.
Figure 6: Loss curves of different grained projections on the Commonsense170k dataset: (a) Training LLaMA2-7B with ProjFactor; (b) Training LLaMA2-7B with the Subspace Scheme (as described in Section 4). The performance of two types of grained projections is compared under the same memory budget of 256.
Figure 7: Left: Illustration of computational numerical error for a single parameter matrix during an update step. The numerical error of the projection operator is defined as the absolute difference betweeñ G bf 16˜P16˜ 16˜P bf 16 and˜Gand˜ and˜G f loat64˜Ploat64˜ loat64˜P f loat64 , averaged across all parameter matrices applied low-rank gradient projection. Right: Comparison of projections' numerical errors for configurations under a constant memory budget M = 256. The y-axis denotes the training steps, which is divided into 7 stages, while the x-axis is 7 different-grained configurations.
Figure 8: Comparative Analysis of Projection Matrix Generations in LLaMA2-7B Training on the Commonsense170k Dataset: (a). Training loss curves across all training steps; (b). Zoomed-in view of convergence behavior highlighting loss variability among projection methods.
Figure 12: Left: Performance comparison of different methods on GSM8K with LLaMA3.2-3B. Right: Performance comparison among the configurations of VLoRP with M = 256. The x-axis indicates configurations from fine to coarse (left to right)

+2

Memory-Efficient LLM Training by Various-Grained Low-Rank Projection of Gradients
  • Preprint
  • File available

May 2025

Yezhen Wang

·

Zhouhao Yang

·

Brian K Chen

·

[...]

·

Kenji Kawaguchi

Building upon the success of low-rank adapter (LoRA), low-rank gradient projection (LoRP) has emerged as a promising solution for memory-efficient fine-tuning. However, existing LoRP methods typically treat each row of the gradient matrix as the default projection unit, leaving the role of projection granularity underexplored. In this work, we propose a novel framework, VLoRP, that extends low-rank gradient projection by introducing an additional degree of freedom for controlling the trade-off between memory efficiency and performance, beyond the rank hyper-parameter. Through this framework, we systematically explore the impact of projection granularity, demonstrating that finer-grained projections lead to enhanced stability and efficiency even under a fixed memory budget. Regarding the optimization for VLoRP, we present ProjFactor, an adaptive memory-efficient optimizer, that significantly reduces memory requirement while ensuring competitive performance, even in the presence of gradient accumulation. Additionally, we provide a theoretical analysis of VLoRP, demonstrating the descent and convergence of its optimization trajectory under both SGD and ProjFactor. Extensive experiments are conducted to validate our findings, covering tasks such as commonsense reasoning, MMLU, and GSM8K.

Download

Figure B.1: CIFAR100, IPC=50: Inner Loss and gradient norm for Neumann
Figure E.1: Visualization of the 2D latent solutions for the Burgers, Allen-Cahn, and KdV equations. The observed data are sampled on an 8 × 8 grid, denoted by white points.
Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

June 2024

·

32 Reads

Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce F\textbf{F}orward G\textbf{G}radient U\textbf{U}nrolling with F\textbf{F}orward F\textbf{F}radient, abbreviated as (FG)2U(\textbf{FG})^2\textbf{U}, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. (FG)2U(\text{FG})^2\text{U} circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, (FG)2U(\text{FG})^2\text{U} is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, (FG)2U(\text{FG})^2\text{U} and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, (FG)2U(\text{FG})^2\text{U} is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for (FG)2U(\text{FG})^2\text{U}, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks.