Qianli Shen’s research while affiliated with National University of Singapore and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (6)


Figure 1: The Mamba block architecture is constructed based on SSM's mathematical formulation in equation (1) corresponding to the blue SSM block (Sequence Transformation) plus other additional components to strengthen the model. The figure is based on Figure 3 (right) following Gu and Dao [17]. Specifically, a Mamba block contains two branches. The left branch is SSM-related. It first uses a green linear projection to map the input sequence's each time step to have more feature channels. Then, a blue one-dimensional convolution block (Conv), a nonlinear activation σ, and finally, the SSM block are applied sequentially. The SSM block intakes the input sequence and model parameters A, B, C for computation. Furthermore, another right branch is the skip connection [20] branch, which is a linear projection followed by a nonlinear activation. The results from the two branches are multiplied and linearly transformed into the final output.
State-space models are accurate and efficient neural operators for dynamical systems
  • Preprint
  • File available

September 2024

·

456 Reads

Zheyuan Hu

·

·

Qianli Shen

·

[...]

·

George Em Karniadakis

Physics-informed machine learning (PIML) has emerged as a promising alternative to classical methods for predicting dynamical systems, offering faster and more generalizable solutions. However, existing models, including recurrent neural networks (RNNs), transformers, and neural operators, face challenges such as long-time integration, long-range dependencies, chaotic dynamics, and extrapolation, to name a few. To this end, this paper introduces state-space models implemented in Mamba for accurate and efficient dynamical system operator learning. Mamba addresses the limitations of existing architectures by dynamically capturing long-range dependencies and enhancing computational efficiency through reparameterization techniques. To extensively test Mamba and compare against another 11 baselines, we introduce several strict extrapolation testbeds that go beyond the standard interpolation benchmarks. We demonstrate Mamba's superior performance in both interpolation and challenging extrapolation tasks. Mamba consistently ranks among the top models while maintaining the lowest computational cost and exceptional extrapolation capabilities. Moreover, we demonstrate the good performance of Mamba for a real-world application in quantitative systems pharmacology for assessing the efficacy of drugs in tumor growth under limited data scenarios. Taken together, our findings highlight Mamba's potential as a powerful tool for advancing scientific machine learning in dynamical systems modeling. (The code will be available at https://github.com/zheyuanhu01/State_Space_Model_Neural_Operator upon acceptance.)

Download

Figure B.1: CIFAR100, IPC=50: Inner Loss and gradient norm for Neumann
Figure E.1: Visualization of the 2D latent solutions for the Burgers, Allen-Cahn, and KdV equations. The observed data are sampled on an 8 × 8 grid, denoted by white points.
Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization

June 2024

·

32 Reads

Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce F\textbf{F}orward G\textbf{G}radient U\textbf{U}nrolling with F\textbf{F}orward F\textbf{F}radient, abbreviated as (FG)2U(\textbf{FG})^2\textbf{U}, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. (FG)2U(\text{FG})^2\text{U} circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, (FG)2U(\text{FG})^2\text{U} is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, (FG)2U(\text{FG})^2\text{U} and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, (FG)2U(\text{FG})^2\text{U} is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for (FG)2U(\text{FG})^2\text{U}, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks.



Performance on Llama3-70B at 25% layer pruning ratio.
Performance on Mixtral-8x7B at 25% layer pruning ratio.
Performance on Mixtral-8x7B at 40% layer pruning ratio.
Performance on Llama2-13B at 40% layer pruning ratio. Both SliceGPT and LLM-Pruner are applied with a desired 25% sparsity ratio according to their implementation.
Generated examples from the pruned LLama3-70B using FINERCUT. The underlined texts denote the input prompts.
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

May 2024

·

210 Reads

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance -- without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.



Robustness of models trained with different dropout methods to random image rotations at test time.
Out-Of-Distribution (OOD) detection task using uncertainty estimates from different methods.
GFlowOut: Dropout with Generative Flow Networks

October 2022

·

173 Reads

Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.

Citations (2)


... Li et al., 2021), DeepONet (Goswami et al., 2022;Goswami et al., 2023;Lu et al., 2021), Latent Dynamic Networks (LDNets) (Regazzoni et al., 2024), Transformers (Hemmasian and Barati Farimani, 2023;Solera-Rico et al., 2024;Y. Wang et al., 2024;Zhao et al., 2024), and Mamba (Hu et al., 2024) (although only developed very recently for modeling of dynamical systems). ...

Reference:

Data-driven multifidelity surrogate models for rocket engines injector design
State-Space Models are Accurate and Efficient Neural Operators for Dynamical Systems
  • Citing Preprint
  • January 2024

... Besides, we attained both original captions from Style and DiffusionDB datasets, and trigger prompts created by our methods. We used the normal version of those prompts, and the version modified by VA3 [41] for optimizing the replication, to trigger base SD model without any fine-tuning. Fig. 9 shows that none of these operations managed to cause copyright infringement, verifying the necessity of attacking; ...

VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
  • Citing Conference Paper
  • June 2024