Ruoyu Sun

Ruoyu Sun
Meta · AI Research

About

69
Publications
9,568
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,624
Citations
Additional affiliations
August 2009 - May 2015
University of Minnesota
Position
  • Research Assistant

Publications

Publications (69)
Preprint
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we...
Preprint
Despite their remarkable performance, the development of Large Language Models (LLMs) faces a critical challenge in scalable oversight: providing effective feedback for tasks where human evaluation is difficult or where LLMs outperform humans. While there is growing interest in using LLMs for critique, current approaches still rely on human annotat...
Preprint
Full-text available
Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT, but it often leads to overfitting and limited output diversity due to its aggressive updates to the data distribution. This paper aim to address these issues by introducing the maximum entropy principl...
Preprint
Full-text available
Recently, large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during finetuning, LLMs may forget the knowledge acquired in the pretraining stage, leading to a decline in general capabilities....
Preprint
Full-text available
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using $1/\sqrt{v}$, Adam-mini uses the average of $v$ within a pre-defined...
Preprint
Training Deep Neural Networks (DNNs) with adversarial examples often results in poor generalization to test-time adversarial data. This paper investigates this issue, known as adversarially robust generalization, through the lens of Rademacher complexity. Building upon the studies by Khim and Loh (2018); Yin et al. (2019), numerous works have been...
Preprint
Pruning neural networks before training has received increasing interest due to its potential to reduce training time and memory. One popular method is to prune the connections based on a certain metric, but it is not entirely clear what metric is the best choice. Recent advances in neural tangent kernel (NTK) theory suggest that the training dynam...
Preprint
Full-text available
The past decade has witnessed a drastic increase in modern deep neural networks (DNNs) size, especially for generative adversarial networks (GANs). Since GANs usually suffer from high computational complexity, researchers have shown an increased interest in applying pruning methods to reduce the training and inference costs of GANs. Among different...
Preprint
Full-text available
Deep neural networks are vulnerable to adversarial attacks. Ideally, a robust model shall perform well on both the perturbed training data and the unseen perturbed test data. It is found empirically that fitting perturbed training data is not hard, but generalizing to perturbed test data is quite difficult. To better understand adversarial generali...
Preprint
Full-text available
Generative adversarial nets (GANs) have been remarkably successful at learning to sample from distributions specified by a given dataset, particularly if the given dataset is reasonably large compared to its dimensionality. However, given limited data, classical GANs have struggled, and strategies like output-regularization, data-augmentation, use...
Preprint
Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a be...
Preprint
Full-text available
In adversarial machine learning, deep neural networks can fit the adversarial examples on the training dataset but have poor generalization ability on the test set. This phenomenon is called robust overfitting, and it can be observed when adversarially training neural nets on common datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet. In th...
Preprint
Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et a...
Article
Sparse neural networks have received increasing interest due to their small size compared to dense networks. Nevertheless, most existing works on neural network theory have focused on dense neural networks, and the understanding of sparse networks is very limited. In this paper we study the loss landscape of one-hidden-layer sparse networks. First,...
Article
Does a large width eliminate all suboptimal local minima for neural nets? An affirmative answer was given by a classic result published in 1995 for one-hidden-layer wide neural nets with a sigmoid activation function, but this result has not been extended to the multilayer case. Recently, it was shown that, with piecewise linear activations, subopt...
Preprint
Model-agnostic meta-learning (MAML) and its variants have become popular approaches for few-shot learning. However, due to the non-convexity of deep neural nets (DNNs) and the bi-level formulation of MAML, the theoretical properties of MAML with DNNs remain largely unknown. In this paper, we first prove that MAML with over-parameterized DNNs is gua...
Preprint
Differential privacy (DP) is an essential technique for privacy-preserving. It was found that a large model trained for privacy preserving performs worse than a smaller model (e.g. ResNet50 performs worse than ResNet18). To better understand this phenomenon, we study high dimensional DP learning from the viewpoint of generalization. Theoretically,...
Preprint
Full-text available
Many existing federated learning (FL) algorithms are designed for supervised learning tasks, assuming that the local data owned by the clients are well labeled. However, in many practical situations, it could be difficult and expensive to acquire complete data labels. Federated semi-supervised learning (Fed-SSL) is an attractive solution for fully...
Preprint
Full-text available
The momentum acceleration technique is widely adopted in many optimization algorithms. However, the theoretical understanding of how the momentum affects the generalization performance of the optimization algorithms is still unknown. In this paper, we answer this question through analyzing the implicit bias of momentum-based optimization. We prove...
Article
When deploying machine learning models in the real-world, system designers may wish that models exhibit certain shape behavior, i.e., model outputs follow a particular shape with respect to input features. Trends such as monotonicity, convexity, diminishing or accelerating returns are some of the desired shapes. Presence of these shapes makes the m...
Preprint
Full-text available
Recent theoretical works on over-parameterized neural nets have focused on two aspects: optimization and generalization. Many existing works that study optimization and generalization together are based on neural tangent kernel and require a very large width. In this work, we are interested in the following question: for a binary classification pro...
Article
Full-text available
Short-echo-time (TE) proton magnetic resonance spectroscopic imaging (MRSI) allows for simultaneously mapping a number of molecules in the brain, and has been recognized as an important tool for studying in vivo biochemistry in various neuroscience and disease applications. However, separation of the metabolite and macromolecule (MM) signals presen...
Preprint
Full-text available
The Barzilai-Borwein (BB) method has demonstrated great empirical success in nonlinear optimization. However, the convergence speed of BB method is not well understood, as the known convergence rate of BB method for quadratic problems is much worse than the steepest descent (SD) method. Therefore, there is a large discrepancy between theory and pra...
Preprint
Full-text available
Understanding of GAN training is still very limited. One major challenge is its non-convex-non-concave min-max objective, which may lead to sub-optimal local minima. In this work, we perform a global landscape analysis of the empirical loss of GANs. We prove that a class of separable-GAN, including the original JS-GAN, has exponentially many bad ba...
Preprint
Nonconvex-concave min-max problem arises in many machine learning applications including minimizing a pointwise maximum of a set of nonconvex functions and robust adversarial training of neural networks. A popular approach to solve this problem is the gradient descent-ascent (GDA) algorithm which unfortunately can exhibit oscillation in case of non...
Preprint
Network pruning, or sparse network has a long history and practical significance in modern application. A major concern for neural network training is that the non-convexity of the associated loss functions may cause bad landscape. We focus on analyzing sparse linear network generated from weight pruning strategy. With no unrealistic assumption, we...
Article
One of the major concerns for neural network training is that the nonconvexity of the associated loss functions may cause a bad landscape. The recent success of neural networks suggests that their loss landscape is not too bad, but what specific results do we know about the landscape? In this article, we review recent findings and results on the gl...
Preprint
One of the major concerns for neural network training is that the non-convexity of the associated loss functions may cause bad landscape. The recent success of neural networks suggests that their loss landscape is not too bad, but what specific results do we know about the landscape? In this article, we review recent findings and results on the glo...
Preprint
Gradient-based meta-learning (GBML) with deep neural nets (DNNs) has become a popular approach for few-shot learning. However, due to the non-convexity of DNNs and the complex bi-level optimization in GBML, the theoretical properties of GBML with DNNs remain largely unknown. In this paper, we first develop a novel theoretical analysis to answer the...
Preprint
In distributed optimization, a popular technique to reduce communication is quantization. In this paper, we provide a general analysis framework for inexact gradient descent that is applicable to quantization schemes. We also propose a quantization scheme Double Encoding and Error Diminishing (DEED). DEED can achieve small communication complexity...
Preprint
Traditional landscape analysis of deep neural networks aims to show that no sub-optimal local minima exist in some appropriate sense. From this, one may be tempted to conclude that descent algorithms which escape saddle points will reach a good local minimum. However, basic optimization theory tell us that it is also possible for a descent algorith...
Preprint
When and why can a neural network be successfully trained? This article provides an overview of optimization algorithms and theory for training neural networks. First, we discuss the issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and then discuss practical solutions including careful initialization and nor...
Article
Random permutation is observed to be powerful for optimization algorithms: for multiblock ADMM (alternating direction method of multipliers), whereas the classical cyclic version diverges, the randomly permuted version converges in practice; for BCD (block coordinate descent), the randomly permuted version is typically faster than other versions. I...
Preprint
Does over-parameterization eliminate sub-optimal local minima for neural network problems? On one hand, existing positive results do not prove the claim, but often weaker claims. On the other hand, existing negative results have strong assumptions on the activation functions and/or data samples, causing a large gap with positive results. It was unc...
Preprint
It was recently found that the standard version of multi-block cyclic ADMM diverges. Interestingly, Gaussian Back Substitution ADMM (GBS-ADMM) and symmetric Gauss-Seidel ADMM (sGS-ADMM) do not have the divergence issue. Therefore, it seems that symmetrization can improve the performance of the classical cyclic order. In another recent work, cyclic...
Preprint
Generative adversarial nets (GANs) and variational auto-encoders have significantly improved our distribution modeling capabilities, showing promise for dataset augmentation, image-to-image translation and feature learning. However, to model high-dimensional distributions, sequential training and stacked architectures are common, increasing the num...
Preprint
Wide networks are often believed to have a nice optimization landscape, but what rigorous results can we prove? To understand the benefit of width, it is important to identify the difference between wide and narrow networks. In this work, we prove that from narrow to wide networks, there is a phase transition from having sub-optimal basins to no su...
Preprint
Full-text available
This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam [1], AMSGrad [2] and AdaGrad [3]. Despite their popularity in training deep neural netw...
Preprint
One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special...
Article
Full-text available
It is widely conjectured that the reason that training algorithms for neural networks are successful because all local minima lead to similar performance, for example, see (LeCun et al., 2015, Choromanska et al., 2015, Dauphin et al., 2014). Performance is typically measured in terms of two metrics: training performance and generalization performan...
Article
While Truncated Back-Propagation through Time (BPTT) is the most popular approach to training Recurrent Neural Networks (RNNs), it suffers from being inherently sequential (making parallelization difficult) and from truncating gradient flow between distant time-steps. We investigate whether Target Propagation (TPROP) style approaches can address th...
Article
Matrix factorization is a popular approach for large-scale matrix completion. The optimization formulation based on matrix factorization, even with huge size, can be solved very efficiently through the standard optimization algorithms in practice. However, due to the non-convexity caused by the factorization model, there is a limited theoretical un...
Article
Full-text available
This paper concerns the worst-case complexity of Gauss-Seidel method for solving a positive semi-definite linear system; or equivalently, that of cyclic coordinate descent (C-CD) for minimizing a convex quadratic function. The known provable complexity of C-CD can be $O(n)$ times slower than gradient descent (GD) and $O(n^2)$ times slower than rand...
Chapter
Utilising both key mathematical tools and state-of-the-art research results, this text explores the principles underpinning large-scale information processing over networks and examines the crucial interaction between big data and its associated communication, social and biological networks. Written by experts in the diverse fields of machine learn...
Article
Full-text available
The joint base station (BS) association and beamforming problem has been studied extensively in recent years, yet the computational complexity for even the simplest SISO case has not been fully characterized. In this paper, we consider the problems for an uplink SISO/SIMO cellular network under the max-min fairness criterion. We first prove that th...
Article
Full-text available
The iteration complexity of the block-coordinate descent (BCD) type algorithm has been under extensive investigation. It was recently shown that for convex problems the classical cyclic BCGD (block coordinate gradient descent) achieves an $\mathcal{O}(1/r)$ complexity ($r$ is the number of passes of all blocks). However, such bounds are at least li...
Article
Full-text available
Strong tracking filtering (STF) is a popular adaptive estimation method to effectively deal with state estimation for linear and nonlinear dynamic systems with inaccurate models or sudden change of state. The key of the STF is to use a time-variant fading factor, which can be evaluated based on the current measurement innovation in real time, to fo...
Conference Paper
Designing iterative algorithms for interference alignment (IA) is very useful for both practical and theoretical purposes. However, the existing works on iterative IA algorithms have not reported significant gains in terms of the DoF (Degrees of Freedom) over simple orthogonalization schemes. In this paper, we aim to design an iterative IA algorith...
Article
Full-text available
The alternating direction method of multipliers (ADMM) is now widely used in many fields, and its convergence was proved when two blocks of variables are alternately updated. It is computationally beneficial to extend the ADMM directly to the case of a multi-block convex minimization problem. Unfortunately, such an extension fails to converge when...
Article
Vector space interference alignment (IA) is known to achieve high degrees of freedom (DoFs) with infiniteindependent channel extensions, but its performance is largely unknown for a finite number of possibly dependent channel extensions. In this paper, we consider a K-user Mt x Mr MIMO interference channel (IC) with an arbitrary number of channel e...
Article
Full-text available
Matrix factorization is a popular approach for large-scale matrix completion and constitutes a basic component of many solutions for Netflix Prize competition. In this approach, the unknown low-rank matrix is expressed as the product of two much smaller matrices so that the low-rank property is automatically fulfilled. The resulting optimization pr...
Article
To cope with the growing demand for wireless data and to extend service coverage, future fifth-generation (5G) networks will increasingly rely on the use of low-power nodes to support massive connectivity in a diverse set of applications and services. To this end, virtualized and mass-scale cloud architectures are proposed as promising technologies...
Article
Full-text available
In a heterogeneous network (HetNet) with a large number of low power base stations (BSs), proper user-BS association and power control is crucial to achieving desirable system performance. In this paper, we systematically study the joint BS association and power allocation problem for a downlink cellular network under the max-min fairness criterion...
Article
Full-text available
To cope with the growing demand for wireless data and to extend service coverage, future 5G networks will increasingly rely on the use of low powered nodes to support massive connectivity in diverse set of applications and services [1]. To this end, virtualized and mass-scale cloud architectures are proposed as promising technologies for 5G in whic...
Conference Paper
In a heterogeneous network (HetNet) with a large number of low power base stations (BSs), proper user-BS association and power control is crucial to achieving desirable system performance. In this paper, we consider the joint BS association and power allocation problem for an uplink cellular network under the max-min fairness criterion. We first pr...
Article
Full-text available
Consider an interference channel with a finite diversity order L (i.e., each channel matrix is a generic linear combination of L fixed basis matrices) whereby each user sends d independent data streams to its intended receiver. In this paper, we study the effect of finite L and d on the achievable Degrees of Freedom (DoF) via vector space interfere...
Conference Paper
We consider the problem of long-term transmit point (TP) association in a heterogeneous network where CoMP (Coordinated Multiple Point) transmission scheme is supported. More specifically, the TP association is designed for a relatively long time period where only channel statistics is available, and each user equipment(UE) could be associated with...
Conference Paper
In this work, we consider the problem of partial cooperative transmission in a heterogeneous network with densely deployed base stations. The objective is to design downlink transmit strategies in a way that optimizes the system spectrum efficiency while using small size base station clusters for partial cooperative joint transmission. We propose a...
Conference Paper
We consider the problem of maximizing the minimum rate by joint BS assignment and power allocation in a cellular network. First, we show that the max-min fairness problem with fixed power vector can be solved in polynomial time. Second, we show that the joint design problem with the constraints that the SINR of each user is at least 0 dB is polynom...
Article
We consider the interference management problem in a multicell MIMO heterogenous network. Within each cell there are a large number of distributed micro/pico base stations (BSs) that can be potentially coordinated for joint transmission. To reduce coordination overhead, we consider user-centric BS clustering so that each user is served by only a sm...
Conference Paper
Full-text available
We consider the multiuser beamforming problem for a multi-input single-output downlink channel that takes into account the errors in the channel state information at the transmitter side (CSIT). By modeling the CSIT errors as elliptically bounded uncertainty regions, this problem can be formulated as minimizing the transmission power subject to the...

Network

Cited By