## About

258

Publications

65,075

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

25,579

Citations

Introduction

Education

March 2003 - June 2006

August 2001 - February 2003

August 1997 - June 2001

## Publications

Publications (258)

We study zeroth-order optimization for convex functions where we further assume that function evaluations are unavailable. Instead, one only has access to a comparison oracle, which given two points x and y returns a single bit of information indicating which point has larger function value, f(x) or f(y). By treating the gradient as an unknown sign...

Mean-field games arise in various fields, including economics, engineering, and machine learning. They study strategic decision-making in large populations where the individuals interact via specific mean-field quantities. The games’ ground metrics and running costs are of essential importance but are often unknown or only partially known. This pap...

A promising trend in deep learning replaces traditional feedforward networks with implicit networks. Unlike traditional networks, implicit networks solve a fixed point equation to compute inferences. Solving for the fixed point varies in complexity, depending on provided data and an error tolerance. Importantly, implicit networks may be trained wit...

Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication. While there have been many results on convergence rates under communication compression, a theoretical lower bound is still missing. Analyses of algorithms with communication compression ha...

Our goal is to solve the large-scale linear programming (LP) formulation of Optimal Transport (OT) problems efficiently. Our key observations are: (i) the primal solutions of the LP problems are theoretically very sparse; (ii) the cost function is usually equipped with good geometric properties. The former motivates us to eliminate the majority of...

We show how to convert the problem of minimizing a convex function over the standard probability simplex to that of minimizing a nonconvex function over the unit sphere. We prove the landscape of this nonconvex problem is benign, i.e. every stationary point is either a strict saddle or a global minimizer. We exploit the Riemannian manifold structur...

Since its invention in 2014, the Adam optimizer has received tremendous attention. On one hand, it has been widely used in deep learning and many variants have been proposed, while on the other hand their theoretical convergence property remains to be a mystery. It is far from satisfactory in the sense that some studies require strong assumptions a...

Inverse problems consist of recovering a signal from a collection of noisy measurements. These problems can often be cast as feasibility problems; however, additional regularization is typically necessary to ensure accurate and stable recovery with respect to data perturbations. Hand-chosen analytic regularization can yield desirable theoretical gu...

Decentralized algorithm is a form of computation that achieves a global goal through local dynamics that relies on low-cost communication between directly-connected agents. On large-scale optimization tasks involving distributed datasets, decentralized algorithms have shown strong, sometimes superior, performance over distributed algorithms with a...

Learned Iterative Shrinkage-Thresholding Algorithm (LISTA) introduces the concept of unrolling an iterative algorithm and training it like a neural network. It has had great success on sparse recovery. In this paper, we show that adding momentum to intermediate variables in the LISTA network achieves a better convergence rate and, in particular, th...

Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SG...

Robust principal component analysis (RPCA) is a critical tool in modern machine learning, which detects outliers in the task of low-rank matrix reconstruction. In this paper, we propose a scalable and learnable non-convex approach for high-dimensional RPCA problems, which we call Learned Robust PCA (LRPCA). LRPCA is highly efficient, and its free p...

Federated Learning (FL) is popular for communication-efficient learning from distributed data. To utilize data at different clients without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a computation then aggregation model, in which multiple local updates are performed using local data before aggregation...

This paper considers the problem of solving a special quartic–quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of \(\frac{1}{2}\mathbf {z}^{*}A\mathbf {z}+\frac{\beta }{2}\sum _{k=1}^{n}|z_{k}|^{4}\) such that \(\Vert \mathbf {z}\Vert _{2}=1\). This problem spans multiple domains including...

This paper targets developing algorithms for solving distributed machine learning problems in a communication-efcient fashion. A class of new stochastic gradient descent (SGD) approaches have been developed, which can be viewed as the stochastic generalization to the recently developed lazily aggregated gradient (LAG) method justifying the name LAS...

Stochastic compositional optimization generalizes classic (non-compositional) stochastic optimization to the minimization of compositions of functions. Each composition may introduce an additional expectation. The series of expectations may be nested. Stochastic compositional optimization is gaining popularity in applications such as reinforcement...

Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often treat them separately, and thus develop problem-specific algorithms and their analyses. Among various exciting de...

Many iterative methods in applied mathematics can be thought of as fixed-point iterations, and such algorithms are usually analyzed analytically, with inequalities. In this paper, we present a geometric approach to analyzing contractive and nonexpansive fixed point iterations with a new tool called the scaled relative graph. The SRG provides a corr...

Systems of interacting agents can often be modeled as contextual games, where the context encodes additional information, beyond the control of any agent (e.g. weather for traffic and fiscal policy for market economies). In such systems, the most likely outcome is given by a Nash equilibrium. In many practical settings, only game equilibria are obs...

Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity $1-\beta$ which measures the network connectivity. On large and sparse...

In this paper, we demonstrate the power of a widely used stochastic estimator based on moving average (SEMA) on a range of stochastic non-convex optimization problems, which only requires {\bf a general unbiased stochastic oracle}. We analyze various stochastic methods (existing or newly proposed) based on the {\bf variance recursion property} of S...

Inverse problems consist of recovering a signal from a collection of noisy measurements. These problems can often be cast as feasibility problems; however, additional regularization is typically necessary to ensure accurate and stable recovery with respect to data perturbations. Hand-chosen analytic regularization can yield desirable theoretical gu...

When applying a stochastic/incremental algorithm, one must choose the order to draw samples. Among the most popular approaches are cyclic sampling and random reshuffling, which are empirically faster and more cache-friendly than uniform-iid-sampling. Cyclic sampling draws the samples in a fixed, cyclic order, which is less robust than reshuffling t...

The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has be...

Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods, aiming at reducing the laborious iterations of hand engineering. It automates the design of an optimization method based on its performance on a set of training problems. This data-driven procedure generates methods that can efficient...

A promising trend in deep learning replaces traditional feedforward networks with implicit networks. Unlike traditional networks, implicit networks solve a fixed point equation to compute inferences. Solving for the fixed point varies in complexity, depending on provided data and an error tolerance. Importantly, implicit networks may be trained wit...

Policy optimization methods remain a powerful workhorse in empirical Reinforcement Learning (RL), with a focus on neural policies that can easily reason over complex and continuous state and/or action spaces. Theoretical understanding of strategic exploration in policy-based methods with non-linear function approximation, however, is largely missin...

We consider the zeroth-order optimization problem in the huge-scale setting, where the dimension of the problem is so large that performing even basic vector operations on the decision variables is infeasible. In this paper, we propose a novel algorithm, coined ZO-BCD, that exhibits favorable overall query complexity and has a much smaller per-iter...

Stochastic bilevel optimization generalizes the classic stochastic optimization from the minimization of a single objective to the minimization of an objective function that depends the solution of another optimization problem. Recently, stochastic bilevel optimization is regaining popularity in emerging machine learning applications such as hyper-...

This paper develops algorithms for decentralized machine learning over a network, where data are distributed, computation is localized, and communication is restricted between neighbors. A line of recent research in this area focuses on improving both computation and communication complexities. The methods SSDA and MSDA \cite{scaman2017optimal} hav...

Primal–dual hybrid gradient (PDHG) and alternating direction method of multipliers (ADMM) are popular first-order optimization methods. They are easy to implement and have diverse applications. As first-order methods, however, they are sensitive to problem conditions and can struggle to reach the desired accuracy. To improve their performance, rese...

The augmented Lagrangian method (ALM) is one of the most useful methods for constrained optimization. Its convergence has been well established under convexity assumptions or smoothness assumptions, or under both assumptions. ALM may experience oscillations and divergence facing nonconvexity and nonsmoothness simultaneously. In this paper, we consi...

Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counte...

Federated learning (FL) is a recently proposed distributed machine learning paradigm dealing with distributed and private data sets. Based on the data partition pattern, FL is often categorized into horizontal, vertical, and hybrid settings. Despite the fact that many works have been developed for the first two approaches, the hybrid FL setting (wh...

In this paper~\footnote{The original title is "Momentum SGD with Robust Weighting For Imbalanced Classification"}, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importanc...

We study derivative-free optimization for convex functions where we further assume that function evaluations are unavailable. Instead, one only has access to a comparison oracle, which, given two points $x$ and $y$, and returns a single bit of information indicating which point has larger function value, $f(x)$ or $f(y)$, with some probability of b...

Stochastic compositional optimization generalizes classic (non-compositional) stochastic optimization to the minimization of compositions of functions. Each composition may introduce an additional expectation. The series of expectations may be nested. Stochastic compositional optimization is gaining popularity in applications such as reinforcement...

In this paper, we study the communication and (sub)gradient computation costs in distributed optimization. We present two algorithms based on the framework of the accelerated penalty method with increasing penalty parameters. Our first algorithm is for smooth distributed optimization and it obtains the near optimal
$O(\sqrt{\frac{L}{\epsilon(1-\si...

We present a new framework, called adversarial projections, for solving inverse problems by learning to project onto manifolds. Our goal is to recover a signal from a collection of noisy measurements. Traditional methods for this task often minimize the addition of a regularization term and an expression that measures compliance with measurements (...

This paper develops algorithms for decentralized machine learning over a network, where data are distributed, computation is localized, and communication is restricted between neighbors. A line of recent research in this area focuses on improving both computation and communication complexities. The methods SSDA and MSDA \cite{scaman2017optimal} hav...

Mean-field games arise in various fields including economics, engineering, and machine learning. They study strategic decision making in large populations where the individuals interact via certain mean-field quantities. The ground metrics and running costs of the games are of essential importance but are often unknown or only partially known. In t...

Mean-field games arise in various fields including economics, engineering, and machine learning. They study strategic decision making in large populations where the individuals interact via certain mean-field quantities. The ground metrics and running costs of the games are of essential importance but are often unknown or only partially known. In t...

SGD with momentum (SGDM) has been widely applied in many machine learning tasks, and it is often applied with dynamic stepsizes and momentum weights tuned in a stagewise manner. Despite of its empirical advantage over SGD, the role of momentum is still unclear in general since previous analyses on SGDM either provide worse convergence bounds than t...

Horizontal Federated learning (FL) handles multi-client data that share the same set of features, and vertical FL trains a better predictor that combine all the features from different clients. This paper targets solving vertical FL in an asynchronous fashion, and develops a simple FL method. The new method allows each client to run stochastic grad...

Federated Learning (FL) has become a popular paradigm for learning from distributed data. To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model, in which multiple local updates are performed using local data, be...

Many iterative methods in optimization are fixed-point iterations with averaged operators. As such methods converge at an O(1/k) rate with the constant determined by the averagedness coefficient, establishing small averagedness coefficients for operators is of broad interest. In this paper, we show that the averagedness coefficients of the composit...

We consider the problem of minimizing a high-dimensional objective function, which may include a regularization term, using (possibly noisy) evaluations of the function. Such optimization is also called derivative-free, zeroth-order, or black-box optimization. We propose a new $\textbf{Z}$eroth-$\textbf{O}$rder $\textbf{R}$egularized $\textbf{O}$pt...

We study how to use unsupervised learning for efficient exploration in reinforcement learning with rich observations generated from a small number of latent states. We present a novel algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret reinforcement learning algorithm. We show that our algorit...

Primal-Dual Hybrid Gradient (PDHG) and Alternating Direction Method of Multipli-ers (ADMM) are two widely-used first-order optimization methods. They reduce a difficult problem to simple subproblems, so they are easy to implement and have many applications. As first-order methods, however, they are sensitive to problem conditions and can struggle t...

Many applications require repeatedly solving a certain type of optimization problem, each time with new (but similar) data. Data-driven algorithms can "learn to optimize" (L2O) with much fewer iterations and with similar cost per iteration as general-purpose optimization algorithms. L2O algorithms are often derived from general-purpose algorithms,...

This paper targets solving distributed machine learning problems such as federated learning in a communication-efficient fashion. A class of new stochastic gradient descent (SGD) approaches have been developed, which can be viewed as the stochastic generalization to the recently developed lazily aggregated gradient (LAG) method --- justifying the n...

The method of block coordinate gradient descent (BCD) has been a powerful method for large-scale optimization. This paper considers the BCD method that successively updates a series of blocks selected according to a Markov chain. This kind of block selection is neither i.i.d. random nor cyclic. On the other hand, it is a natural choice for some app...

The Scaled Relative Graph (SRG) by Ryu, Hannah, and Yin (arXiv:1902.09788, 2019) is a geometric tool that maps the action of a multi-valued nonlinear operator onto the 2D plane, used to analyze the convergence of a wide range of iterative methods. As the SRG includes the spectrum for linear operators, we can view the SRG as a generalization of the...

One of the key approaches to save samples when learning a policy for a reinforcement learning problem is to use knowledge from an approximate model such as its simulator. However, does knowledge transfer from approximate models always help to learn a better policy? Despite numerous empirical studies of transfer reinforcement learning, an answer to...

Many iterative methods in optimization are fixed-point iterations with averaged operators. As such methods converge at an $\mathcal{O}(1/k)$ rate with the constant determined by the averagedness coefficient, establishing small averagedness coefficients for operators is of broad interest. In this paper, we show that the averagedness coefficients of...

Despite the vast literature on DRS, there has been very little work analyzing their behavior under pathologies. Most analyses assume a primal solution exists, a dual solution exists, and strong duality holds. When these assumptions are not met, i.e., under pathologies, the theory often breaks down and the empirical performance may degrade significa...

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to make use of multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU utilization and achieve high throughput, it splits a mini-batch into a set of micro-batches and allows the ov...

Background:
To review and evaluate approaches to convolutional neural network (CNN) reconstruction for accelerated cardiac MR imaging in the real clinical context.
Methods:
Two CNN architectures, Unet and residual network (Resnet) were evaluated using quantitative and qualitative assessment by radiologist. Four different loss functions were also...

This paper considers the problem of solving a special quartic-quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of $\frac{1}{2}\mathbf{z}^{*}A\mathbf{z}+\frac{\beta}{2}\sum_{k=1}^{n}\lvert z_{k}\rvert^{4}$ such that $\lVert\mathbf{z}\rVert_{2}=1$. This problem spans multiple domains includi...

This paper considers the problem of solving a special quartic-quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of 1 2 z * Az + β 2 n k=1 |z k | 4 such that z 2 = 1. This problem spans multiple domains including quantum mechanics and chemistry sciences and we investigate the geometric prope...

We design fast numerical methods for Hamilton–Jacobi equations in density space (HJD), which arises in optimal transport and mean field games. We proposes an algorithm using a generalized Hopf formula in density space. The formula helps transforming a problem from an optimal control problem in density space, which are constrained minimizations supp...

Many optimization algorithms converge to stationary points. When the underlying problem is nonconvex, they may get trapped at local minimizers and occasionally stagnate near saddle points. We propose the Run-and-Inspect Method, which adds an "inspect" phase to existing algorithms that helps escape from non-global stationary points. The inspection s...

Despite remarkable empirical success, the training dynamics of generative adversarial networks (GAN), which involves solving a minimax game using stochastic gradients, is still poorly understood. In this work, we analyze last-iterate convergence of simultaneous gradient descent (simGD) and its variants under the assumption of convex-concavity, guid...

We propose a new method for computing Dynamic Mode Decomposition (DMD) evolution matrices, which we use to analyze dynamical systems. Unlike the majority of existing methods, our approach is based on a variational formulation consisting of data alignment penalty terms and constitutive orthogonality constraints. Our method does not make any assumpti...