Wotao Yin

Wotao Yin
University of California, Los Angeles | UCLA · Department of Mathematics

PhD

About

258
Publications
65,075
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
25,579
Citations
Education
March 2003 - June 2006
Columbia University
Field of study
  • Operations Research
August 2001 - February 2003
Columbia University
Field of study
  • Operations Research
August 1997 - June 2001
Nanjing University
Field of study
  • Mathematics and Applied Mathematics

Publications

Publications (258)
Article
We study zeroth-order optimization for convex functions where we further assume that function evaluations are unavailable. Instead, one only has access to a comparison oracle, which given two points x and y returns a single bit of information indicating which point has larger function value, f(x) or f(y). By treating the gradient as an unknown sign...
Article
Full-text available
Mean-field games arise in various fields, including economics, engineering, and machine learning. They study strategic decision-making in large populations where the individuals interact via specific mean-field quantities. The games’ ground metrics and running costs are of essential importance but are often unknown or only partially known. This pap...
Article
A promising trend in deep learning replaces traditional feedforward networks with implicit networks. Unlike traditional networks, implicit networks solve a fixed point equation to compute inferences. Solving for the fixed point varies in complexity, depending on provided data and an error tolerance. Importantly, implicit networks may be trained wit...
Preprint
Recent advances in distributed optimization and learning have shown that communication compression is one of the most effective means of reducing communication. While there have been many results on convergence rates under communication compression, a theoretical lower bound is still missing. Analyses of algorithms with communication compression ha...
Article
Full-text available
Our goal is to solve the large-scale linear programming (LP) formulation of Optimal Transport (OT) problems efficiently. Our key observations are: (i) the primal solutions of the LP problems are theoretically very sparse; (ii) the cost function is usually equipped with good geometric properties. The former motivates us to eliminate the majority of...
Preprint
Full-text available
We show how to convert the problem of minimizing a convex function over the standard probability simplex to that of minimizing a nonconvex function over the unit sphere. We prove the landscape of this nonconvex problem is benign, i.e. every stationary point is either a strict saddle or a global minimizer. We exploit the Riemannian manifold structur...
Preprint
Full-text available
Since its invention in 2014, the Adam optimizer has received tremendous attention. On one hand, it has been widely used in deep learning and many variants have been proposed, while on the other hand their theoretical convergence property remains to be a mystery. It is far from satisfactory in the sense that some studies require strong assumptions a...
Article
Full-text available
Inverse problems consist of recovering a signal from a collection of noisy measurements. These problems can often be cast as feasibility problems; however, additional regularization is typically necessary to ensure accurate and stable recovery with respect to data perturbations. Hand-chosen analytic regularization can yield desirable theoretical gu...
Preprint
Full-text available
Decentralized algorithm is a form of computation that achieves a global goal through local dynamics that relies on low-cost communication between directly-connected agents. On large-scale optimization tasks involving distributed datasets, decentralized algorithms have shown strong, sometimes superior, performance over distributed algorithms with a...
Preprint
Full-text available
Learned Iterative Shrinkage-Thresholding Algorithm (LISTA) introduces the concept of unrolling an iterative algorithm and training it like a neural network. It has had great success on sparse recovery. In this paper, we show that adding momentum to intermediate variables in the LISTA network achieves a better convergence rate and, in particular, th...
Preprint
Full-text available
Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SG...
Preprint
Full-text available
Robust principal component analysis (RPCA) is a critical tool in modern machine learning, which detects outliers in the task of low-rank matrix reconstruction. In this paper, we propose a scalable and learnable non-convex approach for high-dimensional RPCA problems, which we call Learned Robust PCA (LRPCA). LRPCA is highly efficient, and its free p...
Article
Federated Learning (FL) is popular for communication-efficient learning from distributed data. To utilize data at different clients without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a computation then aggregation model, in which multiple local updates are performed using local data before aggregation...
Article
This paper considers the problem of solving a special quartic–quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of \(\frac{1}{2}\mathbf {z}^{*}A\mathbf {z}+\frac{\beta }{2}\sum _{k=1}^{n}|z_{k}|^{4}\) such that \(\Vert \mathbf {z}\Vert _{2}=1\). This problem spans multiple domains including...
Article
This paper targets developing algorithms for solving distributed machine learning problems in a communication-efcient fashion. A class of new stochastic gradient descent (SGD) approaches have been developed, which can be viewed as the stochastic generalization to the recently developed lazily aggregated gradient (LAG) method justifying the name LAS...
Article
Stochastic compositional optimization generalizes classic (non-compositional) stochastic optimization to the minimization of compositions of functions. Each composition may introduce an additional expectation. The series of expectations may be nested. Stochastic compositional optimization is gaining popularity in applications such as reinforcement...
Preprint
Full-text available
Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often treat them separately, and thus develop problem-specific algorithms and their analyses. Among various exciting de...
Article
Full-text available
Many iterative methods in applied mathematics can be thought of as fixed-point iterations, and such algorithms are usually analyzed analytically, with inequalities. In this paper, we present a geometric approach to analyzing contractive and nonexpansive fixed point iterations with a new tool called the scaled relative graph. The SRG provides a corr...
Preprint
Full-text available
Systems of interacting agents can often be modeled as contextual games, where the context encodes additional information, beyond the control of any agent (e.g. weather for traffic and fiscal policy for market economies). In such systems, the most likely outcome is given by a Nash equilibrium. In many practical settings, only game equilibria are obs...
Preprint
Full-text available
Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity $1-\beta$ which measures the network connectivity. On large and sparse...
Preprint
In this paper, we demonstrate the power of a widely used stochastic estimator based on moving average (SEMA) on a range of stochastic non-convex optimization problems, which only requires {\bf a general unbiased stochastic oracle}. We analyze various stochastic methods (existing or newly proposed) based on the {\bf variance recursion property} of S...
Preprint
Full-text available
Inverse problems consist of recovering a signal from a collection of noisy measurements. These problems can often be cast as feasibility problems; however, additional regularization is typically necessary to ensure accurate and stable recovery with respect to data perturbations. Hand-chosen analytic regularization can yield desirable theoretical gu...
Preprint
Full-text available
When applying a stochastic/incremental algorithm, one must choose the order to draw samples. Among the most popular approaches are cyclic sampling and random reshuffling, which are empirically faster and more cache-friendly than uniform-iid-sampling. Cyclic sampling draws the samples in a fixed, cyclic order, which is less robust than reshuffling t...
Preprint
Full-text available
The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has be...
Preprint
Full-text available
Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods, aiming at reducing the laborious iterations of hand engineering. It automates the design of an optimization method based on its performance on a set of training problems. This data-driven procedure generates methods that can efficient...
Preprint
Full-text available
A promising trend in deep learning replaces traditional feedforward networks with implicit networks. Unlike traditional networks, implicit networks solve a fixed point equation to compute inferences. Solving for the fixed point varies in complexity, depending on provided data and an error tolerance. Importantly, implicit networks may be trained wit...
Preprint
Policy optimization methods remain a powerful workhorse in empirical Reinforcement Learning (RL), with a focus on neural policies that can easily reason over complex and continuous state and/or action spaces. Theoretical understanding of strategic exploration in policy-based methods with non-linear function approximation, however, is largely missin...
Preprint
We consider the zeroth-order optimization problem in the huge-scale setting, where the dimension of the problem is so large that performing even basic vector operations on the decision variables is infeasible. In this paper, we propose a novel algorithm, coined ZO-BCD, that exhibits favorable overall query complexity and has a much smaller per-iter...
Preprint
Full-text available
Stochastic bilevel optimization generalizes the classic stochastic optimization from the minimization of a single objective to the minimization of an objective function that depends the solution of another optimization problem. Recently, stochastic bilevel optimization is regaining popularity in emerging machine learning applications such as hyper-...
Article
This paper develops algorithms for decentralized machine learning over a network, where data are distributed, computation is localized, and communication is restricted between neighbors. A line of recent research in this area focuses on improving both computation and communication complexities. The methods SSDA and MSDA \cite{scaman2017optimal} hav...
Article
Full-text available
Primal–dual hybrid gradient (PDHG) and alternating direction method of multipliers (ADMM) are popular first-order optimization methods. They are easy to implement and have diverse applications. As first-order methods, however, they are sensitive to problem conditions and can struggle to reach the desired accuracy. To improve their performance, rese...
Preprint
Full-text available
The augmented Lagrangian method (ALM) is one of the most useful methods for constrained optimization. Its convergence has been well established under convexity assumptions or smoothness assumptions, or under both assumptions. ALM may experience oscillations and divergence facing nonconvexity and nonsmoothness simultaneously. In this paper, we consi...
Preprint
Stochastic gradient descent (SGD) has taken the stage as the primary workhorse for large-scale machine learning. It is often used with its adaptive variants such as AdaGrad, Adam, and AMSGrad. This paper proposes an adaptive stochastic gradient descent method for distributed machine learning, which can be viewed as the communication-adaptive counte...
Preprint
Federated learning (FL) is a recently proposed distributed machine learning paradigm dealing with distributed and private data sets. Based on the data partition pattern, FL is often categorized into horizontal, vertical, and hybrid settings. Despite the fact that many works have been developed for the first two approaches, the hybrid FL setting (wh...
Preprint
In this paper~\footnote{The original title is "Momentum SGD with Robust Weighting For Imbalanced Classification"}, we present a simple yet effective method (ABSGD) for addressing the data imbalance issue in deep learning. Our method is a simple modification to momentum SGD where we leverage an attentional mechanism to assign an individual importanc...
Preprint
We study derivative-free optimization for convex functions where we further assume that function evaluations are unavailable. Instead, one only has access to a comparison oracle, which, given two points $x$ and $y$, and returns a single bit of information indicating which point has larger function value, $f(x)$ or $f(y)$, with some probability of b...
Preprint
Full-text available
Stochastic compositional optimization generalizes classic (non-compositional) stochastic optimization to the minimization of compositions of functions. Each composition may introduce an additional expectation. The series of expectations may be nested. Stochastic compositional optimization is gaining popularity in applications such as reinforcement...
Article
In this paper, we study the communication and (sub)gradient computation costs in distributed optimization. We present two algorithms based on the framework of the accelerated penalty method with increasing penalty parameters. Our first algorithm is for smooth distributed optimization and it obtains the near optimal $O(\sqrt{\frac{L}{\epsilon(1-\si...
Preprint
Full-text available
We present a new framework, called adversarial projections, for solving inverse problems by learning to project onto manifolds. Our goal is to recover a signal from a collection of noisy measurements. Traditional methods for this task often minimize the addition of a regularization term and an expression that measures compliance with measurements (...
Preprint
This paper develops algorithms for decentralized machine learning over a network, where data are distributed, computation is localized, and communication is restricted between neighbors. A line of recent research in this area focuses on improving both computation and communication complexities. The methods SSDA and MSDA \cite{scaman2017optimal} hav...
Article
Full-text available
Mean-field games arise in various fields including economics, engineering, and machine learning. They study strategic decision making in large populations where the individuals interact via certain mean-field quantities. The ground metrics and running costs of the games are of essential importance but are often unknown or only partially known. In t...
Preprint
Full-text available
Mean-field games arise in various fields including economics, engineering, and machine learning. They study strategic decision making in large populations where the individuals interact via certain mean-field quantities. The ground metrics and running costs of the games are of essential importance but are often unknown or only partially known. In t...
Preprint
SGD with momentum (SGDM) has been widely applied in many machine learning tasks, and it is often applied with dynamic stepsizes and momentum weights tuned in a stagewise manner. Despite of its empirical advantage over SGD, the role of momentum is still unclear in general since previous analyses on SGDM either provide worse convergence bounds than t...
Preprint
Full-text available
Horizontal Federated learning (FL) handles multi-client data that share the same set of features, and vertical FL trains a better predictor that combine all the features from different clients. This paper targets solving vertical FL in an asynchronous fashion, and develops a simple FL method. The new method allows each client to run stochastic grad...
Preprint
Federated Learning (FL) has become a popular paradigm for learning from distributed data. To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model, in which multiple local updates are performed using local data, be...
Article
Full-text available
Many iterative methods in optimization are fixed-point iterations with averaged operators. As such methods converge at an O(1/k) rate with the constant determined by the averagedness coefficient, establishing small averagedness coefficients for operators is of broad interest. In this paper, we show that the averagedness coefficients of the composit...
Preprint
Full-text available
We consider the problem of minimizing a high-dimensional objective function, which may include a regularization term, using (possibly noisy) evaluations of the function. Such optimization is also called derivative-free, zeroth-order, or black-box optimization. We propose a new $\textbf{Z}$eroth-$\textbf{O}$rder $\textbf{R}$egularized $\textbf{O}$pt...
Preprint
We study how to use unsupervised learning for efficient exploration in reinforcement learning with rich observations generated from a small number of latent states. We present a novel algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret reinforcement learning algorithm. We show that our algorit...
Preprint
Primal-Dual Hybrid Gradient (PDHG) and Alternating Direction Method of Multipli-ers (ADMM) are two widely-used first-order optimization methods. They reduce a difficult problem to simple subproblems, so they are easy to implement and have many applications. As first-order methods, however, they are sensitive to problem conditions and can struggle t...
Preprint
Full-text available
Many applications require repeatedly solving a certain type of optimization problem, each time with new (but similar) data. Data-driven algorithms can "learn to optimize" (L2O) with much fewer iterations and with similar cost per iteration as general-purpose optimization algorithms. L2O algorithms are often derived from general-purpose algorithms,...
Preprint
Full-text available
This paper targets solving distributed machine learning problems such as federated learning in a communication-efficient fashion. A class of new stochastic gradient descent (SGD) approaches have been developed, which can be viewed as the stochastic generalization to the recently developed lazily aggregated gradient (LAG) method --- justifying the n...
Article
Full-text available
The method of block coordinate gradient descent (BCD) has been a powerful method for large-scale optimization. This paper considers the BCD method that successively updates a series of blocks selected according to a Markov chain. This kind of block selection is neither i.i.d. random nor cyclic. On the other hand, it is a natural choice for some app...
Preprint
Full-text available
The Scaled Relative Graph (SRG) by Ryu, Hannah, and Yin (arXiv:1902.09788, 2019) is a geometric tool that maps the action of a multi-valued nonlinear operator onto the 2D plane, used to analyze the convergence of a wide range of iterative methods. As the SRG includes the spectrum for linear operators, we can view the SRG as a generalization of the...
Preprint
One of the key approaches to save samples when learning a policy for a reinforcement learning problem is to use knowledge from an approximate model such as its simulator. However, does knowledge transfer from approximate models always help to learn a better policy? Despite numerous empirical studies of transfer reinforcement learning, an answer to...
Preprint
Many iterative methods in optimization are fixed-point iterations with averaged operators. As such methods converge at an $\mathcal{O}(1/k)$ rate with the constant determined by the averagedness coefficient, establishing small averagedness coefficients for operators is of broad interest. In this paper, we show that the averagedness coefficients of...
Article
Full-text available
Despite the vast literature on DRS, there has been very little work analyzing their behavior under pathologies. Most analyses assume a primal solution exists, a dual solution exists, and strong duality holds. When these assumptions are not met, i.e., under pathologies, the theory often breaks down and the empirical performance may degrade significa...
Preprint
We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to make use of multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU utilization and achieve high throughput, it splits a mini-batch into a set of micro-batches and allows the ov...
Article
Full-text available
Background: To review and evaluate approaches to convolutional neural network (CNN) reconstruction for accelerated cardiac MR imaging in the real clinical context. Methods: Two CNN architectures, Unet and residual network (Resnet) were evaluated using quantitative and qualitative assessment by radiologist. Four different loss functions were also...
Preprint
Full-text available
This paper considers the problem of solving a special quartic-quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of $\frac{1}{2}\mathbf{z}^{*}A\mathbf{z}+\frac{\beta}{2}\sum_{k=1}^{n}\lvert z_{k}\rvert^{4}$ such that $\lVert\mathbf{z}\rVert_{2}=1$. This problem spans multiple domains includi...
Preprint
Full-text available
This paper considers the problem of solving a special quartic-quadratic optimization problem with a single sphere constraint, namely, finding a global and local minimizer of 1 2 z * Az + β 2 n k=1 |z k | 4 such that z 2 = 1. This problem spans multiple domains including quantum mechanics and chemistry sciences and we investigate the geometric prope...
Article
Full-text available
We design fast numerical methods for Hamilton–Jacobi equations in density space (HJD), which arises in optimal transport and mean field games. We proposes an algorithm using a generalized Hopf formula in density space. The formula helps transforming a problem from an optimal control problem in density space, which are constrained minimizations supp...
Article
Full-text available
Many optimization algorithms converge to stationary points. When the underlying problem is nonconvex, they may get trapped at local minimizers and occasionally stagnate near saddle points. We propose the Run-and-Inspect Method, which adds an "inspect" phase to existing algorithms that helps escape from non-global stationary points. The inspection s...
Preprint
Despite remarkable empirical success, the training dynamics of generative adversarial networks (GAN), which involves solving a minimax game using stochastic gradients, is still poorly understood. In this work, we analyze last-iterate convergence of simultaneous gradient descent (simGD) and its variants under the assumption of convex-concavity, guid...
Preprint
Full-text available
We propose a new method for computing Dynamic Mode Decomposition (DMD) evolution matrices, which we use to analyze dynamical systems. Unlike the majority of existing methods, our approach is based on a variational formulation consisting of data alignment penalty terms and constitutive orthogonality constraints. Our method does not make any assumpti...