Mert Gürbüzbalaban

Mert Gürbüzbalaban
Rutgers, The State University of New Jersey | Rutgers · Business School - Newark and New Brunswick

PhD (in applied math) from Courant Institute (NYU)

About

80
Publications
6,732
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
870
Citations

Publications

Publications (80)
Preprint
Stochastic smooth nonconvex minimax problems are prevalent in machine learning, e.g., GAN training, fair classification, and distributionally robust learning. Stochastic gradient descent ascent (GDA)-type methods are popular in practice due to their simplicity and single-loop nature. However, there is a significant gap between the theory and practi...
Preprint
This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of noncon...
Preprint
Full-text available
We consider strongly-convex-strongly-concave (SCSC) saddle point (SP) problems which frequently arise in many applications from distributionally robust learning to game theory and fairness in machine learning. We focus on the recently developed stochastic accelerated primal-dual algorithm (SAPD), which admits optimal complexity in several settings...
Preprint
Full-text available
Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statis...
Preprint
Full-text available
We consider a distributionally robust stochastic optimization problem and formulate it as a stochastic two-level composition optimization problem with the use of the mean--semideviation risk measure. In this setting, we consider a single time-scale algorithm, involving two versions of the inner function value tracking: linearized tracking of a cont...
Preprint
We consider the constrained sampling problem where the goal is to sample from a distribution $\pi(x)\propto e^{-f(x)}$ and $x$ is constrained on a convex body $\mathcal{C}\subset \mathbb{R}^d$. Motivated by penalty methods from optimization, we propose penalized Langevin Dynamics (PLD) and penalized Hamiltonian Monte Carlo (PHMC) that convert the c...
Article
This paper considers the problem of understanding the exit time for trajectories of gradient-related first-order methods from saddle neighborhoods under some initial boundary conditions. Given the ‘flat’ geometry around saddle points, first-order methods can struggle to escape these regions in a fast manner due to the small magnitudes of gradients...
Article
Full-text available
We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to ambiguity in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses mean–semideviation risk for quantify...
Article
Nonconvex Stochastic Optimization Nonconvex stochastic optimization problems arise in many machine learning problems, including deep learning. The stochastic gradient Hamiltonian Monte Carlo (SGHMC) is a variant of stochastic gradients with a momentum method in which a controlled and properly scaled Gaussian noise is added to the stochastic gradien...
Preprint
Full-text available
Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails has links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are...
Article
The effective resistance (ER) between a pair of nodes in a weighted undirected graph is defined as the potential difference induced when a unit current is injected at one node and extracted from the other, treating edge weights as the conductance values of edges. The ER is a key quantity of interest in many applications, e.g., solving linear system...
Preprint
We propose a new stochastic method SAPD+ for solving nonconvex-concave minimax problems of the form $\min\max \mathcal{L}(x,y) = f(x)+\Phi(x,y)-g(y)$, where $f,g$ are closed convex and $\Phi(x,y)$ is a smooth function that is weakly convex in $x$, (strongly) concave in $y$. For both strongly concave and merely concave settings, SAPD+ achieves the b...
Article
We propose a new stochastic method SAPD+ for solving nonconvex-concave minimax problems of the form $\min\max\cL(x,y)=f(x)+\Phi(x,y)-g(y)$, where $f,g$ are closed convex and $\Phi(x,y)$ is a smooth function that is weakly convex in $x$, (strongly) concave in $y$. For both strongly concave and merely concave settings, SAPD+ achieves the best-known...
Preprint
Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclud...
Preprint
Full-text available
In the context of first-order algorithms subject to random gradient noise, we study the trade-offs between the convergence rate (which quantifies how fast the initial conditions are forgotten) and the "risk" of suboptimality, i.e. deviations from the expected suboptimality. We focus on a general class of momentum methods (GMM) which recover popular...
Preprint
Full-text available
In this work, we consider strongly convex strongly concave (SCSC) saddle point (SP) problems $\min_{x\in\mathbb{R}^{d_x}}\max_{y\in\mathbb{R}^{d_y}}f(x,y)$ where $f$ is $L$-smooth, $f(.,y)$ is $\mu$-strongly convex for every $y$, and $f(x,.)$ is $\mu$-strongly concave for every $x$. Such problems arise frequently in machine learning in the context...
Preprint
We consider strongly convex/strongly concave saddle point problems assuming we have access to unbiased stochastic estimates of the gradients. We propose a stochastic accelerated primal-dual (SAPD) algorithm and show that SAPD iterate sequence, generated using constant primal-dual step sizes, linearly converges to a neighborhood of the unique saddle...
Preprint
Full-text available
This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i....
Preprint
Full-text available
Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of...
Preprint
Full-text available
This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application...
Preprint
Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the seco...
Preprint
Full-text available
Gaussian noise injections (GNIs) are a family of simple and widely-used regularisation methods for training neural networks, where one injects additive or multiplicative Gaussian noise to the network activations at every iteration of the optimisation algorithm, which is typically chosen as stochastic gradient descent (SGD). In this paper we focus o...
Preprint
Full-text available
We present two classes of differentially private optimization algorithms derived from the well-known accelerated first-order methods. The first algorithm is inspired by Polyak's heavy ball method and employs a smoothing approach to decrease the accumulated noise on the gradient steps required for differential privacy. The second class of algorithms...
Preprint
Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) are two popular Markov Chain Monte Carlo (MCMC) algorithms for Bayesian inference that can scale to large datasets, allowing to sample from the posterior distribution of a machine learning (ML) model based on the input data and the prior distributio...
Preprint
We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known...
Preprint
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the e...
Preprint
Full-text available
We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to uncertainty in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses semi-deviation risk for quantifyin...
Preprint
Full-text available
A common approach to initialization in deep neural networks is to sample the network weights from a Gaussian distribution to preserve the variance of preactivations. On the other hand, recent research shows that for a large number of deep neural networks, training process can often lead to non-Gaussianity and heavy tails in the distribution of the...
Preprint
Full-text available
Stochastic gradient Langevin dynamics (SGLD) is a poweful algorithm for optimizing a non-convex objective, where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. SGLD is based on the overdamped Langevin diffusion which is reversible in time. By adding an anti-symmet...
Article
2020 Society for Industrial and Applied Mathematics. We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm. We focus on gradient descent and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white no...
Preprint
Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as emp...
Article
2019 Neural information processing systems foundation. All rights reserved. We study the problem of minimizing a strongly convex, smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and s...
Preprint
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE)...
Preprint
Full-text available
We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints w...
Preprint
The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced between them when a unit current is injected at the first node and extracted at the second node, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many appli...
Preprint
Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be mo...
Preprint
In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence...
Preprint
Full-text available
We study the trade-offs between convergence rate and robustness to gradient errors in de- signing a first-order algorithm. We focus on gradient descent (GD) and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white noise. With gradient errors, the function values...
Preprint
Full-text available
We study the problem of minimizing a strongly convex and smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. T...
Preprint
Full-text available
Momentum methods such as Polyak's heavy ball (HB) method, Nesterov's accelerated gradient (AG) as well as accelerated projected gradient (APG) method have been commonly used in machine learning practice, but their performance is quite sensitive to noise in the gradients. We study these methods under a first-order stochastic oracle model where noisy...
Preprint
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven...
Preprint
Full-text available
Langevin dynamics (LD) has been proven to be a powerful technique for optimizing a non-convex objective as an efficient algorithm to find local minima while eventually visiting a global minimum on longer time-scales. LD is based on the first-order Langevin diffusion which is reversible in time. We study two variants that are based on non-reversible...
Preprint
Full-text available
Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is a variant of stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. Many works reported its empirical success in practice for solving stochastic non-convex optimization problems...
Conference Paper
Proximal Newton methods are iterative algorithms that solve l1-regularized least squares problems. Distributed-memory implementation of these methods have become popular since they enable the analysis of large-scale machine learning problems. However, the scalability of these methods is limited by the communication overhead on modern distributed ar...
Preprint
Full-text available
We study the trade-off between rate of convergence and robustness to gradient errors in designing a first-order algorithm. In particular, we focus on gradient descent (GD) and Nesterov's accelerated gradient (AG) method for strongly convex quadratic objectives when the gradient has random errors in the form of additive white noise. To characterize...
Article
Full-text available
We consider coordinate descent (CD) methods with exact line search on convex quadratic problems. Our main focus is to study the performance of the CD method that use random permutations in each epoch and compare it to the performance of the CD methods that use deterministic orders and random sampling with replacement. We focus on a class of convex...
Article
The coordinate descent (CD) method is a classical optimization algorithm that has seen a revival of interest because of its competitive performance in machine learning applications. A number of recent papers provided convergence rate estimates for their deterministic (cyclic) and randomized variants that differ in the selection of update coordinate...
Article
Full-text available
The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its perform...
Article
Full-text available
The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced between them when a unit current is injected at the first node and extracted at the second node, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many appli...
Article
Full-text available
We propose a fast method to approximate the real stability radius of a linear dynamical system with output feedback, where the perturbations are restricted to be real valued and bounded with respect to the Frobenius norm. Our work builds on a number of scalable algorithms that have been proposed in recent years, ranging from methods that approximat...
Article
We analyze the proximal incremental aggregated gradient (PIAG) method for minimizing the sum of a large number of smooth component functions f(x) = Σ[subscript i=1]m f[subscript i](x) and a convex function r(x). Such composite optimization problems arise in a number of machine learning applications including regularized regression problems and cons...
Article
The root radius of a polynomial is the maximum of the moduli of its roots (zeros). We consider the following optimization problem: minimize the root radius over monic polynomials of degree n, with either real or complex coefficients, subject to k linearly independent affine constraints on the coefficients. We show that there always exists an optima...
Article
Full-text available
We study the convergence rate of the proximal incremental aggregated gradient (PIAG) method for minimizing the sum of a large number of smooth component functions (where the sum is strongly convex) and a non-smooth convex function. At each iteration, the PIAG method moves along an aggregated gradient formed by incrementally updating gradients of co...
Article
Full-text available
Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these problems boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. Gradient descent method (GD) is successful in minim...
Article
Full-text available
We focus on the problem of minimizing the sum of smooth component functions (where the sum is strongly convex) and a non-smooth convex function, which arises in regularized empirical risk minimization in machine learning and distributed constrained optimization in wireless sensor networks and smart grids. We consider solving this problem using the...
Article
Let b/a be a strictly proper reduced rational transfer function, with a monic. Consider the problem of designing a controller y/x, with deg(y) < deg(x) < deg(a) – 1 and x monic, subject to lower and upper bounds on the coefficients of y and x, so that the poles of the closed loop transfer function, that is the roots (zeros) of ax + by, are, if poss...
Article
Full-text available
We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cy...
Article
Full-text available
The incremental gradient method is a prominent algorithm for minimizing a finite sum of smooth convex functions, used in many contexts including large-scale data processing applications and distributed optimization over networks. It is a first-order method that processes the functions one at a time based on their gradient information. The increment...
Article
Full-text available
Motivated by applications to distributed optimization over networks and large-scale data processing in machine learning, we analyze the deterministic incremental aggregated gradient method for minimizing a finite sum of smooth functions where the sum is strongly convex. This method processes the functions one at a time in a deterministic order and...
Article
The root radius of a polynomial is the maximum of the moduli of its roots (zeros). We consider the following optimization problem: minimize the root radius over monic polynomials of degree n, with either real or complex coefficients, subject to k consistent affine constraints on the coefficients. We show that there always exists an optimal polynomi...
Article
Full-text available
Motivated by machine learning problems over large data sets and distributed optimization over networks, we develop and analyze a new method called incremental Newton method for minimizing the sum of a large number of strongly convex functions. We show that our method is globally convergent for a variable stepsize rule. We further show that under a...
Article
Full-text available
The H∞ norm of a transfer matrix function for a control system is the reciprocal of the largest value of ε such that the associated ε-spectral value set is contained in the stability region for the dynamical system (the left half-plane in the continuous-time case and the unit disk in the discrete-time case). After deriving some fundamental properti...
Article
Full-text available
Given a family of real or complex monic polynomials of fixed degree with one affine constraint on their coefficients, consider the problem of minimizing the root radius (largest modulus of the roots) or root abscissa (largest real part of the roots). We give constructive methods for efficiently computing the globally optimal value as well as an opt...
Article
We discuss two nonsmooth functions on Rn introduced by Nesterov. We show that the first variant is partly smooth in the sense of Lewis and that its only stationary point is the global minimizer. In contrast, we show that the second variant has 2n−12n−1 Clarke stationary points, none of them local minimizers except the global minimizer, but also tha...
Article
Full-text available
The ε-pseudospectral abscissa αε and radius ρε of an n × n matrix are respectively the maximal real part and the maximal modulus of points in its ε-pseudospectrum, defined using the spectral norm. It was proved in [A.S. Lewis and C.H.J. Pang. Variational analysis of pseudospectra. SIAM Journal on Optimization, 19:1048-1072, 2008] that for fixed ε >...
Article
Full-text available
Given a family of real or complex monic polynomials of fixed degree with one affine constraint on their coefficients, consider the problem of minimizing the root radius (largest modulus of the roots) or root abscissa (largest real part of the roots). We give constructive methods for efficiently computing the globally optimal value as well as an opt...
Conference Paper
Full-text available
Given a family of real or complex monic polynomials of fixed degree with one fixed affine constraint on their coefficients, consider the problem of minimizing the root radius (largest modulus of the roots) or abscissa (largest real part of the roots). We give constructive methods for finding globally optimal solutions to these problems. In the real...

Network

Cited By