Article

Adaptive Restart for Accelerated Gradient Schemes

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper we demonstrate a simple heuristic adaptive restart technique that can dramatically improve the convergence rate of accelerated gradient schemes. The analysis of the technique relies on the observation that these schemes exhibit two modes of behavior depending on how much momentum is applied. In what we refer to as the 'high momentum' regime the iterates generated by an accelerated gradient scheme exhibit a periodic behavior, where the period is proportional to the square root of the local condition number of the objective function. This suggests a restart technique whereby we reset the momentum whenever we observe periodic behavior. We provide analysis to show that in many cases adaptively restarting allows us to recover the optimal rate of convergence with no prior knowledge of function parameters.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... for some λ > 0. To improve the convergence property of FISTA, gradient-based restart [72] was used. Furthermore, the parameter λ was optimized for each δ via exhaustive search to minimize the unnormalized square error in the last iteration. ...
... 16 , and 1/σ 2 = 40 dB. Bayesian AMP is compared to OMP [71] and FISTA with backtracking [10], gradient-based restart [72], and optimized λ = √ 0.8σ 2 M −1 log N [53]. 10 5 independent trials were simulated for Bayesian AMP while 10 4 independent trials were for OMP and FISTA. ...
... Toward solving this challenging problem, a possible direction in future 16 , and σ 2 = 0. Bayesian GAMP is compared to BIHT [22] and GLasso [23], [27], as well as the state evolution for Bayesian GAMP. GLasso was implemented by using FISTA with backtracking [10], gradient-based restart [72], and optimized λ in (89) for each δ. 20 iterations and 10 4 independent trials were performed for all algorithms. ...
Article
Full-text available
This paper addresses the reconstruction of an unknown signal vector with sublinear sparsity from generalized linear measurements. Generalized approximate message-passing (GAMP) is proposed via state evolution in the sublinear sparsity limit, where the signal dimension N , measurement dimension M , and signal sparsity k satisfy log k / log N → γ ∈ [0, 1) and M /{ k log( N / k )} → δ as N and k tend to infinity. While the overall flow in state evolution is the same as that for linear sparsity, each proof step for inner denoising requires stronger assumptions than those for linear sparsity. The required new assumptions are proved for Bayesian inner denoising. When Bayesian outer and inner denoisers are used in GAMP, the obtained state evolution recursion is utilized to evaluate the prefactor δ in the sample complexity, called reconstruction threshold. If and only if δ is larger than the reconstruction threshold, Bayesian GAMP can achieve asymptotically exact signal reconstruction. In particular, the reconstruction threshold is finite for noisy linear measurements when the support of non-zero signal elements does not include a neighborhood of zero. As numerical examples, this paper considers linear measurements and 1-bit compressed sensing. Numerical simulations for both cases show that Bayesian GAMP outperforms existing algorithms for sublinear sparsity in terms of the sample complexity.
... • Function restart (FR) by O'Donoghue and Candès [41,Section 3.2]: ...
... • Gradient restart (GR) by O'Donoghue and Candès [41,Section 3.2]: ...
... The approximate Pareto front at the iteration k = 25, with N = 100 initial sample points in [−2, 2] n for the first example(41). ...
Preprint
Full-text available
In this work, based on the continuous time approach, we propose an accelerated gradient method with adaptive residual restart for convex multiobjective optimization problems. For the first, we derive rigorously the continuous limit of the multiobjective accelerated proximal gradient method by Tanabe et al. [Comput. Optim. Appl., 2023]. It is a second-order ordinary differential equation (ODE) that involves a special projection operator and can be viewed as an extension of the ODE by Su et al. [J. Mach. Learn. Res., 2016] for Nesterov's accelerated gradient method. Then, we introduce a novel accelerated multiobjective gradient (AMG) flow with tailored time scaling that adapts automatically to the convex case and the strongly convex case, and the exponential decay rate of a merit function along with the solution trajectory of AMG flow is established via the Lyapunov analysis. After that, we consider an implicit-explicit time discretization and obtain an accelerated multiobjective gradient method with a convex quadratic programming subproblem. The fast sublinear rate and linear rate are proved respectively for convex and strongly convex problems. In addition, we present an efficient residual based adaptive restart technique to overcome the oscillation issue and improve the convergence significantly. Numerical results are provided to validate the practical performance of the proposed method.
... Many scholars have made extensive attempts to address this problem. B. O'Donoghue and E. Candès [9] introduced a function restart strategy and a gradient restart strategy to enhance the convergence speed of Nesterov's method. Further, Nguyen et al. [10] proposed the accelerated residual method, which could be regarded as a finitedifference approximation of the second-order ODE system. ...
... In this subsection, we construct a category of second-order systems designed to address the unconstrained optimization issue (9), ensuring that the solution converges to the optimal outcome within the prescribed time under the influence of these secondorder systems. ...
... By substituting (10) into (9), we obtain the optimization problem min y∈R n f (y), (11) which is equivalent to (9), where y = y(δ), f : R n → R, and f is a µ-strongly convex differentiable smooth function. The optimal solution of the optimization problem (11) is also f (x ⋆ ). ...
Article
Full-text available
In machine learning, the processing of datasets is an unavoidable topic. One important approach to solving this problem is to design some corresponding algorithms so that they can eventually converge to the optimal solution of the optimization problem. Most existing acceleration algorithms exhibit asymptotic convergence. In order to ensure that the optimization problem converges to the optimal solution within the prescribed time, a novel prescribed-time convergence acceleration algorithm with time rescaling is presented in this paper. Two prescribed-time acceleration algorithms are constructed by introducing time rescaling, and the acceleration algorithms are used to solve unconstrained optimization problems and optimization problems containing equation constraints. Some important theorems are given, and the convergence of the acceleration algorithms is proven using the Lyapunov function method. Finally, we provide numerical simulations to verify the effectiveness and rationality of our theoretical results.
... Extensive research has been dedicated to developing a theoretical understanding of these methods [1,6,13,17,35,37], extending their scopes [7,9,14,19,20,22,24], and improving their practical performance [5,25,27,32,36]. Among many approaches to enhance the convergence of accelerated gradient methods in practice, in particular to suppress the oscillating behavior, the restart technique has shown remarkable improvement in the context of CSCO [2,3,4,8,10,27,29,34,35,36]. The most natural restart scheme is to restart the accelerated gradient method after a fixed number of iterations, and an optimal fixed restart scheme is presented in [29]. ...
... Various adaptive restart schemes have also been explored in the literature. Paper [34] proposes a function restart scheme (i.e., it restarts when the function value increases) and a gradient restart scheme (i.e., it restarts when the momentum term and the negative gradient make an obtuse angle). However, this paper only provides a heuristic discussion but no non-asymptotic convergence rate. ...
... which together with (34) implies that a j ∇γ j (x j ) + (A j+1 µ + 1)(x j+1 − x j ) = 0. ...
Preprint
Full-text available
This paper presents a novel restarted version of Nesterov's accelerated gradient method and establishes its optimal iteration-complexity for solving convex smooth composite optimization problems. The proposed restart accelerated gradient method is shown to be a specific instance of the accelerated inexact proximal point framework introduced in "An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods" by Monteiro and Svaiter, SIAM Journal on Optimization, 2013. Furthermore, this work examines the proximal bundle method within the inexact proximal point framework, demonstrating that it is an instance of the framework. Notably, this paper provides new insights into the underlying algorithmic principle that unifies two seemingly disparate optimization methods, namely, the restart accelerated gradient and the proximal bundle methods.
... In this paper, motivated by the extrapolation techniques for accelerating the proximal gradient algorithm in the convex settings, we consider a proximal difference-of-convex algorithm with extrapolation to possibly accelerate the proximal DCA. We show that any cluster point of the sequence generated by our algorithm is a stationary point of the DC optimization problem for a fairly general choice of extrapolation parameters: in particular, the parameters can be chosen as in FISTA with fixed restart [15]. In addition, by assuming the Kurdyka-Lojasiewicz property of the objective and the differentiability of the concave part, we establish global convergence of the sequence generated by our algorithm and analyze its convergence rate. ...
... When the proximal mapping of the proper closed convex function is easy to compute, the subproblems of the proximal DCA can be solved efficiently. However, this algorithm may take a lot of iterations: indeed, when the concave part of the objective is void, the proximal DCA reduces to the proximal gradient algorithm for convex optimization problems, which can be slow in practice [15,Section 5]. ...
... It is known that the function values generated by FISTA converges at a rate of O(1/k 2 ), which is faster than the O(1/k) convergence rate of the proximal gradient algorithm. We refer the readers to [8,15] for more examples of such algorithms. In view of the success of extrapolation techniques in accelerating the proximal gradient algorithm for convex optimization problems, and noting that the proximal gradient algorithm and the proximal DCA are the same when applied to convex problems, in this paper, we incorporate extrapolation techniques to possibly accelerate the proximal DCA in the general DC settings. ...
Preprint
We consider a class of difference-of-convex (DC) optimization problems whose objective is level-bounded and is the sum of a smooth convex function with Lipschitz gradient, a proper closed convex function and a continuous concave function. While this kind of problems can be solved by the classical difference-of-convex algorithm (DCA) [26], the difficulty of the subproblems of this algorithm depends heavily on the choice of DC decomposition. Simpler subproblems can be obtained by using a specific DC decomposition described in [27]. This decomposition has been proposed in numerous work such as [18], and we refer to the resulting DCA as the proximal DCA. Although the subproblems are simpler, the proximal DCA is the same as the proximal gradient algorithm when the concave part of the objective is void, and hence is potentially slow in practice. In this paper, motivated by the extrapolation techniques for accelerating the proximal gradient algorithm in the convex settings, we consider a proximal difference-of-convex algorithm with extrapolation to possibly accelerate the proximal DCA. We show that any cluster point of the sequence generated by our algorithm is a stationary point of the DC optimization problem for a fairly general choice of extrapolation parameters: in particular, the parameters can be chosen as in FISTA with fixed restart [15]. In addition, by assuming the Kurdyka-{\L}ojasiewicz property of the objective and the differentiability of the concave part, we establish global convergence of the sequence generated by our algorithm and analyze its convergence rate. Our numerical experiments on two difference-of-convex regularized least squares models show that our algorithm usually outperforms the proximal DCA and the general iterative shrinkage and thresholding algorithm proposed in [17].
... The main step of both Algorithms 1 and 2 is to solve two subproblems at Step 3. For (40), these two problems can be solved explicitly as ...
... Note that we can write (40) Hence, we can apply the Chambolle-Pock primal-dual algorithm [11] to solve (40). We test the first 7 algorithms mentioned above on some synthetic data generated as follows. ...
... We also implement the restarting variants of both algorithms with a frequency of s = 20 iterations. For FISTA, we compute the proximal operator prox · TV using a primal-dual method as in [11] by setting the number of iterations at 25 and 50, respectively, and also use a fixed restarting strategy after each 50 iterations [40]. ...
Preprint
We develop two new proximal alternating penalty algorithms to solve a wide range class of constrained convex optimization problems. Our approach mainly relies on a novel combination of the classical quadratic penalty, alternating minimization, Nesterov's acceleration, and adaptive strategy for parameters. The first algorithm is designed to solve generic and possibly nonsmooth constrained convex problems without requiring any Lipschitz gradient continuity or strong convexity, while achieving the best-known O(1k)O(\frac{1}{k})-convergence rate in a non-ergodic sense, where k is the iteration counter. The second algorithm is also designed to solve non-strongly convex, but semi-strongly convex problems. This algorithm can achieve the best-known O(1k2)O(\frac{1}{k^2})-convergence rate on the primal constrained problem. Such a rate is obtained in two cases: (i)~averaging only on the iterate sequence of the strongly convex term, or (ii) using two proximal operators of this term without averaging. In both algorithms, we allow one to linearize the second subproblem to use the proximal operator of the corresponding objective term. Then, we customize our methods to solve different convex problems, and lead to new variants. As a byproduct, these algorithms preserve the same convergence guarantees as in our main algorithms. We verify our theoretical development via different numerical examples and compare our methods with some existing state-of-the-art algorithms.
... All inner loop subproblems are solved using the fast iterative shrinkage/thresholding algorithm (FISTA) with constraints encoded as proximal operators (76). To improve the convergence rate, we implemented the FISTA solver in combination with an adaptive restart scheme as proposed by O'Donoghue and Candès (77). Implementation Our implementation of the sparse tensor decomposition model presented here is freely available as the Python package Barnacle. ...
... The model itself can be accessed via the "SparseCP" class, modeled after the standardized application program interface (API) of Tensorly's decomposition class. Our implementation of the FISTA algorithm with adaptive restart (76,77) for solving constrained inner loop least-squares problems is under the "fista" module of Barnacle. Code for constructing and manipulating the simulated data tensors used in model development and evaluation is available under the "tensors" module. ...
Article
Full-text available
Microbes respond to changes in their environment by adapting their physiology through coordinated adjustments to the expression levels of functionally related genes. To detect these shifts in situ, we developed a sparse tensor decomposition method that derives gene co-expression patterns from inherently complex whole community RNA sequencing data. Application of the method to metatranscriptomes of the abundant marine cyanobacteria Prochlorococcus and Synechococcus identified responses to scarcity of two essential nutrients, nitrogen and iron, including increased transporter expression, restructured photosynthesis and carbon metabolism, and mitigation of oxidative stress. Further, expression profiles of the identified gene clusters suggest that both cyanobacteria populations experience simultaneous nitrogen and iron stresses in a transition zone between North Pacific oceanic gyres. The results demonstrate the power of our approach to infer organism responses to environmental pressures, hypothesize functions of uncharacterized genes, and extrapolate ramifications for biogeochemical cycles in a changing ecosystem.
... Recently, there has been significant interest in using restarts to accelerate the convergence of first-order methods [1,13,27,33,34,37,39,44,46,47,49,52,57,62,63,67,69,70,72]. A restart scheme involves repeatedly using the output of an optimization algorithm as the initial point for a new instance, or "restart". ...
... There is a large amount of recent work on adaptive first-order methods [33,34,37,39,52,63,72]. Adaptive methods seek to learn when to restart a first-order method by trying various values for the method's parameters and observing consequences over several iterations. ...
Article
Full-text available
Sharpness is an almost generic assumption in continuous optimization that bounds the distance from minima by objective function suboptimality. It facilitates the acceleration of first-order methods through restarts . However, sharpness involves problem-specific constants that are typically unknown, and restart schemes typically reduce convergence rates. Moreover, these schemes are challenging to apply in the presence of noise or with approximate model classes (e.g., in compressive imaging or learning problems), and they generally assume that the first-order method used produces feasible iterates. We consider the assumption of approximate sharpness , a generalization of sharpness that incorporates an unknown constant perturbation to the objective function error. This constant offers greater robustness (e.g., with respect to noise or relaxation of model classes) for finding approximate minimizers. By employing a new type of search over the unknown constants, we design a restart scheme that applies to general first-order methods and does not require the first-order method to produce feasible iterates. Our scheme maintains the same convergence rate as when the constants are known. The convergence rates we achieve for various first-order methods match the optimal rates or improve on previously established rates for a wide range of problems. We showcase our restart scheme in several examples and highlight potential future applications and developments of our framework and theory.
... In this part, we present the mathematical formulation of the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) with adaptive restart. We follow the approach proposed by Beck et al. [38] and enhanced by the adaptive restart scheme of Donoghue et al. [39]. This optimization algorithm is particularly well-suited to solving our MSGL problem because it ensures accurate minimization of the objective function while maintaining both predictive accuracy and computational efficiency. ...
... To improve the performance of FISTA, we have implemented an adaptive restart strategy based on the recommendation made by Donoghue et al. [39]. This technique can automatically restart the algorithm when the objective function shows a specific non-monotonic behavior. ...
Article
Full-text available
The rapid growth of machine learning (ML) across fields has intensified the challenge of selecting the right algorithm for specific tasks, known as the Algorithm Selection Problem (ASP). Traditional trial-and-error methods have become impractical due to their resource demands. Automated Machine Learning (AutoML) systems automate this process, but often neglect the group structures and sparsity in meta-features, leading to inefficiencies in algorithm recommendations for classification tasks. This paper proposes a meta-learning approach using Multivariate Sparse Group Lasso (MSGL) to address these limitations. Our method models both within-group and across-group sparsity among meta-features to manage high-dimensional data and reduce multicollinearity across eight meta-feature groups. The Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) with adaptive restart efficiently solves the non-smooth optimization problem. Empirical validation on 145 classification datasets with 17 classification algorithms shows that our meta-learning method outperforms four state-of-the-art approaches, achieving 77.18% classification accuracy, 86.07% recommendation accuracy and 88.83% normalized discounted cumulative gain.
... Important examples include Nesterov's accelerated gradient method [23], most famously applied in a modified form in FISTA [13], the Barzilai-Borwein step size technique [24] in, e.g., SpaRSA, GPSR and FPC [14], [15], [17], and Polyak's "heavy ball" momentum [25]. Another acceleration heuristic related to Nesterov's method is function or gradient resetting [26], [27]. Recently in the context of model-aided deep learning [28], the authors of [29] propose a variant of the learnable ISTA [30], where the gradient descent step is accompanied by a trajectory-based acceleration step similar to Polyak's momentum [25] with a learnable step size. ...
... We compare our proposed algorithms P-STELA, N-STELA and PN-STELA against the STELA [33]. In addition, we benchmark against the classical FISTA with backtracking and a backtracking step size of η = 1.2 (FISTA-BTRK) [13], and the FISTA with backtracking and the suprisingly effective function resetting heuristic (FISTA-BTRK-RST) [26], [27]. FISTA-BTRK does not require computing the largest eigenvalue of A T A. We instead initialize G in (2) with the low-complexity lower bound max i j [A] 2 i,j . ...
Article
Full-text available
We consider the minimization of 1\ell _{1} -regularized least-squares problems. A recent optimization approach uses successive convex approximations with an exact line search, which is highly competitive, especially in sparse problem instances. This work proposes an acceleration scheme for the successive convex approximation technique with a negligible additional computational cost. We demonstrate this scheme by devising three related accelerated algorithms with provable convergence. The first introduces an additional descent step along the past optimization trajectory in the variable update, that is inspired by Nesterov's accelerated gradient method and uses a closed-form step size. The second performs a simultaneous descent step along both the best response and the past trajectory, thereby finding a two-dimensional step size, also in closed-form. The third algorithm combines the previous two approaches. All algorithms are hyperparameter-free. Empirical results confirm that the acceleration approaches improve the convergence rate compared to benchmark algorithms, and that they retain the benefits of successive convex approximation also in non-sparse instances.
... Experiment 2: geometric programming. We consider a log-sum-exp minimization problem which commonly arises in geometric programming [46,51,52] and is used to test optimization algorithms [21,41]. In particular, it is formulated as ...
... In particular, when η → 0, the objective function converges to the point-wise maximum function max(Jx + b) and its Hessian vanishes. We follow the experimental setups in [21,41], which use m = 100, n = 20, and generate the entries of J and b randomly. We perform the experiments with small values of η to test the robustness of the methods. ...
... According to (31) and KL inequality (10), let k = 3t T , then it holds that ...
... The vector b is generated as the same as in the former case. The extrapolation parameter in IRL1e is chosen as the same as the FISTA with fixed restart [31]: ...
Article
Full-text available
In this paper, we propose a Bregman proximal iteratively reweighted algorithm with extrapolation based on block coordinate update aimed at solving a class of optimization problems which is the sum of a smooth possibly nonconvex loss function and a general nonconvex regularizer with a separable structure. The proposed algorithm can be used to solve the p(0<p<1)\ell _p(0<p<1) regularization problem by employing an update strategy of the smoothing parameter in its smooth approximation model. When the extrapolation parameter satisfies certain conditions, the global convergence and local convergence rate are obtained by using the Kurdyka–Łojasiewicz (KL) property on the objective function. Numerical experiments are given to indicate the superiority of the proposed algorithm.
... In this case, Newton's method might converge to a different solution or fail to converge altogether and gradient descent may get stuck in local minima or saddle points. To mitigate some of the aforementioned issues, a body of variations and enhancements, such as stochastic gradient descent [10], momentum [11], adaptive learning rate methods [12][13][14], quasi-Newton method [15], inexact Newton method [16], and modified Newton method [17] have been developed. However, it still seems somewhat blind to tune parameters in diverse methods. ...
... According to (13), it is easy to get that ...
Article
In the paper, we investigate the optimization problem (OP) by applying the optimal control method. The optimization problem is reformulated as an optimal control problem (OCP) where the controller (iteration updating) is designed to minimize the sum of costs in the future time instant, which thus theoretically generates the “optimal algorithm” (fastest and most stable). By adopting the maximum principle and linearization with Taylor expansion, new algorithms are proposed. It is shown that the proposed algorithms have a superlinear convergence rate and thus converge more rapidly than the gradient descent; meanwhile, they are superior to Newton’s method because they are not divergent in general and can be applied in the case of a singular or indefinite Hessian matrix. More importantly, the OCP method contains the gradient descent and the Newton’s method as special cases, which discovers the theoretical basis of gradient descent and Newton’s method and reveals how far these algorithms are from the optimal algorithm. The merits of the proposed optimization algorithm are illustrated by numerical experiments.
... Not only does accelerated gradient descent converge considerably faster than traditional gradient descent, but it also performs a more robust local search of the parameter space by initially overshooting and then oscillating back as it settles into a final configuration, thereby selecting only local minimizers with a basis of attraction large enough to contain the initial overshoot. This behavior has made accelerated and stochastic gradient search methods particularly popular within the machine learning community [29,26,25,21,20,19,15,14,5,40]. So far, however, accelerated optimization methods have been restricted to searches over finite dimensional parameter spaces. ...
... Momentum methods, including stochastic variants [15,19], have become very popular in machine learning in recent years [5,14,20,21,25,29,40,26]. Strategic dynamically changing weights on the momentum term can further boost the descent rate. ...
Preprint
Following the seminal work of Nesterov, accelerated optimization methods have been used to powerfully boost the performance of first-order, gradient-based parameter estimation in scenarios where second-order optimization strategies are either inapplicable or impractical. Not only does accelerated gradient descent converge considerably faster than traditional gradient descent, but it also performs a more robust local search of the parameter space by initially overshooting and then oscillating back as it settles into a final configuration, thereby selecting only local minimizers with a basis of attraction large enough to contain the initial overshoot. This behavior has made accelerated and stochastic gradient search methods particularly popular within the machine learning community. In their recent PNAS 2016 paper, Wibisono, Wilson, and Jordan demonstrate how a broad class of accelerated schemes can be cast in a variational framework formulated around the Bregman divergence, leading to continuum limit ODE's. We show how their formulation may be further extended to infinite dimension manifolds (starting here with the geometric space of curves and surfaces) by substituting the Bregman divergence with inner products on the tangent space and explicitly introducing a distributed mass model which evolves in conjunction with the object of interest during the optimization process. The co-evolving mass model, which is introduced purely for the sake of endowing the optimization with helpful dynamics, also links the resulting class of accelerated PDE based optimization schemes to fluid dynamical formulations of optimal mass transport.
... F (x k ) − F ∈ O((1 − µ) k ) for a problem dependent 0 < µ < 1 [NNG12,DL16], whereas one needs to know explicitly the strong convexity parameter in order to set accelerated gradient and accelerated coordinate descent methods to have a linear rate of convergence, see for instance [LS13,LMH15,LLX14,Nes12,Nes13]. Setting the algorithm with an incorrect parameter may result in a slower algorithm, sometimes even slower than if we had not tried to set an acceleration scheme [OC12]. This is a major drawback of the method because in general, the strong convexity parameter is difficult to estimate. ...
... Nesterov [Nes13] also showed that, instead of deriving a new method designed to work better for strongly convex functions, one can restart the accelerated gradient method and get a linear convergence rate. However, the restarting frequency he proposed still depends explicitly on the strong convexity of the function and so O'Donoghue and Candes [OC12] introduced some heuristics to adaptively restart the algorithm and obtain good results in practice. ...
Preprint
By analyzing accelerated proximal gradient methods under a local quadratic growth condition, we show that restarting these algorithms at any frequency gives a globally linearly convergent algorithm. This result was previously known only for long enough frequencies. Then, as the rate of convergence depends on the match between the frequency and the quadratic error bound, we design a scheme to automatically adapt the frequency of restart from the observed decrease of the norm of the gradient mapping. Our algorithm has a better theoretical bound than previously proposed methods for the adaptation to the quadratic error bound of the objective. We illustrate the efficiency of the algorithm on a Lasso problem and on a regularized logistic regression problem.
... Two formulations are proposed based on single-bit sparse maximum-likelihood estimation (MLE) and single-bit compressed sensing. For MLE, an optimal in terms of iteration complexity [30] first-order proximal method is designed using adaptive restart, to further speed up the convergence rate [31]. The proposed compressed sensing (CS) formulation can be -fortuitously -harnessed by invoking the recent single-bit CS literature. ...
... Algorithm 2 illustrates the proposed first-order l 1regularization algorithm incorporating Nesterov's extrapolation method. In addition, an adaptive restart mechanism [31] is utilized in order to further speed up the convergence rate. Experimental evidence on our problems shows that it works remarkably well. ...
Preprint
Channel state information (CSI) at the base station (BS) is crucial to achieve beamforming and multiplexing gains in multiple-input multiple-output (MIMO) systems. State-of-the-art limited feedback schemes require feedback overhead that scales linearly with the number of BS antennas, which is prohibitive for 5G massive MIMO. This work proposes novel limited feedback algorithms that lift this burden by exploiting the inherent sparsity in double directional (DD) MIMO channel representation using overcomplete dictionaries. These dictionaries are associated with angle of arrival (AoA) and angle of departure (AoD) that specifically account for antenna directivity patterns at both ends of the link. The proposed algorithms achieve satisfactory channel estimation accuracy using a small number of feedback bits, even when the number of transmit antennas at the BS is large -- making them ideal for 5G massive MIMO. Judicious simulations reveal that they outperform a number of popular feedback schemes, and underscore the importance of using angle dictionaries matching the given antenna directivity patterns, as opposed to uniform dictionaries. The proposed algorithms are lightweight in terms of computation, especially on the user equipment side, making them ideal for actual deployment in 5G systems.
... Using the previous estimates (34) and (35), and since α > 1, we deduce that We now have all the ingredients to prove the weak convergence of trajectories. From Lemma 2.2, (16), we have ...
... The numerical experiments also suggest that, in accordance with [9], taking α large may improve the rate of convergence to the solution with minimum norm. Restarting may be also an efficient strategy, see [35], [41]. ...
Preprint
In a Hilbert space setting H\mathcal H, we study the convergence properties as t+t \to + \infty of the trajectories of the second-order differential equation \begin{equation*} \mbox{(AVD)}_{\alpha, \epsilon} \quad \quad \ddot{x}(t) + \frac{\alpha}{t} \dot{x}(t) + \nabla \Phi (x(t)) + \epsilon (t) x(t) =0, \end{equation*} where Φ\nabla\Phi is the gradient of a convex continuously differentiable function Φ:HR\Phi: \mathcal H \to \mathbb R, α\alpha is a positive parameter, and ϵ(t)x(t)\epsilon (t) x(t) is a Tikhonov regularization term, with limtϵ(t)=0\lim_{t \to \infty}\epsilon (t) =0. In this damped inertial system, the damping coefficient αt\frac{\alpha}{t} vanishes asymptotically, but not too quickly, a key property to obtain rapid convergence of the values. In the case ϵ()0\epsilon (\cdot) \equiv 0, this dynamic has been highlighted recently by Su, Boyd, and Cand\`es as a continuous version of the Nesterov accelerated method. Depending on the speed of convergence of ϵ(t)\epsilon (t) to zero, we analyze the convergence properties of the trajectories of \mbox{(AVD)}_{\alpha, \epsilon}. We obtain results ranging from the rapid convergence of Φ(x(t))\Phi (x(t)) to minΦ\min \Phi when ϵ(t)\epsilon (t) decreases rapidly to zero, up to the strong ergodic convergence of the trajectories to the element of minimal norm of the set of minimizers of Φ\Phi, when ϵ(t)\epsilon (t) tends slowly to zero.
... It can be seen as an explicit-implicit discretization of a nonlinear second-order dynamical system (oscillator) with viscous damping that vanishes asymptotically in a moderate way [67,2]. While the stepsize κ is chosen in a similar way to ISTA (though with a smaller upper-bound), in our implementation, we tweak the original approach by using a Barzilai-Borwein step size, a standard line search, and restart [3], since this led to improved performance. ...
... The first two algorithms are standard "first-order" algorithms; the next group of algorithms use active-set strategies; and the final group of three algorithms use a diagonal ± rank-1 proximal mapping. Our implementation of FISTA used the Barzilai-Borwein stepsize [4] and line search, and restarted the momentum term every 1000 iterations [3]. L-BFGS-B and ASA use the reformulation of (6.1). ...
Preprint
We introduce a framework for quasi-Newton forward--backward splitting algorithms (proximal quasi-Newton methods) with a metric induced by diagonal ±\pm rank-r symmetric positive definite matrices. This special type of metric allows for a highly efficient evaluation of the proximal mapping. The key to this efficiency is a general proximal calculus in the new metric. By using duality, formulas are derived that relate the proximal mapping in a rank-r modified metric to the original metric. We also describe efficient implementations of the proximity calculation for a large class of functions; the implementations exploit the piece-wise linear nature of the dual problem. Then, we apply these results to acceleration of composite convex minimization problems, which leads to elegant quasi-Newton methods for which we prove convergence. The algorithm is tested on several numerical examples and compared to a comprehensive list of alternatives in the literature. Our quasi-Newton splitting algorithm with the prescribed metric compares favorably against state-of-the-art. The algorithm has extensive applications including signal processing, sparse recovery, machine learning and classification to name a few.
... Recently, accelerated, or "optimal" [14,15], first-order methods have received substantial attention, particularly for solving large-scale optimization problems; see, e.g., [3,4,13,16]. Advantages of most of these methods include ease of implementation, cheap computation per each iteration, and fast local convergence. ...
... The accelerated method in (12) and (13) is not guaranteed to be monotone in the dual objective value. Therefore, we follow O'Donoghue and Candès [16] in incorporating an adaptive restart technique of the acceleration scheme. Namely, we perform the restart procedure whenever the momentum term, r (k+1) −r (k) , and the gradient of the objective function, ∇ψ(r (k) ), make an obtuse angle, i.e., ∇ψ(r (k) ) (r (k+1) − r (k) ) < 0. ...
Preprint
The Uzawa method is a method for solving constrained optimization problems, and is often used in computational contact mechanics. The simplicity of this method is an advantage, but its convergence is slow. This paper presents an accelerated variant of the Uzawa method. The proposed method can be viewed as application of an accelerated projected gradient method to the Lagrangian dual problem. Preliminary numerical experiments suggest that the convergence of the proposed method is much faster than the original Uzawa method.
... Therefore, setting α ∈ (0, 1) is anticipated to enhance the convergence rate, which will be experimentally verified in Section 4. This is in fact a method for solving the vanishing gradient problem [18], one of the common challenges in generative model training where the gradient diminishes when propagating backward in neural network. Usually, the problem is alleviated by introducing extra mechanisms, e.g., implementing an instantaneous gradient decay rate detector [19], a combination of smooth and nonsmooth functions (usually involves regularization) [20], restarting the algorithm for a fixed number of iterations [21]. However, the proposed α-GAN has accelerated gradient design nested in its loss function already, where the acceleration can be varied by changing the hyperparameter α. ...
Preprint
Full-text available
This paper proposes α\alpha-GAN, a generative adversarial network using R\'{e}nyi measures. The value function is formulated, by R\'{e}nyi cross entropy, as an expected certainty measure incurred by the discriminator's soft decision as to where the sample is from, true population or the generator. The discriminator tries to maximize the R\'{e}nyi certainty about sample source, while the generator wants to reduce it by injecting fake samples. This forms a min-max problem with the solution parameterized by the R\'{e}nyi order α\alpha. This α\alpha-GAN reduces to vanilla GAN at α=1\alpha = 1, where the value function is exactly the binary cross entropy. The optimization of α\alpha-GAN is over probability (vector) space. It is shown that the gradient is exponentially enlarged when R\'{e}nyi order is in the range α(0,1)\alpha \in (0,1). This makes convergence faster, which is verified by experimental results. A discussion shows that choosing α(0,1)\alpha \in (0,1) may be able to solve some common problems, e.g., vanishing gradient. A following observation reveals that this range has not been fully explored in the existing R\'{e}nyi version GANs.
... We will explain the preconditioning technique in Proposition 2 later. Within the widely employed gradient descent framework, [24] pioneered two distinct restart strategies for extrapolation parameter: a gradient-based scheme [14,16,29] and a functionvalue-based scheme. The former, taking [29] for example, represents a computationally efficient implementation, while the latter, a function value-based restart scheme, i.e., restarting whence E (x n ) > E x n−1 , which conceptually parallels the line search techniques proposed here, provides enhanced theoretical guarantees. ...
Preprint
Full-text available
This paper proposes a novel proximal difference-of-convex (DC) algorithm enhanced with extrapolation and aggressive non-monotone line search for solving non-convex optimization problems. We introduce an adaptive conservative update strategy of the extrapolation parameter determined by a computationally efficient non-monotone line search. The core of our algorithm is to unite the update of the extrapolation parameter with the step size of the non-monotone line search interactively. The global convergence of the two proposed algorithms is established through the Kurdyka-{\L}ojasiewicz properties, ensuring convergence within a preconditioned framework for linear equations. Numerical experiments on two general non-convex problems: SCAD-penalized binary classification and graph-based Ginzburg-Landau image segmentation models, demonstrate the proposed method's high efficiency compared to existing DC algorithms both in convergence rate and solution accuracy.
... Momentum-based accelerated methods may achieve faster convergence rates, but at the expense of sacrificing the structural robustness properties [22]- [24]. For instance, the continuous-time Nesterov's accelerated dynamics may be unstable under forward Euler's approximation [25]. ...
Preprint
In this paper, we investigate distributed Nash equilibrium seeking for a class of two-subnetwork zero-sum games characterized by bilinear coupling. We present a distributed primal-dual accelerated mirror-descent algorithm that guarantees convergence. However, we demonstrate that this time-varying algorithm is not robust, as it fails to converge under even the slightest disturbances. To address this limitation, we introduce a distributed accelerated algorithm that employs a coordinated restarting mechanism. We model this new algorithm as a hybrid dynamical system and establish that it possesses structural robustness.
... The subproblems in these algorithms are solved according to the procedures described in the appendices of [40] and [41]. We use the same strategy of choosing β k as in the FISTA with fixed and adaptive restart described in [18]. In more detail, we set the initial values ϑ −1 = ϑ 0 = 1 and define, for k ≥ 0, ...
Article
Full-text available
We revisit and adapt the extended sequential quadratic method (ESQM) in Auslender (J Optim Theory Appl 156:183–212, 2013) for solving a class of difference-of-convex optimization problems whose constraints are defined as the intersection of level sets of Lipschitz differentiable functions and a simple compact convex set. Particularly, for this class of problems, we develop a variant of ESQM, called ESQM with extrapolation ( ESQMe\hbox {ESQM}_{\textrm{e}} ESQM e ), which incorporates Nesterov’s extrapolation techniques for empirical acceleration. Under standard constraint qualifications, we show that the sequence generated by ESQMe\hbox {ESQM}_{\textrm{e}} ESQM e clusters at a critical point if the extrapolation parameters are uniformly bounded above by a certain threshold. Convergence of the whole sequence and the convergence rate are established by assuming Kurdyka-Łojasiewicz (KL) property of a suitable potential function and imposing additional differentiability assumptions on the objective and constraint functions. In addition, when the objective and constraint functions are all convex, we show that linear convergence can be established if a certain exact penalty function is known to be a KL function with exponent 12\frac{1}{2} 1 2 ; we also discuss how the KL exponent of such an exact penalty function can be deduced from that of the original extended objective (i.e., sum of the objective and the indicator function of the constraint set). Finally, we perform numerical experiments to demonstrate the empirical acceleration of ESQMe\hbox {ESQM}_{\textrm{e}} ESQM e over a basic version of ESQM, and illustrate its effectiveness by comparing with the natural competing algorithm SCPls\hbox {SCP}_{\textrm{ls}} SCP ls from Yu et al. (SIAM J Optim 31:2024–2054, 2021).
... Alas, the vanilla FISTA algorithm is prone to oscillatory behavior, which results in a sub-linear convergence rate of O(1/T 2 ) after T iterations. In the following, we further accelerate the empirical convergence performance of the FISTA algorithm by incorporating a simple restart strategy based on the function value, originally proposed in (O'donoghue & Candes, 2015). ...
Preprint
This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an 0\ell_0 cardinality constraint. While branch-and-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed exactly in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems.
... In non-convex optimization, this scheme can guarantee convergence under some suitable choices of β k , but does not guarantee acceleration in general (it does in some specific cases, e.g., for matrix factorization [47]). For simplicity, we resort to a heuristic extrapolation with restart, similar to what is proposed in [35]; see Algorithm 1. The restarting procedure guarantees that the objective value goes down at each iteration. ...
Preprint
Full-text available
In this paper, we consider linear time-invariant continuous control systems which are bounded real, also known as scattering passive. Our main theoretical contribution is to show the equivalence between such systems and port-Hamiltonian (PH) systems whose factors satisfy certain linear matrix inequalities. Based on this result, we propose a formulation for the problem of finding the nearest bounded-real system to a given system, and design an algorithm combining alternating optimization and Nesterov's fast gradient method. This formulation also allows us to check whether a given system is bounded real by solving a semidefinite program, and provide a PH parametrization for it. We illustrate our proposed algorithms on real and synthetic data sets.
... In other words, the gradient is calculated with regard to the approximated future position of the parameter instead of the current position. Different variations of the momentum method have been introduced to improve the performance of N AG [33][34][35][36]. A quasihyperbolic variant of momentum-based S DG was proposed in [11], with the aim of obtaining a cheaper and more intuitive version for N AG. ...
Article
Full-text available
Deep learning networks have been trained using first-order-based methods. These methods often converge more quickly when combined with an adaptive step size, but they tend to settle at suboptimal points, especially when learning occurs in a large output space. When first-order-based methods are used with a constant step size, they oscillate near the zero-gradient region, which leads to slow convergence. However, these issues are exacerbated under nonconvexity, which can significantly diminish the performance of first-order methods. In this work, we propose a novel Boltzmann Probability Weighted Sine with a Cosine distance-based Adaptive Gradient (BSCAGrad) method. The step size in this method is carefully designed to mitigate the issue of slow convergence. Furthermore, it facilitates escape from suboptimal points, enabling the optimization process to progress more efficiently toward local minima. This is achieved by combining a Boltzmann probability-weighted sine function and cosine distance to calculate the step size. The Boltzmann probability-weighted sine function acts when the gradient vanishes and the cooling parameter remains moderate, a condition typically observed near suboptimal points. Moreover, using the sine function on the exponential moving average of the weight parameters leverages geometric information from the data. The cosine distance prevents zero in the step size. Together, these components accelerate convergence, improve stability, and guide the algorithm toward a better optimal solution. A theoretical analysis of the convergence rate under both convexity and nonconvexity is provided to substantiate the findings. The experimental results from language modeling, object detection, machine translation, and image classification tasks on a real-world benchmark dataset, including CIFAR10, CIFAR100, PennTreeBank, PASCALVOC and WMT2014, demonstrate that the proposed step size outperforms traditional baseline methods. Graphical abstract
... An extension of FISTA for non-convex and convex (and possibly nonsmooth) is studied in [74]. Here, the method recovers the restarted FISTA iteration [75][76][77] for the convex case under a suitable choice of stepsize. The method assumes that can be decomposed as the difference of two convex functions. ...
... Combining (15) with (10), and note that ffiffiffiffiffi ffi l=L p [21]. When the condition number l L , especially the strongly convex coefficient l, is difficult to obtain, some researches [19,25,26] make use of time-varying momentum parameters, like in (3). Such parameter setting strategy is useful to estimate the condition number and formulate appropriate restart conditions for algorithms, and can also avoid excessive momentum components in the iterative direction. ...
Article
Full-text available
With the dramatic increase in model complexity and problem scales in the machine learning area, researches on the first-order stochastic methods and its accelerated variants for non-convex problems have attracted wide research interest. However, most works on convergence analysis of accelerated methods focus on general convex or strongly convex objective functions. In this paper, we consider an accelerated scheme coming from dynamic systems and ordinary differential equations, which has a simpler and more direct form than the traditional scheme. We construct auxiliary sequences of iteration points as analysis tools, which can be interpreted as extension of Nesterov’s estimate sequence in non-convex case. We analyze the convergence property under different cases when momentum parameters are fixed or varying over iterations. For non-smooth and general convex objective functions, we give a relaxed step-size requirement to ensure convergence. For the non-convex policy search problem in classical reinforcement learning, we propose an accelerated stochastic policy gradient method with restart technique and construct numerical experiments to verify its effectiveness.
... The estimator at the step 1 is the maximum likelihood estimator. The optimization itself is carried out by means of an accelerated projective gradient (APG) algorithm with adaptive restart [30] (see [31] for a combination of APG with a conjugate gradient method). An initial guess for the APG routine is supplied by the estimator found on the previous iteration of the protocol (on the first iteration a completely mixed state is substituted). ...
Preprint
Full-text available
Adaptive measurements have recently been shown to significantly improve the performance of quantum state and process tomography. However, the existing methods either cannot be straightforwardly applied to high-dimensional systems or are prohibitively computationally expensive. Here we propose and experimentally implement a novel tomographic protocol specially designed for the reconstruction of high-dimensional quantum states. The protocol shows qualitative improvement in infidelity scaling with the number of measurements and is fast enough to allow for complete state tomography of states with dimensionality up to 36.
... In particular, we modify the fast iterative soft thresholding algorithm (FISTA) described in [4,Section 4] to solve our subproblem. This approach extends a variety of accelerated gradient descent methods, most notably those of Nesterov [33][34][35], to minimization of composite convex functions; for further details regarding the acceleration process and motivation for why such acceleration is possible, we direct the reader to the references [1,6,17,27,38,42,44]. ...
Preprint
Linear discriminant analysis (LDA) is a classical method for dimensionality reduction, where discriminant vectors are sought to project data to a lower dimensional space for optimal separability of classes. Several recent papers have outlined strategies for exploiting sparsity for using LDA with high-dimensional data. However, many lack scalable methods for solution of the underlying optimization problems. We propose three new numerical optimization schemes for solving the sparse optimal scoring formulation of LDA based on block coordinate descent, the proximal gradient method, and the alternating direction method of multipliers. We show that the per-iteration cost of these methods scales linearly in the dimension of the data provided restricted regularization terms are employed, and cubically in the dimension of the data in the worst case. Furthermore, we establish that if our block coordinate descent framework generates convergent subsequences of iterates, then these subsequences converge to the stationary points of the sparse optimal scoring problem. We demonstrate the effectiveness of our new methods with empirical results for classification of Gaussian data and data sets drawn from benchmarking repositories, including time-series and multispectral X-ray data, and provide Matlab and R implementations of our optimization schemes.
... The efficiency of such algorithms strongly depends on the computation of prox g . In addition, since (44) is strongly convex, restart strategies, as in [14,39,46] for first order methods, can lead to faster convergence rates in practice. When g is absent, (44) reduces to a positive definite linear system H k d = −h k , which can be efficiently solved by conjugate gradient schemes or Cholesky methods. ...
Preprint
We propose a new proximal, path-following framework for a class of constrained convex problems. We consider settings where the nonlinear---and possibly non-smooth---objective part is endowed with a proximity operator, and the constraint set is equipped with a self-concordant barrier. Our approach relies on the following two main ideas. First, we re-parameterize the optimality condition as an auxiliary problem, such that a good initial point is available; by doing so, a family of alternative paths towards the optimum is generated. Second, we combine the proximal operator with path-following ideas to design a single-phase, proximal, path-following algorithm. Our method has several advantages. First, it allows handling non-smooth objectives via proximal operators; this avoids lifting the problem dimension in order to accommodate non-smooth components in optimization. Second, it consists of only a \emph{single phase}: While the overall convergence rate of classical path-following schemes for self-concordant objectives does not suffer from the initialization phase, proximal path-following schemes undergo slow convergence, in order to obtain a good starting point \cite{TranDinh2013e}. In this work, we show how to overcome this limitation in the proximal setting and prove that our scheme has the same O(νlog(1/ε))\mathcal{O}(\sqrt{\nu}\log(1/\varepsilon)) worst-case iteration-complexity with standard approaches \cite{Nesterov2004,Nesterov1994} without requiring an initial phase, where ν\nu is the barrier parameter and ε\varepsilon is a desired accuracy. Finally, our framework allows errors in the calculation of proximal-Newton directions, without sacrificing the worst-case iteration complexity. We demonstrate the merits of our algorithm via three numerical examples, where proximal operators play a key role.
... This procedure can be performed for general h 1 and h 2 , but proximal mappings are efficient to compute for many common regularizers. Thus, we describe the basic algorithm using gradient-based proximal methods (e.g., accelerated gradient [26], and quasi-Newton [27]), which require the ability to compute a gradient and the proximal mapping. ...
Preprint
A semi-parametric, non-linear regression model in the presence of latent variables is introduced. These latent variables can correspond to unmodeled phenomena or unmeasured agents in a complex networked system. This new formulation allows joint estimation of certain non-linearities in the system, the direct interactions between measured variables, and the effects of unmodeled elements on the observed system. The particular form of the model adopted is justified, and learning is posed as a regularized empirical risk minimization. This leads to classes of structured convex optimization problems with a "sparse plus low-rank" flavor. Relations between the proposed model and several common model paradigms, such as those of Robust Principal Component Analysis (PCA) and Vector Autoregression (VAR), are established. Particularly in the VAR setting, the low-rank contributions can come from broad trends exhibited in the time series. Details of the algorithm for learning the model are presented. Experiments demonstrate the performance of the model and the estimation algorithm on simulated and real data.
Article
The Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) is widely used to solve inverse problems arising from various image processing applications. Although FISTA may achieve O(1/k2)O(1/k^{2}) convergence rate in theory, it can be slow in actual implementation due to the fact that the algorithm employs only first-order information of the objective function. This motivates us to propose a Diagonal-Newton Fast Iterative Shrinkage-Thresholding Algorithm (DFISTA), by incorporating partial second-order information in the form of diagonal matrix into the standard FISTA. The incorporation of second-order information through diagonal matrix avoids the implementation difficulty of storing a dense matrix and hence rendering the algorithm suitable for large-scale problems. The diagonal components are obtained by approximating the spectrum of eigenvalues of the Hessian matrix via the least change updating technique under the log-determinant norm subject to the weak secant relation. Convergence properties of DFISTA are also established under standard assumptions. Four images are used to test the efficiency of DFISTA. The full-reference quality metrics including SSIM, PSNR, and RMSE and no-reference quality metrics including NIQE and BRISQUE of the recovered images are also calculated. Both quantitative and visual results indicate that DFISTA performs better than FISTA. In addition, DFISTA is tested on an industrial defect image and a medical image. Moreover, the computational time of DFISTA is shown to be comparable to that of FISTA, indicating that the computational cost of both algorithms is similar. The numerical results suggest that DFISTA can be used as an alternative algorithm for image restoration problems.
Preprint
Full-text available
Sparsity regularization has garnered significant interest across multiple disciplines, including statistics, imaging, and signal processing. Standard techniques for addressing sparsity regularization include iterative soft thresholding algorithms and their accelerated variants. However, these algorithms rely on Landweber iteration , which can be computationally intensive. Therefore, there is a pressing need to develop a more efficient algorithm for sparsity regularization. The Singular Value Decomposition (SVD) method serves as a regularization strategy that does not require Landweber iterations; however, it is confined to classical quadratic regularization. This paper introduces two inversion schemes tailored for situations where the operator K is diagonal within a specific orthogonal basis, focusing on ℓ p regularization when p = 1 and p = 1/2. Furthermore, we demonstrate that for a general linear compact operator K, the SVD method serves as an effective regularization strategy. To assess the efficacy of the proposed methodologies, We conduct several numerical experiments to evaluate the proposed method's effectiveness. The results indicate that our algorithms not only operate faster but also achieve a higher success rate than traditional iterative methods.
Article
Full-text available
Light-matter interaction is exploited in spectroscopic techniques to access information about molecular, atomic or nuclear constituents of a sample. While scattered light carries both amplitude and phase information of the electromagnetic field, the latter is lost in intensity measurements. However, often the phase information is paramount to reconstruct the desired information of the target, as it is well known from coherent x-ray imaging. Here we introduce a phase retrieval method which allows us to reconstruct the field phase information from two-dimensional time- and energy-resolved spectra. We apply this method to the case of x-ray scattering off Mössbauer nuclei at a synchrotron radiation source. Knowledge of the phase allows also for the reconstruction of energy spectra from two-dimensional experimental data sets with excellent precision, without theoretical modelling of the sample. Our approach provides an efficient and accurate data analysis tool which will benefit x-ray quantum optics and Mössbauer spectroscopy with synchrotron radiation alike.
Article
Full-text available
Stochastic optimisation algorithms are the de facto standard for machine learning with large amounts of data. Handling only a subset of available data in each optimisation step dramatically reduces the per-iteration computational costs, while still ensuring significant progress towards the solution. Driven by the need to solve large-scale optimisation problems as efficiently as possible, the last decade has witnessed an explosion of research in this area. Leveraging the parallels between machine learning and inverse problems has allowed harnessing the power of this research wave for solving inverse problems. In this survey, we provide a comprehensive account of the state-of-the-art in stochastic optimisation from the viewpoint of variational regularisation for inverse problems where the solution is modelled as minimising an objective function. We cover topics such as variance reduction, acceleration and higher-order methods, and compare theoretical results with practical behaviour. We focus on the potential and the challenges for stochastic optimisation that are unique to variational regularisation for inverse imaging problems and are not commonly encountered in machine learning. We conclude the survey with illustrative examples on linear inverse problems in imaging to examine the advantages and disadvantages that this new generation of algorithms brings to the field of inverse problems.
Article
The reconstruction from the measured visibilities to the signal in radio interferometry is an ill-posed inverse problem. The compressed sensing technology represented by the sparsity averaging reweighted analysis (SARA) has been successfully applied to radio-interferometric imaging. However, the traditional SARA algorithm solves the L 1 norm minimization problem instead of the L 0 norm one, which has a bias problem. In this paper, a L q proximal gradient algorithm with 0 < q < 1 is proposed to ameliorate the bias problem and obtain an accurate solution in radio interferometry. The proposed method efficiently solves the L q norm minimization problem by using the proximal gradient algorithm, and adopts restart and lazy-start strategies to reduce oscillations and accelerate the convergence rate. Numerical experiment results and quantitative analyses verify the effectiveness of the proposed method.
Article
Scaling to arbitrarily large bundle adjustment problems requires data and compute to be distributed across multiple devices. Centralized methods in prior works are only able to solve small or medium size problems due to overhead in computation and communication. In this paper, we present a fully decentralized method that alleviates computation and communication bottlenecks to solve arbitrarily large bundle adjustment problems. We achieve this by reformulating the reprojection error and deriving a novel surrogate function that decouples optimization variables from different devices. This function makes it possible to use majorization minimization techniques and reduces bundle adjustment to independent optimization subproblems that can be solved in parallel. Moreover, an efficient closed-form warm start strategy has been presented that always improves bundle adjustment estimates. We further apply Nesterov’s acceleration and adaptive restart to improve convergence while maintaining its theoretical guarantees. Despite limited peer-to-peer communication, our method has provable convergence to first-order critical points under mild conditions. On extensive benchmarks with public datasets, our method converges much faster than decentralized baselines with similar memory usage and communication load. Compared to centralized baselines using a single device, our method, while being decentralized, yields more accurate solutions with significant speedups of up to 953.7x over [Formula: see text] and 174.6x over [Formula: see text]. Code: https://github.com/facebookresearch/ DABA .
Article
Full-text available
The analysis of the acceleration behavior of gradient-based eigensolvers with preconditioning presents a substantial theoretical challenge. In this work, we present a novel framework for preconditioning on Riemannian manifolds and introduce a metric, the leading angle, to evaluate preconditioners for symmetric eigenvalue problems. We extend the locally optimal Riemannian accelerated gradient method for Riemannian convex optimization to develop the Riemannian acceleration with preconditioning (RAP) for symmetric eigenvalue problems, thereby providing theoretical evidence to support its acceleration. Our analysis of the Schwarz preconditioner for elliptic eigenvalue problems demonstrates that RAP achieves a convergence rate of 1Cκ1/21-C\kappa ^{-1/2}, which is an improvement over the preconditioned steepest descent method’s rate of 1Cκ11-C\kappa ^{-1}. The exponent in κ1/2\kappa ^{-1/2} is sharp, and numerical experiments confirm our theoretical findings.
Preprint
Full-text available
Nesterov's accelerated gradient method (NAG) marks a pivotal advancement in gradient-based optimization, achieving faster convergence compared to the vanilla gradient descent method for convex functions. However, its algorithmic complexity when applied to strongly convex functions remains unknown, as noted in the comprehensive review by Chambolle and Pock [2016]. This issue, aside from the critical step size, was addressed by Li et al. [2024b], with the monotonic case further explored by Fu and Shi [2024]. In this paper, we introduce a family of controllable momentum coefficients for forward-backward accelerated methods, focusing on the critical step size s=1/L. Unlike traditional linear forms, the proposed momentum coefficients follow an α\alpha-th power structure, where the parameter r is adaptively tuned to α\alpha. Using a Lyapunov function specifically designed for α\alpha, we establish a controllable O(1/k2α)O\left(1/k^{2\alpha} \right) convergence rate for the NAG-α\alpha method, provided that r>2αr > 2\alpha. At the critical step size, NAG-α\alpha achieves an inverse polynomial convergence rate of arbitrary degree by adjusting r according to α>0\alpha > 0. We further simplify the Lyapunov function by expressing it in terms of the iterative sequences xkx_k and yky_k, eliminating the need for phase-space representations. This simplification enables us to extend the controllable O(1/k2α)O \left(1/k^{2\alpha} \right) rate to the monotonic variant, M-NAG-α\alpha, thereby enhancing optimization efficiency. Finally, by leveraging the fundamental inequality for composite functions, we extended the controllable O(1/k2α)O\left(1/k^{2\alpha} \right) rate to proximal algorithms, including the fast iterative shrinkage-thresholding algorithm (FISTA-α\alpha) and its monotonic counterpart (M-FISTA-α\alpha).
Preprint
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000×1000\times larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git
Article
This manuscript presents a predictive model using supervised machine learning, highlighting the importance of advanced optimization algorithms. Our approach focuses on a non-smooth loss function known for its effectiveness in supervised machine learning. To ensure desirable properties such as second derivatives and convexity, and to handle outliers, we use a smoothing function to approximate the loss function. This enables the development of robust and stable algorithms for accurate predictions. We introduce a new surrogate smoothing function that is twice differentiable and convex, enhancing the effectiveness of our methodology. Using optimization techniques, especially stochastic gradient descent with Nesterov momentum, we optimize the predictive model. We validate our algorithm through a comprehensive convergence analysis and extensive comparisons with two other prediction models. Our experiments on real datasets from insurance companies demonstrate the practical significance of our approach in predicting auto insurance customer interest.</p
Preprint
Full-text available
In the realm of gradient-based optimization, Nesterov's accelerated gradient method (NAG) is a landmark advancement, achieving an accelerated convergence rate that outperforms the vanilla gradient descent method for convex function. However, for strongly convex functions, whether NAG converges linearly remains an open question, as noted in the comprehensive review by Chambolle and Pock [2016]. This issue, aside from the critical step size, was addressed by Li et al. [2024a] using a high-resolution differential equation framework. Furthermore, Beck [2017, Section 10.7.4] introduced a monotonically convergent variant of NAG, referred to as M-NAG. Despite these developments, the Lyapunov analysis presented in [Li et al., 2024a] cannot be directly extended to M-NAG. In this paper, we propose a modification to the iterative relation by introducing a gradient term, leading to a new gradient-based iterative relation. This adjustment allows for the construction of a novel Lyapunov function that excludes kinetic energy. The linear convergence derived from this Lyapunov function is independent of both the parameters of the strongly convex functions and the step size, yielding a more general and robust result. Notably, we observe that the gradient iterative relation derived from M-NAG is equivalent to that from NAG when the position-velocity relation is applied. However, the Lyapunov analysis does not rely on the position-velocity relation, allowing us to extend the linear convergence to M-NAG. Finally, by utilizing two proximal inequalities, which serve as the proximal counterparts of strongly convex inequalities, we extend the linear convergence to both the fast iterative shrinkage-thresholding algorithm (FISTA) and its monotonic counterpart (M-FISTA).
Chapter
Full-text available
The main goal of these lecture notes is to survey a series of recent works (Gillis and Sharma, Automatica 85:113–121, 2017; SIAM J. Numer. Anal. 56(2):1022–1047, 2018; Linear Algebra Appl. 623:258–281, 2021; Gillis et al., Numerical Linear Algebra Appl. 25(5):e2153, 2018; Linear Algebra Appl. 573:37–53, 2019; Appl. Numer. Math. 148:131–139, 2020; Choudhary et al., Numerical Linear Algebra Appl. 27(3):e2282, 2020) that aim at solving several nearness problems for a given system. As we will see, these problems can be written as distance problems of matrices or matrix pencils. To solve them, this series of recent works rely on a two-step approach. The first step parametrizes the system using a Port-Hamiltonian representation where stability is guaranteed via convex constraints on the parameters. The second step uses standard non-linear optimization algorithms to optimize these parameters, minimizing the distance between the given system and the sought parametrized stable system. In these lecture notes, we will illustrate this strategy in order to find the nearest stable continuous-time and discrete-time systems, the nearest stable matrix pair, and the nearest positive-real system, as well as generalizations when the eigenvalues need to belong to some set Omega (which is referred to as Omega stability).
Article
Full-text available
This paper considers a special but broad class of convex programing (CP) problems whose feasible region is a simple compact convex set intersected with the inverse image of a closed convex cone under an ane transformation. We study two rst-order penalty methods for solving the above class of problems, namely: the quadratic penalty method and the exact penalty method. In addition to one or two gradient evaluations, an iteration of these methods requires one or two projections onto the simple convex set. We establish the iteration-complexity bounds for these methods to obtain two types of near optimal solutions, namely: near primal and near primal- dual optimal solutions. Finally, we present variants, with possibly better iteration-complexity bounds than the aforementioned methods, which consist of applying penalty-based methods to the perturbed problem obtained by adding a suitable perturbation term to the objective function of the original CP problem.
Article
Full-text available
In this paper we consider the general cone programming problem, and propose primal-dual convex (smooth and/or nonsmooth) minimization reformulations for it. We then discuss first-order methods suitable for solving these reformulations, namely, Nesterov’s optimal method (Nesterov in Doklady AN SSSR 269:543–547, 1983; Math Program 103:127–152, 2005), Nesterov’s smooth approximation scheme (Nesterov in Math Program 103:127–152, 2005), and Nemirovski’s prox-method (Nemirovski in SIAM J Opt 15:229–251, 2005), and propose a variant of Nesterov’s optimal method which has outperformed the latter one in our computational experiments. We also derive iteration-complexity bounds for these first-order methods applied to the proposed primal-dual reformulations of the cone programming problem. The performance of these methods is then compared using a set of randomly generated linear programming and semidefinite programming instances. We also compare the approach based on the variant of Nesterov’s optimal method with the low-rank method proposed by Burer and Monteiro (Math Program Ser B 95:329–357, 2003; Math Program 103:427–444, 2005) for solving a set of randomly generated SDP instances. KeywordsCone programming–Primal-dual first-order methods–Smooth optimal method–Nonsmooth method–Prox-method–Linear programming–Semidefinite programming
Conference Paper
Full-text available
The fused Lasso penalty enforces sparsity in both the coefficients and their successive differences, which is desirable for applications with features ordered in some meaningful way. The resulting problem is, however, challenging to solve, as the fused Lasso penalty is both non-smooth and non-separable. Existing algorithms have high computational complexity and do not scale to large-size problems. In this paper, we propose an Efficient Fused Lasso Algorithm (EFLA) for optimizing this class of problems. One key building block in the proposed EFLA is the Fused Lasso Signal Approximator (FLSA). To efficiently solve FLSA, we propose to reformulate it as the problem of finding an "appropriate" subgradient of the fused penalty at the minimizer, and develop a Subgradient Finding Algorithm (SFA). We further design a restart technique to accelerate the convergence of SFA, by exploiting the special "structures" of both the original and the reformulated FLSA problems. Our empirical evaluations show that, both SFA and EFLA significantly outperform existing solvers. We also demonstrate several applications of the fused Lasso.
Article
Full-text available
This paper develops a general framework for solving a variety of convex cone problems that frequently arise in signal processing, machine learning, statistics, and other fields. The approach works as follows: first, determine a conic formulation of the problem; second, determine its dual; third, apply smoothing; and fourth, solve using an optimal first-order method. A merit of this approach is its flexibility: for example, all compressed sensing problems can be solved via this approach. These include models with objective functionals such as the total-variation norm, ||Wx||_1 where W is arbitrary, or a combination thereof. In addition, the paper also introduces a number of technical contributions such as a novel continuation scheme, a novel approach for controlling the step size, and some new results showing that the smooth and unsmoothed problems are sometimes formally equivalent. Combined with our framework, these lead to novel, stable and computationally efficient algorithms. For instance, our general implementation is competitive with state-of-the-art methods for solving intensively studied problems such as the LASSO. Further, numerical experiments show that one can solve the Dantzig selector problem, for which no efficient large-scale solvers exist, in a few hundred iterations. Finally, the paper is accompanied with a software release. This software is not a single, monolithic solver; rather, it is a suite of programs and routines designed to serve as building blocks for constructing complete algorithms.
Article
Full-text available
This paper examines the relationship between wavelet-based image processing algorithms and variational problems. Algorithms are derived as exact or approximate minimizers of variational problems; in particular, we show that wavelet shrinkage can be considered the exact minimizer of the following problem. Given an image F defined on a square I, minimize over all g in the Besov space B11(L1(I)) the functional |F-g|L2(I)2+λ|g|(B11(L1(I))). We use the theory of nonlinear wavelet image compression in L2(I) to derive accurate error bounds for noise removal through wavelet shrinkage applied to images corrupted with i.i.d., mean zero, Gaussian noise. A new signal-to-noise ratio (SNR), which we claim more accurately reflects the visual perception of noise in images, arises in this derivation. We present extensive computations that support the hypothesis that near-optimal shrinkage parameters can be derived if one knows (or can estimate) only two parameters about an image F: the largest α for which F∈Bqα(Lq(I)),1/q=α/2+1/2, and the norm |F|Bqα(Lq(I)). Both theoretical and experimental results indicate that our choice of shrinkage parameters yields uniformly better results than Donoho and Johnstone's VisuShrink procedure; an example suggests, however, that Donoho and Johnstone's (1994, 1995, 1996) SureShrink method, which uses a different shrinkage parameter for each dyadic level, achieves a lower error than our procedure.
Article
Full-text available
Suppose x is an unknown vector in Ropfm (a digital image or signal); we plan to measure n general linear functionals of x and then reconstruct. If x is known to be compressible by transform coding with a known transform, and we reconstruct via the nonlinear procedure defined here, the number of measurements n can be dramatically smaller than the size m. Thus, certain natural classes of images with m pixels need only n=O(m1/4log5/2(m)) nonadaptive nonpixel samples for faithful recovery, as opposed to the usual m pixel samples. More specifically, suppose x has a sparse representation in some orthonormal basis (e.g., wavelet, Fourier) or tight frame (e.g., curvelet, Gabor)-so the coefficients belong to an lscrp ball for 0<ples1. The N most important coefficients in that expansion allow reconstruction with lscr2 error O(N1/2-1p/). It is possible to design n=O(Nlog(m)) nonadaptive measurements allowing reconstruction with accuracy comparable to that attainable with direct knowledge of the N most important coefficients. Moreover, a good approximation to those N important coefficients is extracted from the n measurements by solving a linear program-Basis Pursuit in signal processing. The nonadaptive measurements have the character of "random" linear combinations of basis/frame elements. Our results use the notions of optimal recovery, of n-widths, and information-based complexity. We estimate the Gel'fand n-widths of lscrp balls in high-dimensional Euclidean space in the case 0<ples1, and give a criterion identifying near- optimal subspaces for Gel'fand n-widths. We show that "most" subspaces are near-optimal, and show that convex optimization (Basis Pursuit) is a near-optimal way to extract information derived from these near-optimal subspaces
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Chapter
Convex optimization problems arise frequently in many different fields. A comprehensive introduction to the subject, this book shows in detail how such problems can be solved numerically with great efficiency. The focus is on recognizing convex optimization problems and then finding the most appropriate technique for solving them. The text contains many worked examples and homework exercises and will appeal to students, researchers and practitioners in fields such as engineering, computer science, mathematics, statistics, finance, and economics.
Article
In the paper I give a brief review of the basic idea and some history and then discuss some developments since the original paper on regression shrinkage and selection via the lasso.
Article
Interior gradient (subgradient) and proximal methods for convex constrained minimization have been much studied, in particular for optimization problems over the nonnegative octant. These methods are using non-Euclidean projections and proximal distance functions to exploit the geometry of the constraints. In this paper, we identify a simple mechanism that allows us to derive global convergence results of the produced iterates as well as improved global rates of convergence estimates for a wide class of such methods, and with more general convex constraints. Our results are illustrated with many applications and examples, including some new explicit and simple algorithms for conic optimization problems. In particular, we derive a class of interior gradient algorithms which exhibits an O(k2)O(k^{-2}) global convergence rate estimate.
Article
We consider the class of iterative shrinkage-thresholding algorithms (ISTA) for solving linear inverse problems arising in signal/image processing. This class of methods, which can be viewed as an ex- tension of the classical gradient algorithm, is attractive due to its simplicity and thus is adequate for solving large-scale problems even with dense matrix data. However, such methods are also known to converge quite slowly. In this paper we present a new fast iterative shrinkage-thresholding algorithm (FISTA) which preserves the computational simplicity of ISTA but with a global rate of convergence which is proven to be significantly better, both theoretically and practically. Initial promising nu- merical results for wavelet-based image deblurring demonstrate the capabilities of FISTA which is shown to be faster than ISTA by several orders of magnitude.
Article
In this article, we propose an algorithm, NESTA-LASSO, for the LASSO problem, i.e., an underdetermined linear least-squares problem with a 1-norm constraint on the solution. We prove under the assumption of the restricted isometry property (RIP) and a sparsity condition on the solution, that NESTA-LASSO is guaranteed to be almost always locally linearly convergent. As in the case of the algorithm NESTA proposed by Becker, Bobin, and Candes, we rely on Nesterov's accelerated proximal gradient method, which takes O(e^{-1/2}) iterations to come within e > 0 of the optimal value. We introduce a modification to Nesterov's method that regularly updates the prox-center in a provably optimal manner, and the aforementioned linear convergence is in part due to this modification. In the second part of this article, we attempt to solve the basis pursuit denoising BPDN problem (i.e., approximating the minimum 1-norm solution to an underdetermined least squares problem) by using NESTA-LASSO in conjunction with the Pareto root-finding method employed by van den Berg and Friedlander in their SPGL1 solver. The resulting algorithm is called PARNES. We provide numerical evidence to show that it is comparable to currently available solvers.
Article
In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum of two convex terms: one is smooth and given by a black-box oracle, and another is general but simple and its structure is known. Despite to the bad properties of the sum, such problems, both in convex and nonconvex cases, can be solved with efficiency typical for the good part of the objective. For convex problems of the above structure, we consider primal and dual variants of the gradient method (converge as O (1/k)), and an accelerated multistep version with convergence rate O (1/k2), where k isthe iteration counter. For all methods, we suggest some efficient "line search" procedures and show that the additional computational work necessary for estimating the unknown problem class parameters can only multiply the complexity of each iteration by a small constant factor. We present also the results of preliminary computational experiments, which confirm the superiority of the accelerated scheme.
Article
Conventional approaches to sampling signals or images follow Shannon's theorem: the sampling rate must be at least twice the maximum frequency present in the signal (Nyquist rate). In the field of data conversion, standard analog-to-digital converter (ADC) technology implements the usual quantized Shannon representation - the signal is uniformly sampled at or above the Nyquist rate. This article surveys the theory of compressive sampling, also known as compressed sensing or CS, a novel sensing/sampling paradigm that goes against the common wisdom in data acquisition. CS theory asserts that one can recover certain signals and images from far fewer samples or measurements than traditional methods use.
Article
Suppose we wish to recover a vector x_0 Є R^m (e.g., a digital signal or image) from incomplete and contaminated observations y = Ax_0 + e; A is an n by m matrix with far fewer rows than columns (n « m) and e is an error term. Is it possible to recover x_0 accurately based on the data y? To recover x_0, we consider the solution x^# to the ℓ_(1-)regularization problem min ‖x‖ℓ_1 subject to ‖Ax - y‖ℓ(2) ≤ Є, where Є is the size of the error term e. We show that if A obeys a uniform uncertainty principle (with unit-normed columns) and if the vector x_0 is sufficiently sparse, then the solution is within the noise level ‖x^# - x_0‖ℓ_2 ≤ C Є. As a first example, suppose that A is a Gaussian random matrix; then stable recovery occurs for almost all such A's provided that the number of nonzeros of x_0 is of about the same order as the number of observations. As a second instance, suppose one observes few Fourier samples of x_0; then stable recovery occurs for almost any set of n coefficients provided that the number of nonzeros is of the order of n/[log m]^6. In the case where the error term vanishes, the recovery is of course exact, and this work actually provides novel insights into the exact recovery phenomenon discussed in earlier papers. The methodology also explains why one can also very nearly recover approximately sparse signals.
Article
We consider linear inverse problems where the solution is assumed to have a sparse expansion on an arbitrary pre-assigned orthonormal basis. We prove that replacing the usual quadratic regularizing penalties by weighted l^p-penalties on the coefficients of such expansions, with 1 < or = p < or =2, still regularizes the problem. If p < 2, regularized solutions of such l^p-penalized problems will have sparser expansions, with respect to the basis under consideration. To compute the corresponding regularized solutions we propose an iterative algorithm that amounts to a Landweber iteration with thresholding (or nonlinear shrinkage) applied at each iteration step. We prove that this algorithm converges in norm. We also review some potential applications of this method.
Introduction to Optimization. Translations Series in Mathematics and Engineering (Opti-mization Software, Publications Division
  • B Polyak
B. Polyak, Introduction to Optimization. Translations Series in Mathematics and Engineering (Opti-mization Software, Publications Division, New York, 1987).
An introduction to compressive sampling
  • E Candès
  • M Wakin
  • E. Candès
E. Candès and M. Wakin. An introduction to compressive sampling. Signal Processing Magazine, IEEE, 25(2):21-30, March 2008.
Iteration complexity of first-order penalty methods for convex programming. Manuscript, School of Industrial and Systems Engineering
  • G Lan
  • R Monteiro
Introduction to optimization. Translations series in mathematics and engineering
  • B Polyak
B. Polyak. Introduction to optimization. Translations series in mathematics and engineering. Optimization Software, Publications Division, 1987.