Article

Constrained and composite optimization via adaptive sampling methods

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The motivation for this paper stems from the desire to develop an adaptive sampling method for solving constrained optimization problems, in which the objective function is stochastic and the constraints are deterministic. The method proposed in this paper is a proximal gradient method that can also be applied to the composite optimization problem min f(x)+h(x)f(x) + h(x), where f is stochastic and h is convex (but not necessarily differentiable). Adaptive sampling methods employ a mechanism for gradually improving the quality of the gradient approximation so as to keep computational cost to a minimum. The mechanism commonly employed in unconstrained optimization is no longer reliable in the constrained or composite optimization settings, because it is based on pointwise decisions that cannot correctly predict the quality of the proximal gradient step. The method proposed in this paper measures the result of a complete step to determine if the gradient approximation is accurate enough; otherwise, a more accurate gradient is generated and a new step is computed. Convergence results are established both for strongly convex and general convex f. Numerical experiments are presented to illustrate the practical behavior of the method.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The approximated problem can then be solved with standard nonlinear optimization algorithms that often converge very rapidly, but only to a minimizer z ⋆ M = z ⋆ . Between these two classes, there are attractive modern strategies such as adaptive sampling methods (see, e.g., [14,15,16,17,18]) which aim to improve the efficiency of SA methods by dynamically adjusting the sample size along the optimization. The driving idea is to use cheap, but noisy, gradient evaluations far from the optimum, while more accurate and expensive estimates are computed close to the minimizer. ...
... For convex constrained optimization problems, (7) is not appropriate since ∇F may not vanish at the optimum. Nevertheless, [17,18] proved that the alternative condition ...
... where the term ∂ t F ε cancels due to the definition of h. From (18), it follows that the alternating procedure ...
Preprint
Adaptive sampling algorithms are modern and efficient methods that dynamically adjust the sample size throughout the optimization process. However, they may encounter difficulties in risk-averse settings, particularly due to the challenge of accurately sampling from the tails of the underlying distribution of random inputs. This often leads to a much faster growth of the sample size compared to risk-neutral problems. In this work, we propose a novel adaptive sampling algorithm that adapts both the sample size and the sampling distribution at each iteration. The biasing distributions are constructed on the fly, leveraging a reduced-order model of the objective function to be minimized, and are designed to oversample a so-called risk region. As a result, a reduction of the variance of the gradients is achieved, which permits to use fewer samples per iteration compared to a standard algorithm, while still preserving the asymptotic convergence rate. Our focus is on the minimization of the Conditional Value-at-Risk (CVaR), and we establish the convergence of the proposed computational framework. Numerical experiments confirm the substantial computational savings achieved by our approach.
... All statistics in our workflow are estimated via the Monte Carlo method. Furthermore, we use a stochastic gradient descent method [59] for numerical optimization and a novel adaptive sampling strategy [8,10,12,61] to reduce the optimization cost. ...
... In this section, we describe the iterative, gradient-based, adaptive stochastic optimization algorithm we have used to solve Prob. 1 and Prob. 2. Our algorithm reduces the cost of optimization by adjusting the number of simulations N = N k in each gradient estimate based on the accuracy of the current design iterate Z k similar to [8,10,12,61]. ...
... For greater robustness and efficiency, we adaptively select each batch size N k = |S k | based on an a posteriori estimate of the statistical error described in the next subsection [8,[10][11][12]61]. [38,51]. ...
Preprint
Full-text available
Reducing the intensity of wind excitation via aerodynamic shape modification is a major strategy to mitigate the reaction forces on supertall buildings, reduce construction and maintenance costs, and improve the comfort of future occupants. To this end, computational fluid dynamics (CFD) combined with state-of-the-art stochastic optimization algorithms is more promising than the trial and error approach adopted by the industry. The present study proposes and investigates a novel approach to risk-averse shape optimization of tall building structures that incorporates site-specific uncertainties in the wind velocity, terrain conditions, and wind flow direction. A body-fitted finite element approximation is used for the CFD with different wind directions incorporated by re-meshing the fluid domain. The bending moment at the base of the building is minimized, resulting in a building with reduced cost, material, and hence, a reduced carbon footprint. Both risk-neutral and risk-averse optimization of the twist and tapering of a representative building are presented under uncertain inflow wind conditions that have been calibrated to fit freely-available site-specific data from Basel, Switzerland. The risk-averse strategy uses the conditional value-at-risk to optimize for the low-probability high-consequence events appearing in the worst 10% of loading conditions. Adaptive sampling is used to accelerate the gradient-based stochastic optimization pipeline. The adaptive method is easy to implement and particularly helpful for compute-intensive simulations because the number of gradient samples grows only as the optimal design algorithm converges. The performance of the final risk-averse building geometry is exceptionally favorable when compared to the risk-neutral optimized geometry, thus, demonstrating the effectiveness of the risk-averse design approach in computational wind engineering.
... To this end, we formulate and solve a stochastic optimization problem to find the model design parameters minimizing a specific misfit measure between the synthetic samples and the data. In order to avoid oversampling, a progressive batching strategy is applied, when an appropriate number of samples is estimated at each iteration and is adaptively updated using a specific test; see [18,19,20,21,22]. Such efficient stochastic programming methods are important for large-scale decision-making problems, especially in engineering design, where oversampling is significantly costly. ...
... We calibrate the model design parameters by solving the stochastic optimization problem (6). In order to avoid oversampling, a progressive batching strategy is applied, when the appropriate number of samples (batch size |S k |) is estimated at each iteration k and is adaptively updated satisfying specific conditions; see [18,19,20,21]. In our implementation, we followed the progressive batching LBFGS algorithm proposed in [56]. ...
... The method is proposed in [56]. We also refer the interested reader to [73,19,20,21]. ...
Preprint
Full-text available
Manufactured materials usually contain random imperfections due to the fabrication process, e.g., the 3D-printing, casting, etc. In this work, we present a new flexible class of digital surrogate models which imitate the manufactured material respecting the statistical features of random imperfections. The surrogate model is constructed as the level-set of a linear combination of the intensity field representing the topological shape and the Gaussian perturbation representing the imperfections. The mathematical design parameters of the model are related to physical ones and thus easy to comprehend. The calibration of the model parameters is performed using progressive batching sub-sampled quasi-Newtion minimization, using a designed distance measure between synthetic samples and the data. Then, owing to a fast sampling algorithm, we have access to an arbitrary number of synthetic samples that can be used in Monte Carlo type methods for uncertainty quantification of the material response. In particular, we illustrate the method with a surrogate model for an imperfect octet-truss lattice cell, which plays an important role in additive manufacturing. We also discuss potential model extensions.
... • It is easy to verify that in problem (1.2), the conditional expectation satisfies Assumption 2 i. • In stochastic methods, such as [6,12,20,53], etc., it is necessary to assume that the noisy gradient variance is bounded, i.e., ...
... This means that in dynamic sampling, we need to pair an effective step size selection mechanism with the dynamically changing number of samples. This is also one of the reasons why dynamic sampling algorithms are often paired with stochastic line search [3,6,12,20,53]. · Finally, Figs. 4, 13 and 20 demonstrate that the dynamic sampling strategy we proposed is effective in avoiding rapid growth of N k , and it also only reaches around 0.01N when completing the tasks we specified. And it better suppresses the growth of the number of samples compared to Prox-LISA. ...
Article
Full-text available
In the field of machine learning, many large-scale optimization problems can be decomposed into the sum of two functions: a smooth function and a nonsmooth function with a simple proximal mapping. In light of this, our paper introduces a novel variant of the proximal stochastic quasi-Newton algorithm, grounded in three key components: (i) developing an adaptive sampling method that dynamically increases the sample size during the iteration process, thus preventing rapid growth in sample size and mitigating the noise introduced by the stochastic approximation method; (ii) the integration of stochastic line search to ensure a sufficient decrease in the expected value of the objective function; and (iii) a stable update scheme for the stochastic modified limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm. For a general objective function, it can be proven that the limit points of the generated sequence almost surely converge to stationary points. Furthermore, the convergence rate and the number of required gradient computations for this process have been analyzed. In the case of a strongly convex objective function, a global linear convergence rate can be achieved, and the number of required gradient computations is thoroughly examined. Finally, numerical experiments demonstrate the robustness of the proposed method across various hyperparameter settings, establishing its competitiveness compared to state-of-the-art methods.
... We calibrate the model design parameters by solving the stochastic optimization problem (6). In order to avoid oversampling, a progressive batching strategy is applied, when the appropriate number of samples (batch size |S k |) is estimated at each iteration k and is adaptively updated satisfying specific conditions; see [14,15,60,61]. In our implementation, we follow the progressive batching LBFGS algorithm proposed in [62]. ...
... The method is proposed in [62]. We also refer the interested reader to [14,15,61,81]. ...
Article
Manufactured materials usually contain random imperfections due to the fabrication process, e.g., the 3D-printing, casting, etc. These imperfections affect significantly the effective material properties and result in uncertainties in the mechanical response. Numerical analysis of the effects of the imperfections and the uncertainty quantification (UQ) can be often done by use of digital stochastic surrogate material models. In this work, we present a new flexible class of surrogate models depending on a small number of parameters and a calibration strategy ensuring that the constructed model fits to the available observation data, with special focus on two-phase materials. The surrogate models are constructed as the level-set of a linear combination of an intensity field representing the topological shape and a Gaussian perturbation representing the imperfections, allowing for fast sampling strategies. The mathematical design parameters of the model are related to physical ones and thus easy to interpret. The calibration of the model parameters is performed using progressive batching sub-sampled quasi-Newton minimization, using a designed distance measure between the synthetic samples and the data. Then, employing a fast sampling algorithm, an arbitrary number of synthetic samples can be generated to use in Monte Carlo type methods for prediction of effective material properties. In particular, we illustrate the method in application to UQ of the elasto-plastic response of an imperfect octet-truss lattice which plays an important role in additive manufacturing. To this end, we study the effective material properties of the lattice unit cell under elasto-plastic deformations and investigate the sensitivity of the effective Young’s modulus to the imperfections.
... Deterministic version of this condition has been used in [2] for the analysis of a proximal inexact trust-region algorithm. A stochastic version imposed in expectation was used in [4] and an alternative, that is meant to be more practical, is suggested in [60]. Further variants for general constrained optimization are proposed in [8]. ...
Article
Full-text available
We develop and analyze stochastic variants of ISTA and a full backtracking FISTA algorithms (Beck and Teboulle in SIAM J Imag Sci 2(1):183–202, 2009; Scheinberg et al. in Found Comput Math 14(3):389–417, 2014) for composite optimization without the assumption that stochastic gradient is an unbiased estimator. This work extends analysis of inexact fixed step ISTA/FISTA in Schmidt et al. (Convergence rates of inexact proximal-gradient methods for convex optimization, 2022. arXiv:1109.2415) to the case of stochastic gradient estimates and adaptive step-size parameter chosen by backtracking. It also extends the framework for analyzing stochastic line-search method in Cartis and Scheinberg (Math Program 169(2):337-375, 2018) to the proximal gradient framework as well as to the accelerated first order methods.
... Related studies have also been extended to some nonlinear problems involving delays such as Cohen-Grossberg neural networks or actuatorsaturated dynamics and to some stochastic control problems involving delays [29][30][31]. The distribution of the impulses through time is not necessarily periodic since the aperiodicity, and eventual sampling adaptation to the signal profiles, allows for the improvement of the data acquisition performance [32]; improvement of the system dynamics in discretetime systems [33]; to be er fit certain optimization processes [34]; the improvement of the signal filtering and a enuation of the noise effects [35]; or the improvement of behavior prediction [36]. Some strategies of non-periodic sampling are described and discussed in [37][38][39][40]. ...
Article
Full-text available
This paper formalizes the analytic expressions and some properties of the evolution operator that generates the state-trajectory of dynamical systems combining delay-free dynamics with a set of discrete, or point, constant (and not necessarily commensurate) delays, where the parameterizations of both the delay-free and the delayed parts can undergo impulsive changes. Also, particular evolution operators are defined explicitly for the non-impulsive and impulsive time-varying delay-free case, and also for the case of impulsive delayed time-varying systems. In the impulsive cases, in general, the evolution operators are non-unique. The delays are assumed to be a finite number of constant delays that are not necessarily commensurate, that is, all of them being integer multiples of a minimum delay. On the other hand, the impulsive actions through time are assumed to be state-dependent and to take place at certain isolated time instants on the matrix functions that define the delay-free and the delayed dynamics. Some variants are also proposed for the cases when the impulsive actions are state-independent or state- and dynamics-independent. The intervals in-between consecutive impulses can be, in general, time-varying while subject to a minimum threshold. The boundedness of the state-trajectory solutions, which imply the system’s global stability, is investigated in the most general case for any given piecewise-continuous bounded function of initial conditions defined on the initial maximum delay interval. Such a solution boundedness property can be achieved, even if the delay-free dynamics is unstable, by an appropriate distribution of the impulsive actions. An illustrative first-order example is developed in detail to illustrate the impulsive stabilization results.
... Gradient estimation methods have also been integrated into quasi-Newton approaches [2,7,27]. Additionally, adaptive sampling approaches, well-established in stochastic optimization [1,8,9,11,12,14,32], have recently been applied in DFO settings [7,10,30]. [7] generalized the norm condition [14] and the practical inner-product condition [8] to standard finite-difference based gradient estimation methods, while [10] extended these conditions to other forward finite-difference based gradient estimation methods. ...
Preprint
This paper presents an algorithmic framework for solving unconstrained stochastic optimization problems using only stochastic function evaluations. We employ central finite-difference based gradient estimation methods to approximate the gradients and dynamically control the accuracy of these approximations by adjusting the sample sizes used in stochastic realizations. We analyze the theoretical properties of the proposed framework on nonconvex functions. Our analysis yields sublinear convergence results to the neighborhood of the solution, and establishes the optimal worst-case iteration complexity (O(ϵ1)\mathcal{O}(\epsilon^{-1})) and sample complexity (O(ϵ2)\mathcal{O}(\epsilon^{-2})) for each gradient estimation method to achieve an ϵ\epsilon-accurate solution. Finally, we demonstrate the performance of the proposed framework and the quality of the gradient estimation methods through numerical experiments on nonlinear least squares problems.
... We determined the number of samples with numerical experiments systematically testing the effect of the sample size on the objective and resulting designs; from 300 samples onwards, the designs and objective showed only very minor differences. More details may be found in Appendix B. We remark that more sophisticated, adaptive algorithms can address the approximation of the expectation operator during stochastic optimization, such as the ones presented in [5,[111][112][113][114][115]. They begin with a small number of samples and adaptively increase it as the solution approaches the minimum. ...
... These adaptive methods achieve both optimal theoretical convergence results and first-order complexity results to achieve an ϵ-accurate solution for the expectation problem. These methods have also been adapted to other problem settings, including derivative-free optimization [14,15,17,79] and stochastic constrained optimization [5,8,14,85]. In this work, we will adapt these methods to the Hessian-averaging based subsampled Newton methods. ...
Preprint
We consider minimizing finite-sum and expectation objective functions via Hessian-averaging based subsampled Newton methods. These methods allow for gradient inexactness and have fixed per-iteration Hessian approximation costs. The recent work (Na et al. 2023) demonstrated that Hessian averaging can be utilized to achieve fast O(logkk)\mathcal{O}\left(\sqrt{\tfrac{\log k}{k}}\right) local superlinear convergence for strongly convex functions in high probability, while maintaining fixed per-iteration Hessian costs. These methods, however, require gradient exactness and strong convexity, which poses challenges for their practical implementation. To address this concern we consider Hessian-averaged methods that allow gradient inexactness via norm condition based adaptive-sampling strategies. For the finite-sum problem we utilize deterministic sampling techniques which lead to global linear and sublinear convergence rates for strongly convex and nonconvex functions respectively. In this setting we are able to derive an improved deterministic local superlinear convergence rate of O(1k)\mathcal{O}\left(\tfrac{1}{k}\right). For the %expected risk expectation problem we utilize stochastic sampling techniques, and derive global linear and sublinear rates for strongly convex and nonconvex functions, as well as a O(1k)\mathcal{O}\left(\tfrac{1}{\sqrt{k}}\right) local superlinear convergence rate, all in expectation. We present novel analysis techniques that differ from the previous probabilistic results. Additionally, we propose scalable and efficient variations of these methods via diagonal approximations and derive the novel diagonally-averaged Newton (Dan) method for large-scale problems. Our numerical results demonstrate that the Hessian averaging not only helps with convergence, but can lead to state-of-the-art performance on difficult problems such as CIFAR100 classification with ResNets.
Article
Full-text available
We introduce adaptive sampling methods for stochastic programs with deterministic constraints. First, we propose and analyze a variant of the stochastic projected gradient method, where the sample size used to approximate the reduced gradient is determined on-the-fly and updated adaptively. This method is applicable to a broad class of expectation-based risk measures, and leads to a significant reduction in the individual gradient evaluations used to estimate the objective function gradient. Numerical experiments with expected risk minimization and conditional value-at-risk minimization support this conclusion, and demonstrate practical performance and efficacy for both risk-neutral and risk-averse problems. Second, we propose an SQP-type method based on similar adaptive sampling principles. The benefits of this method are demonstrated in a simplified engineering design application, featuring risk-averse shape optimization of a steel shell structure subject to uncertain loading conditions and model uncertainty.
Article
Full-text available
We introduce adaptive sampling methods for stochastic programs with deterministic constraints. First, we propose and analyze a variant of the stochastic projected gradient method, where the sample size used to approximate the reduced gradient is determined on-the-fly and updated adaptively. This method is applicable to a broad class of expectation-based risk measures, and leads to a significant reduction in the individual gradient evaluations used to estimate the objective function gradient. Numerical experiments with expected risk minimization and conditional value-at-risk minimization support this conclusion, and demonstrate practical performance and efficacy for both risk-neutral and risk-averse problems. Second, we propose an SQP-type method based on similar adaptive sampling principles. The benefits of this method are demonstrated in a simplified engineering design application, featuring risk-averse shape optimization of a steel shell structure subject to uncertain loading conditions and model uncertainty.
Article
Full-text available
We consider a class of structured nonsmooth stochastic convex programs. Traditional stochastic approximation schemes in nonsmooth regimes are hindered by a convergence rate of O(1/k)\mathcal{O}(1/\sqrt{k}) compared with a linear and sublinear (specifically O(1/k2)\mathcal{O}(1/k^2)) in deterministic strongly convex and convex regimes, respectively. One avenue for addressing the gaps in the rates is through the use of an increasing batch-size of gradients at each step, as adopted in the seminal paper by Ghadimi and Lan where the optimal rate of O(1/k2)\mathcal{O}(1/k^2) and the optimal oracle complexity of O(1/ϵ2)\mathcal{O}(1/\epsilon^2) was established in the smooth convex regime. Inspired by the work by Ghadimi and Lan and by extending our prior works, we make several contributions in the present work. (I) Strongly convex f. Here, we develop a variable sample-size accelerated proximal method (VS-APM) where the number of proximal evaluations to obtain an ϵ\epsilon solution is shown to be O(κlog(1/ϵ))\mathcal{O}(\sqrt{\kappa} \log(1/\epsilon)) while the oracle complexity is O(κ/ϵ)\mathcal{O}(\sqrt{\kappa}/\epsilon), both of which are optimal and κ\kappa denotes the condition number; (II) Convex and nonsmooth f. In this setting, we develop an iterative smoothing extension of (VS-APM) (referred to as (sVS-APM) where the sample-average of gradients of a smoothed function is employed at every step. By suitably choosing the smoothing, steplength, and batch-size sequences, we prove that the expected sub-optimality diminishes to zero at the rate of O(1/k)\mathcal{O}(1/k) and admits the optimal oracle complexity of O(1/ϵ2)\mathcal{O}(1/\epsilon^2). (III) Convex f. Finally, we show that (sVS-APM) and (VS-APM) produce sequences that converge almost surely to a solution of the original problem.
Conference Paper
Full-text available
Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive stepsize selection and automatic stopping. We propose alternative “big batch” SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The resulting methods have similar convergence rates to classical SGD, and do not require convexity of the objective. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. Big batch methods are thus easily automated and can run with little or no oversight.
Article
Full-text available
This paper studies empirical risk minimization (ERM) problems for large-scale datasets and incorporates the idea of adaptive sample size methods to improve the guaranteed convergence bounds for first-order stochastic and deterministic methods. In contrast to traditional methods that attempt to solve the ERM problem corresponding to the full dataset directly, adaptive sample size schemes start with a small number of samples and solve the corresponding ERM problem to its statistical accuracy. The sample size is then grown geometrically – e.g., scaling by a factor of two – and use the solution of the previous ERM as a warm start for the new ERM. Theoretical analyses show that the use of adaptive sample size methods reduces the overall computational cost of achieving the statistical accuracy of the whole dataset for a broad range of deterministic and stochastic first-order methods. The gains are specific to the choice of method. When particularized to, e.g., accelerated gradient descent and stochastic variance reduce gradient, the computational cost advantage is a logarithm of the number of training samples. Numerical experiments on various datasets confirm theoretical claims and showcase the gains of using the proposed adaptive sample size scheme.
Article
Full-text available
This study compared two alternative techniques for predicting forest cover types from cartographic variables. The study evaluated four wilderness areas in the Roosevelt National Forest, located in the Front Range of northern Colorado. Cover type data came from US Forest Service inventory information, while the cartographic variables used to predict cover type consisted of elevation, aspect, and other information derived from standard digital spatial data processed in a geographic information system (GIS). The results of the comparison indicated that a feedforward artificial neural network model more accurately predicted forest cover type than did a traditional statistical model based on Gaussian discriminant analysis.
Conference Paper
Full-text available
The NIPS 2003 workshops included a feature selection competi- tion organized by the authors. We provided participants with five datasets from dierent application domains and called for classifica- tion results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diusion trees. Other top entries used a variety of methods for feature selection, which com- bined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification en- gine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scor- ing software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research.
Conference Paper
Full-text available
We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as ℓ1, mixed norm, and trace-norm. 1
Article
We consider the context of "simulation-based recursions," that is, recursions that involve quantities needing to be estimated using a stochastic simulation. Examples include stochastic adaptations of fixed-point and gradient descent recursions obtained by replacing function and derivative values appearing within the recursion by their Monte Carlo counterparts. The primary motivating settings are simulation optimization and stochastic root finding problems, where the low point and the zero of a function are sought, respectively, with only Monte Carlo estimates of the functions appearing within the problem. We ask how much Monte Carlo sampling needs to be performed within simulation-based recursions in order that the resulting iterates remain consistent and, more importantly, efficient, where "efficient" implies convergence at the fastest possible rate. Answering these questions involves trading off two types of error inherent in the iterates: the deterministic error due to recursion and the "stochastic" error due to sampling. As we demonstrate through a characterization of the relationship between sample sizing and convergence rates, efficiency and consistency are intimately coupled with the speed of the underlying recursion, with faster recursions yielding a wider regime of "optimal" sampling rates. The implications of our results for practical implementation are immediate since they provide specific guidance on optimal simulation expenditure within a variety of stochastic recursions.
Article
In this paper, we propose a stochastic optimization method that adaptively controls the sample size used in the computation of gradient approximations. Unlike other variance reduction techniques that either require additional storage or the regular computation of full gradients, the proposed method reduces variance by increasing the sample size as needed. The decision to increase the sample size is governed by an inner product test that ensures that search directions are descent directions with high probability. We show that the inner product test improves upon the well known norm test, and can be used as a basis for an algorithm that is globally convergent on nonconvex functions and enjoys a global linear rate of convergence on strongly convex functions. Numerical experiments on logistic regression problems illustrate the performance of the algorithm.
Article
Many data-fitting applications require the solution of an optimization problem involving a sum of large number of functions of high dimensional parameter. Here, we consider the problem of minimizing a sum of n functions over a convex constraint set XRp\mathcal{X} \subseteq \mathbb{R}^{p} where both n and p are large. In such problems, sub-sampling as a way to reduce n can offer great amount of computational efficiency, while maintaining their original convergence properties. Within the context of second order methods, we first give quantitative local convergence results for variants of Newton's method where the Hessian is uniformly sub-sampled. Using random matrix concentration inequalities, one can sub-sample in a way that the curvature information is preserved. Using such sub-sampling strategy, we establish locally Q-linear and Q-superlinear convergence rates. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or Levenberg-type regularization. Finally, in addition to Hessian sub-sampling, we consider sub-sampling the gradient as way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra (RandNLA) to obtain the proper sampling strategy and we establish locally R-linear convergence rates. In such a setting, we also show that a very aggressive sample size increase results in a R-superlinearly convergent algorithm. While the sample size depends on the condition number of the problem, our convergence rates are problem-independent, i.e., they do not depend on the quantities related to the problem. Hence, our analysis here can be used to complement the results of our basic framework from the companion paper, [38], by exploring algorithmic trade-offs that are important in practice.
Article
Large scale optimization problems are ubiquitous in machine learning and data analysis and there is a plethora of algorithms for solving such problems. Many of these algorithms employ sub-sampling, as a way to either speed up the computations and/or to implicitly implement a form of statistical regularization. In this paper, we consider second-order iterative optimization algorithms and we provide bounds on the convergence of the variants of Newton's method that incorporate uniform sub-sampling as a means to estimate the gradient and/or Hessian. Our bounds are non-asymptotic and quantitative. Our algorithms are global and are guaranteed to converge from any initial iterate. Using random matrix concentration inequalities, one can sub-sample the Hessian to preserve the curvature information. Our first algorithm incorporates Hessian sub-sampling while using the full gradient. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or ridge-type regularization. Next, in addition to Hessian sub-sampling, we also consider sub-sampling the gradient as a way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra to obtain the proper sampling strategy. In all these algorithms, computing the update boils down to solving a large scale linear system, which can be computationally expensive. As a remedy, for all of our algorithms, we also give global convergence results for the case of inexact updates where such linear system is solved only approximately. This paper has a more advanced companion paper, [42], in which we demonstrate that, by doing a finer-grained analysis, we can get problem-independent bounds for local convergence of these algorithms and explore trade-offs to improve upon the basic results of the present paper.
Article
We present global convergence rates for a line-search method which is based on random first-order models and directions whose quality is ensured only with certain probability. We show that in terms of the order of the accuracy, the evaluation complexity of such a method is the same as its counterparts that use deterministic accurate models; the use of probabilistic models only increases the complexity by a constant, which depends on the probability of the models being good. We particularize and improve these results in the convex and strongly convex case. We also analyse a probabilistic cubic regularization variant that allows approximate probabilistic second-order models and show improved complexity bounds compared to probabilistic first-order methods; again, as a function of the accuracy, the probabilistic cubic regularization bounds are of the same (optimal) order as for the deterministic case.
Article
This paper presents a methodology for using varying sample sizes in batch-type optimization methods for large-scale machine learning problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the function and gradient. We propose a criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient. We establish an O(1/ϵ){O(1/\epsilon)} complexity bound on the total cost of a gradient method. The second part of the paper describes a practical Newton method that uses a smaller sample to compute Hessian vector-products than to evaluate the function and the gradient, and that also employs a dynamic sampling technique. The focus of the paper shifts in the third part of the paper to L 1-regularized problems designed to produce sparse solutions. We propose a Newton-like method that consists of two phases: a (minimalistic) gradient projection phase that identifies zero variables, and subspace phase that applies a subsampled Hessian Newton iteration in the free variables. Numerical tests on speech recognition problems illustrate the performance of the algorithms.
Article
In this paper, we analyze different first-order methods of smooth convex optimization employing inexact first-order information. We introduce the notion of an approximate first-order oracle. The list of examples of such an oracle includes smoothing technique, Moreau-Yosida regularization, Modified Lagrangians, and many others. For different methods, we derive complexity estimates and study the dependence of the desired accuracy in the objective function and the accuracy of the oracle. It appears that in inexact case, the superiority of the fast gradient methods over the classical ones is not anymore absolute. Contrary to the simple gradient schemes, fast gradient methods necessarily suffer from accumulation of errors. Thus, the choice of the method depends both on desired accuracy and accuracy of the oracle. We present applications of our results to smooth convex-concave saddle point problems, to the analysis of Modified Lagrangians, to the prox-method, and some others.
Article
We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates.Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems.
Article
Many structured data-fitting applications require the solution of an optimization problem involving a sum over a potentially large number of measurements. Incremental gradient algorithms offer inexpensive iterations by sampling a subset of the terms in the sum. These methods can make great progress initially, but often slow as they approach a solution. In contrast, full-gradient methods achieve steady convergence at the expense of evaluating the full objective and gradient on each iteration. We explore hybrid methods that exhibit the benefits of both approaches. Rate-of-convergence analysis shows that by controlling the sample size in an incremental gradient algorithm, it is possible to maintain the steady convergence rates of full-gradient methods. We detail a practical quasi-Newton implementation based on this approach. Numerical experiments illustrate its potential benefits.
Article
In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum of two convex terms: one is smooth and given by a black-box oracle, and another is general but simple and its structure is known. Despite to the bad properties of the sum, such problems, both in convex and nonconvex cases, can be solved with efficiency typical for the good part of the objective. For convex problems of the above structure, we consider primal and dual variants of the gradient method (converge as O (1/k)), and an accelerated multistep version with convergence rate O (1/k2), where k isthe iteration counter. For all methods, we suggest some efficient "line search" procedures and show that the additional computational work necessary for estimating the unknown problem class parameters can only multiply the complexity of each iteration by a small constant factor. We present also the results of preliminary computational experiments, which confirm the superiority of the accelerated scheme.
A stochastic line search method with convergence rate analysis
  • Paquette
MNIST handwritten digit database
  • LeCun
Automated inference with adaptive batches
  • De