Article

A Noise-Tolerant Quasi-Newton Algorithm for Unconstrained Optimization

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... We build on the ideas in [1,15,21] and consider the non-smooth problem (P) in a setting, where function and derivative evaluations are only available as noisy observations. As the authors in [1,15,21], we assume the following noise model: Rather than being able to evaluate F and its derivative F directly, we only have access tõ ...
... We build on the ideas in [1,15,21] and consider the non-smooth problem (P) in a setting, where function and derivative evaluations are only available as noisy observations. As the authors in [1,15,21], we assume the following noise model: Rather than being able to evaluate F and its derivative F directly, we only have access tõ ...
... What is more, we presume that the structure of ω is well understood in the sense that, for example, its Lipschitz constant is known, which is certainly the case for the penalty function in (1). In order to solve optimization problems of the form (P), we propose a trust-region algorithm leaning on the successive linear programming template proposed in [2] and a convergence analysis that builds on the ideas in [3,15,21]. Specifically, we use a stabilization of the iterate acceptance test in order to assert that a neighborhood of a stationary point is visited infinitely often by the iterates produced by the algorithm. The polyhedral structure of ω is handled by first solving a linear program in order to determine a direction for a subsequent Cauchy point determination. ...
Article
Full-text available
Gradient-based methods have been highly successful for solving a variety of both unconstrained and constrained nonlinear optimization problems. In real-world applications, such as optimal control or machine learning, the necessary function and derivative information may be corrupted by noise, however. Sun and Nocedal have recently proposed a remedy for smooth unconstrained problems by means of a stabilization of the acceptance criterion for computed iterates, which leads to convergence of the iterates of a trust-region method to a region of criticality (Sun and Nocedal in Math Program 66:1–28, 2023. https://doi.org/10.1007/s10107-023-01941-9). We extend their analysis to the successive linear programming algorithm (Byrd et al. in Math Program 100(1):27–48, 2003. https://doi.org/10.1007/s10107-003-0485-4, SIAM J Optim 16(2):471–489, 2005. https://doi.org/10.1137/S1052623403426532) for unconstrained optimization problems with objectives that can be characterized as the composition of a polyhedral function with a smooth function, where the latter and its gradient may be corrupted by noise. This gives the flexibility to cover, for example, (sub)problems arising in image reconstruction or constrained optimization algorithms. We provide computational examples that illustrate the findings and point to possible strategies for practical determination of the stabilization parameter that balances the size of the critical region with a relaxation of the acceptance criterion (or descent property) of the algorithm.
... Noisy functions The combination of unconstrained optimization with noisy observations has recently been examined [18,19,24]. The authors consider the minimization of a smooth function φ : R n → R while only having access to f (x) = φ(x) + ε(x) and g(x) = ∇φ(x) + e(x), where the only assumptions on ε and ε are that both |ε| and e are uniformly bounded. ...
... Contribution We build on the ideas in [18,19,24] and consider the nonsmooth problem (P) in a setting, where function and derivative evaluations are only available as noisy observations. As the authors in [18,19,24], we assume the following noise model: Rather than being able to evaluate F and its derivative F directly, we only have access tõ ...
... Contribution We build on the ideas in [18,19,24] and consider the nonsmooth problem (P) in a setting, where function and derivative evaluations are only available as noisy observations. As the authors in [18,19,24], we assume the following noise model: Rather than being able to evaluate F and its derivative F directly, we only have access tõ ...
Preprint
Gradient-based methods have been highly successful for solving a variety of both unconstrained and constrained nonlinear optimization problems. In real-world applications, such as optimal control or machine learning, the necessary function and derivative information may be corrupted by noise, however. Sun and Nocedal have recently proposed a remedy for smooth unconstrained problems by means of a stabilization of the acceptance criterion for computed iterates, which leads to convergence of the iterates of a trust-region method to a region of criticality, Sun and Nocedal (2022). We extend their analysis to the successive linear programming algorithm, Byrd et al. (2023a,2023b), for unconstrained optimization problems with objectives that can be characterized as the composition of a polyhedral function with a smooth function, where the latter and its gradient may be corrupted by noise. This gives the flexibility to cover, for example, (sub)problems arising image reconstruction or constrained optimization algorithms. We provide computational examples that illustrate the findings and point to possible strategies for practical determination of the stabilization parameter that balances the size of the critical region with a relaxation of the acceptance criterion (or descent property) of the algorithm.
... Apart from the analysis of the BFGS method with bounded errors in [45] and the noise tolerant versions of BFGS and L-BFGS developed in [43], there is relatively little work on the behaviour of quasi-Newton methods in the presence of general bounded noise. This work is most similar to [43], which builds upon the results of [45] to develop an extension of BFGS designed for the situation where unconstrained minimization must be performed using only function and gradient measurements corrupted by bounded noise. ...
... Apart from the analysis of the BFGS method with bounded errors in [45] and the noise tolerant versions of BFGS and L-BFGS developed in [43], there is relatively little work on the behaviour of quasi-Newton methods in the presence of general bounded noise. This work is most similar to [43], which builds upon the results of [45] to develop an extension of BFGS designed for the situation where unconstrained minimization must be performed using only function and gradient measurements corrupted by bounded noise. Although this paper considers the same situation as [43], the approach developed in this paper to address the corrupting effects of noise is distinct and potentially complementary to the approach used in [43]. ...
... This work is most similar to [43], which builds upon the results of [45] to develop an extension of BFGS designed for the situation where unconstrained minimization must be performed using only function and gradient measurements corrupted by bounded noise. Although this paper considers the same situation as [43], the approach developed in this paper to address the corrupting effects of noise is distinct and potentially complementary to the approach used in [43]. While the approach of [43] employs a lengthening procedure that spaces out the points at which gradient differences are collected, in this paper we develop an approach based on relaxing the secant condition that does not require a lengthening procedure. ...
Article
Full-text available
In this paper, we introduce a new variant of the BFGS method designed to perform well when gradient measurements are corrupted by noise. We show that treating the secant condition with a penalty method approach motivated by regularized least squares estimation generates a parametric family with the original BFGS update at one extreme and not updating the inverse Hessian approximation at the other extreme. Furthermore, we find the curvature condition is relaxed as the family moves towards not updating the inverse Hessian approximation, and disappears entirely at the extreme where the inverse Hessian approximation is not updated. These developments allow us to develop a method we refer to as Secant Penalized BFGS (SP-BFGS) that allows one to relax the secant condition based on the amount of noise in the gradient measurements. SP-BFGS provides a means of incrementally updating the new inverse Hessian approximation with a controlled amount of bias towards the previous inverse Hessian approximation, which allows one to replace the overwriting nature of the original BFGS update with an averaging nature that resists the destructive effects of noise and can cope with negative curvature measurements. We discuss the theoretical properties of SP-BFGS, including convergence when minimizing strongly convex functions in the presence of uniformly bounded noise. Finally, we present extensive numerical experiments using over 30 problems from the CUTEst test problem set that demonstrate the superior performance of SP-BFGS compared to BFGS in the presence of both noisy function and gradient evaluations.
... Unconstrained problems with noisy function and gradient evaluations are studied in [1,2,4,8,12,19,21,22,23,28]. They can be grouped into two categories: those that consider stochastic noise [2,4,8,12,19,21] and those that consider bounded and non-diminishing noise [1,22,23,28]. ...
... Unconstrained problems with noisy function and gradient evaluations are studied in [1,2,4,8,12,19,21,22,23,28]. They can be grouped into two categories: those that consider stochastic noise [2,4,8,12,19,21] and those that consider bounded and non-diminishing noise [1,22,23,28]. ...
Preprint
We analyze the convergence properties of a modified barrier method for solving bound-constrained optimization problems where evaluations of the objective function and its derivatives are affected by bounded and non-diminishing noise. The only modification compared to a standard barrier method is a relaxation of the Armijo line-search condition. We prove that the algorithm generates iterates at which the size of the barrier function gradient eventually falls below a threshold that converges to zero if the noise level converges to zero. Based on this result, we propose a practical stopping test that does not require estimates of unknown problem parameters and identifies iterations in which the theoretical threshold is reached. We also analyze the local convergence properties of the method when noisy second derivatives are used. Under a strict-complementarity assumption, we show that iterates stay in a neighborhood around the optimal solution once it is entered. The neighborhood is defined in a scaled norm that becomes narrower for variables with active bound constraints as the barrier parameter is decreased. As a consequence, we show that active bound constraints can be identified despite noise. Numerical results demonstrate the effectiveness of the stopping test and illustrate the active-set identification properties of the method.
... The relative Jacobian errors documented above bound the relative errors that can occur in the calculation of cost function gradients, which additionally depend on the structure of the measurement residuals. The theoretical examination of the consequences of gradient errors in the L-BFGS-B method have been only recently investigated (Shi et al., 2021), due to the interest in developing stochastic variants for deep learning and other similar applications. The L-BFGS method uses finite differencing to approximate the Hessian. ...
... The typical solution for gradients with errors is to perform a step-lengthening procedure (Shi et al., 2021). However, in systems that are highly ill-conditioned, even without noise, such as the optically thick clouds discussed here, a step-lengthening procedure may cause significant difficulty in the selection of an update vector satisfying the stabilizing Wolfe-Armijo line search conditions. ...
Article
Full-text available
Our global understanding of clouds and aerosols relies on the remote sensing of their optical, microphysical, and macrophysical properties using, in part, scattered solar radiation. These retrievals assume that clouds and aerosols form plane-parallel, homogeneous layers and utilize 1D radiative transfer (RT) models, limiting the detail that can be retrieved about the 3D variability in cloud and aerosol fields and inducing biases in the retrieved properties for highly heterogeneous structures such as cumulus clouds and smoke plumes. To overcome these limitations, we introduce and validate an algorithm for retrieving the 3D optical or microphysical properties of atmospheric particles using multi-angle, multi-pixel radiances and a 3D RT model. The retrieval software, which we have made publicly available, is called Atmospheric Tomography with 3D Radiative Transfer (AT3D). It uses an iterative, local optimization technique to solve a generalized least squares problem and thereby find a best-fitting atmospheric state. The iterative retrieval uses a fast, approximate Jacobian calculation, which we have extended from Levis et al. (2020) to accommodate open and periodic horizontal boundary conditions (BCs) and an improved treatment of non-black surfaces. We validated the accuracy of the approximate Jacobian calculation for derivatives with respect to both the 3D volume extinction coefficient and the parameters controlling the open horizontal boundary conditions across media with a range of optical depths and single-scattering properties and find that it is highly accurate for a majority of cloud and aerosol fields over oceanic surfaces. Relative root mean square errors in the approximate Jacobian for a 3D volume extinction coefficient in media with cloud-like single-scattering properties increase from 2 % to 12 % as the maximum optical depths (MODs) of the medium increase from 0.2 to 100.0 over surfaces with Lambertian albedos <0.2. Over surfaces with albedos of 0.7, these errors increase to 20 %. Errors in the approximate Jacobian for the optimization of open horizontal boundary conditions exceed 50 %, unless the plane-parallel media providing the boundary conditions are optically very thin (∼0.1). We use the theory of linear inverse RT to provide insight into the physical processes that control the cloud tomography problem and identify its limitations, supported by numerical experiments. We show that the Jacobian matrix becomes increasing ill-posed as the optical size of the medium increases and the forward-scattering peak of the phase function decreases. This suggests that tomographic retrievals of clouds will become increasingly difficult as clouds become optically thicker. Retrievals of asymptotically thick clouds will likely require other sources of information to be successful. In Loveridge et al. (2023a; hereafter Part 2), we examine how the accuracy of the retrieved 3D volume extinction coefficient varies as the optical size of the target medium increases using synthetic data. We do this to explore how the increasing error in the approximate Jacobian and the increasingly ill-posed nature of the inversion in the optically thick limit affect the retrieval. We also assess the accuracy of retrieved optical depths and compare them to retrievals using 1D radiative transfer.
... Stochastic second-order methods make judicious use of curvature information and have proven to be effective for a variety of machine learning tasks [24]. Existing stochastic second-order methods can be classified as follows: stochastic Newton methods [25-30, 33, 52-55]; stochastic quasi-Newton methods (SQNM) [37,52,[56][57][58][59][60][61]; generalized Gauss-Newton (GGN) methods [40,41,43,[62][63][64]; Kronecker-factored approximate curvature (K-FAC) methods [44-51, 65, 66]. ...
... However, in order to guarantee that the condition number of the Hessian approximations is bounded, they need to be aware of the strong convexity parameter, which is typically unknown. A new noisetolerant quasi-Newton algorithm was proposed by Shi et al. [61], which has no need for exogenous function information. The noise-tolerant L-BFGS is linearly convergent to a neighborhood of the solution determined by the noise level if the objective function F is μ-strongly convex and has L-Lipschitz continuous gradients. ...
Article
Full-text available
Numerous intriguing optimization problems arise as a result of the advancement of machine learning. The stochastic first-order method is the predominant choice for those problems due to its high efficiency. However, the negative effects of noisy gradient estimates and high nonlinearity of the loss function result in a slow convergence rate. Second-order algorithms have their typical advantages in dealing with highly nonlinear and ill-conditioning problems. This paper provides a review on recent developments in stochastic variants of quasi-Newton methods, which construct the Hessian approximations using only gradient information. We concentrate on BFGS-based methods in stochastic settings and highlight the algorithmic improvements that enable the algorithm to work in various scenarios. Future research on stochastic quasi-Newton methods should focus on enhancing its applicability, lowering the computational and storage costs, and improving the convergence rate.
... L-BFGS is a common optimization algorithm used in quantum control as part of GRAPE [66] and performed well on finding high-fidelity energy landscape controllers [55]. It has not been designed for noisy optimization, but there exist smoothing modifications that attempt to address this [67][68][69]. For individual controller comparisons, we use standard L-BFGS with an ordinary differential equation model to compute the fidelity without perturbations during optimization. ...
... We have explored stochastic gradient descent methods (e.g. ADAM [70]) and also tested a noisy version of L-BFGS that has been recently proposed that modifies the line search and lengthening procedure during the gradient update step [69] and found that our training noise scales were too large and washed away gradient information, rendering these algorithms unsuitable for our study. ...
Preprint
Full-text available
Robustness of quantum operations or controls is important to build reliable quantum devices. The robustness-infidelity measure (RIMp_p) is introduced to statistically quantify the robustness and fidelity of a controller as the p-order Wasserstein distance between the fidelity distribution of the controller under any uncertainty and an ideal fidelity distribution. The RIMp_p is the p-th root of the p-th raw moment of the infidelity distribution. Using a metrization argument, we justify why RIM1_1 (the average infidelity) suffices as a practical robustness measure. Based on the RIMp_p, an algorithmic robustness-infidelity measure (ARIM) is developed to quantify the expected robustness and fidelity of controllers found by a control algorithm. The utility of the RIM and ARIM is demonstrated by considering the problem of robust control of spin- 12 networks using energy landscape shaping subject to Hamiltonian uncertainty. The robustness and fidelity of individual control solutions as well as the expected robustness and fidelity of controllers found by different popular quantum control algorithms are characterized. For algorithm comparisons, stochastic and non-stochastic optimization objectives are considered, with the goal of effective RIM optimization in the latter. Although high fidelity and robustness are often conflicting objectives, some high fidelity, robust controllers can usually be found, irrespective of the choice of the quantum control algorithm. However, for noisy optimization objectives, adaptive sequential decision making approaches such as reinforcement learning have a cost advantage compared to standard control algorithms and, in contrast, the infidelities obtained are more consistent with higher RIM values for low noise levels.
... A similar relaxed Armijo back-tracking line search technique is used in [18], which considered different oracles from [22] to allow biased estimates, and provided complexity bounds for different noise structures under probabilistic frameworks. Quasi-Newton methods were analyzed in [27], which described a noise tolerant modification of the BFGS method; [24] showed ways to make this method robust and efficient in practice. ...
... Comparison of the proposed and classical region algorithms on the tridiagonal function with uniform noise.1.3.2 Problem 289, GUR-T1-323 We initiated both algorithms with small (∆ 0 = 1e − 6) and large (∆ 0 = 1)24 trust region radius and plotted the results inFigures 6 and 7. ...
Preprint
Full-text available
Classical trust region methods were designed to solve problems in which function and gradient information are exact. This paper considers the case when there are bounded errors (or noise) in the above computations and proposes a simple modification of the trust region method to cope with these errors. The new algorithm only requires information about the size of the errors in the function evaluations and incurs no additional computational expense. It is shown that, when applied to a smooth (but not necessarily convex) objective function, the iterates of the algorithm visit a neighborhood of stationarity infinitely often, and that the rest of the sequence cannot stray too far away, as measured by function values. Numerical results illustrate how the classical trust region algorithm may fail in the presence of noise, and how the proposed algorithm ensures steady progress towards stationarity in these cases.
... Byrd et al. [3] have proposed a stochastic quasi-Newton method in limited memory form through subsampled Hessian-vector products. Shi et al. [23] have proposed practical extensions of the BFGS and L-BFGS methods for nonlinear optimization that are capable of dealing with noise by employing a new linesearch technique. Xie et al. [24] have considered the convergence analysis of quasi-Newton methods when there are (bounded) errors in both function and gradient evaluations, and established conditions under which an Armijo-Wolfe linesearch on the noisy function yields sufficient decrease in the true objective function. ...
Article
Full-text available
We investigate quasi-Newton methods for minimizing a strongly convex quadratic function which is subject to errors in the evaluation of the gradients. In particular, we focus on computing search directions for quasi-Newton methods that all give identical behavior in exact arithmetic, generating minimizers of Krylov subspaces of increasing dimensions, thereby having finite termination. The BFGS quasi-Newton method may be seen as an ideal method in exact arithmetic and is empirically known to behave very well on a quadratic problem subject to small errors. We investigate large-error scenarios, in which the expected behavior is not so clear. We consider memoryless methods that are less expensive than the BFGS method, in that they generate low-rank quasi-Newton matrices that differ from the identity by a symmetric matrix of rank two. In addition, a more advanced model for generating the search directions is proposed, based on solving a chance-constrained optimization problem. Our numerical results indicate that for large errors, such a low-rank memoryless quasi-Newton method may perform better than a BFGS method. In addition, the results indicate a potential edge by including the chance-constrained model in the memoryless quasi-Newton method.
... A finite-difference quasi-Newton approach for noisy functions was developed in [5], where the differencing intervals were adjusted according to the level of noise. In [53], a noise tolerant version of the BFGS method was proposed and analyzed, followed by extensions in [44] to make the method more robust and efficient in practice. Another variant of BFGS can be found in [25], where the secant condition was treated with a penalty method instead of directly enforced. ...
Article
Full-text available
We analyze the convergence of a nonlocal gradient descent method for minimizing a class of high-dimensional non-convex functions, where a directional Gaussian smoothing (DGS) is proposed to define the nonlocal gradient (also referred to as the DGS gradient). The method was first proposed in [Zhang et al., Enabling long-range exploration in minimization of multimodal functions, UAI 2021], in which multiple numerical experiments showed that replacing the traditional local gradient with the DGS gradient can help the optimizers escape local minima more easily and significantly improve their performance. However, a rigorous theory for the efficiency of the method on nonconvex landscape is lacking. In this work, we investigate the scenario where the objective function is composed of a convex function, perturbed by deterministic oscillating noise. We provide a convergence theory under which the iterates exponentially converge to a tightened neighborhood of the solution, whose size is characterized by the noise wavelength. We also establish a correlation between the optimal values of the Gaussian smoothing radius and the noise wavelength, thus justifying the advantage of using moderate or large smoothing radii with the method. Furthermore, if the noise level decays to zero when approaching the global minimum, we prove that DGS-based optimization converges to the exact global minimum with linear rates, similarly to standard gradient-based methods in optimizing convex functions. Several numerical experiments are provided to confirm our theory and illustrate the superiority of the approach over those based on the local gradient.
... Levenberg-Marquardt [44,53] or BFGS [2,16], need to estimate the gradient of the cost function to determine the minimum value. Depending on noise in the cost function, the estimated gradient may be erroneous [72,86]. ...
Article
Full-text available
Pose estimation is an important component of many real-world computer vision systems. Most existing pose estimation algorithms need a large number of point correspondences to accurately determine the pose of an object. Since the number of point correspondences depends on the object’s appearance, lighting and other external conditions, detecting many points may not be feasible. In many real-world applications, the movement of objects is limited, e.g. due to gravity. Hence, detecting objects with only three degrees of freedom is usually sufficient. This allows us to improve the accuracy of pose estimation by changing the underlying equations of the perspective-n-point problem to three variables instead of six. By using the simplified equations, our algorithm is more robust against detection errors with limited point correspondences. In this article, we study three scenarios where such constraints apply. The first one is about parking a vehicle on a specific spot. Here, a stationary camera is detecting the vehicle to assist the driver. The second scenario describes the perspective of a moving camera detecting objects in its environment. This scenario is common for driver assistance systems, autonomous cars or mobile robots. Third, we describe a camera observing objects from a birds-eye view, which occurs in industrial applications. In all three scenarios, observed objects can only move in the ground plane and rotate around the vertical axis. Hence, three degrees of freedom are sufficient to estimate the pose. Experiments with synthetic data and real-world photographs have shown that our algorithm outperforms state-of-the-art pose estimation algorithms. Depending on the scenario, our algorithm is able to achieve 50% better accuracy, while being equally fast.
... In the available literature for L-BFGS-type and BFGS-type methods, global and linear convergence is only established under the assumption of strong convexity of the objective in the level set associated to the starting point; cf for instance, [LN89, BN89, YSW+18, GG19, VL19,SXBN22]. This excludes many interesting objective functions and, in particular, is not appropriate for the highly non-convex nature of many inverse problems, including those arising in medical image registration. ...
Article
Full-text available
Many inverse problems are phrased as optimization problems in which the objective function is the sum of a data-fidelity term and a regularization. Often, the Hessian of the fidelity term is computationally unavailable while the Hessian of the regularizer allows for cheap matrix-vector products. In this paper, we study an L-BFGS method that takes advantage of this structure. We show that the method converges globally without convexity assumptions and that the convergence is linear under a Kurdyka–Łojasiewicz-type inequality. In addition, we prove linear convergence to cluster points near which the objective function is strongly convex. To the best of our knowledge, this is the first time that linear convergence of an L-BFGS method is established in a non-convex setting. The convergence analysis is carried out in infinite dimensional Hilbert space, which is appropriate for inverse problems but has not been done before. Numerical results show that the new method outperforms other structured L-BFGS methods and classical L-BFGS on non-convex real-life problems from medical image registration. It also compares favorably with classical L-BFGS on ill-conditioned quadratic model problems. An implementation of the method is freely available.
... This is because, in the presence of errors, only small step sizes will accurately predict the change in the cost function to a precision that satisfies the stability criteria in the line search of the optimization procedure (Byrd et al., 1995;Zhu et al., 1997). If errors in the gradient are large, then the optimization may terminate far from an apparent minimum in the cost function, even when the optimization problem is linear (Shi et al., 2021). ...
Article
Full-text available
Our global understanding of clouds and aerosols relies on the remote sensing of their optical, microphysical, and macrophysical properties using, in part, scattered solar radiation. Current retrievals assume clouds and aerosols form plane-parallel, homogeneous layers and utilize 1D radiative transfer (RT) models. These assumptions limit the detail that can be retrieved about the 3D variability in the cloud and aerosol fields and induce biases in the retrieved properties for highly heterogeneous structures such as cumulus clouds and smoke plumes. In Part 1 of this two-part study, we validated a tomographic method that utilizes multi-angle passive imagery to retrieve 3D distributions of species using 3D RT to overcome these issues. That validation characterized the uncertainty in the approximate Jacobian used in the tomographic retrieval over a wide range of atmospheric and surface conditions for several horizontal boundary conditions. Here, in Part 2, we test the algorithm's effectiveness on synthetic data to test whether the retrieval accuracy is limited by the use of the approximate Jacobian. We retrieve 3D distributions of a volume extinction coefficient (σ3D) at 40 m resolution from synthetic multi-angle, mono-spectral imagery at 35 m resolution derived from stochastically generated cumuliform-type clouds in (1 km)3 domains. The retrievals are idealized in that we neglect forward-modelling and instrumental errors, with the exception of radiometric noise; thus, reported retrieval errors are the lower bounds. σ3D is retrieved with, on average, a relative root mean square error (RRMSE) < 20 % and bias < 0.1 % for clouds with maximum optical depth (MOD) < 17, and the RRMSE of the radiances is < 0.5 %, indicating very high accuracy in shallow cumulus conditions. As the MOD of the clouds increases to 80, the RRMSE and biases in σ3D worsen to 60 % and -35 %, respectively, and the RRMSE of the radiances reaches 16 %, indicating incomplete convergence. This is expected from the increasing ill-conditioning of the inverse problem with the decreasing mean free path predicted by RT theory and discussed in detail in Part 1. We tested retrievals that use a forward model that is not only less ill-conditioned (in terms of condition number) but also less accurate, due to more aggressive delta-M scaling. This reduces the radiance RRMSE to 9 % and the bias in σ3D to -8 % in clouds with MOD ∼ 80, with no improvement in the RRMSE of σ3D. This illustrates a significant sensitivity of the retrieval to the numerical configuration of the RT model which, at least in our circumstances, improves the retrieval accuracy. All of these ensemble-averaged results are robust in response to the inclusion of radiometric noise during the retrieval. However, individual realizations can have large deviations of up to 18 % in the mean extinction in clouds with MOD ∼ 80, which indicates large uncertainties in the retrievals in the optically thick limit. Using less ill-conditioned forward model tomography can also accurately infer optical depths (ODs) in conditions spanning the majority of oceanic cumulus fields (MOD < 80), as the retrieval provides ODs with bias and RRMSE values better than -8 % and 36 %, respectively. This is a significant improvement over retrievals using 1D RT, which have OD biases between -30 % and -23 % and RRMSE between 29 % and 80 % for the clouds used here. Prior information or other sources of information will be required to improve the RRMSE of σ3D in the optically thick limit, where the RRMSE is shown to have a strong spatial structure that varies with the solar and viewing geometry.
... Moreover, some recent work has developed line search methods for single-level problems where the objective function value is computed with errors and the estimated gradient is inexact [48] and potentially stochastic [5]. However, in these methods, the error is bounded but irreducible and not controllable. ...
Preprint
Various tasks in data science are modeled utilizing the variational regularization approach, where manually selecting regularization parameters presents a challenge. The difficulty gets exacerbated when employing regularizers involving a large number of hyperparameters. To overcome this challenge, bilevel learning can be employed to learn such parameters from data. However, neither exact function values nor exact gradients with respect to the hyperparameters are attainable, necessitating methods that only rely on inexact evaluation of such quantities. State-of-the-art inexact gradient-based methods a priori select a sequence of the required accuracies and cannot identify an appropriate step size since the Lipschitz constant of the hypergradient is unknown. In this work, we propose an algorithm with backtracking line search that only relies on inexact function evaluations and hypergradients and show convergence to a stationary point. Furthermore, the proposed algorithm determines the required accuracy dynamically rather than manually selected before running it. Our numerical experiments demonstrate the efficiency and feasibility of our approach for hyperparameter estimation on a range of relevant problems in imaging and data science such as total variation and field of experts denoising and multinomial logistic regression. Particularly, the results show that the algorithm is robust to its own hyperparameters such as the initial accuracies and step size.
... In the limited memory version, the matrix H k is defined at each iteration as the result of applying m BFGS updates to a multiple of the identity matrix using the set of m most recent curvature pairs {s t , y t } kept in storage. Thus, one does not need to store the approximations of the four dense inverse Hessian matrices; rather, one can store two m × d matrices and compute the matrix-vector product via two-loop recursion [94]. After the step has been computed, the oldest pair (s t , y t ) is discarded, and the new curvature pair is stored. ...
Article
Full-text available
The main goal of machine learning is the creation of self-learning algorithms in many areas of human activity. It allows a replacement of a person with artificial intelligence in seeking to expand production. The theory of artificial neural networks, which have already replaced humans in many problems, remains the most well-utilized branch of machine learning. Thus, one must select appropriate neural network architectures, data processing, and advanced applied mathematics tools. A common challenge for these networks is achieving the highest accuracy in a short time. This problem is solved by modifying networks and improving data pre-processing, where accuracy increases along with training time. Bt using optimization methods, one can improve the accuracy without increasing the time. In this review, we consider all existing optimization algorithms that meet in neural networks. We present modifications of optimization algorithms of the first, second, and information-geometric order, which are related to information geometry for Fisher–Rao and Bregman metrics. These optimizers have significantly influenced the development of neural networks through geometric and probabilistic tools. We present applications of all the given optimization algorithms, considering the types of neural networks. After that, we show ways to develop optimization algorithms in further research using modern neural networks. Fractional order, bilevel, and gradient-free optimizers can replace classical gradient-based optimizers. Such approaches are induced in graph, spiking, complex-valued, quantum, and wavelet neural networks. Besides pattern recognition, time series prediction, and object detection, there are many other applications in machine learning: quantum computations, partial differential, and integrodifferential equations, and stochastic processes.
... Despite this, instead of relying on external inputs for step length selection, several algorithms use classical nonlinear optimization methods to dynamically determine step lengths during runtime. For instance, [6,18,21] have each examined stochastic line search algorithms that potentially involve multiple function evaluations on a given direction; [15] introduced a stochastic step search algorithm that modifies the search direction each time a step is refused under a relaxed Armijo condition; in terms of trust region approaches, [3,7,22] have each proposed remedies for adapting existing algorithms to stochasticities. ...
Preprint
Full-text available
Many machine learning applications and tasks rely on the stochastic gradient descent (SGD) algorithm and its variants. Effective step length selection is crucial for the success of these algorithms, which has motivated the development of algorithms such as ADAM or AdaGrad. In this paper, we propose a novel algorithm for adaptive step length selection in the classical SGD framework, which can be readily adapted to other stochastic algorithms. Our proposed algorithm is inspired by traditional nonlinear optimization techniques and is supported by analytical findings. We show that under reasonable conditions, the algorithm produces step lengths in line with well-established theoretical requirements, and generates iterates that converge to a stationary neighborhood of a solution in expectation. We test the proposed algorithm on logistic regressions and deep neural networks and demonstrate that the algorithm can generate step lengths comparable to the best step length obtained from manual tuning.
... Finite-difference quasi-Newton approach for noisy functions was developed in [3], where the differencing intervals were adjusted according to the level of noise. In [41], a noise tolerant version of the BFGS method was proposed and analyzed, followed by extensions in [36] to make the method more robust and efficient in practice. Trust region methods with noise was developed and analyzed in [38]. ...
Preprint
We analyze the convergence of a nonlocal gradient descent method for minimizing a class of high-dimensional non-convex functions, where a directional Gaussian smoothing (DGS) is proposed to define the nonlocal gradient (also referred to as the DGS gradient). The method was first proposed in [42], in which multiple numerical experiments showed that replacing the traditional local gradient with the DGS gradient can help the optimizers escape local minima more easily and significantly improve their performance. However, a rigorous theory for the efficiency of the method on nonconvex landscape is lacking. In this work, we investigate the scenario where the objective function is composed of a convex function, perturbed by a oscillating noise. We provide a convergence theory under which the iterates exponentially converge to a tightened neighborhood of the solution, whose size is characterized by the noise wavelength. We also establish a correlation between the optimal values of the Gaussian smoothing radius and the noise wavelength, thus justify the advantage of using moderate or large smoothing radius with the method. Furthermore, if the noise level decays to zero when approaching global minimum, we prove that DGS-based optimization converges to the exact global minimum with linear rates, similarly to standard gradient-based method in optimizing convex functions. Several numerical experiments are provided to confirm our theory and illustrate the superiority of the approach over those based on the local gradient.
... Although RGD (3.8) with the BB stepsize is efficient in practice, it needs to calculate π * U t exactly, which can be challenging (or might be unnecessary) to do. Motivated by the well-established inexact gradient methods for solving optimization with noise [10,13,5,38] and the expression (3.6), we consider to use grad U L(x t ) = Proj T U t S (−2V π t U t ), the gradient information of L(x t ) with x t = (α t , β t , U t ), to approximate grad q(U t ), where π t is an approximation of π * U t . Here, (α t , β t ) ≈ min α∈R n ,β∈R n L(α, β, U t ) satisfying e 2 (x t ) ≤ θ t , where θ t ≥ 0 is a preselected tolerance. ...
Preprint
Full-text available
Projecting the distance measures onto a low-dimensional space is an efficient way of mitigating the curse of dimensionality in the classical Wasserstein distance using optimal transport. The obtained maximized distance is referred to as projection robust Wasserstein (PRW) distance. In this paper, we equivalently reformulate the computation of the PRW distance as an optimization problem over the Cartesian product of the Stiefel manifold and the Euclidean space with additional nonlinear inequality constraints. We propose a Riemannian exponential augmented Lagrangian method (ReALM) with a global convergence guarantee to solve this problem. Compared with the existing approaches, ReALM can potentially avoid too small penalty parameters. Moreover, we propose a framework of inexact Riemannian gradient descent methods to solve the subproblems in ReALM efficiently. In particular, by using the special structure of the subproblem, we give a practical algorithm named as the inexact Riemannian Barzilai-Borwein method with Sinkhorn iteration (iRBBS). The remarkable features of iRBBS lie in that it performs a flexible number of Sinkhorn iterations to compute an inexact gradient with respect to the projection matrix of the problem and adopts the Barzilai-Borwein stepsize based on the inexact gradient information to improve the performance. We show that iRBBS can return an ϵ\epsilon-stationary point of the original PRW distance problem within O(ϵ3)\mathcal{O}(\epsilon^{-3}) iterations. Extensive numerical results on synthetic and real datasets demonstrate that our proposed ReALM as well as iRBBS outperform the state-of-the-art solvers for computing the PRW distance.
... Byrd et al. [BHNS16] have proposed a stochastic quasi-Newton method in limited memory form through subsampled Hessian-vector products. Shi et al. [SXBN20] have proposed practical extensions of the BFGS and L-BFGS methods for nonlinear optimization that are capable of dealing with noise by employing a new linesearch technique. More recently, Xie, Byrd and Nocedal [XBN20] have considered the convergence analysis of quasi-Newton methods when there are (bounded) errors in both function and gradient evaluations, and established conditions under which an Armijo-Wolfe linesearch on the noisy function yields sufficient decrease in the true objective function. ...
Preprint
Full-text available
We investigate quasi-Newton methods for minimizing a strictly convex quadratic function which is subject to errors in the evaluation of the gradients. The methods all give identical behavior in exact arithmetic, generating minimizers of Krylov subspaces of increasing dimensions, thereby having finite termination. A BFGS quasi-Newton method is empirically known to behave very well on a quadratic problem subject to small errors. We also investigate large-error scenarios, in which the expected behavior is not so clear. In particular, we are interested in the behavior of quasi-Newton matrices that differ from the identity by a low-rank matrix, such as a memoryless BFGS method. Our numerical results indicate that for large errors, a memory-less quasi-Newton method often outperforms a BFGS method. We also consider a more advanced model for generating search directions, based on solving a chance-constrained optimization problem. Our results indicate that such a model often gives a slight advantage in final accuracy, although the computational cost is significantly higher.
Article
This work concerns evolutionary approaches to black-box noisy optimization where the problem is accessed via noisy function evaluations. An evolution strategy (ES) algorithm is proposed, which uses a Gaussian distribution to guide the search and requires only the comparisons among solutions. The new method achieves the similar convergence rate as finite-difference based gradient methods on non-convex landscapes, while being adaptive in the sense that the convergence is ensured with any initial step-size. We further improve the method with a variance adaptation mechanisms that alleviates the need for hyper-parameter tunning and an asynchronous parallelization implementation that enables linear speedup. The persistent monitoring task is chosen as an application to investigate the effectiveness of the proposed method. The task is to schedule a team of agents to minimize the uncertainties of some targets in a changing environment defined on a grid map. We show in the single-agent case that the task can be cast into a sequence of pathfinding subproblems of which the sequential order can be modeled as a Markov decision procedure and solved by the proposed ES method. In the multi-agent case, we show that solving the single-agent subproblems using the ES method in a round-robin way can provide collision-free solutions. Numerical studies demonstrate the reduction in convergence time and the robustness against complicated environments relative to several existing evolutionary algorithms and zeroth-order gradient methods.
Preprint
Full-text available
Our global understanding of clouds and aerosols relies on the remote sensing of their optical, microphysical, and macrophysical properties using, in part, scattered solar radiation. Current retrievals assume clouds and aerosols form plane-parallel, homogeneous layers and utilize 1D radiative transfer (RT) models. These assumptions limit the detail that can be retrieved about the 3D variability of cloud and aerosol fields and induce biases in the retrieved properties for highly heterogeneous structures such as cumulus clouds and smoke plumes. In Part 1 of this two-part study, we validated a tomographic method that utilizes multi-angle passive imagery to retrieve 3D distributions of species using 3D RT to overcome these issues. That validation characterized the uncertainty in the approximate Jacobian used in the tomographic retrieval over a wide range of atmospheric and surface conditions for several horizontal boundary conditions. Here in Part 2, we test the algorithm’s effectiveness on synthetic data to test whether retrieval accuracy is limited by the use of the approximate Jacobian. We retrieve 3D distributions of volume extinction coefficient (σ3D) at 40 m resolution from synthetic multi-angle, mono-spectral imagery at 35 m resolution derived from stochastically-generated ‘cumuliform’ clouds in (1 km)3 domains. The retrievals are idealized in that we neglect forward modelling and instrumental errors with the exception of radiometric noise; thus reported retrieval errors are lower bounds. σ3D is retrieved with, on average, a Relative Root Mean Square Error (RRMSE) < 20 % and bias < 0.1 % for clouds with Maximum Optical Depth (MOD) < 17, and the RRMSE of the radiances is < 0.5 %, indicating very high accuracy in shallow cumulus conditions. As the MOD of the clouds increases to 80, the RRMSE and biases in σ3D worsen to 60 % and −35 %, respectively, and the RRMSE of the radiances reaches 16 %, indicating incomplete convergence. This is expected from the increasing ill-conditioning of the inverse problem with decreasing mean-free-path predicted by RT theory and discussed in detail in Part 1. We tested retrievals that use a forward model that is better conditioned but less accurate due to more aggressive delta-M scaling. This reduces the radiance RRMSE to 9 % and the bias in σ3D to −8 % in clouds with MOD ~80, with no improvement in the RRMSE of σ3D. This illustrates a significant sensitivity of the retrieval to the numerical configuration of the RT model which, at least in our circumstances, improves the retrieval accuracy. All of these ensemble-averaged results are robust to the inclusion of radiometric noise during the retrieval. However, individual realizations can have large deviations of up to 18 % in the mean extinction in clouds with MOD ~80, which indicates large uncertainties in the retrievals in the optically thick limit. Using the better conditioned forward model tomography can also accurately infer optical depths (OD) in conditions spanning the majority of oceanic, cumulus fields (MOD < 80) as the retrieval provides OD with bias and RRMSE better than −8 % and 36 %, respectively. This is a significant improvement over retrievals using 1D RT, which have OD biases between −30 % and −23 % and RRMSE between 29 % and 80 % for the clouds used here. Prior information or other sources of information will be required to improve the RRMSE of σ3D in the optically thick limit, where the RRMSE is shown to have strong spatial structure that varies with the solar and viewing geometry.
Article
The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. In its simplest form, it consists of employing derivative-based methods for unconstrained or constrained optimization and replacing the gradient of the objective (and constraints) by finite-difference approximations. This approach is applicable to problems with or without noise in the functions. The differencing interval is determined by a bound on the second (or third) derivative and by the noise level, which is assumed to be known or to be accessible through difference tables or sampling. The use of finite-difference gradient approximations has been largely dismissed in the derivative-free optimization literature as too expensive in terms of function evaluations or as impractical in the presence of noise. However, the test results presented in this paper suggest that it has much to be recommended. The experiments compare newuoa, dfo-ls and cobyla against finite-difference versions of l-bfgs, lmder and knitro on three classes of problems: general unconstrained problems, nonlinear least squares problems and nonlinear programs with inequality constraints.
Article
Full-text available
We describe the most recent evolution of our constrained and un-constrained testing environment and its accompanying SIF decoder. Code-named SIFDecode and CUTEst, these updated versions feature dynamic mem-ory allocation, a modern thread-safe Fortran modular design, a new Matlab interface and a revised installation procedure integrated with GALAHAD.
Article
Full-text available
We employ recent work on computational noise to obtain near-optimal difference estimates of the derivative of a noisy function. Our analysis relies on a stochastic model of the noise without assuming a specific form of distribution. We use this model to derive theoretical bounds for the errors in the difference estimates and obtain an easily computable difference parameter that is provably near-optimal. Numerical results closely resemble the theory and show that we obtain accurate derivative estimates even when the noisy function is deterministic.
Article
Full-text available
This remark describes an improvement and a correction to Algorithm 778. It is shown that the performance of the algorithm can be improved significantly by making a relatively simple modification to the subspace minimization phase. The correction concerns an error caused by the use of routine dpmeps to estimate machine precision.
Article
Full-text available
The initial release of CUTE, a widely used testing environment for optimization software was described in Bongartz, Conn, Gould and Toint (1995). The latest version, now known as CUTEr is presented. New features include reorganisation of the environment to allow simultaneous multi-platform installation, new tools for, and interfaces to, optimization packages, and a considerably simplified and entirely automated installation procedure for UNIX systems. The SIF decoder, which used to be a part of CUTE, has become a separate tool, easily callable by various packages. It features simple extensions to the SIF test problem format and the generation of files suited to automatic differentiation packages.
Article
Full-text available
Computational noise in deterministic simulations is as ill-defined a concept as can be found in scientific computing. When coupled with adaptive strategies, the effects of finite precision destroy smoothness of the simulation output and complicate subsequent analysis. Following the work of Hamming on roundoff errors, we present a new algorithm, ECnoise, for quantifying the noise level of a computed function. Our theoretical framework is based on stochastic noise but does not assume a specific distribution for the noise. For the deterministic simulations considered, ECnoise produces reliable results in a few function evaluations and offers new insights into building blocks of large scale simulations.
Article
This paper presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval h based on the noise estimation techniques of Hamming (2012) and Mor\'e and Wild (2011). This noise estimation procedure and the selection of h are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a model based trust region method are presented.
Article
In this paper, we prove new complexity bounds for methods of convex optimization based only on computation of the function value. The search directions of our schemes are normally distributed random Gaussian vectors. It appears that such methods usually need at most n times more iterations than the standard gradient methods, where n is the dimension of the space of variables. This conclusion is true for both nonsmooth and smooth problems. For the latter class, we present also an accelerated scheme with the expected rate of convergence (Formula presented.), where k is the iteration counter. For stochastic optimization, we propose a zero-order scheme and justify its expected rate of convergence (Formula presented.). We give also some bounds for the rate of convergence of the random gradient-free methods to stationary points of nonconvex functions, for both smooth and nonsmooth cases. Our theoretical results are supported by preliminary computational experiments.
Article
The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.
Article
The BFGS update formula is shown to have an important property that is inde- pendent of the algorithmic context of the update, and that is relevant to both constrained and unconstrained optimization. The BFGS method for unconstrained optimization, using a variety of line searches, including backtracking, is shown to be globally and superlinearly convergent on uniformly convex problems. The analysis is particularly simple due to the use of some new tools introduced in this paper.
Article
When minimizing a smooth nonlinear function whose derivatives are not available, a popular approach is to use a gradient method with a finite-difference approximation substituted for the exact gradient. In order for such a method to be effective, it must be possible to compute “good” derivative approximations without requiring a large number of function evaluations. Certain “standard” choices for the finite-difference interval may lead to poor derivative approximations for badly scaled problems. We present an algorithm for computing a set of intervals to be used in a forward-difference approximation of the gradient.
Article
This paper presents experimental comparisons of several methods for approximating derivatives by finite differences. In particular, the method for choosing the forward difference interval is discussed. The focus is on the optimization of outputs of computer models ( e.g. circuit simulaiion, structural analysis), and so optimization on functions with as few as two digits of accuracy is considered. A new dynamic interval size adjustment is presented. The primary findings of the study were: (a) choosing a good initial interval size was more important than using the best optimization algorithm, and (b) a simple rule for dynamic readjustment of interval size was effective in improving convergence, particularly for function representations with low accuracy.
Article
Monte Carlo is one of the most versatile and widely used numerical methods. Its convergence rate, O(N−1/2), is independent of dimension, which shows Monte Carlo to be very robust but also slow. This article presents an introduction to Monte Carlo methods for integration problems, including convergence theory, sampling methods and variance reduction techniques. Accelerated convergence for Monte Carlo quadrature is attained using quasi-random (also called low-discrepancy) sequences, which are a deterministic alternative to random or pseudo-random sequences. The points in a quasi-random sequence are correlated to provide greater uniformity. The resulting quadrature method, called quasi-Monte Carlo, has a convergence rate of approximately O((logN)kN−1). For quasi-Monte Carlo, both theoretical error estimates and practical limitations are presented. Although the emphasis in this article is on integration, Monte Carlo simulation of rarefied gas dynamics is also discussed. In the limit of small mean free path (that is, the fluid dynamic limit), Monte Carlo loses its effectiveness because the collisional distance is much less than the fluid dynamic length scale. Computational examples are presented throughout the text to illustrate the theory. A number of open problems are described.
Article
We study the numerical performance of a limited memory quasi-Newton method for large scale optimization, which we call the L-BFGS method. We compare its performance with that of the method developed by Buckley and LeNir (1985), which combines cycles of BFGS steps and conjugate direction steps. Our numerical tests indicate that the L-BFGS method is faster than the method of Buckley and LeNir, and is better able to use additional storage to accelerate convergence. We show that the L-BFGS method can be greatly accelerated by means of a simple scaling. We then compare the L-BFGS method with the partitioned quasi-Newton method of Griewank and Toint (1982a). The results show that, for some problems, the partitioned quasi-Newton method is clearly superior to the L-BFGS method. However we find that for other problems the L-BFGS method is very competitive due to its low iteration cost. We also study the convergence properties of the L-BFGS method, and prove global convergence on uniformly convex problems.
Article
In the presence of rounding errors the sequence of iterates generated by a Newton-like method implemented on a computer differs from the generated sequence produced in theory. We give conditions for the convergence of the generated sequence to an isolated solution of the equation F(x) = 0 and show that these conditions are violated in a neighbourhood of the solution. The relative accuracy to which one can expect to estimate the solution is shown to depend largely on the accuracy to which the mapping F is evaluated.
Article
. In this note we show how the implicit filtering algorithm can be coupled with the BFGS quasi-Newton update to obtain a superlinearly convergent iteration if the noise in the objective function decays sufficiently rapidly as the optimal point is approached. We show how known theory for the noise-free case can be extended and thereby provide a partial explanation for the good performance of quasi-Newton methods when coupled with implicit filtering. Key words. noisy optimization, implicit filtering, BFGS algorithm, superlinear convergence AMS subject classifications. 65K05, 65K10, 90C30 1. Introduction. In this paper we examine the local and global convergence behavior of the combination of the BFGS [4], [20], [17], [23] quasi-Newton method with the implicit filtering algorithm. The resulting method is intended to minimize smooth functions that are perturbed with low-amplitude noise. Our results, which extend those of [5], [15], and [6], show that if the amplitude of the noise decays ...