Conference Paper

A method of solving a convex programming problem with convergence rate $O(1/k^2)$

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Stochastic gradient descent (SGD) [1] is the update rule that enables iterative learning based on the current loss and model parameters. DNNs typically converge after many steps or iterations, and model quality is heavily influenced by certain variables or hyperparameters like batch-size, number of epochs, learning rate, type of optimization (e.g., Nesterov [2], Adam [3], etc.), activation function, weight decay, regularization, etc. Distributed Data-Parallel (DDP) training expedites convergence by launching model replicas over many workers concurrently and periodically synchronize their updates. However, distributed training can suffer from high communication costs due to large model size, large number of workers and available bandwidth. ...
... Relative gradient change between consecutive steps can be estimated from the L2-norm of gradient ∇F at steps (i − 1) and i , as shown in Eqn. (2): ...
... This function calculates and stores gradient variance at every iteration, applies EWMA smoothing, and computes the relative gradient change as shown in Eqn. (2). The local gradients are then used to update model parameters locally (line 9). ...
Preprint
Full-text available
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present \texttt{SelSync}, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of \texttt{SelSync} to improve convergence in the context of \textit{semi-synchronous} training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$\times$.
... In optimization problems like those presented above, mathematicians frequently employ a technique known as inertial-type extrapolation [14,15] to accelerate the convergence of the iterative equations. This approach involves utilizing a term θ n (x n − x n−1 ), where θ n denotes an inertial parameter, to govern the momentum x n − x n−1 . ...
... This approach involves utilizing a term θ n (x n − x n−1 ), where θ n denotes an inertial parameter, to govern the momentum x n − x n−1 . One such algorithm that has enjoyed immense popularity was developed by Nesterov [14]. He used an inertial or extrapolation technique to solve convex optimization problems of the form of (2), where F := ψ 1 + ψ 2 is a convex, smooth function. ...
Article
Full-text available
This paper introduces a novel two-step inertial algorithm for locating a common fixed point of a countable family of nonexpansive mappings. We establish strong convergence properties of the proposed method under mild conditions and employ it to solve convex bilevel optimization problems. The method is further applied to the image recovery problem. Our numerical experiments show that the proposed method achieves faster convergence than other related methods in the literature.
... Stochastic gradient descent (SGD) [20] has been extensively applied in training largescale machine learning models because of only choosing a mini-batch data sample or even a sample for each update round for the past few years. In order to reduce the influence of noise on stochastic gradient, the momentum technique [21,22] was introduced to correct the gradient direction. The literature [23] trained deep recurrent neural networks with momentum and strong initialization. ...
... where the first inequality follows by Assumptions 1, 3 and ∑ n i=1 z i 2 ≤ n ∑ n i=1 z i 2 , and the last one follows Lemma 3 in [17]. Substituting (22), (23) into (21), we obtain (19), which finishes the proof. ...
Article
Full-text available
As a new machine learning technique, federated learning has received more attention in recent years, which enables decentralized model training across data silos or edge intelligent devices in the Internet of Things without exchanging local raw data. All kinds of algorithms are proposed to solve the challenges in federated learning. However, most of these methods are based on stochastic gradient descent, which undergoes slow convergence and unstable performance during the training stage. In this paper, we propose a differential adaptive federated optimization method, which incorporates an adaptive learning rate and the gradient difference into the iteration rule of the global model. We further adopt the first-order moment estimation to compute the approximate value of the differential term so as to avoid amplifying the random noise from the input data sample. The theoretical convergence guarantee is established for our proposed method in a stochastic non-convex setting under full client participation and partial client participation cases. Experiments for the image classification task are performed on two standard datasets by training a neural network model, and experiment results on different baselines demonstrate the effectiveness of our proposed method.
... This paper presents the numerical calculations of classical Taylor rod-on-anvil impact tests for cylindrical OFHC copper samples. A method for selecting the JC model constants using an optimization algorithm based on the Nesterov gradient-descent method [42] was proposed. To optimize the selection of the JC model constants, a solution quality function was proposed, which can estimate the deviation of the calculations from the experimental data and determine the optimal JC model parameters. ...
... Therefore, the number of calculation steps should be minimized. For this purpose, the Nesterov gradient-descent method was chosen to determine the Q f minimum [42]. ...
Article
Full-text available
Numerical simulation of impact and shock-wave interactions of deformable solids is an urgent problem. The key to the adequacy and accuracy of simulation is the material model that links the yield strength with accumulated plastic strain, strain rate, and temperature. A material model often used in engineering applications is the empirical Johnson–Cook (JC) model. However, an increase in the impact velocity complicates the choice of the model constants to reach agreement between numerical and experimental data. This paper presents a method for the selection of the JC model constants using an optimization algorithm based on the Nesterov gradient-descent method. A solution quality function is proposed to estimate the deviation of calculations from experimental data and to determine the optimum JC model parameters. Numerical calculations of the Taylor rod-on-anvil impact test were performed for cylindrical copper specimens. The numerical simulation performed with the optimized JC model parameters was in good agreement with the experimental data received by the authors of this paper and with the literature data. The accuracy of simulation depends on the experimental data used. For all considered experiments, the calculation accuracy (solution quality) increased by 10%. This method, developed for selecting optimized material model constants, may be useful for other models, regardless of the numerical code used for high-velocity impact simulations.
... Optimizer. The stochastic gradient descent with Nesterov momentum was employed as the optimizer function for the training the model 16,17 . This function updates the parameters in the negative direction of a gradient estimate and incorporates an additional momentum value of 0.95 to stabilize the gradient direction by accumulating current and previous gradients. ...
Article
Full-text available
An increasing and aging patient population poses a growing burden on healthcare professionals. Automation of medical imaging diagnostics holds promise for enhancing patient care and reducing manpower required to accommodate an increasing patient-population. Deep learning, a subset of machine learning, has the potential to facilitate automated diagnostics, but commonly requires large-scaled labeled datasets. In medical domains, data is often abundant but labeling is a laborious and costly task. Active learning provides a method to optimize the selection of unlabeled samples that are most suitable for improvement of the model and incorporate them into the model training process. This approach proves beneficial when only a small number of labeled samples are available. Various selection methods currently exist, but most of them employ fixed querying schedules. There is limited research on how the timing of a query can impact performance in relation to the number of queried samples. This paper proposes a novel approach called dynamic querying, which aims to optimize the timing of queries to enhance model development while utilizing as few labeled images as possible. The performance of the proposed model is compared to a model trained utilizing a fully-supervised training method, and its effectiveness is assessed based on dataset size requirements and loss rates. Dynamic querying demonstrates a considerably faster learning curve in relation to the number of labeled samples used, achieving an accuracy of 70% using only 24 samples, compared to 82% for a fully-supervised model trained on the complete training dataset of 1017 images.
... The proposed method uses Nesterov's momentum [10], which makes (nonstochastic) gradient descent converge faster. If J denotes a cost function to be minimised, θ t a parameter at iteration t, v t its momentum, and ∇ θ J(θ) the gradient of J with respect to θ, Nesterov's momentum update the parameter θ t following the pair of rules: ...
... One of the techniques to speed up the convergence of the algorithms is the inertial technique which Polyak first introduced [17] in 1964. Polyak's algorithm was called the heavy ball method, and it was improved by Nesterov [18]. Later on, it has been widely used to solve a wide variety of problems in the optimization field, as seen in [9,[19][20][21][22]. ...
Article
Full-text available
To detect breast cancer in mammography screening practice, we modify the inertial relaxed CQ algorithm with Mann’s iteration for solving split feasibility problems in real Hilbert spaces to apply in an extreme learning machine as an optimizer. Weak convergence of the proposed algorithm is proved under certain mild conditions. Moreover, we present the advantage of our algorithm by comparing it with existing machine learning methods. The highest performance value of 85.03% accuracy, 82.56% precision, 87.65% recall, and 85.03% F1-score show that our algorithm performs better than the other machine learning models.
... For this reason, accelerated gradient methods were developed, able to achieve best worst-case complexity bounds [6]. Among them, two of the most well known are Polyak's heavy ball method [14] and Nesterov's accelerated gradient method [13]. ...
Preprint
Full-text available
In this work we propose a method to perform optimization on manifolds. We assume to have an objective function $f$ defined on a manifold and think of it as the potential energy of a mechanical system. By adding a momentum-dependent kinetic energy we define its Hamiltonian function, which allows us to write the corresponding Hamiltonian system. We make it conformal by introducing a dissipation term: the result is the continuous model of our scheme. We solve it via splitting methods (Lie-Trotter and leapfrog): we combine the RATTLE scheme, approximating the conserved flow, with the exact dissipated flow. The result is a conformal symplectic method for constant stepsizes. We also propose an adaptive stepsize version of it. We test it on an example, the minimization of a function defined on a sphere, and compare it with the usual gradient descent method.
... for which the rate of convergence for f (x(t)) − min f is of order O( 1 t ) as t → +∞. The dynamics (2) can be seen as a continuous version of the celebrated Nesterov accelerated gradient algorithm with momentum [16]. ...
... Typical optimization methods assume the objective function domain Θ to be Euclidean and they vary primarily in terms of how the search directions are specified. From direct use of gradients to various forms of conjugate gradient variants, see for example Nesterov (1983), Bhaya and Kaszkurewicz (2004) or Shanno (1978), and how updates of those directions in combination with gradients are specified (Shanno, 1978). The scientific literature covers such optimization methods in great detail, with several theoretical results and practical efficiency covered in Shanno (1978) and Polak (1997). ...
Preprint
Full-text available
We consider the fundamental task of optimizing a real-valued function defined in a potentially high-dimensional Euclidean space, such as the loss function in many machine-learning tasks or the logarithm of the probability distribution in statistical inference. We use the warped Riemannian geometry notions to redefine the optimisation problem of a function on Euclidean space to a Riemannian manifold with a warped metric, and then find the function's optimum along this manifold. The warped metric chosen for the search domain induces a computational friendly metric-tensor for which optimal search directions associate with geodesic curves on the manifold becomes easier to compute. Performing optimization along geodesics is known to be generally infeasible, yet we show that in this specific manifold we can analytically derive Taylor approximations up to third-order. In general these approximations to the geodesic curve will not lie on the manifold, however we construct suitable retraction maps to pull them back onto the manifold. Therefore, we can efficiently optimize along the approximate geodesic curves. We cover the related theory, describe a practical optimization algorithm and empirically evaluate it on a collection of challenging optimisation benchmarks. Our proposed algorithm, using third-order approximation of geodesics, outperforms standard Euclidean gradient-based counterparts in term of number of iterations until convergence and an alternative method for Hessian-based optimisation routines.
... We implement the networks within the Pennylane framework [54], and train them using the Nesterov momentum optimiser [55]. The results are shown in figure 4. Our results show that the reflection equivariant model consistently outperforms the generic model, despite its lower expressibility and the generic model containing 50% more trainable parameters. ...
Article
Full-text available
Machine learning is among the most widely anticipated use cases for near-term quantum computers, however there remain significant theoretical and implementation challenges impeding its scale up. In particular, there is an emerging body of work which suggests that generic, data agnostic quantum machine learning (QML) architectures may suffer from severe trainability issues, with the gradient of typical variational parameters vanishing exponentially in the number of qubits. Additionally, the high expressibility of QML models can lead to overfitting on training data and poor generalisation performance. A promising strategy to combat both of these difficulties is to construct models which explicitly respect the symmetries inherent in their data, so-called geometric quantum machine learning (GQML). In this work, we utilise the techniques of GQML for the task of image classification, building new QML models which are equivariant with respect to reflections of the images. We find that these networks are capable of consistently and significantly outperforming generic ansatze on complicated real-world image datasets, bringing high-resolution image classification via quantum computers closer to reality. Our work highlights a potential pathway for the future development and implementation of powerful QML models which directly exploit the symmetries of data.
... A common method for optimizers to speed up the training of a model is to implement a momentum variable of sorts along with the gradient of the error. This concept is used by many optimizers like the Nesterov Accelerated Gradient [13], AdaGrad [14], RMS Prop and Adam optimizers [15]. The Adam optimizer (adaptive moment estimation) and the RMS Prop use an average of the previous gradients in order to compute the steps towards a better performing algorithm. ...
Article
This paper proposes a deep learning-based model to segment gastrointestinal tract (GI) magnetic resonance images (MRI). The application of this model will be useful in potentially accelerating treatment times and possibly improve the quality of the treatments for the patients who must undergo radiation treatments in cancer centers. The proposed model employs the U-net architecture, which provides outstanding overall performance in medical image segmentation tasks. The model that was developed through this project has a score of 81.86% using a combination of the dice coefficient and the Hausdorff distance measures, rendering it highly accurate in segmenting and contouring organs in the gastrointestinal system.
... In [23], Hager and Zhang proposed a first-order method that combines the sparse reconstruction by separable approximation (SpaRSA) [54] with the dual active set algorithm (DASA) [21]; the Q-linear convergence result is achieved. Furthermore, relying on the work of Nesterov [36], accelerated proximal gradient methods [4], such as FISTA [53] and MFISTA [4], have also been proposed. ...
Article
Full-text available
The polyhedral projection problem arises in various applications. To efficiently solve the dual problem, one of crucial issues is to safely identify zero-elements as well as the signs of nonzero elements at the optimal solution. In this paper, relying on its nonsmooth dual problem and active set techniques, we first propose a Duality-Gap-Active-Set strategy (DGASS) to effectively identify the indices of zero-elements and the signs of nonzero entries of the optimal solution. Serving as an efficient acceleration strategy, DGASS can be embedded into certain iterative methods. In particular, by applying DGASS to both the proximal gradient algorithm (PGA) and the proximal semismooth Newton algorithm (PSNA), we propose the methods of PGA-DGASS and PSNA-DGASS, respectively. Global convergence and local quadratic convergence rate are discussed. We report on numerical results over both synthetic and real data sets to demonstrate the high efficiency of the two DGASS-accelerated methods.
... and which have an analogous structure to the accelerated gradient method of Nesterov [40,41]. Specifically, we consider the following discretization of the dynamic ...
Article
Full-text available
In a Hilbert space setting, we consider new first order optimization algorithms which are obtained by temporal discretization of a damped inertial autonomous dynamic involving dry friction. The function f to be minimized is assumed to be differentiable (not necessarily convex). The dry friction potential function φ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \varphi $$\end{document}, which has a sharp minimum at the origin, enters the algorithm via its proximal mapping, which acts as a soft thresholding operator on the sum of the velocity and the gradient terms. After a finite number of steps, the structure of the algorithm changes, losing its inertial character to become the steepest descent method. The geometric damping driven by the Hessian of f makes it possible to control and attenuate the oscillations. The algorithm generates convergent sequences when f is convex, and in the nonconvex case when f satisfies the Kurdyka–Lojasiewicz property. The convergence results are robust with respect to numerical errors, and perturbations. The study is then extended to the case of a nonsmooth convex function f, in which case the algorithm involves the proximal operators of f and φ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varphi $$\end{document} separately. Applications are given to the Lasso problem and nonsmooth d.c. programming.
... Problem (4) is a quadratic optimization problem which is efficiently solvable. In particular, we solve this subproblem using Nesterov's accelerated gradient descent method (Nesterov, 1983). Let f (W;X, ) be the objective function of problem (4). ...
Article
Full-text available
We develop a new model for tensor completion which incorporates noisy side information available on the rows and columns of a 3-dimensional tensor. This method learns a low rank representation of the data along with regression coefficients for the observed noisy features. Given this model, we propose an efficient alternating minimization algorithm to find high-quality solutions that scales to large data sets. Through extensive computational experiments, we demonstrate that this method leads to significant gains in out-of-sample accuracy filling in missing values in both simulated and real-world data. We consider the problem of imputing drug response in three large-scale anti-cancer drug screening data sets: the Genomics of Drug Sensitivity in Cancer (GDSC), the Cancer Cell Line Encyclopedia (CCLE), and the Genentech Cell Line Screening Initiative (GCSI). On imputation tasks with 20% to 80% missing data, we show that the proposed method TensorGenomic matches or outperforms state-of-the-art methods including the original tensor model and a multilevel mixed effects model. With 80% missing data, TensorGenomic improves the R2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$R^2$$\end{document} from 0.404 to 0.552 in the GDSC data set, 0.407 to 0.524 in the CCLE data set, and 0.331 to 0.453 in the GCSI data set compared to the tensor model which does not take into account genomic side information.
... Many algorithms have been proposed with the aim to improve the adaptation/learning transients provided by "gradient rule" based algorithms. See [6], [15], [16], [4], [1], [5], [13]. In [9] it was shown that all these algorithms can be cast on an unified general form and the concept of "dynamic adaptation gain/learning rate" (DAG) has been *This work was not supported by any organization a Ioan Doré Landau Bonneuil-sur-Marne, France bernard.vau@ixblue.com 1 The measurement noise affects the performance. ...
... Nesterov's Accelerated Gradient (NAG) acts as a special adaptive momentum algorithm that replaces the β with a designed scheme. This algorithm can attain optimal global convergence in the condition of smooth convex functions [27,28]. Paper [17] uses an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, which only require a little tunning for its hyperparameter and have intuitive interpretations. ...
Conference Paper
Full-text available
The heavy ball momentum technique is widely used in accelerating the machine learning training process, which has demonstrated significant practical success in optimization tasks. However, most heavy ball methods require a preset hyperparame-ter that will result in excessive tuning, and a calibrated fixed hy-perparameter may not lead to optimal performance. In this paper, we propose an adaptive criterion for the choice of the normalized momentum-related hyperparameter, motivated by the quadratic optimization training problem, to eliminate the adverse for tuning the hyperparameter and thus allow for a computationally efficient op-timizer. We theoretically prove that our proposed adaptive method promises convergence for L-Lipschitz functions. In addition, we verify its practical efficiency on existing extensive machine learning benchmarks for image classification tasks. The numerical results show that besides the speed improvement, our proposed methods enjoy advantages, including more robust to large learning rates and better generalization.
... The inertial extrapolation-type algorithms have been studied by several authors; see refs. [13][14][15]. ...
Article
Full-text available
In this work, an algorithm is introduced for the problem of finding a common fixed point of a finite family of G-nonexpansive mappings in a real Hilbert space endowed with a directed graph G. This algorithm is a modified parallel algorithm inspired by the inertial method and the Mann iteration process. Moreover, both weak and strong convergence theorems are provided for the algorithm. Furthermore, an application of the algorithm to a signal recovery problem with multiple blurring filters is presented. Consequently, the numerical experiment shows better results compared with the previous algorithm.
... Let {T n } be a family of nonexpansive mappings of C onto itself satisfying the condition (Z) such that Γ := ∞ n=1 Fix(T n ) = ∅. Many mathematicians often use inertial-type extrapolation [38,39] in optimization problems to speed up the convergence of iterative methods by using the technical term θ n (x n − x n−1 ). The momentum x n − x n−1 is controlled by the parameter θ n , also known as an inertial parameter. ...
Article
Full-text available
Fixed-point theory plays many important roles in real-world problems, such as image processing, classification problem, etc. This paper introduces and analyzes a new, accelerated common-fixed-point algorithm using the viscosity approximation method and then employs it to solve convex bilevel optimization problems. The proposed method was applied to data classification with the Diabetes, Heart Disease UCI and Iris datasets. According to the data classification experiment results, the proposed algorithm outperformed the others in the literature.
... which by Proposition 3.6 implies that assumption (3.1) holds. Note that this example (θ n = 1 −θ n ,θ > 1) falls within the setting of Nesterov's extrapolation methods, and many practical choices for θ n satisfy assumption (3.1) (see Attouch and Cabot 2020, Beck and Teboulle 2009, Chambolle and Dossal 2015, Nesterov 1983). ...
Article
Full-text available
We first establish weak convergence results regarding an inertial Krasnosel’skiĭ-Mann iterative method for approximating common fixed points of countable families of nonexpansive mappings in real Hilbert spaces with no extra assumptions on the considered countable families of nonexpansive mappings. The method of proof and the imposed conditions on the iterative parameters are different from those already available in the literature. We then present some applications to the Douglas–Rachford splitting method and image restoration problems, and compare the performance of our method with that of other popular inertial Krasnosel’skiĭ-Mann methods which can be found in the literature.
... Momentum is a heuristic, but a strong way to accelerate the convergence of SGD. Motivated by the heavy ball method [24] and Nesterov's accelerate gradient method [25], a momentum term is usually added in the current update of descent directions by a weighed sum of previous information to improve the convergence of SGD [26]. Sutskever et al. [27] combined SGD with a careful use of the momentum method in the training of deep neural networks successfully. ...
Article
Full-text available
Federated learning is served as a novel distributed training framework that enables multiple clients of the internet of things to collaboratively train a global model while the data remains local. However, the implement of federated learning faces many problems in practice, such as the large number of training for convergence due to the size of model and the lack of adaptivity by the stochastic gradient-based update at the client side. Meanwhile, it is sensitive to noise during the optimization process that can affect the performance of the final model. For these reasons, we propose Federated Adaptive learning based on Derivative Term, called FedADT in this paper, which incorporates adaptive step size and difference of gradient in the update of local model. To further reduce the influence of noise on the derivative term that is estimated by difference of gradient, we use moving average decay on the derivative term. Moreover, we analyze the convergence performance of the proposed algorithm for non-convex objective function, i.e., the convergence rate of 1/nT can be achieved by choosing appropriate hyper-parameters, where n is the number of clients and T is the number of iterations, respectively. Finally, various experiments for the image classification task are conducted by training widely used convolutional neural network on MNIST and Fashion MNIST datasets to verify the effectiveness of FedADT. In addition, the receiver operating characteristic curve is used to display the result of the proposed algorithm by predicting the categories of clothing on the Fashion MNIST dataset.
Preprint
Full-text available
Phylogenetic inference can be influenced by both underlying biological processes and methodological factors. While biological processes can be modeled, these models frequently make the assumption that methodological factors do not significantly influence the outcome of phylogenomic analyses. Depending on their severity, methodological factors can introduce inconsistency and uncertainty into the inference process. Although search protocols have been proposed to mitigate these issues, many solutions tend to treat factors independently or assume a linear relationship among them. In this study, we capitalize on the increasing size of phylogenetic datasets, using them to train machine learning models. This approach transcends the linearity assumption, accommodating complex non-linear relationships among features. We examined two phylogenomic datasets for teleost fishes: a newly generated dataset for protacanthopterygians (salmonids, galaxiids, marine smelts, and allies), and a reanalysis of a dataset for carangarians (flatfishes and allies). Upon testing five supervised machine learning models, we found that all outperformed the linear model (p < 0.05), with the deep neural network showing the best fit for both empirical datasets tested. Feature importance analyses indicated that influential factors were specific to individual datasets. The insights obtained have the potential to significantly enhance decision-making in phylogenetic analyses, assisting, for example, in the choice of suitable DNA sequence models and data transformation methods. This study can serve as a baseline for future endeavors aiming to capture non-linear interactions of features in phylogenomic datasets using machine learning and complement existing tools for phylogenetic analyses.
Chapter
Self-supervised representation learning (SSRL) methods have shown great success in computer vision. In recent studies, augmentation-based contrastive learning methods have been proposed for learning representations that are invariant or equivariant to pre-defined data augmentation operations. However, invariant or equivariant features favor only specific downstream tasks depending on the augmentations chosen. They may result in poor performance when the learned representation does not match task requirements. Here, we consider an active observer that can manipulate views of an object and has knowledge of the action(s) that generated each view. We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER). CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder. One output head is a projection head with a state-of-the-art contrastive objective to encourage invariance to augmentations. The other is a prediction head estimating the augmentation parameters, capturing equivariant features. Both heads are discarded after training and only the encoder is used for downstream tasks. We evaluate our method on static image tasks and time-augmented image datasets. Our results show that CIPER outperforms a baseline contrastive method on various tasks. Interestingly, CIPER encourages the formation of hierarchically structured representations where different views of an object become systematically organized in the latent representation space.
Article
Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called geometric step decay and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of sharp convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp weakly convex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear (or nearly linear) rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks—phase retrieval and blind deconvolution—and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.
Article
Full-text available
Recent research in computational imaging largely focuses on developing machine learning (ML) techniques for image reconstruction, which requires large-scale training datasets consisting of measurement data and ground-truth images. However, suitable experimental datasets for X-ray Computed Tomography (CT) are scarce, and methods are often developed and evaluated only on simulated data. We fill this gap by providing the community with a versatile, open 2D fan-beam CT dataset suitable for developing ML techniques for a range of image reconstruction tasks. To acquire it, we designed a sophisticated, semi-automatic scan procedure that utilizes a highly-flexible laboratory X-ray CT setup. A diverse mix of samples with high natural variability in shape and density was scanned slice-by-slice (5,000 slices in total) with high angular and spatial resolution and three different beam characteristics: A high-fidelity, a low-dose and a beam-hardening-inflicted mode. In addition, 750 out-of-distribution slices were scanned with sample and beam variations to accommodate robustness and segmentation tasks. We provide raw projection data, reference reconstructions and segmentations based on an open-source data processing pipeline.
Conference Paper
The fast iterative shrinkage thresholding algorithm (FISTA) has been shown to be an efficient method for solving least squares with l1-norm regularization problems and has been applied to optical molecular tomography. It adopts a linear increase scheme to provide the Lipschitz constant, which determines the step-size of the internal gradient. The Lipschitz constant, however, will not change if the proximal gradient condition is satisfied after the linear increase. Then it restricts the convergence speed of FISTA further. In this work, a non-linear search scheme, which contains the gradient information, is proposed to obtain the suitable Lipschitz constant. It can provide a variable step-size in each iteration, which can accelerate the convergence of the standard FISTA. We called is as VFISTA. Phantom and in vivo experiments have been performed to show that VFISTA can speed up the reconstruction process effectively for the inverse problem of FMT compared to FISTA.
Chapter
We present “Amenable Sparse Network Investigator" (ASNI) algorithm that utilizes a novel pruning strategy based on a sigmoid function that induces sparsity level globally over the course of one single round of training. The ASNI algorithm fulfills both tasks that current state-of-the-art strategies can only do one of them. The ASNI algorithm has two subalgorithms: 1) ASNI-I, 2) ASNI-II. ASNI-I learns an accurate sparse off-the-shelf network only in one single round of training. ASNI-II learns a sparse network and an initialization that is quantized, compressed, and from which the sparse network is trainable. The learned initialization is quantized since only two numbers are learned for initialization of nonzero parameters in each layer L. Thus, quantization levels for the initialization of the entire network is 2L. Also, the learned initialization is compressed because it is a set consisting of 2L numbers. The special sparse network that can be trained from such a quantized and compressed initialization is called amenable. For example, in order to initialize more than 25 million parameters of an amenable ResNet-50, only 2\(\,\times \,\)54 numbers are needed. To the best of our knowledge, there is no other algorithm that can learn a quantized and compressed initialization from which the network is still trainable and is able to solve both pruning tasks. Our numerical experiments show that there is a quantized and compressed initialization from which the learned sparse network can be trained and reach to an accuracy on a par with the dense version. This is one step ahead towards learning an ideal network that is sparse and quantized in a very few levels of quantization. We experimentally show that these 2L levels of quantization are concentration points of parameters in each layer of the learned sparse network by ASNI-I. In other words, we show experimentally that for each layer of a deep neural network (DNN) there are two distinct normal-like distributions whose means can be used for initialization of an amenable network. To corroborate the above, we have performed a series of experiments utilizing networks such as ResNets, VGG-style, small convolutional, and fully connected ones on ImageNet, CIFAR10, and MNIST datasets. KeywordsPruningInitializationNonconvex Sparse Optimization
Article
Driven by large-scale optimization problems arising from machine learning, the development of stochastic optimization methods has witnessed a huge growth. Numerous types of methods have been developed based on vanilla stochastic gradient descent method. However, for most algorithms, convergence rate in stochastic setting cannot simply match that in deterministic setting. Better understanding the gap between deterministic and stochastic optimization is the main goal of this paper. Specifically, we are interested in Nesterov acceleration of gradient-based approaches. In our study, we focus on acceleration of stochastic mirror descent method with implicit regularization property. Assuming that the problem objective is smooth and convex or strongly convex, our analysis prescribes the method parameters which ensure fast convergence of the estimation error and satisfied numerical performance.
Article
Modern electron tomography has progressed to higher resolution at lower doses by leveraging compressed sensing (CS) methods that minimize total variation (TV). However, these sparsity-emphasized reconstruction algorithms introduce tunable parameters that greatly influence the reconstruction quality. Here, Pareto front analysis shows that high-quality tomograms are reproducibly achieved when TV minimization is heavily weighted. However, in excess, CS tomography creates overly smoothed three-dimensional (3D) reconstructions. Adding momentum to the gradient descent during reconstruction reduces the risk of over-smoothing and better ensures that CS is well behaved. For simulated data, the tedious process of tomography parameter selection is efficiently solved using Bayesian optimization with Gaussian processes. In combination, Bayesian optimization with momentum-based CS greatly reduces the required compute time-an 80% reduction was observed for the 3D reconstruction of SrTiO3 nanocubes. Automated parameter selection is necessary for large-scale tomographic simulations that enable the 3D characterization of a wider range of inorganic and biological materials.
Chapter
Deep learning (DL) has made a major impact on data science in the last decade. This chapter introduces the basic concepts of this field. It includes both the basic structures used to design deep neural networks and a brief survey of some of its popular use-cases.
Article
Quantization approximates a deep network model with floating-point numbers by the model with low bit width numbers, thereby accelerating inference and reducing computation. Zero-shot quantization, which aims to quantize a model without access to the original data, can be achieved by fitting the real data distribution through data synthesis. However, it has been observed that zero-shot quantization leads to inferior performance compared to post-training quantization with real data for two primary reasons: 1) a normal generator has difficulty obtaining a high diversity of synthetic data since it lacks long-range information to allocate attention to global features, and 2) synthetic images aim to simulate the statistics of real data, which leads to weak intra-class heterogeneity and limited feature richness. To overcome these problems, we propose a novel deep network quantizer called long-range zero-shot generative deep network quantization (LRQ). Technically, we propose a long-range generator (LRG) to learn long-range information instead of simple local features. To incorporate more global features into the synthetic data, we use long-range attention with large-kernel convolution in the generator. In addition, we also present an adversarial margin add (AMA) module to force intra-class angular enlargement between the feature vector and class center. The AMA module forms an adversarial process that increases the convergence difficulty of the loss function, which is opposite to the training objective of the original loss function. Furthermore, to transfer knowledge from the full-precision network, we also utilize decoupled knowledge distillation. Extensive experiments demonstrate that LRQ obtains better performance than other competitors.
Chapter
EfficientNet is a convolutional neural network architecture that was created by doing a neural architecture search with the AutoML MNAS framework, which optimized both accuracy and efficiency. It is based on MobileNetV2’s inverted bottleneck residual blocks, as well as squeeze-and-excite blocks. With far lower parameter computation burdens on the ImageNet challenge, EfficientNet may compete with the best. This paper provides a mobile version of EfficientNet that has accuracy similar to the ImageNet dataset but runs nearly twice as fast.
ResearchGate has not been able to resolve any references for this publication.