Conference Paper

Lazy paired hyper-parameter tuning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In virtually all machine learning applications, hyper-parameter tuning is required to maximize predictive accuracy. Such tuning is computationally expensive, and the cost is further exacerbated by the need for multiple evaluations (via cross-validation or bootstrap) at each configuration setting to guarantee statistically significant results. This paper presents a simple, general technique for improving the efficiency of hyper-parameter tuning by minimizing the number of resampled evaluations at each configuration. We exploit the fact that train-test samples can easily be matched across candidate hyper-parameter configurations. This permits the use of paired hypothesis tests and power analysis that allow for statistically sound early elimination of suboptimal candidates to minimize the number of evaluations. Results on synthetic and real-world datasets demonstrate that our method improves over competitors for discrete parameter settings, and enhances state-of-the-art techniques for continuous parameter settings.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Study in [9] improved the accuracy by designing architecture while adding up to sixteen layers. The author highlighted that it is critical to achieve better performance by increasing the number of layers meanwhile study in [10] also showed practically that making architecture deeper can harm the performance. Additionally, it is very difficult to optimize deeper CNN architecture [10]. ...
... The author highlighted that it is critical to achieve better performance by increasing the number of layers meanwhile study in [10] also showed practically that making architecture deeper can harm the performance. Additionally, it is very difficult to optimize deeper CNN architecture [10]. The performance of CNNs is purely dependent on architectural design. ...
Article
Full-text available
Plant disease classification based on digital pictures is challenging. Machine learning approaches and plant image categorization technologies such as deep learning have been utilized to recognize, identify, and diagnose plant diseases in the previous decade. Increasing the yield quantity and quality of rice forming is an important cause for the paddy production countries. However, some diseases that are blocking the improvement in paddy production are considered as an ominous threat. Convolution Neural Network (CNN) has shown a remarkable performance in solving the early detection of paddy leaf diseases based on its images in the fast-growing era of science and technology. Nevertheless, the significant CNN architectures construction is dependent on expertise in a neural network and domain knowledge. This approach is time-consuming, and high computational resources are mandatory. In this research, we propose a novel method based on Mutant Particle swarm optimization (MUT-PSO) Algorithms to search for an optimum CNN architecture for Paddy leaf disease classification. Experimentation results show that Mutant Particle swarm optimization Convolution Neural Network (MUTPSO-CNN) can find optimum CNN architecture that offers better performance than existing hand-crafted CNN architectures in terms of accuracy, precision/recall, and execution time.
... The disadvantage of CVST is that it initially operates on smaller subsets, thus risking the early elimination of good-performing models when the original dataset is already small. In comparison to the statistical tests used in Zheng and Bilenko (2013) and Krueger et al. (2015), the bootstrap is a general test, applicable to any type of learning task and measure of performance, and is suitable even for relatively small sample sizes. Finally, BBCD-CV requires that only the value of the significance threshold α is pre-specified while the methods in Zheng and Bilenko (2013) and Krueger et al. (2015) have a number of hyperparameters to be specified in advance. ...
... In comparison to the statistical tests used in Zheng and Bilenko (2013) and Krueger et al. (2015), the bootstrap is a general test, applicable to any type of learning task and measure of performance, and is suitable even for relatively small sample sizes. Finally, BBCD-CV requires that only the value of the significance threshold α is pre-specified while the methods in Zheng and Bilenko (2013) and Krueger et al. (2015) have a number of hyperparameters to be specified in advance. ...
Article
Full-text available
Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation and a method by Tibshirani and Tibshirani, BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based hypothesis test we stop training of models on new folds of statistically-significantly inferior configurations. We name the method Bootstrap Corrected with Early Dropping CV (BCED-CV) that is both efficient and provides accurate performance estimates.
... Three primary hyperparameter tuning approaches have been proposed in literature, including random search, grid search, metaheuristic- based search (smart search) [32,33] although there are also other approaches that are less popular [34,35]. Contrary to the "dumb" alternatives of grid search and random search, metaheuristic-based hyperparameter tuning, including PSO, is much less parallelizable. ...
Article
This study aims to propose four Machine Learning methods of Gaussian process regression (GPR), support vector regression (SVR), decision trees (DT), and K-nearest neighbors (KNN) to predict disc cutter’s life of TBM. 200 datasets monitored during the Alborz service tunnel construction in Iran, including TBM operational parameters, geometry, and geological conditions, were applied in the models. The 5-fold cross-validation method was considered to investigate the prediction performance of the models. Finally, the GPR model with R2 = 0.8866/RMSE = 107.3554, was the most accurate model to predict TBM disc cutter’s life. KNN model with R2 = 0.1753/ RMSE = 288.9277, produced the minimum accuracy. To assess each parameter’s contribution in the prediction problem, the backward selection method was used. The results showed that TF, RPM, PR, and Qc parameters significantly contribute to TBM disc cutter’s life. However, RPM and PR parameters were more and less significant compared to the others.
... Our method, in general, is not specific to holdout, crossvalidation or successive halving and could generalize to any other method assessing the performance of a model [9], [22], [26] or allocating resources to the evaluation of a model [15], [60], [61], [62], [63], [64], [65]. While these are important areas of research, we focus here on the most commonly used methods and leave studying these extensions for future work. ...
Preprint
Full-text available
Automated Machine Learning, which supports practitioners and researchers with the tedious task of manually designing machine learning pipelines, has recently achieved substantial success. In this paper we introduce new Automated Machine Learning (AutoML) techniques motivated by our winning submission to the second ChaLearn AutoML challenge, PoSH Auto-sklearn. For this, we extend Auto-sklearn with a new, simpler meta-learning technique, improve its way of handling iterative algorithms and enhance it with a successful bandit strategy for budget allocation. Furthermore, we go one step further and study the design space of AutoML itself and propose a solution towards truly hand-free AutoML. Together, these changes give rise to the next generation of our AutoML system, Auto-sklearn (2.0). We verify the improvement by these additions in a large experimental study on 39 AutoML benchmark datasets and conclude the paper by comparing to Auto-sklearn (1.0), reducing the regret by up to a factor of five.
... Our design space is thus quite large, rendering the classic grid-search approach unfeasible. Therefore, we optimized our RNN models via random search [12], which has a high probability of finding the most suitable configuration without having to explore all possible combinations [67]. The best configuration is determined via 3-fold cross-validation, i.e. a given hyperparameter combination is tested up to three times over the validation data partition and the final result is averaged. ...
Preprint
Tracking mouse cursor movements can be used to predict user attention on heterogeneous page layouts like SERPs. So far, previous work has relied heavily on handcrafted features, which is a time-consuming approach that often requires domain expertise. We investigate different representations of mouse cursor movements, including time series, heatmaps, and trajectory-based images, to build and contrast both recurrent and convolutional neural networks that can predict user attention to direct displays, such as SERP advertisements. Our models are trained over raw mouse cursor data and achieve competitive performance. We conclude that neural network models should be adopted for downstream tasks involving mouse cursor movements, since they can provide an invaluable implicit feedback signal for re-ranking and evaluation.
... The performance of learning algorithms, such as CNNs, is critically sensitive to the architecture design. Determining the proper architecture design is a challenge because it differs for each dataset and therefore requires adjustments for each one [19,20]. Many structural hyperparameters are involved in these decisions, such as depth (which includes the number of convolutional and fully-connected layers), the number of filters, stride (step-size that the filter must be moved), pooling locations and sizes, and the number of units in fully-connected layers. ...
Article
Full-text available
Recent advances in Convolutional Neural Networks (CNNs) have obtained promising results in difficult deep learning tasks. However, the success of a CNN depends on finding an architecture to fit a given problem. A hand-crafted architecture is a challenging, time-consuming process that requires expert knowledge and effort, due to a large number of architectural design choices. In this article, we present an efficient framework that automatically designs a high-performing CNN architecture for a given problem. In this framework, we introduce a new optimization objective function that combines the error rate and the information learnt by a set of feature maps using deconvolutional networks (deconvnet). The new objective function allows the hyperparameters of the CNN architecture to be optimized in a way that enhances the performance by guiding the CNN through better visualization of learnt features via deconvnet. The actual optimization of the objective function is carried out via the Nelder-Mead Method (NMM). Further, our new objective function results in much faster convergence towards a better architecture. The proposed framework has the ability to explore a CNN architecture's numerous design choices in an efficient way and also allows effective, distributed execution and synchronization via web services. Empirically, we demonstrate that the CNN architecture designed with our approach outperforms several existing approaches in terms of its error rate. Our results are also competitive with state-of-the-art results on the MNIST dataset and perform reasonably against the state-of-the-art results on CIFAR-10 and CIFAR-100 datasets. Our approach has a significant role in increasing the depth, reducing the size of strides, and constraining some convolutional layers not followed by pooling layers in order to find a CNN architecture that produces a high recognition performance.
... Hyper-parameter optimization has been receiving an increasingly amount of attention in the NLP and machine learning communities (Thornton et al., 2013;Komer et al., 2014;Bergstra et al., 2011;Bardenet et al., 2013;Zheng et al., 2013). The performance of learning algorithms depend on the correct instantiations of their hyperparameters, ranging from algorithms such as logistic regression and support vector machines, to more complex model families such as boosted regression trees and neural networks. ...
Article
Multinomial logistic loss and L2 regularization are often conflicting objectives as more robust regularization leads to restrained multinomial parameters. For many practical problems, leveraging the best of both worlds would be invaluable for better decision-making processes. This research proposes a novel framework to obtain representative and diverse L2-regularized multinomial models, based on valuable trade-offs between prediction error and model complexity. The framework relies upon the Non-Inferior Set Estimation (NISE) method — a deterministic multiobjective solver. NISE automatically implements hyperparameter tuning in a multiobjective context. Given the diverse set of efficient learning models, model selection and aggregation of the multiple models in an ensemble framework promote high performance in multiclass classification. Additionally, NISE uses the weighted sum method as scalarization, thus being able to deal with the learning formulation directly. Its deterministic nature and the convexity of the learning problem confer scalability to the proposal. The experiments show competitive performance in various setups, taking a broad set of multiclass classification methods as contenders.
Article
The quality of models built by machine learning algorithms mostly depends on the careful tuning of hyper-parameters and feature weights. This paper introduces a novel scheme to optimize hyper-parameters and features by using the Ensemble Kalman Filter (EnKF), which is an iterative optimization algorithm used for high-dimensional nonlinear systems. We build a framework for applying the EnKF method on parameter optimization problems. We propose ensemble evolution to converge to the global optimum. We also optimize the EnKF calculation for large datasets by using the computationally efficient UR decomposition. To demonstrate the performance of our proposed design, we apply our approach for the tuning problem of Support Vector Machines. Experimental results show that the better global optima can be identified by our approach with acceptable computation cost compared to three state-of-the-art Bayesian optimization methods (SMAC, TPE and SPEARMINT).
Article
Full-text available
One of the fundamental difficulties in engineering design is the multiplicity of local solutions. This has triggered much effort in the development of global search algorithms. Globality, however, often has a prohibitively high numerical cost for real problems. A fixed cost local search, which sequentially becomes global, is developed in this work. Globalization is achieved by probabilistic restarts. A spacial probability of starting a local search is built based on past searches. An improved Nelder–Mead algorithm is the local optimizer. It accounts for variable bounds and nonlinear inequality constraints. It is additionally made more robust by reinitializing degenerated simplexes. The resulting method, called the Globalized Bounded Nelder–Mead (GBNM) algorithm, is particularly adapted to tackling multimodal, discontinuous, constrained optimization problems, for which it is uncertain that a global optimization can be afforded. Numerical experiments are given on two analytical test functions and two composite laminate design problems. The GBNM method compares favorably with an evolutionary algorithm, both in terms of numerical cost and accuracy.
Article
Full-text available
With the increasing size of today's data sets, finding the right parameter configuration in model selection via cross-validation can be an extremely time-consuming task. In this paper we propose an improved cross-validation procedure which uses nonparametric testing coupled with sequential analysis to determine the best parameter set on linearly increasing subsets of the data. By eliminating underperforming candidates quickly and keeping promising candidates as long as possible, the method speeds up the computation while preserving the capability of the full cross-validation. Theoretical considerations underline the statistical power of our procedure. The experimental evaluation shows that our method reduces the computation time by a factor of up to 120 compared to a full cross-validation with a negligible impact on the accuracy.
Conference Paper
Full-text available
We study a partial-information online-learning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure (e.g., user-perceived quality of a set of retrieval results, taste of food, product attractiveness), but where pairwise comparisons are easy to make. We propose a novel regret formulation in this setting, as well as present an algorithm that achieves information-theoretically optimal regret bounds (up to a constant factor).
Conference Paper
Full-text available
In the past decade large scale recommendation datasets were published and extensively studied. In this work we describe a detailed analysis of a sparse, large scale dataset, specifically designed to push the envelope of recommender system models. The Yahoo! Music dataset consists of more than a million users, 600 thousand musical items and more than 250 million ratings, collected over a decade. It is characterized by three unique features: First, rated items are multi-typed, including tracks, albums, artists and genres; Second, items are arranged within a four level taxonomy, proving itself effective in coping with a severe sparsity problem that originates from the unusually large number of items (compared to, e.g., movie ratings datasets). Finally, fine resolution timestamps associated with the ratings enable a comprehensive temporal and session analysis. We further present a matrix factorization model exploiting the special characteristics of this dataset. In particular, the model incorporates a rich bias model with terms that capture information from the taxonomy of items and different temporal dynamics of music ratings. To gain additional insights of its properties, we organized the KddCup-2011 competition about this dataset. As the competition drew thousands of participants, we expect the dataset to attract considerable research activity in the future.
Conference Paper
Full-text available
A new formulation for coordinate system independent adaptation of arbitrary normal mutation distributions with zero mean is presented. This enables the evolution strategy (ES) to adapt the correct scaling of a given problem and also ensures invariance with respect to any rotation of the fitness function (or the coordinate system). Especially rotation invariance, here resulting directly from the coordinate system independent adaptation of the mutation distribution, is an essential feature of the ES with regard to its general applicability to complex fitness functions. Compared to previous work on this subject, the introduced formulation facilitates an interpretation of the resulting mutation distribution, making sensible manipulation by the user possible (if desired). Furthermore it enables a more effective control of the overall mutation variance (expected step length)
Article
Full-text available
Many learning systems search through a space of possible performance elements, seeking an element with high expected utility. As the task of finding the globally optimal element is usually intractable, many practical learning systems use hill-climbing to find a local optimum. Unfortunately, even this is difficult, as it depends on the distribution of problems, which is typically unknown. This paper addresses the task of approximating this hill-climbing search when the utility function can only be estimated by sampling. We present an algorithm that returns an element that is, with provably high probability, essentially a local optimum. We then demonstrate the generality of this algorithm by sketching three meaningful applications, that respectively find an element whose efficiency, accuracy or completeness is nearly optimal. These results suggest approaches to solving the utility problem from explanation-based learning, the multiple extension problem from nonmonotonic reasoning and the ...
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Article
Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some light on why recent "High Throughput" methods achieve surprising success--they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.
Article
In the multiarmed bandit problem, a gambler must decide which arm of K non- identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payoff of our algorithm approaches that of the best arm at the rate O(T −1/2). We show by a matching lowerbound that this is the best possible. We also prove that our algorithm approaches the per-round payoff of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payoff of the strategy at the rate O((log N )1/2T −1/2). Finally, we apply ourr esults to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payoff of the unknown game at the rate O(T −1/2).
Article
Direct policy search is a practical way to solve reinforcement learning (RL) problems involving continuous state and action spaces. The goal becomes finding policy parameters that maximize a noisy objective function. The Pegasus method converts this stochastic optimization problem into a deterministic one, by using fixed start states and fixed random number sequences for comparing policies (Ng and Jordan, 2000). We evaluate Pegasus, and new paired comparison methods, using the mountain car problem, and a difficult pursuer-evader problem. We conclude that: (i) paired tests can improve performance of optimization procedures; (ii) several methods are available to reduce the 'overfitting' effect found with Pegasus; (iii) adapting the number of trials used for each comparison yields faster learning; (iv) pairing also helps search methods such as differential evolution.
Book
This chapter focuses on the general principles used to calculate Gittins indices. This is done in turn for reward processes based on a normal distribution with known variance; a normal distribution with unknown variance; a {0, 1} distribution and an exponential distribution; for undiscounted target processes based on an exponential distribution; and on an exponential distribution together with an atom of probability at the origin. The results of these calculations are also described for each case except the Bernoulli/exponential target process, for which some results are given by Jones (1975). Stochastic optimization; Stochastic processes
Article
We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utility-based selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). We also present two detailed extensions of Bayesian optimization, with experiments---active user modelling with preferences, and hierarchical reinforcement learning---and a discussion of the pros and cons of Bayesian optimization based on our experiences.
Article
This is an account of the life of the author’s book Testing Statistical Hypotheses , its genesis, philosophy, reception and publishing history. There is also some discussion of the position of hypothesis testing and the Neyman-Pearson theory in the wider context of statistical methodology and theory.
Article
Selecting a good model of a set of input points by cross validation is a computationally intensive process, especially if the number of possible models or the number of training points is high. Techniques such as gradient descent are helpful in searching through the space of models, but problems such as local minima, and more importantly, lack of a distance metric between various models reduce the applicability of these search methods. Hoeffding Races is a technique for finding a good model for the data by quickly discarding bad models, and concentrating the computational effort at differentiating between the better ones. This paper focuses on the special case of leave-one-out cross validation applied to memorybased learning algorithms, but we also argue that it is applicable to any class of model selection problems. 1 Introduction Model selection addresses "high level" decisions about how best to tune learning algorithm architectures for particular tasks. Such decisions include which...