Conference PaperPDF Available

Speeding up Common Hyperparameter Optimization Methods by a Two-Phase-Search

Authors:

Abstract and Figures

Hyperparameter search concerns everybody who works with machine learning. We compare publicly available hyperparameter searches on four datasets. We develop metrics to measure the performance of hyperparameter searches across datasets of different sizes as well as machine learning algorithms. Further, we propose a method of speeding up the search by using subsets of data. Results show that random search performs well compared to Bayesian methods and that a combined search can speed up the search by a factor of 7.
Content may be subject to copyright.
Speeding up Common Hyperparameter
Optimization Methods by a Two-Phase-Search
Alexander Wendt
Christian Doppler Laboratory for
Embedded Machine Learning
ICT, TU Vienna, Vienna, Austria
alexander.wendt@tuwien.ac.at
Marco Wuschnig
Christian Doppler Laboratory for
Embedded Machine Learning
ICT, TU Vienna, Vienna, Austria
marco.wuschnig@tuwien.ac.at
Martin Lechner
Christian Doppler Laboratory for
Embedded Machine Learning
ICT, TU Vienna, Vienna, Austria
martin.lechner@tuwien.ac.at
Abstract— Hyperparameter search concerns everybody who
works with machine learning. We compare publicly available
hyperparameter searches on four datasets. We develop metrics
to measure the performance of hyperparameter searches across
datasets of different sizes as well as machine learning
algorithms. Further, we propose a method of speeding up the
search by using subsets of data. Results show that random
search performs well compared to Bayesian methods and that a
combined search can speed up the search by a factor of 7.
Keywords—hyperparameter, machine learning, support vector
machine, random forest, Bayesian optimization, optimization
I. INTRODUCTION
Hyperparameter search concerns everybody who works with
machine learning. A high effort is put on finding
hyperparameters to get the most out of the algorithms. There
are numerous methods to optimize hyperparameters. Some are
quite simple, like the grid and random search, while others use
a more complex model to speed up the search like Bayesian
methods [1], [2]. Although grid- and random searches are not
the most efficient searches [3], many practitioners still stick
with them because of their simplicity.
In this work, we compare frequently used, publicly
available hyperparameter searches with implementations of
“classic” machine learning algorithms in the widely used
python machine learning package Scikit-learn1. We use their
implementations of Support Vector Regressor (SVR) and
Random Forest Regressor (RFR) on regression problems on
four datasets. The research problem is to find a metric to
compare algorithms across datasets and algorithms to
determine which hyperparameter optimization method fastest
produces reliable results. Further, we want to find out if an
extensive hyperparameter search on less data combined with
a narrow parameter search on the full dataset is faster than a
method that uses the whole dataset.
We propose the following methodology: First, we apply
grid search as a baseline with all algorithms on datasets with
800 parameter combinations. We choose R-squared as a
metric. It provides us with a measure of how well our model
fits the data. The range is 𝑟 ∞,1, where the
value 1.0 would be a perfect fit for the data without any
variance.
Second, we apply each hyperparameter optimization
method on all datasets and algorithms. As all methods except
grid search are stochastic, we perform five iterations per
optimization. Then, we compare the hyperparameter
optimization algorithms regarding optimization performance
1 https://scikit-learn.org/stable/
2 https://github.com/hyperopt/hyperopt
as well as variance and speed across all datasets and machine
learning algorithms. The challenge is to find metrics that cover
duration as well as reliability.
In step three, we want to determine how a large share of
the datasets is representative to be able to limit the parameter
search space. With that share, we do a wide search to
determine categorical parameters and to restrict the search
space for continuous values. Then, we do a narrow search of
continuous values on the full dataset. We test different
combinations of hyperparameter optimization methods to find
the optimum between search duration and reliability. Our
contributions to research are the following:
Definition of metrics to measure the performance of
hyperparameter search across datasets of different
sizes as well as machine learning algorithms
Review of common Bayesian optimization methods
Speedup of common methods on large datasets
through a two-phase-search algorithm
II. HYPERPARAMETER OPTIMIZATION METHODS
In the past, manual search and grid search were the way to go
when running hyperparameter optimization. To increase
efficiency, people often used semi-automated multi-staged
grid searches. In [3], the authors combined a logarithmic grid
with a fine-grained linear grid. However, significant
disadvantages are that grid search evaluates a lot of non-useful
combinations and that it lacks early stopping. In [4], a random
search is found to be superior to grid search in both runtime
and quality of the results.
In recent years, Bayesian optimization [1], [2] has gained
much attention for hyperparameter tuning. They are designed
for objective functions with long evaluation periods. Bayesian
optimization perfectly fits the needs of modern machine
learning (ML) hyperparameter tuning. Depending on the
implementation, Gaussian processes (GP) [5] and Tree of
Parzen estimators (TPE) [6] are used to approximate the target
function based on the historical data. Hyperopt2
[7] is a
hyperparameter optimization library for python implementing
random search, TPE, and an improved version of TPE called
adaptive TPE. Optuna 3 [8] is another, more recent
optimization library. Compared to Hyperopt, Optuna supports
dynamically created parameter search spaces, and the authors
claim that their implementation is efficient in terms of
searching and pruning while being versatile enough to be used
for many different optimization problems. They implement a
grid search, random search, and TPE.
3 https://github.com/optuna/optuna
© 2020 IEEE, Preprint
According to [9], both implementations of TPE
outperform random search, but their results are different. It
can be explained by different internal parameter settings
(sometimes referred to as hyper-hyperparameters).
FABOLAS [10] is specially designed for large datasets and
uses Gaussian processes for approximation. Skopt4 is another
widely used python library implementing Bayesian
optimization as well as grid and random search.
In contrast to Bayesian optimization-based approaches,
Hyperband [11] focuses on improving random search by using
adaptive resource allocation, i.e., by allocating more resources
to promising hyperparameter settings, and principled early
stopping strategies. The authors claim that Hyperband is 5 to
30 times faster than state-of-the-art Bayesian-based methods.
All Bayesian methods share the disadvantage of a slow start,
i.e., they need some time to find suitable hyperparameters, but
they eventually outperform bandit approaches like
Hyperband. In contrast, Hyperband is much faster and gives
superior results for small time-budgets. BOHB [12] is a
combination of Bayesian optimization and Hyperband trying
to combine their advantages, a fast start with a good
performance for large budgets.
III. TWO-PHASE OPTIMIZATION MODEL
In [10], the authors showed that a small subset of a dataset is
representative enough to provide a representative
hyperparameter space. The grid search time for SVR scales
with 𝑂𝑛, which makes the search costly for larger datasets.
Our method should provide an advantage as subsets of the
dataset are used. Less data generates a higher loss, while the
hyperparameter remains the same. With more data, the loss
gets smaller until saturation. We want to use this observation
to speed up the search by using a two-phase approach:
Phase 1, a wide search of a small share of the data with low
granularity, and then, Phase 2, a narrow search of the
complete data with high granularity.
We want to find out if the search duration of the whole
space can be significantly lowered when focusing the search
only on the relevant areas of the search space. Further, we
assume that less data is enough to determine categorical
parameters, and those continuous parameters are subject to
fine-tuning. Therefore, the targets for the high granularity
search are all-continuous parameters for a selected set of
categorical parameters. For instance, the SVR has the
categorical parameter kernel, which can be either linear or a
radial basis function (rbf). For rbf, it uses 𝐶 and 𝛾 as
continuous parameters. However, this method is not limited to
selecting kernels. In the model pipeline, categorical
parameters could also be the selection of scaler, sampler,
imputer for missing values as well as feature subsets.
In Fig. 1, we present an activity graph for the search space
limitation. The idea is to first apply a hyperparameter
optimization like a grid search or a Bayesian method on a wide
range of parameters. For SVR, the parameter space could
range from 𝐶 ∈10,10
to cover the whole space. The
search gets cheaper by only using a small subset 𝑠 𝜖 𝑆 of the
data, e.g., 10%. Similar to [12], the size of the dataset is our
limited resource. Here, the subset size and the number of
parameter combinations or iterations can be configured.
4 https://github.com/scikit-optimize/scikit-optimize
Fig. 1. Activity chart for the limitation of the search space
Based on the results of the wide search, we select a subset
with the highest 𝑡 results, i.e., the top 20%, to get only the
most promising candidates. To avoid overfitting in this phase
as categorical parameters are fixed, we calculate the median
results of the categorical parameters and select the categorical
parameter with the highest median value. Using the median
lowers the risk of picking parameters, where only a few results
are particularly good, and the rest only moderate.
Furthermore, we are not interested in perfect hyperparameters.
Instead, we look for a range, where most hyperparameter
combinations bring decent results. We analyze the ranges of
the continuous parameters for the selected set of categorical
parameters to get their minimal and maximal values. In case
there would be only one or no values for the existing set, we
retrain on the same dataset with fixed categorical parameters
and wide-range continuous parameters. Then, as in previous
steps, we select the subset 𝑡 with the highest results.
After the search space has been limited, we apply it to the
whole dataset with a hyperparameter optimization method in
Phase 2. As the selection of hyperparameter optimization
methods is arbitrary, we try combinations of them, e.g., grid
search with Bayesian and Bayesian with Bayesian.
IV. TEST AND COMPARISON METHOD
The goal of the method comparison is to find which
algorithms get almost equal 𝑟 loss as the grid search,
but faster and look if there is a global winner across multiple
datasets and machine learning methods.
A. Test Setup
We use four different data sets in the experiments: Two small
datasets Fishcatch and AutoMPG, and two large datasets
Amsterdam AirBnb and Bikeshare. Their properties are
shown in Tab. I. We apply train SVR and RFR on them. The
performance of the regression will be measured by the
𝑟 loss metric described in the introduction.
SearchSp aceLimitation
Widesearchon
subs etsofdata
Combin ation
categoricalcontinuousexist?
Fromresultsselect
topsubsett
Selec tcategorical
parame terswith
bestmedian
results
Selec tcontinuous
parametersfor
categorical Retrainwithfixed
categoricaland
widecontinuou s
No
Fromresultsselect
topsubsett
Yes
© 2020 IEEE, Preprint
TABLE I. D
ATASET
C
HARACTERISTICS
Dataset Name Samples Features Prediction Goal
Fishcatch
5
158 7 Fish weight
AutoMPG
6
199 9 Miles per gallon
Airbnb
7
10498 16 Price of a hotel room
Bikeshare
8
8690 13 Used bikes per day
For SVR, the following parameters were used: kernel
{rbf, linear}, 𝐶,𝛾 ∈10,10
for small datasets and
𝐶,𝛾 ∈10 ,10
for large datasets. For RFR, there were
two categorical parameters: bootstrap {True, False} and
max_features {auto, sqrt}. Continuous parameters were
max_depth ∈10, 100, min_samples_leaf ∈1, 4,
min_samples_split ∈2, 11 and n_estimators
200, 2000.
We compare the following hyperparameter optimization
methods: Random Search, Hyperopt [13], Optuna [8], Skopt
GP [5], and Skopt TPE [6]. We tried to install and test the
Gaussian process method FABOLAS [10] but failed. The
code from the repository does not seem to be maintained. In
[14], they offered an implementation. However, it did not run
stable on our datasets and was excluded from the comparison.
Each machine learning algorithm was cross-validated with
five folds. Due to the stochastics of the optimization methods,
we run each test five times to get the variance of the results as
well.
We execute all tests on a virtual server with Intel(R)
Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 2600 MHz with four
cores, 12 GB RAM on Windows 10. All code was
implemented in Jupyter Notebooks with Sklearn 0.19.2 on
Python 3.7.
B. Representative Subsets of Datasets
We execute 800 iterations of the grid and random search and
40 iterations of the Bayesian optimization methods. To see the
effect of using only subsets, we execute each method five
times for 10% to 100% of the data. Like in [10], we measure
how representative subsets of data are, i.e., if 10% of the
samples provides us with similar hyperparameters as 100% of
the data. Then, we analyze if there is one region of high
𝑟. It is visualized in Fig. 2 by plotting all available
measurements per subset for selected hyperparameters. On the
top, we show the large AirBnb dataset with SVR with 10%
samples to the left and with 100% samples to the right. As
AutoMPG with SVR in the middle is a small dataset, 10% of
the data was not enough to get representative results.
Therefore, 20% or at least 40 samples have to be used. At the
bottom, we plot the Bikeshare dataset with RFR.
For all SVR measurements, there was one distinct area of
high scores. As we compare with the optimum for all data, it
also applies to categorical parameters. For RFR, no
distinguishable area could be found among the parameters. It
suggests that RFR is not very sensitive to the selection of
hyperparameters in our ranges. We conclude that our
assumption of using less data for narrowing the search
space is valid for SVR, but not for RFR.
5 https://www.kaggle.com/aungpyaeap/fish-market
6 https://www.kaggle.com/fadikamal/autompgdatasetzip
Fig. 2. Top and middle, Hyperparameter space 𝐶 and 𝛾 for the datasets
Airbnb and AutoMPG with SVR rbf kernel. Bottom, the Bikeshare dataset
with RFR with hyperparameters number of estimators and max depth
C. Metrics for Comparison Across Datasets and Algorithms
We want to find a metric that allows us to compare
performance across datasets, subsets of the datasets as well as
ML-algorithms and hyperparameter optimization methods.
We are interested in the speed as well as the reliability to reach
to grid search score.
To be able to compare the algorithms across multiple
datasets, we use the only common denominator as a reference,
the grid search. Different from the other algorithms, grid
search is deterministic and always provides the same score.
Time varies less as it tests the same hyperparameters.
Therefore, we compare everything to the grid search median
performance on a dataset. Fig. 3 shows subsets of the
Bikeshare dataset ranging from 10% to 100% on SVR in
absolute values to the left and normalized values to grid search
to the right. In the following, we use the data from Hyperopt
as a representation for all Bayesian methods to demonstrate
how we compare performance. In the evaluation section, we
Bayesian compare the methods. As grid search is almost the
best possible value, we subtract a tolerance by lowering the
reference score with 2%, i.e., the relative score is
𝑠,
,
.∙
,
. (1)
Time is measured in time per iteration normalized to grid
search, i.e., relative time per iteration is
𝑡,
,
,
. (2)
As we only performed five measurements per
hyperparameter optimization method and subset from 10% to
100% of a dataset, we investigate whether it is possible to
merge subsets to get more measurements.
7 https://www.kaggle.com/adityadeshpande23/amsterdam-airbnb/
8 http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset/
© 2020 IEEE, Preprint
Fig. 3. Grid search, random search, and Hyperopt score as well as relative
time per iteration to the left and the same results normalized to grid search
results to the right for the Bikeshare dataset
On a relative scale to grid search, as we see demonstrated in
Fig. 3, the median of random search and Hyperopt are almost
trendless, i.e., they scale almost as grid search. Fig. 4. shows
the same situation in normalized results of subsets of the small
AutoMPG dataset on RFR. Based on our observations, we
conclude that we can merge subset results for the score and
time per iteration. For small datasets, we merge from 20%
and for large sets from 10%.
Fig. 4. Relative score and relative time per iteration for subsets of the
AutoMPG dataset on an RFR with Bayesian optimization methods
In Fig. 5, we plot the score 𝑠, and time per
iteration 𝑡, for each dataset grouped by machine
learning algorithm for Hyperopt and random search. We
notice that the relative score for both RFR and SVR is similar
across the datasets, but not for different machine learning
algorithms. While the distributions are similar for a random
search, the RFR score distribution for Hyperopt is much
higher than for the SVR score distribution. Hyperopt in SVR
shows several outliers with a very low score. In these cases,
the compared Bayesian methods get stuck in a local
maximum, optimizing only a linear kernel. While relative time
per iteration could be merged into one distribution for the
random search, it differs much for Hyperopt. We conclude that
we can merge relative score 𝑠, and time per
iteration 𝑡, for all datasets, but not across
machine learning algorithms. However, for Bayesian
methods, we expect a high variance in 𝑡,.
In our comparison, it is interesting to see after how many
iterations we can expect a method to reach the grid search
score, called iterations to grid search score, 𝑛 . This
information can be used to reduce the number of iterations.
Our initial assumption was that random search should run 800
iterations as a grid search, while 40 iterations would be enough
for the Bayesian methods.
Fig. 5. Top, the relative score and bottom, the time per iteration for
Hyperopt and random search for each machine learning algorithm on all
datasets
Some runs never reach the grid search, due to stochastic
variance or if the method gets stuck in a local optimum. In
Fig. 6, we see the number of iterations 𝑛 for Hyperopt and
random search. We see that for RFR, usually less than ten
iterations are necessary for both Hyperopt and random search,
which confirms that almost any hyperparameters are good
enough. For SVR, we see that 40 iterations for Bayesian
methods are too few, as many runs do not reach grid search
score at all and have the value 40. Based on the RFR iterations
and random search in SVR, we conclude that we can merge
the number of iterations 𝑛 across datasets to get
comparable distributions. We can also see if we tested with
too few iterations and how many iterations are needed.
Fig. 6. To the left, the number of iterations to grid search score for RFR; to
the right, SVR where the iterations around 40 were highlighted on the y-axis
We want to compare the Bayesian methods, which only
ran 40 times to a random search that ran 800 times. Therefore,
we have to include the durations as well. To get to the relative
duration 𝑇,, we calculate the following: Estimation
of the duration to get to grid search results measured in “grid
search units” with
𝑇,

 ∙ 𝑡, (3)
where only runs are considered that makes it to grid score.
Fig. 7 shows 𝑇, for all tested methods for all datasets.
To compress the results into one metric per hyperparameter
optimization method, we select the 3rd quartile relative
duration, i.e., the time, where ¾ of the searches reach the grid
search score of all searches that did it.
As mentioned before, 100% of the runs will never reach
the grid search score within the limited time boundaries.
Therefore, we want to use an additional metric that tells us the
confidence that our tested method reaches the grid search
score.
© 2020 IEEE, Preprint
TABLE II. M
EASUREMENT
R
ESULTS
Hyperparameter
Optimization Method
Min. Rel.
Score
Median
Rel. Score
Max. Rel.
Score Reliability
3
rd
Quartile
Iterations to
Grid
Min. Rel.
Duration
Median
Rel.
Duration
3
rd
Quartile
Rel.
Duration
Max. Rel.
Duration
Hyperopt 0,54 1,00 1,08 0,52 32 0,0007 0,055 0,085 0,372
Optuna 0,54 0,97 1,22 0,40 27 0,0004 0,160 0,361 1,330
Skopt GP 0,54 0,99 1,08 0,44 24 0,0356 0,694 1,520 5,750
Skopt TPE 0,54 0,97 1,08 0,37 24 0,0033 0,238 0,573 2,970
Random 0,86 1,02 1,08 0,91 170 0,0009 0,067 0,178 0,854
Grid800+Hyper20 0,94 1,01 1,03 0,68 820** 0,0139* 0,024* 0,034* 0,064*
Hyper20+Hyper20 0,54 0,91 1,02 0,11 40** 0,0011* 0,019* 0,055* 0,222*
Hyper40+Hyper20 0,54 0,98 1,03 0,41 60** 0,0017* 0,011* 0,026* 0,059*
Random100+Hyper20 0,54 1,01 1,03 0,65 120** 0,0020* 0,009* 0,015* 0,106*
Random100+Random100 0,86 1,01 1,03 0,70 200** 0,0013* 0,007* 0,025* 0,236*
*Only data from the two large datasets AirBnb and BikeShare are used as the models are not suited for small datasets; **Total number of iterations; red text color: best Bayesian method and best mixed-method
Fig. 7. To the left, estimated time to reach grid search score for all measured
methods; to the right, y-axis limit at 1.0
It is measured as reliability and is defined as the count of
scores better than grid search including 2% tolerance divided
through the number of samples 𝑁
𝑟

,

.
(4)
V. E
VALUATION
After the metrics for relative duration and reliability have been
defined, we first compare the Bayesian methods with each
other. Then, we perform our two-step-optimization model
with a grid search, random search, and the best overall
performing Bayesian method. In the following, the 3rd quartile
relative duration is used to measure time, i.e., the time, where
¾ of the values need to reach the grid search score.
A. Comparison of Common Optimization Methods
In Fig. 7 and Tab. II, we see the comparison of the Bayesian
methods with our metrics. There is a huge difference in terms
of the relative duration of the different methods. Hyperopt is
doubtlessly the fastest one for all machine learning methods
with about 𝑇, 0.085 for SVR and similar for
RFR. All outliers are < 1.0, which is 7x the median value.
Skopt TPE and especially Skopt GP suffers from extreme
outliers of 𝑇 , 6, which is 12x the median value.
Optuna performs somewhere between Skopt GP and Skopt
TPE. Also, in terms of reliability, Hyperopt performs best with
𝑟 0.51 compared to values 𝑟, 0.4 .
An interesting notice is that many runs that did not make it to
grid search score had a duration of up to 𝑇 , 12 for
Skopt GP, i.e., the algorithm almost gets stuck and will
probably never reach the grid search score. Due to too few
runs as visible in Fig. 6, reliability was low. However,
Hyperopt is our selection as the best Bayesian method.
Random search performs similar to grid search with
𝑟 0.91 after the equal number of runs as grid search.
𝑇, 0.178 suggests that only 170 runs would be
necessary for ¾ of the runs to reach the grid search level.
Compared to Hyperopt, it runs slower, but the reliability and
confidence to reach the grid search level are much higher.
While random search tests useless parameter settings in
parallel, Hyperopt needs to find out the next hyperparameters
serially. Higher reliability of Hyperopt would require more
iterations, which would increase the relative duration. To
conclude, Hyperopt is faster than random search in getting
to ¾ of the successful runs to grid search score, but the
random search is more reliable than Hyperopt.
B. Comparison of the Two-Phase Optimization Models
We performed two-phase optimization models with different
combinations of hyperparameter algorithms and iterations. In
Tab. II, we show the tested combinations as [method on
10%][Iterations] + [method on 100%][Iterations] together
with the results. In all tests, we used 10% for the large datasets
and 20% from the small datasets for the first run. Because
RFR performance was independent of the choice of
hyperparameters, we did only the Two-Phase Optimization
model on SVR.
We present the results in Tab. II, as well as in Fig. 8.
Hyperopt20+Hyperopt20 has a 𝑟0.11 ,
which is a very low score and only reaches the grid search
score with its outliers. Hyperopt40+Hyperopt20 has reliability
and score that is comparable to the common Bayesian
methods. For small datasets, no duration gain is made.
© 2020 IEEE, Preprint
Fig. 8. Relative duration for combinations within the Two-Phase
Optimization Models
For the large datasets, however, it is much faster than normal
Hyperopt with a relative duration of 𝑇,
0.026 compared to 0.085. In that case,
Hyperopt40+Hyperopt20 would be an improvement
compared to the use of only Hyperopt on 100% of the data.
Random100+Random100 is comparable to a random
search with 170 iterations, which would be the 3rd quantile
iterations to grid search results. Of the tested combinations, it
has the best reliability 𝑟 0.70. In the
small datasets as well as in Tab. II, we did not measure any
significant increase in relative duration. In the large datasets,
however, 𝑇, 0.025 , i.e., a factor
7.1x faster than random search.
Finally, Grid800+Hyper20 and Random100+Hyper20
behave similarly, with a reliability of 0.68 and 0.65.
Random100+Hyper20 has more outliers, where the worst
relative score was 0.53, which is the same value, where other
Bayesian methods use to get stuck. Both methods are at least
5.2x faster and have much fewer outliers than a random search
on large datasets. We conclude that reduced parameter
space is a proper way of reducing search time for large
datasets to the cost of slightly decreased reliability. Our
algorithm of choice is Random100+Random100.
VI. CONCLUSION
We developed a metric to compare hyperparameter searches
by durations and variance of the score. They are random
search, Skopt, Optuna, and Hyperopt. By relating everything
to the deterministic grid search, we created a base for
comparison. Our metric calculates the expected duration to
reach the grid search score as a factor of grid search.
Additionally, we use reliability as a measure of confidence for
the duration. We show that it is possible to merge results from
various datasets.
In our tests, Hyperopt was the fastest method. Some
publications claim that Bayesian methods are faster than
random search and would be the better choice. We come to
another conclusion. While Hyperopt reaches the grid search
score slightly faster than random search, one can be much
more confident that the random search reaches it at the end of
the test.
We introduced a two-phase-search that first searches a
wide hyperparameter space on fewer data and then searches
on within a narrow range on the full dataset. A combination of
two random searches provides almost as good results but is
7.1x times faster than an ordinary random search on large
datasets. This approach is suitable for support vector
machines, but not for random forest methods.
For future work, we intend to use information from the
metrics to estimate better how many iterations are necessary
to get decent results. We applied several “hyper-
hyperparameters” in our metrics and the two-phase search.
They should be refined for more general usage. Finally, it
would be interesting to evaluate this method on more complex
models like a neural network with more hyperparameters.
ACKNOWLEDGMENT
The financial support by the Austrian Federal Ministry for
Digital and Economic Affairs and the National Foundation for
Research, Technology, and Development is gratefully
acknowledged.
REFERENCES
[1] J. Mockus, “On bayesian methods for seeking the extremum,” in
Optimization Techniques IFIP Technical Conference Novosibirsk,
July 1–7, 1974, 1975, pp. 400–404.
[2] P. I. Frazier, “A Tutorial on Bayesian Optimization,” ArXiv, vol.
abs/1807.02811, 2018.
[3] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,
“An Empirical Evaluation of Deep Architectures on Problems with
Many Factors of Variation,” in Proceedings of the 24th International
Conference on Machine Learning, 2007, pp. 473–480.
[4] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter
Optimization,” Journal of Machine Learning Research, vol. 13, no.
10, pp. 281–305, 2012.
[5] C. E. Rasmussen, “Gaussian Processes in Machine Learning,” in
Advanced Lectures on Machine Learning: ML Summer Schools
2003, Canberra, Australia, February 2 - 14, 2003, Tübingen,
Germany, August 4 - 16, 2003, Revised Lectures, O. Bousquet, U.
von Luxburg, and G. Rätsch, Eds. Berlin, Heidelberg: Springer
Berlin Heidelberg, 2004, pp. 63–71.
[6] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for
Hyper-Parameter Optimization,” in Proceedings of the 24th
International Conference on Neural Information Processing
Systems, 2012, pp. 2546–2554.
[7] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox,
“Hyperopt: a Python library for model selection and hyperparameter
optimization,” Computational Science & Discovery, vol. 8, no. 1, p.
014008, Jul. 2015.
[8] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna:
A Next-generation Hyperparameter Optimization Framework,
CoRR, vol. abs/1907.10902, Jul. 2019.
[9] J. Czakon, “Optuna vs Hyperopt: Which Hyperparameter
Optimization Library Should You Choose?” Nov-2019.
[10] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter, “Fast
Bayesian Optimization of Machine Learning Hyperparameters on
Large Datasets,” CoRR, vol. abs/1605.07079, May 2016.
[11] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A.
Talwalkar, “Hyperband: A Novel Bandit-Based Approach to
Hyperparameter Optimization,” J. Mach. Learn. Res., vol. 18, no. 1,
pp. 6765–6816, Jan. 2017.
[12] S. Falkner, A. Klein, and F. Hutter, “BOHB: Robust and Efficient
Hyperparameter Optimization at Scale,” in Proceedings of the 35th
International Conference on Machine Learning (ICML 2018), 2018,
pp. 1436–1445.
[13] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A.
Talwalkar, “Hyperband: A Novel Bandit-Based Approach to
Hyperparameter Optimization,” J. Mach. Learn. Res., vol. 18, no. 1,
pp. 6765–6816, Jan. 2017.
[14] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox,
“Hyperopt: a Python library for model selection and hyperparameter
optimization,” Computational Science & Discovery, vol. 8, no. 1, p.
014008, Jul. 2015.
[15] M. P. Maren Mahsereci Javier Gonzales Andrei Paleyes,
“Emulation of physical processes with Emukit,” NeurIPS Machine
Learning and Physical Sciences Workshop 2019, 2019.
© 2020 IEEE, Preprint
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Sequential model-based optimization (also known as Bayesian optimization) is one of the most efficient methods (per function evaluation) of function minimization. This efficiency makes it appropriate for optimizing the hyperparameters of machine learning algorithms that are slow to train. The Hyperopt library provides algorithms and parallelization infrastructure for performing hyperparameter optimization (model selection) in Python. This paper presents an introductory tutorial on the usage of the Hyperopt library, including the description of search spaces, minimization (in serial and parallel), and the analysis of the results collected in the course of minimization. This paper also gives an overview of Hyperopt-Sklearn, a software project that provides automatic algorithm configuration of the Scikit-learn machine learning library. Following Auto-Weka, we take the view that the choice of classifier and even the choice of preprocessing module can be taken together to represent a single large hyperparameter optimization problem. We use Hyperopt to define a search space that encompasses many standard components (e.g. SVM, RF, KNN, PCA, TFIDF) and common patterns of composing them together. We demonstrate, using search algorithms in Hyperopt and standard benchmarking data sets (MNIST, 20-newsgroups, convex shapes), that searching this space is practical and effective. In particular, we improve on best-known scores for the model space for both MNIST and convex shapes. The paper closes with some discussion of ongoing and future work.
Conference Paper
Full-text available
Recently, several learning algorithms rely- ing on models with deep architectures have been proposed. Though they have demon- strated impressive performance, to date, they have only been evaluated on relatively simple problems such as digit recognition in a con- trolled environment, for which many machine learning algorithms already report reasonable results. Here, we present a series of experi- ments which indicate that these models show promise in solving harder learning problems that exhibit many factors of variation. These models are compared with well-established algorithms such as Support Vector Machines and single hidden-layer feed-forward neural networks.
Conference Paper
The purpose of this study is to introduce new design-criteria for next-generation hyperparameter optimization software. The criteria we propose include (1) define-by-run API that allows users to construct the parameter search space dynamically, (2) efficient implementation of both searching and pruning strategies, and (3) easy-to-setup, versatile architecture that can be deployed for various purposes, ranging from scalable distributed computing to light-weight experiment conducted via interactive interface. In order to prove our point, we will introduce Optuna, an optimization software which is a culmination of our effort in the development of a next generation optimization software. As an optimization software designed with define-by-run principle, Optuna is particularly the first of its kind. We will present the design-techniques that became necessary in the development of the software that meets the above criteria, and demonstrate the power of our new design through experimental results and real world applications. Our software is available under the MIT license (https://github.com/pfnet/optuna/).
Article
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. © 2018 Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh and Ameet Talwalkar.
Article
Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. But it is still costly if each evaluation of the objective requires training and validating the algorithm being optimized, which, for large datasets, often takes hours, days, or even weeks. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods.
Article
Many well known methods for seeking the extremum had been developed on the basis of quadratic approximation.
Article
Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some light on why recent "High Throughput" methods achieve surprising success--they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.