Content uploaded by Alexander Wendt

Author content

All content in this area was uploaded by Alexander Wendt on Oct 20, 2020

Content may be subject to copyright.

Speeding up Common Hyperparameter

Optimization Methods by a Two-Phase-Search

Alexander Wendt

Christian Doppler Laboratory for

Embedded Machine Learning

ICT, TU Vienna, Vienna, Austria

alexander.wendt@tuwien.ac.at

Marco Wuschnig

Christian Doppler Laboratory for

Embedded Machine Learning

ICT, TU Vienna, Vienna, Austria

marco.wuschnig@tuwien.ac.at

Martin Lechner

Christian Doppler Laboratory for

Embedded Machine Learning

ICT, TU Vienna, Vienna, Austria

martin.lechner@tuwien.ac.at

Abstract— Hyperparameter search concerns everybody who

works with machine learning. We compare publicly available

hyperparameter searches on four datasets. We develop metrics

to measure the performance of hyperparameter searches across

datasets of different sizes as well as machine learning

algorithms. Further, we propose a method of speeding up the

search by using subsets of data. Results show that random

search performs well compared to Bayesian methods and that a

combined search can speed up the search by a factor of 7.

Keywords—hyperparameter, machine learning, support vector

machine, random forest, Bayesian optimization, optimization

I. INTRODUCTION

Hyperparameter search concerns everybody who works with

machine learning. A high effort is put on finding

hyperparameters to get the most out of the algorithms. There

are numerous methods to optimize hyperparameters. Some are

quite simple, like the grid and random search, while others use

a more complex model to speed up the search like Bayesian

methods [1], [2]. Although grid- and random searches are not

the most efficient searches [3], many practitioners still stick

with them because of their simplicity.

In this work, we compare frequently used, publicly

available hyperparameter searches with implementations of

“classic” machine learning algorithms in the widely used

python machine learning package Scikit-learn1. We use their

implementations of Support Vector Regressor (SVR) and

Random Forest Regressor (RFR) on regression problems on

four datasets. The research problem is to find a metric to

compare algorithms across datasets and algorithms to

determine which hyperparameter optimization method fastest

produces reliable results. Further, we want to find out if an

extensive hyperparameter search on less data combined with

a narrow parameter search on the full dataset is faster than a

method that uses the whole dataset.

We propose the following methodology: First, we apply

grid search as a baseline with all algorithms on datasets with

800 parameter combinations. We choose R-squared as a

metric. It provides us with a measure of how well our model

fits the data. The range is 𝑟 ∈ ∞,1, where the

value 1.0 would be a perfect fit for the data without any

variance.

Second, we apply each hyperparameter optimization

method on all datasets and algorithms. As all methods except

grid search are stochastic, we perform five iterations per

optimization. Then, we compare the hyperparameter

optimization algorithms regarding optimization performance

1 https://scikit-learn.org/stable/

2 https://github.com/hyperopt/hyperopt

as well as variance and speed across all datasets and machine

learning algorithms. The challenge is to find metrics that cover

duration as well as reliability.

In step three, we want to determine how a large share of

the datasets is representative to be able to limit the parameter

search space. With that share, we do a wide search to

determine categorical parameters and to restrict the search

space for continuous values. Then, we do a narrow search of

continuous values on the full dataset. We test different

combinations of hyperparameter optimization methods to find

the optimum between search duration and reliability. Our

contributions to research are the following:

Definition of metrics to measure the performance of

hyperparameter search across datasets of different

sizes as well as machine learning algorithms

Review of common Bayesian optimization methods

Speedup of common methods on large datasets

through a two-phase-search algorithm

II. HYPERPARAMETER OPTIMIZATION METHODS

In the past, manual search and grid search were the way to go

when running hyperparameter optimization. To increase

efficiency, people often used semi-automated multi-staged

grid searches. In [3], the authors combined a logarithmic grid

with a fine-grained linear grid. However, significant

disadvantages are that grid search evaluates a lot of non-useful

combinations and that it lacks early stopping. In [4], a random

search is found to be superior to grid search in both runtime

and quality of the results.

In recent years, Bayesian optimization [1], [2] has gained

much attention for hyperparameter tuning. They are designed

for objective functions with long evaluation periods. Bayesian

optimization perfectly fits the needs of modern machine

learning (ML) hyperparameter tuning. Depending on the

implementation, Gaussian processes (GP) [5] and Tree of

Parzen estimators (TPE) [6] are used to approximate the target

function based on the historical data. Hyperopt2

[7] is a

hyperparameter optimization library for python implementing

random search, TPE, and an improved version of TPE called

adaptive TPE. Optuna 3 [8] is another, more recent

optimization library. Compared to Hyperopt, Optuna supports

dynamically created parameter search spaces, and the authors

claim that their implementation is efficient in terms of

searching and pruning while being versatile enough to be used

for many different optimization problems. They implement a

grid search, random search, and TPE.

3 https://github.com/optuna/optuna

© 2020 IEEE, Preprint

According to [9], both implementations of TPE

outperform random search, but their results are different. It

can be explained by different internal parameter settings

(sometimes referred to as hyper-hyperparameters).

FABOLAS [10] is specially designed for large datasets and

uses Gaussian processes for approximation. Skopt4 is another

widely used python library implementing Bayesian

optimization as well as grid and random search.

In contrast to Bayesian optimization-based approaches,

Hyperband [11] focuses on improving random search by using

adaptive resource allocation, i.e., by allocating more resources

to promising hyperparameter settings, and principled early

stopping strategies. The authors claim that Hyperband is 5 to

30 times faster than state-of-the-art Bayesian-based methods.

All Bayesian methods share the disadvantage of a slow start,

i.e., they need some time to find suitable hyperparameters, but

they eventually outperform bandit approaches like

Hyperband. In contrast, Hyperband is much faster and gives

superior results for small time-budgets. BOHB [12] is a

combination of Bayesian optimization and Hyperband trying

to combine their advantages, a fast start with a good

performance for large budgets.

III. TWO-PHASE OPTIMIZATION MODEL

In [10], the authors showed that a small subset of a dataset is

representative enough to provide a representative

hyperparameter space. The grid search time for SVR scales

with 𝑂𝑛, which makes the search costly for larger datasets.

Our method should provide an advantage as subsets of the

dataset are used. Less data generates a higher loss, while the

hyperparameter remains the same. With more data, the loss

gets smaller until saturation. We want to use this observation

to speed up the search by using a two-phase approach:

Phase 1, a wide search of a small share of the data with low

granularity, and then, Phase 2, a narrow search of the

complete data with high granularity.

We want to find out if the search duration of the whole

space can be significantly lowered when focusing the search

only on the relevant areas of the search space. Further, we

assume that less data is enough to determine categorical

parameters, and those continuous parameters are subject to

fine-tuning. Therefore, the targets for the high granularity

search are all-continuous parameters for a selected set of

categorical parameters. For instance, the SVR has the

categorical parameter kernel, which can be either linear or a

radial basis function (rbf). For rbf, it uses 𝐶 and 𝛾 as

continuous parameters. However, this method is not limited to

selecting kernels. In the model pipeline, categorical

parameters could also be the selection of scaler, sampler,

imputer for missing values as well as feature subsets.

In Fig. 1, we present an activity graph for the search space

limitation. The idea is to first apply a hyperparameter

optimization like a grid search or a Bayesian method on a wide

range of parameters. For SVR, the parameter space could

range from 𝐶 ∈10,10

to cover the whole space. The

search gets cheaper by only using a small subset 𝑠 𝜖 𝑆 of the

data, e.g., 10%. Similar to [12], the size of the dataset is our

limited resource. Here, the subset size and the number of

parameter combinations or iterations can be configured.

4 https://github.com/scikit-optimize/scikit-optimize

Fig. 1. Activity chart for the limitation of the search space

Based on the results of the wide search, we select a subset

with the highest 𝑡 results, i.e., the top 20%, to get only the

most promising candidates. To avoid overfitting in this phase

as categorical parameters are fixed, we calculate the median

results of the categorical parameters and select the categorical

parameter with the highest median value. Using the median

lowers the risk of picking parameters, where only a few results

are particularly good, and the rest only moderate.

Furthermore, we are not interested in perfect hyperparameters.

Instead, we look for a range, where most hyperparameter

combinations bring decent results. We analyze the ranges of

the continuous parameters for the selected set of categorical

parameters to get their minimal and maximal values. In case

there would be only one or no values for the existing set, we

retrain on the same dataset with fixed categorical parameters

and wide-range continuous parameters. Then, as in previous

steps, we select the subset 𝑡 with the highest results.

After the search space has been limited, we apply it to the

whole dataset with a hyperparameter optimization method in

Phase 2. As the selection of hyperparameter optimization

methods is arbitrary, we try combinations of them, e.g., grid

search with Bayesian and Bayesian with Bayesian.

IV. TEST AND COMPARISON METHOD

The goal of the method comparison is to find which

algorithms get almost equal 𝑟 loss as the grid search,

but faster and look if there is a global winner across multiple

datasets and machine learning methods.

A. Test Setup

We use four different data sets in the experiments: Two small

datasets Fishcatch and AutoMPG, and two large datasets

Amsterdam AirBnb and Bikeshare. Their properties are

shown in Tab. I. We apply train SVR and RFR on them. The

performance of the regression will be measured by the

𝑟 loss metric described in the introduction.

SearchSp aceLimitation

Widesearchon

subs etsofdata

Combin ation

categorical‐continuousexist?

Fromresultsselect

topsubsett

Selec tcategorical

parame terswith

bestmedian

results

Selec tcontinuous

parametersfor

categorical Retrainwithfixed

categoricaland

widecontinuou s

No

Fromresultsselect

topsubsett

Yes

© 2020 IEEE, Preprint

TABLE I. D

ATASET

C

HARACTERISTICS

Dataset Name Samples Features Prediction Goal

Fishcatch

5

158 7 Fish weight

AutoMPG

6

199 9 Miles per gallon

Airbnb

7

10498 16 Price of a hotel room

Bikeshare

8

8690 13 Used bikes per day

For SVR, the following parameters were used: kernel

{rbf, linear}, 𝐶,𝛾 ∈10,10

for small datasets and

𝐶,𝛾 ∈10 ,10

for large datasets. For RFR, there were

two categorical parameters: bootstrap {True, False} and

max_features {auto, sqrt}. Continuous parameters were

max_depth ∈10, 100, min_samples_leaf ∈1, 4,

min_samples_split ∈2, 11 and n_estimators ∈

200, 2000.

We compare the following hyperparameter optimization

methods: Random Search, Hyperopt [13], Optuna [8], Skopt

GP [5], and Skopt TPE [6]. We tried to install and test the

Gaussian process method FABOLAS [10] but failed. The

code from the repository does not seem to be maintained. In

[14], they offered an implementation. However, it did not run

stable on our datasets and was excluded from the comparison.

Each machine learning algorithm was cross-validated with

five folds. Due to the stochastics of the optimization methods,

we run each test five times to get the variance of the results as

well.

We execute all tests on a virtual server with Intel(R)

Xeon(R) CPU E5-2630 v2 @ 2.60GHz, 2600 MHz with four

cores, 12 GB RAM on Windows 10. All code was

implemented in Jupyter Notebooks with Sklearn 0.19.2 on

Python 3.7.

B. Representative Subsets of Datasets

We execute 800 iterations of the grid and random search and

40 iterations of the Bayesian optimization methods. To see the

effect of using only subsets, we execute each method five

times for 10% to 100% of the data. Like in [10], we measure

how representative subsets of data are, i.e., if 10% of the

samples provides us with similar hyperparameters as 100% of

the data. Then, we analyze if there is one region of high

𝑟. It is visualized in Fig. 2 by plotting all available

measurements per subset for selected hyperparameters. On the

top, we show the large AirBnb dataset with SVR with 10%

samples to the left and with 100% samples to the right. As

AutoMPG with SVR in the middle is a small dataset, 10% of

the data was not enough to get representative results.

Therefore, 20% or at least 40 samples have to be used. At the

bottom, we plot the Bikeshare dataset with RFR.

For all SVR measurements, there was one distinct area of

high scores. As we compare with the optimum for all data, it

also applies to categorical parameters. For RFR, no

distinguishable area could be found among the parameters. It

suggests that RFR is not very sensitive to the selection of

hyperparameters in our ranges. We conclude that our

assumption of using less data for narrowing the search

space is valid for SVR, but not for RFR.

5 https://www.kaggle.com/aungpyaeap/fish-market

6 https://www.kaggle.com/fadikamal/autompgdatasetzip

Fig. 2. Top and middle, Hyperparameter space 𝐶 and 𝛾 for the datasets

Airbnb and AutoMPG with SVR rbf kernel. Bottom, the Bikeshare dataset

with RFR with hyperparameters number of estimators and max depth

C. Metrics for Comparison Across Datasets and Algorithms

We want to find a metric that allows us to compare

performance across datasets, subsets of the datasets as well as

ML-algorithms and hyperparameter optimization methods.

We are interested in the speed as well as the reliability to reach

to grid search score.

To be able to compare the algorithms across multiple

datasets, we use the only common denominator as a reference,

the grid search. Different from the other algorithms, grid

search is deterministic and always provides the same score.

Time varies less as it tests the same hyperparameters.

Therefore, we compare everything to the grid search median

performance on a dataset. Fig. 3 shows subsets of the

Bikeshare dataset ranging from 10% to 100% on SVR in

absolute values to the left and normalized values to grid search

to the right. In the following, we use the data from Hyperopt

as a representation for all Bayesian methods to demonstrate

how we compare performance. In the evaluation section, we

Bayesian compare the methods. As grid search is almost the

best possible value, we subtract a tolerance by lowering the

reference score with 2%, i.e., the relative score is

𝑠,

,

.∙

,

. (1)

Time is measured in time per iteration normalized to grid

search, i.e., relative time per iteration is

𝑡,

,

,

. (2)

As we only performed five measurements per

hyperparameter optimization method and subset from 10% to

100% of a dataset, we investigate whether it is possible to

merge subsets to get more measurements.

7 https://www.kaggle.com/adityadeshpande23/amsterdam-airbnb/

8 http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset/

© 2020 IEEE, Preprint

Fig. 3. Grid search, random search, and Hyperopt score as well as relative

time per iteration to the left and the same results normalized to grid search

results to the right for the Bikeshare dataset

On a relative scale to grid search, as we see demonstrated in

Fig. 3, the median of random search and Hyperopt are almost

trendless, i.e., they scale almost as grid search. Fig. 4. shows

the same situation in normalized results of subsets of the small

AutoMPG dataset on RFR. Based on our observations, we

conclude that we can merge subset results for the score and

time per iteration. For small datasets, we merge from 20%

and for large sets from 10%.

Fig. 4. Relative score and relative time per iteration for subsets of the

AutoMPG dataset on an RFR with Bayesian optimization methods

In Fig. 5, we plot the score 𝑠, and time per

iteration 𝑡, for each dataset grouped by machine

learning algorithm for Hyperopt and random search. We

notice that the relative score for both RFR and SVR is similar

across the datasets, but not for different machine learning

algorithms. While the distributions are similar for a random

search, the RFR score distribution for Hyperopt is much

higher than for the SVR score distribution. Hyperopt in SVR

shows several outliers with a very low score. In these cases,

the compared Bayesian methods get stuck in a local

maximum, optimizing only a linear kernel. While relative time

per iteration could be merged into one distribution for the

random search, it differs much for Hyperopt. We conclude that

we can merge relative score 𝑠, and time per

iteration 𝑡, for all datasets, but not across

machine learning algorithms. However, for Bayesian

methods, we expect a high variance in 𝑡,.

In our comparison, it is interesting to see after how many

iterations we can expect a method to reach the grid search

score, called iterations to grid search score, 𝑛 . This

information can be used to reduce the number of iterations.

Our initial assumption was that random search should run 800

iterations as a grid search, while 40 iterations would be enough

for the Bayesian methods.

Fig. 5. Top, the relative score and bottom, the time per iteration for

Hyperopt and random search for each machine learning algorithm on all

datasets

Some runs never reach the grid search, due to stochastic

variance or if the method gets stuck in a local optimum. In

Fig. 6, we see the number of iterations 𝑛 for Hyperopt and

random search. We see that for RFR, usually less than ten

iterations are necessary for both Hyperopt and random search,

which confirms that almost any hyperparameters are good

enough. For SVR, we see that 40 iterations for Bayesian

methods are too few, as many runs do not reach grid search

score at all and have the value 40. Based on the RFR iterations

and random search in SVR, we conclude that we can merge

the number of iterations 𝑛 across datasets to get

comparable distributions. We can also see if we tested with

too few iterations and how many iterations are needed.

Fig. 6. To the left, the number of iterations to grid search score for RFR; to

the right, SVR where the iterations around 40 were highlighted on the y-axis

We want to compare the Bayesian methods, which only

ran 40 times to a random search that ran 800 times. Therefore,

we have to include the durations as well. To get to the relative

duration 𝑇,, we calculate the following: Estimation

of the duration to get to grid search results measured in “grid

search units” with

𝑇,

∙ 𝑡, (3)

where only runs are considered that makes it to grid score.

Fig. 7 shows 𝑇, for all tested methods for all datasets.

To compress the results into one metric per hyperparameter

optimization method, we select the 3rd quartile relative

duration, i.e., the time, where ¾ of the searches reach the grid

search score of all searches that did it.

As mentioned before, 100% of the runs will never reach

the grid search score within the limited time boundaries.

Therefore, we want to use an additional metric that tells us the

confidence that our tested method reaches the grid search

score.

© 2020 IEEE, Preprint

TABLE II. M

EASUREMENT

R

ESULTS

Hyperparameter

Optimization Method

Min. Rel.

Score

Median

Rel. Score

Max. Rel.

Score Reliability

3

rd

Quartile

Iterations to

Grid

Min. Rel.

Duration

Median

Rel.

Duration

3

rd

Quartile

Rel.

Duration

Max. Rel.

Duration

Hyperopt 0,54 1,00 1,08 0,52 32 0,0007 0,055 0,085 0,372

Optuna 0,54 0,97 1,22 0,40 27 0,0004 0,160 0,361 1,330

Skopt GP 0,54 0,99 1,08 0,44 24 0,0356 0,694 1,520 5,750

Skopt TPE 0,54 0,97 1,08 0,37 24 0,0033 0,238 0,573 2,970

Random 0,86 1,02 1,08 0,91 170 0,0009 0,067 0,178 0,854

Grid800+Hyper20 0,94 1,01 1,03 0,68 820** 0,0139* 0,024* 0,034* 0,064*

Hyper20+Hyper20 0,54 0,91 1,02 0,11 40** 0,0011* 0,019* 0,055* 0,222*

Hyper40+Hyper20 0,54 0,98 1,03 0,41 60** 0,0017* 0,011* 0,026* 0,059*

Random100+Hyper20 0,54 1,01 1,03 0,65 120** 0,0020* 0,009* 0,015* 0,106*

Random100+Random100 0,86 1,01 1,03 0,70 200** 0,0013* 0,007* 0,025* 0,236*

*Only data from the two large datasets AirBnb and BikeShare are used as the models are not suited for small datasets; **Total number of iterations; red text color: best Bayesian method and best mixed-method

Fig. 7. To the left, estimated time to reach grid search score for all measured

methods; to the right, y-axis limit at 1.0

It is measured as reliability and is defined as the count of

scores better than grid search including 2% tolerance divided

through the number of samples 𝑁

𝑟

∑

,

.

(4)

V. E

VALUATION

After the metrics for relative duration and reliability have been

defined, we first compare the Bayesian methods with each

other. Then, we perform our two-step-optimization model

with a grid search, random search, and the best overall

performing Bayesian method. In the following, the 3rd quartile

relative duration is used to measure time, i.e., the time, where

¾ of the values need to reach the grid search score.

A. Comparison of Common Optimization Methods

In Fig. 7 and Tab. II, we see the comparison of the Bayesian

methods with our metrics. There is a huge difference in terms

of the relative duration of the different methods. Hyperopt is

doubtlessly the fastest one for all machine learning methods

with about 𝑇, 0.085 for SVR and similar for

RFR. All outliers are < 1.0, which is 7x the median value.

Skopt TPE and especially Skopt GP suffers from extreme

outliers of 𝑇 , 6, which is 12x the median value.

Optuna performs somewhere between Skopt GP and Skopt

TPE. Also, in terms of reliability, Hyperopt performs best with

𝑟 0.51 compared to values 𝑟, 0.4 .

An interesting notice is that many runs that did not make it to

grid search score had a duration of up to 𝑇 , 12 for

Skopt GP, i.e., the algorithm almost gets stuck and will

probably never reach the grid search score. Due to too few

runs as visible in Fig. 6, reliability was low. However,

Hyperopt is our selection as the best Bayesian method.

Random search performs similar to grid search with

𝑟 0.91 after the equal number of runs as grid search.

𝑇, 0.178 suggests that only 170 runs would be

necessary for ¾ of the runs to reach the grid search level.

Compared to Hyperopt, it runs slower, but the reliability and

confidence to reach the grid search level are much higher.

While random search tests useless parameter settings in

parallel, Hyperopt needs to find out the next hyperparameters

serially. Higher reliability of Hyperopt would require more

iterations, which would increase the relative duration. To

conclude, Hyperopt is faster than random search in getting

to ¾ of the successful runs to grid search score, but the

random search is more reliable than Hyperopt.

B. Comparison of the Two-Phase Optimization Models

We performed two-phase optimization models with different

combinations of hyperparameter algorithms and iterations. In

Tab. II, we show the tested combinations as [method on

10%][Iterations] + [method on 100%][Iterations] together

with the results. In all tests, we used 10% for the large datasets

and 20% from the small datasets for the first run. Because

RFR performance was independent of the choice of

hyperparameters, we did only the Two-Phase Optimization

model on SVR.

We present the results in Tab. II, as well as in Fig. 8.

Hyperopt20+Hyperopt20 has a 𝑟 0.11 ,

which is a very low score and only reaches the grid search

score with its outliers. Hyperopt40+Hyperopt20 has reliability

and score that is comparable to the common Bayesian

methods. For small datasets, no duration gain is made.

© 2020 IEEE, Preprint

Fig. 8. Relative duration for combinations within the Two-Phase

Optimization Models

For the large datasets, however, it is much faster than normal

Hyperopt with a relative duration of 𝑇,

0.026 compared to 0.085. In that case,

Hyperopt40+Hyperopt20 would be an improvement

compared to the use of only Hyperopt on 100% of the data.

Random100+Random100 is comparable to a random

search with 170 iterations, which would be the 3rd quantile

iterations to grid search results. Of the tested combinations, it

has the best reliability 𝑟 0.70. In the

small datasets as well as in Tab. II, we did not measure any

significant increase in relative duration. In the large datasets,

however, 𝑇, 0.025 , i.e., a factor

7.1x faster than random search.

Finally, Grid800+Hyper20 and Random100+Hyper20

behave similarly, with a reliability of 0.68 and 0.65.

Random100+Hyper20 has more outliers, where the worst

relative score was 0.53, which is the same value, where other

Bayesian methods use to get stuck. Both methods are at least

5.2x faster and have much fewer outliers than a random search

on large datasets. We conclude that reduced parameter

space is a proper way of reducing search time for large

datasets to the cost of slightly decreased reliability. Our

algorithm of choice is Random100+Random100.

VI. CONCLUSION

We developed a metric to compare hyperparameter searches

by durations and variance of the score. They are random

search, Skopt, Optuna, and Hyperopt. By relating everything

to the deterministic grid search, we created a base for

comparison. Our metric calculates the expected duration to

reach the grid search score as a factor of grid search.

Additionally, we use reliability as a measure of confidence for

the duration. We show that it is possible to merge results from

various datasets.

In our tests, Hyperopt was the fastest method. Some

publications claim that Bayesian methods are faster than

random search and would be the better choice. We come to

another conclusion. While Hyperopt reaches the grid search

score slightly faster than random search, one can be much

more confident that the random search reaches it at the end of

the test.

We introduced a two-phase-search that first searches a

wide hyperparameter space on fewer data and then searches

on within a narrow range on the full dataset. A combination of

two random searches provides almost as good results but is

7.1x times faster than an ordinary random search on large

datasets. This approach is suitable for support vector

machines, but not for random forest methods.

For future work, we intend to use information from the

metrics to estimate better how many iterations are necessary

to get decent results. We applied several “hyper-

hyperparameters” in our metrics and the two-phase search.

They should be refined for more general usage. Finally, it

would be interesting to evaluate this method on more complex

models like a neural network with more hyperparameters.

ACKNOWLEDGMENT

The financial support by the Austrian Federal Ministry for

Digital and Economic Affairs and the National Foundation for

Research, Technology, and Development is gratefully

acknowledged.

REFERENCES

[1] J. Mockus, “On bayesian methods for seeking the extremum,” in

Optimization Techniques IFIP Technical Conference Novosibirsk,

July 1–7, 1974, 1975, pp. 400–404.

[2] P. I. Frazier, “A Tutorial on Bayesian Optimization,” ArXiv, vol.

abs/1807.02811, 2018.

[3] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio,

“An Empirical Evaluation of Deep Architectures on Problems with

Many Factors of Variation,” in Proceedings of the 24th International

Conference on Machine Learning, 2007, pp. 473–480.

[4] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter

Optimization,” Journal of Machine Learning Research, vol. 13, no.

10, pp. 281–305, 2012.

[5] C. E. Rasmussen, “Gaussian Processes in Machine Learning,” in

Advanced Lectures on Machine Learning: ML Summer Schools

2003, Canberra, Australia, February 2 - 14, 2003, Tübingen,

Germany, August 4 - 16, 2003, Revised Lectures, O. Bousquet, U.

von Luxburg, and G. Rätsch, Eds. Berlin, Heidelberg: Springer

Berlin Heidelberg, 2004, pp. 63–71.

[6] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for

Hyper-Parameter Optimization,” in Proceedings of the 24th

International Conference on Neural Information Processing

Systems, 2012, pp. 2546–2554.

[7] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox,

“Hyperopt: a Python library for model selection and hyperparameter

optimization,” Computational Science & Discovery, vol. 8, no. 1, p.

014008, Jul. 2015.

[8] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna:

A Next-generation Hyperparameter Optimization Framework,”

CoRR, vol. abs/1907.10902, Jul. 2019.

[9] J. Czakon, “Optuna vs Hyperopt: Which Hyperparameter

Optimization Library Should You Choose?” Nov-2019.

[10] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter, “Fast

Bayesian Optimization of Machine Learning Hyperparameters on

Large Datasets,” CoRR, vol. abs/1605.07079, May 2016.

[11] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A.

Talwalkar, “Hyperband: A Novel Bandit-Based Approach to

Hyperparameter Optimization,” J. Mach. Learn. Res., vol. 18, no. 1,

pp. 6765–6816, Jan. 2017.

[12] S. Falkner, A. Klein, and F. Hutter, “BOHB: Robust and Efficient

Hyperparameter Optimization at Scale,” in Proceedings of the 35th

International Conference on Machine Learning (ICML 2018), 2018,

pp. 1436–1445.

[13] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A.

Talwalkar, “Hyperband: A Novel Bandit-Based Approach to

Hyperparameter Optimization,” J. Mach. Learn. Res., vol. 18, no. 1,

pp. 6765–6816, Jan. 2017.

[14] J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox,

“Hyperopt: a Python library for model selection and hyperparameter

optimization,” Computational Science & Discovery, vol. 8, no. 1, p.

014008, Jul. 2015.

[15] M. P. Maren Mahsereci Javier Gonzales Andrei Paleyes,

“Emulation of physical processes with Emukit,” NeurIPS Machine

Learning and Physical Sciences Workshop 2019, 2019.

© 2020 IEEE, Preprint