Conference PaperPDF Available

Algorithm Selection as a Bandit Problem with Unbounded Losses



Algorithm selection is typically based on models of algorithm performance learned during a separate offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online approach, in which a performance model is iteratively updated and used to guide selection on a sequence of problem instances. The resulting exploration-exploitation trade-off was represented as a bandit problem with expert advice, using an existing solver for this game, but this required the setting of an arbitrary bound on algorithm runtimes, thus invalidating the optimal regret of the solver. In this paper, we propose a simpler framework for representing algorithm selection as a bandit problem, with partial information, and an unknown bound on losses. We adapt an existing solver to this game, proving a bound on its expected regret, which holds also for the resulting algorithm selection technique. We present experiments with a set of SAT solvers on a mixed SAT-UNSAT benchmark.
arXiv:0807.1494v1 [cs.AI] 9 Jul 2008
Algorithm Selection as a Bandit Problem with Unbounded
Matteo Gagliolo J¨
urgen Schmidhuber
Technical Report No. IDSIA-07-08
July 9, 2008
Istituto Dalle Molle di studi sull’intelligenza artificiale
Galleria 2, 6928 Manno, Switzerland
IDSIA was founded by the Fondazione Dalle Molle per la Qualit`a della Vita and is affiliated with both the Universit`a della Svizzera italiana (USI)
and the Scuola unversitaria professionale della Svizzera italiana (SUPSI).
Both authors are also affiliated with the University of Lugano, Faculty of Informatics (Via Buffi 13, 6904 Lugano, Switzerland). J. Schmidhuber is
also affiliated with TU Munich (Boltzmannstr. 3, 85748 Garching, M ¨unchen, Germany).
This work was supported by the Hasler foundation with grant n. 2244.
Technical Report No. IDSIA-07-08
Algorithm Selection as a Bandit Problem with Unbounded
Matteo Gagliolo J¨urgen Schmidhuber
July 9, 2008
Algorithm selection is typically based on models ofalgorithm performance, learned during a separate
offline training sequence, which can be prohibitively expensive. In recent work, we adopted an online
approach, in which a performance model is iteratively updated and used to guide selection on a sequence
of problem instances. The resulting exploration-exploitation trade-off was represented as a bandit prob-
lem with expert advice, using an existing solver for this game, but this required the setting of an arbitrary
bound on algorithm runtimes, thus invalidating the optimal regret of the solver. In this paper, we propose
a simpler framework for representing algorithm selection as a bandit problem, with partial information,
and an unknown bound on losses. We adapt an existing solver to this game, proving a bound on its ex-
pected regret, which holds also for the resulting algorithm selection technique. We present preliminary
experiments with a set of SAT solvers on a mixed SAT-UNSAT benchmark.
1 Introduction
Decades of research in the fields of Machine Learning and Artificial Intelligence brought us a variety
of alternative algorithms for solving many kinds of problems. Algorithms often display variability in
performance quality, and computational cost, depending on the particular problem instance being solved:
in other words, there is no single “best” algorithm. While a “trial and error” approach is still the most
popular, attempts to automate algorithm selection are not new [33], and have grown to form a consistent
and dynamic field of research in the area of Meta-Learning [37]. Many selection methods follow an offline
learning scheme, in which the availability of a large training set of performance data for the different
algorithms is assumed. This data is used to learn a model that maps (problem,algorithm) pairs to expected
performance, or to some probability distribution on performance. The model is later used to select and
run, for each new problem instance, only the algorithm that is expected to give the best results. While this
approach might sound reasonable, it actually ignores the computational cost of the initial training phase:
collecting a representative sample of performance data has to be done via solving a set of training problem
instances, and each instance is solved repeatedly, at least once for each of the available algorithms, or more
if the algorithms are randomized. Furthermore, these training instances are assumed to be representative of
future ones, as the model is not updated after training.
In other words, there is an obvious trade-off between the exploration of algorithm performances on
different problem instances, aimed at learning the model, and the exploitation of the best algorithm/problem
combinations, based on the model’s predictions. This trade-off is typically ignored in offline algorithm
selection, and the size of the training set is chosen heuristically. In our previous work [13, 14, 15], we have
kept an online view of algorithm selection, in which the only input available to the meta-learner is a set of
algorithms, of unknown performance, and a sequence of problem instances that have to be solved. Rather
than artificially subdividing the problem set into a training and a test set, we iteratively update the model
each time an instance is solved, and use it to guide algorithm selection on the next instance.
Technical Report No. IDSIA-07-08
Bandit problems [3] offer a solid theoretical framework for dealing with the exploration-exploitation
trade-off in an online setting. One important obstacle to the straightforward application of a bandit prob-
lem solver to algorithm selection is that most existing solvers assume a bound on losses to be available
beforehand. In [16, 15] we dealt with this issue heuristically, fixing the bound in advance. In this paper,
we introduce a modification of an existing bandit problem solver [7], which allows it to deal with an un-
known bound on losses, while retaining a bound on the expected regret. This allows us to propose a simpler
version of the algorithm selection framework GAMBLETA, originally introduced in [15]. The result is a
parameterless online algorithm selection method, the first, to our knowledge, with a provable upper bound
on regret.
The rest of the paper is organized as follows. Section 2 describes a tentative taxonomy of algorithm
selection methods, along with a few examples from literature. Section 3 presents our framework for repre-
senting algorithm selection as a bandit problem, discussing the introduction of a higher level of selection
among different algorithm selection techniques (time allocators). Section 4 introduces the modified bandit
problem solver for unbounded loss games, along with its bound on regret. Section 5 describes experiments
with SAT solvers. Section 6 concludes the paper.
2 Related work
In general terms, algorithm selection can be defined as the process of allocating computational resources
to a set of alternative algorithms, in order to improve some measure of performance on a set of problem
instances. Note that this definition includes parameter selection: the algorithm set can contain multiple
copies of a same algorithm, differing in their parameter settings; or even identical randomized algorithms
differing only in their random seeds. Algorithm selection techniques can be further described according to
different orthogonal features:
Decision vs. optimisation problems. A first distinction needs to be made among decision problems,
where a binary criterion for recognizing a solution is available; and optimisation problems, where different
levels of solution quality can be attained, measured by an objective function [22]. Literature on algorithm
selection is often focused on one of these two classes of problems. The selection is normally aimed at min-
imizing solution time for decision problems; and at maximizing performance quality, or improving some
speed-quality trade-off, for optimisation problems.
Per set vs. per instance selection. The selection among different algorithms can be performed once for
an entire set of problem instances (per set selection, following [24]); or repeated for each instance (per
instance selection).
Static vs. dynamic selection. A further independent distinction [31] can be made among static algorithm
selection, in which allocation of resources precedes algorithm execution; and dynamic, or reactive, algo-
rithm selection, in which the allocation can be adapted during algorithm execution.
Oblivious vs. non-oblivious selection. In oblivious techniques, algorithm selection is performed from
scratch for each problem instance; in non-oblivious techniques, there is some knowledge transfer across
subsequent problem instances, usually in the form of a model of algorithm performance.
Off-line vs. online learning. Non-oblivious techniques can be further distinguished as offline or batch
learning techniques, where a separate training phase is performed, after which the selection criteria are
kept fixed; and online techniques, where the criteria can be updated every time an instance is solved.
A seminal paper in the field of algorithm selection is [33], in which offline, per instance selection is
first proposed, for both decision and optimisation problems. More recently, similar concepts have been
proposed, with different terminology (algorithm recommendation,ranking,model selection), in the Meta-
Learning community [12, 37, 18]. Research in this field usually deals with optimisation problems, and
is focused on maximizing solution quality, without taking into account the computational aspect. Work
on Empirical Hardness Models [27, 30] is instead applied to decision problems, and focuses on obtaining
accurate models of runtime performance, conditioned on numerous features of the problem instances, as
Technical Report No. IDSIA-07-08
well as on parameters of the solvers [24]. The models are used to perform algorithm selection on a per
instance basis, and are learned offline: online selection is advocated in [24]. Literature on algorithm port-
folios [23, 19, 32] is usually focused on choice criteria for building the set of candidate solvers, such that
their areas of good performance do not overlap, and optimal static allocation of computational resources
among elements of the portfolio.
A number of interesting dynamic exceptions to the static selection paradigm have been proposed re-
cently. In [25], algorithm performance modeling is based on the behavior of the candidate algorithms dur-
ing a predefined amount of time, called the observational horizon, and dynamic context-sensitive restart
policies for SAT solvers are presented. In both cases, the model is learned offline. In a Reinforcement
Learning [36] setting, algorithm selection can be formulated as a Markov Decision Process: in [26], the
algorithm set includes sequences of recursive algorithms, formed dynamically at run-time solving a sequen-
tial decision problem, and a variation of Q-learning is used to find a dynamic algorithm selection policy;
the resulting technique is per instance, dynamic and online. In [31], a set of deterministic algorithms is
considered, and, under some limitations, static and dynamic schedules are obtained, based on dynamic
programming. In both cases, the method presented is per set, offline.
An approach based on runtime distributions can be found in [10, 11], for parallel independent processes
and shared resources respectively. The runtime distributions are assumed to be known, and the expected
value of a cost function, accounting for both wall-clock time and resources usage, is minimized. A dy-
namic schedule is evaluated offline, using a branch-and-bound algorithm to find the optimal one in a tree
of possible schedules. Examples of allocation to two processes are presented with artificially generated
runtimes, and a real Latin square solver. Unfortunately, the computational complexity of the tree search
grows exponentially in the number of processes.
“Low-knowledge” oblivious approaches can be found in [4, 5], in which various simple indicators of
current solution improvement are used for algorithm selection, in order to achieve the best solution quality
within a given time contract. In [5], the selection process is dynamic: machine time shares are based on
a recency-weighted average of performance improvements. We adopted a similar approach in [13], where
we considered algorithms with a scalar state, that had to reach a target value. The time to solution was
estimated based on a shifting-window linear extrapolation of the learning curves.
For optimisation problems, if selection is only aimed at maximizing solution quality, the same problem
instance can be solved multiple times, keeping only the best solution. In this case, algorithm selection can
be represented as a Max K-armed bandit problem, a variant of the game in which the reward attributed to
each arm is the maximum payoff on a set of rounds. Solvers for this game are used in [9, 35] to implement
oblivious per instance selection from a set of multi-start optimisation techniques: each problem is treated
independently, and multiple runs of the available solvers are allocated, to maximize performance quality.
Further references can be found in [15].
3 Algorithm selection as a bandit problem
In its most basic form [34], the multi-armed bandit problem is faced by a gambler, playing a sequence
of trials against an N-armed slot machine. At each trial, the gambler chooses one of the available arms,
whose losses are randomly generated from different stationary distributions. The gambler incurs in the
corresponding loss, and, in the full information game, she can observe the losses that would have been paid
pulling any of the other arms. A more optimistic formulation can be made in terms of positive rewards.
The aim of the game is to minimize the regret, defined as the difference between the cumulative loss of the
gambler, and the one of the best arm. A bandit problem solver (BPS) can be described as a mapping from
the history of the observed losses ljfor each arm j, to a probability distribution p= (p1, ..., pN), from
which the choice for the successive trial will be picked.
More recently, the original restricting assumptions have been progressively relaxed, allowing for non-
stationary loss distributions, partial information (only the loss for the pulled arm is observed), and adver-
Technical Report No. IDSIA-07-08
sarial bandits that can set their losses in order to deceive the player. In [2, 3], a reward game is considered,
and no statistical assumptions are made about the process generating the rewards, which are allowed to be
an arbitrary function of the entire history of the game (non-oblivious adversarial setting). Based on these
pessimistic hypotheses, the authors describe probabilistic gambling strategies for the full and the partial
information games.
Let us now see how to represent algorithm selection for decision problems as a bandit problem, with
the aim of minimizing solution time. Consider a sequence B={b1,...,bM}of Minstances of a decision
problem, for which we want to minimize solution time, and a set of Kalgorithms A={a1,...,aK},
such that each bmcan be solved by each ak. It is straightforward to describe static algorithm selection
in a multi-armed bandit setting, where “pick arm k” means “run algorithm akon next problem instance”.
Runtimes tkcan be viewed as losses, generated by a rather complex mechanism, i.e., the algorithms ak
themselves, running on the current problem. The information is partial, as the runtime for other algorithms
is not available, unless we decide to solve the same problem instance again. In a worst case scenario one
can receive a ”deceptive” problem sequence, starting with problem instances on which the performance of
the algorithms is misleading, so this bandit problem should be considered adversarial. As BPS typically
minimize the regret with respect to a single arm, this approach would allow to implement per set selection,
of the overall best algorithm. An example can be found in [16], where we presented an online method for
learning a per set estimate of an optimal restart strategy.
Unfortunately, per set selection is only profitable if one of the algorithms dominates the others on all
problem instances. This is usually not the case: it is often observed in practice that different algorithms
perform better on different problem instances. In this situation, a per instance selection scheme, which can
take a different decision for each problem instance, can have a great advantage.
One possible way of exploiting the nice theoretical properties of a BPS in the context of algorithm
selection, while allowing for the improvement in performance of per instance selection, is to use the BPS
at an upper level, to select among alternative algorithm selection techniques. Consider again the algorithm
selection problem represented by Band A. Introduce a set of Ntime allocators (TAj) [13, 15]. Each TAj
can be an arbitrary function, mapping the current history of collected performance data for each ak, to a
share s(j)[0,1]K, with PK
k=1 sk= 1. A TA is used to solve a given problem instance executing all
algorithms in Ain parallel, on a single machine, whose computational resources are allocated to each ak
proportionally to the corresponding sk, such that for any portion of time spent t,sktis used by ak, as in a
static algorithm portfolio [23]. The runtime before a solution is found is then mink{tk/sk},tkbeing the
runtime of algorithm ak.
A trivial example of a TA is the uniform time allocator, assigning a constant s= (1/K, ..., 1/K). Single
algorithm selection can be represented in this framework by setting a single skto 1. Dynamic allocators
will produce a time-varying share s(t). In previous work, we presented examples of heuristic oblivious [13]
and non-oblivious [14] allocators; more sound TAs are proposed in [15], based on non-parametric models
of the runtime distribution of the algorithms, which are used to minimize the expected value of solution
time, or a quantile of this quantity, or to maximize solution probability within a give time contract.
At this higher level, one can use a BPS to select among different time allocators, TAj,TA2..., working
on a same algorithm set A. In this case, “pick arm j” means “use time allocator TAjon Ato solve next
problem instance”. In the long term, the BPS would allow to select, on a per set basis, the TAjthat is best
at allocating time to algorithms in Aon a per instance basis. The resulting “Gambling” Time Allocator
(GAMBLETA) is described in Alg. 1.
If BPS allows for non-stationary arms, it can also deal with time allocators that are learning to allocate
time. This is actually the original motivation for adopting this two-level selection scheme, as it allows
to combine in a principled way the exploration of algorithm behavior, which can be represented by the
uniform time allocator, and the exploitation of this information by a model-based allocator, whose model is
being learned online, based on results on the sequence of problems met so far. If more time allocators are
available, they can be made to compete, using the BPS to explore theirperformances. Another interesting
feature of this selection scheme is that the initial requirement that each algorithm should be capable of
Technical Report No. IDSIA-07-08
Algorithm 1 GAMBLETA(A,T, B P S)Gambling Time Allocator.
Algorithm set Awith Kalgorithms;
A set Tof Ntime allocators TAj;
A bandit problem solver BPS
Mproblem instances.
initialize BPS(N, M )
for each problem bi, i = 1,...,M do
pick time allocator I(i) = jwith probability pj(i)from BPS.
solve problem biusing TAIon A
incur loss lI(i)= mink{tk(i)/s(I)
update BPS
end for
solving each problem can be relaxed, requiring instead that at least one of the akcan solve a given bm, and
that each TAjcan solve each bm: this can be ensured in practice by imposing a sk>0for all ak. This
allows to use interesting combinations of complete and incomplete solvers in A(see Sect. 5). Note that any
bound on the regret of the BPS will determine a bound on the regret of GAMBLETA with respect to the
best time allocator. Nothing can be said about the performance w.r.t. the best algorithm. In a worst-case
setting, if none of the time allocator is effective, a bound can still be obtained by including the uniform
share in the set of TAs. In practice, though, per-instance selection can be much more efficient than uniform
allocation, and the literature is full of examples of time allocators which eventually converge to a good
The original version of GAMBLETA (GAMBLETA4 in the following) [15] was based on a more com-
plex alternative, inspired by the bandit problem with expert advice, as described in [2, 3]. In that setting,
two games are going on in parallel: at a lower level, a partial information game is played, based on the
probability distribution obtained mixing the advice of different experts, represented as probability distri-
butions on the Karms. The experts can be arbitrary functions of the history of observed rewards, and
give a different advice for each trial. At a higher level, a full information game is played, with the N
experts playing the roles of the different arms. The probability distribution pat this level is not used to
pick a single expert, but to mix their advices, in order to generate the distribution for the lower level arms.
In GAMBLETA4, the time allocators play the role of the experts, each suggesting a different s, on a per
instance basis; and the arms of the lower level game are the Kalgorithms, to be run in parallel with the
mixture share. EXP4 [2, 3] is used as the BPS. Unfortunately, the bounds for EXP4 cannot be extended
to GAMBLETA4 in a straightforward manner, as the loss function itself is not convex; moreover, EXP4
cannot deal with unbounded losses, so we had to adopt an heuristic reward attribution instead of using the
plain runtimes.
A common issue of the above approaches is the difficulty of setting reasonable upper bounds on the
time required by the algorithms. This renders a straightforward application of most BPS problematic, as a
known bound on losses is usually assumed, and used to tune parameters of the solver. Underestimating this
bound can invalidate the bounds on regret, while overestimating it can produce an excessively ”cautious”
algorithm, with a poor performance. Setting in advance a good bound is particularly difficult when dealing
with algorithm runtimes, which can easily exhibit variations of several order of magnitudes among different
problem instances, or even among different runs on a same instance [20].
Some interesting results regarding games with unbounded losses have recently been obtained. In [7, 8],
the authors consider a full information game, and provide two algorithms which can adapt to unknown
bounds on signed rewards. Based on this work, [1] provide a Hannan consistent algorithm for losses whose
bound grows in the number of trials iwith a known rate iν,ν < 1/2. This latter hypothesis does not fit well
Technical Report No. IDSIA-07-08
our situation, as we would like to avoid any restriction on the sequence of problems: a very hard instance
can be met first, followed by an easy one. In this sense, the hypothesis of a constant, but unknown, bound is
more suited. In [7], Cesa-Bianchi et al. also introduce an algorithm for loss games with partial information
(EXP3LIGHT), which requires losses to be bound, and is particularly effective when the cumulative loss of
the best arm is small. In the next section we introduce a variation of this algorithm that allows it to deal
with an unknown bound on losses.
4 An algorithm for games with an unknown bound on losses
Here and in the following, we consider a partial information game with Narms, and Mtrials; an index (i)
indicates the value of a quantity used or observed at trial i∈ {1,...,M};jindicate quantities related to
the j-th arm, j∈ {1,...,N}; index Erefers to the loss incurred by the bandit problem solver, and I(i)
indicates the arm chosen at trial (i), so it is a discrete random variable with value in {1,...,N};r,uwill
represent quantities related to an epoch of the game, which consists of a sequence of 0or more consecutive
trials; log with no index is the natural logarithm.
EXP3LIGHT [7, Sec. 4] is a solver for the bandit loss game with partial information. It is a modified
version of the weighted majority algorithm [29], in which the cumulative losses for each arm are obtained
through an unbiased estimate1. The game consists of a sequence of epochs r= 0,1,...: in each epoch,
the probability distribution over the arms is updated, proportional to exp (ηr˜
Ljbeing the current
unbiased estimate of the cumulative loss. Assuming an upper bound 4ron the smallest loss estimate, ηris
set as:
ηr=s2(log N+Nlog M)
When this bound is first trespassed, a new epoch starts and rand ηrare updated accordingly.
The original algorithm assumes losses in [0,1]. We first consider a game with a known finite bound
Lon losses, and introduce a slightly modified version of EXP3LIGHT (Algorithm 2), obtained simply
dividing all losses by L. Based on Theorem 5 from [7], it is easy to prove the following
Theorem 1. If L(M)is the loss of the best arm after Mtrials, and LE(M) = PM
i=1 lI(i)(i)is the loss
of EXP3LIGHT(N, M , L), the expected value of its regret is bounded as:
E{LE(M)} − L(M)(2)
2p6L(log N+Nlog M)NL(M)
+L[2p2L(log N+Nlog M)N
+ (2N+ 1)(1 + log4(3M+ 1))]
The proof is trivial, and is given in the appendix.
We now introduce a simple variation of Algorithm 2 which does not require the knowledge of the
bound Lon losses, and uses Algorithm 2 as a subroutine. EXP3LIGHT-A (Algorithm 3) is inspired by
the doubling trick used in [7] for a full information game with unknown bound on losses. The game is
again organized in a sequence of epochs u= 0,1,...: in each epoch, Algorithm 2 is restarted using a
bound Lu= 2u; a new epoch is started with the appropriate uwhenever a loss larger than the current Lu
is observed.
1For a given round, and a given arm with loss land pull probability p, the estimated loss ˜
lis l/p if the arm is pulled, 0otherwise.
This estimate is unbiased in the sense that its expected value, with respect to the process extracting the arm to be pulled, equals the
actual value of the loss: E{˜
l}=pl/p + (1 p)0 = l.
Technical Report No. IDSIA-07-08
Algorithm 2 EXP3LIGHT(N , M, L)A solver for bandit problems with partial information and a known
bound Lon losses.
Narms, Mtrials
losses lj(i)[0,L]i= 1, ..., M ,j= 1,...,N
initialize epoch r= 0,LE= 0,˜
Lj(0) = 0.
initialize ηraccording to (1)
for each trial i= 1, ..., M do
set pj(i)exp(ηr˜
j=1 pj(i) = 1.
pick arm I(i) = jwith probability pj(i).
incur loss lE(i) = lI(i)(i).
evaluate unbiased loss estimates:
lI(i)(i) = lI(i)(i)/pI(i)(i),˜
lj= 0 for j6=I(i)
update cumulative losses:
LE(i) = LE(i1) + lE(i),
Lj(i) = ˜
Lj(i1) + ˜
lj(i), for j= 1,...,N
L(i) = minj˜
if (˜
start next epoch r=log4(˜
update ηraccording to (1)
end if
end for
Algorithm 3 EXP3LIGHT-A(N, M )A solver for bandit problems with partial information and an unknown
(but finite) bound on losses.
Narms, Mtrials,
losses lj(i)[0,L]i= 1, ..., M ,j= 1,...,N
unknown L<
initialize epoch u= 0, EXP3LIGHT(N, M, 2u)
for each trial i= 1, ..., M do
pick arm I(i) = jwith probability pj(i)from EXP3LIGHT.
incur loss lE(i) = lI(i)(i).
if lI(i)(i)>2uthen
start next epoch u=log2lI(i)(i)
restart EXP3LIGHT(N, M i, 2u)
end if
end for
Technical Report No. IDSIA-07-08
Theorem 2. If L(M)is the loss of the best arm after Mtrials, and L<is the unknown bound on
losses, the expected value of the regret of EXP3LIGHT-A(N , M)is bounded as:
E{LE(M)} − L(M)(3)
4p3log2L⌉L(log N+Nlog M)NL(M)
+ 2log2L⌉L[p4L(log N+Nlog M)N
+ (2N+ 1)(1 + log4(3M+ 1)) + 2]
The proof is given in the appendix. The regret obtained by EXP3LIGHT-A is O(pLNlog ML(M)),
which can be useful in a situation in which Lis high but Lis relatively small, as we expect in our time
allocation setting if the algorithms exhibit huge variations in runtime, but at least one of the TAs eventually
converges to a good performance. We can then use EXP3LIGHT-A as a BPS for selecting among different
time allocators in GAMBLETA (Algorithm 1).
5 Experiments
The set of time allocator used in the following experiments is the same as in [15], and includes the uniform
allocator, along with nine other dynamic allocators, optimizing different quantiles of runtime, based on
a nonparametric model of the runtime distribution that is updated after each problem is solved. We first
briefly describe these time allocators, inviting the reader to refer to [15] for further details and a deeper
discussion. A separate model Fk(t|x), conditioned on features xof the problem instance, is used for each
algorithm ak. Based on these models, the runtime distribution for the whole algorithm portfolio Acan be
evaluated for an arbitrary share s[0,1]K, with PK
k=1 sk= 1, as
FA,s(t) = 1
[1 Fk(skt)].(4)
Eq. (4) can be used to evaluate a quantile tA,s(α) = F1
A,s(α)for a given solution probability α. Fixing
this value, time is allocated using the share that minimizes the quantile
s= arg min
Compared to minimizing expected runtime, this time allocator has the advantage of being applicable even
when the runtime distributions are improper, i. e. F()<1, as in the case of incomplete solvers. A
dynamic version of this time allocator is obtained updating the share value periodically, conditioning each
Fkon the time spent so far by the corresponding ak.
Rather than fixing an arbitrary α, we used nine different instances of this time allocator, with αranging
from 0.1to 0.9, in addition to the uniform allocator, and let the BPS select the best one.
We present experiments for the algorithm selection scenario from [15], in which a local search and a
complete SAT solver (respectively,G2-WSAT [28] and Satz-Rand [20]) are combined tosolve a sequence
of random satisfiable and unsatisfiable problems (benchmarks uf-*,uu-*from [21], 1899 instances in
total). As the clauses-to-variable ratio is fixed in this benchmark, only the number of variables, ranging
from 20 to 250, was used as a problem feature x. Local search algorithms are more efficient on satisfi-
able instances, but cannot prove unsatisfiability, so are doomed to run forever on unsatisfiable instances;
while complete solvers are guaranteed to terminate their execution on all instances, as they can also prove
For the whole problem sequence, the overhead of GAMBLETA3 (Algorithm 1, using EXP3LIGHT-A
as the BPS) over an ideal “oracle”, which can predict and run only the fastest algorithm, is 22%. GAM-
BLETA4 (from [15], based on EXP4) seems to profit from the mixing of time allocation shares, obtaining
Technical Report No. IDSIA-07-08
0 500 1000 1500
6x 1010
Task sequence
Cumulative time [cycles]
(a) cumulative time
0 500 1000 1500
Task sequence
Cumulative overhead
(b) cumulative overhead
Figure 1: (a): Cumulative time spent by GAMBLETA3 and GAMBLETA4 [15] on the SAT-UNSAT problem set
(1091min.). Upper 95% confidence bounds on 20 runs, with random reordering of the problems. ORACLE is the
lower bound on performance. UNIFORM is the (0.5,0.5) share. SATZ-RAND is the per-set best algorithm. (b): The
evolution of cumulative overhead, defined as (PjtG(j)PjtO(j))/PjtO(j), where tGis the performance of
GAMBLETA and tOis the performance of the oracle. Dotted lines represent 95% confidence bounds.
a better 14%. Satz-Rand alone can solve all the problems, but with an overhead of about 40% w.r.t. the
oracle, due to its poor performance on satisfiable instances. Fig. 1 plots the evolution of cumulative time,
and cumulative overhead, along the problem sequence.
6 Conclusions
We presented a bandit problem solver for loss games with partial information and an unknown bound on
losses. The solver represents an ideal plug-in for our algorithm selection method GAMBLETA, avoiding
the need to set any additional parameter. The choice of the algorithm set and time allocators to use is still
left to the user. Any existing selection technique, including oblivious ones, can be included in the set of N
allocators, with an impact O(N)on the regret: the overall performance of GAMBLETA will converge to
the one of the best time allocator. Preliminary experiments showed a degradation in performance compared
to the heuristic version presented in [15], which requires to set in advance a maximum runtime, and cannot
be provided of a bound on regret.
According to [6], a bound for the original EXP3LIGHT can be proved for an adaptive ηr(1), in which
the total number of trials Mis replaced by the current trial i. This should allow for a potentially more
efficient variation of EXP3LIGHT-A, in which EXP3LIGHT is not restarted at each epoch, and can retain
the information on past losses.
One potential advantage of offline selection methods is that the initial training phase can be easily par-
allelized, distributing the workload on a cluster of machines. Ongoing research aims at extending GAM-
BLETA to allocate multiple CPUs in parallel, in order to obtain a fully distributed algorithm selection
framework [17].
Acknowledgments. We would like to thank Nicol`o Cesa-Bianchi for contributing the proofs for
EXP3LIGHT and useful remarks on his work, and Faustino Gomez for his comments on a draft of this
paper. This work was supported by the Hasler foundation with grant n. 2244.
Technical Report No. IDSIA-07-08
[1] Chamy Allenberg, Peter Auer, L´aszl´o Gy¨orfi, and Gy¨orgy Ottucs´ak. Hannan consistency in on-line
learning in case of unbounded losses under partial monitoring. In Jos´e L. Balc´azar et al., editors, ALT,
volume 4264 of Lecture Notes in Computer Science, pages 229–243. Springer, 2006.
[2] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged casino:
the adversarial multi-armed bandit problem. In Proceedings of the 36th Annual Symposium on Foun-
dations of Computer Science, pages 322–331. IEEE Computer Society Press, Los Alamitos, CA,
[3] Peter Auer, Nicol`o Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multi-
armed bandit problem. SIAM J. Comput., 32(1):48–77, 2003.
[4] Christopher J. Beck and Eugene C. Freuder. Simple rules for low-knowledgealgorithm selection. In
CPAIOR, pages 50–64, 2004.
[5] T. Carchrae and J. C. Beck. Applying machine learning to low knowledge control of optimization
algorithms. Computational Intelligence, 21(4):373–387, 2005.
[6] Nicol`o Cesa-Bianchi. Personal communication, 2008.
[7] Nicol`o Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for predic-
tion with expert advice. In Peter Auer, Ron Meir, Peter Auer, and Ron Meir, editors, COLT, volume
3559 of Lecture Notes in Computer Science, pages 217–232. Springer, 2005.
[8] Nicol`o Cesa-Bianchi, Yishay Mansour, and Gilles Stoltz. Improved second-order bounds for predic-
tion with expert advice. Machine Learning, 66(2-3):321–352, March 2007.
[9] Vincent A. Cicirello and Stephen F. Smith. The max k-armed bandit: A new model of exploration
applied to search heuristic selection. In Twentieth National Conference on Artificial Intelligence,
pages 1355–1361. AAAI Press, 2005.
[10] Lev Finkelstein, Shaul Markovitch, and Ehud Rivlin. Optimal schedules for parallelizing anytime
algorithms: the case of independent processes. In Eighteenth national conference on Artificial intel-
ligence, pages 719–724, Menlo Park, CA, USA, 2002. AAAI Press.
[11] Lev Finkelstein, Shaul Markovitch, and Ehud Rivlin. Optimal schedules for parallelizing anytime
algorithms: The case of shared resources. Journal of Artificial Intelligence Research, 19:73–138,
[12] Johannes F¨urnkranz. On-line bibliography on meta-learning, 2001. EU ESPRIT METAL Project
(26.357): A Meta-Learning Assistant for Providing User Support in Machine Learning Mining.
[13] M. Gagliolo, V. Zhumatiy, and J. Schmidhuber. Adaptive online time allocation to search algorithms.
In J.F. Boulicaut et al., editor, Machine Learning: ECML 2004. Proceedings of the 15th European
Conference on Machine Learning, Pisa, Italy, September 20-24, 2004, pages 134–143. Springer,
[14] Matteo Gagliolo and J¨urgen Schmidhuber. A neural network model for inter-problem adaptive online
time allocation. In Włodzisław Duch et al., editors, Artificial Neural Networks: Formal Models and
Their Applications - ICANN 2005 Proceedings, Part 2, pages 7–12. Springer, September 2005.
[15] Matteo Gagliolo and J¨urgen Schmidhuber. Learning dynamic algorithm portfolios. Annals of Math-
ematics and Artificial Intelligence, 47(3–4):295–328, August 2006. AI&MATH 2006 Special Issue.
Technical Report No. IDSIA-07-08
[16] Matteo Gagliolo and J¨urgen Schmidhuber. Learning restart strategies. In Manuela M. Veloso, editor,
IJCAI 2007 — Twentieth International Joint Conference on Artificial Intelligence, vol. 1, pages 792–
797. AAAI Press, January 2007.
[17] Matteo Gagliolo and J¨urgen Schmidhuber. Towards distributed algorithm portfolios. In DCAI 2008
— International Symposium on Distributed Computing and Artificial Intelligence, Advances in Soft
Computing. Springer, 2008. To appear.
[18] Christophe Giraud-Carrier, Ricardo Vilalta, and Pavel Brazdil. Introduction to the special issue on
meta-learning. Machine Learning, 54(3):187–193, 2004.
[19] Carla P. Gomes and Bart Selman. Algorithm portfolios. Artificial Intelligence, 126(1–2):43–62, 2001.
[20] Carla P. Gomes, Bart Selman, Nuno Crato, and Henry Kautz. Heavy-tailed phenomena in satisfiability
and constraint satisfaction problems. J. Autom. Reason., 24(1-2):67–100, 2000.
[21] H. H. Hoos and T. St¨utzle. SATLIB: An Online Resource for Research on SAT. In I.P.Gent et al.,
editors, SAT 2000, pages 283–292, 2000.
[22] Holger H. Hoos and Thomas St¨utzle. Local search algorithms for SAT: An empirical evaluation.
Journal of Automated Reasoning, 24(4):421–481, 2000.
[23] B. A. Huberman, R. M. Lukose, and T. Hogg. An economic approach to hard computational problems.
Science, 275:51–54, 1997.
[24] Frank Hutter and Youssef Hamadi. Parameter adjustment based on performance prediction: Towards
an instance-aware problem solver. Technical Report MSR-TR-2005-125, Microsoft Research, Cam-
bridge, UK, December 2005.
[25] Henry A. Kautz, Eric Horvitz, Yongshao Ruan, Carla P. Gomes, and Bart Selman. Dynamic restart
policies. In AAAI/IAAI, pages 674–681, 2002.
[26] Michail G. Lagoudakis and Michael L. Littman. Algorithm selection using reinforcement learning.
In Proc. 17th ICML, pages 511–518. Morgan Kaufmann, 2000.
[27] Kevin Leyton-Brown, Eugene Nudelman, and Yoav Shoham. Learning the empirical hardness of
optimization problems: The case of combinatorial auctions. In ICCP: International Conference on
Constraint Programming (CP), LNCS, 2002.
[28] Chu Min Li and Wenqi Huang. Diversification and determinism in local search for satisfiability. In
SAT2005, pages 158–172. Springer, 2005.
[29] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput.,
[30] Eugene Nudelman, Kevin Leyton-Brown, Holger H. Hoos, Alex Devkar, and Yoav Shoham. Under-
standing random sat: Beyond the clauses-to-variables ratio. In CP, pages 438–452, 2004.
[31] Marek Petrik. Statistically optimal combination of algorithms. Presented at SOFSEM 2005 - 31st
Annual Conference on Current Trends in Theory and Practice of Informatics, 2005.
[32] Marek Petrik and Shlomo Zilberstein. Learning static parallel portfolios of algorithms. Ninth Inter-
national Symposium on Artificial Intelligence and Mathematics., 2006.
[33] J. R. Rice. The algorithm selection problem. In Morris Rubinoff and Marshall C. Yovits, editors,
Advances in computers, volume 15, pages 65–118. Academic Press, New York, 1976.
Technical Report No. IDSIA-07-08
[34] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the AMS, 58:527–535,
[35] Matthew J. Streeter and Stephen F. Smith. An asymptotically optimal algorithm for the max k-armed
bandit problem. In Twenty-First National Conference on Artificial Intelligence. AAAI Press, 2006.
[36] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge, MA, MIT Press, 1998.
[37] Ricardo Vilalta and Youssef Drissi. A perspective view and survey of meta-learning. Artif. Intell.
Rev., 18(2):77–95, 2002.
A.1 Proof of Theorem 1
The proof is trivially based on the regret for the original EXP3LIGHT, with L= 1, which according to [7,
Theorem 5] (proof obtained from [6]) can be evaluated using the optimal values (1) for ηr:
E{LE(M)} − L(M)(6)
2p2(log N+Nlog M)N(1 + 3L(M))
+ (2N+ 1)(1 + log4(3M+ 1)).
As we are playing the same game normalizing all losses with L, the following will hold for Alg. 2:
E{LE(M)} − L(M)
2p2(log N+Nlog M)N(1 + 3L(M)/L)(8)
+ (2N+ 1)(1 + log4(3M+ 1)).(9)
Multiplying both sides for Land rearranging produces (2).
A.2 Proof of Theorem 2
This follows the proof technique employed in [7, Theorem 4]. Be iuthe last trial of epoch u, i. e. the first
trial at which a loss lI(i)(i)>2uis observed. Write cumulative losses during an epoch u,excluding the
last trial iu, as L(u)=Piu1
i=iu1+1 l(i), and let L(u)= minjPiu1
i=iu1+1 lj(i)indicate the optimal loss for
this subset of trials. Be U=u(M)the a priori unknown epoch at the last trial. In each epoch u, the bound
(2) holds with Lu= 2ufor all trials except the last one iu, so noting that log(Mi)log(M)we can
E} − L(u)(10)
2q6Lu(log N+Nlog M)NL(u)
+Lu[2p2Lu(log N+Nlog M)N
+ (2N+ 1)(1 + log4(3M+ 1))].
The loss for trial iucan only be bound by the next value of Lu, evaluated a posteriori:
E{lE(iu)} − l(iu)≤ Lu+1,(11)
Technical Report No. IDSIA-07-08
where l(i) = minjlj(i)indicates the optimal loss at trial i.
Combining (10,11), and writing i1= 0,iU=M, we obtain the regret for the whole game:2
E{LE(M)} −
u=0{2q6Lu(log N+Nlog M)NL(u)
+Lu[2p2Lu(log N+Nlog M)N
+ (2N+ 1)(1 + log4(3M+ 1))]}
u=0 Lu+1.
The first term on the right hand side of (12) can be bounded using Jensen’s inequality
t(U+ 1)
au= 24Lu(log N+Nlog M)NL(u)(13)
24LU+1(log N+Nlog M)N L(u).
The other terms do not depend on the optimal losses L(u), and can also be bounded noting that Lu
We now have to bound the number of epochs U. This can be done noting that the maximum observed
loss cannot be larger than the unknown, but finite, bound L, and that
U+ 1 = log2maxilI(i)(i)⌉ ≤ ⌈log2L⌉,(14)
which implies
LU+1 = 2U+1 2L.(15)
In this way we can bound the sum
u=0 Lu+1
2u21+log2L⌉ 4L.(16)
We conclude by noting that
L(M) = minjLj(M)(17)
2Note that all cumulative losses are counted from trial iu1+ 1 to trial iu1. If an epoch ends on its first trial, (10) is zero, and
(11) holds. Writing iU=Mimplies the worst case hypothesis that the bound LUis exceeded on the last trial. Epoch numbers uare
increasing, but not necessarily consecutive: in this case the terms related to the missing epochs are 0.
Technical Report No. IDSIA-07-08
Inequality (12) then becomes:
E{LE(M)} − L(M)
2p6(U+ 1)LU+1(log N+Nlog M)N L(M)
+ (U+ 1)LU+1[2p2LU+1 (log N+Nlog M)N
+ (2N+ 1)(1 + log4(3M+ 1))] + 4L.
Plugging in (14, 15) and rearranging we obtain (3).
... Finally, as we take the online algorithm selection problem as a running example for our setting, it is worth mentioning that bandit-based approaches have been already considered for this problem (Gagliolo and Schmidhuber 2007;Gagliolo and Schmidhuber 2010;Degroote 2017;Degroote et al. 2018;Tornede et al. 2022). However, these focus on certain algorithmic problem classes, such as the boolean satisfiability problem (SAT) or the quantified boolean formula problem (QBF). ...
Full-text available
We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates both the actual amount of consumed resources of each learning round and the optimality of realizable rewards as well as the risk of exceeding the allocated resource limit. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We propose a UCB-inspired online learning algorithm, which we analyze theoretically in terms of its regret upper bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms.
... In Gagliolo and Schmidhuber [2010], the authors propose GAMBLETA, a bandit method to select an optimal algorithm from a portfolio of SAT solvers. The specificity of this method is to leverage contextual information [Auer et al. 2002b] for the bandit algorithm to transfer knowledge across a set of SAT problems. ...
This thesis proposes three main contributions to advance the state-of-the-art of AutoML approaches. They are divided into two research directions: optimization (first contribution) and meta-learning (second and third contributions). The first contribution is a hybrid optimization algorithm, dubbed Mosaic, leveraging Monte-Carlo Tree Search and Bayesian Optimization to address the selection of algorithms and the tuning of hyper-parameters, respectively. The empirical assessment of the proposed approach shows its merits compared to Auto-sklearn and TPOT AutoML systems on OpenML 100. The second contribution introduces a novel neural network architecture, termed Dida, to learn a good representation of datasets (i.e., metafeatures) from scratch while enforcing invariances w.r.t features and rows permutations. Two proofof-concept tasks (patch classification and performance prediction tasks) are considered. The proposed approach yields superior empirical performance compared to Dataset2Vec and DSS on both tasks. The third contribution addresses the limitation of Dida on handling standard dataset benchmarks. The proposed approach, called Metabu, relies on hand-crafted meta-features. The novelty of Metabu is two-fold: i) defining an "oracle" topology of datasets based on top-performing hyper-parameters; ii) leveraging Optimal Transport approach to align a mapping of the handcrafted meta-features with the oracle topology. The empirical results suggest that Metabu metafeature outperforms the baseline hand-cr afted meta-features on three different tasks (assessing meta-features based topology, recommending hyper-parameters w.r.t topology, and warmstarting optimization algorithms).
... Multi-Armed Bandits (MAB) problem, as a classical RL algorithm, has also been considered in the context of optimization [30]. The algorithm selection problem can be viewed as an MAB problem [17,31] where the goal is to let the agent find the best algorithm via interactive trials based on the feedback from the optimization and its learning curve [32]. MABs have been used to search for the operators in the Evolutionary Algorithms (EA) [33,34] where different types of problems and reward definitions are analyzed to select the best operators [35]. ...
Metaheuristic algorithms are derivative-free optimizers designed to estimate the global optima for optimization problems. Keeping balance between exploitation and exploration and the performance complementarity between the algorithms have led to the introduction of quite a few metaheuristic methods. In this work, we propose a framework based on Multi-Armed Bandits (MAB) problem, which is a classical Reinforcement Learning (RL) method, to intelligently select a suitable optimizer for each optimization problem during the optimization process. This online algorithm selection technique leverages on the convergence behavior of the algorithms to find the right balance of exploration-exploitation by choosing the update rule of the algorithm with the most estimated improvement in the solution. By performing experiments with three armed-bandits being Harris Hawks Optimizer (HHO), Differential Evolution (DE), and Whale Optimization Algorithm (WOA), we show that the MAB Optimizer Selection (named as MAB-OS) framework has the best overall performance on different types of fitness landscapes in terms of both convergence rate and the final solution. The data and codes used for this work are available at:
... In machine learning, the MAB formulation can be used to find a set of hyperparameters to increase the performance of the learning process [4]. This has been extended further by applying MAB to algorithm selection, where a learner searches for a high-performing algorithm to use for training [5]. For dynamic pricing, when selling a set of products, the price needs to be set according to the current demand in order to maximize profit. ...
Full-text available
The stochastic multi-armed bandit has provided a framework for studying decision-making in unknown environments. We propose a variant of the stochastic multi-armed bandit where the rewards are sampled from a stochastic linear dynamical system. The proposed strategy for this stochastic multi-armed bandit variant is to learn a model of the dynamical system while choosing the optimal action based on the learned model. Motivated by mathematical finance areas such as Intertemporal Capital Asset Pricing Model proposed by Merton and Stochastic Portfolio Theory proposed by Fernholz that both model asset returns with stochastic differential equations, this strategy is applied to quantitative finance as a high-frequency trading strategy, where the goal is to maximize returns within a time period.
... The application presented in Chapter 6 on page 91 is different from most of the proposed techniques as it offers the possibility to perform selection on continuously incoming data streams, which currently only a few works consider [Rij+14; Rij+17; Ker+19]. In addition, Chapters 6 and 7 on page 91 and on page 109 provide an application for online algorithm selection [Arm+06;GS10]. Both areas have been identified as specific research challenges by prior works [Ker+19]. ...
Full-text available
One consequence of the recent coronavirus pandemic is increased demand and use of online services around the globe. At the same time, performance requirements for modern technologies are becoming more stringent as users become accustomed to higher standards. These increased performance and availability requirements, coupled with the unpredictable usage growth, are driving an increasing proportion of applications to run on public cloud platforms as they promise better scalability and reliability. With data centers already responsible for about one percent of the world's power consumption, optimizing resource usage is of paramount importance. Simultaneously, meeting the increasing and changing resource and performance requirements is only possible by optimizing resource management without introducing additional overhead. This requires the research and development of new modeling approaches to understand the behavior of running applications with minimal information. However, the emergence of modern software paradigms makes it increasingly difficult to derive such models and renders previous performance modeling techniques infeasible. Modern cloud applications are often deployed as a collection of fine-grained and interconnected components called microservices. Microservice architectures offer massive benefits but also have broad implications for the performance characteristics of the respective systems. In addition, the microservices paradigm is typically paired with a DevOps culture, resulting in frequent application and deployment changes. Such applications are often referred to as cloud-native applications. In summary, the increasing use of ever-changing cloud-hosted microservice applications introduces a number of unique challenges for modeling the performance of modern applications. These include the amount, type, and structure of monitoring data, frequent behavioral changes, or infrastructure variabilities. This violates common assumptions of the state of the art and opens a research gap for our work. In this thesis, we present five techniques for automated learning of performance models for cloud-native software systems. We achieve this by combining machine learning with traditional performance modeling techniques. Unlike previous work, our focus is on cloud-hosted and continuously evolving microservice architectures, so-called cloud-native applications. Therefore, our contributions aim to solve the above challenges to deliver automated performance models with minimal computational overhead and no manual intervention. Depending on the cloud computing model, privacy agreements, or monitoring capabilities of each platform, we identify different scenarios where performance modeling, prediction, and optimization techniques can provide great benefits. Specifically, the contributions of this thesis are as follows: Monitorless: Application-agnostic prediction of performance degradations. To manage application performance with only platform-level monitoring, we propose Monitorless, the first truly application-independent approach to detecting performance degradation. We use machine learning to bridge the gap between platform-level monitoring and application-specific measurements, eliminating the need for application-level monitoring. Monitorless creates a single and holistic resource saturation model that can be used for heterogeneous and untrained applications. Results show that Monitorless infers resource-based performance degradation with 97% accuracy. Moreover, it can achieve similar performance to typical autoscaling solutions, despite using less monitoring information. SuanMing: Predicting performance degradation using tracing. We introduce SuanMing to mitigate performance issues before they impact the user experience. This contribution is applied in scenarios where tracing tools enable application-level monitoring. SuanMing predicts explainable causes of expected performance degradations and prevents performance degradations before they occur. Evaluation results show that SuanMing can predict and pinpoint future performance degradations with an accuracy of over 90%. SARDE: Continuous and autonomous estimation of resource demands. We present SARDE to learn application models for highly variable application deployments. This contribution focuses on the continuous estimation of application resource demands, a key parameter of performance models. SARDE represents an autonomous ensemble estimation technique. It dynamically and continuously optimizes, selects, and executes an ensemble of approaches to estimate resource demands in response to changes in the application or its environment. Through continuous online adaptation, SARDE efficiently achieves an average resource demand estimation error of 15.96% in our evaluation. DepIC: Learning parametric dependencies from monitoring data. DepIC utilizes feature selection techniques in combination with an ensemble regression approach to automatically identify and characterize parametric dependencies. Although parametric dependencies can massively improve the accuracy of performance models, DepIC is the first approach to automatically learn such parametric dependencies from passive monitoring data streams. Our evaluation shows that DepIC achieves 91.7% precision in identifying dependencies and reduces the characterization prediction error by 30% compared to the best individual approach. Baloo: Modeling the configuration space of databases. To study the impact of different configurations within distributed DBMSs, we introduce Baloo. Our last contribution models the configuration space of databases considering measurement variabilities in the cloud. More specifically, Baloo dynamically estimates the required benchmarking measurements and automatically builds a configuration space model of a given DBMS. Our evaluation of Baloo on a dataset consisting of 900 configuration points shows that the framework achieves a prediction error of less than 11% while saving up to 80% of the measurement effort. Although the contributions themselves are orthogonally aligned, taken together they provide a holistic approach to performance management of modern cloud-native microservice applications. Our contributions are a significant step forward as they specifically target novel and cloud-native software development and operation paradigms, surpassing the capabilities and limitations of previous approaches. In addition, the research presented in this paper also has a significant impact on the industry, as the contributions were developed in collaboration with research teams from Nokia Bell Labs, Huawei, and Google. Overall, our solutions open up new possibilities for managing and optimizing cloud applications and improve cost and energy efficiency.
... The algorithms considered in this study are a memetic algorithm (MA) [24], and DMAB+MA [25], a technique that employs a dynamic multi-armed bandit [26] as a hyperheuristic approach to adaptive operator selection (AOS) [59] within a MA. DMAB+MA can be considered also as an adaptive memetic algorithm. ...
Full-text available
Search trajectory networks (STNs) were proposed as a tool to analyze the behavior of metaheuristics in relation to their exploration ability and the search space regions they traverse. The technique derives from the study of fitness landscapes using local optima networks (LONs). STNs are related to LONs in that both are built as graphs, modelling the transitions among solutions or group of solutions in the search space. The key difference is that STN nodes can represent solutions or groups of solutions that are not necessarily locally optimal. This work presents an STN-based study for a particular combinatorial optimization problem, the cyclic bandwidth sum minimization. STNs were employed to analyze the two leading algorithms for this problem: a memetic algorithm and a hyperheuristic memetic algorithm. We also propose a novel grouping method for STNs that can be generally applied to both continuous and combinatorial spaces.
... algorithm control (Biedenkapp et al. 2019)), where the goal is to predict a schedule of algorithms or dynamically control the algorithm during the solution process of an instance instead of predicting a single algorithm as in our case. Gagliolo and Schmidhuber (2006), Gagliolo and Legrand (2010), Gagliolo and Schmidhuber (2010), Pimpalkhare et al. (2021), and Cicirello and Smith (2005) essentially consider an online algorithm scheduling problem, where both an ordering of algorithms and their corresponding resource allocation (or simply the allocation) has to be computed. Thus, the prediction target is not a single algorithm as in our problem, but rather a very specific composition of algorithms, which can be updated during the solution process. ...
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. As the latter is known to exhibit a heavy-tail distribution, an algorithm is normally stopped when exceeding a predefined upper time limit. As a consequence, machine learning methods used to optimize an algorithm selection strategy in a data-driven manner need to deal with right-censored samples, a problem that has received little attention in the literature so far. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. Moreover, we adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon. In an extensive experimental evaluation on an adapted version of the ASlib benchmark, we demonstrate that theoretically well-founded methods based on Thompson sampling perform specifically strong and improve in comparison to existing methods.
This thesis integrates machine learning techniques into meta-heuristics for solving combinatorial optimization problems. This integration aims to guide the meta-heuristics toward making better decisions and consequently make meta-heuristics more efficient and improve their performance in terms of solution quality and convergence rate. This thesis, first, provides a comprehensive yet technical review of the literature and proposes a unified taxonomy on different ways of the integration. For each type of integration, a complete analysis and discussion is provided on technical details, including challenges, advantages, disadvantages, and perspectives. From a technical aspect, we then focus on a particular integration and address the problem of adaptive operator selection in meta-heuristics using reinforcement learning techniques. More precisely, we propose a general framework that integrates the Q-learning algorithm, as a reinforcement learning algorithm, into the iterated local search algorithm to adaptively and dynamically select the most appropriate search operators at each step of the search process based on their history of performance. The proposed framework is finally applied on two combinatorial optimization problems, traveling salesman problem and permutation flowshop scheduling problem. In both applications, the framework better performance in terms of solution quality and convergence rate compared to a random selection of operators, especially for large size instances of the problems. Moreover, we observe that the proposed framework shows the state-of-the-art behavior when solving the permutation flowshop scheduling problem.
Full-text available
Local search algorithms are among the standard methods for solving hard combinatorial problems from various areas of artificial intelligence and operations research. For SAT, some of the most successful and powerful algorithms are based on stochastic local search, and in the past 10 years a large number of such algorithms have been proposed and investigated. In this article, we focus on two particularly well-known families of local search algorithms for SAT, the GSAT and WalkSAT architectures. We present a detailed comparative analysis of these algorithms" performance using a benchmark set that contains instances from randomized distributions as well as SAT-encoded problems from various domains. We also investigate the robustness of the observed performance characteristics as algorithm-dependent and problem-dependent parameters are changed. Our empirical analysis gives a very detailed picture of the algorithms" performance for various domains of SAT problems; it also reveals a fundamental weakness in some of the best-performing algorithms and shows how this can be overcome.
Full-text available
A wide range of combinatorial optimization algorithms have been developed for complex reasoning tasks. Frequently, no single algorithm outperforms all the others. This has raised interest in leveraging the performance of a collection of algorithms to improve performance. We show how to accomplish this using a Parallel Portfolio of Algorithms(PPA). A PPA is a collection of diverse algorithms for solving a single problem, all running concurrently on a single processor until a solution is produced. The performance of the portfolio may be controlled by assigning different shares of processor time to each algorithm. We present an effective method for finding a PPA in which the share of processor time allocated to each algorithm is fixed. Finding the optimal static schedule is shown to be an NP-complete problem for a general class of utility functions. We present bounds on the performance of the PPA over random instances and evaluate the performance empirically on a collection of 23 state-of-the-art SAT algorithms. The results show significant performance gains over the fastest individual algorithm in the collection.
Full-text available
This work studies external regret in sequential prediction games with both positive and negative payoffs. External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. In this setting, we derive new and sharper regret bounds for the well-known exponentially weighted average forecaster and for a second forecaster with a different multiplicative update rule. Our analysis has two main advantages: first, no preliminary knowledge about the payoff sequence is needed, not even its range; second, our bounds are expressed in terms of sums of squared payoffs, replacing larger first-order quantities appearing in previous bounds. In addition, our most refined bounds have the natural and desirable property of being stable under rescalings and general translations of the payoff sequence.
In the multiarmed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the per-round payo of our algorithm approaches that of the best arm at the rate O(T-1/2). We show by a matching lower bound that this is the best possible. We also prove that our algorithm approaches the per-round payo of any set of strategies at a similar rate: if the best strategy is chosen from a pool of N strategies, then our algorithm approaches the per-round payo of the strategy at the rate O((log N)T-1/2(-1/2)). Finally, we apply our results to the problem of playing an unknown repeated matrix game. We show that our algorithm approaches the minimax payo of the unknown game at the rate O(T-1/2).
This work studies external regret in sequential prediction games with arbitrary payoffs (nonnegative or non-positive). External regret measures the difference between the payoff obtained by the forecasting strategy and the payoff of the best action. We focus on two important parameters: M, the largest absolute value of any payoff, and Q*, the sum of squared payoffs of the best action. Given these parameters we derive first a simple and new forecasting strategy with regret at most order of √Q*(ln N) + M ln N, where N is the number of actions. We extend the results to the case where the parameters are unknown and derive similar bounds. We then devise a refined analysis of the weighted majority forecaster, which yields bounds of the same flavour. The proof techniques we develop are finally applied to the adversarial multi-armed bandit setting, and we prove bounds on the performance of an online algorithm in the case where there is no lower bound on the probability of each action.
Automatic specialization of algorithms to a limited domain is an interesting and industrially applicable problem. We calculate the optimal assignment of computational resources to several dierent solvers that solve the same problem. Optimality is considered with regard to the expected solution time on a set of problem instances from the domain of interest. We present two approaches, a static and dynamic one. The static approach leads to a simple analytically calculable solution. The dynamic approach results in formulation of the problem as a Markov Decision Process. Our tests on the SAT Problem show that the presented methods are quite eective. Therefore, both methods are attractive for applications and future research.
SATLIB is an online resource for SAT-related research established in June 1998. Its core components, a benchmark suite of SAT instances and a collection of SAT solvers, aim to facilitate empirical research on SAT by providing a uniform test-bed for SAT solvers along with freely available implementations of high-performing SAT algorithms. In this article, we give an overview of SATLIB; in particular, we describe its current set of benchmark problems. Currently, the main SATLIB web site ( and its North American mirror site. ( are being accessed frequently by a growing number of researchers, resulting in access rates of about 250 hits per month. To further increase the usefulness of SATLIB as a resource, we encourage all members of the community to utilise it for their SAT-related research and to improve it by submitting new benchmark problems, SAT solvers, and bibliography entries.
Tuning an algorithm's parameters for robust and high performance is a tedious and time-consuming task that often requires knowledge about both the domain and the algorithm of interest. Furthermore, the optimal parameter configuration to use may differ considerably across problem instances. In this report, we define and tackle the algorithm configuration problem, which is to automatically choose the optimal parameter configuration for a given algorithm on a per-instance base. We employ an indirect approach that predicts algorithm runtime for the problem instance at hand and each (con-tinuous) parameter configuration, and then simply chooses the con-figuration that minimizes the prediction. This approach is based on similar work by Leyton-Brown et al. [LBNS02, NLBD + 04] who tackle the algorithm selection problem [Ric76] (given a problem in-stance, choose the best algorithm to solve it). While all previous studies for runtime prediction focussed on tree search algorithm, we demonstrate that it is possible to fairly accurately predict the runtime of SAPS [HTH02], one of the best-performing stochastic local search algorithms for SAT. We also show that our approach automatically picks parameter configurations that speed up SAPS by an average factor of more than two when compared to its default parameter configuration. Finally, we introduce sequential Bayesian learning to the problem of runtime prediction, enabling an incre-mental learning approach and yielding very informative estimates of predictive uncertainty.