ArticlePDF Available

The Famine of Forte: Few Search Problems Greatly Favor Your Algorithm

Authors:

Abstract

No Free Lunch theorems show that the average performance across any closed-under-permutation set of problems is fixed for all algorithms, under appropriate conditions. Extending these results, we demonstrate that the proportion of favorable problems is itself strictly bounded, such that no single algorithm can perform well over a large fraction of possible problems. Our results explain why we must either continue to develop new learning methods year after year or move towards highly parameterized models that are both flexible and sensitive to their hyperparameters.
The Famine of Forte:
Few Search Problems Greatly Favor Your Algorithm
George D. Monta˜
nez
Machine Learning Department
Carnegie Mellon University
Pittsburgh, Pennsylvania, USA
gmontane@cs.cmu.edu
Abstract—Casting machine learning as a type of search, we
demonstrate that the proportion of problems that are favorable
for a fixed algorithm is strictly bounded, such that no single
algorithm can perform well over a large fraction of them.
Our results explain why we must either continue to develop
new learning methods year after year or move towards highly
parameterized models that are both flexible and sensitive to
their hyperparameters. We further give an upper bound on the
expected performance for a search algorithm as a function of
the mutual information between the target and the information
resource (e.g., training dataset), proving the importance of certain
types of dependence for machine learning. Lastly, we show that
the expected per-query probability of success for an algorithm is
mathematically equivalent to a single-query probability of success
under a distribution (called a search strategy), and prove that the
proportion of favorable strategies is also strictly bounded. Thus,
whether one holds fixed the search algorithm and considers all
possible problems or one fixes the search problem and looks at all
possible search strategies, favorable matches are exceedingly rare.
The forte (strength) of any algorithm is quantifiably restricted.
I. INTRODUCTION
BIGCITY, United States. In a fictional world not unlike
our own, a sophisticated criminal organization plots to attack
an unspecified landmark within the city. Due to the complexity
of the attack and methods of infiltration, the group is forced to
construct a plan relying on the coordinated actions of several
interdependent agents, of which the failure of any one would
cause the collapse of the entire plot. As a member of the city’s
security team, you must allocate finite resources to protect
the many important locations within the city. Although you
know the attack is imminent, your sources have not indicated
which location, of the hundreds possible, will be hit; your lack
of manpower forces you to make assumptions about target
likelihoods. You know you can foil the plot if you stop even
one enemy agent. Because of this, you seek to maximize the
odds of capturing an agent by placing vigilant security forces
at the strategic locations throughout the city. Allocating more
security to a given location increases surveillance there, raising
the probability a conspirator will be found if operating nearby.
Unsure of your decisions, you allocate based on your best
information, but continue to second-guess yourself.
With this picture in mind, we can analyze the scenario
through the lens of algorithmic search. Algorithmic search
methods, whether employed in fictional security situations or
by researchers in the lab, all share common elements. They
have a search space of elements (possible locations) which
contains items of high value at unknown locations within that
space. These high-value elements are targets of the search,
which are sought by a process of inspecting elements of the
search space to see if they contain any elements of the target
set. (We refer to the inspection of search space elements as
sampling from the search space, and each sampling action as a
query.) In our fictional scenario, the enemy activity locations
constitute the target set, and surveillance at city locations
corresponds to a search process (i.e., attempting to locate and
arrest enemy agents). The search process can be deterministic
or contain some degree of randomization, such as choosing
which elements to inspect using a weighted distribution over
the search space. The search process is a success if an element
of the target set is located during the course of inspecting
elements of the space. The history of the search refers to the
collection of elements already sampled and any accompanying
information (e.g., the intelligence data gathered thus far by the
security forces).
There are many ways to search such a space. Research on
algorithmic search is replete with proposed search methods
that attempt to increase the expected likelihood of search
success, across many different problem domains. However, a
direct result of the No Free Lunch theorems are that no search
method is universally optimal across all search problems [1],
[2], [3], [4], [5], [6], [7], [8], [9], since all algorithmic search
methods have performance equivalent to uniform random
sampling on the search space when uniformly averaged across
any closed-under-permutation set of problem functions [4].
Thus, search (and learning) algorithms must trade weaker
performance on some problems for improved performance on
others [1], [10]. Given the fact that there exists no universally
superior search method, we typically seek to develop methods
that perform well over a large range of important problems.
An important question naturally arises:
Is there a limit to the number of problems for which
a search algorithm can perform well?
For search problems in general, the answer is yes, as the
proportion of search problems for which an algorithm out-
performs uniform random sampling is strictly bounded in
relation to the degree of performance improvement. Previous
results have relied on the fact that most search problems have
arXiv:1609.08913v2 [stat.ML] 19 Apr 2017
many target elements, and thus are not difficult for random
sampling [11], [12]. Since the typical search problem is not
difficult for random sampling, it becomes hard to find a
problem for which an algorithm can significantly and reliably
outperform random sampling. While theoretically interesting,
a more relevant situation occurs when we consider difficult
search problems, which are those having relatively small target
sets. We denote such problems as target-sparse problems
or following [11], target-sparse functions. Note, this use of
sparseness differs from that typical in the machine learning
literature, since it refers to sparseness of the target space,
not sparseness in the feature space. One focus of this paper
is to prove results that hold even for target-sparse functions.
As mentioned, Monta˜
nez [11] and English [12] have shown
that uniform sampling does well on the majority of functions
since they are not target-sparse, thus making problems for
which an algorithm greatly outperforms random chance rare.
If we restrict ourselves to only target-sparse functions, perhaps
it would become easier to find relatively favorable problems
since we would already know uniform sampling does poorly.
The bar would already be low, so we would have to do less to
surpass it, perhaps leading to a greater proportion of favorable
problems within that set. We show why this intuition incorrect,
along with proving several other key results.
II. CONTRIBUTIONS
First, we demonstrate that favorable search problems must
necessarily be rare. Our work departs from No Free Lunch
results (namely, that the mean performance across sets of prob-
lems is fixed for all algorithms) to show that the proportion
of favorable problems is strictly bounded in relation to the
inherent problem difficulty and the degree of improvement
sought (i.e., not just the mean performance is bounded). Our
results continue to hold for sets of objective functions that
are not closed-under-permutation, in contrast to traditional No
Free Lunch theorems. Furthermore, the negative results pre-
sented here do not depend on any distributional assumptions
on the space of possible problems, such that the proportion of
favorable problems is small regardless of which distribution
holds over them in the real world. This directly answers
critiques aimed at No Free Lunch results arguing against a
uniform distribution on problems in the real world (cf. [13]),
since given any distribution over possible problems, there are
still relatively few favorable problems within the set one is
taking the distribution over.
As a corollary, we prove the information costs of finding
any favorable search problem is bounded below by the number
of bits “advantage” gained by the algorithm on such problems.
We do this by using an active information transform to
measure performance improvement in bits [14], proving a
conservation of information result [6], [14], [11] that shows
the amount of information required to locate a search problem
giving bbits of expected information is at least bbits. Thus, to
get an expected net gain of information, the true distribution
over search problems must be biased towards favorable prob-
lems for a given algorithm. This places a floor on the minimal
information costs for finding favorable problems, somewhat
reminiscent of the entertainingly satirical work on “data set
selection” [15].
Another major contribution of this paper is to bound the
expected per-query probability of success based on informa-
tion resources. Namely, we relate the degree of dependence
(measured in mutual information) between target sets and
external information resources, such as objective functions,
noisy measurements1or sets of training data, to the maximum
improvement in search performance. We prove that for a fixed
target-sparseness and given an algorithm with induced single-
query probability of success q,
qI(T;F) + D(PTkUT)+1
I
(1)
where I(T;F)is the mutual information between target set T
(as a random variable) and external information resource F,
D(PTkUT)is the Kullback-Leibler divergence between the
marginal distribution on Tand the uniform distribution on
target sets, and Iis the baseline information cost for the
search problem due to sparseness. This simple equation takes
into account degree of dependence, target sparseness, target
function uncertainty, and the contribution of random luck. It is
surprising that such well-known quantities appear in the course
of simply trying to upper-bound the probability of success.
We then establish the equivalence between the expected
per-query probability of success for an algorithm and the
probability of a successful single-query search under some
distribution, which we call a strategy. Each algorithm maps
to a strategy, and we prove an upper-bound on the proportion
of favorable strategies for a fixed problem. Thus, matching a
search problem to a fixed algorithm or a search algorithm to
a fixed problem are both provably difficult tasks, and the set
of favorable items remains vanishingly small in both cases.
Lastly, we apply the results to several problem domains,
some toy and some actual, showing how these results lead to
new insights in different research areas.
III. SEARCH FRA ME WORK
We begin by formalizing our problem setting, search method
abstraction and other necessary concepts.
A. The Search Problem
We abstract the search problem as follows. The search
space, denoted , contains the elements to be examined. We
limit ourselves to finite, discrete search spaces, which entails
no loss of generality when considering search spaces fully
representable on physical computer hardware within a finite
time. Let the target set Tbe a nonempty subset of the
search space. The set Tcan be represented using a binary
vector the size of , where an indexed position evaluates to 1
whenever the corresponding element is in Tand 0 otherwise.
Thus, each Tcorresponds to exactly one binary vector of
length ||, and vice versa. We refer to this one-to-one mapped
1Noisy in the sense of inaccurate or biased, not in the sense of indetermin-
istic.
Fig. 1. Two example target sets. In one case (left), knowing the location of the
black target elements fully determines the location of the red target element.
In the second case (right), the locations of the black target elements reveal
nothing concerning the location of the red element. Problems are represented
as a binary vector (top row), and two-dimensional search space (bottom row).
binary vector as a target function and use the terms target
function and target set interchangeably, depending on context.
These target sets/functions will help us define our space of
possible search problems, as we will see shortly.
Figure 1 shows two example target sets, in binary vector
and generic target set form. The set on the left has strong
(potentially exploitable) structure governing the placement of
target elements, whereas the example on the right is more
or less random. Thus, knowing the location of some target
elements may or may not be able to help one find additional
elements, and in general there may be any degree of correlation
between the location of target elements already uncovered and
those yet to be discovered.
Typically, elements from the search space are evaluated
according to some external information resource, such as
an objective function f. We abstract this resource as simply
a finite length bit string, which could represent an objective
function, a set of training data (in supervised learning settings),
or anything else. The resource can exist in coded form, and
we make no assumption about the shape of this resource or
its encoding. Our only requirement is that it can be used
as an oracle, given an element (possibly the null element)
from . In other words, we require the existence of two
methods, one for initialization (given a null element) and one
for evaluating queried points in . We assume both methods
are fully specified by the problem domain and external infor-
mation resource. With a slight abuse of notation, we define an
information resource evaluation as F(ω) := g(F, ω), where
gis an extraction function applied to the information resource
and ω∪ {∅}. Therefore, F()represents the method
used to extract initial information for the search algorithm
(absent of any query), and F(ω)represents the evaluation
of point ωunder resource F. The size of the information
resource becomes important for datasets in machine learning,
which determines the maximum amount of mutual information
available between Tand F.
Asearch problem is defined as a 3-tuple, (Ω, T , F),
consisting of a search space, a target subset of the space and an
Ω
P
BLACK-BOX
ALGORITHM
HISTORY
ω₀, F(ω₀)
ω₃, F(ω₃)
ω₈, F(ω₈)
ω₅, F(ω₅)
ω₂, F(ω₂)
i
i 6
i 5
i 4
i 3
i 2
i 1
ω₆, F(ω₆)
CHOOSE NEXT POINT AT TIME STEP i
ω, F(ω)
Fig. 2. Black-box search algorithm. At time ithe algorithm computes a
probability distribution Piover the search space , using information from
the history, and a new point is drawn according to Pi. The point is evaluated
using external information resource F. The tuple (ω, F (ω)) is then added to
the history at position i. Note, indices on ωelements do not correspond to
time step in this diagram, but to sampled locations.
external information resource F, respectively. Since the target
locations are hidden, any information gained by the search
concerning the target locations is mediated through the exter-
nal information resource Falone. Thus, the space of possible
search problems includes many deceptive search problems,
where the external resource provides misleading information
about target locations, similar to when security forces are given
false intelligence data, and many noisy problems, similar to
when imprecise intelligence gathering techniques are used. In
the fully general case, there can be any relationship between T
and F. Because we consider any and all degrees of dependence
between external information resources and target locations,
this effectively creates independence when considering the set
as a whole, allowing the first main result of this paper to follow
as a consequence.
However, in many natural settings, target locations and
external information resources are tightly coupled. For ex-
ample, we typically threshold objective function values to
designate the target elements as those that meet or exceed
some minimum. Doing so enforces dependence between target
locations and objective functions, where the former is fully
determined by the latter, once the threshold is known. This
dependence causes direct correlation between the objective
function (which is observable) and the target locations (which
are not directly observable). We will demonstrate such correla-
tion is exploitable, affecting the upper bound on the expected
probability of success.
B. The Search Algorithm
An algorithmic search is any process that chooses elements
of a search space to examine, given some history of elements
already examined and information resource evaluations. The
history consists of two parts: a query trace and a resource
evaluation trace. A query trace is a record of points queried,
indexed by time. A resource evaluation trace is a record
of partial information extracted from F, also indexed by
time. The information resource evaluations can be used to
build elaborate predictive models (as is the case of Bayesian
optimization methods), or ignored completely (as is done with
uniform random sampling). The algorithm’s internal state is
updated at each time step (according to the rules of the algo-
rithm), as new query points are evaluated against the external
information resource. The search is thus an iterative process,
with the history hat time irepresented as hi= (ωi, F (ωi)).
The problem domain defines how much initial information
from Fis given by F(), as well as how much (and what)
information is extracted from Fat each query.
We allow for both deterministic and randomized algorithms,
since deterministic algorithms are equivalent to randomized
algorithms with degenerate probability functions (i.e., they
place all probability mass on a single point). Furthermore,
any population based method can also be represented in this
framework, by holding Pfixed when selecting the mele-
ments in the population, then considering the history (possibly
limiting the horizon of the history to only msteps, creating
a Markovian dependence on the previous population) and
updating the probability distribution over the search space for
the next msteps.
Abstracting away the details of how such a distribution
Pis chosen, we can treat the search algorithm as a black
box that somehow chooses elements of a search space to
evaluate. A search is successful if an element in the target
set is located during the search. Algorithm 1 outlines the
steps followed by the black-box search algorithm and Figure 2
visually demonstrates the process.
Algorithm 1 Black-box Search Algorithm
1: Initialize h0(, F ()).
2: for all i= 1,...,imax do
3: Using history h0:i1, compute Pi, the distribution over .
4: Sample element ωiaccording to Pi.
5: Set hi(ωi, F (ωi)).
6: end for
7: if an element of Tis contained in any tuple of hthen
8: Return success.
9: else
10: Return failure.
11: end if
Our framework is sufficiently general to apply to super-
vised learning, genetic algorithms, and sequential model-based
hyperparameter optimization, yet is specific enough to allow
for precise quantitative statements. Although faced with an
inherent balancing act whenever formulating a new framework,
we emphasize greater generality to allow for the largest
number of domain specific applications.
IV. MEASURING PERFORMANCE
Since general search algorithms may vary the total number
of sampling steps performed, we measure performance using
the expected per-query probability of success,
q(T , F ) = E˜
P ,H
1
|˜
P|
|˜
P|
X
i=1
Pi(ωT)
F
,(2)
where ˜
Pis the sequence of probability distributions for sam-
pling elements (with one distribution Pifor each time step i),
His the search history, and the Tand Fmake the dependence
on the search problem explicit. Because this is an expectation
over sequences of history tuples and probability distributions,
past information forms the basis for q(T , F ).|˜
P|denotes the
length of the sequence ˜
P, which equals the number of queries
taken. The expectation is taken over all sources of randomness,
which includes randomness over possible search histories and
any randomness in constructing the various Pifrom h0:i1(if
such a construction is not entirely deterministic). Taking the
expectation over all sources of randomness is equivalent to the
probability of success for samples drawn from an appropriately
averaged distribution P(· | F)(see Lemma 1). Because we
are sampling from a fixed (after conditioning and expectation)
probability distribution, the expected per-query probability of
success for the algorithm is equivalent to the induced amount
of probability mass allocated to target elements. Revisiting
our fictional security scenario, the probability of capturing an
enemy agent is proportional to the amount of security at his
location. In the same way, the expected per-query probability
of success is equal to the amount of induced probability mass
at locations where target elements are placed. Thus, each
P(· | F)demarks an equivalence class of search algorithms
mapping to the same averaged distribution; we refer to these
equivalence classes as search strategies.
We use uniform random sampling with replacement as our
baseline search algorithm, which is a simple, always available
strategy and we define p(T, F )as the per-query probability of
success for that method. In a natural way, p(T, F )is a measure
of the intrinsic difficulty of a search problem [14], absent
of any side-information. The ratio p(T, F )/q(T , F )quantifies
the improvement achieved by a search algorithm over simple
random sampling. Like an error quantity, when this ratio is
small (i.e., less than one) the performance of the algorithm
is better than uniform random sampling, but when it is larger
than one, performance is worse. We will often write p(T , F )
simply as p, when the target set and information resource are
clear from the context.
V. MAIN RE SU LTS
We state our main results here with brief discussion, and
with proofs given in the Appendix.
A. Famine of Forte
Theorem 1. (Famine of Forte) Define
τk={T|T,|T|=kN}
and let Bmdenote any set of binary strings, such that the
strings are of length mor less. Let
R={(T , F )|Tτk, F ∈ Bm},and
Rqmin ={(T , F )|Tτk, F ∈ Bm, q(T , F )qmin},
where q(T, F )is the expected per-query probability of success
for algorithm Aon problem (Ω, T, F ). Then for any mN,
|Rqmin |
|R|p
qmin
and
lim
m→∞ |Rqmin |
|R|p
qmin
where p=k/||.
We see that for small p(problems with sparse target sets)
favorable search problems are rare if we desire a strong prob-
ability of success. The larger qmin, the smaller the proportion.
In many real-world settings, we are given a difficult search
problem (with minuscule p) and we hope that our algorithm
has a reasonable chance of achieving success within a limited
number of queries. According to this result, the proportion of
problems fulfilling such criteria is also minuscule. Only if we
greatly relax the minimum performance demanded, so that qmin
approaches the scale of p, can we hope to easily stumble upon
such accommodating search problems. If not, insight and/or
auxiliary information (e.g., convexity constraints, smoothness
assumptions, domain knowledge, etc.) are required to find
problems for which an algorithm excels.
This result has many applications, as will be shown in
Section VI.
B. Conservation of Information
Corollary 1. (Conservation of Active Information of Expec-
tations) Define Iq(T,F )=log p/q(T , F )and let
R={(T , F )|Tτk, F ∈ Bm},and
Rb={(T , F )|Tτk, F ∈ Bm, Iq(T,F )b}.
Then for any mN
|Rb|
|R|2b.
The active information transform quantifies improvement in
a search over a baseline search method [14], measuring gains
in search performance in terms of information (bits). This
allows one to precisely quantify the proportion of favorable
functions in relation to the number bbits improvement desired.
Active information provides a nice geometric interpretation
of improved search performance, namely, that the improved
search method is equivalent to a uniform random sampler on
a reduced search space. Similarly, a degraded-performance
search is equivalent to a uniform random sampler (with
replacement) on a larger search space.
Applying this transform, we see that finding a search
problem for which an algorithm effectively reduces the search
space by bbits requires at least bbits, so information is
conserved in this context. Assuming you have no domain
knowledge to guide the process of finding a search problem
for which your search algorithm excels, you are unlikely to
stumble upon one by chance; indeed, they are exponentially
rare in the amount of improvement sought.
C. Famine of Favorable Strategies
Theorem 2. (Famine of Favorable Strategies) For any fixed
search problem (Ω, t, f ), set of probability mass functions P=
{P:PR||,PjPj= 1}, and a fixed threshold qmin
[0,1],
µ(Gt,qmin )
µ(P)p
qmin
,
where Gt,qmin ={P:P∈ P, t>Pqmin }and µis
Lebesgue measure. Furthermore, the proportion of possible
search strategies giving at least bbits of active information of
expectations is no greater than 2b.
Thus, not only are favorable problems rare, so are favorable
strategies. Whether you hold fixed the algorithm and try to
match a problem to it, or hold fixed the problem and try to
match a strategy to it, both are provably difficult. Because
matching problems to algorithms is hard, seemingly serendip-
itous agreement between the two calls for further explanation,
serving as evidence against blind matching by independent
mechanisms (especially for very sparse targets embedded
in very large spaces). More importantly, this result places
hard quantitative constraints (similar to minimax bounds in
statistical learning theory) on information costs for automated
machine learning, which attempts to match learning algorithms
to learning problems [16], [17].
D. Success Under Dependence
Theorem 3. (Success Under Dependence) Define
q=ET,F [q(T , F )]
and note that
q=ET,F P(ωT|F)= Pr(ωT;A).
Then,
qI(T;F) + D(PTkUT)+1
I
where I=log k/||,D(PTkUT)is the Kullback-Leibler
divergence between the marginal distribution on Tand the
uniform distribution on T, and I(T;F)is the mutual infor-
mation. Alternatively, we can write
qH(UT)H(T|F)+1
I
where H(UT) = log ||
k.
Thus the bound on expected probability of success improves
monotonically with the amount of dependence between target
sets and information resources. We quantify the dependence
using mutual information under any fixed joint distribution
on τkand Bm. We see that Imeasures the relative target
sparseness of the search problem, and can be interpreted as the
information cost of locating a target element in the absence
of side-information. D(PTkUT)is naturally interpreted as
the predictability of the target sets, since large values imply
the probable occurrence of only a small number of possible
target sets. The mutual information I(T;F)is the amount of
exploitable information the external resource contains regard-
ing T; lowering the mutual information lowers the maximum
expected probability of success for the algorithm. Lastly, the
1in the numerator upper bounds the contribution of pure ran-
domness, as explained in the Appendix. Thus, this expression
constrains the relative contributions of predictability, problem
difficulty, side-information, and randomness for a successful
search, providing a satisfyingly simple and interpretable upper
bound on the probability of successful search.
VI. EX AM PL ES
A. Binary Classification
It has been suggested that machine learning represents
a type of search through parameter or concept space [18].
Supporting this view, we can represent binary classification
problems within our framework as follows:
A- classification algorithm, such as an SVM.
- space of possible concepts over an instance space.
T- Set of all hypotheses with less than 10% classification
error on test set, for example.
F- set of training examples.
F()- full set of training data.
F(c)- loss on training data for concept c.
(Ω, T, F )- binary classification learning task.
The space of possible binary concepts is , with the true
concept being an element in that space. In our example, let
||= 2100. The target set consists of the set of all concepts in
that space that (1) are consistent with the training data (which
we will assume all are), and (2) differ from the truth in at most
10% of positions on the generalization held-out dataset. Thus,
|T|=P10
i=0 100
i. Let us assume the marginal distribution on
Tis uniform, which isn’t necessary but simplifies the calcula-
tion. The external information resource Fis the set of training
examples. The algorithm uses the training examples (given by
F()) to produce a distribution over the space of concepts;
for deterministic algorithms, this is a degenerate distribution
on exactly one element. A single query is then taken (i.e., a
concept is output), and we assess the probability of success
for the single query. By Theorem 3, the expected chances of
outputting a concept with at least 90% generalization accuracy
is thus no greater than I(T;F)+1
II(T;F)
II(T;F)
59 . The
denominator is the information cost of specifying at least
one element of the target set and the numerator represents
the information resources available for doing so. When the
mutual information meets (or exceeds) that cost, success can
be ensured for any algorithm perfectly mining the available
mutual information. When noise reduces the mutual informa-
tion below the information cost, the expected probability of
success becomes strictly bounded in proportion to that ratio.
B. General Learning Problems
Vapnik presents a generalization of learning that applies to
classification, regression, and density estimation [19], which
we can translate into our framework. Following Vapnik, let
P(z)be defined on space Z, and consider the parameterized
set of functions Qα(z), α Λ. The goal is to minimize
R(α) = RQα(z)dP (z)for αΛ, when P(z)is unknown
but an i.i.d. sample z1, . . . , z`is given. Let Remp(α) =
1
`P`
i=1 Qα(zi)be the empirical risk.
To reduce this general problem to a search problem within
our framework, assume Λis finite, choose R0, and let
Ω=Λ;
T={α:R(α)argminα0ΛR(α0)< };
F={z1, . . . , z`};
F() = {z1, . . . , z`}; and
F(α) = Remp(α).
Thus, any finite problem representable in Vapnik’s statistical
learning framework is also directly representable within our
search framework.
C. Hyperparameter Optimization
Given that sequential hyperparameter optimization is a lit-
eral search through a space of hyperparameter configurations,
our results are directly applicable. The search space consists
of all the possible hyperparameter configurations (appropri-
ately discretized in the case of numerical hyperparameters).
The target set Tis determined by the particular learning
algorithm the configurations are applied to, the performance
metric used, and the level of performance desired. Let X
denote a set of points sampled from the space, and let the
information gained from the sample become the external
information resource f. Given that resource, we have the
following theorem:
Theorem 4. Given a search algorithm A, a finite discrete
hyperparameter configuration space , a set Xof points
sampled from that search space, and information resource f
that is a function of X, let 0:= \X,τk={T|T
0,|T|=kN}, and τk,qmin ={T|Tτk, q(T , f)qmin},
where q(T, f )is the expected per-query probability of success
for algorithm Aunder Tand f. Then,
|τk,qmin |
|τk|p0
qmin
where p0=k/|0|
The proof follows directly from Theorem 1.
The proportion of possible hyperparameter target sets giving
an expected probability of success qmin or more is minuscule
when k |0|. If we have no additional information beyond
that gained from the points X, we have no justifiable basis
for expecting a successful search. Thus, we must make some
assumptions concerning the relationship of the points sampled
to the remaining points in 0. We can do so by either assuming
structure on the search space, such that spatial information
becomes informative, or by making an assumption on the
process by which Xwas sampled, so that the sample is repre-
sentative of the space in quantifiable ways. These assumptions
allow fto become informative of the target set T, leading to
exploitable dependence under Theorem 3. Thus we see the
need for inductive bias in hyperparameter optimization (to
expand a term used by Mitchell [20] for classification), which
hints at a strategy for creating more effective hyperparameter
optimization algorithms (i.e., through exploitation of spatial
structure).
D. One-Size-Fits-All Fitness Functions
Theorem 1 gives us an upper bound on the proportion of
favorable search problems, but what happens when we have
a single, fixed information resource, such as a single fitness
function? A natural question to ask is for how many target
locations can such a fitness function be useful. More precisely,
for a given search algorithm, for what proportion of search
space locations can the fitness function raise the expected
probability of success, assuming a target element happens to
be located at one of those spots?
Applying Theorem 1 with |T|= 1, we find that a fitness
function can significantly raise the probability of locating tar-
get elements placed on, at most, 1/qmin search space elements.
We see this as follows. Since |T|= 1 and the fitness function
is fixed, each search problem maps to exactly one element of
, giving ||possible search problems. The number of qmin-
favorable search problems is upper-bounded by
||p
qmin
=|||T|/||
qmin
=||1/||
qmin
=1
qmin
.
Because this expression is independent of the size of the search
space, the number of elements for which a fitness function can
strongly raise the probability of success remains fixed even as
the size of the search space increases. Thus, for very large
search spaces the proportion of favored locations effectively
vanishes. There can exist no single fitness function that is
strongly favorable for many elements simultaneously, and thus
no “one-size-fits-all” fitness function.
E. Proliferation of Learning Algorithms
Our results also help make sense of recent trends in machine
learning. They explain why new algorithms continue to be
developed, year after year, conference after conference, despite
decades of research in machine learning and optimization.
Given that algorithms can only perform well on a narrow
subset of problems, we must either continue to devise novel
algorithms for new problem domains or else move towards
flexible algorithms that can modify their behavior solely
though parameterization. The latter effectively behave like
new strategies for new hyperparameter configurations. The
explosive rise of flexible, hyperparameter sensitive algorithms
like deep learning methods and vision architectures shows a
definite trend towards the latter, with their hyperparameter
sensitivity being well known [21], [22]. Furthermore, because
flexible algorithms are highly sensitive to their hyperparam-
eters (by definition), this explains the concurrent rise in
automated hyperparameter optimization methods [23], [24],
[16], [25].
F. Landmark Security
Returning to our initial toy example, we can now fill in a
few details. Our external information resource is the pertinent
intelligence data, mined through surveillance. We begin with
background knowledge (represented in F()), used to make
the primary security force placements. Team members on the
ground communicate updates back to central headquarters,
such as suspicious activity, which are the F(ω)evaluations
used to update the internal information state. Each resource
allocated is a query, and manpower constraints limit the
number of queries available. Doing more with fewer officers
is better, so the hope is to maximize the per-officer probability
of stopping the attack.
Our results tell us a few things. First, a fixed strategy can
only work well in a limited number of situations. There is little
or no hope of a problem being a good match for your strategy
if the problem arises independently of it (Theorems 1 and
2). So reliable intelligence becomes key. The better correlated
the intelligence reports are with the actual plot, the better a
strategy can perform (Theorem 3). However, even for a fixed
search problem with reliable external information resource
there is no guarantee of success, if the strategy is chosen
poorly; the proportion of good strategies for a fixed problem
is no better than the proportion of good problems for a fixed
algorithm (Theorem 2). Thus, domain knowledge is crucial in
choosing either. Without side-information to guide the match
between search strategy and search problem, the expected
probability of success is dismal in target-sparse situations.
VII. CONCLUSION
A colleague once remarked that the results presented here
bring to mind Lake Wobegone, “where all the children are
above average.” In a world where every published algorithm
is “above average,” conservation of information reminds us
that this cannot be. Improved performance over any subset of
problems necessarily implies degraded performance over the
remaining problems [10], and all methods have performance
equivalent to random sampling when uniformly averaged over
any closed-under-permutation set of problems [1], [2], [4].
But there are many ways to achieve an average. An algo-
rithm might perform well over a large number of problems,
only to perform poorly on a small set of remaining problems.
A different algorithm might perform close to average on all
problems, with slight variations around the mean. How large
can the subset of problems with improved performance be?
We show that the maximum proportion of search problems
for which an algorithm can attain qmin-favorable performance
is bounded from above by p/qmin. Thus, an algorithm can
only perform well over a narrow subset of possible problems.
Not only is there no free lunch, but there is also a famine of
favorable problems.
If finding a good search problem for a fixed algorithm is
hard, then so is finding a good search algorithm for a fixed
problem (Theorem 2). Thus, the matching of problems to
algorithms is provably difficult, regardless of which is fixed
and which varies.
Our results paint a more optimistic picture once we restrict
ourselves to those search problems for which the external
information resource is strongly informative of the target set,
as is often assumed to be the case. For those problems, the
expected per-query probability of success is upper bounded
by a function involving the mutual information between target
sets and external information resources (like training datasets
and objective functions). The lower the mutual information,
the lower chance of success, but the bound improves as the
dependence is strengthened.
The search framework we propose is general enough to
find application in many problem areas, such as machine
learning, evolutionary search, and hyperparameter optimiza-
tion. The results are not just of theoretical importance, but
help explain real-world phenomena, such as the need for
exploitable dependence in machine learning and the empirical
difficulty of automated learning [17]. Our results help us
understand the growing popularity of deep learning methods
and unavoidable interest in automated hyperparameter tuning
methods. Extending the framework to continuous settings and
other problem areas (such as active learning) is the focus of
ongoing research.
ACK NOW LE DG EM EN T
I would like to thank Akshay Krishnamurthy, Junier Oliva,
Ben Cowley and Willie Neiswanger for their discussions
regarding Lemma 2. I am indebted to Geoff Gordon for
help proving Lemma 2, and to Cosma Shalizi for providing
many good insights, challenges and ideas concerning this
manuscript.
REFERENCES
[1] D. Wolpert and W. Macready, “No free lunch theorems for optimization,”
IEEE Transactions on Evolutionary Computation, no. 1, pp. 67–82, April
1997.
[2] J. Culberson, “On the futility of blind search: An algorithmic view of
‘no free lunch’,” Evolutionary Computation, vol. 6, no. 2, pp. 109–127,
1998.
[3] T. English, “Evaluation of evolutionary and genetic optimizers: No free
lunch,” in Evolutionary Programming V: Proceedings of the Fifth Annual
Conference on Evolutionary Programming, 1996, pp. 163–169.
[4] C. Schumacher, M. Vose, and L. Whitley, “The no free lunch and prob-
lem description length,” in Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO-2001), 2001, pp. 565–570.
[5] D. Whitley, “Functions as permutations: regarding no free lunch, walsh
analysis and summary statistics,” in Parallel Problem Solving from
Nature PPSN VI. Springer, 2000, pp. 169–178.
[6] T. English, “No more lunch: Analysis of sequential search,” in Evo-
lutionary Computation, 2004. CEC2004. Congress on, vol. 1. IEEE,
2004, pp. 227–234.
[7] S. Droste, T. Jansen, and I. Wegener, “Optimization with randomized
search heuristics–the (a)nfl theorem, realistic scenarios, and difficult
functions,” Theoretical Computer Science, vol. 287, no. 1, pp. 131–144,
2002.
[8] W. Dembski and R. Marks II, “The search for a search: Measuring
the information cost of higher level search,Journal of Advanced
Computational Intelligence and Intelligent Informatics, vol. 14, no. 5,
pp. 475–486, 2010.
[9] W. Dembski, W. Ewert, and R. Marks II, “A general theory of infor-
mation cost incurred by successful search,” in Biological Information.
World Scientific, 2013, ch. 3, pp. 26–63.
[10] C. Schaffer, “A conservation law for generalization performance,” in
Proceedings of the Eleventh International Machine Learning Confer-
ence, W. W. Cohen and H. Hirsch, Eds. Rutgers University, New
Brunswick, NJ, 1994, pp. 259–265.
[11] G. D. Monta ˜
nez, “Bounding the number of favorable functions in
stochastic search,” in Evolutionary Computation (CEC), 2013 IEEE
Congress on, June 2013, pp. 3019–3026.
[12] T. English, “Optimization is easy and learning is hard in the typical
function,” in Evolutionary Computation, 2000. Proceedings of the 2000
Congress on, vol. 2. IEEE, 2000, pp. 924–931.
[13] R. B. Rao, D. Gordon, and W. Spears, “For every generalization
action, is there really an equal and opposite reaction? analysis of the
conservation law for generalization performance,Urbana, vol. 51, p.
61801.
[14] W. Dembski and R. Marks II, “Conservation of information in search:
Measuring the cost of success,” Systems, Man and Cybernetics, Part A:
Systems and Humans, IEEE Transactions on, vol. 39, no. 5, pp. 1051
–1061, sept. 2009.
[15] D. LaLoudouana and M. B. Tarare, “Data set selection,Journal of
Machine Learning Gossip, vol. 1, pp. 11–19, 2003.
[16] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka:
Combined selection and hyperparameter optimization of classification
algorithms,” in Proceedings of the 19th ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM, 2013, pp.
847–855.
[17] I. Guyon, K. Bennett, G. Cawley, H. J. Escalante, S. Escalera, T. K. Ho,
N. Maci, B. Ray, M. Saeed, A. Statnikov, and E. Viegas, “Design of the
2015 chalearn automl challenge,” in 2015 International Joint Conference
on Neural Networks (IJCNN), July 2015, pp. 1–8.
[18] T. M. Mitchell, “Generalization as search,Artificial intelligence, vol. 18,
no. 2, pp. 203–226, 1982.
[19] V. N. Vapnik, “An overview of statistical learning theory,IEEE trans-
actions on neural networks, vol. 10, no. 5, pp. 988–999, 1999.
[20] T. M. Mitchell, “The need for biases in learning generalizations,” Rutgers
University, Tech. Rep., 1980.
[21] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza-
tion of machine learning algorithms,” in Advances in neural information
processing systems, 2012, pp. 2951–2959.
[22] J. Bergstra, D. Yamins, and D. Cox, “Making a science of model
search: Hyperparameter optimization in hundreds of dimensions for
vision architectures,” in Proc. 30th International Conference on Machine
Learning (ICML-13), 2013.
[23] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas,
“Taking the human out of the loop: A review of bayesian optimization,”
Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
[24] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
mization,” The Journal of Machine Learning Research, vol. 13, no. 1,
pp. 281–305, 2012.
[25] Z. Wang, F. Hutter, M. Zoghi, D. Matheson, and N. de Feitas, “Bayesian
optimization in a billion dimensions via random embeddings,” Journal
of Artificial Intelligence Research (JAIR), vol. 55, pp. 361–387, February
2016.
[26] R. Fano and D. Hawkins, “Transmission of information: A statistical
theory of communications,” American Journal of Physics, vol. 29,
no. 11, pp. 793–794, 1961.
VIII. APPENDIX: PRO OF S
Lemma 1. (Expected Per Query Performance From Expected
Distribution) Let tbe a target set, q(t, f)the expected per-
query probability of success for an algorithm and νbe the
conditional joint measure induced by that algorithm over finite
sequences of probability distributions and search histories,
conditioned on external information resource f. Denote a
probability distribution sequence by ˜
Pand a search history by
h. Let U(˜
P)denote a uniform distribution on elements of ˜
P
and define P(x|f) = REP∼U(˜
P)[P(x)](˜
P , h |f). Then,
q(t, f ) = P(Xt|f)
where P(X|f)is a probability distribution on the search
space.
Proof. Begin by expanding the definition of EP∼U(˜
P)[P(x)],
being the average probability mass on element xunder se-
quence ˜
P:
EP∼U(˜
P)[P(x)] = 1
|˜
P|
|˜
P|
X
i=1
Pi(x).
We note that P(x|f)is a proper probability distribution, since
1) P(x|f)0, being the integral of a nonnegative func-
tion;
2) P(x|f)1, as
P(x|f)Z
1
|˜
P|
|˜
P|
X
i=1
1
(˜
P , h|f) = 1;
3) P(x|f)sums to one, because
X
x
P(x|f) = X
x
Z1
|˜
P|
|˜
P|
X
i=1
Pi(x)(˜
P , h|f)
=Z1
|˜
P|
|˜
P|
X
i=1 "X
x
Pi(x)#(˜
P , h|f)
=Z|˜
P|
|˜
P|(˜
P , h|f)
= 1.
Finally,
P(Xt|f) = X
x
1xtP(x|f)
=X
x1xtZEP∼U(˜
P)[P(x)](˜
P , h|f)
=X
x
1xtZ
1
|˜
P|
|˜
P|
X
i=1
Pi(x)
(˜
P , h|f)
=Z1
|˜
P|
|˜
P|
X
i=1 "X
x
1xtPi(x)#(˜
P , h|f)
=E˜
P ,H
1
|˜
P|
|˜
P|
X
i=1
Pi(Xt)
f
=q(t, f ).
Lemma 2. (Maximum Number of Satisfying Vectors) Given
an integer 1kn, a set S={s:s∈ {0,1}n,ksk=k}
of all n-length k-hot binary vectors, a set P={P:P
Rn,PjPj= 1}of discrete n-dimensional simplex vectors,
and a fixed scalar threshold [0,1], then for any fixed
P∈ P,
X
s∈S
1s>P1
n1
k1
where s>Pdenotes the vector dot product between sand P.
Proof. For = 0, the bound holds trivially. For  > 0, let S
be a random quantity that takes values suniformly in the set
S. Then, for any fixed P∈ P,
X
s∈S
1s>P=n
kE1S>P
=n
kPr S>P.
Let 1denotes the all ones vector. Under a uniform distribution
on random quantity Sand because Pdoes not change with
respect to s, we have
ES>P=n
k1
X
s∈S
s>P
=P>n
k1
X
s∈S
s
=P>1n1
k1
n
k
=P>1n1
k1
n
kn1
k1
=k
nP>1
=k
n
since Pmust sum to 1.
Noting that S>P0, we use Markov’s inequality to get
X
s∈S
1s>P=n
kPr S>P
n
k1
ES>P
=n
k1
k
n
=1
n1
k1.
Theorem 1. (Famine of Forte) Define
τk={T|T,|T|=kN}
and let Bmdenote any set of binary strings, such that the
strings are of length mor less. Let
R={(T , F )|Tτk, F ∈ Bm},and
Rqmin ={(T , F )|Tτk, F ∈ Bm, q(T , F )qmin},
where q(T, F )is the expected per-query probability of success
for algorithm Aon problem (Ω, T, F ). Then for any mN,
|Rqmin |
|R|p
qmin
and
lim
m→∞ |Rqmin |
|R|p
qmin
where p=k/||.
Proof. We begin by defining a set Sof all ||-length target
functions with exactly kones, namely, S={s:s
{0,1}||,ksk=k}. For each of these, we have |Bm|
external information resources. The total number of search
problems is therefore
||
k|Bm|.(3)
We seek to bound the proportion of possible search problems
for which q(s, f )qmin for any threshold qmin (0,1]. Thus,
|Rqmin |
|R||Bm|supfPs∈S 1q(s,f)qmin
|Bm|||
k(4)
=||
k1
X
s∈S
1q(s,f)qmin ,(5)
where f∈ Bmdenotes the arg sup of the expression.
Therefore,
|Rqmin |
|R|||
k1
X
s∈S
1q(s,f)qmin
=||
k1
X
s∈S
1P(ωs|f)qmin
=||
k1
X
s∈S
1s>Pfqmin
where the first equality follows from Lemma 1, ωsmeans
the target function sevaluated at ωis one, and Pfrepresents
the ||-length probability vector defined by P(·|f). By
Lemma 2, we have
||
k1
X
s∈S
1s>Pfqmin ||
k11
qmin || − 1
k1
=k
||
1
qmin
=p/qmin (6)
proving the result for finite external information resources.
To extend to infinite external information resources, let
Am={f:f∈ {0,1}`, ` N, ` m}and define
am:= |Am|supfAmPs∈S 1q(s,f)qmin
|Am|||
k,(7)
bm:= |Bm|supf∈BmPs∈S 1q(s,f)qmin
|Bm|||
k.(8)
We have shown that amp/qmin for each mN. Thus,
lim sup
m→∞
|Am|supfAmPs∈S 1q(s,f)qmin
|Am|||
k= lim sup
m→∞
am
sup
m
am
p/qmin.
Next, we use the monotone convergence theorem to show the
limit exists. First,
lim
m→∞ am= lim
m→∞
supfAmPs1q(s,f)qmin
||
k(9)
By construction, the successive Amare nested with increasing
m, so the sequence of suprema (and numerator) are increasing,
though not necessarily strictly increasing. The denominator
is not dependent on m, so {am}is an increasing sequence.
Because it is also bounded above by p/qmin, the limit exists
by monotone convergence. Thus,
lim
m→∞ am= lim sup
m→∞
amp/qmin.
Lastly,
lim
m→∞ bm= lim
m→∞ |Bm|supf∈BmPs∈S 1q(s,f)qmin
|Bm|||
k
= lim
m→∞
supf∈BmPs∈S 1q(s,f)qmin
||
k
lim
m→∞
supfAmPs∈S 1q(s,f)qmin
||
k
= lim
m→∞ am
p/qmin.
Corollary 1. (Conservation of Active Information) Define ac-
tive information of expectations as Iq(T,F )=log p/q(T , F ),
where pis the per-query probability of success for uniform
random sampling and q(T, F )is the expected per-query prob-
ability of success for an alternative search algorithm. Define
τk={T|T,|T|=kN}
and let Bmdenote any set of binary strings, such that the
strings are of length mor less. Let
R={(T , F )|Tτk, F ∈ Bm},and
Rb={(T , F )|Tτk, F ∈ Bm, Iq(T,F )b}.
Then for any mN
|Rb|
|R|2b.
Proof. The proof follows from the definition of active infor-
mation of expectations and Theorem 1. Note,
b≤ −log p
q(T , F )(10)
implies
q(T , F )p2b.(11)
Since Iq(T,F )bimplies q(T , F )p2b, the set of problems
for which Iq(T,F )bcan be no bigger than the set for which
q(T , F )p2b. By Theorem 1, the proportion of problems for
which q(T , F )is at least p2bis no greater than p/(p2b). Thus,
|Rb|
|R|1
2b.(12)
Corollary 2. (Conservation of Expected I˜q) Define
I˜q=log p/˜q,
where pis the per-query probability of success for uniform
random sampling and ˜qis the per-query probability of success
for an alternative search algorithm. Define
τk={T|T,|T|=kN}
and let Bmdenote any set of binary strings, such that the
strings are of length mor less. Let
R={(T , F )|Tτk, F ∈ Bm},and
Rb={(T , F )|Tτk, F ∈ Bm,E[I˜q]b}.
Then for any mN
|Rb|
|R|2b.
Proof. By Jensen’s inequality and the concavity of log(˜q/p)
in ˜q, we have
bElog p
˜q
log Eq]
p
=log p
q(T , F )
=Iq(T,F ).
The result follows by invoking Corollary 1.
Lemma 3. (Maximum Proportion of Satisfying Strategies)
Given an integer 1kn, a set S={s:s
{0,1}n,ksk=k}of all n-length k-hot binary vectors, a
set P={P:PRn,PjPj= 1}of discrete n-dimensional
simplex vectors, and a fixed scalar threshold [0,1], then
max
s∈S
µ(Gs,)
µ(P)1
k
n
where Gs, ={P:P∈ P,s>P}and µis Lebesgue
measure.
Proof. Similar results have been proved by others with regard
to No Free Lunch theorems [1], [14], [4], [9], [3]. Our
result concerns the maximum proportion of sufficiently good
strategies (not the mean performance of strategies over all
problems, as in the NFL case) and is a simplification over
previous search-for-a-search results.
For = 0, the bound holds trivially. For  > 0, We first
notice that the µ(P)1term can be viewed as a uniform
density over the region of the simplex P, so that the integral
becomes an expectation with respect to this distribution, where
Pis drawn uniformly from P. Thus, for any s∈ S,
µ(Gs,)
µ(P)=ZP
1
µ(P)1s>P(P)
=EP∼U(P)1s>P
= Pr(s>P)
1
EP∼U(P)s>P,
where the final line follows from Markov’s inequality. Since
the symmetric Dirichlet distribution in ndimensions with
parameter α= 1 gives the uniform distribution over the
simplex, we get
EP∼U(P)[P] = EPDir(α=1) [P]
=α
Pn
i=1 α1
=1
n1,
where 1denotes the all ones vector. We have
1
EP∼U(P)s>P=1
s>EP∼U(P)[P]
=1
s>1
n1
=1
k
n.
Theorem 2. (Famine of Favorable Strategies) For any fixed
search problem (Ω, t, f ), set of probability mass functions P=
{P:PR||,PjPj= 1}, and a fixed threshold qmin
[0,1],
µ(Gt,qmin )
µ(P)p
qmin
,
where Gt,qmin ={P:P∈ P, t>Pqmin }and µis
Lebesgue measure. Furthermore, the proportion of possible
search strategies giving at least bbits of active information of
expectations is no greater than 2b.
Proof. Applying Lemma 3, with s=t,=qmin,k=|t|,
n=||, and p=|t|
||, yields the first result, while following
the same steps as Corollary 1 gives the second (noting that
by Lemma 1 each strategy is equivalent to a corresponding
q(t, f )).
Theorem 3. (Probability of Success Under Dependence) De-
fine τk={T|T,|T|=kN}and let Bmdenote any
set of binary strings, such that the strings are of length mor
less. Define qas the expected per-query probability of success
under the joint distribution on Tτkand F∈ Bmfor any
fixed algorithm A, so that q:= ET,F [q(T , F )], namely,
q=ET,F P(ωT|F)= Pr(ωT;A).
Then,
qI(T;F) + D(PTkUT)+1
I
where I=log k/||,D(PTkUT)is the Kullback-Leibler
divergence between the marginal distribution on Tand the
uniform distribution on T, and I(T;F)is the mutual infor-
mation. Alternatively, we can write
qH(UT)H(T|F)+1
I
where H(UT) = log ||
k.
Proof. This proof loosely follows that of Fano’s Inequal-
ity [26], being a reversed generalization of it, so in keeping
with the traditional notation we let X:= ωfor the remainder
of this proof. Let Z=1(XT). Using the chain rule for
entropy to expand H(Z, T |X)in two different ways, we get
H(Z, T |X) = H(Z|T , X) + H(T|X)(13)
=H(T|Z, X ) + H(Z|X).(14)
By definition, H(Z|T, X)=0, and by the data processing
inequality H(T|F)H(T|X). Thus,
H(T|F)H(T|Z, X ) + H(Z|X).(15)
Define Pg= Pr(XT;A) = Pr(Z= 1). Then,
H(T|Z, X ) = (1 Pg)H(T|Z= 0, X ) + PgH(T|Z= 1, X)
(16)
(1 Pg) log ||
k+Pglog || − 1
k1(17)
= log ||
kPglog ||
k.(18)
We let H(UT) = log ||
k, being the entropy of the uniform
distribution over k-sparse target sets in . Therefore,
H(T|F)H(UT)Pglog ||
k+H(Z|X).(19)
Using the definitions of conditional entropy and I, we get
H(T)I(T;F)H(UT)PgI+H(Z|X),(20)
which implies
PgII(T;F) + H(UT)H(T) + H(Z|X)(21)
=I(T;F) + D(PTkUT) + H(Z|X).(22)
Examining H(Z|X), we see it captures how much entropy of
Zis due to the randomness of T. To see this, imagine is
a roulette wheel and we place our bet on X. Target elements
are “chosen” as balls land on random slots, according to the
distribution on T. When a ball lands on Xas often as not
(roughly half the time), this quantity is maximized. Thus, this
entropy captures the contribution of dumb luck, being averaged
over all X. (When balls move towards always landing on X,
something other than luck is at work.) We upperbound this by
its maximum value of 1 and obtain
Pr(XT;A)I(T;F) + D(PTkUT)+1
I
,(23)
and substitute qfor Pr(XT;A)to obtain the first result,
noting that q=ET,F P(XT|F)specifies a proper
probability distribution by the linearity and boundedness of
the expectation. To obtain the second form, use the definitions
I(T;F) = H(T)H(T|F)and D(PTkUT) = H(UT)
H(T).
Chapter
Bias, arising from inductive assumptions, is necessary for successful artificial learning, allowing algorithms to generalize beyond training data and outperform random guessing. We explore how bias relates to algorithm flexibility (expressivity). Expressive algorithms alter their outputs as training data changes, allowing them to adapt to changing situations. Using a measure of algorithm flexibility rooted in the information-theoretic concept of entropy, we examine the trade-off between bias and expressivity, showing that while highly biased algorithms may outperform uniform random sampling, they cannot also be highly expressive. Conversely, maximally expressive algorithms necessarily have performance no better than uniform random guessing. We establish that necessary trade-offs exist in trying to design flexible yet strongly performing learning systems.
Chapter
Full-text available
We present an information-theoretic framework for understanding overfitting and underfitting in machine learning and prove the formal undecidability of determining whether an arbitrary classification algorithm will overfit a dataset. Measuring algorithm capacity via the information transferred from datasets to models, we consider mismatches between algorithm capacities and datasets to provide a signature for when a model can overfit or underfit a dataset. We present results upper-bounding algorithm capacity, establish its relationship to quantities in the algorithmic search framework for machine learning, and relate our work to recent information-theoretic approaches to generalization.
Chapter
Building on the view of machine learning as search, we demonstrate the necessity of bias in learning, quantifying the role of bias (measured relative to a collection of possible datasets, or more generally, information resources) in increasing the probability of success. For a given degree of bias towards a fixed target, we show that the proportion of favorable information resources is strictly bounded from above. Furthermore, we demonstrate that bias is a conserved quantity, such that no algorithm can be favorably biased towards many distinct targets simultaneously. Thus bias encodes trade-offs. The probability of success for a task can also be measured geometrically, as the angle of agreement between what holds for the actual task and what is assumed by the algorithm, represented in its bias. Lastly, finding a favorably biasing distribution over a fixed set of information resources is provably difficult, unless the set of resources itself is already favorable with respect to the given task and algorithm.
Article
Full-text available
Article
Full-text available
Many different machine learning algorithms exist; taking into account each algorithm's hyperparameters, there is a staggeringly large number of possible alternatives overall. We consider the problem of simultaneously selecting a learning algorithm and setting its hyperparameters, going beyond previous work that addresses these issues in isolation. We show that this problem can be addressed by a fully automated approach, leveraging recent innovations in Bayesian optimization. Specifically, we consider a wide range of feature selection techniques (combining 3 search and 8 evaluator methods) and all classification approaches implemented in WEKA, spanning 2 ensemble methods, 10 meta-methods, 27 base classifiers, and hyperparameter settings for each classifier. On each of 21 popular datasets from the UCI repository, the KDD Cup 09, variants of the MNIST dataset and CIFAR-10, we show classification performance often much better than using standard selection/hyperparameter optimization methods. We hope that our approach will help non-expert users to more effectively identify machine learning algorithms and hyperparameter settings appropriate to their applications, and hence to achieve improved performance.
Conference Paper
Full-text available
Permutations can represent search problems when all points in the search space have unique evaluations. Given a particular set of N evaluations we have N! search algorithms and N! possible functions. A general No Free Lunch result holds for this finite set of N! functions. Furthermore, it is proven that the average description length over the set of N! functions must be O(N lg N). Thus if the size of the search space is exponentially large with respect to a parameter set which specifies a point in the search space, then the description length of the set of N! functions must also be exponential on average. Summary statistics are identical for all instances of the set of N! functions, including mean, variance, skew and other r-moment statistics. These summary statistics can be used to show that any set of N! functions must obey a set of identical constraints which holds over the set of Walsh coefficients. This also imposes mild constraints on schema information for the set of N! functions. When N = 2 L subsets of the N! functions are related via Gray codes which partition N! into equivalence classes of size 2L.
Conference Paper
ChaLearn is organizing the Automatic Machine Learning (AutoML) contest for the IJCNN 2015, which challenges participants to solve classification and regression problems without any human intervention. Participants' code is automatically run on the contest servers to train and test learning machines. However, there is no obligation to submit code. Half of the prizes can be won by submitting prediction results only. Datasets of progressive difficulty are introduced throughout six rounds. (Participants can enter the competition in any round.) The rounds alternate phases in which learners are tested on datasets participants have not seen (AutoML), and phases in which participants have limited time to tweak their algorithms on those datasets to improve performance (Tweakathon). This challenge will push the state of the art in fully automatic machine learning on a wide range of real-world problems. The platform will remain available beyond the termination of the challenge.
Article
Bayesian optimization techniques have been successfully applied to robotics, planning, sensor placement, recommendation, advertising, intelligent user interfaces and automatic algorithm configuration. Despite these successes, the approach is restricted to problems of moderate dimension, and several workshops on Bayesian optimization have identified its scaling to high-dimensions as one of the holy grails of the field. In this paper, we introduce a novel random embedding idea to attack this problem. The resulting Random EMbedding Bayesian Optimization (REMBO) algorithm is very simple, has important invariance properties, and applies to domains with both categorical and continuous variables. We present a thorough theoretical analysis of REMBO. Empirical results confirm that REMBO can efiectively solve problems with billions of dimensions, provided the intrinsic dimensionality is low. They also show that REMBO achieves state-of-the-art performance in optimizing the 47 discrete parameters of a popular mixed integer linear programming solver.
Conference Paper
This paper provides a general framework for understanding targeted search. It begins by defining the search matrix, which makes explicit the sources of information that can affect search progress. The search matrix enables a search to be represented as a probability measure on the original search space. This representation facilitates tracking the information cost incurred by successful search (success being defined as finding the target). To categorize such costs, various information and efficiency measures are defined, notably, active information. Conservation of information characterizes these costs and is precisely formulated via two theorems, one restricted (proved in previous work of ours), the other general (proved for the first time here). The restricted version assumes a uniform probability search baseline, the general, an arbitrary probability search baseline. When a search with probability q of success displaces a baseline search with probability p of success where q > p, conservation of information states that raising the probability of successful search by a factor of q/p(>1) incurs an information cost of at least log (q/p). Conservation of information shows that information, like money, obeys strict accounting principles.
Article
Grid search and manual search are the most widely used strategies for hyper-parameter optimization. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a comparison with a large previous study that used grid search and manual search to configure neural networks and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising configuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some light on why recent "High Throughput" methods achieve surprising success--they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural baseline against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.
Conference Paper
According to the No Free Lunch theorems for search, when uniformly averaged over all possible search functions, every search algorithm has identical search performance for a wide variety of common performance metrics [1], [2], [3], [4]. Differences in performance can arise, however, between two algorithms when performance is measured over non-closed under permutation sets of functions, such as sets consisting of a single function. Using uniform random sampling with replacement as a baseline, we ask how many functions exist such that a search algorithm has better expected performance than random sampling. We define favorable functions as those that allow an algorithm to locate a search target with higher probability than uniform random sampling with replacement, and we bound the proportion of favorable functions for stochastic search methods, including genetic algorithms. Using active information [5] as our divergence measure, we demonstrate that no more than 2-b of all functions are favorable by b or more bits, for b ≥ 2 and reasonably sized search spaces (n ≥ 19). Thus, the proportion of functions for which an algorithm performs relatively well by a moderate degree is strictly bounded. Our results can be viewed as statement of information conservation [6], [7], [1], [8], [5], since identifying a favorable function of b or more bits requires at least b bits of information, under the conditions given.