PreprintPDF Available

Optimal Decision Tree and Submodular Function Ranking Under Noisy Outcomes

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

A fundamental task in active learning involves performing a sequence of tests to identify an unknown hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction, has been widely studied for decades and the asymptotically best-possible approximation algorithm has been devised for it. We study a generalization where certain test outcomes are noisy, even in the more general case when the noise is persistent, i.e., repeating a test gives the same noisy output. We design new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous work in the area assumed at most a logarithmic number of noisy outcomes per hypothesis and provided approximation ratios that depended on parameters such as the minimum probability of a hypothesis. Our new approximation algorithms provide guarantees that are nearly best-possible and work for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly with this number. In fact, our results hold in a significantly more general setting, where the goal is to cover stochastic submodular functions. We evaluate the performance of our algorithms on two natural applications with noise: toxic chemical identification and active learning of linear classifiers. Despite our theoretical logarithmic approximation guarantees, our methods give solutions with cost very close to the information theoretic minimum, demonstrating the effectiveness of our methods.
Content may be subject to copyright.
OPERATIONS RESEARCH
Vol. 00, No. 0, Xxxxx 0000, pp. 000–000
issn 0030-364X |eissn 1526-5463 |00 |0000 |0001
INFORMS
doi 10.1287/xxxx.0000.0000
©0000 INFORMS
Authors are encouraged to submit new papers to INFORMS journals by means of
a style file template, which includes the journal title. However, use of a template
does not certify that the paper has been accepted for publication in the named jour-
nal. INFORMS journal templates are for the exclusive purpose of submitting to an
INFORMS journal and should not be used to distribute the papers in print or online
or to submit the papers to another publication.
Optimal Decision Tree and Submodular Ranking
with Noisy Outcomes
Su Jia, Fatemeh Navidi, Viswanath Nagarajan, R. Ravi
A fundamental task in active learning involves performing a sequence of tests to identify an unknown
hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction,
has been widely studied for decades and the asymptotically best-possible approximation algorithm has
been devised for it. We study a generalization where certain test outcomes are noisy, even in the more
general case when the noise is persistent, i.e., repeating a test gives the same noisy output. We design
new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed
a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous
work in the area assumed at most a logarithmic number of noisy outcomes per hypothesis and provided
approximation ratios that depended on parameters such as the minimum probability of a hypothesis. Our
new approximation algorithms provide guarantees that are nearly best-possible and work for the general case
of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly
with this number. In fact, our results hold in a significantly more general setting, where the goal is to cover
stochastic submodular functions. We evaluate the performance of our algorithms on two natural applications
with noise: toxic chemical identification and active learning of linear classifiers. Despite our theoretical
logarithmic approximation guarantees, our methods give solutions with cost very close to the information
theoretic minimum, demonstrating the effectiveness of our methods.
Key words : Approximation Algorithms, Optimal Decision Tree, Submodular functions, Active Learning
1
Jia et al.: Optimal Decision Tree with Noisy Outcomes
2Operations Research 00(0), pp. 000–000, ©0000 INFORMS
1. Introduction
The classic Optimal Decision Tree (ODT) problem involves identifying an initially unknown hypoth-
esis hthat is drawn from a known probability distribution over a set of hypotheses. We can perform
tests in order to distinguish between these hypotheses. Each test produces a binary outcome (pos-
itive or negative) and the precise outcome of each test-hypothesis pair is known beforehand, and
thus an instance of ODT can be viewed as a ±1-valued matrix Mwith the tests as rows and
hypotheses as columns. The goal is to identify the true hypothesis husing the fewest tests.
As a motivating application, consider the following task in medical diagnosis detailed in Loveland
(1985). A doctor needs to diagnose a patient’s disease by performing tests. Given an a priori
probability distribution over possible diseases, what sequence of tests should the doctor perform to
identify the disease as quickly as possible? Another application is in active learning (e.g. Dasgupta
(2005)). Given a set of data points, one wants to learn a classifier that labels the points correctly
as positive and negative. There is a set of mpossible classifiers which is assumed to contain the
true classifier. In the Bayesian setting, which we consider, the true classifier is drawn from some
known probability distribution. The goal is to identify the true classifier by querying labels at the
minimum number of points in expectation (over the prior distribution). Other applications include
entity identification in databases (Chakaravarthy et al. (2011)) and experimental design to choose
the most accurate theory among competing candidates (Golovin et al. (2010)).
Despite the considerable literature on the classic ODT problem, an important issue that is not
considered is that of unknown or noisy outcomes. In fact, our research was motivated by a data-
set involving toxic chemical identification where the outcomes of many hypothesis-test pairs are
stated as unknown (see Section 6 for details). While prior work incorporating noise in ODT, for
example Golovin et al. (2010), was restricted to settings with very few noisy outcomes, in this
paper, we design approximation algorithms for the noisy optimal decision tree problem in full
generality.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 3
Specifically, we generalize the ODT problem to allow unknown/noisy entries (denoted by “”) in
the test-hypothesis matrix M, to obtain the Optimal Decision Tree with Noise (ODTN) problem,
in which the outcome of each noisy entry in the test-hypothesis matrix Mis a random ±1 value,
independent of other noisy entries. More precisely, if the entry Mt,h =(for hypothesis hand test
t) and the realized hypothesis is h, then the outcome of twill be a random ±1 value. We will
assume for simplicity that each noisy outcome is ±1 with uniform probability, though our results
extend directly to the case where each noisy outcome has a different probability. We consider the
standard persistent noise model, where repeating the same test always produces the same outcome.
Note that this model is more general than the non-persistent noise (where repeating a noisy test
leads to “fresh” independent ±1 outcomes), since one may create copies of tests and hypotheses
to reduce to the persistent noise model.
We consider both non-adaptive policies, where the test sequence is fixed upfront, and adaptive
policies, where the test sequence is built incrementally and depends on observed test outcomes.
Evidently, adaptive policies perform at least as well as non-adaptive ones. Indeed, there exists
instances where the relative gap between the best adaptive and non-adaptive policies is very large
(see for example, Dasgupta (2005)). However, non-adaptive policies are very simple to implement,
requiring minimal incremental computation, and may be preferred in time-sensitive applications.
In fact, our results hold in a significantly more general setting, where the goal is to cover stochastic
submodular functions. In the absence of noisy outcomes, the non-adaptive and adaptive versions
of this problem were studied by by Azar and Gamzu (2011) and Navidi et al. (2020). Other than
the ODT problem, this submodular setting captures a number of applications such as multiple-
intent search ranking, decision region determination and correlated knapsack cover: see Navidi
et al. (2020) for details. Our work is the first to handle noisy outcomes in all these applications.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
4Operations Research 00(0), pp. 000–000, ©0000 INFORMS
1.1. Contributions
We derive most of our results for the ODTN problem as corollaries of a more general problem,
Submodular Function Ranking with Noisy Outcomes, which is a natural extension of the Submod-
ular Function Ranking problem, introduced by Azar and Gamzu (2011). We first state our results
before formally defining this problem in Section 2.3.
First, we obtain an O(log 1
ε)-approximation algorithm (see Theorem 3) for Non-Adaptive Sub-
modular Function Ranking with noisy outcomes (SFRN) where εis a separability parameter of
the underlying submodular functions. As a special case, for the ODTN (both adaptive and non-
adaptive) problem, we consider submodular functions with separability ε=1
m, so the above result
immediately implies an O(log m)-approximation for non-adaptive ODTN. This bound is the best
possible (up to constant factors) even in the noiseless case, assuming P6=N P .
As our second contribution, we obtain an O(min{clog ||, r}+ log m
ε)-approximation (Theo-
rem 7) algorithm for Adaptive Submodular Ranking with noisy outcomes (ASRN), which implies
an O(min{c, r}+ log m) bound for ODTN by setting ε=1
m, where Ω is the set of random outcomes
we may observe when selecting elements. The term min{clog ||, r}corresponds to the “noise spar-
sity” of the instance (see Section 2 for formal definitions). For the ODTN problem, c(resp. r) is the
maximum number of noisy outcomes in each column (resp. row) of the test-hypothesis matrix M. In
the noiseless case, c=r= 0 and our result matches the best approximation ratio for the ODT and
the Adaptive Submodular Ranking problem (Navidi et al. (2020)). In the noisy case, our perfor-
mance guarantee degrades smoothly with the noise sparsity. For example, we obtain a logarithmic
approximation ratio (which is the best possible) as long as the number of noisy outcomes in each
row or column is at most logarithmic. For ODTN, Golovin et al. (2010) obtained an O(log21
pmin )-
approximation algorithm which is polynomial-time only when c=O(log m); here pmin 1
mis the
minimum probability of any hypothesis. Our result improves this result in that (i) the running
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 5
time is polynomial irrespective of the number of noisy outcomes and (ii) the approximation ratio
is better by at least one logarithmic factor.
While the above algorithm admits a nice approximation ratio when there are few noisy entries
in each row or column of M, as our third contribution, we consider the other extreme, when each
test has only a few deterministic entries (or equivalently, a large number of noisy outcomes). Here,
we focus on the special case of ODTN. At first sight, higher noise seems to only render the problem
more challenging, but somewhat surprisingly, we obtain a much better approximation ratio in this
regime. Specifically, if the number of noisy outcomes in each test is at least mO(m), we obtain
an approximation algorithm whose cost is O(log m) times the optimum and returns the target
hypothesis with high probability. We establish this result by relating the cost to a Stochastic Set
Cover instance, whose cost lower-bounds that of the ODTN instance.
Finally, we tested our algorithms on synthetic as well as a real dataset (arising in toxic chem-
ical identification). We compared the empirical performance guarantee of our algorithms to an
information-theoretic lower bound. The cost of the solution returned by our non-adaptive algorithm
is typically within 50% of this lower bound, and typically within 20% for the adaptive algorithm,
demonstrating the effective practical performance of our algorithms.
As a final remark, although in this work we will consider uniform distribution for noisy outcomes,
our results extend directly to the case where each noisy outcome has a different probability of being
±1. Suppose that the probability of every noisy outcome is between δand 1 δ. Then our results
on ASRN continue to hold, irrespective of δ, and the result for the many-unknowns version holds
with a slightly worse O(1
δlog m) approximation ratio.
1.2. Related Work
The optimal decision tree problem (without noise) has been extensively studied for several decades:
see Garey and Graham (1974), Hyafil and Rivest (1976/77), Loveland (1985), Arkin et al. (1998),
Jia et al.: Optimal Decision Tree with Noisy Outcomes
6Operations Research 00(0), pp. 000–000, ©0000 INFORMS
Kosaraju et al. (1999), Adler and Heeringa (2008), Chakaravarthy et al. (2009), Gupta et al. (2017).
The state-of-the-art result Gupta et al. (2017) is an O(log m)-approximation, for instances with
arbitrary probability distribution and costs. Chakaravarthy et al. (2011) also showed that ODT
cannot be approximated to a factor better than Ω(log m), unless P=NP.
The application of ODT to Bayesian active learning was formalized in Dasgupta (2005). There
are also several results on the statistical complexity of active learning. e.g. Balcan et al. (2006),
Hanneke (2007), Nowak (2009), where the focus is on proving bounds for structured hypothesis
classes. In contrast, we consider arbitrary hypothesis classes and obtain computationally efficient
policies with provable approximation bounds relative to the optimal (instance specific) policy. This
approach is similar to that in Dasgupta (2005), Guillory and Bilmes (2009), Golovin and Krause
(2011), Golovin et al. (2010), Cicalese et al. (2014), Javdani et al. (2014).
The noisy ODT problem was studied previously in Golovin et al. (2010). Using a connection
to adaptive submodularity, Golovin and Krause (2011) obtained an O(log21
pmin )-approximation
algorithm for noisy ODT in the presence of very few noisy outcomes, where pmin 1
mis the
minimum probability of any hypothesis.In particular, the running time of the algorithm in Golovin
et al. (2010) is exponential in the number of noisy outcomes per hypothesis, which is polynomial
only if this number is at most logarithmic in the number of hypotheses/tests. As noted earlier, our
result improves both the running time (it is now polynomial for any number of noisy outcomes)
and the approximation ratio. We note that an O(log m) approximation ratio (still only for very
sparse noise) follows from work on the “equivalence class determination” problem by Cicalese et al.
(2014). For this setting, our result is also an O(log m) approximation, but our algorithm is simpler.
More importantly, ours is the first result that can handle any number of noisy outcomes.
The paper Golovin et al. (2010) states the approximation ratio as O(log 1
pmin ) because it relied on an erroneous
claim in Golovin and Krause (2011). The correct approximation ratio, based on Nan and Saligrama (2017), Golovin
and Krause (2017), is O(log21
pmin ).
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 7
Other variants of noisy ODT have also been considered, e.g. Naghshvar et al. (2012), Bellala et al.
(2011), Chen et al. (2017), where the goal is to identify the correct hypothesis with at least some
target probability. The theoretical results in Chen et al. (2017) provide “bicriteria” approximation
bounds where the algorithm has a larger error probability than the optimal policy. Our setting is
different because we enforce zero probability of error.
Many algorithms for ODT (including ours) rely on some underlying submodularity properties.
We briefly survey some background results. In the basic Submodular Cover problem, we are given
a set of elements and a submodular function f. The goal is to use the minimal number of elements
to make the value of freach certain threshold. Wolsey (1982) first considered this problem and
proved that the natural greedy algorithm is a (1 + ln 1
ε)-approximation algorithm, where εis the
minimal positive marginal increment of the function. As a natural generalization, in the Submodular
Function Ranking problem we are given multiple submodular functions, and need to sequentially
select elements so as to minimize the total cover time of those functions. Azar and Gamzu (2011)
obtained an O(log 1
)-approximation algorithm for this problem, and Im et al. (2016) extended this
result to also handle costs. More recently, Navidi et al. (2020) studied an adaptive version of the
submodular ranking problem.
Finally, we note that there is also work on minimizing the worst-case (instead of average case)
cost in ODT and active learning; see e.g., Moshkov (2010), Saettler et al. (2017), Guillory and
Bilmes (2010, 2011). These results are incomparable to ours because we are interested in the average
case, i.e. minimizing expected cost.
2. Preliminaries
2.1. Optimal Decision Tree with Noise
In the Optimal Decision Tree with Noise (ODTN) problem, we are given a set of mpossible
hypotheses with a prior probability distribution {πi}m
i=1, from which an unknown hypothesis ¯
iis
Jia et al.: Optimal Decision Tree with Noisy Outcomes
8Operations Research 00(0), pp. 000–000, ©0000 INFORMS
drawn. There is also a set Tof nbinary tests, each test TT associated with a 3-way partition
T+, T , T of [m], where the outcome of test Tis
positive if ¯
iT+,
negative if ¯
iT, and
positive or negative with probability 1
2each if ¯
iT(noisy outcomes).
We assume that conditioned on ¯
i, each noisy outcome is independent. The outcomes for all test-
hypothesis pairs can be summarized in a {1,1,∗}-valued n×mmatrix M.
While we know the 3-way partition T+, T , T for each test TT upfront, we are not aware of
the actual outcomes for the noisy test-hypothesis pairs. It is assumed that the realized hypothesis
¯
ican be uniquely identified by performing all tests, regardless of the outcomes of -tests. This
means that for every pair i, j [m] of hypotheses, there is some test T T with iT+and jT
or vice-versa. We show how to relax this “identifiability” assumption in Appendix E. The goal
is to perform a sequence of tests to identify hypothesis ¯
iusing the minimum expected number of
tests, which will be formally defined soon. Note that the expectation is taken over both the prior
distribution of ¯
iand the random outcomes of noisy tests for ¯
i.
Types of Policies. Anon-adaptive policy is specified by a permutation of tests denoting the order
in which they will be tried until identification of the underlying hypothesis. The policy performs
tests in this sequence and eliminates incompatible hypotheses until there is a unique compatible
hypothesis (which is ¯
i). Note that the number of tests performed under such a policy is still random
as it depends on ¯
iand the outcomes of noisy tests.
An adaptive policy chooses tests incrementally, depending on prior test outcomes. The state of
a policy is a tuple (E , d) where E⊆ T is a subset of tests and d∈ {±1}Edenotes the observed
outcomes of the tests in E. An adaptive policy is specified by a mapping Φ : 2T ×{±1}→ T from
states to tests, where Φ(E, d) is the next test to perform at state (E, d). Define the (random) cost
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 9
Cost(Φ) of a policy Φ to be the number of tests performed until ¯
iis uniquely identified, i.e., all
other hypotheses have been eliminated. The goal is to find policy Φ with minimum E[Cost(Φ)].
Again, the expectation is over the prior distribution of ¯
ias well as the outcomes of noisy tests.
Equivalently, we can view a policy as a decision tree with nodes corresponding to states, labels at
nodes representing the test performed at that state and branches corresponding to the ±1 outcome
at the current test. In particular, a non-adaptive policy is simply a decision tree where all nodes
on each level are labelled with the same test.
As the number of states can be exponential, we cannot hope to specify arbitrary adaptive policies.
Instead, we want implicit policies Φ, where given any state (E, d), the test Φ(E , d) can be computed
efficiently. This would imply that the total time taken on any decision path is polynomial. We
note that an optimal policy Φcan be very complex and the map Φ(E , d) may not be efficiently
computable. We will still compare the performance of our (efficient) policy to Φ.
Noise Model. In this paper, we consider the persistent noise model. That is, repeating a test T
with ¯
iTalways produces the same outcome. An alternative model is non-persistent noise, where
each run of test Twith ¯
iTproduces an independent random outcome. The persistent noise
model is more appropriate to handle missing data. It also contains the non-persistent noise model
as a special case (by introducing multiple tests with identical partitions). The persistent-noise
model is also more challenging from an algorithmic point of view.
In fact, our results hold in a substantially more general setting (than ODT), that of covering
arbitrary submodular functions. In Section 2.2 we first describe this setting in the noiseless case,
which is well-understood (prior to our work). Then, in Section 2.3 we describe the setting with
noisy outcomes, which is the focus of our paper.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
10 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
2.2. Adaptive Submodular Ranking (Noiseless Case)
We now review the (non-adaptive and adaptive) Submodular Ranking problems introduced by
Azar and Gamzu (2011) and Navidi et al. (2020) respectively.
Submodular Function Ranking. An instance of Submodular Function Ranking (SFR) consists
of a ground set of elements [n] := {1, ..., n}and a collection of monotone submodular functions
{f1, ..., fm},fi: 2[n][0,1],with fi() = 0 and fi([n]) = 1 for all i[m]. Each i[m] is called a
scenario. An unknown target scenario ¯
iis drawn from a known distribution {πi}over [m].
A solution to SFR is a permutation σ= (σ(1), ..., σ(n)) of elements. Given any such permutation,
the cover time of scenario iis C(i, σ) := min{t|fi(σt) = 1}where σt= (σ(1), ..., σ(t)) is the t-
prefix of permutation σ. In words, the cover time is the earliest time when the value of fireaches
the unit threshold. The goal is to find a permutation σof [n] with minimal expected cover time
E¯
i[C(¯
i, σ)] = Pi[m]πi·C(i, σ ).
The separability parameter ε > 0 is defined as minimum positive marginal increment of any
function, i.e. ε:= min{fi(S{e})fi(S)>0| ∀S[n], i [m], e [n]}. We will use the following.
Theorem 1 (Azar and Gamzu (2011)).There is an O(log 1
)-approximation algorithm for
SFR.
Adaptive Submodular Ranking. In the Adaptive Submodular Ranking (ASR) problem, in
addition to the above input to SFR, for each scenario i[m] we are given a response function
ri: [n]Ω where Ω is a finite set of outcomes (or response, which we use interchangeably). A
solution to ASR is an adaptive sequence of elements: the sequence is adaptive because it can depend
on the outcomes from previous elements. When the policy selects an element e[n], it receives an
outcome o=r¯
i(e)Ω, thereby any scenario iwith ri(e)6= ¯ocan be ruled out.
The state of an adaptive policy is a tuple (E, d) where E[n] is the subset of previously selected
elements and dEdenotes the observed responses on E. An adaptive policy is then specified
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 11
by a mapping Φ : 2[n]×[n] from states to elements, where Φ(E, d) is the next element to select
at state (E, d). Note that any adaptive policy Φ induces, for each scenario i, a unique sequence
σiof elements that will be selected if the target scenario ¯
i=i. The cover time of iis defined as
C(i, Φ) := min{t|fi(σt
i)=1}. The goal is to find a policy Φ with minimal expected cover time
Pi[m]πi·C(i, Φ). We will use the following result in Section 4.
Theorem 2 (Navidi et al. (2020)).There is an O(log m
)-approximation algorithm for ASR.
As discussed in Navidi et al. (2020), the optimal decision tree problem (without noise) is a special
case of ASR. We show later that even the noisy version ODTN can be reduced to a noisy variant
of ASR (which we define next).
2.3. Adaptive Submodular Ranking with Noise
In this paper, we introduce a new variant of ASR by incorporating noisy outcomes, which gener-
alizes the ODTN problem.
ASR with Noise. An instance of the Adaptive Submodular Ranking with Noise (ASRN) Problem
consists of a ground set of elements [n], a finite set Ω of outcomes, and a collection of monotone
submodular functions {f1, ..., fm}, where each fi: 2[n]×[0,1] satisfies fi() = 0 and fi([n]×Ω) =
1. Note that the groundset of each function fiis [n]×Ω, i.e., all element-outcome pairs. As before,
each i[m] is called a scenario and an unknown target scenario ¯
iis drawn from a given distribution
{πi}m
i=1. For each scenario i[m], we are given a response function ri: [n] {∗}. When an
element eis selected, its outcome is:
ri(e) if ri(e)Ω, and
a uniformly random response from Ω if ri(e) = (noisy outcome).
Jia et al.: Optimal Decision Tree with Noisy Outcomes
12 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
The responses can be summarized in an n×mmatrix Mwith entries from Ω {∗}. Conditioned on
¯
i, we assume that all noisy outcomes are independent. Our results extend to arbitrary distributions
for noisy outcomes, but we will work with the uniform case for simplicity.
As in the noiseless case, the state of a policy is the tuple (E , d) where E[n] denotes the
previously selected elements and dEdenotes their observed responses. A non-adaptive policy
is simply given by a permutation of all elements and involves selecting elements in this (static)
sequence. An adaptive policy is a mapping Φ : 2[n]×[n], where Φ(E, d) is the next element to
select at state (E, d). Scenario iis said to be covered in state (E , d) if fi({(e, de) : eE}) = 1, i.e.,
function fiis covered by the element-response pairs observed so far. The goal is to cover the target
scenario ¯
iusing the minimum expected number of elements.
Unlike the noiseless case, in ASRN, each scenario imay trace multiple paths in the decision tree
corresponding to policy Φ. However, if we condition on the responses ωnfrom all elements, each
scenario itraces a unique path, corresponding to a sequence σi,ω of element-response pairs. The
cover time of scenario iunder ωis defined as C(i, Φ|ω) := min{t|fi(σt
i,ω)=1}where σt
i,ω consists
of the first telement-response pairs in σi,ω. The expected cover time of scenario iis ECT(i, Φ) :=
PωnPr(ω|i)·C(i, Φ|ω), where Pr(ω|i) is the probability of observing responses ωconditioned on
¯
i=i. Finally, the expected cost of policy Φ is Pi[m]πi·ECT(i, Φ).
For each scenario i, we assume that the function fican always be covered irrespective of the
noisy outcomes (when ¯
i=i). In other words, for any i[m] and ωnthat is consistent with
scenario i(i.e., ωe=ri(e) for each ewith ri(e)6=), we must have fi({(e, ωe) : e[n]}) = 1. In the
absence of this assumption, the optimal value (as defined above) will be unbounded.
Connection to ODTN. The ODTN problem can be cast as a special case of the ASRN problem,
where the ntests Tin ODTN corresponds to the elements [n] in ASRN, and the mhypotheses
in ODTN correspond to the scenarios in ASRN, with the same prior distribution. The outcomes
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 13
Ω = 1}. Define the response function for each test T∈ T as follows. Let (T+, T , T ) be the
3-way partition of [m] for test T. For any hypothesis (scenario) i[m], define ri(T) = oif iTo
for each o{∗}. For any i[m], define the submodular function
fi(S) = 1
m1·[
T:(T,+1)S
T[ [
T:(T,1)S
T+,ST ×{+1,1}.
Note that the element-outcome pairs here are U=T ×{+1,1}. It is easy to see that each function
fi: 2U[0,1] is monotone and submodular. Also, these functions fihappen to be uniform for
all i. Moreover, the separability parameter ε=1
m1. Crucially, fi(S) corresponds to the fraction
of hypotheses (other than i) that are incompatible with at least one outcome in S: for example,
if Shas a positive outcome (T , +1) then hypotheses Tare incompatible (similarly for negative
outcomes). So fihas value one exactly when iis identified as the only compatible hypothesis.
By the assumption that the target hypothesis can be uniquely identified, the function fican be
covered (i.e. reaches value one) irrespective of the noisy outcomes.
2.4. Expanded Scenario Set
In our analysis for both the non-adaptive and adaptive ASRN problem, we will consider an equiva-
lent noiseless ASR instance. Let Ibe a given ASRN instance with scenarios [m]. The ASR instance
Jconsiders an expanded set of scenarios. For any scenario i[m], define
Ω(i) := {ωn:ωe=ri(e) for all e[n] with ri(e)6=∗},
denoting all outcome vectors that are consistent with scenario i. For any ωΩ(i), the expanded
scenario (i, ω) corresponds to the original scenario i[m] when the outcome of each element eis
ωe. Note that an expanded scenario also fixes all noisy outcomes. We write Hi:= {(i, ω) : ωΩ(i)}
and H=m
i=1Hifor the set of all expanded scenarios.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
14 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
To define the prior distribution in the ASR instance, let ci=|{e[n] : ri(e) = ∗}| be the number
of noisy outcomes for i[m]. Since the outcome of any -element for iis uniformly drawn from Ω,
each of the ||cipossible expanded scenarios for ioccurs with the same probability πi,ω =πi/||ci.
To complete the reduction, for each (i, ω)H, we define the response function
ri,ω : [n], ri,ω(e) = ωe,e[n],
and the submodular coverage function
fi,ω : 2[n][0,1], fi,ω(S) = fi{(e, ωe:eS}),S[n].
By this definition, since fiis monotone and submodular on [n]×Ω, the function fi,ω is also
monotone and submodular on [n]. We will also work with the ASR (noiseless) instance on the
expanded scenarios with response functions ri,ω and submodular functions fi,ω . In Appendix A, we
will formally establish the following reduction.
Proposition 1. The ASRN instance Iis equivalent to the ASR instance J.
Crucially, the number of expanded scenarios |H|is exponentially large as |H|Pi[m]||ci. So we
cannot merely apply existing algorithms for the noiseless ASR problems. In §3 and §4 we will show
different ways for managing the expanded scenarios and obtaining polynomial time algorithms.
3. Nonadaptive Algorithm
This main result in this section is an O(log 1
ε)-approximation for Non-Adaptive Submodular Func-
tion Ranking (SFRN) where ε > 0 is the separability parameter of the submodular functions. By
Proposition 1, the SFRN problem is equivalent to the SFR problem on the expanded scenarios.
However, as noted above, we cannot use Theorem 1 directly as the SFR instance has an exponential
number of scenarios. Nevertheless, we can obtain the following result.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 15
Theorem 3. There is a poly(1
ε, n, m)time O(log 1
ε)-approximation for the SFRN problem.
Observe that for ODTN, ε=1
m1, thus we obtain the following result for ODTN.
Corollary 1. There is an O(log m)-approximation for non-adaptive ODTN.
High Level Ideas. The algorithm of Azar and Gamzu (2011) for SFR is a greedy-style algorithm
that at any iteration, having already chosen elements E, assigns to each e[n]\Ea score that
measures the coverage gain when it is selected, defined as
GE(e) := X
(i,ω)H:fi,ω (E)<1
πi,ω
fi,ω({e} E)fi,ω (E)
1fi,ω(E)=X
(i,ω)H
πi,ω ·E(i, ω;e),(1)
E(i, ω, e) =
fi,ω({e}∪E)fi,ω (E)
1fi,ω(E),if fi,ω (E)<1;
0,otherwise.
(2)
The algorithm then selects the element with the maximum score.
Since this summation involves exponentially many terms, we do not know how to compute
the exact value of (1) in polynomial time. However, using the fact that GE(e) is the expectation
of ∆E(i, ω;e) over the expanded scenarios (i, ω)H, we will show how to obtain a randomized
constant-approximate maximizer by sampling from H. Moreover, we use the following extension
of Theorem 1, which follows directly from the analysis in Im et al. (2016).
Theorem 4 (Azar and Gamzu (2011), Im et al. (2016)).Consider the SFR algorithm that
selects at each step, an element ewith GE(e)Ω(1) ·maxe0UGE(e0). This is an O(log 1
)-
approximation algorithm.
Consequently, if we always find an approximate maximizer for GE(e) by sampling then Theorem 3
would follow from Theorem 4. However, this sampling approach is not sufficient because it can fail
when the value GE(e) is very small. In order to deal with this, a key observation is that when the
Jia et al.: Optimal Decision Tree with Noisy Outcomes
16 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
Algorithm 1 Non-adaptive SFRN algorithm.
1: Initialize Eand sequence σ=.
2: while E6= [n]do Phase 1 begins
3: For each e[n], compute an estimate GE(e) of the score GE(e) by sampling from H
independently N=m3n4ε1times.
4: Let edenote the element e[n]\Ethat maximizes GE(e).
5: if GE(e)1
4m2n4εthen
6: Update EE{e}and append eto sequence σ.
7: else
8: Exit the while loop. Phase 1 ends
9: Append the elements in [n]\Eto sequence σin arbitrary order. Phase 2
10: Output non-adaptive sequence σ.
score GE(e) is small for all elements e, then it must be that (with high probability) the already-
selected elements Ehave covered ¯
i, so any future elements would not affect the expected cover
time. The formal analysis is given in Appendix B.
4. Adaptive Algorithms
In this section we present the Olog m
ε+ min{clog ||, r}-approximation for ASRN where we recall
that c, r are the maximum number of noisy entries (“stars”) per column and per row in the outcome
matrix M, and εis the separability parameter of the submodular functions. We propose two
algorithms, achieving Or+ log m
εand Oclog ||+ log m
εapproximations respectively, which
combined imply our main result.
In both algorithms, we maintain the posterior probability of each scenario based on the previous
element responses, and use these probabilities to calculate a score for each element, which comprises
(i) a term that prioritizes splitting the candidate scenarios in a balanced manner and (ii) terms
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 17
corresponding to the expected number of scenarios eliminated. Different than the noiseless setting,
in ASRN (and ODTN), each scenario may trace multiple paths in the decision tree due to outcome
randomness. In fact, each scenario may trace an exponential number of paths in the tree, so a
naive generalization of the analysis in Navidi et al. (2020) incurs an extra exponential factor in the
approximation ratio.
We circumvent this challenge by reducing to an ASR instance J(as defined in Proposition 1)
using the expanded scenarios. In this way, the noise is removed, since we recall that the outcome
of each element is deterministic conditional on any expanded scenario (i, ω). Our first result, an
O(clog ||+ log m
ε)-approximation, then follows from Navidi et al. (2020).
However, as Jinvolves exponentially many scenarios, a naive implementation of the algorithm
in Navidi et al. (2020) leads to exponential running time. To improve the computational efficiency,
in Section 4.1 we exploit the special structure of Jand devise a polynomial time algorithm. Then,
in Section 4.2, we propose a slightly different algorithm than that of Navidi et al. (2020), and show
an O(r+ log m
ε) approximation ratio.
4.1. An O(clog ||+ log m
ε)-Approximation Algorithm
Our first adaptive algorithm is based on the O(log m
ε)-approximation algorithm for ASR from
Navidi et al. (2020), formally stated as Algorithm 2. Applying this result on the instance Jand
recalling |H|≤||c·m, we immediately obtain the desired guarantee. Their algorithm, rephrased
in our notations, maintains the set H0Hof all expanded scenarios that are consistent with all
the observed outcomes, and iteratively selects the element with maximum score, as defined in (3).
As the heart of the algorithm, this score strikes a balance between covering the submodular
functions of the consistent scenarios and shrinking H0hence reducing the uncertainty in the target
We use the subscript cto distinguish from the score function Scorerconsidered in Section 4.2, but for ease of
notation, we will suppress the subscript in this subsection.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
18 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
scenario. The second term in Scorec, similar to the score in our non-adaptive algorithm (Algo-
rithm 1), involves the sum of the incremental coverage (for selecting e) over all uncovered expanded
scenarios, weighted by their current coverage, with higher weights on the expanded scenarios closer
to being covered.
To interpret the first term in Scorec, let us for simplicity assume Ω = 1}and πi,ω is uniform
over H. Upon selecting an element, H0is split into two subsets, among which Le(H0) is the lighter
(in cardinality), or equivalently – since we just assumed πi,ω to be uniform – in the total prior
probabilities. Thus, this term is simply the number of expanded scenarios eliminated in the worst
case (over the outcomes in Ω). This is reminiscent of the greedy algorithm for the ODT problem
(e.g. Kosaraju et al. (1999)) which iteratively selects a test that maximizes the number of scenarios
ruled out, in the worst case over all test outcomes. Evidently, the higher this term, the more
progress is made towards identifying the target (expanded) scenario.
Algorithm 2 Algorithm for ASR instance J, based on Navidi et al. (2020).
1: Initialize E, H 0H.
2: while H06=do
3: For any element e[n], let Be(H0) be the largest cardinality set among
{(i, ω)H0:ri,ω (e) = o} ∀o
4: Define Le(H0) = H0\Be(H0)
5: Select the element e[n]\Emaximizing
Scorec(e, E, H 0) = πLe(H0)+X
(i,ω)H0,fi,ω (E)<1
πi,ω ·fi,ω(eE)fi,ω (E)
1fi,ω(E)(3)
6: Observe response oand update H0as H0 {(i, ω)H0:ωe=oand fi,ω (Ee)<1}
7: EE{e}
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 19
As noted earlier, a key issue is the exponential size of the expanded scenario set H. The naive
implementation, which computes the summation in Scorecby evaluating each term in H0, requires
exponential time. Nonetheless, as the main focus of this subsection, we explain how to utilize the
structure of the ASRN instance Jto reformulate each of the two terms in Scorecin a manageable
form, hence enabling a polynomial time implementation.
Computing the First Term in Scorec.Recall that Hiis the set of all expanded scenarios
for i. Since each (i, ω)Hiis has an equal share πi,ω =||ciπiof prior probability mass the
(original) scenario i[m], computing the first term in Scorecreduces to maintaining the number
ni=|HiH0|of consistent copies of i. We observe that nican be easily updated in each iteration.
In fact, suppose outcome oΩ is observed upon selecting element e. We consider how H0Hi
changes after selecting in the following three cases.
1. if ri(e)/{, o}, then none of i’s expanded scenarios would remain in H0, so nibecomes 0,
2. if ri(e) = o, then all of i’s expanded scenarios would remain in H0, so niremains the same,
3. if ri(e) = , then only those (i, ω) with ω(e) = owill remain, and so nishrinks by an ||factor.
As ni’s can be easily updated, we are also able to compute the first term in Scorecefficiently.
Indeed, for any element e(that is not yet selected), we can implicitly describe the set Le(H0) as
follows. Note that for any outcome oΩ,
|{(i, ω)H0:ri,ω (e) = o}|=X
i[m]:ri(e)=o
ni+1
||X
i[m]:ri(e)=?
ni,
so the largest cardinality set Be(H0) can then be easily determined using ni’s. In fact, let bbe the
outcome corresponding to Be(H0). Then,
π(Le(H0)) =X
i[m]:ri(e)/∈{b,?}
πi
||ci·ni+||1
||X
i[m]:ri(e)=?
πi
||ci·ni.
Computing the Second Term in Scorec.The second term in Scorecinvolves summing over
exponentially many terms, so a naive implementation is inefficient. Instead, we will rewrite this
summation as an expectation that can be calculated in polynomial time.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
20 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
We introduce some notations before formally stating this equivalence. Suppose the algorithm
selected a subset Eof elements, and observed outcomes {νe}eE. We overload notation slightly and
use f(νE) := f{(e, νe) : eE}for any function fdefined on 2[m]×. For each scenario i[m],
let pi=ni·πi
||cibe the total probability mass of the surviving expanded scenarios for i.Finally,
for any element eand scenario i, let Ei,νebe the expectation over the outcome νeof element e
conditional on ibeing the realized scenario. We can then rewrite the second term in Scorecas
follows.
Lemma 1. For each i[m], and e /E,
X
(i,ω)H0
πi,ω ·fi,ω(eE)fi,ω (E)
1fi,ω(E)=X
i[m]
pi·Ei,νe[fi(νE{νe})fi(νE)]
1fi(νE)(4)
This lemma suggests the following efficient implementation of Algorithm 2. For each i, compute
and maintain piusing ni. To find the expectation in the numerator, note that if ri(e)6=, then νeis
deterministic and hence it is straightforward to find this expectation. In the other case, if ri(e) = ,
recalling that the outcome is uniform over Ω, we may simply evaluate fi(νE{(e, o)})fi(νE) for
each oΩ and take the average, since the noisy outcome is uniformly distributed over Ω.
Now we are ready to formally state and prove the main result of this subsection.
Theorem 5. Algorithm 2 is an O(clog ||+ log m+ log 1
ε)-approximation algorithm for ASRN
where cis the maximum number of noisy outcomes in each column of the response matrix M.
Proof. Consider the ASR instance Jand Algorithm 2. As discussed above, this algorithm can
be implemented in polynomial time. By Theorem 2, this algorithm has an Olog(||cm) +log m
ε=
O(clog ||+ log m
ε) approximation ratio since |H|||c·m.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 21
Algorithm 3 Modified algorithm for ASR instance J.
1: Initialize E, H 0H
2: while H06=do
3: S{i[m] : HiH06=∅} Consistent original scenarios
4: For e[n], let Ue(S) = {iS:ri(e) = ∗} and Ce(S) be the largest cardinality set among
{iS:ri(e) = o},o,
and let oe(S)Ω be the outcome corresponding to Ce(S).
5: For each e[n], let
Re(H0) = {(i, ω)H0:iCe(S)}[{(j, oe(S)) H0:jUe(S)},
be those expanded-scenarios that have outcome oe(S) for element e, and Re(H0) := H0\Re(H0).
6: Select element e[n]\Ethat maximizes
Scorer(e, E, H 0) = πRe(H0)+X
(i,ω)H0,fi,ω (E)<1
πi,ω ·fi,ω(eE)fi,ω (E)
1fi,ω(E)(5)
7: Observe outcome o
8: H0{(i, ω)H0:ri,ω (e) = oand fi,ω (Ee)<1}Update the (expanded) scenarios
9: EE{e}
4.2. An O(r+ log m
ε)-Approximation Algorithm
In this section, we consider a slightly different score function, Scorer, and obtain an O(r+ log m
ε)-
approximation. Unlike the previous section where the approximation factor follows as an immedi-
ately corollary from Theorem 2, to prove this result, we need to also modify the analysis.
One may easily verify via the Bayesian rule that pi/p([m]) is indeed the posterior probability of scenario i[m],
given the previously observed outcomes.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
22 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
The only difference from Algorithm 2 is in the first term of the score function. Recall that in
Scorec, upon selecting an element, the surviving expanded scenarios is partitioned into ||subsets,
among which Le(H0) is defined to be the lightest cardinality. Its counterpart in Scorer, however,
is defined more indirectly, by first considering the original scenarios. The element epartitions the
original scenarios with deterministic outcomes into ||subsets, with the largest (in cardinality)
being Ce(S)[m]. The set Re(H0)H0is then defined to be the consistent expanded scenarios
that have a different outcome than Ce(S).
Computational Complexity. By definition, Scan be directly computed using the ni’s, which
can be updated in polynomial time as explained in Section 4.1. Similar to Algorithm 2, the second
term here also involves summing over exponentially many terms, but by following the same recipe
as in Section 4.1, one may also implement it in polynomial time.
The main result of this section, stated below, is proved by adapting the proof technique from
Navidi et al. (2020). The proof appears in Appendix C.3.
Theorem 6. Algorithm 3 is a polynomial-time O(r+ log m
ε)-approximation algorithm for ASRN,
where ris the maximum number of noisy outcomes in any row of the response matrix M.
Combining the above result with Theorem 5 and selecting the one with lower approximation ratio
between Algorithm 2 and Algorithm 3, we immediately obtain the following.
Theorem 7. There is an adaptive Omin{clog ||, r}+ log m
ε)-approximation algorithm for the
ASRN problem.
When applied to the ODTN problem, this implies an Omin{c, r}+ log m
ε)-approximation algo-
rithm. In Appendix C.2, we also provide closed-form expressions for the scores used in Algorithms 3
and 2 in the special case of ODTN: this is also used in our computational results.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 23
5. ODTN with Many Unknowns
Our adaptive algorithm in Section 4 has a performance guarantee that grows with the noise sparsity
min(r, c log ||). In this section, we consider the special case of ODTN (which is our primary
application) and focus on instances with a large number of noisy outcomes. We show that an
O(log m)-approximation algorithm can be achieved even in this regime.
An ODTN instance is called α-sparse (0 α1) if there max{|T+|,|T|} ≤ mαfor all tests
T∈ T .In particular, when α < 1, this means the vast majority of entries are noisy in every test.
Our main result is the following.
Theorem 8. There is a polynomial time adaptive algorithm whose cost is O(log m)times the
optimum for ODTN on any α-sparse instance with α1
2, and returns the true hypothesis with
probability 1m1.
Moreover, by repeating the algorithm for c1 times, the error probability will decrease to mc.
5.1. Main Idea and the Stochastic Set Cover Problem
Stochastic Set Cover. The design and analysis of our algorithm are both closely related to that
of the Stochastic Set Cover (SSC) problem (Liu et al. (2008), Im et al. (2016)). An instance of SSC
consists of a ground set [m] of items and a collection of random subsets S1,···, Snof [m], where
the distribution of each Siis known to the algorithm. The instantiation of each set is only known
after it is selected. The goal is to find an adaptive policy that minimizes the expected number of
sets to cover all elements in the ground set.
The following natural adaptive greedy algorithm is known to be an O(log m)-approximation (Liu
et al. (2008), Im et al. (2016)). Suppose at some iteration, A[m] is the set of uncovered elements.
A random set Sis said to be β-greedy if its expected coverage of the uncovered elements is at least
1the maximum, i.e.
E|SA|1
βmax
j[n]
E|SjA|.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
24 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
An SSC algorithm is (β, ρ)-greedy if for every t1, the algorithm picks a β-greedy set in no less
than t/ρ iterations among the first t. By slightly modifying the analysis in Im et al. (2016), one
may obtain the following guarantee which will serve as the cornerstone of our analysis.
Theorem 9 (Im et al. (2016)).For any stochastic set cover instance, a (β, ρ)-greedy policy
costs at most O(βρ log m)times the optimum.
Relating ODTN Optimum and SSC: A Lower Bound. We now derive a lower bound on
the ODTN optimum, in terms of the optima of SSC instances constructed as follows. For any
hypothesis i[m], let SSC(i) denote the stochastic set cover instance with ground set [m]\{i}and
nrandom sets, given by
ST(i) =
T+with prob. 1 if iT
Twith prob. 1 if iT+
Tor T+with prob. 1
2each if iT
,T[n].
To see the connection between SSC and ODTN, observe that when iis the target hypothesis in the
ODTN instance, any feasible algorithm must identify iby eliminating all other hypotheses which,
in the language of SSC, translates to covering all items in [m]\{i}. This leads to the following key
lower bound that our algorithm exploits.
Lemma 2. OPT Pi[m]πi·OPTSSC(i).
We now explain why “good” progress made in SSC(i) also leads to “good” progress in ODTN.
Consider a hypothesis iand a test Twith iT, and let Abe the set of consistent hypotheses.
When test Tis selected, the expected coverage of the corresponding (random) set ST(i) in SSC(i)
is 1
2(|T+A|+|TA|). The following result shows that if Tmaximizes 1
2(|T+A|+|TA|),
then it is 2-greedy for SSC(i).
Lemma 3. Let Tbe a test that maximizes 1
2(|T+A|+|TA|). Then for any iT,
1
2|T+A|+|TA|=E[|ST(i)(A\i)|]1
2·max
T0[n]
E[|ST0(i)(A\i)|].
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 25
Hence, by our sparsity assumption, since the vast majority of hypotheses are in T, such a test Tis
2-greedy for most SSC instances. This motivates the following greedy algorithm. When Ais the set
of consistent hypotheses, pick test Tthat maximizes 1
2|T+A|+1
2|TA|. Suppose the following
ideal condition holds. At each iteration t(when ttests have been selected), for every hypothesis
i, the algorithm has selected at least t/ρ tests that are -tests for i. Then, the sequence of tests
selected is (2, ρ)-greedy for every i, hence making nearly-optimal progress in every instance SSC(i).
Therefore by Theorem 9, the expected cost of this algorithm under iis O(ρlog m)·OPTSSC(i).
Taking expectation over the target hypothesis iand combining with Lemma 2, it then follows that
this algorithm is an O(ρlog m)-approximation to ODTN.
However, in general, the ideal condition assumed above may not hold. In other words, up until
some point, the sequence of tests selected is no longer (2, ρ)-greedy for some hypothesis i. To
handle this issue, we modify the above greedy algorithm at all power-of-two iterations as follows
(see Section 5.3). At each t= 2kwhere k= 1,2, ... log m, we consider the set Zof O(mα) hypotheses
with the fewest -tests selected thus far. Then, we invoke a membership oracle Member(Z), to
check whether the target hypothesis ¯
iZ(see Section 5.2). If so, then the algorithm halts and
returns ¯
i. Otherwise, it continues with the greedy algorithm until the next power-of-two iteration.
We will show that the membership oracle only incurs cost O(mα), which can be bounded using the
following lower bound.
Lemma 4. The optimal value OPT Ω(m1α)for any α-sparse instance.
In particular, when α < 1
2the above implies that the cost O(mα) for each call of the membership
oracle is lower than OPT, and hence the total cost incurred at power-of-two steps is O(log m·OPT).
5.2. Overview of the Membership Oracle
The membership oracle Member(Z) takes a (small) subset Z[m] as input, and decides whether
the target hypothesis ¯
iZ. At a high level, Member(Z) works as follows. Whenever |Z| ≥ 2, we
Jia et al.: Optimal Decision Tree with Noisy Outcomes
26 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
pick an arbitrary pair (j, k) of hypotheses in Zand let them “duel” (i.e. choose a test Twith
MT,j =MT ,k ) until there is only a unique survivor i.
Let i[m] be an arbitrary hypothesis. We show that if ¯
i6=ithen with high probability we can
rule out iusing very few tests. In fact, we first select an arbitrary set Wof 4 log mdeterministic
tests for i, and let Ybe the set of consistent hypotheses after performing these tests. Without loss
of generality, we assume iT+for all TW. There are three cases:
Trivial Case: if ¯
iTfor some TW, then we rule out iwhen any test Tis performed.
Good Case: if ¯
iTfor more than half of the tests Tin W, then by Chernoff’s inequality,
with high probability we observe at least one “-”, hence ruling out i.
Bad Case: if ¯
iT+for less than half of the tests Tin W, then concentration bounds can
not ensure a high enough probability for ruling out i. In this case, we let each hypothesis in
Yduel with iuntil either iloses a duel or wins all the duels. This takes |Z|1 iterations.
We formalize the above ideas in the Algorithm 5 (Appendix D.1), and prove bound the cost of
Member(Z) as follows.
Lemma 5. If ¯
iZ, then Member(Z)declares ¯
i=iwith probability one; otherwise, it declares ¯
i /Z
with probability at least 1m2. Moreover, the expected cost of Member(Z)is O(|Z|+ log m).
5.3. The Main Algorithm
The overall algorithm is given in Algorithm 4. The algorithm maintains a subset of consistent
hypotheses, and iteratively computes the greediest test, as formally specified in Step 7. At each
t= 2kwhere k= 1,2, ... log m, we invoke the membership oracle.
Truncated Decision Tree. Let Tdenote the decision tree corresponding to our algorithm. We
only consider tests that correspond to step 7. Recall that His the set of expanded hypotheses and
that any expanded hypothesis traces a unique path in T. For any (i, ω)H, let Pi,ω denote this
path traced; so |Pi,ω|is the number of tests performed in Step 7 under (i, ω). We will work with a
truncated decision tree T, defined below.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 27
Algorithm 4 Main algorithm for large number of noisy outcomes
1: Initialization: consistent hypotheses A[m], weights wi0 for i[m], iteration index t0
2: while |A|>1do
3: if tis a power of 2 then
4: Let ZAbe the subset of 2mαhypotheses with lowest wi
5: Invoke Member(Z)
6: If a hypothesis is identified in Z, then Break
7: Select a test TT maximizing 1
2(|T+A|+|TA|) and observe outcome oT
8: Set R{i[m] : MT,i =oT}and AA\R Remove incompatible hypotheses
9: Set wiwi+ 1 for each for each iTUpdate the weights of the hypotheses in T
10: tt+ 1.
Fix any expanded hypothesis (i, ω)H. For any t1, let θi,ω(t) denote the fraction of the first
ttests in Pi,ω that are -tests for hypothesis i. Recall that Pi,ω only contains tests from Step 7. Let
ρ= 4 and define
ti,ω = max t {20,21,···,2log m}:θi,ω(t0)1
ρfor all t0t.(6)
If ti,ω >|Pi,ω|then we simply set ti,ω =|Pi,ω |.
Now we define the truncated decision tree T. By abuse of notation, we will use θi(t) and tias
random variables, with randomness over ω. Observe that for any (i, ω), at the next power-of-two
step2dlog tie, which we call the truncation time, the membership oracle will be invoked. Moreover,
2dlog tie2ti, . This motivates us to define Tis the subtree of Tconsisting of the first 2dlogti,ω etests
along path Pi,ω, for each (i, ω)H. Under this definition, the cost of Algorithm 4 clearly equals
the sum of the cost the truncated tree and cost for invoking membership oracles.
Our proof proceeds by bounding the cost of Algorithm 4 at power-of-two steps and other steps. In
other words, we will decompose the cost into the cost incurred by invoking the membership oracle
Unless stated otherwise, we denote log := log2.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
28 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
and selecting the greedy tests. We start with the easier task of bounding the cost for the membership
oracle. The oracle Member is always invoked on |Z|=O(mα) hypotheses. Using Lemma 5, the
expected total number of tests due to Step 4 is O(mαlog m). By Lemma 4, when α1
2, this cost
is O(log m·OPT).
The remaining part of this subsection focuses on bounding the cost of the truncated tree as
O(log m)·OPT. With this inequality, we obtain an expected cost of
O(log m)·(mα+OP T )(as α< 1
2)O(log m)·(m1α+OP T )Lemma 4O(log m)·OP T ,
and Theorem 8 follows. At a high level, for a fixed hypothesis i[m], we will bound the cost of
the truncated tree as follows:
ihas low fraction of -tests at ti
=
Lemma 6iis among the top O(mα) hypotheses at ti
=
Lemma 5iis identified w.h.p. by Member(Z) at 2dlog tie2ti,hence the truncated path is (2,2)-greedy
=
T heorem 9the expected cost conditional on iis O(log m)·SSC(i)
and finally by summing over i[m], it follows from Lemma 2 that the cost of the truncated tree is
O(log m)·OPT. We formalize each step below.
Consider the first step, formally we show that if θi(t)<1
4, then there are O(mα) hypotheses
with fewer -tests than i. Suppose iis the target hypothesis and θi(t) drops below 1
4at t, that is,
only less than a quarter of the tests selected are 2-greedy for SSC(i). Recall that if iTwhere
Tmaximizes 1
2(|AT+|+|AT|), then ST(i) is 2-greedy set for SSC(i), so we deduce that less
than a t
4tests selected are -tests for i, or, at least 3t
4tests selected thus far are deterministic for i.
We next utilize the sparsity assumption to show that there can be at most O(mα) such hypotheses.
Lemma 6. Consider any W⊆ T and I[m]. For iI, let D(i) = |{TW:MT,i 6=∗}| denote
the number of tests in Wfor which ihas deterministic (i.e. ±1) outcomes. For each κ1, define
I0={iI:D(i)>|W|}. Then, |I0|κmα.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 29
Proof. By definition of I0and α-sparsity, it holds that
|I0|· |W|
κ<X
iI
D(i) = X
TW|{iI:MT,i 6=∗}| |W| ·mα,
where the last step follows since |T| ≤ mαfor each test T. The proof follows immediately by
rearranging.
We now complete the analysis using the relation to SSC. Fix any hypothesis i[m] and con-
sider decision tree Tiobtained by conditioning Ton ¯
i=i. Lemma 3 and the definition of trun-
cation together imply that Tiis (2,4)-greedy for SSC(i), so by Theorem 9, the expected cost
of Tiis O(log m)·OPTSSC(i). Now, taking expectations over i[m], the expected cost of Tis
O(log m)Pm
i=1 πi·OPTSSC(i). Recall from Lemma 2 that
OPT X
i[m]
πi·OPTSSC(i),
and therefore the cost of Tis O(log m)·OPT.
Correctness. We finally show that our algorithm identifies the target hypothesis ¯
iwith high
probability. By definition of ti, where the path is truncated, ¯
ihas less than 1
4fraction of -tests.
Thus, at iteration 2dlog t¯
ie, i.e. the first time the membership oracle is invoked after ti,¯
ihas less than
1
2fraction of -tests. Hence, by Lemma 6, ¯
iis among the O(mα) hypotheses with fewest -tests.
Finally it follows from Lemma 5 that ¯
iis identified correctly with probability at least 11
m.
6. Experiments
We implemented our algorithms on real-world and synthetic data sets. We compared our algo-
rithms’ cost (expected number of tests) with an information theoretic lower bound on the optimal
cost and show that the difference is negligible. Thus, despite our logarithmic approximation ratios,
the practical performance is much better.
Chemicals with Unknown Test Outcomes. We considered a data set called WISER, which
includes 414 chemicals (hypothesis) and 78 binary tests. Every chemical has either positive, negative
https://wiser.nlm.nih.gov
Jia et al.: Optimal Decision Tree with Noisy Outcomes
30 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
or unknown result on each test. The original instance (called WISER-ORG) is not identifiable: so
our result does not apply directly. In Appendix E we show how our result can be extended to such
“non-identifiable” ODTN instances (this requires a more relaxed stopping criterion defined on the
“similarity graph”). In addition, we also generated a modified dataset by removing chemicals that
are not identifiable from each other, to obtain a perfectly identifiable dataset (called WISER-ID).
In generating the WISER-ID instance, we used a greedy rule that iteratively drops the highest-
degree hypothesis in the similarity graph until all remaining hypotheses are uniquely identifiable.
WISER-ID has 255 chemicals.
Random Binary Classifiers with Margin Error. We construct a dataset containing 100 two-
dimensional points, by picking each of their attributes uniformly in [1000,1000]. We also choose
2000 random triples (a, b, c) to form linear classifiers ax+by
a2+b2+c0, where a, b N(0,1) and
cU(1000,1000). The point labels are binary and we introduce noisy outcomes based on the
distance of each point to a classifier. Specifically, for each threshold d{0,5,10,20,30}we define
dataset CL-dthat has a noisy outcome for any classifier-point pair where the distance of the point
to the boundary of the classifier is smaller than d. In order to ensure that the instances are perfectly
identifiable, we remove “equivalent” classifiers and we are left with 234 classifiers.
Distributions. For the distribution over the hypotheses, we considered permutations of power law
distribution (Pr[X=x;α] = βxα) for α= 0,0.5 and 1. Note that, α= 0 corresponds to uniform
distribution. To be able to compare the results across different classifiers’ datasets meaningfully,
we considered the same permutation in each distribution.
Algorithms. We implement the following algorithms: the adaptive O(r+ log m+ log 1
ε)-
approximation (which we denote ODTNr), the adaptive O(clog ||+ log m+ log 1
ε)-approximation
(ODTNc), the non-adaptive O(log m)-approximation (Non-Adap) and a slightly adaptive version
of Non-Adap (Low-Adap). Algorithm Low-Adap considers the same sequence of tests as Non-
Adap while (adaptively) skipping non-informative tests based on observed outcomes. For the non-
identifiable instance (WISER-ORG) we used the O(d+ min(c, r) + log m+ log 1
ε)-approximation
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 31
algorithms with both neighborhood and clique stopping criteria (see Appendix E). The implemen-
tations of the adaptive and non-adaptive algorithms are available online.§
Algorithm
Data WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30
Low-BND 7.994 7.870 7.870 7.870 7.870 7.870
ODTNr8.357 7.910 7.927 7.915 7.962 8.000
ODTNh9.707 7.910 7.979 8.211 8.671 8.729
Non-Adap 11.568 9.731 9.831 9.941 9.996 10.204
Low-Adap 9.152 8.619 8.517 8.777 8.692 8.803
Table 1 Cost of Different Algorithms for α= 0 (Uniform Distribution).
Algorithm
Data WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30
Low-BND 7.702 7.582 7.582 7.582 7.582 7.582
ODTNr8.177 7.757 7.780 7.789 7.831 7.900
ODTNh9.306 7.757 7.829 8.076 8.497 8.452
Non-Adap 11.998 9.504 9.500 9.694 9.826 9.934
Low-Adap 8.096 7.837 7.565 7.674 8.072 8.310
Table 2 Cost of Different Algorithms for α= 0.5.
Results. Tables 1, Tables 2 and Tables 3 show the expected costs of different algorithms on all
uniquely identifiable data sets when the parameter αin the distribution over hypothesis is 0,0.5
and 1 correspondingly. These tables also report values of an information theoretic lower bound
(the entropy) on the optimal cost (Low-BND). As the approximation ratio of our algorithms
are dependent on maximum number cof unknowns per hypothesis and maximum number rof
§https://github.com/FatemehNavidi/ODTN ; https://github.com/sjia1/ODT-with-noisy-outcomes
Jia et al.: Optimal Decision Tree with Noisy Outcomes
32 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
Algorithm
Data WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30
Low-BND 6.218 6.136 6.136 6.136 6.136 6.136
ODTNr7.367 6.998 7.121 7.150 7.299 7.357
ODTNh8.566 6.998 7.134 7.313 7.637 7.915
Non-Adap 11.976 9.598 9.672 9.824 10.159 10.277
Low-Adap 9.072 8.453 8.344 8.609 8.683 8.541
Table 3 Cost of Different Algorithms for α= 1.
Parameters
Data WISER-ORG WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30
r 388 245 0 5 7 12 13
Avg-r 50.46 30.690 0 1.12 2.21 4.43 6.54
h 61 45 0 3 6 8 8
Avg-h 9.51 9.39 0 0.48 0.94 1.89 2.79
Table 4 Maximum and Average Number of Stars per Hypothesis and per Test in Different Datasets.
Algorithm Neighborhood Stopping Clique Stopping
ODTNr11.163 11.817
ODTNh11.908 12.506
Non-Adap 16.995 21.281
Low-Adap 16.983 20.559
Table 5 Algorithms on WISER-ORG dataset with Neighborhood and Clique Stopping for Uniform Distribution.
unknowns per test, we also have included these parameters as well as their average values in Table 4.
Table 5 summarizes the results on WISER-ORG with clique and neighborhood stopping criteria.
We can see that ODTNrconsistently outperforms the other algorithms and is very close to the
information-theoretic lower bound.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 33
Acknowledgements
A preliminary version of this paper appeared as Jia et al. (2019) in the proceedings of Neural
Information Processing Systems (NeurIPS) 2019.
References
Micah Adler and Brent Heeringa. Approximating optimal binary decision trees. In Approximation, Ran-
domization and Combinatorial Optimization. Algorithms and Techniques, pages 1–9. Springer, 2008.
Esther M Arkin, Henk Meijer, Joseph SB Mitchell, David Rappaport, and Steven S Skiena. Decision trees for
geometric models. International Journal of Computational Geometry & Applications, 8(03):343–363,
1998.
Yossi Azar and Iftah Gamzu. Ranking with submodular valuations. In Proceedings of the twenty-second
annual ACM-SIAM symposium on Discrete Algorithms, pages 1070–1079. SIAM, 2011.
Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Machine Learning,
Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania,
USA, June 25-29, 2006, pages 65–72, 2006.
Gowtham Bellala, Suresh K. Bhavnani, and Clayton Scott. Active diagnosis under persistent noise with
unknown noise distribution: A rank-based approach. In Proceedings of the Fourteenth International
Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,
2011, pages 155–163, 2011.
Venkatesan T Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, and Yogish Sabharwal. Approximating
decision trees with multiway branches. In International Colloquium on Automata, Languages, and
Programming, pages 210–221. Springer, 2009.
Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, Pranjal Awasthi, and Mukesh K. Mohania.
Decision trees for entity identification: Approximation algorithms and hardness results. ACM Trans.
Algorithms, 7(2):15:1–15:22, 2011.
Yuxin Chen, Seyed Hamed Hassani, and Andreas Krause. Near-optimal bayesian active learning with cor-
related and noisy tests. In Proceedings of the 20th International Conference on Artificial Intelligence
and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 223–231, 2017.
Ferdinando Cicalese, Eduardo Sany Laber, and Aline Medeiros Saettler. Diagnosis determination: decision
trees optimizing simultaneously worst and expected testing cost. In Proceedings of the 31th International
Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 414–422, 2014.
Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing
systems, pages 337–344, 2005.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
34 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
M.R. Garey and R.L. Graham. Performance bounds on the splitting algorithm for binary testing. Acta
Informatica, 3:347–355, 1974.
Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning
and stochastic optimization. J. Artif. Intell. Res., 42:427–486, 2011. doi: 10.1613/jair.3278. URL
https://doi.org/10.1613/jair.3278.
Daniel Golovin and Andreas Krause. Adaptive submodularity: A new approach to active learning and
stochastic optimization. CoRR, abs/1003.3967, 2017. URL http://arxiv.org/abs/1003.3967.
Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy
observations. In Advances in Neural Information Processing Systems 23: 24th Annual Conference
on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010,
Vancouver, British Columbia, Canada., pages 766–774, 2010.
Andrew Guillory and Jeff A. Bilmes. Average-case active learning with costs. In Algorithmic Learning
Theory, 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings,
pages 141–155, 2009.
Andrew Guillory and Jeff A. Bilmes. Interactive submodular set cover. In Proceedings of the 27th Interna-
tional Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 415–422,
2010.
Andrew Guillory and Jeff A. Bilmes. Simultaneous learning and covering with adversarial noise. In Proceed-
ings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington,
USA, June 28 - July 2, 2011, pages 369–376, 2011.
Anupam Gupta, Viswanath Nagarajan, and R Ravi. Approximation algorithms for optimal decision trees
and adaptive tsp problems. Mathematics of Operations Research, 42(3):876–896, 2017.
Steve Hanneke. A bound on the label complexity of agnostic active learning. In Machine Learning, Pro-
ceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June
20-24, 2007, pages 353–360, 2007.
Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is N P -complete. Informa-
tion Processing Lett., 5(1):15–17, 1976/77.
Sungjin Im, Viswanath Nagarajan, and Ruben Van Der Zwaan. Minimum latency submodular cover. ACM
Transactions on Algorithms (TALG), 13(1):13, 2016.
Shervin Javdani, Yuxin Chen, Amin Karbasi, Andreas Krause, Drew Bagnell, and Siddhartha S. Srinivasa.
Near optimal bayesian active learning for decision making. In Proceedings of the Seventeenth Inter-
national Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April
22-25, 2014, pages 430–438, 2014.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 35
Su Jia, Viswanath Nagarajan, Fatemeh Navidi, and R. Ravi. Optimal decision tree with noisy outcomes.
In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alch´e-Buc, Emily B. Fox, and
Roman Garnett, editors, Annual Conference on Neural Information Processing Systems (NeurIPS),
pages 3298–3308, 2019.
S Rao Kosaraju, Teresa M Przytycka, and Ryan Borgstrom. On an optimal split tree problem. In Workshop
on Algorithms and Data Structures, pages 157–168. Springer, 1999.
Zhen Liu, Srinivasan Parthasarathy, Anand Ranganathan, and Hao Yang. Near-optimal algorithms for shared
filter evaluation in data stream systems. In Proceedings of the ACM SIGMOD International Conference
on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 133–146,
2008.
D. W. Loveland. Performance bounds for binary testing with arbitrary weights. Acta Inform., 22(1):101–114,
1985.
Mikhail Ju. Moshkov. Greedy algorithm with weights for decision tree construction. Fundam. Inform., 104
(3):285–292, 2010.
Mohammad Naghshvar, Tara Javidi, and Kamalika Chaudhuri. Noisy bayesian active learning. In 50th
Annual Allerton Conference on Communication, Control, and Computing, Allerton 2012, Allerton Park
& Retreat Center, Monticello, IL, USA, October 1-5, 2012, pages 1626–1633, 2012.
Feng Nan and Venkatesh Saligrama. Comments on the proof of adaptive stochastic set cover based on
adaptive submodularity and its implications for the group identification problem in ”group-based active
query selection for rapid diagnosis in time-critical situations”. IEEE Trans. Information Theory, 63
(11):7612–7614, 2017.
Fatemeh Navidi, Prabhanjan Kambadur, and Viswanath Nagarajan. Adaptive submodular ranking and
routing. Oper. Res., 68(3):856–877, 2020.
Robert D. Nowak. Noisy generalized binary search. In Advances in Neural Information Processing Systems
22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting
held 7-10 December 2009, Vancouver, British Columbia, Canada., pages 1366–1374, 2009.
Aline Medeiros Saettler, Eduardo Sany Laber, and Ferdinando Cicalese. Trading off worst and expected cost
in decision tree problems. Algorithmica, 79(3):886–908, 2017.
Laurence A Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combi-
natorica, 2(4):385–393, 1982.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
36 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
Appendix A: Proof of Proposition 1.
Recall that an adaptive algorithm for ASRN or ASR can be viewed as a decision tree. We will
show that any feasible decision tree for the ASR instance Jis also feasible for the ASRN instance
Iwith the same objective, and vice versa.
In one direction, consider a feasible decision tree Tfor the ASR instance J. For any expanded
scenario (i, ω)H, let Pi,ω be the unique path traced in T, and Si,ω the elements selected along
Pi,ω. Note that by definition of a feasible decision tree, at the last node (“leaf ”) of path Pi,ω , it
holds fi,ω(Si,ω ) = 1 which, in the notation of the original ASRN instance, translates to fi({(e, ωe) :
eSi,ω}) = 1.
In the other direction, let T0be any decision tree for ASRN instance I. Suppose the target
scenario is i[m] and element-outcomes are given by ωnon the -elements for i, which is
unknown to the algorithm. Then a unique path P0
i,ω is traced in T0. Let S0
i,ω denote the elements
on this path. Since iis covered at the end of P0
i,ω we have fi({(e, ωe) : eS0
i,ω}) = 1. Now consider
T0as a decision tree for ASR instance J. Under scenario i, ω, it is clear that path P0
i,ω is traced
and so elements S0
i,ω are selected. It follows that fi,ω(S0
i,ω) = fi({(e, ωe) : eS0
i,ω}) = 1 which means
that scenario (i, ω) is covered at the end of P0
i,ω. Therefore T0is a also feasible decision tree for J.
Taking expectations, the cost for Jis at most that for instance I.
Appendix B: Details in Section 3
The non-adaptive SFRN algorithm (Algorithm 1) involves two phases. In the first phase, we run
the SFR algorithm using sampling to get estimates GE(e) of the scores. If at some step, the
maximum sampled score is “too low” then we go to the second phase where we perform all remaining
elements in an arbitrary order. The number of samples used to obtain each estimate is polynomial
in m, n, ε1, so the overall runtime is polynomial.
Pre-processing. We first show that by losing an O(1)-factor in approximation ratio, we may
assume that πin2for all i[m]. Let A={i[m] : πin2}, then Piπin2·nn1.
Replace all scenarios in Awith a single dummy scenario “0” with π0=PiAπi, and define f0to
be any fiwhere iA. By our assumption that each fimust be covered irrespective of the noisy
outcomes, it holds that fi,ω([n]) = 1 for each ωΩ(i), and hence the cover time is at most n.
Thus, for any permutation σ, the expected cover time of the old and new instance differ by at
most O(n1·n) = O(1). Therefore, the cover time of any sequence of elements differs by only O(1)
in this new instance (where we removed the scenarios with tiny prior densities) and the original
instances.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 37
We now present the formal proof of Theorem 3, with proofs of the lemmas deferred to
Appendix B. To analyze the our randomized algorithm, we need the following sampling lemma,
which follows from the standard Chernoff bound.
Lemma 7. Let Xbe a [0,1]-bounded random variable with EXm2n4ε. Let ¯
Xdenote the
average of m3n4ε1many independent samples of X. Then Pr ¯
X /[1
2EX, 2EX]eΩ(m).
The next lemma shows that sampling does find an approximate maximizer unless the score is
very small, and also bounds the failure probability.
Lemma 8. Consider any step in the algorithm with S= maxe[n]GE(e)and ¯
S= maxe[n]GE(e)
with GE(e) = ¯
S. Call this step a failure if (i) ¯
S < 1
4m2n4εand S1
2m2n4ε, or (ii) ¯
S
1
4m2n4εand GE(e)<S
4. Then the probability of failure is at most eΩ(m).
Based on Lemma 8, in the remaining analysis we condition on the event that our algorithm
never encounters failures, which occurs with probability 1 eΩ(m). To conclude the proof, we need
the following key lemma which essentially states that if the score of the greediest element is low,
then the elements selected so far suffices to cover all scenarios with high probability, and hence the
ordering of the remaining elements does not matter much.
Lemma 9. Assume that there are no failures. Consider the end of phase 1 in our algorithm, i.e. the
first step with GE(e)<1
4m2n4ε. Then, the probability that the realized scenario is not covered
is at most m2.
The above is essentially a consequence of the submodularity of the target functions. Suppose
for contradiction that there is a scenario ithat, with at least m2probability over the random
outcomes, remains uncovered by the currently selected elements. Recall that by our feasibility
assumption, if all elements were selected, then fiis covered with probability 1. Thus, by submodu-
larity, there exists an individual element ˜ewhose inclusion brings more coverage than the average
coverage over all elements in [n], and hence ˜ehas a “high” score.
Proof of Theorem 3. Assume that there are no failures. We proceed by bounding the expected
costs (number of elements) from phase 1 and 2 separately. By Lemma 8, the element chosen in
each step of phase 1 is a 4-approximate maximizer (see case (ii) failure) of the score used in the
SFR algorithm. Thus, by Theorem 4, the expected cost in phase 1 is O(log m) times the optimum.
On the other side, by Lemma 9 the probability of performing phase 2 is at most eΩ(m). As there
are at most nelements in phase 2, the expected cost is only O(1). Therefore, Algorithm 1 is an
O(log m)-approximation algorithm for SFRN.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
38 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
B.1. Proof of Lemma 7.
Let X1, ..., XNbe i.i.d. samples of random variable where N=m3n4ε1is the number of samples.
Letting Y=Pi[N]Xi, the usual Chernoff bound implies for any δ(0,1),
Pr Y /[(1 δ)EY, (1 + δ)EY]exp(δ2
2·EY).
The lemma follows by setting δ=1
2and using the assumption EY=N·EX1= Ω(m).
B.2. Proof of Lemma 8
We will consider the two types of failures separately. For the first type, suppose S1
2m2n4ε.
Using Lemma 7 on the element e[n] with GE(e) = S, we obtain
Pr[ ¯
S < 1
4m2n4ε]Pr[GE(e)<1
4m2n4ε]eΩ(m).
So the probability of the first type of failure is at most eΩ(m).
For the second type of failure, we consider two further cases:
S < 1
8m2n4ε. For any e[n] we have GE(e)S < 1
8m2n4ε. Note that GE(e) is the average
of Nindependent samples each with mean GE(e). We now upper bound the probability of the
event Bethat GE(e)1
4m2n4ε. We first artificially increase each sample mean to 1
8m2n4ε:
note that this only increases the probability of event Be. Now, using Lemma 7 we obtain
Pr[Be]eΩ(m). By a union bound, it follows that Pr[ ¯
S1
4m2n4ε]Pe[n]Pr[Be]eΩ(m).
S1
8m2n4ε. Consider now any eUwith GE(e)< S/4. By Lemma 7 (artificially increasing
GE(e) to S/4 if needed), it follows that Pr[GE(e)> S/2] eΩ(m). Now consider the element
e0with GE(e0) = S. Again, by Lemma 7, it follows that Pr[GE(e0)S/2] eΩ(m). This means
that element ehas GE(e)GE(e0)> S/2 and GE(e)S/4 with probability 1eΩ(m). In
other words, assuming S1
8m2n4ε, the probability that GE(e)< S/4 is at most eΩ(m).
Adding the probabilities over all possibilities for failures, the lemma follows.
B.3. Proof of Lemma 9
Let Edenote the elements chosen so far and pthe probability that Edoes not cover the realized
scenario-copy of H. That is,
p= Pr
(i,ω)H(fi,ω (E)<1) =
m
X
i=1
πi·Pr
ωΩ(i)(fi,ω(E)<1).
It follows that there is some iwith PrωΩ(i)(fi,ω(E)<1) p. By definition of separability, if
fi,ω(E)<1 then fi,ω (E)1ε. Thus,
X
ωΩ(i)
πi,ωfi,ω (E)X
ω:fi,ω(E)=1
πi,ω ·1 + X
ω:fi,ω(E)<1
πi,ω ·fi,ω(E)(1 εp)πi.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 39
On the other hand, taking all the elements we have fi,ω([n]) = 1 for all ωΩ(i). Thus,
X
ωΩ(i)
πi,ωfi,ω ([n]) = X
ωΩ(i)
πi,ω =πi.
Taking the difference of the above two inequalities, we have
X
ωΩ(i)
πi,ω ·(fi,ω([n]) fi,ω (E)) πi·εp.
Consider function g(S) := PωΩ(i)πi,ω ·(fi,ω(SE)fi,ω(E))for S[n], which is also submodular.
From the above, we have g([n]) πi·εp. Using submodularity of g,
max
e[n]g({e})εpπi
n=⇒ ∃˜e[n] : X
ωΩ(i)
πi,ω ·(fi,ω(E {˜e})fi,ω (E))εpπi
n.
It follows that GEe)εpπi
nn3εp, where we used that miniπin2. Now, suppose for a con-
tradiction that pm2. Since there is no failure and GEe)n3m2ε1
4n4m2ε, by case (ii)
of Lemma 8 , we deduce that GE(e)1
4m2n4, which is contradiction.
Appendix C: Details in Section 4
C.1. Proof of Lemma 1.
By decomposing the summation in the left hand side of (3) as H0=iH0Hi, and noticing that
fi,ω(E) = fi(νE), the problem reduces to showing that for each i[m],
X
(i,ω)H0Hi
πi,ω ·fi,ω(eE)fi,ω (E)=pi·Ei,νe[fi(νE {νe})fi(νE)].
Recall that pi=ni·πi
||ciand π(i,ω)=πi
||ci, the above simplifies to
1
niX
(i,ω)H0Hifi,ω (eE)fi,ω(E)=Ei,νe[fi(νE{νe})fi(νE)].
Note that ni=|H0Hi|, so the above is equivalent to
1
niX
(i,ω)H0Hi
fi,ω(eE) = Ei,νe[fiνE {νe}].(7)
It is straightforward to verify that the above by considering the following are two cases.
If ri(e) = νe\{∗}, then the outcome νeis deterministic conditional on scenario i, and so
is fiνE{νe}, the value of fiafter selecting e. On the left hand side, for every ωHi, by
definition of Hiit holds νe=ωe, and hence fi,ω(eE) = fi(νE∪ {νe}for every (i, ω)Hi.
Therefore all terms in the summation are equal to fi(νE{νe}and hence (7) holds.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
40 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
If ri(e) = , then each outcome oΩ occurs with equal probabilities, thus we may rewrite the
right hand side as
Ei,νe[fiνE{νe}] = X
o
Pi[νe=o]·fiνE{νe}
=1
||X
o
fiνE{(e, o)}.
To analyze the other side, note that by definition of Hiand H0, there are equally many
expanded scenarios (i, ω) in H0Hiwith ωe=ofor each outcome oΩ. Thus, we can rewrite
the left hand side as
1
niX
(i,ω)H0Hi
fi,ω(eE) = 1
niX
oX
(i,ω)H0Hi,
ωe=o
fi,ω(eE)
=1
niX
o
ni
||fi,ω(eE)
=1
||X
o
fiνE{(e, o)},
which matches the right hand side of (7) and completes the proof.
C.2. Application of Algorithm 2 and Algorithm 3 to ODTN.
For concreteness, we provide a closed-form formula for Scorecand Scorerin the ODTN problem
using Lemma 1, which were used in our experiments for ODTN. In §2.3, we formulated ODTN as an
ASRN instance. Recall that the outcomes Ω = {+1,1}, and the submodular function f(associated
with each hypothesis i) measures the proportion of hypotheses eliminated after observing the
outcomes of a subset of tests.
As in §4, at any point in Algorithm 2 or 3, after selecting set Eof tests, let νE:E±1 denote
their outcomes. For each hypothesis i[m], let nidenote the number of surviving expanded-
scenarios of i. Also, for each hypothesis i, let pidenote the total probability mass of the surviving
expanded-scenarios of i. For any S[m], we use the shorthand p(S) = PiSpi. Finally, let A[m]
denote the compatible hypotheses based on the observed outcomes νE(these are all the hypotheses
iwith ni>0). Then, f(νE) = m−|A|
m1. Moreover, for any new test/element T,
f(νE{νT}) = (m−|A|+|AT|
m1if νT= +1
m−|A|+|AT+|
m1if νT=1.
Recall that T+,Tand Tdenote the hypotheses with +1, 1 and outcomes for test T. So,
f(νE{νT})f(νE)
1f(νE)=(|AT|
|A|−1if νT= +1
|AT+|
|A|−1if νT=1.
It is then straightforward to verify the following.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 41
Proposition 2. Consider implementing Algorithm 2 on an ODTN instance. Suppose after select-
ing tests E, the expanded-scenarios H0(and original scenarios A) are compatible with the parame-
ters described above. For any test T, if bT{+1,1}is the outcome corresponding to BT(H0)then
the second term in Scorec(T;E , H0)and Scorer(T;E , H0)is:
|AT|
|A|1+|AT+|
|A|1·p(AT)
2+|AT|
|A|1·pAT++|AT+|
|A|1·pAT.
The above expression has a natural interpretation for ODTN: conditioned on the outcomes νEso
far, it is the expected number of newly eliminated hypotheses due to test T(normalized by |A|1).
The first term of the score π(LT(H0)) or π(RT(H0)) is calculated as for the general ASRN prob-
lem. Finally, observe that for the submodular functions used for ODTN, the separation parameter
is ε=1
m1. So, by Theorem 7 we immediately obtain a polynomial time O(min(r, c) + log m)-
approximation for ODTN.
C.3. Proof of Theorem 6
The proof is similar to the analysis in Navidi et al. (2020).
With some foresight, set α:= 15(r+ log m). Write Algorithm 3 as ALG and let OPT be the
optimal adaptive policy. It will be convenient to view ALG and OPT as decision trees where each
node represents the “state” of the policy. Nodes in the decision tree are labelled by elements (that
are selected at the corresponding state) and branches out of each node are labelled by the outcome
observed at that point. At any state, we use Eto denote the previously selected elements and
H0Mto denote the expanded-scenarios that are (i) compatible with the outcomes observed so
far and (ii) uncovered. Suppose at some iteration, elements Eare selected and outcomes νEare
observed, then a scenario iis said to be covered if fi(EνE) = 1, and uncovered otherwise.
For ease of presentation, we use the phrase “at time t” to mean “after selecting telements”.
Note that the cost incurred until time tis exactly t. The key step is to show
ak0.2ak1+ 3yk,for all k1,(8)
where
AkMis the set of uncovered expanded scenarios in ALG at time α·2kand ak=p(Ak) is
their total probability,
Ykis the set of uncovered scenarios in OPT at time 2k1, and yk=p(Yk) is the total probability
of these scenarios.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
42 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
As shown in Section 2 of Navidi et al. (2020), (8) implies that Algorithm 3 is an O(α)-
approximation and hence Theorem 6 follows. To prove (8), we consider the total score collected by
ALG between iterations α2k1and α2k, formally given by
Z:=
α2k
X
t>α2k1X
(E,H 0)V(t)
max
e[n]\E
X
(i,ω)Re(H0)
πi,ω +X
(i,ω)H0
πi,ω ·fi,ω(eE)fi,ω (E)
1fi,ω(E)
(9)
where V(t) denotes the set of states (E, H 0) that occur at time tin the decision tree ALG. We
note that all the expanded-scenarios seen in states of V(t) are contained in Ak1.
Consider any state (E, H 0) at time tin the algorithm. Recall that H0are the expanded-scenarios
and let S[m] denote the original scenarios in H0. Let TH0(k) denote the subtree of OPT that
corresponds to paths traced by expanded-scenarios in H0up to time 2k1. Note that each node
(labeled by any element e[n]) in TH(k) has at most ||outgoing branches and one of them
corresponds to the outcome oe(S) defined in Algorithm 3. We define Stemk(H0) to be the path in
TH0(k) that at each node (labeled e) follows the oe(S) branch. We also use Stemk(H0)[n]×Ω to
denote the observed element-outcome pairs on this path.
Definition 1. Each state (E, H 0) is exactly one of the following types:
bad if the probability of uncovered scenarios in H0at the end of Stemk(H0) is at least Pr(H0)
3.
okay if it is not bad and Pr(eStemk(H0)Re(H0)) is at least Pr(H0)
3.
good if it is neither bad nor okay and the probability of scenarios in H0that get covered by
Stemk(H0) is at least Pr(H0)
3.
Crucially, this categorization of states is well defined. Indeed, each expanded-scenario in H0is (i)
uncovered at the end of Stemk(H0), or (ii) in Re(H0) for some eStemk(H0), or (iii) covered by
some prefix of Stemk(H0), i.e. the function value reaches 1 on Stemk(H0). So the total probability
of the scenarios in one of these 3 categories must be at least Pr(H)
3.
In the next two lemmas, we will show a lower bound (Lemma 10) and an upper bound (Lemma 11)
for Zin terms of akand yk, which together imply (8) and complete the proof.
Lemma 10. For any k1, it holds Zα·(ak3yk)/3.
Proof. The proof of this lower bound is identical to that of Lemma 3 in Navidi et al. (2020)
for noiseless-ASR. The only difference is that we use the scenario-subset Re(H0)H0instead of
subset “Le(H)H” in the analysis of Navidi et al. (2020).
Lemma 11. For any k1,Zak1·(1 + ln 1
+r+ log m).
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 43
Proof. This proof is analogous to that of Lemma 4 in Navidi et al. (2020) but requires new
ideas, as detailed below. Our proof splits into two steps. We first rewrite Zby interchanging its
double summation: the outer layer is now over the Ak1(instead of times between α2k1to α2k
as in the original definition of Z). Then for each fixed (i, ω)Ak1, we will upper bound the inner
summation using the assumption that there are at most roriginal scenarios with ri(e) = for each
element e.
Step 1: Rewriting Z.For any uncovered (i, ω)Ak1in the decision tree ALG at time α2k1,
let Pi,ω be the path traced by (i, ω) in ALG, starting from time α2k1and ending at time α2kor
when (i, ω) is covered.
Recall that in the definition of Z, for each time tbetween α2k1and α2k, we sum over all states
(E, H 0) at time t. Since tα2k1, and the subset of uncovered scenarios only shrinks at tincreases,
for any (E, H 0)V(t) we have H0Ak1. So, only the expanded scenarios in Ak1contribute to
Z. Thus we may rewrite (9) as
Z=X
(i,ω)Ak1
πi,ω ·X
(e;E,H 0)Pi,ω fi,ω(eE)fi,ω (E)
1fi,ω(E)+1[(i, ω )Re(H0)]
X
(i,ω)Ak1
πi,ω ·
X
(e;E,H 0)Pi,ω
fi,ω(eE)fi,ω (E)
1fi,ω(E)+X
(e;E,H 0)Pi,ω
1[(i, ω)Re(H0)]
.(10)
Step 2: Bounding the Inner Summation. The rest of our proof involves upper bounding each
of the two terms in the summation over ePi,ω for any fixed (i, ω)Ak1. To bound the first
term, we need the following standard result on submodular functions.
Lemma 12 (Azar and Gamzu (2011)).Let f: 2U[0,1] be any monotone function with
f()=0 and ε= min{f(S{e})f(S) : eU, S U, f (S{e})f(S)>0}be the separability
parameter. Then for any nested sequence of subsets =S0S1 ···SkU, it holds
k
X
t=1
f(St)f(St1)
1f(St1)1 + ln 1
ε.
It follows immediately that
X
(e;E,H 0)Pi,ω
fi,ω(eE)fi,ω (E)
1fi,ω(E)1 + ln 1
ε.(11)
Next we consider the second term P
(e;E,H 0)Pi,ω
1[(i, ω)Re(H0)]. Recall that S[m] is the subset
of original scenarios with at least one expanded scenario in H0. Consider the partition of scenarios
Jia et al.: Optimal Decision Tree with Noisy Outcomes
44 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
Sinto ||+1 parts based on the response entries (from Ω {∗}) for element e. From Algorithm 3,
recall that Ue(S) denotes the part with response and Ce(S) denotes the largest cardinality part
among the non-responses. Also, oe(S)Ω is the outcome corresponding to part Ce(S). Moreover,
Re(H0)H0consists of all expanded-scenarios that do not have outcome oe(S) on element e.
Suppose that (i, ω)Re(H0). Then, it must be that the observed outcome on eis not oe(S). Let
S0Sdenote the subset of original scenarios that are also compatible with the observed outcome
on e. We now claim that |S0||S|+r
2. To see this, let De(S)Sdenote the part having the second
largest cardinality among the non-responses for e. As the observed outcome is not oe(S) (which
corresponds to the largest part), we have
|S0||Ue(S)|+|De(S)||Ue(S)|+|S||Ue(S)|
2=|S|+|Ue(S)|
2|S|+r
2.
The first inequality above uses the fact that S0consists of Ue(S) (scenarios with response)
and some part (other than Ce(S)) with a non-response. The second inequality uses |De(S)| ≤
|De(S)|+|Ce(S)|
2|S|−|Ue(S)|
2. The last inequality uses the upper-bound ron the number of responses
per element. It follows that each time (i, ω)Re(H0), the number of compatible (original) scenarios
on path Pi,ω changes as |S0| ≤ |S|+r
2. Hence, after log2msuch events, the number of compatible
scenarios on path Pi,ω is at most r. Finally, we use the fact that the number of compatible scenarios
reduces by at least one whenever (i, ω)Re(H0), to obtain
X
(e;E,H 0)Pi,ω
1[(i, ω)Re(H0)] r+ log2m. (12)
Combining (10), (11) and (12), we obtain the lemma.
Appendix D: Details in Section 5
D.1. A Low-Cost Membership Oracle
Note that Steps 3, 9 and 18 are well-defined because the ODTN instance is assumed to be iden-
tifiable. If there is no new test in Step 3 with T+Z06=and TZ06=, then we must have
|Z0|= 1. If there is no new test in Step 9 with z6∈Tthen we must have identified zuniquely, i.e.
Y=. Finally, in step 18, we use the fact that there are tests that deterministically separate every
pair of hypotheses.
Proof. If ¯
iZthen it is clear that i=¯
iin step 6 and Member(Z) declares ¯
i=i. Now consider
the case ¯
i6∈ Z. Recall that iZdenotes the unique hypothesis that is still compatible in step 6,
and that Ydenotes the set of compatible hypotheses among [m]\ {i}, so it always contains ¯
i.
Hence, Y6=in step 14, which implies that k= 4 log m. Also recall the definition of set Sand J
from (13).
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 45
Algorithm 5 Member(Z) oracle that checks if ¯
iZ.
1: Initialize: Z0Z.
2: while |Z0|2do % While-loop 1: Finding a suspect – reducing |Z0|to 1
3: Choose any new test TT with T+Z06=and TZ06=, observe outcome ωT1}.
4: Let Rbe the set of hypotheses ruled out, i.e. R={j[m] : MT ,j =ωT}.
5: Let Z0Z0\R.
6: Let zbe the unique hypothesis when the while-loop ends. Identified a “suspect”.
7: Initialize k0 and Y=H.
8: while Y6=and k4 log mdo While-loop 2: choose deterministic tests for z.
9: Choose any new test Twith MT,i 6=and observe outcome ωT 1}.
10: if ωT=MT,i then  i ruled out.
11: Declare “¯
i6∈Z” and stop.
12: else
13: Let Rbe the set of hypotheses ruled out, YY\Rand kk+ 1.
14: if Y=then
15: Declare “¯
i=i” and terminate.
16: else
17: Let WT denote the tests performed in step 9 and Now consider the“bad” case.
J={jY:MT,j =MT ,i for at least 2 log mtests TW}
={jY:MT,j =for at most 2 log mtests TW}.
(13)
18: For each jJ, choose a test T=T(j)T with MT,j , MT ,i 6=and MT,j =MT ,i
19: let W0T denote the set of these tests.
20: if no tests in WW0rule out ithen Let iduel with hypotheses in J.
21: Declare “¯
i=i”.
22: else
23: Declare “¯
i /Z”.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
46 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
Case 1. If ¯
iJthen we will identify correctly that ¯
i6=iin step 20 as one of the tests in W0
(step 18) separates ¯
iand ideterministically. So in this case we will always declare ¯
i /Z.
Case 2. If ¯
i6∈J, then by definition of J, we have ¯
iTfor at least 2 log mtests TW. As i
has a deterministic outcome for each test in W, the probability that all outcomes in Ware
consistent with iis at most m2. So with probability at least 1m2, some test in Wmust
have an outcome (under ¯
i) inconsistent with i, and based on step 20, we would declare ¯
i /Z.
In order to bound the cost, note that the number of tests performed are at most: |Z|in step 3,
4 log min step 9 and |J||Z|in step 18, and the proof follows.
Proof. For the first statement, fix any x= (i, ω)Ω. Recall that Pxonly contains tests from
step 7. We only need to consider the case that tx<|Px|/2. Let t0
x= 2 ·txwhich is a power-of-2. By
(6) we know that there is some kwith tx< k t0
xand θx(k)<1. Hence θx(t0
x)<2
ρ<1
2.
Consider the point in the algorithm after performing the first t0
xtests (call them S) on Px.
Because t0
xis a power-of-two, the algorithm calls the member oracle in this iteration. Let X[m]
be the compatible hypotheses after the t0
x-th test on Px. Because θx(t0
x)<1/2, at most |S|/2 tests in
Sare -tests for hypothesis i: in other words the weight wx|S|/2 at this point in the algorithm.
Let
X0=yX:Shas at most |S|
2-tests for y=yX:wy|S|
2.
Using Lemma 6 with Sand X, it follows that |X0|2Cmα. Hence the number of hypotheses yX
with wy≤ |S|/2wxis at most 2Cmα, and so iZ(recall that Zconsists of 2Cmαhypotheses
with the lowest weight). This means that after step 4, we would have correctly identified ¯
i=xand
so Pxends. Hence |Px|t0
x. The first part of the lemma now follows from the fact that t0
x2·tx.
The second statement in the lemma follows by taking expectation over all xH.
D.2. Proof of Lemma 2.
Consider any feasible decision tree Tfor the ODTN instance and any hypothesis i[m]. If we
condition on ¯
i=ithen Tcorresponds to a feasible adaptive policy for SSC (i). This is because:
for any expanded hypothesis (ω, i)Ω(i), the tests performed in Tmust rule out all the
hypotheses [m]\i, and
the hypotheses ruled-out by any test T(conditioned on ¯
i=i) is a random subset that has the
same distribution as ST(i).
Formally, let Pi,ω denote the path traced in Tunder test outcomes ω, and |Pi,ω|the number of
tests performed along this path. Recall that uiis the number of unknown tests for i, and that
the probability of observing outcomes ωwhen ¯
i=iis 2ui, so this policy for SSC (i) has cost
P(i,ω)Ω(i)2ui·|Pi,ω |. Thus, OP TSSC(i)P(i,ω)Ω(i)2ui· |Pi,ω |. Taking expectations over i[m]
the lemma follows.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
Operations Research 00(0), pp. 000–000, ©0000 INFORMS 47
D.3. Proof of Lemma 3.
For simplicity write (T0)+as T0
+(similarly define T0
, T 0
). Note that E[|ST(i)(A\i)|] =
1
2(|T+A|+|TA|)because iT. We consider two cases for test T0T .
If MT0,i =, then
E[|ST0(i)(A\i)|] = 1
2|T0
+A|+|T0
A|1
2|T+A|+|TA|,
by the “greedy choice” of Tin step 7.
If iT0
+T0
then
E[|ST0(i)(A\i)|]max{|T0
+A|,|T0
A|}|T0
+A|+|T0
A|,
which is at most |T+A|+|TA|by the choice of T.
In either case the claim holds, and the lemma follows.
D.4. Proof of Lemma 4
By definition of α-sparse instances, the maximum number of candidate hypotheses that can be
eliminated after performing a single test is mα. As we need to eliminate m1 hypotheses irrespec-
tive of the realized hypothesis ¯
i, we need to perform at least m1
mα= Ω(m1α) tests under every ¯
i,
and the proof follows.
Appendix E: Extension to Non-identifiable ODT Instances
Previous work on ODT problem usually imposes the following identifiability assumption (e.g.
Kosaraju et al. (1999)): for every pair hypotheses, there is a test that distinguishes them deter-
ministically. However in many real world applications, such assumption does not hold. Thus far,
we have also made this identifiability assumption for ODTN (see §2.1). In this section, we show
how our results can be extended also to non-identifiable ODTN instances.
To this end, we introduce a slightly different stopping criterion for non-identifiable instances.
(Note that is is no longer possible to stop with a unique compatible hypothesis.) Define a similarity
graph Gon mnodes, each corresponding to a hypothesis, with an edge (i, j) if there is no test
separating iand jdeterministically. Our algorithms’ performance guarantees will now also depend
on the maximum degree dof G; note that d= 0 in the perfectly identifiable case. For each hypothesis
i[m], let Di[m] denote the set containing iand all its neighbors in G. We now define two
stopping criteria as follows:
The neighborhood stopping criterion involves stopping when the set Kof compatible hypothe-
ses is contained in some Di, where imight or might not be the true hypothesis ¯x.
Jia et al.: Optimal Decision Tree with Noisy Outcomes
48 Operations Research 00(0), pp. 000–000, ©0000 INFORMS
The clique stopping criterion involves stopping when Kis contained in some clique of G.
Note that clique stopping is clearly a stronger notion of identification than neighborhood stopping.
That is, if the clique-stopping criterion is satisfied then so is the neighborhood-stopping criterion.
We now obtain an adaptive algorithm with approximation ratio O(d+ min(h, r )+ log m) for clique-
stopping as well as neighborhood-stopping.
Consider the following two-phase algorithm. In the first phase, we will identify some subset
N[m] containing the realized hypothesis ¯
iwith |N| ≤ d+ 1. Given an ODTN instance with m
hypotheses and tests T(as in §2.1), we construct the following ASRN instance with hypotheses as
scenarios and tests as elements (this is similar to the construction in §2.3). The responses are the
same as in ODTN: so the outcomes Ω = {+1,1}. Let U=T ×{+1,1}be the element-outcome
pairs. For each hypothesis i[m], we define a submodular function:
e
fi(S) = min
1
md1·[
T:(T,+1)S
T[ [
T:(T,1)S
T+,1
,SU.
It is easy to see that each function e
fi: 2U[0,1] is monotone and submodular, and the separability
parameter ε=1
md1. Moreover, e
fi(S) = 1 if and only if at least md1 hypotheses are incompat-
ible with at least one outcome in S. Equivalently, e
fi(S) = 1 iff there are at most d+ 1 hypotheses
compatible with S. By definition of graph Gand max-degree d, it follows that function e
fican be
covered (i.e. reaches value one) irrespective of the noisy outcomes. Therefore, by Theorem 7 we
obtain an O(min(r, c) + log m)-approximation algorithm for this ASRN instance. Finally, note that
any feasible policy for ODTN with clique/neighborhood stopping is also feasible for this ASRN
instance. So, the expected cost in the first phase is O(min(r, c) + log m)·OP T .
Then, in the second phase, we run a simple splitting algorithm that iteratively selects any test
Tthat splits the current set Kof consistent hypotheses (i.e., T+K6=and TK6=). The
second phase continues until Kis contained in (i) some clique (for clique-stopping) or (ii) some
subset Di(for neighborhood-stopping). Since the number of consistent hypotheses |K| d+ 1 at
the start of the second phase, there are at most dtests in this phase. So, the expected cost is at
most dd·OP T . Combining both phases, we obtain the following.
Theorem 10. There is an adaptive O(d+ min(c, r) + log m)-approximation algorithm for ODTN
with the clique-stopping or neighborhood-stopping criterion.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Conference Paper
We consider the problem of constructing decision trees for entity identification from a given table. The input is a table containing information about a set of entities over a fixed set of attributes. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. The previously best known approximation ratio for this problem was O(log2 N). In this paper, we present a new greedy heuristic that yields an improved approximation ratio of O(logN).
Full-text available
Article
We consider the problem of constructing decision trees for entity identification from a given relational table. The in- put is a table containing information about a set of enti- ties over a fixed set of attributes and a probability distri- bution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such di- verse applications as efficient fault detection, species iden- tification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability dis- tribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distribution over the set of entities. We consider a natural greedy algorithm and prove an ap- proximation guarantee of O(rK · log N), where N is the number of entities, K is the maximum number of distinct values of an attribute, and rK is a suitably defined Ram- sey number. In addition, our analysis indicates a possible way of resolving a Ramsey theoretic conjecture by Erdos. We also show that it is NP-hard to approxima