Content uploaded by Su Jia

Author content

All content in this area was uploaded by Su Jia on Jan 11, 2022

Content may be subject to copyright.

OPERATIONS RESEARCH

Vol. 00, No. 0, Xxxxx 0000, pp. 000–000

issn 0030-364X |eissn 1526-5463 |00 |0000 |0001

INFORMS

doi 10.1287/xxxx.0000.0000

©0000 INFORMS

Authors are encouraged to submit new papers to INFORMS journals by means of

a style ﬁle template, which includes the journal title. However, use of a template

does not certify that the paper has been accepted for publication in the named jour-

nal. INFORMS journal templates are for the exclusive purpose of submitting to an

INFORMS journal and should not be used to distribute the papers in print or online

or to submit the papers to another publication.

Optimal Decision Tree and Submodular Ranking

with Noisy Outcomes

Su Jia, Fatemeh Navidi, Viswanath Nagarajan, R. Ravi

A fundamental task in active learning involves performing a sequence of tests to identify an unknown

hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction,

has been widely studied for decades and the asymptotically best-possible approximation algorithm has

been devised for it. We study a generalization where certain test outcomes are noisy, even in the more

general case when the noise is persistent, i.e., repeating a test gives the same noisy output. We design

new approximation algorithms for both the non-adaptive setting, where the test sequence must be ﬁxed

a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous

work in the area assumed at most a logarithmic number of noisy outcomes per hypothesis and provided

approximation ratios that depended on parameters such as the minimum probability of a hypothesis. Our

new approximation algorithms provide guarantees that are nearly best-possible and work for the general case

of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly

with this number. In fact, our results hold in a signiﬁcantly more general setting, where the goal is to cover

stochastic submodular functions. We evaluate the performance of our algorithms on two natural applications

with noise: toxic chemical identiﬁcation and active learning of linear classiﬁers. Despite our theoretical

logarithmic approximation guarantees, our methods give solutions with cost very close to the information

theoretic minimum, demonstrating the eﬀectiveness of our methods.

Key words : Approximation Algorithms, Optimal Decision Tree, Submodular functions, Active Learning

1

Jia et al.: Optimal Decision Tree with Noisy Outcomes

2Operations Research 00(0), pp. 000–000, ©0000 INFORMS

1. Introduction

The classic Optimal Decision Tree (ODT) problem involves identifying an initially unknown hypoth-

esis hthat is drawn from a known probability distribution over a set of hypotheses. We can perform

tests in order to distinguish between these hypotheses. Each test produces a binary outcome (pos-

itive or negative) and the precise outcome of each test-hypothesis pair is known beforehand, and

thus an instance of ODT can be viewed as a ±1-valued matrix Mwith the tests as rows and

hypotheses as columns. The goal is to identify the true hypothesis husing the fewest tests.

As a motivating application, consider the following task in medical diagnosis detailed in Loveland

(1985). A doctor needs to diagnose a patient’s disease by performing tests. Given an a priori

probability distribution over possible diseases, what sequence of tests should the doctor perform to

identify the disease as quickly as possible? Another application is in active learning (e.g. Dasgupta

(2005)). Given a set of data points, one wants to learn a classiﬁer that labels the points correctly

as positive and negative. There is a set of mpossible classiﬁers which is assumed to contain the

true classiﬁer. In the Bayesian setting, which we consider, the true classiﬁer is drawn from some

known probability distribution. The goal is to identify the true classiﬁer by querying labels at the

minimum number of points in expectation (over the prior distribution). Other applications include

entity identiﬁcation in databases (Chakaravarthy et al. (2011)) and experimental design to choose

the most accurate theory among competing candidates (Golovin et al. (2010)).

Despite the considerable literature on the classic ODT problem, an important issue that is not

considered is that of unknown or noisy outcomes. In fact, our research was motivated by a data-

set involving toxic chemical identiﬁcation where the outcomes of many hypothesis-test pairs are

stated as unknown (see Section 6 for details). While prior work incorporating noise in ODT, for

example Golovin et al. (2010), was restricted to settings with very few noisy outcomes, in this

paper, we design approximation algorithms for the noisy optimal decision tree problem in full

generality.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 3

Speciﬁcally, we generalize the ODT problem to allow unknown/noisy entries (denoted by “∗”) in

the test-hypothesis matrix M, to obtain the Optimal Decision Tree with Noise (ODTN) problem,

in which the outcome of each noisy entry in the test-hypothesis matrix Mis a random ±1 value,

independent of other noisy entries. More precisely, if the entry Mt,h =∗(for hypothesis hand test

t) and the realized hypothesis is h, then the outcome of twill be a random ±1 value. We will

assume for simplicity that each noisy outcome is ±1 with uniform probability, though our results

extend directly to the case where each noisy outcome has a diﬀerent probability. We consider the

standard persistent noise model, where repeating the same test always produces the same outcome.

Note that this model is more general than the non-persistent noise (where repeating a noisy test

leads to “fresh” independent ±1 outcomes), since one may create copies of tests and hypotheses

to reduce to the persistent noise model.

We consider both non-adaptive policies, where the test sequence is ﬁxed upfront, and adaptive

policies, where the test sequence is built incrementally and depends on observed test outcomes.

Evidently, adaptive policies perform at least as well as non-adaptive ones. Indeed, there exists

instances where the relative gap between the best adaptive and non-adaptive policies is very large

(see for example, Dasgupta (2005)). However, non-adaptive policies are very simple to implement,

requiring minimal incremental computation, and may be preferred in time-sensitive applications.

In fact, our results hold in a signiﬁcantly more general setting, where the goal is to cover stochastic

submodular functions. In the absence of noisy outcomes, the non-adaptive and adaptive versions

of this problem were studied by by Azar and Gamzu (2011) and Navidi et al. (2020). Other than

the ODT problem, this submodular setting captures a number of applications such as multiple-

intent search ranking, decision region determination and correlated knapsack cover: see Navidi

et al. (2020) for details. Our work is the ﬁrst to handle noisy outcomes in all these applications.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

4Operations Research 00(0), pp. 000–000, ©0000 INFORMS

1.1. Contributions

We derive most of our results for the ODTN problem as corollaries of a more general problem,

Submodular Function Ranking with Noisy Outcomes, which is a natural extension of the Submod-

ular Function Ranking problem, introduced by Azar and Gamzu (2011). We ﬁrst state our results

before formally deﬁning this problem in Section 2.3.

First, we obtain an O(log 1

ε)-approximation algorithm (see Theorem 3) for Non-Adaptive Sub-

modular Function Ranking with noisy outcomes (SFRN) where εis a separability parameter of

the underlying submodular functions. As a special case, for the ODTN (both adaptive and non-

adaptive) problem, we consider submodular functions with separability ε=1

m, so the above result

immediately implies an O(log m)-approximation for non-adaptive ODTN. This bound is the best

possible (up to constant factors) even in the noiseless case, assuming P6=N P .

As our second contribution, we obtain an O(min{clog |Ω|, r}+ log m

ε)-approximation (Theo-

rem 7) algorithm for Adaptive Submodular Ranking with noisy outcomes (ASRN), which implies

an O(min{c, r}+ log m) bound for ODTN by setting ε=1

m, where Ω is the set of random outcomes

we may observe when selecting elements. The term min{clog |Ω|, r}corresponds to the “noise spar-

sity” of the instance (see Section 2 for formal deﬁnitions). For the ODTN problem, c(resp. r) is the

maximum number of noisy outcomes in each column (resp. row) of the test-hypothesis matrix M. In

the noiseless case, c=r= 0 and our result matches the best approximation ratio for the ODT and

the Adaptive Submodular Ranking problem (Navidi et al. (2020)). In the noisy case, our perfor-

mance guarantee degrades smoothly with the noise sparsity. For example, we obtain a logarithmic

approximation ratio (which is the best possible) as long as the number of noisy outcomes in each

row or column is at most logarithmic. For ODTN, Golovin et al. (2010) obtained an O(log21

pmin )-

approximation algorithm which is polynomial-time only when c=O(log m); here pmin ≤1

mis the

minimum probability of any hypothesis. Our result improves this result in that (i) the running

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 5

time is polynomial irrespective of the number of noisy outcomes and (ii) the approximation ratio

is better by at least one logarithmic factor.

While the above algorithm admits a nice approximation ratio when there are few noisy entries

in each row or column of M, as our third contribution, we consider the other extreme, when each

test has only a few deterministic entries (or equivalently, a large number of noisy outcomes). Here,

we focus on the special case of ODTN. At ﬁrst sight, higher noise seems to only render the problem

more challenging, but somewhat surprisingly, we obtain a much better approximation ratio in this

regime. Speciﬁcally, if the number of noisy outcomes in each test is at least m−O(√m), we obtain

an approximation algorithm whose cost is O(log m) times the optimum and returns the target

hypothesis with high probability. We establish this result by relating the cost to a Stochastic Set

Cover instance, whose cost lower-bounds that of the ODTN instance.

Finally, we tested our algorithms on synthetic as well as a real dataset (arising in toxic chem-

ical identiﬁcation). We compared the empirical performance guarantee of our algorithms to an

information-theoretic lower bound. The cost of the solution returned by our non-adaptive algorithm

is typically within 50% of this lower bound, and typically within 20% for the adaptive algorithm,

demonstrating the eﬀective practical performance of our algorithms.

As a ﬁnal remark, although in this work we will consider uniform distribution for noisy outcomes,

our results extend directly to the case where each noisy outcome has a diﬀerent probability of being

±1. Suppose that the probability of every noisy outcome is between δand 1 −δ. Then our results

on ASRN continue to hold, irrespective of δ, and the result for the many-unknowns version holds

with a slightly worse O(1

δlog m) approximation ratio.

1.2. Related Work

The optimal decision tree problem (without noise) has been extensively studied for several decades:

see Garey and Graham (1974), Hyaﬁl and Rivest (1976/77), Loveland (1985), Arkin et al. (1998),

Jia et al.: Optimal Decision Tree with Noisy Outcomes

6Operations Research 00(0), pp. 000–000, ©0000 INFORMS

Kosaraju et al. (1999), Adler and Heeringa (2008), Chakaravarthy et al. (2009), Gupta et al. (2017).

The state-of-the-art result Gupta et al. (2017) is an O(log m)-approximation, for instances with

arbitrary probability distribution and costs. Chakaravarthy et al. (2011) also showed that ODT

cannot be approximated to a factor better than Ω(log m), unless P=NP.

The application of ODT to Bayesian active learning was formalized in Dasgupta (2005). There

are also several results on the statistical complexity of active learning. e.g. Balcan et al. (2006),

Hanneke (2007), Nowak (2009), where the focus is on proving bounds for structured hypothesis

classes. In contrast, we consider arbitrary hypothesis classes and obtain computationally eﬃcient

policies with provable approximation bounds relative to the optimal (instance speciﬁc) policy. This

approach is similar to that in Dasgupta (2005), Guillory and Bilmes (2009), Golovin and Krause

(2011), Golovin et al. (2010), Cicalese et al. (2014), Javdani et al. (2014).

The noisy ODT problem was studied previously in Golovin et al. (2010). Using a connection

to adaptive submodularity, Golovin and Krause (2011) obtained an O(log21

pmin )-approximation

algorithm for noisy ODT in the presence of very few noisy outcomes, where pmin ≤1

mis the

minimum probability of any hypothesis.∗In particular, the running time of the algorithm in Golovin

et al. (2010) is exponential in the number of noisy outcomes per hypothesis, which is polynomial

only if this number is at most logarithmic in the number of hypotheses/tests. As noted earlier, our

result improves both the running time (it is now polynomial for any number of noisy outcomes)

and the approximation ratio. We note that an O(log m) approximation ratio (still only for very

sparse noise) follows from work on the “equivalence class determination” problem by Cicalese et al.

(2014). For this setting, our result is also an O(log m) approximation, but our algorithm is simpler.

More importantly, ours is the ﬁrst result that can handle any number of noisy outcomes.

∗The paper Golovin et al. (2010) states the approximation ratio as O(log 1

pmin ) because it relied on an erroneous

claim in Golovin and Krause (2011). The correct approximation ratio, based on Nan and Saligrama (2017), Golovin

and Krause (2017), is O(log21

pmin ).

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 7

Other variants of noisy ODT have also been considered, e.g. Naghshvar et al. (2012), Bellala et al.

(2011), Chen et al. (2017), where the goal is to identify the correct hypothesis with at least some

target probability. The theoretical results in Chen et al. (2017) provide “bicriteria” approximation

bounds where the algorithm has a larger error probability than the optimal policy. Our setting is

diﬀerent because we enforce zero probability of error.

Many algorithms for ODT (including ours) rely on some underlying submodularity properties.

We brieﬂy survey some background results. In the basic Submodular Cover problem, we are given

a set of elements and a submodular function f. The goal is to use the minimal number of elements

to make the value of freach certain threshold. Wolsey (1982) ﬁrst considered this problem and

proved that the natural greedy algorithm is a (1 + ln 1

ε)-approximation algorithm, where εis the

minimal positive marginal increment of the function. As a natural generalization, in the Submodular

Function Ranking problem we are given multiple submodular functions, and need to sequentially

select elements so as to minimize the total cover time of those functions. Azar and Gamzu (2011)

obtained an O(log 1

)-approximation algorithm for this problem, and Im et al. (2016) extended this

result to also handle costs. More recently, Navidi et al. (2020) studied an adaptive version of the

submodular ranking problem.

Finally, we note that there is also work on minimizing the worst-case (instead of average case)

cost in ODT and active learning; see e.g., Moshkov (2010), Saettler et al. (2017), Guillory and

Bilmes (2010, 2011). These results are incomparable to ours because we are interested in the average

case, i.e. minimizing expected cost.

2. Preliminaries

2.1. Optimal Decision Tree with Noise

In the Optimal Decision Tree with Noise (ODTN) problem, we are given a set of mpossible

hypotheses with a prior probability distribution {πi}m

i=1, from which an unknown hypothesis ¯

iis

Jia et al.: Optimal Decision Tree with Noisy Outcomes

8Operations Research 00(0), pp. 000–000, ©0000 INFORMS

drawn. There is also a set Tof nbinary tests, each test T∈T associated with a 3-way partition

T+, T −, T ∗of [m], where the outcome of test Tis

•positive if ¯

i∈T+,

•negative if ¯

i∈T−, and

•positive or negative with probability 1

2each if ¯

i∈T∗(noisy outcomes).

We assume that conditioned on ¯

i, each noisy outcome is independent. The outcomes for all test-

hypothesis pairs can be summarized in a {1,−1,∗}-valued n×mmatrix M.

While we know the 3-way partition T+, T −, T ∗for each test T∈T upfront, we are not aware of

the actual outcomes for the noisy test-hypothesis pairs. It is assumed that the realized hypothesis

¯

ican be uniquely identiﬁed by performing all tests, regardless of the outcomes of -tests. This

means that for every pair i, j ∈[m] of hypotheses, there is some test T∈ T with i∈T+and j∈T−

or vice-versa. We show how to relax this “identiﬁability” assumption in Appendix E. The goal

is to perform a sequence of tests to identify hypothesis ¯

iusing the minimum expected number of

tests, which will be formally deﬁned soon. Note that the expectation is taken over both the prior

distribution of ¯

iand the random outcomes of noisy tests for ¯

i.

Types of Policies. Anon-adaptive policy is speciﬁed by a permutation of tests denoting the order

in which they will be tried until identiﬁcation of the underlying hypothesis. The policy performs

tests in this sequence and eliminates incompatible hypotheses until there is a unique compatible

hypothesis (which is ¯

i). Note that the number of tests performed under such a policy is still random

as it depends on ¯

iand the outcomes of noisy tests.

An adaptive policy chooses tests incrementally, depending on prior test outcomes. The state of

a policy is a tuple (E , d) where E⊆ T is a subset of tests and d∈ {±1}Edenotes the observed

outcomes of the tests in E. An adaptive policy is speciﬁed by a mapping Φ : 2T ×{±1}→ T from

states to tests, where Φ(E, d) is the next test to perform at state (E, d). Deﬁne the (random) cost

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 9

Cost(Φ) of a policy Φ to be the number of tests performed until ¯

iis uniquely identiﬁed, i.e., all

other hypotheses have been eliminated. The goal is to ﬁnd policy Φ with minimum E[Cost(Φ)].

Again, the expectation is over the prior distribution of ¯

ias well as the outcomes of noisy tests.

Equivalently, we can view a policy as a decision tree with nodes corresponding to states, labels at

nodes representing the test performed at that state and branches corresponding to the ±1 outcome

at the current test. In particular, a non-adaptive policy is simply a decision tree where all nodes

on each level are labelled with the same test.

As the number of states can be exponential, we cannot hope to specify arbitrary adaptive policies.

Instead, we want implicit policies Φ, where given any state (E, d), the test Φ(E , d) can be computed

eﬃciently. This would imply that the total time taken on any decision path is polynomial. We

note that an optimal policy Φ∗can be very complex and the map Φ∗(E , d) may not be eﬃciently

computable. We will still compare the performance of our (eﬃcient) policy to Φ∗.

Noise Model. In this paper, we consider the persistent noise model. That is, repeating a test T

with ¯

i∈T∗always produces the same outcome. An alternative model is non-persistent noise, where

each run of test Twith ¯

i∈T∗produces an independent random outcome. The persistent noise

model is more appropriate to handle missing data. It also contains the non-persistent noise model

as a special case (by introducing multiple tests with identical partitions). The persistent-noise

model is also more challenging from an algorithmic point of view.

In fact, our results hold in a substantially more general setting (than ODT), that of covering

arbitrary submodular functions. In Section 2.2 we ﬁrst describe this setting in the noiseless case,

which is well-understood (prior to our work). Then, in Section 2.3 we describe the setting with

noisy outcomes, which is the focus of our paper.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

10 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

2.2. Adaptive Submodular Ranking (Noiseless Case)

We now review the (non-adaptive and adaptive) Submodular Ranking problems introduced by

Azar and Gamzu (2011) and Navidi et al. (2020) respectively.

Submodular Function Ranking. An instance of Submodular Function Ranking (SFR) consists

of a ground set of elements [n] := {1, ..., n}and a collection of monotone submodular functions

{f1, ..., fm},fi: 2[n]→[0,1],with fi(∅) = 0 and fi([n]) = 1 for all i∈[m]. Each i∈[m] is called a

scenario. An unknown target scenario ¯

iis drawn from a known distribution {πi}over [m].

A solution to SFR is a permutation σ= (σ(1), ..., σ(n)) of elements. Given any such permutation,

the cover time of scenario iis C(i, σ) := min{t|fi(σt) = 1}where σt= (σ(1), ..., σ(t)) is the t-

preﬁx of permutation σ. In words, the cover time is the earliest time when the value of fireaches

the unit threshold. The goal is to ﬁnd a permutation σof [n] with minimal expected cover time

E¯

i[C(¯

i, σ)] = Pi∈[m]πi·C(i, σ ).

The separability parameter ε > 0 is deﬁned as minimum positive marginal increment of any

function, i.e. ε:= min{fi(S∪{e})−fi(S)>0| ∀S⊆[n], i ∈[m], e ∈[n]}. We will use the following.

Theorem 1 (Azar and Gamzu (2011)).There is an O(log 1

)-approximation algorithm for

SFR.

Adaptive Submodular Ranking. In the Adaptive Submodular Ranking (ASR) problem, in

addition to the above input to SFR, for each scenario i∈[m] we are given a response function

ri: [n]→Ω where Ω is a ﬁnite set of outcomes (or response, which we use interchangeably). A

solution to ASR is an adaptive sequence of elements: the sequence is adaptive because it can depend

on the outcomes from previous elements. When the policy selects an element e∈[n], it receives an

outcome o=r¯

i(e)∈Ω, thereby any scenario iwith ri(e)6= ¯ocan be ruled out.

The state of an adaptive policy is a tuple (E, d) where E⊆[n] is the subset of previously selected

elements and d∈ΩEdenotes the observed responses on E. An adaptive policy is then speciﬁed

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 11

by a mapping Φ : 2[n]×Ω→[n] from states to elements, where Φ(E, d) is the next element to select

at state (E, d). Note that any adaptive policy Φ induces, for each scenario i, a unique sequence

σiof elements that will be selected if the target scenario ¯

i=i. The cover time of iis deﬁned as

C(i, Φ) := min{t|fi(σt

i)=1}. The goal is to ﬁnd a policy Φ with minimal expected cover time

Pi∈[m]πi·C(i, Φ). We will use the following result in Section 4.

Theorem 2 (Navidi et al. (2020)).There is an O(log m

)-approximation algorithm for ASR.

As discussed in Navidi et al. (2020), the optimal decision tree problem (without noise) is a special

case of ASR. We show later that even the noisy version ODTN can be reduced to a noisy variant

of ASR (which we deﬁne next).

2.3. Adaptive Submodular Ranking with Noise

In this paper, we introduce a new variant of ASR by incorporating noisy outcomes, which gener-

alizes the ODTN problem.

ASR with Noise. An instance of the Adaptive Submodular Ranking with Noise (ASRN) Problem

consists of a ground set of elements [n], a ﬁnite set Ω of outcomes, and a collection of monotone

submodular functions {f1, ..., fm}, where each fi: 2[n]×Ω→[0,1] satisﬁes fi(∅) = 0 and fi([n]×Ω) =

1. Note that the groundset of each function fiis [n]×Ω, i.e., all element-outcome pairs. As before,

each i∈[m] is called a scenario and an unknown target scenario ¯

iis drawn from a given distribution

{πi}m

i=1. For each scenario i∈[m], we are given a response function ri: [n]→Ω∪ {∗}. When an

element eis selected, its outcome is:

•ri(e) if ri(e)∈Ω, and

•a uniformly random response from Ω if ri(e) = ∗(noisy outcome).

Jia et al.: Optimal Decision Tree with Noisy Outcomes

12 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

The responses can be summarized in an n×mmatrix Mwith entries from Ω ∪{∗}. Conditioned on

¯

i, we assume that all noisy outcomes are independent. Our results extend to arbitrary distributions

for noisy outcomes, but we will work with the uniform case for simplicity.

As in the noiseless case, the state of a policy is the tuple (E , d) where E⊆[n] denotes the

previously selected elements and d∈ΩEdenotes their observed responses. A non-adaptive policy

is simply given by a permutation of all elements and involves selecting elements in this (static)

sequence. An adaptive policy is a mapping Φ : 2[n]×Ω→[n], where Φ(E, d) is the next element to

select at state (E, d). Scenario iis said to be covered in state (E , d) if fi({(e, de) : e∈E}) = 1, i.e.,

function fiis covered by the element-response pairs observed so far. The goal is to cover the target

scenario ¯

iusing the minimum expected number of elements.

Unlike the noiseless case, in ASRN, each scenario imay trace multiple paths in the decision tree

corresponding to policy Φ. However, if we condition on the responses ω∈Ωnfrom all elements, each

scenario itraces a unique path, corresponding to a sequence σi,ω of element-response pairs. The

cover time of scenario iunder ωis deﬁned as C(i, Φ|ω) := min{t|fi(σt

i,ω)=1}where σt

i,ω consists

of the ﬁrst telement-response pairs in σi,ω. The expected cover time of scenario iis ECT(i, Φ) :=

Pω∈ΩnPr(ω|i)·C(i, Φ|ω), where Pr(ω|i) is the probability of observing responses ωconditioned on

¯

i=i. Finally, the expected cost of policy Φ is Pi∈[m]πi·ECT(i, Φ).

For each scenario i, we assume that the function fican always be covered irrespective of the

noisy outcomes (when ¯

i=i). In other words, for any i∈[m] and ω∈Ωnthat is consistent with

scenario i(i.e., ωe=ri(e) for each ewith ri(e)6=∗), we must have fi({(e, ωe) : e∈[n]}) = 1. In the

absence of this assumption, the optimal value (as deﬁned above) will be unbounded.

Connection to ODTN. The ODTN problem can be cast as a special case of the ASRN problem,

where the ntests Tin ODTN corresponds to the elements [n] in ASRN, and the mhypotheses

in ODTN correspond to the scenarios in ASRN, with the same prior distribution. The outcomes

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 13

Ω = {±1}. Deﬁne the response function for each test T∈ T as follows. Let (T+, T −, T ∗) be the

3-way partition of [m] for test T. For any hypothesis (scenario) i∈[m], deﬁne ri(T) = oif i∈To

for each o∈Ω∪{∗}. For any i∈[m], deﬁne the submodular function

fi(S) = 1

m−1·[

T:(T,+1)∈S

T−[ [

T:(T,−1)∈S

T+,∀S⊆T ×{+1,−1}.

Note that the element-outcome pairs here are U=T ×{+1,−1}. It is easy to see that each function

fi: 2U→[0,1] is monotone and submodular. Also, these functions fihappen to be uniform for

all i. Moreover, the separability parameter ε=1

m−1. Crucially, fi(S) corresponds to the fraction

of hypotheses (other than i) that are incompatible with at least one outcome in S: for example,

if Shas a positive outcome (T , +1) then hypotheses T−are incompatible (similarly for negative

outcomes). So fihas value one exactly when iis identiﬁed as the only compatible hypothesis.

By the assumption that the target hypothesis can be uniquely identiﬁed, the function fican be

covered (i.e. reaches value one) irrespective of the noisy outcomes.

2.4. Expanded Scenario Set

In our analysis for both the non-adaptive and adaptive ASRN problem, we will consider an equiva-

lent noiseless ASR instance. Let Ibe a given ASRN instance with scenarios [m]. The ASR instance

Jconsiders an expanded set of scenarios. For any scenario i∈[m], deﬁne

Ω(i) := {ω∈Ωn:ωe=ri(e) for all e∈[n] with ri(e)6=∗},

denoting all outcome vectors that are consistent with scenario i. For any ω∈Ω(i), the expanded

scenario (i, ω) corresponds to the original scenario i∈[m] when the outcome of each element eis

ωe. Note that an expanded scenario also ﬁxes all noisy outcomes. We write Hi:= {(i, ω) : ω∈Ω(i)}

and H=∪m

i=1Hifor the set of all expanded scenarios.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

14 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

To deﬁne the prior distribution in the ASR instance, let ci=|{e∈[n] : ri(e) = ∗}| be the number

of noisy outcomes for i∈[m]. Since the outcome of any -element for iis uniformly drawn from Ω,

each of the |Ω|cipossible expanded scenarios for ioccurs with the same probability πi,ω =πi/|Ω|ci.

To complete the reduction, for each (i, ω)∈H, we deﬁne the response function

ri,ω : [n]→Ω, ri,ω(e) = ωe,∀e∈[n],

and the submodular coverage function

fi,ω : 2[n]→[0,1], fi,ω(S) = fi{(e, ωe:e∈S}),∀S⊆[n].

By this deﬁnition, since fiis monotone and submodular on [n]×Ω, the function fi,ω is also

monotone and submodular on [n]. We will also work with the ASR (noiseless) instance on the

expanded scenarios with response functions ri,ω and submodular functions fi,ω . In Appendix A, we

will formally establish the following reduction.

Proposition 1. The ASRN instance Iis equivalent to the ASR instance J.

Crucially, the number of expanded scenarios |H|is exponentially large as |H|≤Pi∈[m]|Ω|ci. So we

cannot merely apply existing algorithms for the noiseless ASR problems. In §3 and §4 we will show

diﬀerent ways for managing the expanded scenarios and obtaining polynomial time algorithms.

3. Nonadaptive Algorithm

This main result in this section is an O(log 1

ε)-approximation for Non-Adaptive Submodular Func-

tion Ranking (SFRN) where ε > 0 is the separability parameter of the submodular functions. By

Proposition 1, the SFRN problem is equivalent to the SFR problem on the expanded scenarios.

However, as noted above, we cannot use Theorem 1 directly as the SFR instance has an exponential

number of scenarios. Nevertheless, we can obtain the following result.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 15

Theorem 3. There is a poly(1

ε, n, m)time O(log 1

ε)-approximation for the SFRN problem.

Observe that for ODTN, ε=1

m−1, thus we obtain the following result for ODTN.

Corollary 1. There is an O(log m)-approximation for non-adaptive ODTN.

High Level Ideas. The algorithm of Azar and Gamzu (2011) for SFR is a greedy-style algorithm

that at any iteration, having already chosen elements E, assigns to each e∈[n]\Ea score that

measures the coverage gain when it is selected, deﬁned as

GE(e) := X

(i,ω)∈H:fi,ω (E)<1

πi,ω

fi,ω({e} ∪E)−fi,ω (E)

1−fi,ω(E)=X

(i,ω)∈H

πi,ω ·∆E(i, ω;e),(1)

∆E(i, ω, e) =

fi,ω({e}∪E)−fi,ω (E)

1−fi,ω(E),if fi,ω (E)<1;

0,otherwise.

(2)

The algorithm then selects the element with the maximum score.

Since this summation involves exponentially many terms, we do not know how to compute

the exact value of (1) in polynomial time. However, using the fact that GE(e) is the expectation

of ∆E(i, ω;e) over the expanded scenarios (i, ω)∈H, we will show how to obtain a randomized

constant-approximate maximizer by sampling from H. Moreover, we use the following extension

of Theorem 1, which follows directly from the analysis in Im et al. (2016).

Theorem 4 (Azar and Gamzu (2011), Im et al. (2016)).Consider the SFR algorithm that

selects at each step, an element ewith GE(e)≥Ω(1) ·maxe0∈UGE(e0). This is an O(log 1

)-

approximation algorithm.

Consequently, if we always ﬁnd an approximate maximizer for GE(e) by sampling then Theorem 3

would follow from Theorem 4. However, this sampling approach is not suﬃcient because it can fail

when the value GE(e) is very small. In order to deal with this, a key observation is that when the

Jia et al.: Optimal Decision Tree with Noisy Outcomes

16 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

Algorithm 1 Non-adaptive SFRN algorithm.

1: Initialize E←∅ and sequence σ=∅.

2: while E6= [n]do Phase 1 begins

3: For each e∈[n], compute an estimate GE(e) of the score GE(e) by sampling from H

independently N=m3n4ε−1times.

4: Let e∗denote the element e∈[n]\Ethat maximizes GE(e).

5: if GE(e)≥1

4m−2n−4εthen

6: Update E←E∪{e∗}and append e∗to sequence σ.

7: else

8: Exit the while loop. Phase 1 ends

9: Append the elements in [n]\Eto sequence σin arbitrary order. Phase 2

10: Output non-adaptive sequence σ.

score GE(e) is small for all elements e, then it must be that (with high probability) the already-

selected elements Ehave covered ¯

i, so any future elements would not aﬀect the expected cover

time. The formal analysis is given in Appendix B.

4. Adaptive Algorithms

In this section we present the Olog m

ε+ min{clog |Ω|, r}-approximation for ASRN where we recall

that c, r are the maximum number of noisy entries (“stars”) per column and per row in the outcome

matrix M, and εis the separability parameter of the submodular functions. We propose two

algorithms, achieving Or+ log m

εand Oclog |Ω|+ log m

εapproximations respectively, which

combined imply our main result.

In both algorithms, we maintain the posterior probability of each scenario based on the previous

element responses, and use these probabilities to calculate a score for each element, which comprises

(i) a term that prioritizes splitting the candidate scenarios in a balanced manner and (ii) terms

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 17

corresponding to the expected number of scenarios eliminated. Diﬀerent than the noiseless setting,

in ASRN (and ODTN), each scenario may trace multiple paths in the decision tree due to outcome

randomness. In fact, each scenario may trace an exponential number of paths in the tree, so a

naive generalization of the analysis in Navidi et al. (2020) incurs an extra exponential factor in the

approximation ratio.

We circumvent this challenge by reducing to an ASR instance J(as deﬁned in Proposition 1)

using the expanded scenarios. In this way, the noise is removed, since we recall that the outcome

of each element is deterministic conditional on any expanded scenario (i, ω). Our ﬁrst result, an

O(clog |Ω|+ log m

ε)-approximation, then follows from Navidi et al. (2020).

However, as Jinvolves exponentially many scenarios, a naive implementation of the algorithm

in Navidi et al. (2020) leads to exponential running time. To improve the computational eﬃciency,

in Section 4.1 we exploit the special structure of Jand devise a polynomial time algorithm. Then,

in Section 4.2, we propose a slightly diﬀerent algorithm than that of Navidi et al. (2020), and show

an O(r+ log m

ε) approximation ratio.

4.1. An O(clog |Ω|+ log m

ε)-Approximation Algorithm

Our ﬁrst adaptive algorithm is based on the O(log m

ε)-approximation algorithm for ASR from

Navidi et al. (2020), formally stated as Algorithm 2. Applying this result on the instance Jand

recalling |H|≤|Ω|c·m, we immediately obtain the desired guarantee. Their algorithm, rephrased

in our notations, maintains the set H0⊆Hof all expanded scenarios that are consistent with all

the observed outcomes, and iteratively selects the element with maximum score, as deﬁned in (3)‡.

As the heart of the algorithm, this score strikes a balance between covering the submodular

functions of the consistent scenarios and shrinking H0hence reducing the uncertainty in the target

‡We use the subscript cto distinguish from the score function Scorerconsidered in Section 4.2, but for ease of

notation, we will suppress the subscript in this subsection.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

18 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

scenario. The second term in Scorec, similar to the score in our non-adaptive algorithm (Algo-

rithm 1), involves the sum of the incremental coverage (for selecting e) over all uncovered expanded

scenarios, weighted by their current coverage, with higher weights on the expanded scenarios closer

to being covered.

To interpret the ﬁrst term in Scorec, let us for simplicity assume Ω = {±1}and πi,ω is uniform

over H. Upon selecting an element, H0is split into two subsets, among which Le(H0) is the lighter

(in cardinality), or equivalently – since we just assumed πi,ω to be uniform – in the total prior

probabilities. Thus, this term is simply the number of expanded scenarios eliminated in the worst

case (over the outcomes in Ω). This is reminiscent of the greedy algorithm for the ODT problem

(e.g. Kosaraju et al. (1999)) which iteratively selects a test that maximizes the number of scenarios

ruled out, in the worst case over all test outcomes. Evidently, the higher this term, the more

progress is made towards identifying the target (expanded) scenario.

Algorithm 2 Algorithm for ASR instance J, based on Navidi et al. (2020).

1: Initialize E←∅, H 0←H.

2: while H06=∅do

3: For any element e∈[n], let Be(H0) be the largest cardinality set among

{(i, ω)∈H0:ri,ω (e) = o} ∀o∈Ω

4: Deﬁne Le(H0) = H0\Be(H0)

5: Select the element e∈[n]\Emaximizing

Scorec(e, E, H 0) = πLe(H0)+X

(i,ω)∈H0,fi,ω (E)<1

πi,ω ·fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)(3)

6: Observe response oand update H0as H0← {(i, ω)∈H0:ωe=oand fi,ω (E∪e)<1}

7: E←E∪{e}

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 19

As noted earlier, a key issue is the exponential size of the expanded scenario set H. The naive

implementation, which computes the summation in Scorecby evaluating each term in H0, requires

exponential time. Nonetheless, as the main focus of this subsection, we explain how to utilize the

structure of the ASRN instance Jto reformulate each of the two terms in Scorecin a manageable

form, hence enabling a polynomial time implementation.

Computing the First Term in Scorec.Recall that Hiis the set of all expanded scenarios

for i. Since each (i, ω)∈Hiis has an equal share πi,ω =|Ω|−ciπiof prior probability mass the

(original) scenario i∈[m], computing the ﬁrst term in Scorecreduces to maintaining the number

ni=|Hi∩H0|of consistent copies of i. We observe that nican be easily updated in each iteration.

In fact, suppose outcome o∈Ω is observed upon selecting element e. We consider how H0∩Hi

changes after selecting in the following three cases.

1. if ri(e)/∈{, o}, then none of i’s expanded scenarios would remain in H0, so nibecomes 0,

2. if ri(e) = o, then all of i’s expanded scenarios would remain in H0, so niremains the same,

3. if ri(e) = , then only those (i, ω) with ω(e) = owill remain, and so nishrinks by an |Ω|factor.

As ni’s can be easily updated, we are also able to compute the ﬁrst term in Scoreceﬃciently.

Indeed, for any element e(that is not yet selected), we can implicitly describe the set Le(H0) as

follows. Note that for any outcome o∈Ω,

|{(i, ω)∈H0:ri,ω (e) = o}|=X

i∈[m]:ri(e)=o

ni+1

|Ω|X

i∈[m]:ri(e)=?

ni,

so the largest cardinality set Be(H0) can then be easily determined using ni’s. In fact, let bbe the

outcome corresponding to Be(H0). Then,

π(Le(H0)) =X

i∈[m]:ri(e)/∈{b,?}

πi

|Ω|ci·ni+|Ω|−1

|Ω|X

i∈[m]:ri(e)=?

πi

|Ω|ci·ni.

Computing the Second Term in Scorec.The second term in Scorecinvolves summing over

exponentially many terms, so a naive implementation is ineﬃcient. Instead, we will rewrite this

summation as an expectation that can be calculated in polynomial time.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

20 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

We introduce some notations before formally stating this equivalence. Suppose the algorithm

selected a subset Eof elements, and observed outcomes {νe}e∈E. We overload notation slightly and

use f(νE) := f{(e, νe) : e∈E}for any function fdeﬁned on 2[m]×Ω. For each scenario i∈[m],

let pi=ni·πi

|Ω|cibe the total probability mass of the surviving expanded scenarios for i.†Finally,

for any element eand scenario i, let Ei,νebe the expectation over the outcome νeof element e

conditional on ibeing the realized scenario. We can then rewrite the second term in Scorecas

follows.

Lemma 1. For each i∈[m], and e /∈E,

X

(i,ω)∈H0

πi,ω ·fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)=X

i∈[m]

pi·Ei,νe[fi(νE∪{νe})−fi(νE)]

1−fi(νE)(4)

This lemma suggests the following eﬃcient implementation of Algorithm 2. For each i, compute

and maintain piusing ni. To ﬁnd the expectation in the numerator, note that if ri(e)6=, then νeis

deterministic and hence it is straightforward to ﬁnd this expectation. In the other case, if ri(e) = ,

recalling that the outcome is uniform over Ω, we may simply evaluate fi(νE∪{(e, o)})−fi(νE) for

each o∈Ω and take the average, since the noisy outcome is uniformly distributed over Ω.

Now we are ready to formally state and prove the main result of this subsection.

Theorem 5. Algorithm 2 is an O(clog |Ω|+ log m+ log 1

ε)-approximation algorithm for ASRN

where cis the maximum number of noisy outcomes in each column of the response matrix M.

Proof. Consider the ASR instance Jand Algorithm 2. As discussed above, this algorithm can

be implemented in polynomial time. By Theorem 2, this algorithm has an Olog(|Ω|cm) +log m

ε=

O(clog |Ω|+ log m

ε) approximation ratio since |H|≤|Ω|c·m.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 21

Algorithm 3 Modiﬁed algorithm for ASR instance J.

1: Initialize E←∅, H 0←H

2: while H06=∅do

3: S←{i∈[m] : Hi∩H06=∅} Consistent original scenarios

4: For e∈[n], let Ue(S) = {i∈S:ri(e) = ∗} and Ce(S) be the largest cardinality set among

{i∈S:ri(e) = o},∀o∈Ω,

and let oe(S)∈Ω be the outcome corresponding to Ce(S).

5: For each e∈[n], let

Re(H0) = {(i, ω)∈H0:i∈Ce(S)}[{(j, oe(S)) ∈H0:j∈Ue(S)},

be those expanded-scenarios that have outcome oe(S) for element e, and Re(H0) := H0\Re(H0).

6: Select element e∈[n]\Ethat maximizes

Scorer(e, E, H 0) = πRe(H0)+X

(i,ω)∈H0,fi,ω (E)<1

πi,ω ·fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)(5)

7: Observe outcome o

8: H0←{(i, ω)∈H0:ri,ω (e) = oand fi,ω (E∪e)<1}Update the (expanded) scenarios

9: E←E∪{e}

4.2. An O(r+ log m

ε)-Approximation Algorithm

In this section, we consider a slightly diﬀerent score function, Scorer, and obtain an O(r+ log m

ε)-

approximation. Unlike the previous section where the approximation factor follows as an immedi-

ately corollary from Theorem 2, to prove this result, we need to also modify the analysis.

†One may easily verify via the Bayesian rule that pi/p([m]) is indeed the posterior probability of scenario i∈[m],

given the previously observed outcomes.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

22 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

The only diﬀerence from Algorithm 2 is in the ﬁrst term of the score function. Recall that in

Scorec, upon selecting an element, the surviving expanded scenarios is partitioned into |Ω|subsets,

among which Le(H0) is deﬁned to be the lightest cardinality. Its counterpart in Scorer, however,

is deﬁned more indirectly, by ﬁrst considering the original scenarios. The element epartitions the

original scenarios with deterministic outcomes into |Ω|subsets, with the largest (in cardinality)

being Ce(S)⊆[m]. The set Re(H0)⊆H0is then deﬁned to be the consistent expanded scenarios

that have a diﬀerent outcome than Ce(S).

Computational Complexity. By deﬁnition, Scan be directly computed using the ni’s, which

can be updated in polynomial time as explained in Section 4.1. Similar to Algorithm 2, the second

term here also involves summing over exponentially many terms, but by following the same recipe

as in Section 4.1, one may also implement it in polynomial time.

The main result of this section, stated below, is proved by adapting the proof technique from

Navidi et al. (2020). The proof appears in Appendix C.3.

Theorem 6. Algorithm 3 is a polynomial-time O(r+ log m

ε)-approximation algorithm for ASRN,

where ris the maximum number of noisy outcomes in any row of the response matrix M.

Combining the above result with Theorem 5 and selecting the one with lower approximation ratio

between Algorithm 2 and Algorithm 3, we immediately obtain the following.

Theorem 7. There is an adaptive Omin{clog |Ω|, r}+ log m

ε)-approximation algorithm for the

ASRN problem.

When applied to the ODTN problem, this implies an Omin{c, r}+ log m

ε)-approximation algo-

rithm. In Appendix C.2, we also provide closed-form expressions for the scores used in Algorithms 3

and 2 in the special case of ODTN: this is also used in our computational results.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 23

5. ODTN with Many Unknowns

Our adaptive algorithm in Section 4 has a performance guarantee that grows with the noise sparsity

min(r, c log |Ω|). In this section, we consider the special case of ODTN (which is our primary

application) and focus on instances with a large number of noisy outcomes. We show that an

O(log m)-approximation algorithm can be achieved even in this regime.

An ODTN instance is called α-sparse (0 ≤α≤1) if there max{|T+|,|T−|} ≤ mαfor all tests

T∈ T .In particular, when α < 1, this means the vast majority of entries are noisy in every test.

Our main result is the following.

Theorem 8. There is a polynomial time adaptive algorithm whose cost is O(log m)times the

optimum for ODTN on any α-sparse instance with α≤1

2, and returns the true hypothesis with

probability 1−m−1.

Moreover, by repeating the algorithm for c≥1 times, the error probability will decrease to m−c.

5.1. Main Idea and the Stochastic Set Cover Problem

Stochastic Set Cover. The design and analysis of our algorithm are both closely related to that

of the Stochastic Set Cover (SSC) problem (Liu et al. (2008), Im et al. (2016)). An instance of SSC

consists of a ground set [m] of items and a collection of random subsets S1,···, Snof [m], where

the distribution of each Siis known to the algorithm. The instantiation of each set is only known

after it is selected. The goal is to ﬁnd an adaptive policy that minimizes the expected number of

sets to cover all elements in the ground set.

The following natural adaptive greedy algorithm is known to be an O(log m)-approximation (Liu

et al. (2008), Im et al. (2016)). Suppose at some iteration, A⊆[m] is the set of uncovered elements.

A random set Sis said to be β-greedy if its expected coverage of the uncovered elements is at least

1/β the maximum, i.e.

E|S∩A|≥1

βmax

j∈[n]

E|Sj∩A|.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

24 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

An SSC algorithm is (β, ρ)-greedy if for every t≥1, the algorithm picks a β-greedy set in no less

than t/ρ iterations among the ﬁrst t. By slightly modifying the analysis in Im et al. (2016), one

may obtain the following guarantee which will serve as the cornerstone of our analysis.

Theorem 9 (Im et al. (2016)).For any stochastic set cover instance, a (β, ρ)-greedy policy

costs at most O(βρ log m)times the optimum.

Relating ODTN Optimum and SSC: A Lower Bound. We now derive a lower bound on

the ODTN optimum, in terms of the optima of SSC instances constructed as follows. For any

hypothesis i∈[m], let SSC(i) denote the stochastic set cover instance with ground set [m]\{i}and

nrandom sets, given by

ST(i) =

T+with prob. 1 if i∈T−

T−with prob. 1 if i∈T+

T−or T+with prob. 1

2each if i∈T∗

,∀T∈[n].

To see the connection between SSC and ODTN, observe that when iis the target hypothesis in the

ODTN instance, any feasible algorithm must identify iby eliminating all other hypotheses which,

in the language of SSC, translates to covering all items in [m]\{i}. This leads to the following key

lower bound that our algorithm exploits.

Lemma 2. OPT ≥Pi∈[m]πi·OPTSSC(i).

We now explain why “good” progress made in SSC(i) also leads to “good” progress in ODTN.

Consider a hypothesis iand a test Twith i∈T∗, and let Abe the set of consistent hypotheses.

When test Tis selected, the expected coverage of the corresponding (random) set ST(i) in SSC(i)

is 1

2(|T+∩A|+|T−∩A|). The following result shows that if Tmaximizes 1

2(|T+∩A|+|T−∩A|),

then it is 2-greedy for SSC(i).

Lemma 3. Let Tbe a test that maximizes 1

2(|T+∩A|+|T−∩A|). Then for any i∈T∗,

1

2|T+∩A|+|T−∩A|=E[|ST(i)∩(A\i)|]≥1

2·max

T0∈[n]

E[|ST0(i)∩(A\i)|].

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 25

Hence, by our sparsity assumption, since the vast majority of hypotheses are in T∗, such a test Tis

2-greedy for most SSC instances. This motivates the following greedy algorithm. When Ais the set

of consistent hypotheses, pick test Tthat maximizes 1

2|T+∩A|+1

2|T−∩A|. Suppose the following

ideal condition holds. At each iteration t(when ttests have been selected), for every hypothesis

i, the algorithm has selected at least t/ρ tests that are -tests for i. Then, the sequence of tests

selected is (2, ρ)-greedy for every i, hence making nearly-optimal progress in every instance SSC(i).

Therefore by Theorem 9, the expected cost of this algorithm under iis O(ρlog m)·OPTSSC(i).

Taking expectation over the target hypothesis iand combining with Lemma 2, it then follows that

this algorithm is an O(ρlog m)-approximation to ODTN.

However, in general, the ideal condition assumed above may not hold. In other words, up until

some point, the sequence of tests selected is no longer (2, ρ)-greedy for some hypothesis i. To

handle this issue, we modify the above greedy algorithm at all power-of-two iterations as follows

(see Section 5.3). At each t= 2kwhere k= 1,2, ... log m, we consider the set Zof O(mα) hypotheses

with the fewest -tests selected thus far. Then, we invoke a membership oracle Member(Z), to

check whether the target hypothesis ¯

i∈Z(see Section 5.2). If so, then the algorithm halts and

returns ¯

i. Otherwise, it continues with the greedy algorithm until the next power-of-two iteration.

We will show that the membership oracle only incurs cost O(mα), which can be bounded using the

following lower bound.

Lemma 4. The optimal value OPT ≥Ω(m1−α)for any α-sparse instance.

In particular, when α < 1

2the above implies that the cost O(mα) for each call of the membership

oracle is lower than OPT, and hence the total cost incurred at power-of-two steps is O(log m·OPT).

5.2. Overview of the Membership Oracle

The membership oracle Member(Z) takes a (small) subset Z⊆[m] as input, and decides whether

the target hypothesis ¯

i∈Z. At a high level, Member(Z) works as follows. Whenever |Z| ≥ 2, we

Jia et al.: Optimal Decision Tree with Noisy Outcomes

26 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

pick an arbitrary pair (j, k) of hypotheses in Zand let them “duel” (i.e. choose a test Twith

MT,j =−MT ,k ) until there is only a unique survivor i.

Let i∈[m] be an arbitrary hypothesis. We show that if ¯

i6=ithen with high probability we can

rule out iusing very few tests. In fact, we ﬁrst select an arbitrary set Wof 4 log mdeterministic

tests for i, and let Ybe the set of consistent hypotheses after performing these tests. Without loss

of generality, we assume i∈T+for all T∈W. There are three cases:

•Trivial Case: if ¯

i∈T−for some T∈W, then we rule out iwhen any test Tis performed.

•Good Case: if ¯

i∈T∗for more than half of the tests Tin W, then by Chernoﬀ’s inequality,

with high probability we observe at least one “-”, hence ruling out i.

•Bad Case: if ¯

i∈T+for less than half of the tests Tin W, then concentration bounds can

not ensure a high enough probability for ruling out i. In this case, we let each hypothesis in

Yduel with iuntil either iloses a duel or wins all the duels. This takes |Z|−1 iterations.

We formalize the above ideas in the Algorithm 5 (Appendix D.1), and prove bound the cost of

Member(Z) as follows.

Lemma 5. If ¯

i∈Z, then Member(Z)declares ¯

i=iwith probability one; otherwise, it declares ¯

i /∈Z

with probability at least 1−m−2. Moreover, the expected cost of Member(Z)is O(|Z|+ log m).

5.3. The Main Algorithm

The overall algorithm is given in Algorithm 4. The algorithm maintains a subset of consistent

hypotheses, and iteratively computes the greediest test, as formally speciﬁed in Step 7. At each

t= 2kwhere k= 1,2, ... log m, we invoke the membership oracle.

Truncated Decision Tree. Let Tdenote the decision tree corresponding to our algorithm. We

only consider tests that correspond to step 7. Recall that His the set of expanded hypotheses and

that any expanded hypothesis traces a unique path in T. For any (i, ω)∈H, let Pi,ω denote this

path traced; so |Pi,ω|is the number of tests performed in Step 7 under (i, ω). We will work with a

truncated decision tree T, deﬁned below.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 27

Algorithm 4 Main algorithm for large number of noisy outcomes

1: Initialization: consistent hypotheses A←[m], weights wi←0 for i∈[m], iteration index t←0

2: while |A|>1do

3: if tis a power of 2 then

4: Let Z⊆Abe the subset of 2mαhypotheses with lowest wi

5: Invoke Member(Z)

6: If a hypothesis is identiﬁed in Z, then Break

7: Select a test T∈T maximizing 1

2(|T+∩A|+|T−∩A|) and observe outcome oT

8: Set R←{i∈[m] : MT,i =−oT}and A←A\R Remove incompatible hypotheses

9: Set wi←wi+ 1 for each for each i∈T∗Update the weights of the hypotheses in T∗

10: t←t+ 1.

Fix any expanded hypothesis (i, ω)∈H. For any t≥1, let θi,ω(t) denote the fraction of the ﬁrst

ttests in Pi,ω that are -tests for hypothesis i. Recall that Pi,ω only contains tests from Step 7. Let

ρ= 4 and deﬁne

ti,ω = max t∈ {20,21,···,2log m}:θi,ω(t0)≥1

ρfor all t0≤t.(6)

If ti,ω >|Pi,ω|then we simply set ti,ω =|Pi,ω |.

Now we deﬁne the truncated decision tree T. By abuse of notation, we will use θi(t) and tias

random variables, with randomness over ω. Observe that for any (i, ω), at the next power-of-two

step†2dlog tie, which we call the truncation time, the membership oracle will be invoked. Moreover,

2dlog tie≤2ti, . This motivates us to deﬁne Tis the subtree of Tconsisting of the ﬁrst 2dlogti,ω etests

along path Pi,ω, for each (i, ω)∈H. Under this deﬁnition, the cost of Algorithm 4 clearly equals

the sum of the cost the truncated tree and cost for invoking membership oracles.

Our proof proceeds by bounding the cost of Algorithm 4 at power-of-two steps and other steps. In

other words, we will decompose the cost into the cost incurred by invoking the membership oracle

†Unless stated otherwise, we denote log := log2.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

28 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

and selecting the greedy tests. We start with the easier task of bounding the cost for the membership

oracle. The oracle Member is always invoked on |Z|=O(mα) hypotheses. Using Lemma 5, the

expected total number of tests due to Step 4 is O(mαlog m). By Lemma 4, when α≤1

2, this cost

is O(log m·OPT).

The remaining part of this subsection focuses on bounding the cost of the truncated tree as

O(log m)·OPT. With this inequality, we obtain an expected cost of

O(log m)·(mα+OP T )≤(as α< 1

2)O(log m)·(m1−α+OP T )≤Lemma 4O(log m)·OP T ,

and Theorem 8 follows. At a high level, for a ﬁxed hypothesis i∈[m], we will bound the cost of

the truncated tree as follows:

ihas low fraction of -tests at ti

=⇒

Lemma 6iis among the top O(mα) hypotheses at ti

=⇒

Lemma 5iis identiﬁed w.h.p. by Member(Z) at 2dlog tie≤2ti,hence the truncated path is (2,2)-greedy

=⇒

T heorem 9the expected cost conditional on iis O(log m)·SSC(i)

and ﬁnally by summing over i∈[m], it follows from Lemma 2 that the cost of the truncated tree is

O(log m)·OPT. We formalize each step below.

Consider the ﬁrst step, formally we show that if θi(t)<1

4, then there are O(mα) hypotheses

with fewer -tests than i. Suppose iis the target hypothesis and θi(t) drops below 1

4at t, that is,

only less than a quarter of the tests selected are 2-greedy for SSC(i). Recall that if i∈T∗where

Tmaximizes 1

2(|A∩T+|+|A∩T−|), then ST(i) is 2-greedy set for SSC(i), so we deduce that less

than a t

4tests selected are -tests for i, or, at least 3t

4tests selected thus far are deterministic for i.

We next utilize the sparsity assumption to show that there can be at most O(mα) such hypotheses.

Lemma 6. Consider any W⊆ T and I⊆[m]. For i∈I, let D(i) = |{T∈W:MT,i 6=∗}| denote

the number of tests in Wfor which ihas deterministic (i.e. ±1) outcomes. For each κ≥1, deﬁne

I0={i∈I:D(i)>|W|/κ}. Then, |I0|≤κmα.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 29

Proof. By deﬁnition of I0and α-sparsity, it holds that

|I0|· |W|

κ<X

i∈I

D(i) = X

T∈W|{i∈I:MT,i 6=∗}| ≤ |W| ·mα,

where the last step follows since |T∗| ≤ mαfor each test T. The proof follows immediately by

rearranging.

We now complete the analysis using the relation to SSC. Fix any hypothesis i∈[m] and con-

sider decision tree Tiobtained by conditioning Ton ¯

i=i. Lemma 3 and the deﬁnition of trun-

cation together imply that Tiis (2,4)-greedy for SSC(i), so by Theorem 9, the expected cost

of Tiis O(log m)·OPTSSC(i). Now, taking expectations over i∈[m], the expected cost of Tis

O(log m)Pm

i=1 πi·OPTSSC(i). Recall from Lemma 2 that

OPT ≥X

i∈[m]

πi·OPTSSC(i),

and therefore the cost of Tis O(log m)·OPT.

Correctness. We ﬁnally show that our algorithm identiﬁes the target hypothesis ¯

iwith high

probability. By deﬁnition of ti, where the path is truncated, ¯

ihas less than 1

4fraction of -tests.

Thus, at iteration 2dlog t¯

ie, i.e. the ﬁrst time the membership oracle is invoked after ti,¯

ihas less than

1

2fraction of -tests. Hence, by Lemma 6, ¯

iis among the O(mα) hypotheses with fewest -tests.

Finally it follows from Lemma 5 that ¯

iis identiﬁed correctly with probability at least 1−1

m.

6. Experiments

We implemented our algorithms on real-world and synthetic data sets. We compared our algo-

rithms’ cost (expected number of tests) with an information theoretic lower bound on the optimal

cost and show that the diﬀerence is negligible. Thus, despite our logarithmic approximation ratios,

the practical performance is much better.

Chemicals with Unknown Test Outcomes. We considered a data set called WISER‡, which

includes 414 chemicals (hypothesis) and 78 binary tests. Every chemical has either positive, negative

‡https://wiser.nlm.nih.gov

Jia et al.: Optimal Decision Tree with Noisy Outcomes

30 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

or unknown result on each test. The original instance (called WISER-ORG) is not identiﬁable: so

our result does not apply directly. In Appendix E we show how our result can be extended to such

“non-identiﬁable” ODTN instances (this requires a more relaxed stopping criterion deﬁned on the

“similarity graph”). In addition, we also generated a modiﬁed dataset by removing chemicals that

are not identiﬁable from each other, to obtain a perfectly identiﬁable dataset (called WISER-ID).

In generating the WISER-ID instance, we used a greedy rule that iteratively drops the highest-

degree hypothesis in the similarity graph until all remaining hypotheses are uniquely identiﬁable.

WISER-ID has 255 chemicals.

Random Binary Classiﬁers with Margin Error. We construct a dataset containing 100 two-

dimensional points, by picking each of their attributes uniformly in [−1000,1000]. We also choose

2000 random triples (a, b, c) to form linear classiﬁers ax+by

√a2+b2+c≤0, where a, b ∼N(0,1) and

c∼U(−1000,1000). The point labels are binary and we introduce noisy outcomes based on the

distance of each point to a classiﬁer. Speciﬁcally, for each threshold d∈{0,5,10,20,30}we deﬁne

dataset CL-dthat has a noisy outcome for any classiﬁer-point pair where the distance of the point

to the boundary of the classiﬁer is smaller than d. In order to ensure that the instances are perfectly

identiﬁable, we remove “equivalent” classiﬁers and we are left with 234 classiﬁers.

Distributions. For the distribution over the hypotheses, we considered permutations of power law

distribution (Pr[X=x;α] = βx−α) for α= 0,0.5 and 1. Note that, α= 0 corresponds to uniform

distribution. To be able to compare the results across diﬀerent classiﬁers’ datasets meaningfully,

we considered the same permutation in each distribution.

Algorithms. We implement the following algorithms: the adaptive O(r+ log m+ log 1

ε)-

approximation (which we denote ODTNr), the adaptive O(clog |Ω|+ log m+ log 1

ε)-approximation

(ODTNc), the non-adaptive O(log m)-approximation (Non-Adap) and a slightly adaptive version

of Non-Adap (Low-Adap). Algorithm Low-Adap considers the same sequence of tests as Non-

Adap while (adaptively) skipping non-informative tests based on observed outcomes. For the non-

identiﬁable instance (WISER-ORG) we used the O(d+ min(c, r) + log m+ log 1

ε)-approximation

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 31

algorithms with both neighborhood and clique stopping criteria (see Appendix E). The implemen-

tations of the adaptive and non-adaptive algorithms are available online.§

Algorithm

Data WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30

Low-BND 7.994 7.870 7.870 7.870 7.870 7.870

ODTNr8.357 7.910 7.927 7.915 7.962 8.000

ODTNh9.707 7.910 7.979 8.211 8.671 8.729

Non-Adap 11.568 9.731 9.831 9.941 9.996 10.204

Low-Adap 9.152 8.619 8.517 8.777 8.692 8.803

Table 1 Cost of Diﬀerent Algorithms for α= 0 (Uniform Distribution).

Algorithm

Data WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30

Low-BND 7.702 7.582 7.582 7.582 7.582 7.582

ODTNr8.177 7.757 7.780 7.789 7.831 7.900

ODTNh9.306 7.757 7.829 8.076 8.497 8.452

Non-Adap 11.998 9.504 9.500 9.694 9.826 9.934

Low-Adap 8.096 7.837 7.565 7.674 8.072 8.310

Table 2 Cost of Diﬀerent Algorithms for α= 0.5.

Results. Tables 1, Tables 2 and Tables 3 show the expected costs of diﬀerent algorithms on all

uniquely identiﬁable data sets when the parameter αin the distribution over hypothesis is 0,0.5

and 1 correspondingly. These tables also report values of an information theoretic lower bound

(the entropy) on the optimal cost (Low-BND). As the approximation ratio of our algorithms

are dependent on maximum number cof unknowns per hypothesis and maximum number rof

§https://github.com/FatemehNavidi/ODTN ; https://github.com/sjia1/ODT-with-noisy-outcomes

Jia et al.: Optimal Decision Tree with Noisy Outcomes

32 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

Algorithm

Data WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30

Low-BND 6.218 6.136 6.136 6.136 6.136 6.136

ODTNr7.367 6.998 7.121 7.150 7.299 7.357

ODTNh8.566 6.998 7.134 7.313 7.637 7.915

Non-Adap 11.976 9.598 9.672 9.824 10.159 10.277

Low-Adap 9.072 8.453 8.344 8.609 8.683 8.541

Table 3 Cost of Diﬀerent Algorithms for α= 1.

Parameters

Data WISER-ORG WISER-ID Cl-0 Cl-5 Cl-10 Cl-20 Cl-30

r 388 245 0 5 7 12 13

Avg-r 50.46 30.690 0 1.12 2.21 4.43 6.54

h 61 45 0 3 6 8 8

Avg-h 9.51 9.39 0 0.48 0.94 1.89 2.79

Table 4 Maximum and Average Number of Stars per Hypothesis and per Test in Diﬀerent Datasets.

Algorithm Neighborhood Stopping Clique Stopping

ODTNr11.163 11.817

ODTNh11.908 12.506

Non-Adap 16.995 21.281

Low-Adap 16.983 20.559

Table 5 Algorithms on WISER-ORG dataset with Neighborhood and Clique Stopping for Uniform Distribution.

unknowns per test, we also have included these parameters as well as their average values in Table 4.

Table 5 summarizes the results on WISER-ORG with clique and neighborhood stopping criteria.

We can see that ODTNrconsistently outperforms the other algorithms and is very close to the

information-theoretic lower bound.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 33

Acknowledgements

A preliminary version of this paper appeared as Jia et al. (2019) in the proceedings of Neural

Information Processing Systems (NeurIPS) 2019.

References

Micah Adler and Brent Heeringa. Approximating optimal binary decision trees. In Approximation, Ran-

domization and Combinatorial Optimization. Algorithms and Techniques, pages 1–9. Springer, 2008.

Esther M Arkin, Henk Meijer, Joseph SB Mitchell, David Rappaport, and Steven S Skiena. Decision trees for

geometric models. International Journal of Computational Geometry & Applications, 8(03):343–363,

1998.

Yossi Azar and Iftah Gamzu. Ranking with submodular valuations. In Proceedings of the twenty-second

annual ACM-SIAM symposium on Discrete Algorithms, pages 1070–1079. SIAM, 2011.

Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Machine Learning,

Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania,

USA, June 25-29, 2006, pages 65–72, 2006.

Gowtham Bellala, Suresh K. Bhavnani, and Clayton Scott. Active diagnosis under persistent noise with

unknown noise distribution: A rank-based approach. In Proceedings of the Fourteenth International

Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13,

2011, pages 155–163, 2011.

Venkatesan T Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, and Yogish Sabharwal. Approximating

decision trees with multiway branches. In International Colloquium on Automata, Languages, and

Programming, pages 210–221. Springer, 2009.

Venkatesan T. Chakaravarthy, Vinayaka Pandit, Sambuddha Roy, Pranjal Awasthi, and Mukesh K. Mohania.

Decision trees for entity identiﬁcation: Approximation algorithms and hardness results. ACM Trans.

Algorithms, 7(2):15:1–15:22, 2011.

Yuxin Chen, Seyed Hamed Hassani, and Andreas Krause. Near-optimal bayesian active learning with cor-

related and noisy tests. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence

and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 223–231, 2017.

Ferdinando Cicalese, Eduardo Sany Laber, and Aline Medeiros Saettler. Diagnosis determination: decision

trees optimizing simultaneously worst and expected testing cost. In Proceedings of the 31th International

Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 414–422, 2014.

Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing

systems, pages 337–344, 2005.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

34 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

M.R. Garey and R.L. Graham. Performance bounds on the splitting algorithm for binary testing. Acta

Informatica, 3:347–355, 1974.

Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning

and stochastic optimization. J. Artif. Intell. Res., 42:427–486, 2011. doi: 10.1613/jair.3278. URL

https://doi.org/10.1613/jair.3278.

Daniel Golovin and Andreas Krause. Adaptive submodularity: A new approach to active learning and

stochastic optimization. CoRR, abs/1003.3967, 2017. URL http://arxiv.org/abs/1003.3967.

Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning with noisy

observations. In Advances in Neural Information Processing Systems 23: 24th Annual Conference

on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010,

Vancouver, British Columbia, Canada., pages 766–774, 2010.

Andrew Guillory and Jeﬀ A. Bilmes. Average-case active learning with costs. In Algorithmic Learning

Theory, 20th International Conference, ALT 2009, Porto, Portugal, October 3-5, 2009. Proceedings,

pages 141–155, 2009.

Andrew Guillory and Jeﬀ A. Bilmes. Interactive submodular set cover. In Proceedings of the 27th Interna-

tional Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 415–422,

2010.

Andrew Guillory and Jeﬀ A. Bilmes. Simultaneous learning and covering with adversarial noise. In Proceed-

ings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington,

USA, June 28 - July 2, 2011, pages 369–376, 2011.

Anupam Gupta, Viswanath Nagarajan, and R Ravi. Approximation algorithms for optimal decision trees

and adaptive tsp problems. Mathematics of Operations Research, 42(3):876–896, 2017.

Steve Hanneke. A bound on the label complexity of agnostic active learning. In Machine Learning, Pro-

ceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June

20-24, 2007, pages 353–360, 2007.

Laurent Hyaﬁl and Ronald L. Rivest. Constructing optimal binary decision trees is N P -complete. Informa-

tion Processing Lett., 5(1):15–17, 1976/77.

Sungjin Im, Viswanath Nagarajan, and Ruben Van Der Zwaan. Minimum latency submodular cover. ACM

Transactions on Algorithms (TALG), 13(1):13, 2016.

Shervin Javdani, Yuxin Chen, Amin Karbasi, Andreas Krause, Drew Bagnell, and Siddhartha S. Srinivasa.

Near optimal bayesian active learning for decision making. In Proceedings of the Seventeenth Inter-

national Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April

22-25, 2014, pages 430–438, 2014.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 35

Su Jia, Viswanath Nagarajan, Fatemeh Navidi, and R. Ravi. Optimal decision tree with noisy outcomes.

In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alch´e-Buc, Emily B. Fox, and

Roman Garnett, editors, Annual Conference on Neural Information Processing Systems (NeurIPS),

pages 3298–3308, 2019.

S Rao Kosaraju, Teresa M Przytycka, and Ryan Borgstrom. On an optimal split tree problem. In Workshop

on Algorithms and Data Structures, pages 157–168. Springer, 1999.

Zhen Liu, Srinivasan Parthasarathy, Anand Ranganathan, and Hao Yang. Near-optimal algorithms for shared

ﬁlter evaluation in data stream systems. In Proceedings of the ACM SIGMOD International Conference

on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 133–146,

2008.

D. W. Loveland. Performance bounds for binary testing with arbitrary weights. Acta Inform., 22(1):101–114,

1985.

Mikhail Ju. Moshkov. Greedy algorithm with weights for decision tree construction. Fundam. Inform., 104

(3):285–292, 2010.

Mohammad Naghshvar, Tara Javidi, and Kamalika Chaudhuri. Noisy bayesian active learning. In 50th

Annual Allerton Conference on Communication, Control, and Computing, Allerton 2012, Allerton Park

& Retreat Center, Monticello, IL, USA, October 1-5, 2012, pages 1626–1633, 2012.

Feng Nan and Venkatesh Saligrama. Comments on the proof of adaptive stochastic set cover based on

adaptive submodularity and its implications for the group identiﬁcation problem in ”group-based active

query selection for rapid diagnosis in time-critical situations”. IEEE Trans. Information Theory, 63

(11):7612–7614, 2017.

Fatemeh Navidi, Prabhanjan Kambadur, and Viswanath Nagarajan. Adaptive submodular ranking and

routing. Oper. Res., 68(3):856–877, 2020.

Robert D. Nowak. Noisy generalized binary search. In Advances in Neural Information Processing Systems

22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting

held 7-10 December 2009, Vancouver, British Columbia, Canada., pages 1366–1374, 2009.

Aline Medeiros Saettler, Eduardo Sany Laber, and Ferdinando Cicalese. Trading oﬀ worst and expected cost

in decision tree problems. Algorithmica, 79(3):886–908, 2017.

Laurence A Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combi-

natorica, 2(4):385–393, 1982.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

36 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

Appendix A: Proof of Proposition 1.

Recall that an adaptive algorithm for ASRN or ASR can be viewed as a decision tree. We will

show that any feasible decision tree for the ASR instance Jis also feasible for the ASRN instance

Iwith the same objective, and vice versa.

In one direction, consider a feasible decision tree Tfor the ASR instance J. For any expanded

scenario (i, ω)∈H, let Pi,ω be the unique path traced in T, and Si,ω the elements selected along

Pi,ω. Note that by deﬁnition of a feasible decision tree, at the last node (“leaf ”) of path Pi,ω , it

holds fi,ω(Si,ω ) = 1 which, in the notation of the original ASRN instance, translates to fi({(e, ωe) :

e∈Si,ω}) = 1.

In the other direction, let T0be any decision tree for ASRN instance I. Suppose the target

scenario is i∈[m] and element-outcomes are given by ω∈Ωnon the -elements for i, which is

unknown to the algorithm. Then a unique path P0

i,ω is traced in T0. Let S0

i,ω denote the elements

on this path. Since iis covered at the end of P0

i,ω we have fi({(e, ωe) : e∈S0

i,ω}) = 1. Now consider

T0as a decision tree for ASR instance J. Under scenario i, ω, it is clear that path P0

i,ω is traced

and so elements S0

i,ω are selected. It follows that fi,ω(S0

i,ω) = fi({(e, ωe) : e∈S0

i,ω}) = 1 which means

that scenario (i, ω) is covered at the end of P0

i,ω. Therefore T0is a also feasible decision tree for J.

Taking expectations, the cost for Jis at most that for instance I.

Appendix B: Details in Section 3

The non-adaptive SFRN algorithm (Algorithm 1) involves two phases. In the ﬁrst phase, we run

the SFR algorithm using sampling to get estimates GE(e) of the scores. If at some step, the

maximum sampled score is “too low” then we go to the second phase where we perform all remaining

elements in an arbitrary order. The number of samples used to obtain each estimate is polynomial

in m, n, ε−1, so the overall runtime is polynomial.

Pre-processing. We ﬁrst show that by losing an O(1)-factor in approximation ratio, we may

assume that πi≥n−2for all i∈[m]. Let A={i∈[m] : πi≤n−2}, then Piπi≤n−2·n≤n−1.

Replace all scenarios in Awith a single dummy scenario “0” with π0=Pi∈Aπi, and deﬁne f0to

be any fiwhere i∈A. By our assumption that each fimust be covered irrespective of the noisy

outcomes, it holds that fi,ω([n]) = 1 for each ω∈Ω(i), and hence the cover time is at most n.

Thus, for any permutation σ, the expected cover time of the old and new instance diﬀer by at

most O(n−1·n) = O(1). Therefore, the cover time of any sequence of elements diﬀers by only O(1)

in this new instance (where we removed the scenarios with tiny prior densities) and the original

instances.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 37

We now present the formal proof of Theorem 3, with proofs of the lemmas deferred to

Appendix B. To analyze the our randomized algorithm, we need the following sampling lemma,

which follows from the standard Chernoﬀ bound.

Lemma 7. Let Xbe a [0,1]-bounded random variable with EX≥m−2n−4ε. Let ¯

Xdenote the

average of m3n4ε−1many independent samples of X. Then Pr ¯

X /∈[1

2EX, 2EX]≤e−Ω(m).

The next lemma shows that sampling does ﬁnd an approximate maximizer unless the score is

very small, and also bounds the failure probability.

Lemma 8. Consider any step in the algorithm with S= maxe∈[n]GE(e)and ¯

S= maxe∈[n]GE(e)

with GE(e∗) = ¯

S. Call this step a failure if (i) ¯

S < 1

4m−2n−4εand S≥1

2m−2n−4ε, or (ii) ¯

S≥

1

4m−2n−4εand GE(e∗)<S

4. Then the probability of failure is at most e−Ω(m).

Based on Lemma 8, in the remaining analysis we condition on the event that our algorithm

never encounters failures, which occurs with probability 1 −e−Ω(m). To conclude the proof, we need

the following key lemma which essentially states that if the score of the greediest element is low,

then the elements selected so far suﬃces to cover all scenarios with high probability, and hence the

ordering of the remaining elements does not matter much.

Lemma 9. Assume that there are no failures. Consider the end of phase 1 in our algorithm, i.e. the

ﬁrst step with GE(e∗)<1

4m−2n−4ε. Then, the probability that the realized scenario is not covered

is at most m−2.

The above is essentially a consequence of the submodularity of the target functions. Suppose

for contradiction that there is a scenario ithat, with at least m−2probability over the random

outcomes, remains uncovered by the currently selected elements. Recall that by our feasibility

assumption, if all elements were selected, then fiis covered with probability 1. Thus, by submodu-

larity, there exists an individual element ˜ewhose inclusion brings more coverage than the average

coverage over all elements in [n], and hence ˜ehas a “high” score.

Proof of Theorem 3. Assume that there are no failures. We proceed by bounding the expected

costs (number of elements) from phase 1 and 2 separately. By Lemma 8, the element chosen in

each step of phase 1 is a 4-approximate maximizer (see case (ii) failure) of the score used in the

SFR algorithm. Thus, by Theorem 4, the expected cost in phase 1 is O(log m) times the optimum.

On the other side, by Lemma 9 the probability of performing phase 2 is at most e−Ω(m). As there

are at most nelements in phase 2, the expected cost is only O(1). Therefore, Algorithm 1 is an

O(log m)-approximation algorithm for SFRN.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

38 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

B.1. Proof of Lemma 7.

Let X1, ..., XNbe i.i.d. samples of random variable where N=m3n4ε−1is the number of samples.

Letting Y=Pi∈[N]Xi, the usual Chernoﬀ bound implies for any δ∈(0,1),

Pr Y /∈[(1 −δ)EY, (1 + δ)EY]≤exp(−δ2

2·EY).

The lemma follows by setting δ=1

2and using the assumption EY=N·EX1= Ω(m).

B.2. Proof of Lemma 8

We will consider the two types of failures separately. For the ﬁrst type, suppose S≥1

2m−2n−4ε.

Using Lemma 7 on the element e∈[n] with GE(e) = S, we obtain

Pr[ ¯

S < 1

4m−2n−4ε]≤Pr[GE(e)<1

4m−2n−4ε]≤e−Ω(m).

So the probability of the ﬁrst type of failure is at most e−Ω(m).

For the second type of failure, we consider two further cases:

•S < 1

8m−2n−4ε. For any e∈[n] we have GE(e)≤S < 1

8m−2n−4ε. Note that GE(e) is the average

of Nindependent samples each with mean GE(e). We now upper bound the probability of the

event Bethat GE(e)≥1

4m−2n−4ε. We ﬁrst artiﬁcially increase each sample mean to 1

8m−2n−4ε:

note that this only increases the probability of event Be. Now, using Lemma 7 we obtain

Pr[Be]≤e−Ω(m). By a union bound, it follows that Pr[ ¯

S≥1

4m−2n−4ε]≤Pe∈[n]Pr[Be]≤e−Ω(m).

•S≥1

8m−2n−4ε. Consider now any e∈Uwith GE(e)< S/4. By Lemma 7 (artiﬁcially increasing

GE(e) to S/4 if needed), it follows that Pr[GE(e)> S/2] ≤e−Ω(m). Now consider the element

e0with GE(e0) = S. Again, by Lemma 7, it follows that Pr[GE(e0)≤S/2] ≤e−Ω(m). This means

that element e∗has GE(e∗)≥GE(e0)> S/2 and GE(e∗)≥S/4 with probability 1−e−Ω(m). In

other words, assuming S≥1

8m−2n−4ε, the probability that GE(e∗)< S/4 is at most e−Ω(m).

Adding the probabilities over all possibilities for failures, the lemma follows.

B.3. Proof of Lemma 9

Let Edenote the elements chosen so far and pthe probability that Edoes not cover the realized

scenario-copy of H. That is,

p= Pr

(i,ω)∈H(fi,ω (E)<1) =

m

X

i=1

πi·Pr

ω∈Ω(i)(fi,ω(E)<1).

It follows that there is some iwith Prω∈Ω(i)(fi,ω(E)<1) ≥p. By deﬁnition of separability, if

fi,ω(E)<1 then fi,ω (E)≤1−ε. Thus,

X

ω∈Ω(i)

πi,ωfi,ω (E)≤X

ω:fi,ω(E)=1

πi,ω ·1 + X

ω:fi,ω(E)<1

πi,ω ·fi,ω(E)≤(1 −εp)πi.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 39

On the other hand, taking all the elements we have fi,ω([n]) = 1 for all ω∈Ω(i). Thus,

X

ω∈Ω(i)

πi,ωfi,ω ([n]) = X

ω∈Ω(i)

πi,ω =πi.

Taking the diﬀerence of the above two inequalities, we have

X

ω∈Ω(i)

πi,ω ·(fi,ω([n]) −fi,ω (E)) ≥πi·εp.

Consider function g(S) := Pω∈Ω(i)πi,ω ·(fi,ω(S∪E)−fi,ω(E))for S⊆[n], which is also submodular.

From the above, we have g([n]) ≥πi·εp. Using submodularity of g,

max

e∈[n]g({e})≥εpπi

n=⇒ ∃˜e∈[n] : X

ω∈Ω(i)

πi,ω ·(fi,ω(E∪ {˜e})−fi,ω (E))≥εpπi

n.

It follows that GE(˜e)≥εpπi

n≥n−3εp, where we used that miniπi≥n−2. Now, suppose for a con-

tradiction that p≥m−2. Since there is no failure and GE(˜e)≥n−3m−2ε≥1

4n−4m−2ε, by case (ii)

of Lemma 8 , we deduce that GE(e∗)≥1

4m−2n−4, which is contradiction.

Appendix C: Details in Section 4

C.1. Proof of Lemma 1.

By decomposing the summation in the left hand side of (3) as H0=∪iH0∩Hi, and noticing that

fi,ω(E) = fi(νE), the problem reduces to showing that for each i∈[m],

X

(i,ω)∈H0∩Hi

πi,ω ·fi,ω(e∪E)−fi,ω (E)=pi·Ei,νe[fi(νE∪ {νe})−fi(νE)].

Recall that pi=ni·πi

|Ω|ciand π(i,ω)=πi

|Ω|ci, the above simpliﬁes to

1

niX

(i,ω)∈H0∩Hifi,ω (e∪E)−fi,ω(E)=Ei,νe[fi(νE∪{νe})−fi(νE)].

Note that ni=|H0∩Hi|, so the above is equivalent to

1

niX

(i,ω)∈H0∩Hi

fi,ω(e∪E) = Ei,νe[fiνE∪ {νe}].(7)

It is straightforward to verify that the above by considering the following are two cases.

•If ri(e) = νe∈Ω\{∗}, then the outcome νeis deterministic conditional on scenario i, and so

is fiνE∪{νe}, the value of fiafter selecting e. On the left hand side, for every ω∈Hi, by

deﬁnition of Hiit holds νe=ωe, and hence fi,ω(e∪E) = fi(νE∪ {νe}for every (i, ω)∈Hi.

Therefore all terms in the summation are equal to fi(νE∪{νe}and hence (7) holds.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

40 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

•If ri(e) = ∗, then each outcome o∈Ω occurs with equal probabilities, thus we may rewrite the

right hand side as

Ei,νe[fiνE∪{νe}] = X

o∈Ω

Pi[νe=o]·fiνE∪{νe}

=1

|Ω|X

o∈Ω

fiνE∪{(e, o)}.

To analyze the other side, note that by deﬁnition of Hiand H0, there are equally many

expanded scenarios (i, ω) in H0∩Hiwith ωe=ofor each outcome o∈Ω. Thus, we can rewrite

the left hand side as

1

niX

(i,ω)∈H0∩Hi

fi,ω(e∪E) = 1

niX

o∈ΩX

(i,ω)∈H0∩Hi,

ωe=o

fi,ω(e∪E)

=1

niX

o∈Ω

ni

|Ω|fi,ω(e∪E)

=1

|Ω|X

o∈Ω

fiνE∪{(e, o)},

which matches the right hand side of (7) and completes the proof.

C.2. Application of Algorithm 2 and Algorithm 3 to ODTN.

For concreteness, we provide a closed-form formula for Scorecand Scorerin the ODTN problem

using Lemma 1, which were used in our experiments for ODTN. In §2.3, we formulated ODTN as an

ASRN instance. Recall that the outcomes Ω = {+1,−1}, and the submodular function f(associated

with each hypothesis i) measures the proportion of hypotheses eliminated after observing the

outcomes of a subset of tests.

As in §4, at any point in Algorithm 2 or 3, after selecting set Eof tests, let νE:E→±1 denote

their outcomes. For each hypothesis i∈[m], let nidenote the number of surviving expanded-

scenarios of i. Also, for each hypothesis i, let pidenote the total probability mass of the surviving

expanded-scenarios of i. For any S⊆[m], we use the shorthand p(S) = Pi∈Spi. Finally, let A⊆[m]

denote the compatible hypotheses based on the observed outcomes νE(these are all the hypotheses

iwith ni>0). Then, f(νE) = m−|A|

m−1. Moreover, for any new test/element T,

f(νE∪{νT}) = (m−|A|+|A∩T−|

m−1if νT= +1

m−|A|+|A∩T+|

m−1if νT=−1.

Recall that T+,T−and T∗denote the hypotheses with +1, −1 and ∗outcomes for test T. So,

f(νE∪{νT})−f(νE)

1−f(νE)=(|A∩T−|

|A|−1if νT= +1

|A∩T+|

|A|−1if νT=−1.

It is then straightforward to verify the following.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 41

Proposition 2. Consider implementing Algorithm 2 on an ODTN instance. Suppose after select-

ing tests E, the expanded-scenarios H0(and original scenarios A) are compatible with the parame-

ters described above. For any test T, if bT∈{+1,−1}is the outcome corresponding to BT(H0)then

the second term in Scorec(T;E , H0)and Scorer(T;E , H0)is:

|A∩T−|

|A|−1+|A∩T+|

|A|−1·p(A∩T∗)

2+|A∩T−|

|A|−1·pA∩T++|A∩T+|

|A|−1·pA∩T−.

The above expression has a natural interpretation for ODTN: conditioned on the outcomes νEso

far, it is the expected number of newly eliminated hypotheses due to test T(normalized by |A|−1).

The ﬁrst term of the score π(LT(H0)) or π(RT(H0)) is calculated as for the general ASRN prob-

lem. Finally, observe that for the submodular functions used for ODTN, the separation parameter

is ε=1

m−1. So, by Theorem 7 we immediately obtain a polynomial time O(min(r, c) + log m)-

approximation for ODTN.

C.3. Proof of Theorem 6

The proof is similar to the analysis in Navidi et al. (2020).

With some foresight, set α:= 15(r+ log m). Write Algorithm 3 as ALG and let OPT be the

optimal adaptive policy. It will be convenient to view ALG and OPT as decision trees where each

node represents the “state” of the policy. Nodes in the decision tree are labelled by elements (that

are selected at the corresponding state) and branches out of each node are labelled by the outcome

observed at that point. At any state, we use Eto denote the previously selected elements and

H0⊆Mto denote the expanded-scenarios that are (i) compatible with the outcomes observed so

far and (ii) uncovered. Suppose at some iteration, elements Eare selected and outcomes νEare

observed, then a scenario iis said to be covered if fi(E∪νE) = 1, and uncovered otherwise.

For ease of presentation, we use the phrase “at time t” to mean “after selecting telements”.

Note that the cost incurred until time tis exactly t. The key step is to show

ak≤0.2ak−1+ 3yk,for all k≥1,(8)

where

•Ak⊆Mis the set of uncovered expanded scenarios in ALG at time α·2kand ak=p(Ak) is

their total probability,

•Ykis the set of uncovered scenarios in OPT at time 2k−1, and yk=p(Yk) is the total probability

of these scenarios.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

42 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

As shown in Section 2 of Navidi et al. (2020), (8) implies that Algorithm 3 is an O(α)-

approximation and hence Theorem 6 follows. To prove (8), we consider the total score collected by

ALG between iterations α2k−1and α2k, formally given by

Z:=

α2k

X

t>α2k−1X

(E,H 0)∈V(t)

max

e∈[n]\E

X

(i,ω)∈Re(H0)

πi,ω +X

(i,ω)∈H0

πi,ω ·fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)

(9)

where V(t) denotes the set of states (E, H 0) that occur at time tin the decision tree ALG. We

note that all the expanded-scenarios seen in states of V(t) are contained in Ak−1.

Consider any state (E, H 0) at time tin the algorithm. Recall that H0are the expanded-scenarios

and let S⊆[m] denote the original scenarios in H0. Let TH0(k) denote the subtree of OPT that

corresponds to paths traced by expanded-scenarios in H0up to time 2k−1. Note that each node

(labeled by any element e∈[n]) in TH(k) has at most |Ω|outgoing branches and one of them

corresponds to the outcome oe(S) deﬁned in Algorithm 3. We deﬁne Stemk(H0) to be the path in

TH0(k) that at each node (labeled e) follows the oe(S) branch. We also use Stemk(H0)⊆[n]×Ω to

denote the observed element-outcome pairs on this path.

Definition 1. Each state (E, H 0) is exactly one of the following types:

•bad if the probability of uncovered scenarios in H0at the end of Stemk(H0) is at least Pr(H0)

3.

•okay if it is not bad and Pr(∪e∈Stemk(H0)Re(H0)) is at least Pr(H0)

3.

•good if it is neither bad nor okay and the probability of scenarios in H0that get covered by

Stemk(H0) is at least Pr(H0)

3.

Crucially, this categorization of states is well deﬁned. Indeed, each expanded-scenario in H0is (i)

uncovered at the end of Stemk(H0), or (ii) in Re(H0) for some e∈Stemk(H0), or (iii) covered by

some preﬁx of Stemk(H0), i.e. the function value reaches 1 on Stemk(H0). So the total probability

of the scenarios in one of these 3 categories must be at least Pr(H)

3.

In the next two lemmas, we will show a lower bound (Lemma 10) and an upper bound (Lemma 11)

for Zin terms of akand yk, which together imply (8) and complete the proof.

Lemma 10. For any k≥1, it holds Z≥α·(ak−3yk)/3.

Proof. The proof of this lower bound is identical to that of Lemma 3 in Navidi et al. (2020)

for noiseless-ASR. The only diﬀerence is that we use the scenario-subset Re(H0)⊆H0instead of

subset “Le(H)⊆H” in the analysis of Navidi et al. (2020).

Lemma 11. For any k≥1,Z≤ak−1·(1 + ln 1

+r+ log m).

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 43

Proof. This proof is analogous to that of Lemma 4 in Navidi et al. (2020) but requires new

ideas, as detailed below. Our proof splits into two steps. We ﬁrst rewrite Zby interchanging its

double summation: the outer layer is now over the Ak−1(instead of times between α2k−1to α2k

as in the original deﬁnition of Z). Then for each ﬁxed (i, ω)∈Ak−1, we will upper bound the inner

summation using the assumption that there are at most roriginal scenarios with ri(e) = for each

element e.

Step 1: Rewriting Z.For any uncovered (i, ω)∈Ak−1in the decision tree ALG at time α2k−1,

let Pi,ω be the path traced by (i, ω) in ALG, starting from time α2k−1and ending at time α2kor

when (i, ω) is covered.

Recall that in the deﬁnition of Z, for each time tbetween α2k−1and α2k, we sum over all states

(E, H 0) at time t. Since t≥α2k−1, and the subset of uncovered scenarios only shrinks at tincreases,

for any (E, H 0)∈V(t) we have H0⊆Ak−1. So, only the expanded scenarios in Ak−1contribute to

Z. Thus we may rewrite (9) as

Z=X

(i,ω)∈Ak−1

πi,ω ·X

(e;E,H 0)∈Pi,ω fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)+1[(i, ω )∈Re(H0)]

≤X

(i,ω)∈Ak−1

πi,ω ·

X

(e;E,H 0)∈Pi,ω

fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)+X

(e;E,H 0)∈Pi,ω

1[(i, ω)∈Re(H0)]

.(10)

Step 2: Bounding the Inner Summation. The rest of our proof involves upper bounding each

of the two terms in the summation over e∈Pi,ω for any ﬁxed (i, ω)∈Ak−1. To bound the ﬁrst

term, we need the following standard result on submodular functions.

Lemma 12 (Azar and Gamzu (2011)).Let f: 2U→[0,1] be any monotone function with

f(∅)=0 and ε= min{f(S∪{e})−f(S) : e∈U, S ⊆U, f (S∪{e})−f(S)>0}be the separability

parameter. Then for any nested sequence of subsets ∅=S0⊆S1⊆ ···Sk⊆U, it holds

k

X

t=1

f(St)−f(St−1)

1−f(St−1)≤1 + ln 1

ε.

It follows immediately that

X

(e;E,H 0)∈Pi,ω

fi,ω(e∪E)−fi,ω (E)

1−fi,ω(E)≤1 + ln 1

ε.(11)

Next we consider the second term P

(e;E,H 0)∈Pi,ω

1[(i, ω)∈Re(H0)]. Recall that S⊆[m] is the subset

of original scenarios with at least one expanded scenario in H0. Consider the partition of scenarios

Jia et al.: Optimal Decision Tree with Noisy Outcomes

44 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

Sinto |Ω|+1 parts based on the response entries (from Ω ∪{∗}) for element e. From Algorithm 3,

recall that Ue(S) denotes the part with response ∗and Ce(S) denotes the largest cardinality part

among the non-∗responses. Also, oe(S)∈Ω is the outcome corresponding to part Ce(S). Moreover,

Re(H0)⊆H0consists of all expanded-scenarios that do not have outcome oe(S) on element e.

Suppose that (i, ω)∈Re(H0). Then, it must be that the observed outcome on eis not oe(S). Let

S0⊆Sdenote the subset of original scenarios that are also compatible with the observed outcome

on e. We now claim that |S0|≤ |S|+r

2. To see this, let De(S)⊆Sdenote the part having the second

largest cardinality among the non-∗responses for e. As the observed outcome is not oe(S) (which

corresponds to the largest part), we have

|S0|≤|Ue(S)|+|De(S)|≤|Ue(S)|+|S|−|Ue(S)|

2=|S|+|Ue(S)|

2≤|S|+r

2.

The ﬁrst inequality above uses the fact that S0consists of Ue(S) (scenarios with ∗response)

and some part (other than Ce(S)) with a non-∗response. The second inequality uses |De(S)| ≤

|De(S)|+|Ce(S)|

2≤|S|−|Ue(S)|

2. The last inequality uses the upper-bound ron the number of ∗responses

per element. It follows that each time (i, ω)∈Re(H0), the number of compatible (original) scenarios

on path Pi,ω changes as |S0| ≤ |S|+r

2. Hence, after log2msuch events, the number of compatible

scenarios on path Pi,ω is at most r. Finally, we use the fact that the number of compatible scenarios

reduces by at least one whenever (i, ω)∈Re(H0), to obtain

X

(e;E,H 0)∈Pi,ω

1[(i, ω)∈Re(H0)] ≤r+ log2m. (12)

Combining (10), (11) and (12), we obtain the lemma.

Appendix D: Details in Section 5

D.1. A Low-Cost Membership Oracle

Note that Steps 3, 9 and 18 are well-deﬁned because the ODTN instance is assumed to be iden-

tiﬁable. If there is no new test in Step 3 with T+∩Z06=∅and T−∩Z06=∅, then we must have

|Z0|= 1. If there is no new test in Step 9 with z6∈T∗then we must have identiﬁed zuniquely, i.e.

Y=∅. Finally, in step 18, we use the fact that there are tests that deterministically separate every

pair of hypotheses.

Proof. If ¯

i∈Zthen it is clear that i=¯

iin step 6 and Member(Z) declares ¯

i=i. Now consider

the case ¯

i6∈ Z. Recall that i∈Zdenotes the unique hypothesis that is still compatible in step 6,

and that Ydenotes the set of compatible hypotheses among [m]\ {i}, so it always contains ¯

i.

Hence, Y6=∅in step 14, which implies that k= 4 log m. Also recall the deﬁnition of set Sand J

from (13).

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 45

Algorithm 5 Member(Z) oracle that checks if ¯

i∈Z.

1: Initialize: Z0←Z.

2: while |Z0|≥2do % While-loop 1: Finding a suspect – reducing |Z0|to 1

3: Choose any new test T∈T with T+∩Z06=∅and T−∩Z06=∅, observe outcome ωT∈{±1}.

4: Let Rbe the set of hypotheses ruled out, i.e. R={j∈[m] : MT ,j =−ωT}.

5: Let Z0←Z0\R.

6: Let zbe the unique hypothesis when the while-loop ends. Identiﬁed a “suspect”.

7: Initialize k←0 and Y=H.

8: while Y6=∅and k≤4 log mdo While-loop 2: choose deterministic tests for z.

9: Choose any new test Twith MT,i 6=∗and observe outcome ωT∈ {±1}.

10: if ωT=−MT,i then i ruled out.

11: Declare “¯

i6∈Z” and stop.

12: else

13: Let Rbe the set of hypotheses ruled out, Y←Y\Rand k←k+ 1.

14: if Y=∅then

15: Declare “¯

i=i” and terminate.

16: else

17: Let W⊆T denote the tests performed in step 9 and Now consider the“bad” case.

J={j∈Y:MT,j =MT ,i for at least 2 log mtests T∈W}

={j∈Y:MT,j =∗for at most 2 log mtests T∈W}.

(13)

18: For each j∈J, choose a test T=T(j)∈T with MT,j , MT ,i 6=∗and MT,j =−MT ,i

19: let W0⊆T denote the set of these tests.

20: if no tests in W∪W0rule out ithen Let iduel with hypotheses in J.

21: Declare “¯

i=i”.

22: else

23: Declare “¯

i /∈Z”.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

46 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

•Case 1. If ¯

i∈Jthen we will identify correctly that ¯

i6=iin step 20 as one of the tests in W0

(step 18) separates ¯

iand ideterministically. So in this case we will always declare ¯

i /∈Z.

•Case 2. If ¯

i6∈J, then by deﬁnition of J, we have ¯

i∈T∗for at least 2 log mtests T∈W. As i

has a deterministic outcome for each test in W, the probability that all outcomes in Ware

consistent with iis at most m−2. So with probability at least 1−m−2, some test in Wmust

have an outcome (under ¯

i) inconsistent with i, and based on step 20, we would declare ¯

i /∈Z.

In order to bound the cost, note that the number of tests performed are at most: |Z|in step 3,

4 log min step 9 and |J|≤|Z|in step 18, and the proof follows.

Proof. For the ﬁrst statement, ﬁx any x= (i, ω)∈Ω. Recall that Pxonly contains tests from

step 7. We only need to consider the case that tx<|Px|/2. Let t0

x= 2 ·txwhich is a power-of-2. By

(6) we know that there is some kwith tx< k ≤t0

xand θx(k)<1/ρ. Hence θx(t0

x)<2

ρ<1

2.

Consider the point in the algorithm after performing the ﬁrst t0

xtests (call them S) on Px.

Because t0

xis a power-of-two, the algorithm calls the member oracle in this iteration. Let X⊆[m]

be the compatible hypotheses after the t0

x-th test on Px. Because θx(t0

x)<1/2, at most |S|/2 tests in

Sare -tests for hypothesis i: in other words the weight wx≥|S|/2 at this point in the algorithm.

Let

X0=y∈X:Shas at most |S|

2∗-tests for y=y∈X:wy≥|S|

2.

Using Lemma 6 with Sand X, it follows that |X0|≤2Cmα. Hence the number of hypotheses y∈X

with wy≤ |S|/2≤wxis at most 2Cmα, and so i∈Z(recall that Zconsists of 2Cmαhypotheses

with the lowest weight). This means that after step 4, we would have correctly identiﬁed ¯

i=xand

so Pxends. Hence |Px|≤t0

x. The ﬁrst part of the lemma now follows from the fact that t0

x≤2·tx.

The second statement in the lemma follows by taking expectation over all x∈H.

D.2. Proof of Lemma 2.

Consider any feasible decision tree Tfor the ODTN instance and any hypothesis i∈[m]. If we

condition on ¯

i=ithen Tcorresponds to a feasible adaptive policy for SSC (i). This is because:

•for any expanded hypothesis (ω, i)∈Ω(i), the tests performed in Tmust rule out all the

hypotheses [m]\i, and

•the hypotheses ruled-out by any test T(conditioned on ¯

i=i) is a random subset that has the

same distribution as ST(i).

Formally, let Pi,ω denote the path traced in Tunder test outcomes ω, and |Pi,ω|the number of

tests performed along this path. Recall that uiis the number of unknown tests for i, and that

the probability of observing outcomes ωwhen ¯

i=iis 2−ui, so this policy for SSC (i) has cost

P(i,ω)∈Ω(i)2−ui·|Pi,ω |. Thus, OP TSSC(i)≤P(i,ω)∈Ω(i)2−ui· |Pi,ω |. Taking expectations over i∈[m]

the lemma follows.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

Operations Research 00(0), pp. 000–000, ©0000 INFORMS 47

D.3. Proof of Lemma 3.

For simplicity write (T0)+as T0

+(similarly deﬁne T0

−, T 0

∗). Note that E[|ST(i)∩(A\i)|] =

1

2(|T+∩A|+|T−∩A|)because i∈T∗. We consider two cases for test T0∈T .

•If MT0,i =∗, then

E[|ST0(i)∩(A\i)|] = 1

2|T0

+∩A|+|T0

−∩A|≤1

2|T+∩A|+|T−∩A|,

by the “greedy choice” of Tin step 7.

•If i∈T0

+∪T0

−then

E[|ST0(i)∩(A\i)|]≤max{|T0

+∩A|,|T0

−∩A|}≤|T0

+∩A|+|T0

−∩A|,

which is at most |T+∩A|+|T−∩A|by the choice of T.

In either case the claim holds, and the lemma follows.

D.4. Proof of Lemma 4

By deﬁnition of α-sparse instances, the maximum number of candidate hypotheses that can be

eliminated after performing a single test is mα. As we need to eliminate m−1 hypotheses irrespec-

tive of the realized hypothesis ¯

i, we need to perform at least m−1

mα= Ω(m1−α) tests under every ¯

i,

and the proof follows.

Appendix E: Extension to Non-identiﬁable ODT Instances

Previous work on ODT problem usually imposes the following identiﬁability assumption (e.g.

Kosaraju et al. (1999)): for every pair hypotheses, there is a test that distinguishes them deter-

ministically. However in many real world applications, such assumption does not hold. Thus far,

we have also made this identiﬁability assumption for ODTN (see §2.1). In this section, we show

how our results can be extended also to non-identiﬁable ODTN instances.

To this end, we introduce a slightly diﬀerent stopping criterion for non-identiﬁable instances.

(Note that is is no longer possible to stop with a unique compatible hypothesis.) Deﬁne a similarity

graph Gon mnodes, each corresponding to a hypothesis, with an edge (i, j) if there is no test

separating iand jdeterministically. Our algorithms’ performance guarantees will now also depend

on the maximum degree dof G; note that d= 0 in the perfectly identiﬁable case. For each hypothesis

i∈[m], let Di⊆[m] denote the set containing iand all its neighbors in G. We now deﬁne two

stopping criteria as follows:

•The neighborhood stopping criterion involves stopping when the set Kof compatible hypothe-

ses is contained in some Di, where imight or might not be the true hypothesis ¯x.

Jia et al.: Optimal Decision Tree with Noisy Outcomes

48 Operations Research 00(0), pp. 000–000, ©0000 INFORMS

•The clique stopping criterion involves stopping when Kis contained in some clique of G.

Note that clique stopping is clearly a stronger notion of identiﬁcation than neighborhood stopping.

That is, if the clique-stopping criterion is satisﬁed then so is the neighborhood-stopping criterion.

We now obtain an adaptive algorithm with approximation ratio O(d+ min(h, r )+ log m) for clique-

stopping as well as neighborhood-stopping.

Consider the following two-phase algorithm. In the ﬁrst phase, we will identify some subset

N⊆[m] containing the realized hypothesis ¯

iwith |N| ≤ d+ 1. Given an ODTN instance with m

hypotheses and tests T(as in §2.1), we construct the following ASRN instance with hypotheses as

scenarios and tests as elements (this is similar to the construction in §2.3). The responses are the

same as in ODTN: so the outcomes Ω = {+1,−1}. Let U=T ×{+1,−1}be the element-outcome

pairs. For each hypothesis i∈[m], we deﬁne a submodular function:

e

fi(S) = min

1

m−d−1·[

T:(T,+1)∈S

T−[ [

T:(T,−1)∈S

T+,1

,∀S⊆U.

It is easy to see that each function e

fi: 2U→[0,1] is monotone and submodular, and the separability

parameter ε=1

m−d−1. Moreover, e

fi(S) = 1 if and only if at least m−d−1 hypotheses are incompat-

ible with at least one outcome in S. Equivalently, e

fi(S) = 1 iﬀ there are at most d+ 1 hypotheses

compatible with S. By deﬁnition of graph Gand max-degree d, it follows that function e

fican be

covered (i.e. reaches value one) irrespective of the noisy outcomes. Therefore, by Theorem 7 we

obtain an O(min(r, c) + log m)-approximation algorithm for this ASRN instance. Finally, note that

any feasible policy for ODTN with clique/neighborhood stopping is also feasible for this ASRN

instance. So, the expected cost in the ﬁrst phase is O(min(r, c) + log m)·OP T .

Then, in the second phase, we run a simple splitting algorithm that iteratively selects any test

Tthat splits the current set Kof consistent hypotheses (i.e., T+∩K6=∅and T−∩K6=∅). The

second phase continues until Kis contained in (i) some clique (for clique-stopping) or (ii) some

subset Di(for neighborhood-stopping). Since the number of consistent hypotheses |K|≤ d+ 1 at

the start of the second phase, there are at most dtests in this phase. So, the expected cost is at

most d≤d·OP T . Combining both phases, we obtain the following.

Theorem 10. There is an adaptive O(d+ min(c, r) + log m)-approximation algorithm for ODTN

with the clique-stopping or neighborhood-stopping criterion.