ArticlePDF Available

Decision Trees for Entity Identification: Approximation Algorithms and Hardness Results

Authors:

Abstract and Figures

We consider the problem of constructing decision trees for entity identification from a given relational table. The in- put is a table containing information about a set of enti- ties over a fixed set of attributes and a probability distri- bution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such di- verse applications as efficient fault detection, species iden- tification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability dis- tribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distribution over the set of entities. We consider a natural greedy algorithm and prove an ap- proximation guarantee of O(rK · log N), where N is the number of entities, K is the maximum number of distinct values of an attribute, and rK is a suitably defined Ram- sey number. In addition, our analysis indicates a possible way of resolving a Ramsey theoretic conjecture by Erdos. We also show that it is NP-hard to approximate the gen- eral version of the problem within a factor of (log N).
Content may be subject to copyright.
15
Decision Trees for Entity Identification: Approximation Algorithms
and Hardness Results
VENKATESAN T. CHAKARAVARTHY, VINAYAKA PANDIT, and
SAMBUDDHA ROY, IBM India Research Lab
PRANJAL AWASTHI, Carnegie Mellon University
MUKESH K. MOHANIA, IBM India Research Lab
We consider the problem of constructing decision trees for entity identification from a given relational
table. The input is a table containing information about a set of entities over a fixed set of attributes and
a probability distribution over the set of entities that specifies the likelihood of the occurrence of each
entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the
attribute values such that the average number of tests is minimized. This classical problem finds such
diverse applications as efficient fault detection, species identification in biology, and efficient diagnosis in
the field of medicine. Prior work mainly deals with the special case where the input table is binary and the
probability distribution over the set of entities is uniform. We study the general problem involving arbitrary
input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy
algorithm and prove an approximation guarantee of O(rK·log N), where Nis the number of entities and
Kis the maximum number of distinct values of an attribute. The value rKis a suitably defined Ramsey
number,whichisatmostlogK. We show that it is NP-hard to approximate the problem within a factor of
(log N), even for binary tables (i.e., K=2). Thus, for the case of binary tables, our approximation algorithm
is optimal up to constant factors (since r2=2). In addition, our analysis indicates a possible way of resolving
a Ramsey-theoretic conjecture by Erd¨
os.
Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnu-
merical Algorithms and Problems
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Approximation algorithms, decision tree, Ramsey numbers
ACM Reference Format:
Chakaravarthy, V. T., Pandit, V., Roy, S., Awasthi, P., and Mohania, M. K. 2011. Decision trees for entity
identification: Approximation algorithms and hardness results. ACM Trans. Algor. 7, 2, Article 15 (March
2011), 22 pages.
DOI =10.1145/1921659.1921661 http://doi.acm.org/10.1145/1921659.1921661
A preliminary version of the article was presented at the ACM Symposium on Principles of Database Systems,
[Chakaravarthy et al. 2007].
This work was done while P. Awasthi was at IBM India Research Lab, New Delhi.
Authors’ addresses: V. T. Chakaravarthy, V. Pandit, and S. Roy, IBM India Research Lab, 4 Block C,
Institutional Area, Vasanth Kunj, New Delhi – 110070, India; email: {vechakra, pvinayak, sambud-
dha}@in.ibm.com; P. Awasthi, Computer Science Department, Wean Hall 1313, Carnegie Mellon University,
Pittsburgh, PA 15213; email: pawasthi@cs.cmu.edu; M. K. Mohania, IBM India Research Lab, 4 Block C,
Institutional Area, Vasanth Kunj, New Delhi – 110070, India; email: mkmohania@in.ibm.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior specific permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
2011 ACM 1549-6325/2011/03-ART15 $10.00
DOI 10.1145/1921659.1921661 http://doi.acm.org/10.1145/1921659.1921661
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:2 V. T. Chakaravarthy et al.
1. INTRODUCTION
Decision trees for the purposes of identification and diagnosis have been studied for
a long time now [Moret 1982]. Consider a typical medical diagnosis application. A
hospital maintains a table containing information about diseases. Each row in the table
is a disease and each column is a medical test and the corresponding entry specifies the
outcome of the test for a person suffering from the given disease. Some of the medical
tests are costly (e.g., MRI scans) and some require few days for the result to be known
(e.g., blood cultures). When the hospital receives a new patient whose disease has not
been identified, it would like to determine the shortest sequence of tests which can
unambiguously determine the disease of the patient. Such a capability would enable it
to achieve objectives like saving the expenditure of the patients, quickly determining
the disease to start the treatment early, etc. Motivated by such applications, we consider
the problem of constructing decision trees for entity identification from the given data.
Decision Trees for Entity Identification—Problem Statement. The input is a table D
having Nrows and mcolumns. Each row is called an entity and the columns are the
attributes of these entities. Additionally, we are also given a probability distribution
Pover the set of entities. For each entity e,Pspecifies p(e), the likelihood of the
occurrence of e. A solution is a decision tree in which each internal node is labeled by
an attribute and its branches are labeled by the values that the attribute can take. The
entities are the leaves of the tree. The main requirement is that the tree should identify
each entity correctly. The cost of the tree is the expected distance of an entity from the
root, (i.e., ep(e)d(e), where d(e) is the distance of the entity efrom the root). The goal
is to construct a decision tree with the minimum cost. We call this the WDT problem
(Here, Wstands for “weight” and it stresses the fact that the entities are associated
with probabilities/weights).
Example 1.1.Figure 1 shows an example table and two decision trees for it. In this
example, the probability distribution over the entities is uniform, that is, p(ei)=1/6,
for each entity ei. In the first decision tree, the distance d(e1)is2andd(e4) is 3. The
cost of the first decision tree is 14/6 and that of the second decision tree is 8/6. The
second decision tree happens to be an optimal tree for this instance.
For a given table, the maximum number of distinct values that any attribute takes is
called its branching factor. In the preceding example, the branching factor of the given
table is 5, because every attribute takes at most 5 distinct values and the attribute B
attains the maximum 5. Interesting special cases of the WDT problem can be obtained
in two ways:
—the case in which every input instance is required to have a branching factor of at
most K, for some constant K; we call this the K-WDT problem. Of particular interest
is the 2-WDT problem, where the tables are binary.
—the case in which the probability distribution over the set of entities is known to be
uniform; we call this the UDT problem (Here, Ustresses the fact that the probabili-
ties/weights are uniform).
The special case in which both of these restrictions apply is called the K-UDT problem.
Prior Results and Our Results. Much of the previous literature deals with the re-
stricted 2-UDT problem. Hyafil and Rivest [1976] showed that the 2-UDT problem is
NP-hard. Garey [1970, 1972] presented a dynamic programming-based algorithm for
the 2-UDT problem that finds the optimal solution, but the algorithm runs in exponen-
tial time in the worst case.
Kosaraju et al. [1999] presented a greedy algorithm for the 2-WDT problem with
an approximation ratio of O(log N); the approximation ratio remains the same for the
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:3
Fig. 1. Example decision trees.
special case of the 2-UDT problem. Independently, Dasgupta [2005] showed that the
same greedy heuristic has an approximation ratio of 4 log Nfor the 2-UDT problem.
Recently, Heeringa and Adler [2005] gave an alternative analysis of the same greedy
algorithm and obtained a slightly improved approximation ratio of (1 +ln N)forthe
2-UDT problem (see also Heeringa [2006] and Adler and Heeringa [2008]). They also
showed that it is NP-hard to approximate the 2-UDT problem within a ratio of (1 +),
for some >0. We study the problem in its whole generality, namely the WDT problem,
where the attributes can take multiple values and the input probability distribution
can be arbitrary. This occurs commonly, for example, in medical diagnosis applications
(e.g., blood-group can take multiple values; some diseases are more prevalent than
others).
We present two approximation algorithms for the UDT problem. The first one is a
simple algorithm that uses any given α-approximation algorithm for the 2-UDT prob-
lem as a black box and provides an αlog Kapproximation for the K-UDT problem. In
particular, using the algorithm of Heeringa and Adler [2005] as the black box, we obtain
an algorithm with an approximation ratio of log K(1+ln N). Our second algorithm for
the UDT problem uses a greedy heuristic and has an approximation ratio ofrK(1+ln N),
where rKis a suitably defined Ramsey number which is at most (2 +0.64 log K). Our
analysis builds on that of Heeringa and Adler [2005] and uses additional combinatorial
arguments. The highlight of our analysis is that it establishes connections to Ramsey
numbers and a conjecture by Erd¨
os (see what follows for more details). Furthermore,
notice that the second algorithm offers a constant factor improvement over the first
algorithm.
Remark 1.2. We note that subsequent to our work, Chakaravarthy et al. [2009]
considered a sligtly different greedy heuristic for the UDT problem and showed an
approximation ratio of 4 log N.
Next we consider the general WDT problem. We first observe that by combin-
ing the black-box approach with the algorithm of Kosaraju et al. [1999], we get an
O(log Klog N) approximation ratio for the WDT problem. We also show how extend
our analysis for the UDT problem to handle weights and obtain an algorithm with
an approximation ratio of O(rKlog N). This provides an alternative way of getting the
result obtained via the black-box approach.
We next focus on the hardness of approximating various versions of the problem. We
show that it is NP-hard to approximate the 2-WDT problem within a ratio of (log N).
This implies that the O(log N)-approximation algorithm of Kosaraju et al. [1999] for
the 2-WDT problem is optimal up to constant factors. We also improve the hardness of
approximation for the unweighted version of the problem. We show that it is NP-hard
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:4 V. T. Chakaravarthy et al.
Fig. 2. Summary of results.
to approximate the UDT and the 2-UDT problem, within a ratio of (4 )and(2),
respectively, for any >0. The results are summarized in Figure 2.
Ramsey Numbers and Connections to Erd¨os’s Conjecture. Our analysis of the ap-
proximation algorithms has interesting connections with Ramsey theory and an unre-
solved conjecture by Erd¨
os. Ramsey theory, treated at length in the book by Graham
et al. [1990], deals with coloring the edges of complete graphs (or hypergraphs) with
a specified number of colors satisfying certain constraints. For our purposes, we need
the following specific type of Ramsey numbers.
For n>0, let Gndenote the complete graph on nvertices. A k-coloring of Gnis a
coloring of the edges of Gnusing kcolors. For k>0, Rkis defined to be the smallest
number nsuch that any k-coloring of Gncontains a monochromatic triangle.1The
inverses of the Ramsey numbers are more convenient for our purposes. For n>0, we
define rnto be the smallest number ksuch that we can color the edges of Gnusing only
kcolors without inducing any monochromatic triangle.
The exact values of the Ramsey numbers for k>3 are not known. However, it is
known that for any k,3k+1
2Rk1+k!e(see West [2001], Nesetril and Rosenfeld
[2001], and Schur [1916]). Erd¨
os made the conjecture that for some constant α, for all
k,Rkαk.
In terms of the inverse Ramsey numbers, the previous bounds translate as follows:
(i) for any n,rn2+0.64 log n=O(log n); (ii) rn=(log n/log log n). The Erd¨
os conjec-
ture now reads
rn=(log n).
Our results provide interesting approaches to address the conjecture. Exhibit a con-
stant c>0 and show that for all K2, it is NP-hard to approximate the K-WDT
problem within a factor of clog Klog N. Notice that this would prove the conjecture
under the assumption that NP = P. However, we note that if the recent O(log N)-
approximation algorithm for UDT by Chakaravarthy et al. [2009] can be extended to
the weighted case, the preceding approach will be ruled out. Another way of proving
the conjecture would be to construct a family of bad instances for our algorithm (which
is a simple greedy heuristic). We discuss the details later in the article.
Applications and Related Work. Decision trees for entity identification (as defined
in this article) have been used for medical diagnosis (as described earlier), species
identification in biology, fault detection, etc. [Moret 1982]. Taxonomists release field
guides to help identify species based on their characteristics. These guides are often
presented in the form of decision trees labeled by species characteristics. Typically, a
field biologist identifies the species of a specimen at hand by referring to such guides
(hopefully with as few look-ups as possible). Taxonomists refer to such decision trees as
“identification keys” and an article on identification keys can be found in Wikipedia.2
1A monochromatic triangle is a triplet of vertices such that all the three edges between them have the same
color. In Ramsey theory, Rkis denoted R(3,3,...,3), where “3” is repeated ktimes. For example, it is known
that R1=3, R2=6, R3=17 [Radziszowski 1994].
2http://en.wikipedia.org/wiki/Dichotomous key.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:5
Computer programs and algorithms for identification and diagnosis applications have
been developed for nearly four decades (e.g., Pankhurst [1970], Reynolds et al. [2003],
and Wijtzes et al. [1997]).
Murthy [1998] and Moret [1982] present excellent surveys on the use of decision trees
in such diverse fields as machine learning, pattern recognition, taxonomy, switching
theory, and boolean logic.
2. PRELIMINARIES
In this section, we define the WDT problem and its special cases. We also develop some
notation used in the article.
Let Dbe a relational table having Ntuples and mattributes. We call each tuple an
entity.LetEand Adenote the set of entities and attributes, respectively. For xEand
aA,x.adenotes the value of the entity xon the attribute a.ForaA,Vadenotes
the set of distinct values taken by ain D.LetK=maxaA{|Va|}.NoticethatKN.We
call Kthe branching factor of D.
Adecision tree T for the table Dis a rooted tree satisfying the following properties.
Each internal node uis labeled by an attribute aand has at most Kchildren. Every
branch (edge) out of uis labeled by a distinct value from the set Va. The entities are
the leaves of the tree and thus the tree has exactly Nleaves. The main requirement
is that the tree should identify every entity correctly. In other words, for any entity x,
the following traversal process should correctly lead to x. The process starts at the root
node. Let ube the current node and abe the attribute label of u. Take the branch out
of ulabeled by x.aand move to the corresponding child of u. The requirement is that
this traversal process should reach the entity x.
Observe that the values of the attributes are used only for taking the correct branch
in the traversal process. So, we can map each value of an attribute to a distinct number
from 1 to Kand assume that Vais a subset of {1,2,..., K}. In the rest of the article, we
assume that for any xEand aA,x.a∈{1,2,...,K}.
ForatreeT,weuse“uT” to mean that uis an internal node in T. We denote by
x,y, an unordered pair of distinct entities.
Let Tbe a decision tree for D. For an entity xE,path length of xis defined to be the
number of internal nodes in the path from the root to x; it is denoted T(x). The sum of
all path lengths is called total path length and is denoted |T|,thatis,|T|=xET(x).
Let w(·)beaweight function that assigns a real number w(x)>0, for each xE.We
define the cost of Twith respect to w(·) as follows.
cost(T,w)=
xE
w(x)T(x)
We will denote cost(T,w)asw(T).
As mentioned in the Introduction, input to the WDT problem includes a probability
distribution Pover Especifying the likelihood of the occurrence of each entity and
the goal is to construct a tree having the minimum expected path length. We view
probabilities as weights and assume that the distribution is specified as a weight
function p(·) that associates a weight p(x)>0, for each entity x. Notice that when an
entity is chosen at random according to the previous distribution, the expected path
length is given by p(T)=cost(T,p). We assume that the probabilities p(x) are given as
rational numbers. We can easily write these numbers in such way that for any entity
x,p(x)=w(x)/L, where w(x)1 is an integer and Lis an integer giving the common
denominator. And so, without loss of generality, we assume the probability distribution
will be given as an integer weight function w(·) over a set of entities, that is, for all
xE,w(x)1 is an integer. Notice that p(T)=w(T)/Land hence, finding an optimal
Tunder p(·)andw(·) are equivalent.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:6 V. T. Chakaravarthy et al.
WDT Problem.The input is a relational table Dand a probability distribution P
represented as an integer weight function w(·). The goal is to construct a decision tree
Thaving the minimum cost w(T).
For a positive integer K,theK-WDT problem is a special case of the WDT problem
where the input table is required to have a branching factor of at most K.Noticethat
in the K-WDT problem, the input is a table whose entries are drawn from the set
{1,2,...,K}.
Of particular interest is the special case called UDT in which the probability dis-
tribution is uniform. In this problem, the weight function is given by w(x)=1, for all
xE. Note that the cost of a tree Tis w(T)=|T|. For an integer K2, the special
case of the UDT where the input table is required to have a branching factor of at most
Kis called the K-UDT problem.
3. APPROXIMATION ALGORITHMS AND ANALYSIS
In this section, we present an algorithm for the WDT problem and prove an approxi-
mation ratio of O(rKlog N), where Krefers to the branching factor of the input table. As
mentioned in the Introduction, our analysis builds on that of Heeringa and Adler [2005]
for the 2-UDT problem. In order to achieve our result, we have to extend their ideas to
deal with two issues. Firstly, the attributes can be multivalued as opposed to binary;
secondly, the entities can have arbitrary weights. For ease of exposition, we first show
how to address the issue of attributes being multivalued. Then, we deal with the case
of arbitrary weights. Specifically, Section 3.1 presents an algorithm and analysis for
the UDT problem. These ideas are generalized in Section 3.2 to obtain an algorithm
for the WDT problem.
3.1. The Unweighted Case: UDT Problem
This section deals with the UDT problem. Here, the probability distribution is uniform
and so, the weights of all the entities are 1. The goal is to find a tree Twith the
minimum cost |T|.
We present two approximation algorithms for UDT . The first one uses any given
α-approximation algorithm for 2-UDT as a black box and provides an αlog Kapprox-
imation for the K-UDT problem. In particular, using the algorithm of Heeringa and
Adler [2005] as the black box, we obtain an algorithm with an approximation ratio of
log K(1 +ln N). Our second algorithm for the UDT problem uses a greedy heuristic
and has an approximation ratio of rK(1 +ln N). Recall that rK2+0.64 log K. Thus,
the second algorithm offers a constant factor improvement over the first algorithm. The
first approach has the advantage that any improvement in the approximation ratio for
the 2-UDT problem automatically yields an improvement for the K-UDT problem. On
the other hand, the second approach has the advantage that any improvement in the
upper bound for rKimproves the approximation ratio.
3.1.1. The Black-Box Algorithm.
Let Abe an α-approximation algorithm for the 2-UDT
problem. We show how to get a (αlog K)-approximation algorithm for the K-UDT
problem. The idea is to encode the given UDT instance as a 2-UDT instance and then
invoke the algorithm Aon the encoded instance.
Given an N×mtable Dhaving a branching factor of K, we construct an N×mlog K
binary table D2as follows. Each attribute in Dis represented by log Kattributes in
D2. The former attribute is called the original attribute and the latter attributes are
called as its derived attributes. The values appearing in an original attribute are
represented in binary in the corresponding derived attributes. Invoke the algorithm A
on the binary table D2and let T2be the decision tree returned by the algorithm. We
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:7
Fig. 3. The greedy algorithm.
obtain a decision tree Tfor Dfrom T2by replacing the attributes in its internal nodes
with their original attributes in Dand labeling appropriately. Notice that |T|≤|T2|.
Given a tree Tfor D, we can construct a tree T2for D2such that |T2|≤log K|T|.
In constructing a decision tree T2for the encoded instance D2, the main task is to take
the correct branches of the internal nodes of Tusing the binary derived attributes.
We achieve this by replacing each internal node with a complete binary tree of depth
log Kusing the derived attributes of the original attribute of the internal node.
Clearly, |T2|≤log K|T|. This shows that |T
2|≤log K|T|where Tand T
2are
the optimal decision trees for Dand D2, respectively. Since |T2|≤α|T
2|, the solution T
returned by the black-box algorithm satisfies |T|≤αlog K|T|.
THEOREM 3.1. Given a α-approximation algorithm for the 2-UDT problem, the black-
box algorithm has an approximation ratio of αlog Kfor the UDT problem where K is
the branching factor of the input table.
In particular, we obtain an approximation ratio of log K(1 +ln N) by using the
Heeringa-Adler algorithm as a black box.
3.1.2. The Greedy Algorithm.
In this section, we present a greedy algorithm for the UDT
problem. The algorithm is similar in spirit to that of Heeringa and Adler [2005] for
the 2-UDT problem. We build on their analysis and develop further combinatorial
arguments to obtain our approximation ratio.
Given as input an N×mtable Dhaving branching factor at most K, the greedy
algorithm produces a decision tree Tas described in the following. Let Eand Adenote
the set of entities and attributes of D, respectively. The intuition is that any decision
tree should distinguish every pair of distinct entities. So, a natural idea is to make the
attribute that distinguishes the maximum number of pairs as the root of T, where an
attribute ais said to distinguish a pair x,y,ifx.a= y.a.Choosingsuchanattribute
acan be easily done in time O(mN 2). Picking the attribute aas the label for the root
node partitions the set Einto disjoint sets E1,E2,...,EK, where Ei={x|x.a=i}.We
recursively apply the same greedy procedure on each of these sets to obtain Kdecision
trees and make these the subtrees of the root node. The greedy procedure is formally
specified in Figure 3. We get the output tree Tby calling T=Greedy(E).
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:8 V. T. Chakaravarthy et al.
THEOREM 3.2. The greedy algorithm has an approximation ratio of (rK(1 +ln N)) for
the UDT problem, where K is the branching factor of the input table.
We now analyze the greedy algorithm and prove Theorem 3.2. The analysis is divided
into two parts. In the first part, we introduce certain combinatorial objects called
tabular partitions and analyze the performance of the greedy algorithm using these
objects. In the second part, we relate these objects to Ramsey colorings and complete
the proof of Theorem 3.2.
3.1.3. Analysis Involving Tabular Partitions.
Let Tand Tbe the greedy and the optimal
decision trees, respectively. In this section, we prove a relationship between |T|and
|T|involving tabular partitions, defined in the following.
Definition 3.3 (Tabular Partitions). For any positive integer n1, a tabular par-
tition Pof nis a sequence P1,P2,...,Pnsuch that Piis a partition of the set
{1,2,...,n}−{i}. We require that for any distinct 1 i,jn,ifAis the set in Pi
containing jand Bis the set in Pjcontaining i,then AB=∅. Let the length of
a partition Pidenote the number of sets in it. We define the compactness of Pas
comp(P)=maxi(length of Pi), for 1 in. We define Cnto be the smallest number
such that there exists a tabular partition of nhaving compactness Cn.
THEOREM 3.4. |T|≤CK(1 +ln N)|T|.
We next focus on proving the previous result. In Section 3.1.4, we shall show that
CKrKand obtain Theorem 3.2 by combining the two results. We start with some
notations and observations. Let Tbe any decision tree for Dand ube an inter-
nal node of T. We define ET(u)Eto be the set of entities in the subtree of T
under u.
PROPOSITION 3.5. For any decision tree T of D,wehave|T|=uT|ET(u)|.
PROOF.Eachentityxcontributes a cost equal to its distance from the root. Let us
distribute this cost uniformly among the internal nodes on the path from xto the root.
Observe that the total cost accumulated at an internal node uis equal to |ET(u)|. Thus
|T|=uT|ET(u)|.
Consider a decision tree Tand a pair x,yof entities. We say that a node uT
separates the pair x,y, if the traversal for both xand ypasses through u,butxand
ytake different branches from u. Formally, uis said to separate 3x,y,ifx,yET(u)
and x.a= y.a, where ais the attribute label of u. For any pair x,yof entities, there
exists a unique separator in Tthat separates xand y. We define SEP(u)tobetheset
of all pairs separated by u. The separators with respect to the greedy tree Twill be
important in our analysis. For each pair x,y, we denote by sx,ythe separator of x,y
in Tand let Sx,ydenote ET(sx,y).
From Proposition 3.5, we see that each node uTcontributes a cost of |ET(u)|
towards the total cost |T|and separates the pairs in SEP(u). We distribute the cost
|ET(u)|equally among the pairs in SEP(u). For each pair x,y∈SEP(u), we define the
cost cx,y=|ET(u)|/|SEP(u)|. Since each pair has a unique separator, the costs cx,yare
well defined.
It is easy to see that |ET(u)|=x,y∈SEP(u)cx,yand by Proposition 3.5, we have
|T|=x,ycx,y, where the summation is taken over all (unordered) pairs of distinct
entities. Notice that each pair x,yalso has a unique separator in T. So, we rewrite the
3We note that the separator of x,yis nothing but the least common ancestor of xand y.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:9
preceding summation by partitioning the set of all pairs according to their separators
in Tand obtain the following equation.
|T|=
zT
x,y∈SEP(z)
cx,y(1)
For each zT, we define α(z) to be the term corresponding to zin the summation
given in Eq. (1). Clearly, α(z)=x,y∈SEP(z)cx,y. The following lemma gives an upper
bound on α(z).
LEMMA 3.6. For any z T,α(z)CK(1 +ln |Z|)|Z|, where Z =ET(z).
Assuming the correctness of Lemma 3.6, we first prove Theorem 3.4. The lemma is
proved later in the section.
PROOF OF THEOREM 3.4. Replacing the inner summation in Eq. (1) by α(z)wehave
|T|≤CK(1 +ln N)
zT
|ET(z)|=CK(1 +ln N)|T|.
The first step is obtained by invoking Lemma 3.6 and the fact that |Z|≤N.
Proposition 3.5 gives us the second step.
We now proceed to prove Lemma 3.6. Fix any zT. Let us denote Z=ET(z). Let
azbe the attribute label of z. The node zpartitions the set Zinto Ksets Z1,Z2,..., ZK,
where Zi={xZ|x.az=i}. We extend the preceding notations to sets of values. For
any A⊆{1,2,..., K}, define ZA=∪
iAZi. We prove the following upper bound on cx,y.
LEMMA 3.7. Let x,y∈SEP(z). Consider disjoint sets A,B⊆{1,2,...,K}satisfying
yZAand x ZB. Then,
cx,y1
|Sx,yZA|+1
|Sx,yZB|.
PROOF. We are given a pair x,y∈SEP(z). Let s=sx,ybe the separator of x,y
in Tand the attribute label of sbe as. The cost cx,yis given by |Sx,y|/|SEP(s)|, where
Sx,y=ET(s). The greedy algorithm chose the attribute asfor the node s. Hypothetically,
consider choosing the attribute az, instead. Let us denote the pairs separated by such
achoiceasX, that is, define X={x,y|x,ySx,yand x.az= y.az}. Notice that the
greedy algorithm chose the attribute as, instead of az, because asdistinguishes more
pairs compared to az, meaning, |SEP(s)|≥|X|. It follows that cx,y≤|Sx,y|/|X|. Partition
Sx,yinto S1,S2,...,SK, where Si={xSx,y|x.az=i}. Then,
|X|=
1i<jK
|Si|·|Sj|.
Now we claim that
cx,y1
iA|Si|+1
jB|Sj|.(2)
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:10 V. T. Chakaravarthy et al.
The claim can be proved as follows. Let A=A({1,2,..., K}− AB)sothat AB=
{1,2,...,K}and AB=∅. Recall that |Sx,y|=|S1|+|S2|+···+|SK|. It follows that
1
iA|Si|+1
jB|Sj|1
iA|Si|+1
jB|Sj|
=Sx,y
(iA|Si|)·(jB|Sj|)
Sx,y
|X|
cx,y.
This proves the claim in Eq. (2).
Observe that for any 1 iK,Sx,yZiSiand hence |Sx,yZi|≤|Si|. Therefore,
cx,y1
iA|Sx,yZi|+1
jB|Sx,yZj|.
Finally, since the sets Ziand Zjare disjoint for any distinct 1 ijK, it follows
that the first term equals 1/|Sx,yZA|and the second term equals 1/|Sx,yZB|. The
lemma is proved.
For each x,y, we shall a choose a suitable pair of disjoint sets Aand Band obtain
an upper bound on cx,yby invoking Lemma 3.7. We make use of tabular partitions
for choosing these sets; the motivation for doing so will become clear in the proof of
Lemma 3.10. Let Pbe an optimal tabular partition of Khaving compactness CK,
given by the sequence P1,P2,...,PK. Consider any pair x,y∈SEP(z). Let i=x.azand
j=y.azso that xZiand yZj.Let
Abe the set in the partition Pithat contains
jand
Bbe the set in the partition Pjthat contains i. Notice that, by the definition of
tabular partitions, the sets
Aand
Bare disjoint. We invoke Lemma 3.7 with
Aand
B
as the required disjoint sets. (Observe that for any iand j, all the pairs in Zi×Zjwill
make use of the same disjoint sets while invoking the lemma. Thus the sets chosen
depend only on the values x.azand y.az). Therefore,
cx,y1
|Sx,yZ
A|+1
|Sx,yZ
B|.
We split the preceding cost into two parts and attribute the first term to xand the
second term to y. Define
cx
x,y=1
|Sx,yZ
A|and cy
x,y=1
|Sx,yZ
B|.
It follows that cx,ycx
x,y+cy
x,y. For any xZ, we imagine that xpays a cost cx
x,yto get
separated from an entity yZ. We denote the accumulated cost as Accz(x) and define
it as
Accz(x)=
y:x,y∈SEP(z)
cx
x,y.
Now the lemma given next follows easily.
LEMMA 3.8. For any z, α(z)xZAccz(x).
Our next task is to obtain an upper bound on Accz(x), so that we get a bound on α(z).
The following lemma is useful for this purpose.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:11
LEMMA 3.9. Let x Ebe any entity and Q Ebe any set of entities such that x ∈ Q.
Then,
yQ
1
|Sx,yQ|(1 +ln |Q|).
PROOF.Lett=|Q|. We shall prove the following claim.
yQ
1
|Sx,yQ|
t
i=1
1
i
The claim implies the lemma, since it is well known that t
i=1(1/i)(1 +ln t), for all
t. We prove the claim by applying induction on |Q|. For the base case of |Q|=1, let
Q={y}, where y= x. Clearly, ySx,yand so, |Sx,yQ|=1, and the claim follows.
Assuming that the claim is true for all sets of size at most t1, we prove it for any set
Qof size t.Letybe any entity in Qsuch that for all yQ,sx,yis a descendent of sx,y
(a node is considered to be a descendent of itself). If more than one such element exists,
pick one arbitrarily. Intuitively, yis one among the first batch of entities in Qto get
separated from x. The main observation is that QSx,yand so, Sx,yQ=Q. Thus
1/|Sx,yQ|=1/|Q|=1/t. We apply the induction hypothesis on the set of remaining
entities Q=Qyand infer that
yQ
1
|Sx,yQ|
t1
i=1
1
i.
Clearly, QQand hence |Sx,yQ|≤|Sx,yQ|, so, in the previous summation, if we
replace the term |Sx,yQ|by |Sx,yQ|, then the resulting inequality is also true. We
conclude that
yQ
1
|Sx,yQ|=1
|Sx,yQ|+
yQ
1
|Sx,yQ|
1
t+
t1
i=1
1
i
=
t
i=1
1
i.
LEMMA 3.10. For any x Z, Accz(x)CK(1 +ln |Z|).
PROOF.Letr=x.azand so xZr.Let
Z=ZZrbe the rest of the entities in Z.
Notice that Accz(x)=y
Zcx
x,y. We perform the preceding summation by partitioning
Z
according to Pr,therth member of the optimal tabular partition P=P1,P2,..., PK.
Let Pr=s1,s2,...,s, where CK.For1i, define Qi={y
Z|y.azsi}. Thus,
Z=Q1Q2...Qand hence,
Accz(x)=
1i
yQi
cx
x,y.(3)
We derive an upper bound for each term in the outer sum using Lemma 3.9. Fix any
1i. Notice that for any yQi,wehavecx
x,y=1/|Sx,yQi|, by definition. Moreover,
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:12 V. T. Chakaravarthy et al.
x∈ Qi. Thus, by applying Lemma 3.9 on Qi, we get
yQi
cx
x,y(1 +ln |Qi|)(1 +ln |Z|).(4)
We get the lemma by combining Eqs. (3) and (4), and the fact that CK.
PROOF OF LEMMA 3.6. The result is proved by combining Lemma 3.8 and Lemma 3.10.
α(z)
xZ
Accz(x)
xZ
CK(1 +ln |Z|)
=CK(1 +ln |Z|)|Z|.
3.1.4. Tabular Partitions and Ramsey Colorings.
In this section, we introduce the notion
of directed Ramsey colorings and show that they are equivalent to tabular partitions.
Throughout the discussion, for n>0, let Gnand
Gndenote the complete undirected
and the complete directed graph on nvertices, respectively.
Definition 3.11.Let n>0 be an integer. A directed Ramsey coloring of
Gnis a
coloring τof the edges such that for any triplet of distinct vertices x,yand z,ifτ(x,y)=
τ(x,z)thenτ(y,x)= τ(y,z) (and by symmetry, τ(z,x)= τ(z,y)).
We define
Rkto be the smallest number nsuch that
Gncannot be directed Ramsey
colored using kcolors.4The inverse of these numbers will be useful. Definernto be the
minimum number of colors required to do a directed Ramsey coloring of
Gn.
We claim that for any n, there exists a tabular partition Pof compactness kif and
only if there exists a directed Ramsey coloring τof
Gnthat uses only kcolors. A proof
sketch follows. Let P=P1,P2,... Pn.Fix1xn. Arrange the sets in the partition Px
in an arbitrary manner, say Px=sx,1,sx,2,...,sx, , where k. The n-1 edges outgoing
from the vertex xare colored according to the partition Px. Meaning, for 1 c,for
ysx,c,wesetτ(x,y)=c. For any yand z,ifτ(x,y)=τ(x,z), then it means that yand
zbelong to the same set in the partition Px. By the property of tabular partitions, it
should be the case that xand zbelong to different sets in the partition Py, implying
that τ(y,x)= τ(y,z). We conclude that τis a directed Ramsey coloring and that τuses
only kcolors. The converse is proved using a similar argument. The claim implies the
following proposition.
THEOREM 3.12. For any n, Cn=rn.
Let us call an edge-coloring of Gna Ramsey coloring if it does not induce any
monochromatic triangles. For any n, a Ramsey coloring τof Gnreadily yields a di-
rected Ramsey coloring τof
Gn. For each pair of vertices xand y,wesetτ(x,y)=
τ(y,x)=τ(x,y). It can easily be verified that τis indeed a directed Ramsey coloring
of
Gn. The number of colors used in τis the same as that of τ. Therefore, we have the
following proposition.
PROPOSITION 3.13. For any n, rnrn.
PROOF OF THEOREM 3.2. The result follows from Theorems 3.4 and 3.12, and
Proposition 3.13.
4Such a number exists, as shown in Theorem 5.1.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:13
3.2. The Weighted Case: WDT Problem
In this section, we show how to deal with the weighted case, namely the WDT problem.
Let Dbe the input N×mtable over a set of entities Eand a set of attributes A,having
a branching factor of K.Letw(·) be the input weight function that assigns an integer
weight w(x)1 to each entity xE. The problem is to construct an optimal decision
tree Thaving the minimum cost with respect to w(·). We present an algorithm which
generalizes the greedy algorithm for the UDT problem.
Weighted Greedy Algorithm. Refer to the greedy algorithm given in Figure 3. The
main step in that algorithm is choosing an attribute that distinguishes the maximum
number of pairs. We modify this step so that the weights are taken into account.
Namely, we choose the attribute a
a=argmaxaA
x,y∈S(a)
w(x)w(y),
where S(a)={x,y|x,yEand x.a= y.a}, is the set of pairs distinguished by the
attribute a. We call the preceding procedure the weighted greedy algorithm.
Let W=xEw(x) denote the total weight of the entities. Let Tand Tdenote the
weighted greedy and the optimal trees, under the weight function w(·).
THEOREM 3.14. w(T)CK(1 +ln W)w(T), where W is the sum of weights of all the
entities.
We prove the this theorem by adapting the proof of Theorem 3.2. Due to space
constraints, we provide an outline of the proof.
Intuitively, we imagine that each entity xis replicated w(x) times and modify the
proof of Theorem 3.2 accordingly. We reuse notation from the previous proof. Let SEP(u)
be the set of all pairs separated by u. For each pair x,y, we denote by sx,ythe separator
of x,yin Tand let Sx,ydenote ET(sx,y). Additional notation is introduced next.
ForasetofentitiesXE,letw(X) denote the total weight of the entities in X,that
is, w(X)=xXw(x). We also define weights on any set of pairs of entities: for a set of
pairs XE×E, define w(X)=x,y∈Xw(x)w(y).
Proposition 3.5 generalizes to the weighted case as follows.
PROPOSITION 3.15. For a decision tree T of D,w(T)=uTw(ET(u)).
For each pair of entities x,y, define a cost cx,yas follows.
cx,y=w(x)w(y)w(Sx,y)
w(SEP(sx,y))
By Proposition 3.15, we get the following equation, which is similar to Eq. (1).
w(T)=
zT
x,y∈SEP(z)
cx,y(5)
For each zT, the inner summation in Eq. (5) is defined as the cost α(z)=
x,y∈SEP(z)cx,y. Our goal is to derive an upper bound on α(z).
Fix any zT. Let us denote Z=ET(z). Let azbe the attribute label of z. The node
zpartitions the set Zinto Ksets Z1,Z2,...,ZK, where Zi={xZ|x.az=i}. We extend
the preceding notations to sets of values. For any A⊆{1,2,...,K}, define ZA=∪
iAZi.
The following lemma generalizes Lemma 3.7 to the weighted case.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:14 V. T. Chakaravarthy et al.
LEMMA 3.16. Let x,y∈SEP(z). Consider disjoint sets A,B⊆{1,2,..., K}satisfying
yZAand x ZB. Then,
cx,yw(x)w(y)1
w(Sx,yZA)+1
w(Sx,yZB).
Consider any x,y∈SEP(z). Let Pbe an optimal tabular partition of Khaving
compactness CK, given by the sequence P1,P2,...,PK.Leti=x.azand j=y.azso that
xZiand yZj.Let
Abe the set in the partition Pithat contains jand
Bbe the set
in the partition Pjthat contains i. Define
cx
x,y=w(x)w(y)
w(Sx,yZ
A)and cy
x,y=w(x)w(y)
w(Sx,yZ
B).
By Lemma 3.16, we have that cx,ycx
x,y+cy
x,y. For each entity xET(z), define
Accz(x)as
Accz(x)=
y:x,y∈SEP(z)
cx
x,y.
We wish to derive an upper bound on Accz(x). The following lemma, which generalizes
Lemma 3.9, is useful for this purpose.
LEMMA 3.17. Let x Ebe any entity and Q Ebe any set of entities such that x ∈ Q.
Then,
yQ
w(y)
w(Sx,yQ)(1 +ln w(Q)).
The following is obtained by generalizing Lemma 3.10.
LEMMA 3.18. For any x Z, Accz(x)w(x)CK(1 +ln w(Z)).
PROOF OF THEOREM 3.14. Consider any zTand let Z=ET(z). Then, α(z)
xZ,Accz(x). Applying Lemma 3.18 and Proposition 3.15, we get that
α(z)CK(1 +ln w(Z))w(Z).(6)
Replacing the inner summation in Eq. (5) by α(z)wehave
w(T)CK(1 +ln W)
zT
w(ET(z))
=CK(1 +ln W)w(T).(7)
The first step is obtained by invoking Eq. (6) and the fact that w(Z)w(E)=W.
Proposition 3.15 gives us the second step.
Theorem 3.14 shows that the approximation ratio of the weighted greedy algorithm
is logarithmic in N, when the total weight Wis polynomially bounded in N. Unfortu-
nately, when the weights are arbitrarily large, the ratio could be worse. We overcome
this issue by using the following rounding technique.
Rounded Greedy Algorithm. Let Dbe an input table having a branching factor of K
and let win be the input integer weight function. Let wmax
in =maxxwin(x) denote the
maximum weight. Define a new weight function w(·) as follows: for any entity xE,
define
w(x)=win(x)N2
wmax
in .
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:15
Run the weighted greedy algorithm with w(·) as the input weight function and obtain
a tree T. Return the tree T.
Let Tand T
in be the optimal decision trees under the weight functions w(·)and
win(·), respectively. From Theorem 3.14, we have a good bound for w(T) with respect to
w(T). But, of course, we need to compare win(T)andwin(T
in). We do this next.
THEOREM 3.19. win(T)2CK(1 +3lnN)win (T
in).
PROOF.LetxEbe any entity and consider the path from the root to xin the tree
T
in. Notice that each internal node along this path separates at least one entity from x.
(Otherwise, T
in contains a “dummy” node that does not separate any pairs and hence
can be deleted to obtain a tree of lesser cost). So, the length of the path is at most N
and hence, the following claim is true.
Claim 1: |T
in|≤ N2.
We next compare win(T
in)andw(T
in). We have
w(T
in)=
xE
w(x)T
in (x)
xEwin(x)N2
wmax
in
+1T
in (x)
=win(T
in)N2
wmax
in
+|T
in|
win(T
in)N2
wmax
in
+N2
2win(T
in)N2
wmax
in
.(8)
The second step is from the definition of w(·) and the fourth step is obtained from
Claim 1. The last inequality is obtained by observing the fact that win(T
in)wmax
in .
Notice that for any entity xE,1w(x)N2and so the total weight Wunder the
function w(·)satisesWN3. So, Theorem 3.14 implies the following claim.
Claim 2: w(T)CK(1 +3lnN)w(T).
We can now compare win(T)andwin(T
in). Note that Tis the optimal tree under
the function w(·) and hence, w(T)w(T
in). We obtain the lemma by combining the
observation with Eq. (8) and Claim 2.
By combining Theorem 3.19, Theorem 3.12, and Proposition 3.13, we get the following
result.
THEOREM 3.20. The approximation ratio of the rounded greedy algorithm is at most
2rK(1 +3ln N)=O(rKlog N).
4. HARDNESS OF APPROXIMATION
In this section, we study the hardness of approximating the WDT and the UDT prob-
lems. We show that it is NP-hard to approximate the 2-WDT problem within a ratio of
(log N). Therefore, our approximation algorithm for the 2-WDT problem is optimal
up to constant factors. We also improve the previous hardness results for the UDT
problem.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:16 V. T. Chakaravarthy et al.
4.1. Hardness of Approximating the 2-WDT Problem
THEOREM 4.1. It is NP-hard to approximate the 2-WDT problem within a factor of
(log N), where N is the number of entities in the input.
PROOF. We prove the result via a reduction from the set cover problem. It is known
that approximating set cover within a factor of (log n) is NP-hard [Raz and Safra
1997].
Let (U,S) be the input set cover instance, where U={x1,x2,...,xn}is a universe of
items and Sis a collection of sets {S1,S2,...,Sm}such that SiU, for each i. Without
loss of generality, we can assume that for any pair of distinct items xiand xj, there
exists a set SkScontaining exactly one of these two items. (If not, one of these items
can be removed from the system.) Construct an instance of the 2-WDT problem having
N=n+1 entities and mattributes. The set of entities is E={x1,x2,...,xn}∪{x}, where
each entity xicorresponds to the item xiand xis a special entity. The set of attributes
is A={S1,S2,··· ,Sm}, so that each attribute Sicorresponds to the set Si. The N×m
table Dis given as follows. For each entity xiand attribute Sj,setxi.Sj=1, if xiSj
and otherwise, set xi.Sj=0. For the special entity x,setx.Sj=0, for all attributes Sj.
For each entity xi, set the weight w(xi)=1. As for the special entity x, set its weight as
w(x)=N3. This completes the construction.
Let Tbe a decision tree for D.LetCbe the set of attributes found along the path
from the root to the entity x. Recall that the length of the preceding path is denoted as
T(x). Observe that Cis a cover for (U,S). We have (|C|=T(x)) w(T)/N3.Onthe
other hand, given a cover C, we can construct a decision tree Tsatisfying the following
two properties: (i) the set of attributes along the path from the root to xis exactly the
set Cso that |T(x)|=|C|; (ii) for every other entity xi,T(xi)N. (The second property
is based on the fact that for any table containing Nentities, it suffices to test at most N
attributes in order to distinguish any entity from the rest). Thus w(T)≤|C|N3+N2.
In particular, w(T)≤|C|N3+N2, where Tand Care the optimal decision tree and
optimal cover, respectively.
Based on the previous observations, we can prove the following claim. If there exists
an α(N)-approximation algorithm for the 2-WDT problem then for any >0, we can
design an (1 +)α(n)-approximation algorithm for the set cover problem. Therefore,
the hardness of set cover problem implies the claimed hardness result for the 2-WDT
problem.
4.2. Hardness for the UDT and the 2-UDT Problems
In this section, we present improved results of hardness of approximation for the UDT
and the 2-UDT problems. For the 2-UDT problem, Heeringa and Adler [2005] showed
a hardness of approximation of (1 +), for some >0. We show that for any >0, it
is NP-hard to approximate the UDT and the 2-UDT problems within a factor of (4 )
and (2), respectively. Our reductions are from the Minimum Sum Set Cover (MSSC)
problem.
The input to the MSSC problem is a set system: a collection of sets S={S1,
S2,...,Sm}over a universe U={x1,x2,...,xN}of items, where each SiU.Aso-
lution is an ordering πon the sets in S, with an associated cost defined as follows. Let
πbe S
1,S
2,...,S
m.EachiteminS
1pays a cost of 1, each item in S
2S
1pays a cost
of 2, and so on. The cost of πis the sum of the costs of all items. Formally, define the
costs cπ
x=argmini{xS
i},forxU, and cost(π)=xUcπ
x. The MSSC problem is
to find an ordering with the minimum cost. For a constant d,thed-MSSC problem is
the special case of MSSC in which every set in the set system has at most delements.
Feige et al. [2004] proved the following hardness results for these problems.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:17
THEOREM 4.2. [FEIQE ET AL. 2004].
(1)For any >0, it is NP-hard to approximate the MSSC problem within a ratio of
(4 ).
(2)For any >0, there exists a constant d such that it is NP-hard to approximate
d-MSSC within a ratio of (2 ).
Our hardness results for UDT and 2-UDT are obtained via approximation preserving
reductions from the MSSC and the d-MSSC problems, respectively. The reduction from
MSSC to UDT is easier and we present it first (in Section 4.2.1). The reduction from
d-MSSC to 2-UDT is similar, but involves more technical details; it is presented in
Section 4.2.2.
4.2.1. Hardness for the
UDT
Problem.
Here, we prove the hardness result for the UDT
problem by exhibiting a reduction from MSSC .
THEOREM 4.3. For any >0, it is NP-hard to approximate the UDT problem within
aratioof(4 ).
PROOF. Given an MSSC instance S={S1,S2,...,Sm}over a universe U={x1,
x2,...,xN}, construct an N×mtable Das follows. Each item xcorresponds to an entity
and each set Sicorresponds to an attribute ai.For1jm,1in, set the entry
xi.ajas follows: if xiSjthen set xi.aj=i,elsesetxi.aj=0. Observe that any decision
tree for Dis left-deep: for any internal node u, except the branch labeled 0, every other
branch out of uleads to a leaf node.
We claim that given an ordering πof S, we can construct a decision tree Tsuch
that |T|=cost(π) and vice versa. Let π=S
1,S
2,...,S
mand a
1,a
2,...,a
mbe the
corresponding sequence of attributes. Construct a left-deep tree T, in which the root
node is labeled a
1and its 0th child is labeled a
2and so on. In general, label the internal
node in ith level with a
i. It can be seen that Tis indeed a decision tree for Dand
that |T|=cost(π). The converse is shown via a similar construction. Given a decision
tree T, traverse the tree starting with the root node and always taking the branches
labeled 0. Write down the sequence of sets corresponding to the internal nodes seen
in this traversal and let πdenote the sequence. Notice that the sets appearing in this
sequence cover all elements of Uand that cost(π)=|T|. (Some sets in Smay not appear
in this sequence. To be formally compliant with the definition of solutions, we append
the missing sets in an arbitrary order). The claim, in conjunction with Theorem 4.2
(part 1), implies the following result.
4.2.2. Hardness for the
2
-
UDT
Problem.
In this section, we shall prove the hardness result
for the 2-UDT problem.
THEOREM 4.4. For any >0, it is NP-hard to approximate the 2-UDT problem within
aratioof(2 ).
The proof is similar to that of Theorem 4.3. The reduction is from d-MSSC instances,
for a suitable constant d. Observe that the entries in the table can only be 0 or 1
as opposed to the index of the elements in the previous construction. The required
reduction is obtained by using log dauxiliary columns to identify elements of each
set. The rest of the section is devoted to a formal proof.
We first present a general construction, using which we shall derive the theorem.
Fix any integer d>0 and we shall show a reduction from d-MSSC to 2-UDT . Given
ad-MSSC instance S={S1,S2,...,Sm}over a universe of items U={x1,x2,...,xN},
construct a binary table Dhaving Nentities and m(1 +log d) attributes, as fol-
lows. Each item xUcorresponds to an entity xin D. Each set Sicorresponds
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:18 V. T. Chakaravarthy et al.
to 1 +log dattributes: a main attribute named aiand log dauxiliary attributes
named ai,1,ai,2,...,ai,logd. For filling the table D, consider each set Sj. Order the
items in Sjarbitrarily and let the ordering be Sj=x
1,x
2,...,x
, where d.For
each entity x
iSj,setx
i.aj=1; write iasalog d-bit string and fill the entries
x
i.aj,1,x
i.aj,2,...,x
i.aj,log dwith these bits. For any entity x∈ Sj, set the value on all
these 1 +log dattributes to be 0. This completes the construction of D. We make two
claims connecting the solutions of the MSSC and the decision trees of D.
Any decision tree Tfor Dis a binary tree in which each internal node has two
branches labeled 0 and 1; we call these the 0-branch and the 1-branch, and the corre-
sponding children the 0-child and the 1-child, respectively. Let π=S
1,S
2,...,S
mbe
an ordering of S.WesaythatasetS
icovers an entity x,ifxS
iand x∈ S
j, for all
j<i.
LEMMA 4.5. Given an ordering πof S, we can construct a decision tree Tof Dsuch
that |T|≤cost(π)+Nlog d. In particular, if πand Tare the respective optimal
solutions, then |T|≤cost(π)+Nlog d.
PROOF.Letπ=S
1,S
2,...,S
mand let a
1,a
2,...,a
mbe the sequence of main attributes
corresponding to these sets. We construct a tree Tthat would be “almost left-deep.”
We start the construction by making a
1the label of the root node. Notice that all the
entities in S
1will follow the 1-branch and all the other entities will follow the 0-branch.
For the former entities, the auxiliary attributes corresponding to S
1contain the index
of the entities within S
1. So, using these attributes we can identify each entity within
S
1paying a cost of at most log d. (Formally, construct a complete binary tree of
depth log dwith the auxiliary attributes as labels and assign the entities within S
1
appropriately, and attach this tree to the 1-branch of the root node). We make a
2the
label of the 0-child of the root node. The discussion for this node is similar to that of the
root node. All entities that are covered by S
2will follow the 1-branch and are identified
using the auxiliary attributes corresponding to S
2. The remaining entities follow the
1-branch. In general, if we follow the 0-branches from the root node, and reach a level
i, we will have a node labeled by a
i. The entities covered by S
iwill follow the 1-branch
of this node and they will be identified using a complete binary tree of depth at most
log d. For any entity x,ifS
iis the set covering x, then starting at the root node, xwill
follow the 0-branch until it reaches the node labeled a
i, where it will follow the 1-branch
and then get identified by the complete binary tree on the 1-branch. The entity incurs
a cost of ifor the former process and a cost of at most log dfor the latter process.
Thus |T|≤cost(π)+Nlog d.
LEMMA 4.6. Given a decision tree Tfor D, we can construct an ordering πfor Ssuch
that cost(π)≤|T|+ N.
PROOF. Starting from the root node traverse the tree always taking the 0-branch until
an entity x(a leaf node) is reached. Let b
1,b
2,...,b
rbe the sequence of attributes seen
in this traversal. Construct an ordering πby writing down the sets corresponding
to these attributes (each attribute b
ican either be a main attribute or an auxiliary
attribute; in either case, we write down the corresponding set). The sequence may not
include all sets. However, except for x, all the other entities are covered by the sets
in the sequence π. We deal with xby appending to πany set that includes x(and
to be formally compliant with the definition of MSSC solutions, the sets not listed in
πare appended in any arbitrary order). Notice that cost(π)≤|T|+N(the extra Nis
included for handling the cost of covering x).
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:19
For convenience, we switch over to average costs instead of absolute costs. If πis
an ordering of an MSSC instance having Nitems, we define costa(π)=cost(π)/N.
Similarly, if Tis a decision tree of a table Dhaving Nentities, we define |T|a=|T|/N.
An important property of d-MSSC is that every set can cover at most d(a constant)
number of items and so, in any solution, the average cost is at least N/2d. The following
proposition formalizes the claim.
PROPOSITION 4.7. For any d, for any d-MSSC instance having N items, the optimal
ordering πsatisfies
costa(π)N
2d.
LEMMA 4.8. Suppose for some constant α, there exists an α-approximation for the
2-UDT problem. Let δ>0and d be any constants. Then, there exists an algorithm
for the d-MSSC problem and some constant N0such that the algorithm achieves an
approximation factor of (1 +δ)αon all instances whose universe contains at least N0
items.
PROOF. Given a d-MSSC instance Mover a universe containing Nelements, con-
struct a binary table Dusing the (reduction) procedure described before. Use the α-
approximation algorithm to obtain a solution Tfor D. Apply Lemma 4.6 to transform
Tinto a solution πfor M.LetTand πdenote the optimal solution for Dand M,
respectively. By Lemma 4.5 and 4.6, we can relate cost(π) and cost(π) as follows.
costa(π)≤|T|a+1
α|T|a+1
α·costa(π)+αlog d+1
We shall choose a suitable N0such that (αlog d+1) αδ(costa(π)) or equivalently,
costa(π)(αlog d+1)/(αδ). This would imply that the algorithm achieves an ap-
proximation factor of α(1 +δ) on all instances having at least N0items. This task is
accomplished by applying Proposition 4.7, which says that costa(π)N/2d.So,wefix
N0=2d(αlog d+1)
αδ .
We now prove Theorem 4.4. Suppose there exists an α-approximation algorithm for
the 2-UDT problem for some constant α<2. Choose δ>0 such that (1 +δ)α<2
and let β=(1 +δ)α. Invoke Theorem 4.2 to obtain a constant dsuch that it is NP-
hard to approximate d-MSSC within a factor of β. Now by Lemma 4.8, there exists
an algorithm for the d-MSSC problem that has an approximation ratio of βon all
instances over a universe of size at least N0. For instances having a smaller universe,
we can perform an exhaustive search in polynomial time, since N0is a constant. This
means that NP =P. We have proved the theorem.
5. RAMSEY NUMBERS AND ERD ¨
OS’S CONJECTURE
In this section, we take a closer look at our approximation ratio and discuss its con-
nection to a Ramsey-theoretic conjecture by Erd¨
os. We presented an algorithm for
the WDT problem having an approximation ratio of O(rKlog N). Let us now focus on
bounds for the inverse Ramsey numbers rn,forn1.
Recall that for any k,Rk3k+1
2[Nesetril and Rosenfeld 2001; Schur 1916]. From
this we get that for any n,rn2+0.64 log n. Notice that any improvement in the
upper bound of rnwould automatically improve our approximation ratio. Better upper
bounds are known for rn(see Nesetril and Rosenfeld [2001], Exoo [1994], and Chung
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:20 V. T. Chakaravarthy et al.
and Grinstead [1983]); but they improve the preceding bound only by constant factors.
We observe that the upper bound for rncannot be improved significantly because of the
following result: Rk1+k!e[West 2001], which implies rn=(log n/log log n).
Observe that our approximation ratio actually involves rn, rather than rn. Therefore,
one can try to derive a better upper bound on rn. Unfortunately, we show that rn=
(log n/log log n). The claim is implied by the following theorem which can be proved
based on an argument similar to the one used to obtain the same bound for Rk.
THEOREM 5.1. For any k,
Rk1+k!e.
PROOF.Letn=
Rk1. By the definition of
Rk,
Gncan be directed Ramsey colored
using only kcolors. Let τbe such a coloring. Pick any vertex from
Gn, say the vertex u.
We first claim that τcan be transformed to be symmetric with respect to u, meaning
we can modify τin such a way that for any other vertex x, the edge (x,u) gets the same
color as (u,x). This is accomplished by considering each vertex xand (locally) relabeling
the colors assigned to its outgoing edges such that (x,u) gets the same color as (u,x).
This does not increase the number of colors. From now on, it is assumed that we have
modified τin the preceding manner.
There are n1 outgoing edges from u, which are colored using kcolors. So, there must
exist a color class chaving at least (n1)/kedges, that is, there should exist a color c
such that the set of vertices V={x|τ(u,x)=c}satisfies the inequality |V|≥(n1)/k.
The main observation is that for any x,yV, the edge (x,y) cannot have cas its color.
The observation can be seen as follows. We have τ(u,x)=τ(u,y)=c,andso,τ(x,y)
should be different from τ(x,u), by the definition of directed Ramsey colorings. On the
other hand, τ(x,u)=c, because of the transformation that we performed. Therefore
τ(x,y)= c. To summarize, we have argued that the color cis not assigned to any edge
in the subgraph induced by V. Therefore, only k1 colors are used in for the edges of
the previous subgraph. It follows that |V|≤
Rk11. Putting together the lower and
upper bounds on |V|, we get that (n1)/k≤|V|≤
Rk11. Hence, nk(Rk11)+1.
Since n=
Rk1, we have established the following recurrence relation on the directed
Ramsey numbers. We have
Rk2+k(
Rk11),
with the boundary condition being
R1=3. By solving the recurrence relation, we get
that for any k1,
Rk2+k!
k1
i=0
1
i!.
TheRHSisatmost(1+k!e). The theorem is proved.
Notice that there is a gap in the upper and lower bounds for Rk.Erd
¨
os conjectured
that for some constant α, for all k,Rkαk. This is equivalent to rn=(log n).
We discuss the implication of our results in possibly proving the conjecture. The
idea is to show that, in terms of worst-case performance factors, the rounded greedy
algorithm performs poorly! We observe that a lower bound of (log Klog N)onthe
approximation ratio for the rounded greedy algorithm would imply the conjecture.
More explicitly, we note that the following hypothesis implies the conjecture.
Hypothesis. There exists a constant β>0 such that for any K, there exists a K-WDT
table Dand a weight function w(·) on which the tree Tproduced by the rounded greedy
algorithm is such that w(T)(βlog Klog N)w(T), where Tis the optimal solution.
A result by Garey and Graham [1974] could be a starting point for constructing
such instances. They analyzed the worst-case performance of the greedy procedure for
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identification: Approximation Algorithms and Hardness 15:21
the 2-UDT problem and by constructing counter-examples, obtained a lower bound of
(log N/log log N) for the approximation ratio of the procedure.
One can also attempt to prove the conjecture under the assumption NP = Pbyshow-
ing that it is NP-hard to approximate the K-WDT within a factor of (log Klog N).
More precisely, exhibit a constant c>0 and show that for all K2, it is NP-hard to
approximate the K-WDT problem within a factor of clog Klog N. However, as men-
tioned in the Introduction, extending the O(log N)-approximation algorithm for the
UDT problem by Chakaravarthy et al. [2009] to the weighted case will rule out the this
approach.
6. CONCLUSION AND OPEN PROBLEMS
We studied the problem of constructing good decision trees for entity identification, in
the general setup where attributes are multivalued and the entities are associated with
probabilities. We designed an algorithm and proved an approximation ratio involving
Ramsey numbers, and also presented hardness results.
There are several interesting open questions. An obvious avenue is to bridge the
gap between the approximation ratio and the hardness factor for 2-UDT ,K-UDT ,and
WDT .
The directed Ramsey numbers rnintroduced in this article pose challenging open
problems: Is rn=rn, for all n?Isrn=O(log n/log log n)? Proving the second statement
in the affirmative would improve our approximation ratios. If both the statements
are shown true then the conjecture by Erd¨
os would be disproved! Finally, it would be
interesting, if the conjecture can be proved using the approach suggested.
ACKNOWLEDGMENTS
We thank A. Guillory for useful discussions and the anonymous referees for helpful comments.
REFERENCES
ADLER,M.AND HEERINGA, B. 2008. Approximating optimal binary decision trees. In Proceedings of the 11th In-
ternational Workshop on Approximation Algorithms for Combinatorial Optimization Problems.Lecture
Notes in Computer Science, vol. 5171. Springer, Berlin, 1–9.
CHAKARAVARTHY,V.,PANDIT,V.,ROY,S.,AWAS THI ,P.,AND MOHANIA, M. 2007. Decision trees for entity identifica-
tion: Approximation algorithms and hardness results. In Proceedings of the 26th ACM Symposium on
Principles of Database Systems. ACM, New York, 53–62.
CHAKARAVARTHY,V.,PANDIT,V.,ROY,S.,AND SABHARWAL, Y. 2009. Approximating decision trees with multiway
branches. In Proceedings of the 36th International Colloquium on Automata, Languages and Program-
ming. Lecture Notes in Computer Science, vol. 5555. Springer, Berlin.
CHUNG,F.AND GRINSTEAD, C. 1983. A survey of bounds for classical Ramsey numbers. J. Graph Theory 7,
25–37.
DASGUPTA, S. 2005. Analysis of a greedy active learning strategy. In Proceedings of the 17th Annual Conference
on Neural Information Processing Systems. MIT Press, Cambridge, MA, 337–344.
EXOO, G. 1994. A lower bound for Schur numbers and multicolor Ramsey numbers. Electron. J. Combin. 1, R8.
FEIGE,U.,LOV ´
ASZ,L.,AND TETALI, P. 2004. Approximating min sum set cover. Algorithmica 40, 4, 219–234.
GAREY, M. 1970. Optimal binary decision trees for diagnostic identification problems. Ph.D. thesis, University
of Wisconsin, Madison.
GAREY, M. 1972. Optimal binary identification procedures. SIAM J. Appl. Math. 23, 2, 173–186.
GAREY,M.AND GRAHAM, R. 1974. Performance bounds on the splitting algorithm for binary testing. Acta Inf. 3,
347–355.
GRAHAM,R.,ROTHSCHILD,B.,AND SPENCER, J. 1990. Ramsey Theory. John Wiley & Sons, New York.
HEERINGA, B. 2006. Improving access to organized information. Ph.D. thesis, University of Massachusetts,
Amherst.
HEERINGA,B.AND ADLER, M. 2005. Approximating optimal decision trees. Tech. rep. TR 05-25, University of
Massachusetts, Amherst.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:22 V. T. Chakaravarthy et al.
HYAF IL ,L.AND RIVEST, R. 1976. Constructing optimal binary decision trees is NP-complete. Inf. Process.
Lett. 5, 1, 15–17.
KOSARAJU,S.,PRZYTYCKA,M.,AND BORGSTROM, R. 1999. On an optimal split tree problem. In Proceedings of the
5th International Workshop on Algorithms and Data Structures. Lecture Notes in Computer Science,
vol. 1272. Springer, Berlin, 69–92.
MORET, B. 1982. Decision trees and diagrams. ACM Comput. Surv. 14, 4, 593–623.
MURTHY, S. 1998. Automatic construction of decision trees from data: A multi-disciplinary survey. Data
Mining Knowl. Discov. 2, 4, 345–389.
NESETRIL,J.AND ROSENFELD, M. 2001. I. Schur, C.E. Shannon and Ramsey numbers, a short story. Discr.
Math. 229, 1-3, 185–195.
PANKHURST, R. 1970. A computer program for generating diagnostic keys. Comput. J. 13, 2, 145–151.
RADZISZOWSKI, S. 1994. Small Ramsey numbers. Electron. J. Combin. 1, 7.
RAZ,R.AND SAFRA, S. 1997. A sub-constant error-probability low-degree test, and a sub-constant error-
probability PCP characterization of NP. In Proceedings of the 29th ACM Symposium on Theory of
Computing. ACM, New York.
REYNOLDS,A.,DICKS,J.,ROBERTS,I.,WESSELINK,J.,IGLESIA,B.,ROBERT,V.,BOEKHOUT,T.,AND RAYWARD-SMITH,V.
2003. Algorithms for identification key generation and optimization with application to yeast identifica-
tion. In Proceedings of EvoWorkshops. Lecture Notes in Computer Science, vol. 2611. Springer, Berlin,
107–118.
SCHUR, I. 1916. Uber die kongruenz xm+ymzm(mod p). Jber. Deustsch. Math. Verein 25, 114–117.
WEST, D. 2001. Introduction to Graph Theory. Prentice Hall.
WIJTZES,T.,BRUGGEMAN,M.,NOUT,M.,AND ZWIETERING, M. 1997. A computer system for identification of lactic
acid bacteria. Int. J. Food Microbiol. 38, 1, 65–70.
Received February 2008; revised December 2008; accepted April 2009
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
... In what follows, we show that the greedy policy can achieve an approximation guarantee of O(log n) for AIGS, which matches the best achievable ratio. A key step of our analysis is to map AIGS to the binary decision tree problem [5,18]. ...
... There has been extensive research [5,25] showing that the minimum (probability) weight of nodes will have a significant impact on the approximation ratio for the decision tree problem. Fortunately, a rounding technique [5] can be used tackle the negative impact of the minimum weight. ...
... There has been extensive research [5,25] showing that the minimum (probability) weight of nodes will have a significant impact on the approximation ratio for the decision tree problem. Fortunately, a rounding technique [5] can be used tackle the negative impact of the minimum weight. In particular, for each node u, its weight will be rounded from p(u) to w(u) as follows, ...
Preprint
Full-text available
Interactive graph search (IGS) uses human intelligence to locate the target node in hierarchy, which can be applied for image classification, product categorization and searching a database. Specifically, IGS aims to categorize an object from a given category hierarchy via several rounds of interactive queries. In each round of query, the search algorithm picks a category and receives a boolean answer on whether the object is under the chosen category. The main efficiency goal asks for the minimum number of queries to identify the correct hierarchical category for the object. In this paper, we study the average-case interactive graph search (AIGS) problem that aims to minimize the expected number of queries when the objects follow a probability distribution. We propose a greedy search policy that splits the candidate categories as evenly as possible with respect to the probability weights, which offers an approximation guarantee of $O(\log n)$ for AIGS given the category hierarchy is a directed acyclic graph (DAG), where $n$ is the total number of categories. Meanwhile, if the input hierarchy is a tree, we show that a constant approximation factor of $(1+\sqrt{5})/2$ can be achieved. Furthermore, we present efficient implementations of the greedy policy, namely GreedyTree and GreedyDAG, that can quickly categorize the object in practice. Extensive experiments in real-world scenarios are carried out to demonstrate the superiority of our proposed methods.
... In this chapter, we study an abstract stochastic optimization problem in the setting described above which unifies and generalizes many previously-studied problems such as optimal decision trees studied in [53], [63], [27], [20], [49] and [26], equivalence class determination (see [40] and [12]), decision region determination studied in [58] and submodular ranking studied in [6] and [55]. We obtain an algorithm with the best-possible approximation guarantee in all these special cases. ...
... Such a simple algorithm was previously unknown even in the special case of optimal decision tree, despite a large number of papers on this topic, including [53], [63], [27], [1], [20], [44], [49], [38] and [26] . ...
... The first O(log m)-approximation algorithm for optimal decision tree was obtained in [49], which is known to be best-possible from [20]. This result was extended to the equivalence class determination problem in [26]. ...
Thesis
This dissertation aims to consider different problems in the area of stochastic optimization, where we are provided with more information about the instantiation of the stochastic parameters over time. With uncertainty being an inseparable part of every industry, several applications can be modeled as discussed. In this dissertation we focus on three main areas of applications: 1) ranking problems, which can be helpful for modeling product ranking, designing recommender systems, etc., 2) routing problems which can cover applications in delivery, transportation and networking, and 3) classification problems with possible applications in medical diagnosis and chemical identification. We consider three types of solutions for these problems based on how we want to deal with the observed information: static, adaptive and a priori solutions. In Chapter II, we study two general stochastic submodular optimization problems that we call Adaptive Submodular Ranking and Adaptive Submodular Routing. In the ranking version, we want to provide an adaptive sequence of weighted elements to cover a random submodular function with minimum expected cost. In the routing version, we want to provide an adaptive path of vertices to cover a random scenario with minimum expected length. We provide (poly)logarithmic approximation algorithms for these problems that (nearly) match or improve the best-known results for various special cases. We also implemented different variations of the ranking algorithm and observed that it outperforms other practical algorithms on real-world and synthetic data sets. In Chapter III, we consider the Optimal Decision Tree problem: an identification task that is widely used in active learning. We study this problem in presence of noise, where we want to perform a sequence of tests with possible noisy outcomes to identify a random hypothesis. We give different static (non-adaptive) and adaptive algorithms for this task with almost logarithmic approximation ratios. We also implemented our algorithms on real-world and synthetic data sets and compared our results with an information theoretic lower bound to show that in practice, our algorithms' value is very close to this lower bound. In Chapter IV, we focus on a stochastic vehicle routing problem called a priori traveling repairman, where we are given a metric and probabilities of each vertices being active. We want to provide an a priori master tour originating from the root that is shortcut later over the observed active vertices. Our objective is to minimize the expected total wait time of active vertices, where the wait time of a vertex is defined as the length of the path from the root to this vertex. We consider two benchmarks to evaluate the performance of an algorithm for this problem: optimal a priori solution and the re-optimization solution. We provide two algorithms to compare with each of these benchmarks that have constant and logarithmic approximation ratios respectively.
... The goal is to identify the true classifier by querying labels at the minimum number of points in expectation (over the prior distribution). Other applications include entity identification in databases (Chakaravarthy et al. (2011)) and experimental design to choose the most accurate theory among competing candidates (Golovin et al. (2010)). ...
... The state-of-the-art resultGupta et al. (2017) is an O(log m)-approximation, for instances with arbitrary probability distribution and costs.Chakaravarthy et al. (2011) also showed that ODT ...
Preprint
Full-text available
A fundamental task in active learning involves performing a sequence of tests to identify an unknown hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction, has been widely studied for decades and the asymptotically best-possible approximation algorithm has been devised for it. We study a generalization where certain test outcomes are noisy, even in the more general case when the noise is persistent, i.e., repeating a test gives the same noisy output. We design new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous work in the area assumed at most a logarithmic number of noisy outcomes per hypothesis and provided approximation ratios that depended on parameters such as the minimum probability of a hypothesis. Our new approximation algorithms provide guarantees that are nearly best-possible and work for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly with this number. In fact, our results hold in a significantly more general setting, where the goal is to cover stochastic submodular functions. We evaluate the performance of our algorithms on two natural applications with noise: toxic chemical identification and active learning of linear classifiers. Despite our theoretical logarithmic approximation guarantees, our methods give solutions with cost very close to the information theoretic minimum, demonstrating the effectiveness of our methods.
Article
Full-text available
Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity.
Article
Full-text available
Lazy graph search algorithms are efficient at solving motion planning problems where edge evaluation is the computational bottleneck. These algorithms work by lazily computing the shortest potentially feasible path, evaluating edges along that path, and repeating until a feasible path is found. The order in which edges are selected is critical to minimizing the total number of edge evaluations: a good edge selector chooses edges that are not only likely to be invalid, but also eliminates future paths from consideration. We wish to learn such a selector by leveraging prior experience. We formulate this problem as a Markov Decision Process (MDP) on the state of the search problem. While solving this large MDP is generally intractable, we show that we can compute oracular selectors that can solve the MDP during training. With access to such oracles, we use imitation learning to find effective policies. If new search problems are sufficiently similar to problems solved during training, the learned policy will choose a good edge evaluation ordering and solve the motion planning problem quickly. We evaluate our algorithms on a wide range of 2D and 7D problems and show that the learned selector outperforms baseline commonly used heuristics. We further provide a novel theoretical analysis of lazy search in a Bayesian framework as well as regret guarantees on our imitation learning based approach to motion planning.
Preprint
Full-text available
In the stochastic submodular cover problem, the goal is to select a subset of stochastic items of minimum expected cost to cover a submodular function. Solutions in this setting correspond to sequential decision processes that select items one by one "adaptively" (depending on prior observations). While such adaptive solutions achieve the best objective, the inherently sequential nature makes them undesirable in many applications. We ask: how well can solutions with only a few adaptive rounds approximate fully-adaptive solutions? We give nearly tight answers for both independent and correlated settings, proving smooth tradeoffs between the number of adaptive rounds and the solution quality, relative to fully adaptive solutions. Experiments on synthetic and real datasets show qualitative improvements in the solutions as we allow more rounds of adaptivity; in practice, solutions with a few rounds of adaptivity are nearly as good as fully adaptive solutions.
Preprint
Full-text available
In the problem of active sequential hypotheses testing (ASHT), a learner seeks to identify the true hypothesis $h^*$ from among a set of hypotheses $H$. The learner is given a set of actions and knows the outcome distribution of any action under any true hypothesis. While repeatedly playing the entire set of actions suffices to identify $h^*$, a cost is incurred with each action. Thus, given a target error $\delta>0$, the goal is to find the minimal cost policy for sequentially selecting actions that identify $h^*$ with probability at least $1 - \delta$. This paper provides the first approximation algorithms for ASHT, under two types of adaptivity. First, a policy is partially adaptive if it fixes a sequence of actions in advance and adaptively decides when to terminate and what hypothesis to return. Under partial adaptivity, we provide an $O\big(s^{-1}(1+\log_{1/\delta}|H|)\log (s^{-1}|H| \log |H|)\big)$-approximation algorithm, where $s$ is a natural separation parameter between the hypotheses. Second, a policy is fully adaptive if action selection is allowed to depend on previous outcomes. Under full adaptivity, we provide an $O(s^{-1}\log (|H|/\delta)\log |H|)$-approximation algorithm. We numerically investigate the performance of our algorithms using both synthetic and real-world data, showing that our algorithms outperform a previously proposed heuristic policy.
Chapter
Full-text available
We give a (ln n + 1)-approximation for the decision tree (DT) problem. An instance of DT is a set of m binary tests T = (T 1, ..., T m ) and a set of n items X = (X 1, ..., X n ). The goal is to output a binary tree where each internal node is a test, each leaf is an item and the total external path length of the tree is minimized. Total external path length is the sum of the depths of all the leaves in the tree. DT has a long history in computer science with applications ranging from medical diagnosis to experiment design. It also generalizes the problem of finding optimal average-case search strategies in partially ordered sets which includes several alphabetic tree problems. Our work decreases the previous upper bound on the approximation ratio by a constant factor. We provide a new analysis of the greedy algorithm that uses a simple accounting scheme to spread the cost of a tree among pairs of items split at a particular node. We conclude by showing that our upper bound also holds for the DT problem with weighted tests.
Article
Full-text available
The input to the min sum set cover problem is a collection of n sets that jointly cover m elements. The output is a linear order on the sets, namely, in every time step from 1 to n exactly one set is chosen. For every element, this induces a first time step by which it is covered. The objective is to find a linear arrangement of the sets that minimizes the sum of these first time steps over all elements. We show that a greedy algorithm approximates min sum set cover within a ratio of 4. This result was implicit in work of Bar-Noy, Bellare, Halldorsson, Shachnai, and Tamir (1998) on chromatic sums, but we present a simpler proof. We also show that for every ε > 0, achieving an approximation ratio of 4 – ε is NP-hard. For the min sum vertex cover version of the problem (which comes up as a heuristic for speeding up solvers of semidefinite programs) we show that it can be approximated within a ratio of 2, and is NP-hard to approximate within some constant ρ > 1.
Conference Paper
Full-text available
We consider the problem of constructing decision trees for entity identification from a given table. The input is a table containing information about a set of entities over a fixed set of attributes. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. The previously best known approximation ratio for this problem was O(log2 N). In this paper, we present a new greedy heuristic that yields an improved approximation ratio of O(logN).
Article
Binary identification problems model a variety of actual problems, all containing the requirement that one construct a testing procedure for identifying a single unknown object belonging to a known finite set of possibilities. They arise in connection with machine fault-location, medical diagnosis, species identification, and computer programming. Author describes the basic model for binary identification problems and present a number of general results, including a dynamic programming algorithm for constructing optimal identification procedures. The main results of the paper concern identification problems in which the object possibilities are naturally partitioned into similarity classes with available tests of two types; general tests which only distinguish between similarity classes and specific tests which each test for a single one of the possibilities. This structure is utilized to obtain considerable improvement over the general dynamic programming algorithm.