ArticlePDF Available

# Decision Trees for Entity Identification: Approximation Algorithms and Hardness Results

Authors:

## Abstract and Figures

We consider the problem of constructing decision trees for entity identification from a given relational table. The in- put is a table containing information about a set of enti- ties over a fixed set of attributes and a probability distri- bution over the set of entities that specifies the likelihood of the occurrence of each entity. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. This classical problem finds such di- verse applications as efficient fault detection, species iden- tification in biology, and efficient diagnosis in the field of medicine. Prior work mainly deals with the special case where the input table is binary and the probability dis- tribution over the set of entities is uniform. We study the general problem involving arbitrary input tables and arbitrary probability distribution over the set of entities. We consider a natural greedy algorithm and prove an ap- proximation guarantee of O(rK · log N), where N is the number of entities, K is the maximum number of distinct values of an attribute, and rK is a suitably defined Ram- sey number. In addition, our analysis indicates a possible way of resolving a Ramsey theoretic conjecture by Erdos. We also show that it is NP-hard to approximate the gen- eral version of the problem within a factor of (log N).
Content may be subject to copyright.
15
Decision Trees for Entity Identiﬁcation: Approximation Algorithms
and Hardness Results
VENKATESAN T. CHAKARAVARTHY, VINAYAKA PANDIT, and
SAMBUDDHA ROY, IBM India Research Lab
PRANJAL AWASTHI, Carnegie Mellon University
MUKESH K. MOHANIA, IBM India Research Lab
We consider the problem of constructing decision trees for entity identiﬁcation from a given relational
table. The input is a table containing information about a set of entities over a ﬁxed set of attributes and
a probability distribution over the set of entities that speciﬁes the likelihood of the occurrence of each
entity. The goal is to construct a decision tree that identiﬁes each entity unambiguously by testing the
attribute values such that the average number of tests is minimized. This classical problem ﬁnds such
diverse applications as efﬁcient fault detection, species identiﬁcation in biology, and efﬁcient diagnosis in
the ﬁeld of medicine. Prior work mainly deals with the special case where the input table is binary and the
probability distribution over the set of entities is uniform. We study the general problem involving arbitrary
input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy
algorithm and prove an approximation guarantee of O(rK·log N), where Nis the number of entities and
Kis the maximum number of distinct values of an attribute. The value rKis a suitably deﬁned Ramsey
number,whichisatmostlogK. We show that it is NP-hard to approximate the problem within a factor of
(log N), even for binary tables (i.e., K=2). Thus, for the case of binary tables, our approximation algorithm
is optimal up to constant factors (since r2=2). In addition, our analysis indicates a possible way of resolving
a Ramsey-theoretic conjecture by Erd¨
os.
Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnu-
merical Algorithms and Problems
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Approximation algorithms, decision tree, Ramsey numbers
ACM Reference Format:
Chakaravarthy, V. T., Pandit, V., Roy, S., Awasthi, P., and Mohania, M. K. 2011. Decision trees for entity
identiﬁcation: Approximation algorithms and hardness results. ACM Trans. Algor. 7, 2, Article 15 (March
2011), 22 pages.
DOI =10.1145/1921659.1921661 http://doi.acm.org/10.1145/1921659.1921661
A preliminary version of the article was presented at the ACM Symposium on Principles of Database Systems,
[Chakaravarthy et al. 2007].
This work was done while P. Awasthi was at IBM India Research Lab, New Delhi.
Authors’ addresses: V. T. Chakaravarthy, V. Pandit, and S. Roy, IBM India Research Lab, 4 Block C,
Institutional Area, Vasanth Kunj, New Delhi – 110070, India; email: {vechakra, pvinayak, sambud-
dha}@in.ibm.com; P. Awasthi, Computer Science Department, Wean Hall 1313, Carnegie Mellon University,
Pittsburgh, PA 15213; email: pawasthi@cs.cmu.edu; M. K. Mohania, IBM India Research Lab, 4 Block C,
Institutional Area, Vasanth Kunj, New Delhi – 110070, India; email: mkmohania@in.ibm.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that
copies show this notice on the ﬁrst page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior speciﬁc permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or permissions@acm.org.
c
2011 ACM 1549-6325/2011/03-ART15 $10.00 DOI 10.1145/1921659.1921661 http://doi.acm.org/10.1145/1921659.1921661 ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:2 V. T. Chakaravarthy et al. 1. INTRODUCTION Decision trees for the purposes of identiﬁcation and diagnosis have been studied for a long time now [Moret 1982]. Consider a typical medical diagnosis application. A hospital maintains a table containing information about diseases. Each row in the table is a disease and each column is a medical test and the corresponding entry speciﬁes the outcome of the test for a person suffering from the given disease. Some of the medical tests are costly (e.g., MRI scans) and some require few days for the result to be known (e.g., blood cultures). When the hospital receives a new patient whose disease has not been identiﬁed, it would like to determine the shortest sequence of tests which can unambiguously determine the disease of the patient. Such a capability would enable it to achieve objectives like saving the expenditure of the patients, quickly determining the disease to start the treatment early, etc. Motivated by such applications, we consider the problem of constructing decision trees for entity identiﬁcation from the given data. Decision Trees for Entity Identiﬁcation—Problem Statement. The input is a table D having Nrows and mcolumns. Each row is called an entity and the columns are the attributes of these entities. Additionally, we are also given a probability distribution Pover the set of entities. For each entity e,Pspeciﬁes p(e), the likelihood of the occurrence of e. A solution is a decision tree in which each internal node is labeled by an attribute and its branches are labeled by the values that the attribute can take. The entities are the leaves of the tree. The main requirement is that the tree should identify each entity correctly. The cost of the tree is the expected distance of an entity from the root, (i.e., ep(e)d(e), where d(e) is the distance of the entity efrom the root). The goal is to construct a decision tree with the minimum cost. We call this the WDT problem (Here, Wstands for “weight” and it stresses the fact that the entities are associated with probabilities/weights). Example 1.1.Figure 1 shows an example table and two decision trees for it. In this example, the probability distribution over the entities is uniform, that is, p(ei)=1/6, for each entity ei. In the ﬁrst decision tree, the distance d(e1)is2andd(e4) is 3. The cost of the ﬁrst decision tree is 14/6 and that of the second decision tree is 8/6. The second decision tree happens to be an optimal tree for this instance. For a given table, the maximum number of distinct values that any attribute takes is called its branching factor. In the preceding example, the branching factor of the given table is 5, because every attribute takes at most 5 distinct values and the attribute B attains the maximum 5. Interesting special cases of the WDT problem can be obtained in two ways: —the case in which every input instance is required to have a branching factor of at most K, for some constant K; we call this the K-WDT problem. Of particular interest is the 2-WDT problem, where the tables are binary. —the case in which the probability distribution over the set of entities is known to be uniform; we call this the UDT problem (Here, Ustresses the fact that the probabili- ties/weights are uniform). The special case in which both of these restrictions apply is called the K-UDT problem. Prior Results and Our Results. Much of the previous literature deals with the re- stricted 2-UDT problem. Hyaﬁl and Rivest [1976] showed that the 2-UDT problem is NP-hard. Garey [1970, 1972] presented a dynamic programming-based algorithm for the 2-UDT problem that ﬁnds the optimal solution, but the algorithm runs in exponen- tial time in the worst case. Kosaraju et al. [1999] presented a greedy algorithm for the 2-WDT problem with an approximation ratio of O(log N); the approximation ratio remains the same for the ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:3 Fig. 1. Example decision trees. special case of the 2-UDT problem. Independently, Dasgupta [2005] showed that the same greedy heuristic has an approximation ratio of 4 log Nfor the 2-UDT problem. Recently, Heeringa and Adler [2005] gave an alternative analysis of the same greedy algorithm and obtained a slightly improved approximation ratio of (1 +ln N)forthe 2-UDT problem (see also Heeringa [2006] and Adler and Heeringa [2008]). They also showed that it is NP-hard to approximate the 2-UDT problem within a ratio of (1 +), for some >0. We study the problem in its whole generality, namely the WDT problem, where the attributes can take multiple values and the input probability distribution can be arbitrary. This occurs commonly, for example, in medical diagnosis applications (e.g., blood-group can take multiple values; some diseases are more prevalent than others). We present two approximation algorithms for the UDT problem. The ﬁrst one is a simple algorithm that uses any given α-approximation algorithm for the 2-UDT prob- lem as a black box and provides an αlog Kapproximation for the K-UDT problem. In particular, using the algorithm of Heeringa and Adler [2005] as the black box, we obtain an algorithm with an approximation ratio of log K(1+ln N). Our second algorithm for the UDT problem uses a greedy heuristic and has an approximation ratio ofrK(1+ln N), where rKis a suitably deﬁned Ramsey number which is at most (2 +0.64 log K). Our analysis builds on that of Heeringa and Adler [2005] and uses additional combinatorial arguments. The highlight of our analysis is that it establishes connections to Ramsey numbers and a conjecture by Erd¨ os (see what follows for more details). Furthermore, notice that the second algorithm offers a constant factor improvement over the ﬁrst algorithm. Remark 1.2. We note that subsequent to our work, Chakaravarthy et al. [2009] considered a sligtly different greedy heuristic for the UDT problem and showed an approximation ratio of 4 log N. Next we consider the general WDT problem. We ﬁrst observe that by combin- ing the black-box approach with the algorithm of Kosaraju et al. [1999], we get an O(log Klog N) approximation ratio for the WDT problem. We also show how extend our analysis for the UDT problem to handle weights and obtain an algorithm with an approximation ratio of O(rKlog N). This provides an alternative way of getting the result obtained via the black-box approach. We next focus on the hardness of approximating various versions of the problem. We show that it is NP-hard to approximate the 2-WDT problem within a ratio of (log N). This implies that the O(log N)-approximation algorithm of Kosaraju et al. [1999] for the 2-WDT problem is optimal up to constant factors. We also improve the hardness of approximation for the unweighted version of the problem. We show that it is NP-hard ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:4 V. T. Chakaravarthy et al. Fig. 2. Summary of results. to approximate the UDT and the 2-UDT problem, within a ratio of (4 )and(2), respectively, for any >0. The results are summarized in Figure 2. Ramsey Numbers and Connections to Erd¨os’s Conjecture. Our analysis of the ap- proximation algorithms has interesting connections with Ramsey theory and an unre- solved conjecture by Erd¨ os. Ramsey theory, treated at length in the book by Graham et al. [1990], deals with coloring the edges of complete graphs (or hypergraphs) with a speciﬁed number of colors satisfying certain constraints. For our purposes, we need the following speciﬁc type of Ramsey numbers. For n>0, let Gndenote the complete graph on nvertices. A k-coloring of Gnis a coloring of the edges of Gnusing kcolors. For k>0, Rkis deﬁned to be the smallest number nsuch that any k-coloring of Gncontains a monochromatic triangle.1The inverses of the Ramsey numbers are more convenient for our purposes. For n>0, we deﬁne rnto be the smallest number ksuch that we can color the edges of Gnusing only kcolors without inducing any monochromatic triangle. The exact values of the Ramsey numbers for k>3 are not known. However, it is known that for any k,3k+1 2Rk1+k!e(see West [2001], Nesetril and Rosenfeld [2001], and Schur [1916]). Erd¨ os made the conjecture that for some constant α, for all k,Rkαk. In terms of the inverse Ramsey numbers, the previous bounds translate as follows: (i) for any n,rn2+0.64 log n=O(log n); (ii) rn=(log n/log log n). The Erd¨ os conjec- ture now reads rn=(log n). Our results provide interesting approaches to address the conjecture. Exhibit a con- stant c>0 and show that for all K2, it is NP-hard to approximate the K-WDT problem within a factor of clog Klog N. Notice that this would prove the conjecture under the assumption that NP = P. However, we note that if the recent O(log N)- approximation algorithm for UDT by Chakaravarthy et al. [2009] can be extended to the weighted case, the preceding approach will be ruled out. Another way of proving the conjecture would be to construct a family of bad instances for our algorithm (which is a simple greedy heuristic). We discuss the details later in the article. Applications and Related Work. Decision trees for entity identiﬁcation (as deﬁned in this article) have been used for medical diagnosis (as described earlier), species identiﬁcation in biology, fault detection, etc. [Moret 1982]. Taxonomists release ﬁeld guides to help identify species based on their characteristics. These guides are often presented in the form of decision trees labeled by species characteristics. Typically, a ﬁeld biologist identiﬁes the species of a specimen at hand by referring to such guides (hopefully with as few look-ups as possible). Taxonomists refer to such decision trees as “identiﬁcation keys” and an article on identiﬁcation keys can be found in Wikipedia.2 1A monochromatic triangle is a triplet of vertices such that all the three edges between them have the same color. In Ramsey theory, Rkis denoted R(3,3,...,3), where “3” is repeated ktimes. For example, it is known that R1=3, R2=6, R3=17 [Radziszowski 1994]. 2http://en.wikipedia.org/wiki/Dichotomous key. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:5 Computer programs and algorithms for identiﬁcation and diagnosis applications have been developed for nearly four decades (e.g., Pankhurst [1970], Reynolds et al. [2003], and Wijtzes et al. [1997]). Murthy [1998] and Moret [1982] present excellent surveys on the use of decision trees in such diverse ﬁelds as machine learning, pattern recognition, taxonomy, switching theory, and boolean logic. 2. PRELIMINARIES In this section, we deﬁne the WDT problem and its special cases. We also develop some notation used in the article. Let Dbe a relational table having Ntuples and mattributes. We call each tuple an entity.LetEand Adenote the set of entities and attributes, respectively. For xEand aA,x.adenotes the value of the entity xon the attribute a.ForaA,Vadenotes the set of distinct values taken by ain D.LetK=maxaA{|Va|}.NoticethatKN.We call Kthe branching factor of D. Adecision tree T for the table Dis a rooted tree satisfying the following properties. Each internal node uis labeled by an attribute aand has at most Kchildren. Every branch (edge) out of uis labeled by a distinct value from the set Va. The entities are the leaves of the tree and thus the tree has exactly Nleaves. The main requirement is that the tree should identify every entity correctly. In other words, for any entity x, the following traversal process should correctly lead to x. The process starts at the root node. Let ube the current node and abe the attribute label of u. Take the branch out of ulabeled by x.aand move to the corresponding child of u. The requirement is that this traversal process should reach the entity x. Observe that the values of the attributes are used only for taking the correct branch in the traversal process. So, we can map each value of an attribute to a distinct number from 1 to Kand assume that Vais a subset of {1,2,..., K}. In the rest of the article, we assume that for any xEand aA,x.a∈{1,2,...,K}. ForatreeT,weuse“uT” to mean that uis an internal node in T. We denote by x,y, an unordered pair of distinct entities. Let Tbe a decision tree for D. For an entity xE,path length of xis deﬁned to be the number of internal nodes in the path from the root to x; it is denoted T(x). The sum of all path lengths is called total path length and is denoted |T|,thatis,|T|=xET(x). Let w(·)beaweight function that assigns a real number w(x)>0, for each xE.We deﬁne the cost of Twith respect to w(·) as follows. cost(T,w)= xE w(x)T(x) We will denote cost(T,w)asw(T). As mentioned in the Introduction, input to the WDT problem includes a probability distribution Pover Especifying the likelihood of the occurrence of each entity and the goal is to construct a tree having the minimum expected path length. We view probabilities as weights and assume that the distribution is speciﬁed as a weight function p(·) that associates a weight p(x)>0, for each entity x. Notice that when an entity is chosen at random according to the previous distribution, the expected path length is given by p(T)=cost(T,p). We assume that the probabilities p(x) are given as rational numbers. We can easily write these numbers in such way that for any entity x,p(x)=w(x)/L, where w(x)1 is an integer and Lis an integer giving the common denominator. And so, without loss of generality, we assume the probability distribution will be given as an integer weight function w(·) over a set of entities, that is, for all xE,w(x)1 is an integer. Notice that p(T)=w(T)/Land hence, ﬁnding an optimal Tunder p(·)andw(·) are equivalent. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:6 V. T. Chakaravarthy et al. WDT Problem.The input is a relational table Dand a probability distribution P represented as an integer weight function w(·). The goal is to construct a decision tree Thaving the minimum cost w(T). For a positive integer K,theK-WDT problem is a special case of the WDT problem where the input table is required to have a branching factor of at most K.Noticethat in the K-WDT problem, the input is a table whose entries are drawn from the set {1,2,...,K}. Of particular interest is the special case called UDT in which the probability dis- tribution is uniform. In this problem, the weight function is given by w(x)=1, for all xE. Note that the cost of a tree Tis w(T)=|T|. For an integer K2, the special case of the UDT where the input table is required to have a branching factor of at most Kis called the K-UDT problem. 3. APPROXIMATION ALGORITHMS AND ANALYSIS In this section, we present an algorithm for the WDT problem and prove an approxi- mation ratio of O(rKlog N), where Krefers to the branching factor of the input table. As mentioned in the Introduction, our analysis builds on that of Heeringa and Adler [2005] for the 2-UDT problem. In order to achieve our result, we have to extend their ideas to deal with two issues. Firstly, the attributes can be multivalued as opposed to binary; secondly, the entities can have arbitrary weights. For ease of exposition, we ﬁrst show how to address the issue of attributes being multivalued. Then, we deal with the case of arbitrary weights. Speciﬁcally, Section 3.1 presents an algorithm and analysis for the UDT problem. These ideas are generalized in Section 3.2 to obtain an algorithm for the WDT problem. 3.1. The Unweighted Case: UDT Problem This section deals with the UDT problem. Here, the probability distribution is uniform and so, the weights of all the entities are 1. The goal is to ﬁnd a tree Twith the minimum cost |T|. We present two approximation algorithms for UDT . The ﬁrst one uses any given α-approximation algorithm for 2-UDT as a black box and provides an αlog Kapprox- imation for the K-UDT problem. In particular, using the algorithm of Heeringa and Adler [2005] as the black box, we obtain an algorithm with an approximation ratio of log K(1 +ln N). Our second algorithm for the UDT problem uses a greedy heuristic and has an approximation ratio of rK(1 +ln N). Recall that rK2+0.64 log K. Thus, the second algorithm offers a constant factor improvement over the ﬁrst algorithm. The ﬁrst approach has the advantage that any improvement in the approximation ratio for the 2-UDT problem automatically yields an improvement for the K-UDT problem. On the other hand, the second approach has the advantage that any improvement in the upper bound for rKimproves the approximation ratio. 3.1.1. The Black-Box Algorithm. Let Abe an α-approximation algorithm for the 2-UDT problem. We show how to get a (αlog K)-approximation algorithm for the K-UDT problem. The idea is to encode the given UDT instance as a 2-UDT instance and then invoke the algorithm Aon the encoded instance. Given an N×mtable Dhaving a branching factor of K, we construct an N×mlog K binary table D2as follows. Each attribute in Dis represented by log Kattributes in D2. The former attribute is called the original attribute and the latter attributes are called as its derived attributes. The values appearing in an original attribute are represented in binary in the corresponding derived attributes. Invoke the algorithm A on the binary table D2and let T2be the decision tree returned by the algorithm. We ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:7 Fig. 3. The greedy algorithm. obtain a decision tree Tfor Dfrom T2by replacing the attributes in its internal nodes with their original attributes in Dand labeling appropriately. Notice that |T|≤|T2|. Given a tree Tfor D, we can construct a tree T2for D2such that |T2|≤log K|T|. In constructing a decision tree T2for the encoded instance D2, the main task is to take the correct branches of the internal nodes of Tusing the binary derived attributes. We achieve this by replacing each internal node with a complete binary tree of depth log Kusing the derived attributes of the original attribute of the internal node. Clearly, |T2|≤log K|T|. This shows that |T 2|≤log K|T|where Tand T 2are the optimal decision trees for Dand D2, respectively. Since |T2|≤α|T 2|, the solution T returned by the black-box algorithm satisﬁes |T|≤αlog K|T|. THEOREM 3.1. Given a α-approximation algorithm for the 2-UDT problem, the black- box algorithm has an approximation ratio of αlog Kfor the UDT problem where K is the branching factor of the input table. In particular, we obtain an approximation ratio of log K(1 +ln N) by using the Heeringa-Adler algorithm as a black box. 3.1.2. The Greedy Algorithm. In this section, we present a greedy algorithm for the UDT problem. The algorithm is similar in spirit to that of Heeringa and Adler [2005] for the 2-UDT problem. We build on their analysis and develop further combinatorial arguments to obtain our approximation ratio. Given as input an N×mtable Dhaving branching factor at most K, the greedy algorithm produces a decision tree Tas described in the following. Let Eand Adenote the set of entities and attributes of D, respectively. The intuition is that any decision tree should distinguish every pair of distinct entities. So, a natural idea is to make the attribute that distinguishes the maximum number of pairs as the root of T, where an attribute ais said to distinguish a pair x,y,ifx.a= y.a.Choosingsuchanattribute acan be easily done in time O(mN 2). Picking the attribute aas the label for the root node partitions the set Einto disjoint sets E1,E2,...,EK, where Ei={x|x.a=i}.We recursively apply the same greedy procedure on each of these sets to obtain Kdecision trees and make these the subtrees of the root node. The greedy procedure is formally speciﬁed in Figure 3. We get the output tree Tby calling T=Greedy(E). ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:8 V. T. Chakaravarthy et al. THEOREM 3.2. The greedy algorithm has an approximation ratio of (rK(1 +ln N)) for the UDT problem, where K is the branching factor of the input table. We now analyze the greedy algorithm and prove Theorem 3.2. The analysis is divided into two parts. In the ﬁrst part, we introduce certain combinatorial objects called tabular partitions and analyze the performance of the greedy algorithm using these objects. In the second part, we relate these objects to Ramsey colorings and complete the proof of Theorem 3.2. 3.1.3. Analysis Involving Tabular Partitions. Let Tand Tbe the greedy and the optimal decision trees, respectively. In this section, we prove a relationship between |T|and |T|involving tabular partitions, deﬁned in the following. Deﬁnition 3.3 (Tabular Partitions). For any positive integer n1, a tabular par- tition Pof nis a sequence P1,P2,...,Pnsuch that Piis a partition of the set {1,2,...,n}−{i}. We require that for any distinct 1 i,jn,ifAis the set in Pi containing jand Bis the set in Pjcontaining i,then AB=∅. Let the length of a partition Pidenote the number of sets in it. We deﬁne the compactness of Pas comp(P)=maxi(length of Pi), for 1 in. We deﬁne Cnto be the smallest number such that there exists a tabular partition of nhaving compactness Cn. THEOREM 3.4. |T|≤CK(1 +ln N)|T|. We next focus on proving the previous result. In Section 3.1.4, we shall show that CKrKand obtain Theorem 3.2 by combining the two results. We start with some notations and observations. Let Tbe any decision tree for Dand ube an inter- nal node of T. We deﬁne ET(u)Eto be the set of entities in the subtree of T under u. PROPOSITION 3.5. For any decision tree T of D,wehave|T|=uT|ET(u)|. PROOF.Eachentityxcontributes a cost equal to its distance from the root. Let us distribute this cost uniformly among the internal nodes on the path from xto the root. Observe that the total cost accumulated at an internal node uis equal to |ET(u)|. Thus |T|=uT|ET(u)|. Consider a decision tree Tand a pair x,yof entities. We say that a node uT separates the pair x,y, if the traversal for both xand ypasses through u,butxand ytake different branches from u. Formally, uis said to separate 3x,y,ifx,yET(u) and x.a= y.a, where ais the attribute label of u. For any pair x,yof entities, there exists a unique separator in Tthat separates xand y. We deﬁne SEP(u)tobetheset of all pairs separated by u. The separators with respect to the greedy tree Twill be important in our analysis. For each pair x,y, we denote by sx,ythe separator of x,y in Tand let Sx,ydenote ET(sx,y). From Proposition 3.5, we see that each node uTcontributes a cost of |ET(u)| towards the total cost |T|and separates the pairs in SEP(u). We distribute the cost |ET(u)|equally among the pairs in SEP(u). For each pair x,y∈SEP(u), we deﬁne the cost cx,y=|ET(u)|/|SEP(u)|. Since each pair has a unique separator, the costs cx,yare well deﬁned. It is easy to see that |ET(u)|=x,y∈SEP(u)cx,yand by Proposition 3.5, we have |T|=x,ycx,y, where the summation is taken over all (unordered) pairs of distinct entities. Notice that each pair x,yalso has a unique separator in T. So, we rewrite the 3We note that the separator of x,yis nothing but the least common ancestor of xand y. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:9 preceding summation by partitioning the set of all pairs according to their separators in Tand obtain the following equation. |T|= zT x,y∈SEP(z) cx,y(1) For each zT, we deﬁne α(z) to be the term corresponding to zin the summation given in Eq. (1). Clearly, α(z)=x,y∈SEP(z)cx,y. The following lemma gives an upper bound on α(z). LEMMA 3.6. For any z T,α(z)CK(1 +ln |Z|)|Z|, where Z =ET(z). Assuming the correctness of Lemma 3.6, we ﬁrst prove Theorem 3.4. The lemma is proved later in the section. PROOF OF THEOREM 3.4. Replacing the inner summation in Eq. (1) by α(z)wehave |T|≤CK(1 +ln N) zT |ET(z)|=CK(1 +ln N)|T|. The ﬁrst step is obtained by invoking Lemma 3.6 and the fact that |Z|≤N. Proposition 3.5 gives us the second step. We now proceed to prove Lemma 3.6. Fix any zT. Let us denote Z=ET(z). Let azbe the attribute label of z. The node zpartitions the set Zinto Ksets Z1,Z2,..., ZK, where Zi={xZ|x.az=i}. We extend the preceding notations to sets of values. For any A⊆{1,2,..., K}, deﬁne ZA=∪ iAZi. We prove the following upper bound on cx,y. LEMMA 3.7. Let x,y∈SEP(z). Consider disjoint sets A,B⊆{1,2,...,K}satisfying yZAand x ZB. Then, cx,y1 |Sx,yZA|+1 |Sx,yZB|. PROOF. We are given a pair x,y∈SEP(z). Let s=sx,ybe the separator of x,y in Tand the attribute label of sbe as. The cost cx,yis given by |Sx,y|/|SEP(s)|, where Sx,y=ET(s). The greedy algorithm chose the attribute asfor the node s. Hypothetically, consider choosing the attribute az, instead. Let us denote the pairs separated by such achoiceasX, that is, deﬁne X={x,y|x,ySx,yand x.az= y.az}. Notice that the greedy algorithm chose the attribute as, instead of az, because asdistinguishes more pairs compared to az, meaning, |SEP(s)|≥|X|. It follows that cx,y≤|Sx,y|/|X|. Partition Sx,yinto S1,S2,...,SK, where Si={xSx,y|x.az=i}. Then, |X|= 1i<jK |Si|·|Sj|. Now we claim that cx,y1 iA|Si|+1 jB|Sj|.(2) ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:10 V. T. Chakaravarthy et al. The claim can be proved as follows. Let A=A({1,2,..., K}− AB)sothat AB= {1,2,...,K}and AB=∅. Recall that |Sx,y|=|S1|+|S2|+···+|SK|. It follows that 1 iA|Si|+1 jB|Sj|1 iA|Si|+1 jB|Sj| =Sx,y (iA|Si|)·(jB|Sj|) Sx,y |X| cx,y. This proves the claim in Eq. (2). Observe that for any 1 iK,Sx,yZiSiand hence |Sx,yZi|≤|Si|. Therefore, cx,y1 iA|Sx,yZi|+1 jB|Sx,yZj|. Finally, since the sets Ziand Zjare disjoint for any distinct 1 ijK, it follows that the ﬁrst term equals 1/|Sx,yZA|and the second term equals 1/|Sx,yZB|. The lemma is proved. For each x,y, we shall a choose a suitable pair of disjoint sets Aand Band obtain an upper bound on cx,yby invoking Lemma 3.7. We make use of tabular partitions for choosing these sets; the motivation for doing so will become clear in the proof of Lemma 3.10. Let Pbe an optimal tabular partition of Khaving compactness CK, given by the sequence P1,P2,...,PK. Consider any pair x,y∈SEP(z). Let i=x.azand j=y.azso that xZiand yZj.Let Abe the set in the partition Pithat contains jand Bbe the set in the partition Pjthat contains i. Notice that, by the deﬁnition of tabular partitions, the sets Aand Bare disjoint. We invoke Lemma 3.7 with Aand B as the required disjoint sets. (Observe that for any iand j, all the pairs in Zi×Zjwill make use of the same disjoint sets while invoking the lemma. Thus the sets chosen depend only on the values x.azand y.az). Therefore, cx,y1 |Sx,yZ A|+1 |Sx,yZ B|. We split the preceding cost into two parts and attribute the ﬁrst term to xand the second term to y. Deﬁne cx x,y=1 |Sx,yZ A|and cy x,y=1 |Sx,yZ B|. It follows that cx,ycx x,y+cy x,y. For any xZ, we imagine that xpays a cost cx x,yto get separated from an entity yZ. We denote the accumulated cost as Accz(x) and deﬁne it as Accz(x)= y:x,y∈SEP(z) cx x,y. Now the lemma given next follows easily. LEMMA 3.8. For any z, α(z)xZAccz(x). Our next task is to obtain an upper bound on Accz(x), so that we get a bound on α(z). The following lemma is useful for this purpose. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:11 LEMMA 3.9. Let x Ebe any entity and Q Ebe any set of entities such that x ∈ Q. Then, yQ 1 |Sx,yQ|(1 +ln |Q|). PROOF.Lett=|Q|. We shall prove the following claim. yQ 1 |Sx,yQ| t i=1 1 i The claim implies the lemma, since it is well known that t i=1(1/i)(1 +ln t), for all t. We prove the claim by applying induction on |Q|. For the base case of |Q|=1, let Q={y}, where y= x. Clearly, ySx,yand so, |Sx,yQ|=1, and the claim follows. Assuming that the claim is true for all sets of size at most t1, we prove it for any set Qof size t.Letybe any entity in Qsuch that for all yQ,sx,yis a descendent of sx,y (a node is considered to be a descendent of itself). If more than one such element exists, pick one arbitrarily. Intuitively, yis one among the ﬁrst batch of entities in Qto get separated from x. The main observation is that QSx,yand so, Sx,yQ=Q. Thus 1/|Sx,yQ|=1/|Q|=1/t. We apply the induction hypothesis on the set of remaining entities Q=Qyand infer that yQ 1 |Sx,yQ| t1 i=1 1 i. Clearly, QQand hence |Sx,yQ|≤|Sx,yQ|, so, in the previous summation, if we replace the term |Sx,yQ|by |Sx,yQ|, then the resulting inequality is also true. We conclude that yQ 1 |Sx,yQ|=1 |Sx,yQ|+ yQ 1 |Sx,yQ| 1 t+ t1 i=1 1 i = t i=1 1 i. LEMMA 3.10. For any x Z, Accz(x)CK(1 +ln |Z|). PROOF.Letr=x.azand so xZr.Let Z=ZZrbe the rest of the entities in Z. Notice that Accz(x)=y Zcx x,y. We perform the preceding summation by partitioning Z according to Pr,therth member of the optimal tabular partition P=P1,P2,..., PK. Let Pr=s1,s2,...,s, where CK.For1i, deﬁne Qi={y Z|y.azsi}. Thus, Z=Q1Q2...Qand hence, Accz(x)= 1i yQi cx x,y.(3) We derive an upper bound for each term in the outer sum using Lemma 3.9. Fix any 1i. Notice that for any yQi,wehavecx x,y=1/|Sx,yQi|, by deﬁnition. Moreover, ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:12 V. T. Chakaravarthy et al. x∈ Qi. Thus, by applying Lemma 3.9 on Qi, we get yQi cx x,y(1 +ln |Qi|)(1 +ln |Z|).(4) We get the lemma by combining Eqs. (3) and (4), and the fact that CK. PROOF OF LEMMA 3.6. The result is proved by combining Lemma 3.8 and Lemma 3.10. α(z) xZ Accz(x) xZ CK(1 +ln |Z|) =CK(1 +ln |Z|)|Z|. 3.1.4. Tabular Partitions and Ramsey Colorings. In this section, we introduce the notion of directed Ramsey colorings and show that they are equivalent to tabular partitions. Throughout the discussion, for n>0, let Gnand Gndenote the complete undirected and the complete directed graph on nvertices, respectively. Deﬁnition 3.11.Let n>0 be an integer. A directed Ramsey coloring of Gnis a coloring τof the edges such that for any triplet of distinct vertices x,yand z,ifτ(x,y)= τ(x,z)thenτ(y,x)= τ(y,z) (and by symmetry, τ(z,x)= τ(z,y)). We deﬁne Rkto be the smallest number nsuch that Gncannot be directed Ramsey colored using kcolors.4The inverse of these numbers will be useful. Deﬁnernto be the minimum number of colors required to do a directed Ramsey coloring of Gn. We claim that for any n, there exists a tabular partition Pof compactness kif and only if there exists a directed Ramsey coloring τof Gnthat uses only kcolors. A proof sketch follows. Let P=P1,P2,... Pn.Fix1xn. Arrange the sets in the partition Px in an arbitrary manner, say Px=sx,1,sx,2,...,sx, , where k. The n-1 edges outgoing from the vertex xare colored according to the partition Px. Meaning, for 1 c,for ysx,c,wesetτ(x,y)=c. For any yand z,ifτ(x,y)=τ(x,z), then it means that yand zbelong to the same set in the partition Px. By the property of tabular partitions, it should be the case that xand zbelong to different sets in the partition Py, implying that τ(y,x)= τ(y,z). We conclude that τis a directed Ramsey coloring and that τuses only kcolors. The converse is proved using a similar argument. The claim implies the following proposition. THEOREM 3.12. For any n, Cn=rn. Let us call an edge-coloring of Gna Ramsey coloring if it does not induce any monochromatic triangles. For any n, a Ramsey coloring τof Gnreadily yields a di- rected Ramsey coloring τof Gn. For each pair of vertices xand y,wesetτ(x,y)= τ(y,x)=τ(x,y). It can easily be veriﬁed that τis indeed a directed Ramsey coloring of Gn. The number of colors used in τis the same as that of τ. Therefore, we have the following proposition. PROPOSITION 3.13. For any n, rnrn. PROOF OF THEOREM 3.2. The result follows from Theorems 3.4 and 3.12, and Proposition 3.13. 4Such a number exists, as shown in Theorem 5.1. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:13 3.2. The Weighted Case: WDT Problem In this section, we show how to deal with the weighted case, namely the WDT problem. Let Dbe the input N×mtable over a set of entities Eand a set of attributes A,having a branching factor of K.Letw(·) be the input weight function that assigns an integer weight w(x)1 to each entity xE. The problem is to construct an optimal decision tree Thaving the minimum cost with respect to w(·). We present an algorithm which generalizes the greedy algorithm for the UDT problem. Weighted Greedy Algorithm. Refer to the greedy algorithm given in Figure 3. The main step in that algorithm is choosing an attribute that distinguishes the maximum number of pairs. We modify this step so that the weights are taken into account. Namely, we choose the attribute a a=argmaxaA x,y∈S(a) w(x)w(y), where S(a)={x,y|x,yEand x.a= y.a}, is the set of pairs distinguished by the attribute a. We call the preceding procedure the weighted greedy algorithm. Let W=xEw(x) denote the total weight of the entities. Let Tand Tdenote the weighted greedy and the optimal trees, under the weight function w(·). THEOREM 3.14. w(T)CK(1 +ln W)w(T), where W is the sum of weights of all the entities. We prove the this theorem by adapting the proof of Theorem 3.2. Due to space constraints, we provide an outline of the proof. Intuitively, we imagine that each entity xis replicated w(x) times and modify the proof of Theorem 3.2 accordingly. We reuse notation from the previous proof. Let SEP(u) be the set of all pairs separated by u. For each pair x,y, we denote by sx,ythe separator of x,yin Tand let Sx,ydenote ET(sx,y). Additional notation is introduced next. ForasetofentitiesXE,letw(X) denote the total weight of the entities in X,that is, w(X)=xXw(x). We also deﬁne weights on any set of pairs of entities: for a set of pairs XE×E, deﬁne w(X)=x,y∈Xw(x)w(y). Proposition 3.5 generalizes to the weighted case as follows. PROPOSITION 3.15. For a decision tree T of D,w(T)=uTw(ET(u)). For each pair of entities x,y, deﬁne a cost cx,yas follows. cx,y=w(x)w(y)w(Sx,y) w(SEP(sx,y)) By Proposition 3.15, we get the following equation, which is similar to Eq. (1). w(T)= zT x,y∈SEP(z) cx,y(5) For each zT, the inner summation in Eq. (5) is deﬁned as the cost α(z)= x,y∈SEP(z)cx,y. Our goal is to derive an upper bound on α(z). Fix any zT. Let us denote Z=ET(z). Let azbe the attribute label of z. The node zpartitions the set Zinto Ksets Z1,Z2,...,ZK, where Zi={xZ|x.az=i}. We extend the preceding notations to sets of values. For any A⊆{1,2,...,K}, deﬁne ZA=∪ iAZi. The following lemma generalizes Lemma 3.7 to the weighted case. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:14 V. T. Chakaravarthy et al. LEMMA 3.16. Let x,y∈SEP(z). Consider disjoint sets A,B⊆{1,2,..., K}satisfying yZAand x ZB. Then, cx,yw(x)w(y)1 w(Sx,yZA)+1 w(Sx,yZB). Consider any x,y∈SEP(z). Let Pbe an optimal tabular partition of Khaving compactness CK, given by the sequence P1,P2,...,PK.Leti=x.azand j=y.azso that xZiand yZj.Let Abe the set in the partition Pithat contains jand Bbe the set in the partition Pjthat contains i. Deﬁne cx x,y=w(x)w(y) w(Sx,yZ A)and cy x,y=w(x)w(y) w(Sx,yZ B). By Lemma 3.16, we have that cx,ycx x,y+cy x,y. For each entity xET(z), deﬁne Accz(x)as Accz(x)= y:x,y∈SEP(z) cx x,y. We wish to derive an upper bound on Accz(x). The following lemma, which generalizes Lemma 3.9, is useful for this purpose. LEMMA 3.17. Let x Ebe any entity and Q Ebe any set of entities such that x ∈ Q. Then, yQ w(y) w(Sx,yQ)(1 +ln w(Q)). The following is obtained by generalizing Lemma 3.10. LEMMA 3.18. For any x Z, Accz(x)w(x)CK(1 +ln w(Z)). PROOF OF THEOREM 3.14. Consider any zTand let Z=ET(z). Then, α(z) xZ,Accz(x). Applying Lemma 3.18 and Proposition 3.15, we get that α(z)CK(1 +ln w(Z))w(Z).(6) Replacing the inner summation in Eq. (5) by α(z)wehave w(T)CK(1 +ln W) zT w(ET(z)) =CK(1 +ln W)w(T).(7) The ﬁrst step is obtained by invoking Eq. (6) and the fact that w(Z)w(E)=W. Proposition 3.15 gives us the second step. Theorem 3.14 shows that the approximation ratio of the weighted greedy algorithm is logarithmic in N, when the total weight Wis polynomially bounded in N. Unfortu- nately, when the weights are arbitrarily large, the ratio could be worse. We overcome this issue by using the following rounding technique. Rounded Greedy Algorithm. Let Dbe an input table having a branching factor of K and let win be the input integer weight function. Let wmax in =maxxwin(x) denote the maximum weight. Deﬁne a new weight function w(·) as follows: for any entity xE, deﬁne w(x)=win(x)N2 wmax in . ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:15 Run the weighted greedy algorithm with w(·) as the input weight function and obtain a tree T. Return the tree T. Let Tand T in be the optimal decision trees under the weight functions w(·)and win(·), respectively. From Theorem 3.14, we have a good bound for w(T) with respect to w(T). But, of course, we need to compare win(T)andwin(T in). We do this next. THEOREM 3.19. win(T)2CK(1 +3lnN)win (T in). PROOF.LetxEbe any entity and consider the path from the root to xin the tree T in. Notice that each internal node along this path separates at least one entity from x. (Otherwise, T in contains a “dummy” node that does not separate any pairs and hence can be deleted to obtain a tree of lesser cost). So, the length of the path is at most N and hence, the following claim is true. Claim 1: |T in|≤ N2. We next compare win(T in)andw(T in). We have w(T in)= xE w(x)T in (x) xEwin(x)N2 wmax in +1T in (x) =win(T in)N2 wmax in +|T in| win(T in)N2 wmax in +N2 2win(T in)N2 wmax in .(8) The second step is from the deﬁnition of w(·) and the fourth step is obtained from Claim 1. The last inequality is obtained by observing the fact that win(T in)wmax in . Notice that for any entity xE,1w(x)N2and so the total weight Wunder the function w(·)satisesWN3. So, Theorem 3.14 implies the following claim. Claim 2: w(T)CK(1 +3lnN)w(T). We can now compare win(T)andwin(T in). Note that Tis the optimal tree under the function w(·) and hence, w(T)w(T in). We obtain the lemma by combining the observation with Eq. (8) and Claim 2. By combining Theorem 3.19, Theorem 3.12, and Proposition 3.13, we get the following result. THEOREM 3.20. The approximation ratio of the rounded greedy algorithm is at most 2rK(1 +3ln N)=O(rKlog N). 4. HARDNESS OF APPROXIMATION In this section, we study the hardness of approximating the WDT and the UDT prob- lems. We show that it is NP-hard to approximate the 2-WDT problem within a ratio of (log N). Therefore, our approximation algorithm for the 2-WDT problem is optimal up to constant factors. We also improve the previous hardness results for the UDT problem. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:16 V. T. Chakaravarthy et al. 4.1. Hardness of Approximating the 2-WDT Problem THEOREM 4.1. It is NP-hard to approximate the 2-WDT problem within a factor of (log N), where N is the number of entities in the input. PROOF. We prove the result via a reduction from the set cover problem. It is known that approximating set cover within a factor of (log n) is NP-hard [Raz and Safra 1997]. Let (U,S) be the input set cover instance, where U={x1,x2,...,xn}is a universe of items and Sis a collection of sets {S1,S2,...,Sm}such that SiU, for each i. Without loss of generality, we can assume that for any pair of distinct items xiand xj, there exists a set SkScontaining exactly one of these two items. (If not, one of these items can be removed from the system.) Construct an instance of the 2-WDT problem having N=n+1 entities and mattributes. The set of entities is E={x1,x2,...,xn}∪{x}, where each entity xicorresponds to the item xiand xis a special entity. The set of attributes is A={S1,S2,··· ,Sm}, so that each attribute Sicorresponds to the set Si. The N×m table Dis given as follows. For each entity xiand attribute Sj,setxi.Sj=1, if xiSj and otherwise, set xi.Sj=0. For the special entity x,setx.Sj=0, for all attributes Sj. For each entity xi, set the weight w(xi)=1. As for the special entity x, set its weight as w(x)=N3. This completes the construction. Let Tbe a decision tree for D.LetCbe the set of attributes found along the path from the root to the entity x. Recall that the length of the preceding path is denoted as T(x). Observe that Cis a cover for (U,S). We have (|C|=T(x)) w(T)/N3.Onthe other hand, given a cover C, we can construct a decision tree Tsatisfying the following two properties: (i) the set of attributes along the path from the root to xis exactly the set Cso that |T(x)|=|C|; (ii) for every other entity xi,T(xi)N. (The second property is based on the fact that for any table containing Nentities, it sufﬁces to test at most N attributes in order to distinguish any entity from the rest). Thus w(T)≤|C|N3+N2. In particular, w(T)≤|C|N3+N2, where Tand Care the optimal decision tree and optimal cover, respectively. Based on the previous observations, we can prove the following claim. If there exists an α(N)-approximation algorithm for the 2-WDT problem then for any >0, we can design an (1 +)α(n)-approximation algorithm for the set cover problem. Therefore, the hardness of set cover problem implies the claimed hardness result for the 2-WDT problem. 4.2. Hardness for the UDT and the 2-UDT Problems In this section, we present improved results of hardness of approximation for the UDT and the 2-UDT problems. For the 2-UDT problem, Heeringa and Adler [2005] showed a hardness of approximation of (1 +), for some >0. We show that for any >0, it is NP-hard to approximate the UDT and the 2-UDT problems within a factor of (4 ) and (2), respectively. Our reductions are from the Minimum Sum Set Cover (MSSC) problem. The input to the MSSC problem is a set system: a collection of sets S={S1, S2,...,Sm}over a universe U={x1,x2,...,xN}of items, where each SiU.Aso- lution is an ordering πon the sets in S, with an associated cost deﬁned as follows. Let πbe S 1,S 2,...,S m.EachiteminS 1pays a cost of 1, each item in S 2S 1pays a cost of 2, and so on. The cost of πis the sum of the costs of all items. Formally, deﬁne the costs cπ x=argmini{xS i},forxU, and cost(π)=xUcπ x. The MSSC problem is to ﬁnd an ordering with the minimum cost. For a constant d,thed-MSSC problem is the special case of MSSC in which every set in the set system has at most delements. Feige et al. [2004] proved the following hardness results for these problems. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:17 THEOREM 4.2. [FEIQE ET AL. 2004]. (1)For any >0, it is NP-hard to approximate the MSSC problem within a ratio of (4 ). (2)For any >0, there exists a constant d such that it is NP-hard to approximate d-MSSC within a ratio of (2 ). Our hardness results for UDT and 2-UDT are obtained via approximation preserving reductions from the MSSC and the d-MSSC problems, respectively. The reduction from MSSC to UDT is easier and we present it ﬁrst (in Section 4.2.1). The reduction from d-MSSC to 2-UDT is similar, but involves more technical details; it is presented in Section 4.2.2. 4.2.1. Hardness for the UDT Problem. Here, we prove the hardness result for the UDT problem by exhibiting a reduction from MSSC . THEOREM 4.3. For any >0, it is NP-hard to approximate the UDT problem within aratioof(4 ). PROOF. Given an MSSC instance S={S1,S2,...,Sm}over a universe U={x1, x2,...,xN}, construct an N×mtable Das follows. Each item xcorresponds to an entity and each set Sicorresponds to an attribute ai.For1jm,1in, set the entry xi.ajas follows: if xiSjthen set xi.aj=i,elsesetxi.aj=0. Observe that any decision tree for Dis left-deep: for any internal node u, except the branch labeled 0, every other branch out of uleads to a leaf node. We claim that given an ordering πof S, we can construct a decision tree Tsuch that |T|=cost(π) and vice versa. Let π=S 1,S 2,...,S mand a 1,a 2,...,a mbe the corresponding sequence of attributes. Construct a left-deep tree T, in which the root node is labeled a 1and its 0th child is labeled a 2and so on. In general, label the internal node in ith level with a i. It can be seen that Tis indeed a decision tree for Dand that |T|=cost(π). The converse is shown via a similar construction. Given a decision tree T, traverse the tree starting with the root node and always taking the branches labeled 0. Write down the sequence of sets corresponding to the internal nodes seen in this traversal and let πdenote the sequence. Notice that the sets appearing in this sequence cover all elements of Uand that cost(π)=|T|. (Some sets in Smay not appear in this sequence. To be formally compliant with the deﬁnition of solutions, we append the missing sets in an arbitrary order). The claim, in conjunction with Theorem 4.2 (part 1), implies the following result. 4.2.2. Hardness for the 2 - UDT Problem. In this section, we shall prove the hardness result for the 2-UDT problem. THEOREM 4.4. For any >0, it is NP-hard to approximate the 2-UDT problem within aratioof(2 ). The proof is similar to that of Theorem 4.3. The reduction is from d-MSSC instances, for a suitable constant d. Observe that the entries in the table can only be 0 or 1 as opposed to the index of the elements in the previous construction. The required reduction is obtained by using log dauxiliary columns to identify elements of each set. The rest of the section is devoted to a formal proof. We ﬁrst present a general construction, using which we shall derive the theorem. Fix any integer d>0 and we shall show a reduction from d-MSSC to 2-UDT . Given ad-MSSC instance S={S1,S2,...,Sm}over a universe of items U={x1,x2,...,xN}, construct a binary table Dhaving Nentities and m(1 +log d) attributes, as fol- lows. Each item xUcorresponds to an entity xin D. Each set Sicorresponds ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:18 V. T. Chakaravarthy et al. to 1 +log dattributes: a main attribute named aiand log dauxiliary attributes named ai,1,ai,2,...,ai,logd. For ﬁlling the table D, consider each set Sj. Order the items in Sjarbitrarily and let the ordering be Sj=x 1,x 2,...,x , where d.For each entity x iSj,setx i.aj=1; write iasalog d-bit string and ﬁll the entries x i.aj,1,x i.aj,2,...,x i.aj,log dwith these bits. For any entity x∈ Sj, set the value on all these 1 +log dattributes to be 0. This completes the construction of D. We make two claims connecting the solutions of the MSSC and the decision trees of D. Any decision tree Tfor Dis a binary tree in which each internal node has two branches labeled 0 and 1; we call these the 0-branch and the 1-branch, and the corre- sponding children the 0-child and the 1-child, respectively. Let π=S 1,S 2,...,S mbe an ordering of S.WesaythatasetS icovers an entity x,ifxS iand x∈ S j, for all j<i. LEMMA 4.5. Given an ordering πof S, we can construct a decision tree Tof Dsuch that |T|≤cost(π)+Nlog d. In particular, if πand Tare the respective optimal solutions, then |T|≤cost(π)+Nlog d. PROOF.Letπ=S 1,S 2,...,S mand let a 1,a 2,...,a mbe the sequence of main attributes corresponding to these sets. We construct a tree Tthat would be “almost left-deep.” We start the construction by making a 1the label of the root node. Notice that all the entities in S 1will follow the 1-branch and all the other entities will follow the 0-branch. For the former entities, the auxiliary attributes corresponding to S 1contain the index of the entities within S 1. So, using these attributes we can identify each entity within S 1paying a cost of at most log d. (Formally, construct a complete binary tree of depth log dwith the auxiliary attributes as labels and assign the entities within S 1 appropriately, and attach this tree to the 1-branch of the root node). We make a 2the label of the 0-child of the root node. The discussion for this node is similar to that of the root node. All entities that are covered by S 2will follow the 1-branch and are identiﬁed using the auxiliary attributes corresponding to S 2. The remaining entities follow the 1-branch. In general, if we follow the 0-branches from the root node, and reach a level i, we will have a node labeled by a i. The entities covered by S iwill follow the 1-branch of this node and they will be identiﬁed using a complete binary tree of depth at most log d. For any entity x,ifS iis the set covering x, then starting at the root node, xwill follow the 0-branch until it reaches the node labeled a i, where it will follow the 1-branch and then get identiﬁed by the complete binary tree on the 1-branch. The entity incurs a cost of ifor the former process and a cost of at most log dfor the latter process. Thus |T|≤cost(π)+Nlog d. LEMMA 4.6. Given a decision tree Tfor D, we can construct an ordering πfor Ssuch that cost(π)≤|T|+ N. PROOF. Starting from the root node traverse the tree always taking the 0-branch until an entity x(a leaf node) is reached. Let b 1,b 2,...,b rbe the sequence of attributes seen in this traversal. Construct an ordering πby writing down the sets corresponding to these attributes (each attribute b ican either be a main attribute or an auxiliary attribute; in either case, we write down the corresponding set). The sequence may not include all sets. However, except for x, all the other entities are covered by the sets in the sequence π. We deal with xby appending to πany set that includes x(and to be formally compliant with the deﬁnition of MSSC solutions, the sets not listed in πare appended in any arbitrary order). Notice that cost(π)≤|T|+N(the extra Nis included for handling the cost of covering x). ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:19 For convenience, we switch over to average costs instead of absolute costs. If πis an ordering of an MSSC instance having Nitems, we deﬁne costa(π)=cost(π)/N. Similarly, if Tis a decision tree of a table Dhaving Nentities, we deﬁne |T|a=|T|/N. An important property of d-MSSC is that every set can cover at most d(a constant) number of items and so, in any solution, the average cost is at least N/2d. The following proposition formalizes the claim. PROPOSITION 4.7. For any d, for any d-MSSC instance having N items, the optimal ordering πsatisﬁes costa(π)N 2d. LEMMA 4.8. Suppose for some constant α, there exists an α-approximation for the 2-UDT problem. Let δ>0and d be any constants. Then, there exists an algorithm for the d-MSSC problem and some constant N0such that the algorithm achieves an approximation factor of (1 +δ)αon all instances whose universe contains at least N0 items. PROOF. Given a d-MSSC instance Mover a universe containing Nelements, con- struct a binary table Dusing the (reduction) procedure described before. Use the α- approximation algorithm to obtain a solution Tfor D. Apply Lemma 4.6 to transform Tinto a solution πfor M.LetTand πdenote the optimal solution for Dand M, respectively. By Lemma 4.5 and 4.6, we can relate cost(π) and cost(π) as follows. costa(π)≤|T|a+1 α|T|a+1 α·costa(π)+αlog d+1 We shall choose a suitable N0such that (αlog d+1) αδ(costa(π)) or equivalently, costa(π)(αlog d+1)/(αδ). This would imply that the algorithm achieves an ap- proximation factor of α(1 +δ) on all instances having at least N0items. This task is accomplished by applying Proposition 4.7, which says that costa(π)N/2d.So,weﬁx N0=2d(αlog d+1) αδ . We now prove Theorem 4.4. Suppose there exists an α-approximation algorithm for the 2-UDT problem for some constant α<2. Choose δ>0 such that (1 +δ)α<2 and let β=(1 +δ)α. Invoke Theorem 4.2 to obtain a constant dsuch that it is NP- hard to approximate d-MSSC within a factor of β. Now by Lemma 4.8, there exists an algorithm for the d-MSSC problem that has an approximation ratio of βon all instances over a universe of size at least N0. For instances having a smaller universe, we can perform an exhaustive search in polynomial time, since N0is a constant. This means that NP =P. We have proved the theorem. 5. RAMSEY NUMBERS AND ERD ¨ OS’S CONJECTURE In this section, we take a closer look at our approximation ratio and discuss its con- nection to a Ramsey-theoretic conjecture by Erd¨ os. We presented an algorithm for the WDT problem having an approximation ratio of O(rKlog N). Let us now focus on bounds for the inverse Ramsey numbers rn,forn1. Recall that for any k,Rk3k+1 2[Nesetril and Rosenfeld 2001; Schur 1916]. From this we get that for any n,rn2+0.64 log n. Notice that any improvement in the upper bound of rnwould automatically improve our approximation ratio. Better upper bounds are known for rn(see Nesetril and Rosenfeld [2001], Exoo [1994], and Chung ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:20 V. T. Chakaravarthy et al. and Grinstead [1983]); but they improve the preceding bound only by constant factors. We observe that the upper bound for rncannot be improved signiﬁcantly because of the following result: Rk1+k!e[West 2001], which implies rn=(log n/log log n). Observe that our approximation ratio actually involves rn, rather than rn. Therefore, one can try to derive a better upper bound on rn. Unfortunately, we show that rn= (log n/log log n). The claim is implied by the following theorem which can be proved based on an argument similar to the one used to obtain the same bound for Rk. THEOREM 5.1. For any k, Rk1+k!e. PROOF.Letn= Rk1. By the deﬁnition of Rk, Gncan be directed Ramsey colored using only kcolors. Let τbe such a coloring. Pick any vertex from Gn, say the vertex u. We ﬁrst claim that τcan be transformed to be symmetric with respect to u, meaning we can modify τin such a way that for any other vertex x, the edge (x,u) gets the same color as (u,x). This is accomplished by considering each vertex xand (locally) relabeling the colors assigned to its outgoing edges such that (x,u) gets the same color as (u,x). This does not increase the number of colors. From now on, it is assumed that we have modiﬁed τin the preceding manner. There are n1 outgoing edges from u, which are colored using kcolors. So, there must exist a color class chaving at least (n1)/kedges, that is, there should exist a color c such that the set of vertices V={x|τ(u,x)=c}satisﬁes the inequality |V|≥(n1)/k. The main observation is that for any x,yV, the edge (x,y) cannot have cas its color. The observation can be seen as follows. We have τ(u,x)=τ(u,y)=c,andso,τ(x,y) should be different from τ(x,u), by the deﬁnition of directed Ramsey colorings. On the other hand, τ(x,u)=c, because of the transformation that we performed. Therefore τ(x,y)= c. To summarize, we have argued that the color cis not assigned to any edge in the subgraph induced by V. Therefore, only k1 colors are used in for the edges of the previous subgraph. It follows that |V|≤ Rk11. Putting together the lower and upper bounds on |V|, we get that (n1)/k≤|V|≤ Rk11. Hence, nk(Rk11)+1. Since n= Rk1, we have established the following recurrence relation on the directed Ramsey numbers. We have Rk2+k( Rk11), with the boundary condition being R1=3. By solving the recurrence relation, we get that for any k1, Rk2+k! k1 i=0 1 i!. TheRHSisatmost(1+k!e). The theorem is proved. Notice that there is a gap in the upper and lower bounds for Rk.Erd ¨ os conjectured that for some constant α, for all k,Rkαk. This is equivalent to rn=(log n). We discuss the implication of our results in possibly proving the conjecture. The idea is to show that, in terms of worst-case performance factors, the rounded greedy algorithm performs poorly! We observe that a lower bound of (log Klog N)onthe approximation ratio for the rounded greedy algorithm would imply the conjecture. More explicitly, we note that the following hypothesis implies the conjecture. Hypothesis. There exists a constant β>0 such that for any K, there exists a K-WDT table Dand a weight function w(·) on which the tree Tproduced by the rounded greedy algorithm is such that w(T)(βlog Klog N)w(T), where Tis the optimal solution. A result by Garey and Graham [1974] could be a starting point for constructing such instances. They analyzed the worst-case performance of the greedy procedure for ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:21 the 2-UDT problem and by constructing counter-examples, obtained a lower bound of (log N/log log N) for the approximation ratio of the procedure. One can also attempt to prove the conjecture under the assumption NP = Pbyshow- ing that it is NP-hard to approximate the K-WDT within a factor of (log Klog N). More precisely, exhibit a constant c>0 and show that for all K2, it is NP-hard to approximate the K-WDT problem within a factor of clog Klog N. However, as men- tioned in the Introduction, extending the O(log N)-approximation algorithm for the UDT problem by Chakaravarthy et al. [2009] to the weighted case will rule out the this approach. 6. CONCLUSION AND OPEN PROBLEMS We studied the problem of constructing good decision trees for entity identiﬁcation, in the general setup where attributes are multivalued and the entities are associated with probabilities. We designed an algorithm and proved an approximation ratio involving Ramsey numbers, and also presented hardness results. There are several interesting open questions. An obvious avenue is to bridge the gap between the approximation ratio and the hardness factor for 2-UDT ,K-UDT ,and WDT . The directed Ramsey numbers rnintroduced in this article pose challenging open problems: Is rn=rn, for all n?Isrn=O(log n/log log n)? Proving the second statement in the afﬁrmative would improve our approximation ratios. If both the statements are shown true then the conjecture by Erd¨ os would be disproved! Finally, it would be interesting, if the conjecture can be proved using the approach suggested. ACKNOWLEDGMENTS We thank A. Guillory for useful discussions and the anonymous referees for helpful comments. REFERENCES ADLER,M.AND HEERINGA, B. 2008. Approximating optimal binary decision trees. In Proceedings of the 11th In- ternational Workshop on Approximation Algorithms for Combinatorial Optimization Problems.Lecture Notes in Computer Science, vol. 5171. Springer, Berlin, 1–9. CHAKARAVARTHY,V.,PANDIT,V.,ROY,S.,AWAS THI ,P.,AND MOHANIA, M. 2007. Decision trees for entity identiﬁca- tion: Approximation algorithms and hardness results. In Proceedings of the 26th ACM Symposium on Principles of Database Systems. ACM, New York, 53–62. CHAKARAVARTHY,V.,PANDIT,V.,ROY,S.,AND SABHARWAL, Y. 2009. Approximating decision trees with multiway branches. In Proceedings of the 36th International Colloquium on Automata, Languages and Program- ming. Lecture Notes in Computer Science, vol. 5555. Springer, Berlin. CHUNG,F.AND GRINSTEAD, C. 1983. A survey of bounds for classical Ramsey numbers. J. Graph Theory 7, 25–37. DASGUPTA, S. 2005. Analysis of a greedy active learning strategy. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems. MIT Press, Cambridge, MA, 337–344. EXOO, G. 1994. A lower bound for Schur numbers and multicolor Ramsey numbers. Electron. J. Combin. 1, R8. FEIGE,U.,LOV ´ ASZ,L.,AND TETALI, P. 2004. Approximating min sum set cover. Algorithmica 40, 4, 219–234. GAREY, M. 1970. Optimal binary decision trees for diagnostic identiﬁcation problems. Ph.D. thesis, University of Wisconsin, Madison. GAREY, M. 1972. Optimal binary identiﬁcation procedures. SIAM J. Appl. Math. 23, 2, 173–186. GAREY,M.AND GRAHAM, R. 1974. Performance bounds on the splitting algorithm for binary testing. Acta Inf. 3, 347–355. GRAHAM,R.,ROTHSCHILD,B.,AND SPENCER, J. 1990. Ramsey Theory. John Wiley & Sons, New York. HEERINGA, B. 2006. Improving access to organized information. Ph.D. thesis, University of Massachusetts, Amherst. HEERINGA,B.AND ADLER, M. 2005. Approximating optimal decision trees. Tech. rep. TR 05-25, University of Massachusetts, Amherst. ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. 15:22 V. T. Chakaravarthy et al. HYAF IL ,L.AND RIVEST, R. 1976. Constructing optimal binary decision trees is NP-complete. Inf. Process. Lett. 5, 1, 15–17. KOSARAJU,S.,PRZYTYCKA,M.,AND BORGSTROM, R. 1999. On an optimal split tree problem. In Proceedings of the 5th International Workshop on Algorithms and Data Structures. Lecture Notes in Computer Science, vol. 1272. Springer, Berlin, 69–92. MORET, B. 1982. Decision trees and diagrams. ACM Comput. Surv. 14, 4, 593–623. MURTHY, S. 1998. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining Knowl. Discov. 2, 4, 345–389. NESETRIL,J.AND ROSENFELD, M. 2001. I. Schur, C.E. Shannon and Ramsey numbers, a short story. Discr. Math. 229, 1-3, 185–195. PANKHURST, R. 1970. A computer program for generating diagnostic keys. Comput. J. 13, 2, 145–151. RADZISZOWSKI, S. 1994. Small Ramsey numbers. Electron. J. Combin. 1, 7. RAZ,R.AND SAFRA, S. 1997. A sub-constant error-probability low-degree test, and a sub-constant error- probability PCP characterization of NP. In Proceedings of the 29th ACM Symposium on Theory of Computing. ACM, New York. REYNOLDS,A.,DICKS,J.,ROBERTS,I.,WESSELINK,J.,IGLESIA,B.,ROBERT,V.,BOEKHOUT,T.,AND RAYWARD-SMITH,V. 2003. Algorithms for identiﬁcation key generation and optimization with application to yeast identiﬁca- tion. In Proceedings of EvoWorkshops. Lecture Notes in Computer Science, vol. 2611. Springer, Berlin, 107–118. SCHUR, I. 1916. Uber die kongruenz xm+ymzm(mod p). Jber. Deustsch. Math. Verein 25, 114–117. WEST, D. 2001. Introduction to Graph Theory. Prentice Hall. WIJTZES,T.,BRUGGEMAN,M.,NOUT,M.,AND ZWIETERING, M. 1997. A computer system for identiﬁcation of lactic acid bacteria. Int. J. Food Microbiol. 38, 1, 65–70. Received February 2008; revised December 2008; accepted April 2009 ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011. ## Supplementary resource (1) ... In what follows, we show that the greedy policy can achieve an approximation guarantee of O(log n) for AIGS, which matches the best achievable ratio. A key step of our analysis is to map AIGS to the binary decision tree problem [5,18]. ... ... There has been extensive research [5,25] showing that the minimum (probability) weight of nodes will have a significant impact on the approximation ratio for the decision tree problem. Fortunately, a rounding technique [5] can be used tackle the negative impact of the minimum weight. ... ... There has been extensive research [5,25] showing that the minimum (probability) weight of nodes will have a significant impact on the approximation ratio for the decision tree problem. Fortunately, a rounding technique [5] can be used tackle the negative impact of the minimum weight. In particular, for each node u, its weight will be rounded from p(u) to w(u) as follows, ... Preprint Full-text available Interactive graph search (IGS) uses human intelligence to locate the target node in hierarchy, which can be applied for image classification, product categorization and searching a database. Specifically, IGS aims to categorize an object from a given category hierarchy via several rounds of interactive queries. In each round of query, the search algorithm picks a category and receives a boolean answer on whether the object is under the chosen category. The main efficiency goal asks for the minimum number of queries to identify the correct hierarchical category for the object. In this paper, we study the average-case interactive graph search (AIGS) problem that aims to minimize the expected number of queries when the objects follow a probability distribution. We propose a greedy search policy that splits the candidate categories as evenly as possible with respect to the probability weights, which offers an approximation guarantee of$O(\log n)$for AIGS given the category hierarchy is a directed acyclic graph (DAG), where$n$is the total number of categories. Meanwhile, if the input hierarchy is a tree, we show that a constant approximation factor of$(1+\sqrt{5})/2$can be achieved. Furthermore, we present efficient implementations of the greedy policy, namely GreedyTree and GreedyDAG, that can quickly categorize the object in practice. Extensive experiments in real-world scenarios are carried out to demonstrate the superiority of our proposed methods. ... In this chapter, we study an abstract stochastic optimization problem in the setting described above which unifies and generalizes many previously-studied problems such as optimal decision trees studied in [53], [63], [27], [20], [49] and [26], equivalence class determination (see [40] and [12]), decision region determination studied in [58] and submodular ranking studied in [6] and [55]. We obtain an algorithm with the best-possible approximation guarantee in all these special cases. ... ... Such a simple algorithm was previously unknown even in the special case of optimal decision tree, despite a large number of papers on this topic, including [53], [63], [27], [1], [20], [44], [49], [38] and [26] . ... ... The first O(log m)-approximation algorithm for optimal decision tree was obtained in [49], which is known to be best-possible from [20]. This result was extended to the equivalence class determination problem in [26]. ... Thesis This dissertation aims to consider different problems in the area of stochastic optimization, where we are provided with more information about the instantiation of the stochastic parameters over time. With uncertainty being an inseparable part of every industry, several applications can be modeled as discussed. In this dissertation we focus on three main areas of applications: 1) ranking problems, which can be helpful for modeling product ranking, designing recommender systems, etc., 2) routing problems which can cover applications in delivery, transportation and networking, and 3) classification problems with possible applications in medical diagnosis and chemical identification. We consider three types of solutions for these problems based on how we want to deal with the observed information: static, adaptive and a priori solutions. In Chapter II, we study two general stochastic submodular optimization problems that we call Adaptive Submodular Ranking and Adaptive Submodular Routing. In the ranking version, we want to provide an adaptive sequence of weighted elements to cover a random submodular function with minimum expected cost. In the routing version, we want to provide an adaptive path of vertices to cover a random scenario with minimum expected length. We provide (poly)logarithmic approximation algorithms for these problems that (nearly) match or improve the best-known results for various special cases. We also implemented different variations of the ranking algorithm and observed that it outperforms other practical algorithms on real-world and synthetic data sets. In Chapter III, we consider the Optimal Decision Tree problem: an identification task that is widely used in active learning. We study this problem in presence of noise, where we want to perform a sequence of tests with possible noisy outcomes to identify a random hypothesis. We give different static (non-adaptive) and adaptive algorithms for this task with almost logarithmic approximation ratios. We also implemented our algorithms on real-world and synthetic data sets and compared our results with an information theoretic lower bound to show that in practice, our algorithms' value is very close to this lower bound. In Chapter IV, we focus on a stochastic vehicle routing problem called a priori traveling repairman, where we are given a metric and probabilities of each vertices being active. We want to provide an a priori master tour originating from the root that is shortcut later over the observed active vertices. Our objective is to minimize the expected total wait time of active vertices, where the wait time of a vertex is defined as the length of the path from the root to this vertex. We consider two benchmarks to evaluate the performance of an algorithm for this problem: optimal a priori solution and the re-optimization solution. We provide two algorithms to compare with each of these benchmarks that have constant and logarithmic approximation ratios respectively. ... The goal is to identify the true classifier by querying labels at the minimum number of points in expectation (over the prior distribution). Other applications include entity identification in databases (Chakaravarthy et al. (2011)) and experimental design to choose the most accurate theory among competing candidates (Golovin et al. (2010)). ... ... The state-of-the-art resultGupta et al. (2017) is an O(log m)-approximation, for instances with arbitrary probability distribution and costs.Chakaravarthy et al. (2011) also showed that ODT ... Preprint Full-text available A fundamental task in active learning involves performing a sequence of tests to identify an unknown hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction, has been widely studied for decades and the asymptotically best-possible approximation algorithm has been devised for it. We study a generalization where certain test outcomes are noisy, even in the more general case when the noise is persistent, i.e., repeating a test gives the same noisy output. We design new approximation algorithms for both the non-adaptive setting, where the test sequence must be fixed a-priori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous work in the area assumed at most a logarithmic number of noisy outcomes per hypothesis and provided approximation ratios that depended on parameters such as the minimum probability of a hypothesis. Our new approximation algorithms provide guarantees that are nearly best-possible and work for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly with this number. In fact, our results hold in a significantly more general setting, where the goal is to cover stochastic submodular functions. We evaluate the performance of our algorithms on two natural applications with noise: toxic chemical identification and active learning of linear classifiers. Despite our theoretical logarithmic approximation guarantees, our methods give solutions with cost very close to the information theoretic minimum, demonstrating the effectiveness of our methods. Article Full-text available Decision trees are popular classification models, providing high accuracy and intuitive explanations. However, as the tree size grows the model interpretability deteriorates. Traditional tree-induction algorithms, such as C4.5 and CART, rely on impurity-reduction functions that promote the discriminative power of each split. Thus, although these traditional methods are accurate in practice, there has been no theoretical guarantee that they will produce small trees. In this paper, we justify the use of a general family of impurity functions, including the popular functions of entropy and Gini-index, in scenarios where small trees are desirable, by showing that a simple enhancement can equip them with complexity guarantees. We consider a general setting, where objects to be classified are drawn from an arbitrary probability distribution, classification can be binary or multi-class, and splitting tests are associated with non-uniform costs. As a measure of tree complexity, we adopt the expected cost to classify an object drawn from the input distribution, which, in the uniform-cost case, is the expected number of tests. We propose a tree-induction algorithm that gives a logarithmic approximation guarantee on the tree complexity. This approximation factor is tight up to a constant factor under mild assumptions. The algorithm recursively selects a test that maximizes a greedy criterion defined as a weighted sum of three components. The first two components encourage the selection of tests that improve the balance and the cost-efficiency of the tree, respectively, while the third impurity-reduction component encourages the selection of more discriminative tests. As shown in our empirical evaluation, compared to the original heuristics, the enhanced algorithms strike an excellent balance between predictive accuracy and tree complexity. Conference Paper Article Full-text available Lazy graph search algorithms are efficient at solving motion planning problems where edge evaluation is the computational bottleneck. These algorithms work by lazily computing the shortest potentially feasible path, evaluating edges along that path, and repeating until a feasible path is found. The order in which edges are selected is critical to minimizing the total number of edge evaluations: a good edge selector chooses edges that are not only likely to be invalid, but also eliminates future paths from consideration. We wish to learn such a selector by leveraging prior experience. We formulate this problem as a Markov Decision Process (MDP) on the state of the search problem. While solving this large MDP is generally intractable, we show that we can compute oracular selectors that can solve the MDP during training. With access to such oracles, we use imitation learning to find effective policies. If new search problems are sufficiently similar to problems solved during training, the learned policy will choose a good edge evaluation ordering and solve the motion planning problem quickly. We evaluate our algorithms on a wide range of 2D and 7D problems and show that the learned selector outperforms baseline commonly used heuristics. We further provide a novel theoretical analysis of lazy search in a Bayesian framework as well as regret guarantees on our imitation learning based approach to motion planning. Preprint Full-text available In the stochastic submodular cover problem, the goal is to select a subset of stochastic items of minimum expected cost to cover a submodular function. Solutions in this setting correspond to sequential decision processes that select items one by one "adaptively" (depending on prior observations). While such adaptive solutions achieve the best objective, the inherently sequential nature makes them undesirable in many applications. We ask: how well can solutions with only a few adaptive rounds approximate fully-adaptive solutions? We give nearly tight answers for both independent and correlated settings, proving smooth tradeoffs between the number of adaptive rounds and the solution quality, relative to fully adaptive solutions. Experiments on synthetic and real datasets show qualitative improvements in the solutions as we allow more rounds of adaptivity; in practice, solutions with a few rounds of adaptivity are nearly as good as fully adaptive solutions. Preprint Full-text available In the problem of active sequential hypotheses testing (ASHT), a learner seeks to identify the true hypothesis$h^*$from among a set of hypotheses$H$. The learner is given a set of actions and knows the outcome distribution of any action under any true hypothesis. While repeatedly playing the entire set of actions suffices to identify$h^*$, a cost is incurred with each action. Thus, given a target error$\delta>0$, the goal is to find the minimal cost policy for sequentially selecting actions that identify$h^*$with probability at least$1 - \delta$. This paper provides the first approximation algorithms for ASHT, under two types of adaptivity. First, a policy is partially adaptive if it fixes a sequence of actions in advance and adaptively decides when to terminate and what hypothesis to return. Under partial adaptivity, we provide an$O\big(s^{-1}(1+\log_{1/\delta}|H|)\log (s^{-1}|H| \log |H|)\big)$-approximation algorithm, where$s$is a natural separation parameter between the hypotheses. Second, a policy is fully adaptive if action selection is allowed to depend on previous outcomes. Under full adaptivity, we provide an$O(s^{-1}\log (|H|/\delta)\log |H|)\$-approximation algorithm. We numerically investigate the performance of our algorithms using both synthetic and real-world data, showing that our algorithms outperform a previously proposed heuristic policy.
Chapter
Full-text available
We give a (ln n + 1)-approximation for the decision tree (DT) problem. An instance of DT is a set of m binary tests T = (T 1, ..., T m ) and a set of n items X = (X 1, ..., X n ). The goal is to output a binary tree where each internal node is a test, each leaf is an item and the total external path length of the tree is minimized. Total external path length is the sum of the depths of all the leaves in the tree. DT has a long history in computer science with applications ranging from medical diagnosis to experiment design. It also generalizes the problem of finding optimal average-case search strategies in partially ordered sets which includes several alphabetic tree problems. Our work decreases the previous upper bound on the approximation ratio by a constant factor. We provide a new analysis of the greedy algorithm that uses a simple accounting scheme to spread the cost of a tree among pairs of items split at a particular node. We conclude by showing that our upper bound also holds for the DT problem with weighted tests.
Article
Full-text available
The input to the min sum set cover problem is a collection of n sets that jointly cover m elements. The output is a linear order on the sets, namely, in every time step from 1 to n exactly one set is chosen. For every element, this induces a first time step by which it is covered. The objective is to find a linear arrangement of the sets that minimizes the sum of these first time steps over all elements. We show that a greedy algorithm approximates min sum set cover within a ratio of 4. This result was implicit in work of Bar-Noy, Bellare, Halldorsson, Shachnai, and Tamir (1998) on chromatic sums, but we present a simpler proof. We also show that for every ε > 0, achieving an approximation ratio of 4 – ε is NP-hard. For the min sum vertex cover version of the problem (which comes up as a heuristic for speeding up solvers of semidefinite programs) we show that it can be approximated within a ratio of 2, and is NP-hard to approximate within some constant ρ > 1.
Conference Paper
Full-text available
We consider the problem of constructing decision trees for entity identification from a given table. The input is a table containing information about a set of entities over a fixed set of attributes. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. The previously best known approximation ratio for this problem was O(log2 N). In this paper, we present a new greedy heuristic that yields an improved approximation ratio of O(logN).
Article
Article
Binary identification problems model a variety of actual problems, all containing the requirement that one construct a testing procedure for identifying a single unknown object belonging to a known finite set of possibilities. They arise in connection with machine fault-location, medical diagnosis, species identification, and computer programming. Author describes the basic model for binary identification problems and present a number of general results, including a dynamic programming algorithm for constructing optimal identification procedures. The main results of the paper concern identification problems in which the object possibilities are naturally partitioned into similarity classes with available tests of two types; general tests which only distinguish between similarity classes and specific tests which each test for a single one of the possibilities. This structure is utilized to obtain considerable improvement over the general dynamic programming algorithm.