Content uploaded by Venkatesan T. Chakaravarthy
Author content
All content in this area was uploaded by Venkatesan T. Chakaravarthy
Content may be subject to copyright.
15
Decision Trees for Entity Identiﬁcation: Approximation Algorithms
and Hardness Results
VENKATESAN T. CHAKARAVARTHY, VINAYAKA PANDIT, and
SAMBUDDHA ROY, IBM India Research Lab
PRANJAL AWASTHI, Carnegie Mellon University
MUKESH K. MOHANIA, IBM India Research Lab
We consider the problem of constructing decision trees for entity identiﬁcation from a given relational
table. The input is a table containing information about a set of entities over a ﬁxed set of attributes and
a probability distribution over the set of entities that speciﬁes the likelihood of the occurrence of each
entity. The goal is to construct a decision tree that identiﬁes each entity unambiguously by testing the
attribute values such that the average number of tests is minimized. This classical problem ﬁnds such
diverse applications as efﬁcient fault detection, species identiﬁcation in biology, and efﬁcient diagnosis in
the ﬁeld of medicine. Prior work mainly deals with the special case where the input table is binary and the
probability distribution over the set of entities is uniform. We study the general problem involving arbitrary
input tables and arbitrary probability distributions over the set of entities. We consider a natural greedy
algorithm and prove an approximation guarantee of O(rK·log N), where Nis the number of entities and
Kis the maximum number of distinct values of an attribute. The value rKis a suitably deﬁned Ramsey
number,whichisatmostlogK. We show that it is NPhard to approximate the problem within a factor of
(log N), even for binary tables (i.e., K=2). Thus, for the case of binary tables, our approximation algorithm
is optimal up to constant factors (since r2=2). In addition, our analysis indicates a possible way of resolving
a Ramseytheoretic conjecture by Erd¨
os.
Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnu
merical Algorithms and Problems
General Terms: Algorithms, Theory
Additional Key Words and Phrases: Approximation algorithms, decision tree, Ramsey numbers
ACM Reference Format:
Chakaravarthy, V. T., Pandit, V., Roy, S., Awasthi, P., and Mohania, M. K. 2011. Decision trees for entity
identiﬁcation: Approximation algorithms and hardness results. ACM Trans. Algor. 7, 2, Article 15 (March
2011), 22 pages.
DOI =10.1145/1921659.1921661 http://doi.acm.org/10.1145/1921659.1921661
A preliminary version of the article was presented at the ACM Symposium on Principles of Database Systems,
[Chakaravarthy et al. 2007].
This work was done while P. Awasthi was at IBM India Research Lab, New Delhi.
Authors’ addresses: V. T. Chakaravarthy, V. Pandit, and S. Roy, IBM India Research Lab, 4 Block C,
Institutional Area, Vasanth Kunj, New Delhi – 110070, India; email: {vechakra, pvinayak, sambud
dha}@in.ibm.com; P. Awasthi, Computer Science Department, Wean Hall 1313, Carnegie Mellon University,
Pittsburgh, PA 15213; email: pawasthi@cs.cmu.edu; M. K. Mohania, IBM India Research Lab, 4 Block C,
Institutional Area, Vasanth Kunj, New Delhi – 110070, India; email: mkmohania@in.ibm.com.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that
copies show this notice on the ﬁrst page or initial screen of a display along with the full citation. Copyrights for
components of this work owned by others than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this
work in other works requires prior speciﬁc permission and/or a fee. Permissions may be requested from
Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 101210701 USA, fax +1 (212)
8690481, or permissions@acm.org.
c
2011 ACM 15496325/2011/03ART15 $10.00
DOI 10.1145/1921659.1921661 http://doi.acm.org/10.1145/1921659.1921661
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:2 V. T. Chakaravarthy et al.
1. INTRODUCTION
Decision trees for the purposes of identiﬁcation and diagnosis have been studied for
a long time now [Moret 1982]. Consider a typical medical diagnosis application. A
hospital maintains a table containing information about diseases. Each row in the table
is a disease and each column is a medical test and the corresponding entry speciﬁes the
outcome of the test for a person suffering from the given disease. Some of the medical
tests are costly (e.g., MRI scans) and some require few days for the result to be known
(e.g., blood cultures). When the hospital receives a new patient whose disease has not
been identiﬁed, it would like to determine the shortest sequence of tests which can
unambiguously determine the disease of the patient. Such a capability would enable it
to achieve objectives like saving the expenditure of the patients, quickly determining
the disease to start the treatment early, etc. Motivated by such applications, we consider
the problem of constructing decision trees for entity identiﬁcation from the given data.
Decision Trees for Entity Identiﬁcation—Problem Statement. The input is a table D
having Nrows and mcolumns. Each row is called an entity and the columns are the
attributes of these entities. Additionally, we are also given a probability distribution
Pover the set of entities. For each entity e,Pspeciﬁes p(e), the likelihood of the
occurrence of e. A solution is a decision tree in which each internal node is labeled by
an attribute and its branches are labeled by the values that the attribute can take. The
entities are the leaves of the tree. The main requirement is that the tree should identify
each entity correctly. The cost of the tree is the expected distance of an entity from the
root, (i.e., ep(e)d(e), where d(e) is the distance of the entity efrom the root). The goal
is to construct a decision tree with the minimum cost. We call this the WDT problem
(Here, Wstands for “weight” and it stresses the fact that the entities are associated
with probabilities/weights).
Example 1.1.Figure 1 shows an example table and two decision trees for it. In this
example, the probability distribution over the entities is uniform, that is, p(ei)=1/6,
for each entity ei. In the ﬁrst decision tree, the distance d(e1)is2andd(e4) is 3. The
cost of the ﬁrst decision tree is 14/6 and that of the second decision tree is 8/6. The
second decision tree happens to be an optimal tree for this instance.
For a given table, the maximum number of distinct values that any attribute takes is
called its branching factor. In the preceding example, the branching factor of the given
table is 5, because every attribute takes at most 5 distinct values and the attribute B
attains the maximum 5. Interesting special cases of the WDT problem can be obtained
in two ways:
—the case in which every input instance is required to have a branching factor of at
most K, for some constant K; we call this the KWDT problem. Of particular interest
is the 2WDT problem, where the tables are binary.
—the case in which the probability distribution over the set of entities is known to be
uniform; we call this the UDT problem (Here, Ustresses the fact that the probabili
ties/weights are uniform).
The special case in which both of these restrictions apply is called the KUDT problem.
Prior Results and Our Results. Much of the previous literature deals with the re
stricted 2UDT problem. Hyaﬁl and Rivest [1976] showed that the 2UDT problem is
NPhard. Garey [1970, 1972] presented a dynamic programmingbased algorithm for
the 2UDT problem that ﬁnds the optimal solution, but the algorithm runs in exponen
tial time in the worst case.
Kosaraju et al. [1999] presented a greedy algorithm for the 2WDT problem with
an approximation ratio of O(log N); the approximation ratio remains the same for the
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:3
Fig. 1. Example decision trees.
special case of the 2UDT problem. Independently, Dasgupta [2005] showed that the
same greedy heuristic has an approximation ratio of 4 log Nfor the 2UDT problem.
Recently, Heeringa and Adler [2005] gave an alternative analysis of the same greedy
algorithm and obtained a slightly improved approximation ratio of (1 +ln N)forthe
2UDT problem (see also Heeringa [2006] and Adler and Heeringa [2008]). They also
showed that it is NPhard to approximate the 2UDT problem within a ratio of (1 +),
for some >0. We study the problem in its whole generality, namely the WDT problem,
where the attributes can take multiple values and the input probability distribution
can be arbitrary. This occurs commonly, for example, in medical diagnosis applications
(e.g., bloodgroup can take multiple values; some diseases are more prevalent than
others).
We present two approximation algorithms for the UDT problem. The ﬁrst one is a
simple algorithm that uses any given αapproximation algorithm for the 2UDT prob
lem as a black box and provides an αlog Kapproximation for the KUDT problem. In
particular, using the algorithm of Heeringa and Adler [2005] as the black box, we obtain
an algorithm with an approximation ratio of log K(1+ln N). Our second algorithm for
the UDT problem uses a greedy heuristic and has an approximation ratio ofrK(1+ln N),
where rKis a suitably deﬁned Ramsey number which is at most (2 +0.64 log K). Our
analysis builds on that of Heeringa and Adler [2005] and uses additional combinatorial
arguments. The highlight of our analysis is that it establishes connections to Ramsey
numbers and a conjecture by Erd¨
os (see what follows for more details). Furthermore,
notice that the second algorithm offers a constant factor improvement over the ﬁrst
algorithm.
Remark 1.2. We note that subsequent to our work, Chakaravarthy et al. [2009]
considered a sligtly different greedy heuristic for the UDT problem and showed an
approximation ratio of 4 log N.
Next we consider the general WDT problem. We ﬁrst observe that by combin
ing the blackbox approach with the algorithm of Kosaraju et al. [1999], we get an
O(log Klog N) approximation ratio for the WDT problem. We also show how extend
our analysis for the UDT problem to handle weights and obtain an algorithm with
an approximation ratio of O(rKlog N). This provides an alternative way of getting the
result obtained via the blackbox approach.
We next focus on the hardness of approximating various versions of the problem. We
show that it is NPhard to approximate the 2WDT problem within a ratio of (log N).
This implies that the O(log N)approximation algorithm of Kosaraju et al. [1999] for
the 2WDT problem is optimal up to constant factors. We also improve the hardness of
approximation for the unweighted version of the problem. We show that it is NPhard
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:4 V. T. Chakaravarthy et al.
Fig. 2. Summary of results.
to approximate the UDT and the 2UDT problem, within a ratio of (4 −)and(2−),
respectively, for any >0. The results are summarized in Figure 2.
Ramsey Numbers and Connections to Erd¨os’s Conjecture. Our analysis of the ap
proximation algorithms has interesting connections with Ramsey theory and an unre
solved conjecture by Erd¨
os. Ramsey theory, treated at length in the book by Graham
et al. [1990], deals with coloring the edges of complete graphs (or hypergraphs) with
a speciﬁed number of colors satisfying certain constraints. For our purposes, we need
the following speciﬁc type of Ramsey numbers.
For n>0, let Gndenote the complete graph on nvertices. A kcoloring of Gnis a
coloring of the edges of Gnusing kcolors. For k>0, Rkis deﬁned to be the smallest
number nsuch that any kcoloring of Gncontains a monochromatic triangle.1The
inverses of the Ramsey numbers are more convenient for our purposes. For n>0, we
deﬁne rnto be the smallest number ksuch that we can color the edges of Gnusing only
kcolors without inducing any monochromatic triangle.
The exact values of the Ramsey numbers for k>3 are not known. However, it is
known that for any k,3k+1
2≤Rk≤1+k!e(see West [2001], Nesetril and Rosenfeld
[2001], and Schur [1916]). Erd¨
os made the conjecture that for some constant α, for all
k,Rk≤αk.
In terms of the inverse Ramsey numbers, the previous bounds translate as follows:
(i) for any n,rn≤2+0.64 log n=O(log n); (ii) rn=(log n/log log n). The Erd¨
os conjec
ture now reads
rn=(log n).
Our results provide interesting approaches to address the conjecture. Exhibit a con
stant c>0 and show that for all K≥2, it is NPhard to approximate the KWDT
problem within a factor of clog Klog N. Notice that this would prove the conjecture
under the assumption that NP = P. However, we note that if the recent O(log N)
approximation algorithm for UDT by Chakaravarthy et al. [2009] can be extended to
the weighted case, the preceding approach will be ruled out. Another way of proving
the conjecture would be to construct a family of bad instances for our algorithm (which
is a simple greedy heuristic). We discuss the details later in the article.
Applications and Related Work. Decision trees for entity identiﬁcation (as deﬁned
in this article) have been used for medical diagnosis (as described earlier), species
identiﬁcation in biology, fault detection, etc. [Moret 1982]. Taxonomists release ﬁeld
guides to help identify species based on their characteristics. These guides are often
presented in the form of decision trees labeled by species characteristics. Typically, a
ﬁeld biologist identiﬁes the species of a specimen at hand by referring to such guides
(hopefully with as few lookups as possible). Taxonomists refer to such decision trees as
“identiﬁcation keys” and an article on identiﬁcation keys can be found in Wikipedia.2
1A monochromatic triangle is a triplet of vertices such that all the three edges between them have the same
color. In Ramsey theory, Rkis denoted R(3,3,...,3), where “3” is repeated ktimes. For example, it is known
that R1=3, R2=6, R3=17 [Radziszowski 1994].
2http://en.wikipedia.org/wiki/Dichotomous key.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:5
Computer programs and algorithms for identiﬁcation and diagnosis applications have
been developed for nearly four decades (e.g., Pankhurst [1970], Reynolds et al. [2003],
and Wijtzes et al. [1997]).
Murthy [1998] and Moret [1982] present excellent surveys on the use of decision trees
in such diverse ﬁelds as machine learning, pattern recognition, taxonomy, switching
theory, and boolean logic.
2. PRELIMINARIES
In this section, we deﬁne the WDT problem and its special cases. We also develop some
notation used in the article.
Let Dbe a relational table having Ntuples and mattributes. We call each tuple an
entity.LetEand Adenote the set of entities and attributes, respectively. For x∈Eand
a∈A,x.adenotes the value of the entity xon the attribute a.Fora∈A,Vadenotes
the set of distinct values taken by ain D.LetK=maxa∈A{Va}.NoticethatK≤N.We
call Kthe branching factor of D.
Adecision tree T for the table Dis a rooted tree satisfying the following properties.
Each internal node uis labeled by an attribute aand has at most Kchildren. Every
branch (edge) out of uis labeled by a distinct value from the set Va. The entities are
the leaves of the tree and thus the tree has exactly Nleaves. The main requirement
is that the tree should identify every entity correctly. In other words, for any entity x,
the following traversal process should correctly lead to x. The process starts at the root
node. Let ube the current node and abe the attribute label of u. Take the branch out
of ulabeled by x.aand move to the corresponding child of u. The requirement is that
this traversal process should reach the entity x.
Observe that the values of the attributes are used only for taking the correct branch
in the traversal process. So, we can map each value of an attribute to a distinct number
from 1 to Kand assume that Vais a subset of {1,2,..., K}. In the rest of the article, we
assume that for any x∈Eand a∈A,x.a∈{1,2,...,K}.
ForatreeT,weuse“u∈T” to mean that uis an internal node in T. We denote by
x,y, an unordered pair of distinct entities.
Let Tbe a decision tree for D. For an entity x∈E,path length of xis deﬁned to be the
number of internal nodes in the path from the root to x; it is denoted T(x). The sum of
all path lengths is called total path length and is denoted T,thatis,T=x∈ET(x).
Let w(·)beaweight function that assigns a real number w(x)>0, for each x∈E.We
deﬁne the cost of Twith respect to w(·) as follows.
cost(T,w)=
x∈E
w(x)T(x)
We will denote cost(T,w)asw(T).
As mentioned in the Introduction, input to the WDT problem includes a probability
distribution Pover Especifying the likelihood of the occurrence of each entity and
the goal is to construct a tree having the minimum expected path length. We view
probabilities as weights and assume that the distribution is speciﬁed as a weight
function p(·) that associates a weight p(x)>0, for each entity x. Notice that when an
entity is chosen at random according to the previous distribution, the expected path
length is given by p(T)=cost(T,p). We assume that the probabilities p(x) are given as
rational numbers. We can easily write these numbers in such way that for any entity
x,p(x)=w(x)/L, where w(x)≥1 is an integer and Lis an integer giving the common
denominator. And so, without loss of generality, we assume the probability distribution
will be given as an integer weight function w(·) over a set of entities, that is, for all
x∈E,w(x)≥1 is an integer. Notice that p(T)=w(T)/Land hence, ﬁnding an optimal
Tunder p(·)andw(·) are equivalent.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:6 V. T. Chakaravarthy et al.
WDT Problem.The input is a relational table Dand a probability distribution P
represented as an integer weight function w(·). The goal is to construct a decision tree
Thaving the minimum cost w(T).
For a positive integer K,theKWDT problem is a special case of the WDT problem
where the input table is required to have a branching factor of at most K.Noticethat
in the KWDT problem, the input is a table whose entries are drawn from the set
{1,2,...,K}.
Of particular interest is the special case called UDT in which the probability dis
tribution is uniform. In this problem, the weight function is given by w(x)=1, for all
x∈E. Note that the cost of a tree Tis w(T)=T. For an integer K≥2, the special
case of the UDT where the input table is required to have a branching factor of at most
Kis called the KUDT problem.
3. APPROXIMATION ALGORITHMS AND ANALYSIS
In this section, we present an algorithm for the WDT problem and prove an approxi
mation ratio of O(rKlog N), where Krefers to the branching factor of the input table. As
mentioned in the Introduction, our analysis builds on that of Heeringa and Adler [2005]
for the 2UDT problem. In order to achieve our result, we have to extend their ideas to
deal with two issues. Firstly, the attributes can be multivalued as opposed to binary;
secondly, the entities can have arbitrary weights. For ease of exposition, we ﬁrst show
how to address the issue of attributes being multivalued. Then, we deal with the case
of arbitrary weights. Speciﬁcally, Section 3.1 presents an algorithm and analysis for
the UDT problem. These ideas are generalized in Section 3.2 to obtain an algorithm
for the WDT problem.
3.1. The Unweighted Case: UDT Problem
This section deals with the UDT problem. Here, the probability distribution is uniform
and so, the weights of all the entities are 1. The goal is to ﬁnd a tree Twith the
minimum cost T.
We present two approximation algorithms for UDT . The ﬁrst one uses any given
αapproximation algorithm for 2UDT as a black box and provides an αlog Kapprox
imation for the KUDT problem. In particular, using the algorithm of Heeringa and
Adler [2005] as the black box, we obtain an algorithm with an approximation ratio of
log K(1 +ln N). Our second algorithm for the UDT problem uses a greedy heuristic
and has an approximation ratio of rK(1 +ln N). Recall that rK≤2+0.64 log K. Thus,
the second algorithm offers a constant factor improvement over the ﬁrst algorithm. The
ﬁrst approach has the advantage that any improvement in the approximation ratio for
the 2UDT problem automatically yields an improvement for the KUDT problem. On
the other hand, the second approach has the advantage that any improvement in the
upper bound for rKimproves the approximation ratio.
3.1.1. The BlackBox Algorithm.
Let Abe an αapproximation algorithm for the 2UDT
problem. We show how to get a (αlog K)approximation algorithm for the KUDT
problem. The idea is to encode the given UDT instance as a 2UDT instance and then
invoke the algorithm Aon the encoded instance.
Given an N×mtable Dhaving a branching factor of K, we construct an N×mlog K
binary table D2as follows. Each attribute in Dis represented by log Kattributes in
D2. The former attribute is called the original attribute and the latter attributes are
called as its derived attributes. The values appearing in an original attribute are
represented in binary in the corresponding derived attributes. Invoke the algorithm A
on the binary table D2and let T2be the decision tree returned by the algorithm. We
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:7
Fig. 3. The greedy algorithm.
obtain a decision tree Tfor Dfrom T2by replacing the attributes in its internal nodes
with their original attributes in Dand labeling appropriately. Notice that T≤T2.
Given a tree Tfor D, we can construct a tree T2for D2such that T2≤log KT.
In constructing a decision tree T2for the encoded instance D2, the main task is to take
the correct branches of the internal nodes of Tusing the binary derived attributes.
We achieve this by replacing each internal node with a complete binary tree of depth
log Kusing the derived attributes of the original attribute of the internal node.
Clearly, T2≤log KT. This shows that T∗
2≤log KT∗where T∗and T∗
2are
the optimal decision trees for Dand D2, respectively. Since T2≤αT∗
2, the solution T
returned by the blackbox algorithm satisﬁes T≤αlog KT∗.
THEOREM 3.1. Given a αapproximation algorithm for the 2UDT problem, the black
box algorithm has an approximation ratio of αlog Kfor the UDT problem where K is
the branching factor of the input table.
In particular, we obtain an approximation ratio of log K(1 +ln N) by using the
HeeringaAdler algorithm as a black box.
3.1.2. The Greedy Algorithm.
In this section, we present a greedy algorithm for the UDT
problem. The algorithm is similar in spirit to that of Heeringa and Adler [2005] for
the 2UDT problem. We build on their analysis and develop further combinatorial
arguments to obtain our approximation ratio.
Given as input an N×mtable Dhaving branching factor at most K, the greedy
algorithm produces a decision tree Tas described in the following. Let Eand Adenote
the set of entities and attributes of D, respectively. The intuition is that any decision
tree should distinguish every pair of distinct entities. So, a natural idea is to make the
attribute that distinguishes the maximum number of pairs as the root of T, where an
attribute ais said to distinguish a pair x,y,ifx.a= y.a.Choosingsuchanattribute
acan be easily done in time O(mN 2). Picking the attribute aas the label for the root
node partitions the set Einto disjoint sets E1,E2,...,EK, where Ei={xx.a=i}.We
recursively apply the same greedy procedure on each of these sets to obtain Kdecision
trees and make these the subtrees of the root node. The greedy procedure is formally
speciﬁed in Figure 3. We get the output tree Tby calling T=Greedy(E).
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:8 V. T. Chakaravarthy et al.
THEOREM 3.2. The greedy algorithm has an approximation ratio of (rK(1 +ln N)) for
the UDT problem, where K is the branching factor of the input table.
We now analyze the greedy algorithm and prove Theorem 3.2. The analysis is divided
into two parts. In the ﬁrst part, we introduce certain combinatorial objects called
tabular partitions and analyze the performance of the greedy algorithm using these
objects. In the second part, we relate these objects to Ramsey colorings and complete
the proof of Theorem 3.2.
3.1.3. Analysis Involving Tabular Partitions.
Let Tand T∗be the greedy and the optimal
decision trees, respectively. In this section, we prove a relationship between Tand
T∗involving tabular partitions, deﬁned in the following.
Deﬁnition 3.3 (Tabular Partitions). For any positive integer n≥1, a tabular par
tition Pof nis a sequence P1,P2,...,Pnsuch that Piis a partition of the set
{1,2,...,n}−{i}. We require that for any distinct 1 ≤i,j≤n,ifAis the set in Pi
containing jand Bis the set in Pjcontaining i,then A∩B=∅. Let the length of
a partition Pidenote the number of sets in it. We deﬁne the compactness of Pas
comp(P)=maxi(length of Pi), for 1 ≤i≤n. We deﬁne Cnto be the smallest number
such that there exists a tabular partition of nhaving compactness Cn.
THEOREM 3.4. T≤CK(1 +ln N)T∗.
We next focus on proving the previous result. In Section 3.1.4, we shall show that
CK≤rKand obtain Theorem 3.2 by combining the two results. We start with some
notations and observations. Let Tbe any decision tree for Dand ube an inter
nal node of T. We deﬁne ET(u)⊆Eto be the set of entities in the subtree of T
under u.
PROPOSITION 3.5. For any decision tree T of D,wehaveT=u∈TET(u).
PROOF.Eachentityxcontributes a cost equal to its distance from the root. Let us
distribute this cost uniformly among the internal nodes on the path from xto the root.
Observe that the total cost accumulated at an internal node uis equal to ET(u). Thus
T=u∈TET(u).
Consider a decision tree Tand a pair x,yof entities. We say that a node u∈T
separates the pair x,y, if the traversal for both xand ypasses through u,butxand
ytake different branches from u. Formally, uis said to separate 3x,y,ifx,y∈ET(u)
and x.a= y.a, where ais the attribute label of u. For any pair x,yof entities, there
exists a unique separator in Tthat separates xand y. We deﬁne SEP(u)tobetheset
of all pairs separated by u. The separators with respect to the greedy tree Twill be
important in our analysis. For each pair x,y, we denote by sx,ythe separator of x,y
in Tand let Sx,ydenote ET(sx,y).
From Proposition 3.5, we see that each node u∈Tcontributes a cost of ET(u)
towards the total cost Tand separates the pairs in SEP(u). We distribute the cost
ET(u)equally among the pairs in SEP(u). For each pair x,y∈SEP(u), we deﬁne the
cost cx,y=ET(u)/SEP(u). Since each pair has a unique separator, the costs cx,yare
well deﬁned.
It is easy to see that ET(u)=x,y∈SEP(u)cx,yand by Proposition 3.5, we have
T=x,ycx,y, where the summation is taken over all (unordered) pairs of distinct
entities. Notice that each pair x,yalso has a unique separator in T∗. So, we rewrite the
3We note that the separator of x,yis nothing but the least common ancestor of xand y.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:9
preceding summation by partitioning the set of all pairs according to their separators
in T∗and obtain the following equation.
T=
z∈T∗
x,y∈SEP(z)
cx,y(1)
For each z∈T∗, we deﬁne α(z) to be the term corresponding to zin the summation
given in Eq. (1). Clearly, α(z)=x,y∈SEP(z)cx,y. The following lemma gives an upper
bound on α(z).
LEMMA 3.6. For any z ∈T∗,α(z)≤CK(1 +ln Z)Z, where Z =ET∗(z).
Assuming the correctness of Lemma 3.6, we ﬁrst prove Theorem 3.4. The lemma is
proved later in the section.
PROOF OF THEOREM 3.4. Replacing the inner summation in Eq. (1) by α(z)wehave
T≤CK(1 +ln N)
z∈T∗
ET∗(z)=CK(1 +ln N)T∗.
The ﬁrst step is obtained by invoking Lemma 3.6 and the fact that Z≤N.
Proposition 3.5 gives us the second step.
We now proceed to prove Lemma 3.6. Fix any z∈T∗. Let us denote Z=ET∗(z). Let
azbe the attribute label of z. The node zpartitions the set Zinto Ksets Z1,Z2,..., ZK,
where Zi={x∈Zx.az=i}. We extend the preceding notations to sets of values. For
any A⊆{1,2,..., K}, deﬁne ZA=∪
i∈AZi. We prove the following upper bound on cx,y.
LEMMA 3.7. Let x,y∈SEP(z). Consider disjoint sets A,B⊆{1,2,...,K}satisfying
y∈ZAand x ∈ZB. Then,
cx,y≤1
Sx,y∩ZA+1
Sx,y∩ZB.
PROOF. We are given a pair x,y∈SEP(z). Let s=sx,ybe the separator of x,y
in Tand the attribute label of sbe as. The cost cx,yis given by Sx,y/SEP(s), where
Sx,y=ET(s). The greedy algorithm chose the attribute asfor the node s. Hypothetically,
consider choosing the attribute az, instead. Let us denote the pairs separated by such
achoiceasX, that is, deﬁne X={x,yx,y∈Sx,yand x.az= y.az}. Notice that the
greedy algorithm chose the attribute as, instead of az, because asdistinguishes more
pairs compared to az, meaning, SEP(s)≥X. It follows that cx,y≤Sx,y/X. Partition
Sx,yinto S1,S2,...,SK, where Si={x∈Sx,yx.az=i}. Then,
X=
1≤i<j≤K
Si·Sj.
Now we claim that
cx,y≤1
i∈ASi+1
j∈BSj.(2)
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:10 V. T. Chakaravarthy et al.
The claim can be proved as follows. Let A=A∪({1,2,..., K}− A−B)sothat A∪B=
{1,2,...,K}and A∩B=∅. Recall that Sx,y=S1+S2+···+SK. It follows that
1
i∈ASi+1
j∈BSj≥1
i∈ASi+1
j∈BSj
=Sx,y
(i∈ASi)·(j∈BSj)
≥Sx,y
X
≥cx,y.
This proves the claim in Eq. (2).
Observe that for any 1 ≤i≤K,Sx,y∩Zi⊆Siand hence Sx,y∩Zi≤Si. Therefore,
cx,y≤1
i∈ASx,y∩Zi+1
j∈BSx,y∩Zj.
Finally, since the sets Ziand Zjare disjoint for any distinct 1 ≤i≤j≤K, it follows
that the ﬁrst term equals 1/Sx,y∩ZAand the second term equals 1/Sx,y∩ZB. The
lemma is proved.
For each x,y, we shall a choose a suitable pair of disjoint sets Aand Band obtain
an upper bound on cx,yby invoking Lemma 3.7. We make use of tabular partitions
for choosing these sets; the motivation for doing so will become clear in the proof of
Lemma 3.10. Let P∗be an optimal tabular partition of Khaving compactness CK,
given by the sequence P1,P2,...,PK. Consider any pair x,y∈SEP(z). Let i=x.azand
j=y.azso that x∈Ziand y∈Zj.Let
Abe the set in the partition Pithat contains
jand
Bbe the set in the partition Pjthat contains i. Notice that, by the deﬁnition of
tabular partitions, the sets
Aand
Bare disjoint. We invoke Lemma 3.7 with
Aand
B
as the required disjoint sets. (Observe that for any iand j, all the pairs in Zi×Zjwill
make use of the same disjoint sets while invoking the lemma. Thus the sets chosen
depend only on the values x.azand y.az). Therefore,
cx,y≤1
Sx,y∩Z
A+1
Sx,y∩Z
B.
We split the preceding cost into two parts and attribute the ﬁrst term to xand the
second term to y. Deﬁne
cx
x,y=1
Sx,y∩Z
Aand cy
x,y=1
Sx,y∩Z
B.
It follows that cx,y≤cx
x,y+cy
x,y. For any x∈Z, we imagine that xpays a cost cx
x,yto get
separated from an entity y∈Z. We denote the accumulated cost as Accz(x) and deﬁne
it as
Accz(x)=
y:x,y∈SEP(z)
cx
x,y.
Now the lemma given next follows easily.
LEMMA 3.8. For any z, α(z)≤x∈ZAccz(x).
Our next task is to obtain an upper bound on Accz(x), so that we get a bound on α(z).
The following lemma is useful for this purpose.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:11
LEMMA 3.9. Let x ∈Ebe any entity and Q ⊆Ebe any set of entities such that x ∈ Q.
Then,
y∈Q
1
Sx,y∩Q≤(1 +ln Q).
PROOF.Lett=Q. We shall prove the following claim.
y∈Q
1
Sx,y∩Q≤
t
i=1
1
i
The claim implies the lemma, since it is well known that t
i=1(1/i)≤(1 +ln t), for all
t. We prove the claim by applying induction on Q. For the base case of Q=1, let
Q={y}, where y= x. Clearly, y∈Sx,yand so, Sx,y∩Q=1, and the claim follows.
Assuming that the claim is true for all sets of size at most t−1, we prove it for any set
Qof size t.Lety∗be any entity in Qsuch that for all y∈Q,sx,yis a descendent of sx,y∗
(a node is considered to be a descendent of itself). If more than one such element exists,
pick one arbitrarily. Intuitively, y∗is one among the ﬁrst batch of entities in Qto get
separated from x. The main observation is that Q⊆Sx,y∗and so, Sx,y∗∩Q=Q. Thus
1/Sx,y∗∩Q=1/Q=1/t. We apply the induction hypothesis on the set of remaining
entities Q=Q−y∗and infer that
y∈Q
1
Sx,y∩Q≤
t−1
i=1
1
i.
Clearly, Q⊆Qand hence Sx,y∩Q≤Sx,y∩Q, so, in the previous summation, if we
replace the term Sx,y∩Qby Sx,y∩Q, then the resulting inequality is also true. We
conclude that
y∈Q
1
Sx,y∩Q=1
Sx,y∗∩Q+
y∈Q
1
Sx,y∩Q
≤1
t+
t−1
i=1
1
i
=
t
i=1
1
i.
LEMMA 3.10. For any x ∈Z, Accz(x)≤CK(1 +ln Z).
PROOF.Letr=x.azand so x∈Zr.Let
Z=Z−Zrbe the rest of the entities in Z.
Notice that Accz(x)=y∈
Zcx
x,y. We perform the preceding summation by partitioning
Z
according to Pr,therth member of the optimal tabular partition P∗=P1,P2,..., PK.
Let Pr=s1,s2,...,s, where ≤CK.For1≤i≤, deﬁne Qi={y∈
Zy.az∈si}. Thus,
Z=Q1∪Q2∪...∪Qand hence,
Accz(x)=
1≤i≤
y∈Qi
cx
x,y.(3)
We derive an upper bound for each term in the outer sum using Lemma 3.9. Fix any
1≤i≤. Notice that for any y∈Qi,wehavecx
x,y=1/Sx,y∩Qi, by deﬁnition. Moreover,
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:12 V. T. Chakaravarthy et al.
x∈ Qi. Thus, by applying Lemma 3.9 on Qi, we get
y∈Qi
cx
x,y≤(1 +ln Qi)≤(1 +ln Z).(4)
We get the lemma by combining Eqs. (3) and (4), and the fact that ≤CK.
PROOF OF LEMMA 3.6. The result is proved by combining Lemma 3.8 and Lemma 3.10.
α(z)≤
x∈Z
Accz(x)
≤
x∈Z
CK(1 +ln Z)
=CK(1 +ln Z)Z.
3.1.4. Tabular Partitions and Ramsey Colorings.
In this section, we introduce the notion
of directed Ramsey colorings and show that they are equivalent to tabular partitions.
Throughout the discussion, for n>0, let Gnand
Gndenote the complete undirected
and the complete directed graph on nvertices, respectively.
Deﬁnition 3.11.Let n>0 be an integer. A directed Ramsey coloring of
Gnis a
coloring τof the edges such that for any triplet of distinct vertices x,yand z,ifτ(x,y)=
τ(x,z)thenτ(y,x)= τ(y,z) (and by symmetry, τ(z,x)= τ(z,y)).
We deﬁne
Rkto be the smallest number nsuch that
Gncannot be directed Ramsey
colored using kcolors.4The inverse of these numbers will be useful. Deﬁnernto be the
minimum number of colors required to do a directed Ramsey coloring of
Gn.
We claim that for any n, there exists a tabular partition Pof compactness kif and
only if there exists a directed Ramsey coloring τof
Gnthat uses only kcolors. A proof
sketch follows. Let P=P1,P2,... Pn.Fix1≤x≤n. Arrange the sets in the partition Px
in an arbitrary manner, say Px=sx,1,sx,2,...,sx, , where ≤k. The n1 edges outgoing
from the vertex xare colored according to the partition Px. Meaning, for 1 ≤c≤,for
y∈sx,c,wesetτ(x,y)=c. For any yand z,ifτ(x,y)=τ(x,z), then it means that yand
zbelong to the same set in the partition Px. By the property of tabular partitions, it
should be the case that xand zbelong to different sets in the partition Py, implying
that τ(y,x)= τ(y,z). We conclude that τis a directed Ramsey coloring and that τuses
only kcolors. The converse is proved using a similar argument. The claim implies the
following proposition.
THEOREM 3.12. For any n, Cn=rn.
Let us call an edgecoloring of Gna Ramsey coloring if it does not induce any
monochromatic triangles. For any n, a Ramsey coloring τof Gnreadily yields a di
rected Ramsey coloring τof
Gn. For each pair of vertices xand y,wesetτ(x,y)=
τ(y,x)=τ(x,y). It can easily be veriﬁed that τis indeed a directed Ramsey coloring
of
Gn. The number of colors used in τis the same as that of τ. Therefore, we have the
following proposition.
PROPOSITION 3.13. For any n, rn≤rn.
PROOF OF THEOREM 3.2. The result follows from Theorems 3.4 and 3.12, and
Proposition 3.13.
4Such a number exists, as shown in Theorem 5.1.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:13
3.2. The Weighted Case: WDT Problem
In this section, we show how to deal with the weighted case, namely the WDT problem.
Let Dbe the input N×mtable over a set of entities Eand a set of attributes A,having
a branching factor of K.Letw(·) be the input weight function that assigns an integer
weight w(x)≥1 to each entity x∈E. The problem is to construct an optimal decision
tree T∗having the minimum cost with respect to w(·). We present an algorithm which
generalizes the greedy algorithm for the UDT problem.
Weighted Greedy Algorithm. Refer to the greedy algorithm given in Figure 3. The
main step in that algorithm is choosing an attribute that distinguishes the maximum
number of pairs. We modify this step so that the weights are taken into account.
Namely, we choose the attribute a
a=argmaxa∈A
x,y∈S(a)
w(x)w(y),
where S(a)={x,yx,y∈Eand x.a= y.a}, is the set of pairs distinguished by the
attribute a. We call the preceding procedure the weighted greedy algorithm.
Let W=x∈Ew(x) denote the total weight of the entities. Let Tand T∗denote the
weighted greedy and the optimal trees, under the weight function w(·).
THEOREM 3.14. w(T)≤CK(1 +ln W)w(T∗), where W is the sum of weights of all the
entities.
We prove the this theorem by adapting the proof of Theorem 3.2. Due to space
constraints, we provide an outline of the proof.
Intuitively, we imagine that each entity xis replicated w(x) times and modify the
proof of Theorem 3.2 accordingly. We reuse notation from the previous proof. Let SEP(u)
be the set of all pairs separated by u. For each pair x,y, we denote by sx,ythe separator
of x,yin Tand let Sx,ydenote ET(sx,y). Additional notation is introduced next.
ForasetofentitiesX⊆E,letw(X) denote the total weight of the entities in X,that
is, w(X)=x∈Xw(x). We also deﬁne weights on any set of pairs of entities: for a set of
pairs X⊆E×E, deﬁne w(X)=x,y∈Xw(x)w(y).
Proposition 3.5 generalizes to the weighted case as follows.
PROPOSITION 3.15. For a decision tree T of D,w(T)=u∈Tw(ET(u)).
For each pair of entities x,y, deﬁne a cost cx,yas follows.
cx,y=w(x)w(y)w(Sx,y)
w(SEP(sx,y))
By Proposition 3.15, we get the following equation, which is similar to Eq. (1).
w(T)=
z∈T∗
x,y∈SEP(z)
cx,y(5)
For each z∈T∗, the inner summation in Eq. (5) is deﬁned as the cost α(z)=
x,y∈SEP(z)cx,y. Our goal is to derive an upper bound on α(z).
Fix any z∈T∗. Let us denote Z=ET∗(z). Let azbe the attribute label of z. The node
zpartitions the set Zinto Ksets Z1,Z2,...,ZK, where Zi={x∈Zx.az=i}. We extend
the preceding notations to sets of values. For any A⊆{1,2,...,K}, deﬁne ZA=∪
i∈AZi.
The following lemma generalizes Lemma 3.7 to the weighted case.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:14 V. T. Chakaravarthy et al.
LEMMA 3.16. Let x,y∈SEP(z). Consider disjoint sets A,B⊆{1,2,..., K}satisfying
y∈ZAand x ∈ZB. Then,
cx,y≤w(x)w(y)1
w(Sx,y∩ZA)+1
w(Sx,y∩ZB).
Consider any x,y∈SEP(z). Let P∗be an optimal tabular partition of Khaving
compactness CK, given by the sequence P1,P2,...,PK.Leti=x.azand j=y.azso that
x∈Ziand y∈Zj.Let
Abe the set in the partition Pithat contains jand
Bbe the set
in the partition Pjthat contains i. Deﬁne
cx
x,y=w(x)w(y)
w(Sx,y∩Z
A)and cy
x,y=w(x)w(y)
w(Sx,y∩Z
B).
By Lemma 3.16, we have that cx,y≤cx
x,y+cy
x,y. For each entity x∈ET∗(z), deﬁne
Accz(x)as
Accz(x)=
y:x,y∈SEP(z)
cx
x,y.
We wish to derive an upper bound on Accz(x). The following lemma, which generalizes
Lemma 3.9, is useful for this purpose.
LEMMA 3.17. Let x ∈Ebe any entity and Q ⊆Ebe any set of entities such that x ∈ Q.
Then,
y∈Q
w(y)
w(Sx,y∩Q)≤(1 +ln w(Q)).
The following is obtained by generalizing Lemma 3.10.
LEMMA 3.18. For any x ∈Z, Accz(x)≤w(x)CK(1 +ln w(Z)).
PROOF OF THEOREM 3.14. Consider any z∈T∗and let Z=ET∗(z). Then, α(z)≤
x∈Z,Accz(x). Applying Lemma 3.18 and Proposition 3.15, we get that
α(z)≤CK(1 +ln w(Z))w(Z).(6)
Replacing the inner summation in Eq. (5) by α(z)wehave
w(T)≤CK(1 +ln W)
z∈T∗
w(ET∗(z))
=CK(1 +ln W)w(T∗).(7)
The ﬁrst step is obtained by invoking Eq. (6) and the fact that w(Z)≤w(E)=W.
Proposition 3.15 gives us the second step.
Theorem 3.14 shows that the approximation ratio of the weighted greedy algorithm
is logarithmic in N, when the total weight Wis polynomially bounded in N. Unfortu
nately, when the weights are arbitrarily large, the ratio could be worse. We overcome
this issue by using the following rounding technique.
Rounded Greedy Algorithm. Let Dbe an input table having a branching factor of K
and let win be the input integer weight function. Let wmax
in =maxxwin(x) denote the
maximum weight. Deﬁne a new weight function w(·) as follows: for any entity x∈E,
deﬁne
w(x)=win(x)N2
wmax
in .
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:15
Run the weighted greedy algorithm with w(·) as the input weight function and obtain
a tree T. Return the tree T.
Let T∗and T∗
in be the optimal decision trees under the weight functions w(·)and
win(·), respectively. From Theorem 3.14, we have a good bound for w(T) with respect to
w(T∗). But, of course, we need to compare win(T)andwin(T∗
in). We do this next.
THEOREM 3.19. win(T)≤2CK(1 +3lnN)win (T∗
in).
PROOF.Letx∈Ebe any entity and consider the path from the root to xin the tree
T∗
in. Notice that each internal node along this path separates at least one entity from x.
(Otherwise, T∗
in contains a “dummy” node that does not separate any pairs and hence
can be deleted to obtain a tree of lesser cost). So, the length of the path is at most N
and hence, the following claim is true.
Claim 1: T∗
in≤ N2.
We next compare win(T∗
in)andw(T∗
in). We have
w(T∗
in)=
x∈E
w(x)T∗
in (x)
≤
x∈Ewin(x)N2
wmax
in
+1T∗
in (x)
=win(T∗
in)N2
wmax
in
+T∗
in
≤win(T∗
in)N2
wmax
in
+N2
≤2win(T∗
in)N2
wmax
in
.(8)
The second step is from the deﬁnition of w(·) and the fourth step is obtained from
Claim 1. The last inequality is obtained by observing the fact that win(T∗
in)≥wmax
in .
Notice that for any entity x∈E,1≤w(x)≤N2and so the total weight Wunder the
function w(·)satisﬁesW≤N3. So, Theorem 3.14 implies the following claim.
Claim 2: w(T)≤CK(1 +3lnN)w(T∗).
We can now compare win(T)andwin(T∗
in). Note that T∗is the optimal tree under
the function w(·) and hence, w(T∗)≤w(T∗
in). We obtain the lemma by combining the
observation with Eq. (8) and Claim 2.
By combining Theorem 3.19, Theorem 3.12, and Proposition 3.13, we get the following
result.
THEOREM 3.20. The approximation ratio of the rounded greedy algorithm is at most
2rK(1 +3ln N)=O(rKlog N).
4. HARDNESS OF APPROXIMATION
In this section, we study the hardness of approximating the WDT and the UDT prob
lems. We show that it is NPhard to approximate the 2WDT problem within a ratio of
(log N). Therefore, our approximation algorithm for the 2WDT problem is optimal
up to constant factors. We also improve the previous hardness results for the UDT
problem.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:16 V. T. Chakaravarthy et al.
4.1. Hardness of Approximating the 2WDT Problem
THEOREM 4.1. It is NPhard to approximate the 2WDT problem within a factor of
(log N), where N is the number of entities in the input.
PROOF. We prove the result via a reduction from the set cover problem. It is known
that approximating set cover within a factor of (log n) is NPhard [Raz and Safra
1997].
Let (U,S) be the input set cover instance, where U={x1,x2,...,xn}is a universe of
items and Sis a collection of sets {S1,S2,...,Sm}such that Si⊆U, for each i. Without
loss of generality, we can assume that for any pair of distinct items xiand xj, there
exists a set Sk∈Scontaining exactly one of these two items. (If not, one of these items
can be removed from the system.) Construct an instance of the 2WDT problem having
N=n+1 entities and mattributes. The set of entities is E={x1,x2,...,xn}∪{x}, where
each entity xicorresponds to the item xiand xis a special entity. The set of attributes
is A={S1,S2,··· ,Sm}, so that each attribute Sicorresponds to the set Si. The N×m
table Dis given as follows. For each entity xiand attribute Sj,setxi.Sj=1, if xi∈Sj
and otherwise, set xi.Sj=0. For the special entity x,setx.Sj=0, for all attributes Sj.
For each entity xi, set the weight w(xi)=1. As for the special entity x, set its weight as
w(x)=N3. This completes the construction.
Let Tbe a decision tree for D.LetCbe the set of attributes found along the path
from the root to the entity x. Recall that the length of the preceding path is denoted as
T(x). Observe that Cis a cover for (U,S). We have (C=T(x)) ≤w(T)/N3.Onthe
other hand, given a cover C, we can construct a decision tree Tsatisfying the following
two properties: (i) the set of attributes along the path from the root to xis exactly the
set Cso that T(x)=C; (ii) for every other entity xi,T(xi)≤N. (The second property
is based on the fact that for any table containing Nentities, it sufﬁces to test at most N
attributes in order to distinguish any entity from the rest). Thus w(T)≤CN3+N2.
In particular, w(T∗)≤C∗N3+N2, where T∗and C∗are the optimal decision tree and
optimal cover, respectively.
Based on the previous observations, we can prove the following claim. If there exists
an α(N)approximation algorithm for the 2WDT problem then for any >0, we can
design an (1 +)α(n)approximation algorithm for the set cover problem. Therefore,
the hardness of set cover problem implies the claimed hardness result for the 2WDT
problem.
4.2. Hardness for the UDT and the 2UDT Problems
In this section, we present improved results of hardness of approximation for the UDT
and the 2UDT problems. For the 2UDT problem, Heeringa and Adler [2005] showed
a hardness of approximation of (1 +), for some >0. We show that for any >0, it
is NPhard to approximate the UDT and the 2UDT problems within a factor of (4 −)
and (2−), respectively. Our reductions are from the Minimum Sum Set Cover (MSSC)
problem.
The input to the MSSC problem is a set system: a collection of sets S={S1,
S2,...,Sm}over a universe U={x1,x2,...,xN}of items, where each Si⊆U.Aso
lution is an ordering πon the sets in S, with an associated cost deﬁned as follows. Let
πbe S
1,S
2,...,S
m.EachiteminS
1pays a cost of 1, each item in S
2−S
1pays a cost
of 2, and so on. The cost of πis the sum of the costs of all items. Formally, deﬁne the
costs cπ
x=argmini{x∈S
i},forx∈U, and cost(π)=x∈Ucπ
x. The MSSC problem is
to ﬁnd an ordering with the minimum cost. For a constant d,thedMSSC problem is
the special case of MSSC in which every set in the set system has at most delements.
Feige et al. [2004] proved the following hardness results for these problems.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:17
THEOREM 4.2. [FEIQE ET AL. 2004].
(1)For any >0, it is NPhard to approximate the MSSC problem within a ratio of
(4 −).
(2)For any >0, there exists a constant d such that it is NPhard to approximate
dMSSC within a ratio of (2 −).
Our hardness results for UDT and 2UDT are obtained via approximation preserving
reductions from the MSSC and the dMSSC problems, respectively. The reduction from
MSSC to UDT is easier and we present it ﬁrst (in Section 4.2.1). The reduction from
dMSSC to 2UDT is similar, but involves more technical details; it is presented in
Section 4.2.2.
4.2.1. Hardness for the
UDT
Problem.
Here, we prove the hardness result for the UDT
problem by exhibiting a reduction from MSSC .
THEOREM 4.3. For any >0, it is NPhard to approximate the UDT problem within
aratioof(4 −).
PROOF. Given an MSSC instance S={S1,S2,...,Sm}over a universe U={x1,
x2,...,xN}, construct an N×mtable Das follows. Each item xcorresponds to an entity
and each set Sicorresponds to an attribute ai.For1≤j≤m,1≤i≤n, set the entry
xi.ajas follows: if xi∈Sjthen set xi.aj=i,elsesetxi.aj=0. Observe that any decision
tree for Dis leftdeep: for any internal node u, except the branch labeled 0, every other
branch out of uleads to a leaf node.
We claim that given an ordering πof S, we can construct a decision tree Tsuch
that T=cost(π) and vice versa. Let π=S
1,S
2,...,S
mand a
1,a
2,...,a
mbe the
corresponding sequence of attributes. Construct a leftdeep tree T, in which the root
node is labeled a
1and its 0th child is labeled a
2and so on. In general, label the internal
node in ith level with a
i. It can be seen that Tis indeed a decision tree for Dand
that T=cost(π). The converse is shown via a similar construction. Given a decision
tree T, traverse the tree starting with the root node and always taking the branches
labeled 0. Write down the sequence of sets corresponding to the internal nodes seen
in this traversal and let πdenote the sequence. Notice that the sets appearing in this
sequence cover all elements of Uand that cost(π)=T. (Some sets in Smay not appear
in this sequence. To be formally compliant with the deﬁnition of solutions, we append
the missing sets in an arbitrary order). The claim, in conjunction with Theorem 4.2
(part 1), implies the following result.
4.2.2. Hardness for the
2

UDT
Problem.
In this section, we shall prove the hardness result
for the 2UDT problem.
THEOREM 4.4. For any >0, it is NPhard to approximate the 2UDT problem within
aratioof(2 −).
The proof is similar to that of Theorem 4.3. The reduction is from dMSSC instances,
for a suitable constant d. Observe that the entries in the table can only be 0 or 1
as opposed to the index of the elements in the previous construction. The required
reduction is obtained by using log dauxiliary columns to identify elements of each
set. The rest of the section is devoted to a formal proof.
We ﬁrst present a general construction, using which we shall derive the theorem.
Fix any integer d>0 and we shall show a reduction from dMSSC to 2UDT . Given
adMSSC instance S={S1,S2,...,Sm}over a universe of items U={x1,x2,...,xN},
construct a binary table Dhaving Nentities and m(1 +log d) attributes, as fol
lows. Each item x∈Ucorresponds to an entity xin D. Each set Sicorresponds
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:18 V. T. Chakaravarthy et al.
to 1 +log dattributes: a main attribute named aiand log dauxiliary attributes
named ai,1,ai,2,...,ai,logd. For ﬁlling the table D, consider each set Sj. Order the
items in Sjarbitrarily and let the ordering be Sj=x
1,x
2,...,x
, where ≤d.For
each entity x
i∈Sj,setx
i.aj=1; write iasalog dbit string and ﬁll the entries
x
i.aj,1,x
i.aj,2,...,x
i.aj,log dwith these bits. For any entity x∈ Sj, set the value on all
these 1 +log dattributes to be 0. This completes the construction of D. We make two
claims connecting the solutions of the MSSC and the decision trees of D.
Any decision tree Tfor Dis a binary tree in which each internal node has two
branches labeled 0 and 1; we call these the 0branch and the 1branch, and the corre
sponding children the 0child and the 1child, respectively. Let π=S
1,S
2,...,S
mbe
an ordering of S.WesaythatasetS
icovers an entity x,ifx∈S
iand x∈ S
j, for all
j<i.
LEMMA 4.5. Given an ordering πof S, we can construct a decision tree Tof Dsuch
that T≤cost(π)+Nlog d. In particular, if π∗and T∗are the respective optimal
solutions, then T∗≤cost(π∗)+Nlog d.
PROOF.Letπ=S
1,S
2,...,S
mand let a
1,a
2,...,a
mbe the sequence of main attributes
corresponding to these sets. We construct a tree Tthat would be “almost leftdeep.”
We start the construction by making a
1the label of the root node. Notice that all the
entities in S
1will follow the 1branch and all the other entities will follow the 0branch.
For the former entities, the auxiliary attributes corresponding to S
1contain the index
of the entities within S
1. So, using these attributes we can identify each entity within
S
1paying a cost of at most log d. (Formally, construct a complete binary tree of
depth log dwith the auxiliary attributes as labels and assign the entities within S
1
appropriately, and attach this tree to the 1branch of the root node). We make a
2the
label of the 0child of the root node. The discussion for this node is similar to that of the
root node. All entities that are covered by S
2will follow the 1branch and are identiﬁed
using the auxiliary attributes corresponding to S
2. The remaining entities follow the
1branch. In general, if we follow the 0branches from the root node, and reach a level
i, we will have a node labeled by a
i. The entities covered by S
iwill follow the 1branch
of this node and they will be identiﬁed using a complete binary tree of depth at most
log d. For any entity x,ifS
iis the set covering x, then starting at the root node, xwill
follow the 0branch until it reaches the node labeled a
i, where it will follow the 1branch
and then get identiﬁed by the complete binary tree on the 1branch. The entity incurs
a cost of ifor the former process and a cost of at most log dfor the latter process.
Thus T≤cost(π)+Nlog d.
LEMMA 4.6. Given a decision tree Tfor D, we can construct an ordering πfor Ssuch
that cost(π)≤T+ N.
PROOF. Starting from the root node traverse the tree always taking the 0branch until
an entity x∗(a leaf node) is reached. Let b
1,b
2,...,b
rbe the sequence of attributes seen
in this traversal. Construct an ordering πby writing down the sets corresponding
to these attributes (each attribute b
ican either be a main attribute or an auxiliary
attribute; in either case, we write down the corresponding set). The sequence may not
include all sets. However, except for x∗, all the other entities are covered by the sets
in the sequence π. We deal with x∗by appending to πany set that includes x∗(and
to be formally compliant with the deﬁnition of MSSC solutions, the sets not listed in
πare appended in any arbitrary order). Notice that cost(π)≤T+N(the extra Nis
included for handling the cost of covering x∗).
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:19
For convenience, we switch over to average costs instead of absolute costs. If πis
an ordering of an MSSC instance having Nitems, we deﬁne costa(π)=cost(π)/N.
Similarly, if Tis a decision tree of a table Dhaving Nentities, we deﬁne Ta=T/N.
An important property of dMSSC is that every set can cover at most d(a constant)
number of items and so, in any solution, the average cost is at least N/2d. The following
proposition formalizes the claim.
PROPOSITION 4.7. For any d, for any dMSSC instance having N items, the optimal
ordering π∗satisﬁes
costa(π∗)≥N
2d.
LEMMA 4.8. Suppose for some constant α, there exists an αapproximation for the
2UDT problem. Let δ>0and d be any constants. Then, there exists an algorithm
for the dMSSC problem and some constant N0such that the algorithm achieves an
approximation factor of (1 +δ)αon all instances whose universe contains at least N0
items.
PROOF. Given a dMSSC instance Mover a universe containing Nelements, con
struct a binary table Dusing the (reduction) procedure described before. Use the α
approximation algorithm to obtain a solution Tfor D. Apply Lemma 4.6 to transform
Tinto a solution πfor M.LetT∗and π∗denote the optimal solution for Dand M,
respectively. By Lemma 4.5 and 4.6, we can relate cost(π) and cost(π∗) as follows.
costa(π)≤Ta+1
≤αT∗a+1
≤α·costa(π∗)+αlog d+1
We shall choose a suitable N0such that (αlog d+1) ≤αδ(costa(π∗)) or equivalently,
costa(π∗)≥(αlog d+1)/(αδ). This would imply that the algorithm achieves an ap
proximation factor of α(1 +δ) on all instances having at least N0items. This task is
accomplished by applying Proposition 4.7, which says that costa(π∗)≥N/2d.So,weﬁx
N0=2d(αlog d+1)
αδ .
We now prove Theorem 4.4. Suppose there exists an αapproximation algorithm for
the 2UDT problem for some constant α<2. Choose δ>0 such that (1 +δ)α<2
and let β=(1 +δ)α. Invoke Theorem 4.2 to obtain a constant dsuch that it is NP
hard to approximate dMSSC within a factor of β. Now by Lemma 4.8, there exists
an algorithm for the dMSSC problem that has an approximation ratio of βon all
instances over a universe of size at least N0. For instances having a smaller universe,
we can perform an exhaustive search in polynomial time, since N0is a constant. This
means that NP =P. We have proved the theorem.
5. RAMSEY NUMBERS AND ERD ¨
OS’S CONJECTURE
In this section, we take a closer look at our approximation ratio and discuss its con
nection to a Ramseytheoretic conjecture by Erd¨
os. We presented an algorithm for
the WDT problem having an approximation ratio of O(rKlog N). Let us now focus on
bounds for the inverse Ramsey numbers rn,forn≥1.
Recall that for any k,Rk≥3k+1
2[Nesetril and Rosenfeld 2001; Schur 1916]. From
this we get that for any n,rn≤2+0.64 log n. Notice that any improvement in the
upper bound of rnwould automatically improve our approximation ratio. Better upper
bounds are known for rn(see Nesetril and Rosenfeld [2001], Exoo [1994], and Chung
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:20 V. T. Chakaravarthy et al.
and Grinstead [1983]); but they improve the preceding bound only by constant factors.
We observe that the upper bound for rncannot be improved signiﬁcantly because of the
following result: Rk≤1+k!e[West 2001], which implies rn=(log n/log log n).
Observe that our approximation ratio actually involves rn, rather than rn. Therefore,
one can try to derive a better upper bound on rn. Unfortunately, we show that rn=
(log n/log log n). The claim is implied by the following theorem which can be proved
based on an argument similar to the one used to obtain the same bound for Rk.
THEOREM 5.1. For any k,
Rk≤1+k!e.
PROOF.Letn=
Rk−1. By the deﬁnition of
Rk,
Gncan be directed Ramsey colored
using only kcolors. Let τbe such a coloring. Pick any vertex from
Gn, say the vertex u.
We ﬁrst claim that τcan be transformed to be symmetric with respect to u, meaning
we can modify τin such a way that for any other vertex x, the edge (x,u) gets the same
color as (u,x). This is accomplished by considering each vertex xand (locally) relabeling
the colors assigned to its outgoing edges such that (x,u) gets the same color as (u,x).
This does not increase the number of colors. From now on, it is assumed that we have
modiﬁed τin the preceding manner.
There are n−1 outgoing edges from u, which are colored using kcolors. So, there must
exist a color class chaving at least (n−1)/kedges, that is, there should exist a color c
such that the set of vertices V={xτ(u,x)=c}satisﬁes the inequality V≥(n−1)/k.
The main observation is that for any x,y∈V, the edge (x,y) cannot have cas its color.
The observation can be seen as follows. We have τ(u,x)=τ(u,y)=c,andso,τ(x,y)
should be different from τ(x,u), by the deﬁnition of directed Ramsey colorings. On the
other hand, τ(x,u)=c, because of the transformation that we performed. Therefore
τ(x,y)= c. To summarize, we have argued that the color cis not assigned to any edge
in the subgraph induced by V. Therefore, only k−1 colors are used in for the edges of
the previous subgraph. It follows that V≤
Rk−1−1. Putting together the lower and
upper bounds on V, we get that (n−1)/k≤V≤
Rk−1−1. Hence, n≤k(Rk−1−1)+1.
Since n=
Rk−1, we have established the following recurrence relation on the directed
Ramsey numbers. We have
Rk≤2+k(
Rk−1−1),
with the boundary condition being
R1=3. By solving the recurrence relation, we get
that for any k≥1,
Rk≤2+k!
k−1
i=0
1
i!.
TheRHSisatmost(1+k!e). The theorem is proved.
Notice that there is a gap in the upper and lower bounds for Rk.Erd
¨
os conjectured
that for some constant α, for all k,Rk≤αk. This is equivalent to rn=(log n).
We discuss the implication of our results in possibly proving the conjecture. The
idea is to show that, in terms of worstcase performance factors, the rounded greedy
algorithm performs poorly! We observe that a lower bound of (log Klog N)onthe
approximation ratio for the rounded greedy algorithm would imply the conjecture.
More explicitly, we note that the following hypothesis implies the conjecture.
Hypothesis. There exists a constant β>0 such that for any K, there exists a KWDT
table Dand a weight function w(·) on which the tree Tproduced by the rounded greedy
algorithm is such that w(T)≥(βlog Klog N)w(T∗), where T∗is the optimal solution.
A result by Garey and Graham [1974] could be a starting point for constructing
such instances. They analyzed the worstcase performance of the greedy procedure for
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
Decision Trees for Entity Identiﬁcation: Approximation Algorithms and Hardness 15:21
the 2UDT problem and by constructing counterexamples, obtained a lower bound of
(log N/log log N) for the approximation ratio of the procedure.
One can also attempt to prove the conjecture under the assumption NP = Pbyshow
ing that it is NPhard to approximate the KWDT within a factor of (log Klog N).
More precisely, exhibit a constant c>0 and show that for all K≥2, it is NPhard to
approximate the KWDT problem within a factor of clog Klog N. However, as men
tioned in the Introduction, extending the O(log N)approximation algorithm for the
UDT problem by Chakaravarthy et al. [2009] to the weighted case will rule out the this
approach.
6. CONCLUSION AND OPEN PROBLEMS
We studied the problem of constructing good decision trees for entity identiﬁcation, in
the general setup where attributes are multivalued and the entities are associated with
probabilities. We designed an algorithm and proved an approximation ratio involving
Ramsey numbers, and also presented hardness results.
There are several interesting open questions. An obvious avenue is to bridge the
gap between the approximation ratio and the hardness factor for 2UDT ,KUDT ,and
WDT .
The directed Ramsey numbers rnintroduced in this article pose challenging open
problems: Is rn=rn, for all n?Isrn=O(log n/log log n)? Proving the second statement
in the afﬁrmative would improve our approximation ratios. If both the statements
are shown true then the conjecture by Erd¨
os would be disproved! Finally, it would be
interesting, if the conjecture can be proved using the approach suggested.
ACKNOWLEDGMENTS
We thank A. Guillory for useful discussions and the anonymous referees for helpful comments.
REFERENCES
ADLER,M.AND HEERINGA, B. 2008. Approximating optimal binary decision trees. In Proceedings of the 11th In
ternational Workshop on Approximation Algorithms for Combinatorial Optimization Problems.Lecture
Notes in Computer Science, vol. 5171. Springer, Berlin, 1–9.
CHAKARAVARTHY,V.,PANDIT,V.,ROY,S.,AWAS THI ,P.,AND MOHANIA, M. 2007. Decision trees for entity identiﬁca
tion: Approximation algorithms and hardness results. In Proceedings of the 26th ACM Symposium on
Principles of Database Systems. ACM, New York, 53–62.
CHAKARAVARTHY,V.,PANDIT,V.,ROY,S.,AND SABHARWAL, Y. 2009. Approximating decision trees with multiway
branches. In Proceedings of the 36th International Colloquium on Automata, Languages and Program
ming. Lecture Notes in Computer Science, vol. 5555. Springer, Berlin.
CHUNG,F.AND GRINSTEAD, C. 1983. A survey of bounds for classical Ramsey numbers. J. Graph Theory 7,
25–37.
DASGUPTA, S. 2005. Analysis of a greedy active learning strategy. In Proceedings of the 17th Annual Conference
on Neural Information Processing Systems. MIT Press, Cambridge, MA, 337–344.
EXOO, G. 1994. A lower bound for Schur numbers and multicolor Ramsey numbers. Electron. J. Combin. 1, R8.
FEIGE,U.,LOV ´
ASZ,L.,AND TETALI, P. 2004. Approximating min sum set cover. Algorithmica 40, 4, 219–234.
GAREY, M. 1970. Optimal binary decision trees for diagnostic identiﬁcation problems. Ph.D. thesis, University
of Wisconsin, Madison.
GAREY, M. 1972. Optimal binary identiﬁcation procedures. SIAM J. Appl. Math. 23, 2, 173–186.
GAREY,M.AND GRAHAM, R. 1974. Performance bounds on the splitting algorithm for binary testing. Acta Inf. 3,
347–355.
GRAHAM,R.,ROTHSCHILD,B.,AND SPENCER, J. 1990. Ramsey Theory. John Wiley & Sons, New York.
HEERINGA, B. 2006. Improving access to organized information. Ph.D. thesis, University of Massachusetts,
Amherst.
HEERINGA,B.AND ADLER, M. 2005. Approximating optimal decision trees. Tech. rep. TR 0525, University of
Massachusetts, Amherst.
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.
15:22 V. T. Chakaravarthy et al.
HYAF IL ,L.AND RIVEST, R. 1976. Constructing optimal binary decision trees is NPcomplete. Inf. Process.
Lett. 5, 1, 15–17.
KOSARAJU,S.,PRZYTYCKA,M.,AND BORGSTROM, R. 1999. On an optimal split tree problem. In Proceedings of the
5th International Workshop on Algorithms and Data Structures. Lecture Notes in Computer Science,
vol. 1272. Springer, Berlin, 69–92.
MORET, B. 1982. Decision trees and diagrams. ACM Comput. Surv. 14, 4, 593–623.
MURTHY, S. 1998. Automatic construction of decision trees from data: A multidisciplinary survey. Data
Mining Knowl. Discov. 2, 4, 345–389.
NESETRIL,J.AND ROSENFELD, M. 2001. I. Schur, C.E. Shannon and Ramsey numbers, a short story. Discr.
Math. 229, 13, 185–195.
PANKHURST, R. 1970. A computer program for generating diagnostic keys. Comput. J. 13, 2, 145–151.
RADZISZOWSKI, S. 1994. Small Ramsey numbers. Electron. J. Combin. 1, 7.
RAZ,R.AND SAFRA, S. 1997. A subconstant errorprobability lowdegree test, and a subconstant error
probability PCP characterization of NP. In Proceedings of the 29th ACM Symposium on Theory of
Computing. ACM, New York.
REYNOLDS,A.,DICKS,J.,ROBERTS,I.,WESSELINK,J.,IGLESIA,B.,ROBERT,V.,BOEKHOUT,T.,AND RAYWARDSMITH,V.
2003. Algorithms for identiﬁcation key generation and optimization with application to yeast identiﬁca
tion. In Proceedings of EvoWorkshops. Lecture Notes in Computer Science, vol. 2611. Springer, Berlin,
107–118.
SCHUR, I. 1916. Uber die kongruenz xm+ym≡zm(mod p). Jber. Deustsch. Math. Verein 25, 114–117.
WEST, D. 2001. Introduction to Graph Theory. Prentice Hall.
WIJTZES,T.,BRUGGEMAN,M.,NOUT,M.,AND ZWIETERING, M. 1997. A computer system for identiﬁcation of lactic
acid bacteria. Int. J. Food Microbiol. 38, 1, 65–70.
Received February 2008; revised December 2008; accepted April 2009
ACM Transactions on Algorithms, Vol. 7, No. 2, Article 15, Publication date: March 2011.