Reducing metadata complexity for faster table summarization.
-
Citations (0)
-
Cited In (0)
Page 1
Reducing Metadata Complexity for Faster Table
Summarization
∗
K. Selçuk Candan
Arizona State University
Tempe, AZ 85283, USA
candan@asu.edu
Mario Cataldi
Università di Torino
Torino, Italy
cataldi@di.unito.it
Maria Luisa Sapino
Università di Torino
Torino, Italy
mlsapino@di.unito.it
ABSTRACT
Since the visualization real estate puts stringent constraints
on how much data can be presented to the users at once,
table summarization is an essential tool in helping users
quickly explore large data sets. An effective summary needs
to minimize the information loss due to the reduction in de-
tails. Summarization algorithms leverage the redundancy
in the data to identify value and tuple clustering strate-
gies that represent the (almost) same amount of information
with a smaller number of data representatives. It has been
shown that, when available, metadata, such as value hier-
archies associated to the attributes of the tables, can help
greatly reduce the resulting information loss. However, table
summarization, whether carried out through data analysis
performed on the table from scratch or supported through
already available metadata, is an expensive operation. We
note that the table summarization process can be signifi-
cantly sped up when the metadata used for supporting the
summarization itself is pre-processed to reduce the unnec-
essary details. The pre-processing of the metadata, how-
ever, needs to be performed carefully to ensure that it does
not add significant amounts of additional loss to the table
summarization process. In this paper, we propose a tRedux
algorithm for value hierarchy pre-processing and reduction.
Experimental evaluations show that, depending on the ta-
ble and taxonomy complexity, metadata summarization can
provide gains in table summarization time that can range
(in absolute values) from seconds to 10s-of-1000s of seconds.
Consequently, while resulting in only an extra ∼ 20% re-
duction in table quality, tRedux can provide ∼ 2× speedups
in table summarization time. Experiments also show that
tRedux has a better performance than alternative metadata
reduction strategies in supporting table summarization; and,
as the taxonomy complexity increases, the absolute gains of
tRedux also increase.
∗Partially supported by NSF Grant “Archaeological Data
Integration for the Study of Long-Term Human and Social
Dynamics” (0624341)
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
EDBT 2010, March 22–26, 2010, Lausanne, Switzerland.
Copyright 2010 ACM 978-1-60558-945-9/10/0003 ...$10.00
*
1*2*
12192227
U.S.
Arizona
Maryland
California
Southwest
PhoenixLos AngelesBaltimore FrederickSan Diego
(a) Hierarchy for Age (b) Hierarchy for Location
Figure 1:
and Location (b); directed edges denote the clus-
tering/summarization direction (taken from [1])
Value hierarchy for attribute Age (a)
Categories and Subject Descriptors
H.2.4 [Information Systems]: Database Management—
systems, database applications; H.3.3 [Information Sys-
tems]: Information Storage and Retrieval—information search
and retrieval
General Terms
Algorithms, Experimentation
Keywords
Table Summarization, Metadata Complexity, Taxonomy re-
duction
1.INTRODUCTION
Table summarization is an important tool in helping users
quickly explore large data sets. An effective summary needs
to minimize the information loss due to the reduction in
details; in particular each tuple in the original table needs
to be represented in the summary with a sufficiently sim-
ilar tuple. Moreover, each tuple in the summary must be
sufficiently different from other summary tuples to ensure
that the summary real-estate is not wasted. Summariza-
tion algorithms leverage the underlying redundancy (such
as approximate functional dependencies and other patterns)
in the data to identify value and tuple clustering strategies
that represent the (almost) same information with a smaller
number of data representatives.
When available, metadata, such as value hierarchies (like
the ones shown in Figure 1) can help greatly reduce the re-
sulting information loss. Value hierarchies have been com-
monly used for user-driven data analysis (e.g. OLAP [2])
and exploration within large data sets. In [1], we have shown
that value hierarchies associated to the attributes of the ta-
bles can also be used to support table summarization. Con-
240
Page 2
Name
John
Sharon
Mary
Peter
James
Alice
Age
12
19
19
22
22
27
Location
Phoenix
Los Angeles
San Diego
Baltimore
Frederick
Baltimore
(a) Data table
Name
-
-
(b) Summarized table
Age
1*
2*
Location
Southwest
Maryland
Table 1: (a) A database and (b) a summary on the
?Age,Location? pair using hierarchies in Figure 1 (also
taken from [1])
U.S.
ld
S Southwesth
Maryland
PhoenixLos AngelesBaltimore FrederickSan Diego
U.S.
Arizona
California
PhoenixLos AngelesBaltimoreFrederick San Diego
(a) Alternative #1
Figure 2: Two possible reductions of the location
hierarchy in Figure 1
(b) Alternative #2
sider, for example, Table 1 (a) which shows a data table
consisting of 6 rows. If the user is interested in summariz-
ing this table based on the attribute pair, ?Age,Location?
in such a way that the summarized table can be visualized
in a space that can hold at most 2 tuples, the hierarchies in
Figure 1 can be used to obtain the summary in Table 1(b).
In this paper, we note that table summarization, whether
carried out through data analysis performed on the table
from scratch or supported through already available meta-
data, is an expensive operation. For example, the computa-
tional cost of the metadata supported table summarization
process is exponential in the depth of the hierarchy (i.e., the
number of alternative value clustering strategies) [1, 3].
The key observation driving the work in this paper is that
the speed of the table summarization process can be sig-
nificantly improved when the metadata used for supporting
summarization itself is pre-processed to reduce its unproduc-
tive details. Metadata are often provided by domain experts
and their primary usage is in data organization and interpre-
tation. As such, they can be overly detailed and all of these
details may not be relevant in the summary of the data.
The pre-processing of the metadata to eliminate details not
relevant for obtaining a table summary, however, needs to
be performed carefully to ensure that it does not add signif-
icant amounts of additional loss to the table summarization
process. For example, while the reduced hierarchy in Fig-
ure 2(a) would still give the table summary in in Table 1(b),
the alternative in Figure 2(b) would cause further loss in the
table summary. Thus, intuitively, in this context a ”good”
reduction of the given metadata is the one that leads to
high-quality table summaries.
Based on these observations, we propose a novel hierar-
chy/taxonomy reduction method, tRedux, which reduces the
complexities (defined in terms of the hierarchy height and
density) of the given value hierarchies while preserving their
effectiveness in table summarization. In the next section,
we provide an overview of the relevant literature. Section 3
formalizes the table summarization problem and introduces
the quality measures for the table summaries. In Sections 4
and 5, we first introduce our tRedux algorithm for taxonomy
reduction and, then, describe the use of tRedux within the
table summarization process.
Evaluations presented in Section 6 show that, depending
on the table and taxonomy complexity, metadata summa-
rization can provide gains in table summarization time that
can range (in absolute values) from seconds to 10s-of-1000s
of seconds. Consequently, while resulting in only an extra
∼ 20% reduction in table quality, tRedux can provide ∼ 2×
speedups in table summarization time. Experiments also
show that tRedux has a better performance than alternative
metadata reduction strategies in supporting table summa-
rization; and, as the taxonomy complexity increases, the
absolute gains of tRedux also increase.
2.
2.1
RELATED WORK
Table Summarization
[4, 5] present a table summarization system, SaintEtiQ,
which computes and incrementally maintains a hierarchi-
cally arranged set of summaries of the input table. SaintE-
tiQ uses background knowledge (i.e., metadata) to support
these summaries. [6] also performs data summarization, but
it relies on frequent patterns in the relational dataset. Tab-
Sum [7] creates and maintains table summaries through row
and column reductions. To reduce the number of rows, the
algorithm first partitions the original table into groups based
on one or more attribute values of the table, and then col-
lapses each group of rows into a single row relying on the
available metadata, such as the concept hierarchy. For col-
umn reduction, it simplifies the value representation and/or
merges multiple columns.
Data compression techniques, like Huffman or Lempel-
Ziv, can also be used to reduce the size of the table. For ex-
ample, [8] presents a database compression technique based
on vector quantization. Buchsbaum et al. [9] develop al-
gorithms to compress massive tables through a partition-
training paradigm, but the compressed tables are not human
readable. Histograms can also be exploited to summarize in-
formation into a compressed structure. Following this idea,
Buccafurri et al. [10] introduce a quad-tree based partition
schema for summarizing two-dimensional data. Leveraging
the quad-tree structure, [11] proposes approaches to pro-
cessing OLAP operations over the summary. In fact, any
multidimensional clustering algorithm can be used to sum-
marize a table. Such methods, however, do not take into
account specific domain knowledge (e.g. “what are accept-
able summarizations, how do they rank?”) that hierarchies
would provide.
The concept of imprecision in OLAP dimensions is dis-
cussed in [12]. In that framework, a fact (e.g., a tuple in
the table) with imprecise data is associated with dimen-
sion values of coarser granularities, resulting in the dimen-
sional imprecision. In [13], we supported OLAP operations
over imperfectly integrated taxonomies. We proposed a re-
classification strategy which eliminates conflicts by introduc-
ing minimal imprecision. This approach is complementary
to the work presented here: the obtained navigable taxon-
omy can be taken as the input for summarization.
The table summarization task is also related to k-
anonymization problem, introduced as a technique against
linkage attacks on private data [3]. The k-anonymization
241
Page 3
approach eliminates the possibility of such attacks by en-
suring that, in the disseminated table, each value combi-
nation of attributes is matched to k others.
this, k-anonymization techniques rely on a-priori knowledge
about acceptable value generalizations. Cell generalization
schemes [14] treat each cell in the data table independently.
Thus, different cells for the same attribute may be gener-
alized in a different way. This provides significant flexibil-
ity in anonymization, while the problem remains extremely
hard (NP-hard [15]) and only approximation algorithms are
applicable under realistic usage scenarios [14].
generalization schemes [3, 16] treat all values for a given
attribute collectively; i.e., all values are generalized using
the same unique domain generalization strategy. While the
problem remains NP-hard (in the number of attributes), this
approach saves significant amount of time in processing and
may eliminate the need for using approximation solutions,
since it does not need to consider the individual values. Most
of these schemes, such as Samarati’s original algorithm [3],
rely on the fact that, for a given attribute, applicable gener-
alizations are in total order and that each generalization step
in this total order has the same cost. [3] leverages this to de-
velop a binary search scheme to achieve savings in time. [16]
relies on the same observation to develop an algorithm which
achieves attribute-based k-anonymization one attribute at a
time, while pruning unproductive generalization strategies.
To achieve
Attribute
2.2 Metadata Reduction
Ontology summarization has been used to support various
reasoning tasks. Fokouel et al. [17] focus on the problem of
summarizing OWL ontologies to reduce the cost of reason-
ing with ontologies. Often, metadata (such as RDF collec-
tions) can be seen as graphs. [18] analyzes the metadata
graph to measure centrality of the nodes (e.g.
in the graph. In this approach, the centrality of a given
node reflects its degrees of salience; thus the most salient
nodes are maintained in the summary. In contrast, clus-
tering based approaches discover groups of nodes such that
the intra-group similarity is maximized and the inter-cluster
similarity is minimized [19, 20]. [21] presented a novel hi-
erarchical clustering algorithm, merging two clusters only if
the inter-connectivity and closeness between two clusters are
high relative to the internal inter-connectivity of the clusters
and closeness of items within the clusters. [22] uses an ag-
glomerative clustering algorithm that merges the two clus-
ters with the greatest number of common neighbors. More
recently, [23] and [24] use random walks based cluster anal-
ysis techniques.
Since many metadata types, such as taxonomies are hi-
erarchical, researchers also experimented with tree summa-
rization algorithms [25, 26]. Davood et al. [26] observed that
summaries of XML trees did a much better job in document
clustering tasks than by using edit distance values, e.g. [27],
on the original trees. DataGuides [25] was one of the first
approaches which attempted to construct structural sum-
maries of hierarchical structures to support efficient query
processing. Though this and similar methods work fine for
tree based structures, the constructed summaries are not
trees, but graphs. Various other summarization algorithms,
such as [26, 28], focus on creating summaries suitable for ef-
ficient similarity-search in tree-structured data. Since, once
again, the goal of these algorithms is not to obtain a smaller
tree representing the larger one provided as input, but to
concepts)
find a representation that will speed up query processing,
the resulting summaries are in the form of strings, hash se-
quences, and concept/label vectors. Our goal, in this paper,
however is to reduce the size of the input taxonomy tree
to support table summarization process, therefore these and
similar algorithms are not applicable.
3. TABLE SUMMARIZATION
Summarization of large data tables is required in many
scenarios where it is hard to display complete data sets.
The summarization process takes as input a database table
and returns a reduced version of it.
tuples with less precision (knowledge relaxation) than the
original, but still informative of the content of the database.
This reduced form can then be presented to the user for
exploration or be used as input for advanced data mining
processes.
3.1 Value Clustering Hierarchies
A value clustering hierarchy1is a tree H(V,E) where V
encodes values and clustering-identifiers (e.g., high-level con-
cepts or cluster labels, such as “1*” in Figure 1(a), and E
contains acceptable value clustering relationships.
The result provides
Definition 3.1
tering hierarchy, H, is a tree H(V,E):
(Value Hierarchy). A value clus-
• v=(id:value)∈V where v.id is the node id in the tree
and v.value is either a value in the database or a value
clustering encoded by the node.
• e = vi → vj ∈ E is a directed edge denoting that the
value encoded by the node vj can be clustered under
the value encoded by the node vi.
Those nodes in V which correspond to the attribute values
that are originally in the database do not have any outgoing
edges in E; i.e., they form the leaves of the hierarchy. Given
an attribute value in the data table T and a value hierarchy
corresponding to that attribute, we can define alternative
clusterings as paths on the corresponding hierarchy.
Definition 3.2
clustering hierarchy H, a tree node vi is a clustering of a
tree node vj, denoted by vj?vi, if ∃path p=vi?vj in H.
We also say that vi covers vj.
(Value Clustering). Given a value
3.2Tuple-Clustering and Table Summary
Let us consider a data table, T, and a set, SA, of sum-
marization attributes. Roughly speaking, our purpose is to
find another relation T?which clusters the values in T such
that T?summarizes T with respect to the summarization-
attributes. Based on the above, in the following, we formal-
ize the concept of tuple summarization.
Definition 3.3
on attributes SA= {Q1,··· ,Qq}. t?is said to be a clustering
of the tuple, t, (on attributes SA) iff ∀i∈[1,q]
• t?[Qi] = t[Qi] , or
• ∃path pi = t?[Qi] ? t[Qi] in the corresponding value
hierarchy Hi.
(Tuple-Clustering). Let t be a tuple
1We use the terms “value hierarchy” and “value clustering
hierarchy” interchangeably
242
Page 4
In this paper, we use t?t?as shorthand.
Given this definition of tuple-clustering, we can define the
summary of a table as a one-to-one and onto mapping which
clusters the tuples of the original table.
Definition 3.4
tables T
summarization-attribute set SA, T?is said to be a summary
of T on attributes in SA (T[SA] ? T?[SA] for short) iff
there is a one-to-one and onto mapping, μ, from the tuples
in T[SA] to T?[SA], such that
(Table Summary). Given two data
(with the same schema),
and T?
and the
• ∀t ∈ T[SA], t?μ(t)
Here, T[SA] and T?[SA] are projections of the data tables T
and T?on summarization-attributes.
3.3 Table Summarization Process
As described in Section 2.2, there are various metadata
supported table summarization algorithms [1, 3, 16]. With-
out loss of generality, in this work, we use Samarati’s k-
anonymization algorithm [3] as the back-end table summa-
rizer. In [3], each unique tuple gets clustered with at least
k−1 other similar tuples to ensure that no single tuple can be
uniquely identified. The algorithm uses attribute value hier-
archies to ensure that the amount of loss (i.e., value general-
izations using the value hierarchies) is minimized. For each
attribute, the algorithm takes a value clustering hierarchy (a
taxonomy) which describes the generalization/specialization
relationship between the possible values. For example, con-
sider a table with a“Location”attribute. The hierarchy rep-
resents all the relevant values in the corresponding domain as
the leaves of a tree (Figure 1(b)). The internal nodes in the
value hierarchy will correspond to appropriate (geographic
or political) clusterings of countries. Thus, in a summary,
an internal node of the hierarchy can be used to cluster all
the leaves below it using a more general label. If in the
summary a leaf value is used, this gives zero generalization
(g = 0); if, on the other hand, a leaf at depth d is replaced
with an internal node at depth d?, this causes g = d − d?
steps of generalization; of course, by picking clusters closer
to the root, the algorithm will be able to summarize more
easily. On the other hand, more general cluster labels also
cause higher degree of knowledge relaxation. In Section 3.4,
we will refer to the knowledge relaxation due to the use of
generalizing clusters as dilution.
Among all possible clusterings that put each tuple with
k − 1 other similar ones, [3] aims to find those that require
minimal generalizations; i.e., the amount of distortion in the
data needed to achieve the clustering is as small as possi-
ble. Intuitively, if there is a generalization at depth d that
puts all tuples into clusters of size k, then there will be gen-
eralizations of level d?≤ d that also cluster all tuples into
clusters of size at least k, but will have more loss; conversely,
if one can establish that there is no generalization at level
d that is a k-clustering, then it follows that there are no
other clustering of level d?> d that can cluster all tuples
into clusters of size at least k. Relying on the fact that for
a given attribute, applicable domain generalizations are in
total order and that each generalization step in this total or-
der has the same cost, [3] develops a binary search scheme to
achieve savings in time2. It starts evaluating generalization
levels from the middle-level to see if there is a corresponding
k-clustering solution:
2[16] relies on the same observation to develop an algorithm
V2
A
N
V2
A
A
N N
(a)(b)
Figure 3: Ghost nodes in unbalanced hierarchy
• if there is, then the algorithm tries to find another solu-
tion with less generalization by jumping to the central
point of the half path with lower generalization;
• if there is none, on the other hand, the algorithm tries
to find a solution by jumping to the central point point
of the half path with higher generalization.
The process continues in this binary search until a general-
ization level such that no solution with a lower generaliza-
tion exists is found. Unfortunately, this and other attribute-
generalization based algorithms, including [16] and [29], are
all exponential in the number of attributes that need sum-
marization3. When there is a single attribute to summarize,
for a hierarchy, H, of depth, depth(H), the algorithm con-
siders log(depth(H)) alternative clustering strategies. When
there are m attributes to consider, however, the maximum
degree of generalization is
attributes are generalized to the max, causing the great-
est amount of loss. In this case, the algorithm considers
log(?
algorithm has to consider all combinations of attribute gen-
eralizations such that (?
worst case time cost of the algorithm is exponential in the
number of attributes.
One problem with using Samarati’s algorithm [3] as the
back-end table summarizer is that it assumes balanced hier-
archies as input. In order to perform table summarization
using unbalanced input hierarchies, we first balance the in-
put hierarchies by introducing ghost nodes that fill the empty
spots in the hierarchy. As shown in Figure 3, these ghost
nodes act as surrogates of the closest ancestors of the leaf
nodes that are out of balance: in this example, in order to
put N at the same level of the other leaves, we introduce
a ghost node between N and its parent A (Figure 3(b)).
The ghost node is also labeled A as the parent. Note that,
whether N is generalized to its original parent or the new
(ghost) node, it will be replaced in the summarized table
with A, therefore, this transformation does not result in ad-
ditional loss.
3.4Quality of a Table Summary
Information-based measures of quality leverage statisti-
cal knowledge, for example the knowledge about data fre-
quencies, to measure information loss. One advantage of
the use of value hierarchies for table summarization is that
?
1≤i≤mdepth(Hi), where all
1≤i≤mdepth(Hi)) alternative generalization levels on
the average; moreover, for each generalization level, g, the
1≤i≤mgi) = g. Since in general,
there can be exponentially many such combinations, the
which achieves attribute-based k-anonymization one at-
tribute at a time, while pruning unproductive attribute gen-
eralization strategies. [29] further assumes an attribute or-
der and attribute-value order to develop a top-down frame-
work with significant pruning opportunities.
3In fact the problem is NP-hard [1]
243
Page 5
the degree of loss resulting from the summarization pro-
cess, can be quantified and explicitly minimized using the
available value hierarchies. Unlike purely numeric informa-
tion loss measures, such as mean squared error, and statis-
tical measures, such as entropy, classification, discernibility,
and certainty [29, 30, 31, 32], knowledge about value hier-
archies provides a mechanism to judge the significance of
the distortion within the given application domain [3, 33,
34]. For example, a commonly used technique for measur-
ing the amount of loss during the summarization process
is to count the number of generalization steps required to
obtain the summary [3, 29, 16]: given a generalization hier-
archy, each step followed to achieve the value clustering is
considered one unit of loss.
Definition 3.5
Let t and t?be two tuples on attributes SA= {Q1,··· ,Qq},
such that t ? t?. Then the cost of the corresponding clus-
tering strategy is defined through a monotonic combination
function,
individual summarization-attribute:
(Penalty of a Tuple Clustering).
?, of the penalty of the clustering along each
Δ(t?t?)=
1≤i≤q
?
Δi,
where
• Δi = 0, if t?[Qi] = t[Qi]
• Δi = Δ(t[Qi],t?[Qi]) (i.e., the minimal number of
edges that separates t[Qi] to t?[Qi] in the corresponding
original hierarchy) otherwise.
Let us consider a data table T, and a set SA of summa-
rization attributes. Let T?be a summary of T on attributes
in SA (i.e., T[SA] ? T?[SA]). In this paper, we use the
following quality measures to evaluate table summaries:
• dilution (dl):
dl(T,T?,SA) =
1
|T|
?
t∈T
Δ(t[SA],μ(t[SA])).
The smaller the degree of dilution, the smaller is the
amount of loss and the higher is the quality of the
summary.
• diversity (div):
div(T?,SA) =
2
|T?|(|T?| − 1)
?
t1,t2∈T?(t1=t2)
Δ(t1[SA],t2[SA]).
The greater the diversity, the higher is the quality of
the summary.
Therefore, given a table T and the set of summarization
attributes, SA, the goal of the summarization algorithm is
to find a summary such that the degree of dilution is mini-
mized, yet the diversity is maximized.
4.TAXONOMY REDUCTION
As we have seen in Section 3.3, the table summarization
process can be prohibitively costly, especially when the num-
ber of relevant attributes is large. Our key observation in
this paper is that it might be possible to reduce the cost of
the table summarization algorithm significantly by reducing
the sizes of the input hierarchies. The hierarchy reduction,
however, needs to be performed in such a way that it does
not add significant amounts of additional loss to the table
summarization process. In other words, the hierarchy re-
duction process should eliminate the details in the metadata
that are not likely to be used in the table summary anyhow
(remember the example in Figure 2).
In this section, we propose a tRedux algorithm for value
hierarchy reduction. Our proposed approach to taxonomy
preprocessing and reduction consists of three steps:
I: create a graph representing the structural distances
between the nodes in the taxonomy as well as the dis-
tribution of nodes labels in the database;
II: partition the resulting graph into disjoint sub-graphs
based on connectivity analysis; and
III: finally, select a representative label for each partition
and reconstruct a taxonomy tree.
In Section 6, we show that taxonomies reduced using this
algorithm help preserve qualities of table summaries, while
providing significant gains in table summarization time. In
the rest of this section, we describe each of the above steps.
4.1Step I: Constructing the Node Structural-
Similarity/Occurrence Graph
Naturally, the most effective way to ensure the quality
of the table summarization process is to cluster those tax-
onomy nodes whose labels would be judged to be similar
by human users of the system. Simultaneously, the cluster
should also represent the joint-distribution of the node labels
in the database. Therefore, the first step of the process is to
create a graph that represent both the structural similarities
of the nodes and label co-occurrence in the database.
More formally, let us consider a data table T and a set SA
of summarization attributes. Let Hi(Vi,Ei) be a value hier-
archy corresponding to the attribute Qi ∈ SA. In the next
step, the algorithm constructs a weighted directed graph,
Gi(Vi,E?
corresponds to the concepts in the input taxonomy. The
weights (w : E → R+) associated to the edges in E represent
both
i,w), where the set of vertices, Vi = {v1,...,vn},
• similarities between pairs of concepts in the taxonomy,
and
• occurrences of data values in the database.
In structure-based methods, the similarity between two
nodes in a taxonomy is measured by the distance between
them [35] or the sum of the edge weights along the shortest
path connecting the words [36]. CP/CV, presented in [37]
first associates a node vector to each node in the taxonomy
(the vector represents the relationship of this with the rest of
the nodes in the taxonomy) and, then, compares the vectors
to quantify the how structurally similar the two nodes are.
Comparisons against other approaches on available human-
generated benchmark data [38, 39] showed that CP/CV im-
proves the correlation of the resulting similarity judgments
to human common sense. Thus, without loss of generality,
we use the node similarity measure presented in [37] to quan-
tify the structural similarities among the taxonomy nodes.
More specifically, given the data table T, for each value hi-
erarchy, Hi(Vi,Ei), we construct a complete directed graph,
244