Reducing metadata complexity for faster table summarization.
ABSTRACT Since the visualization real estate puts stringent constraints on how much data can be presented to the users at once, table summarization is an essential tool in helping users quickly explore large data sets. An effective summary needs to minimize the information loss due to the reduction in details. Summarization algorithms leverage the redundancy in the data to identify value and tuple clustering strategies that represent the (almost) same amount of information with a smaller number of data representatives. It has been shown that, when available, metadata, such as value hierarchies associated to the attributes of the tables, can help greatly reduce the resulting information loss. However, table summarization, whether carried out through data analysis performed on the table from scratch or supported through already available metadata, is an expensive operation. We note that the table summarization process can be significantly sped up when the metadata used for supporting the summarization itself is preprocessed to reduce the unnecessary details. The preprocessing of the metadata, however, needs to be performed carefully to ensure that it does not add significant amounts of additional loss to the table summarization process. In this paper, we propose a tRedux algorithm for value hierarchy preprocessing and reduction. Experimental evaluations show that, depending on the table and taxonomy complexity, metadata summarization can provide gains in table summarization time that can range (in absolute values) from seconds to 10sof1000s of seconds. Consequently, while resulting in only an extra ~ 20% reduction in table quality, tRedux can provide ~ 2x speedups in table summarization time. Experiments also show that tRedux has a better performance than alternative metadata reduction strategies in supporting table summarization; and, as the taxonomy complexity increases, the absolute gains of tRedux also increase.

Conference Proceeding: Building data warehouses with semantic data.
[show abstract] [hide abstract]
ABSTRACT: The Semantic Web (SW) deployment is now a realization and the amount of semantic annotations is ever increasing thanks to several initiatives that promote a change in the current Web towards the Web of Data, where the semantics of data become explicit through data representation formats and standards such as RDF/(S) and OWL. However, such initiatives have not yet been accompanied by efficient intelligent applications that can exploit the implicit semantics and thus, provide more insightful analysis. In this paper, we provide the means for efficiently analyzing and exploring large amounts of semantic data by combining the inference power from the annotation semantics with the analysis capabilities provided by OLAPstyle aggregations, navigation, and reporting. We formally present how semantic data should be organized in a welldefined conceptual MD schema, so that sophisticated queries can be expressed and evaluated. Our proposal has been evaluated over a real biomedical scenario, which demonstrates the scalability and applicability of the proposed approach.Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, March 2226, 2010; 01/2010
Page 1
Reducing Metadata Complexity for Faster Table
Summarization
∗
K. Selçuk Candan
Arizona State University
Tempe, AZ 85283, USA
candan@asu.edu
Mario Cataldi
Università di Torino
Torino, Italy
cataldi@di.unito.it
Maria Luisa Sapino
Università di Torino
Torino, Italy
mlsapino@di.unito.it
ABSTRACT
Since the visualization real estate puts stringent constraints
on how much data can be presented to the users at once,
table summarization is an essential tool in helping users
quickly explore large data sets. An effective summary needs
to minimize the information loss due to the reduction in de
tails. Summarization algorithms leverage the redundancy
in the data to identify value and tuple clustering strate
gies that represent the (almost) same amount of information
with a smaller number of data representatives. It has been
shown that, when available, metadata, such as value hier
archies associated to the attributes of the tables, can help
greatly reduce the resulting information loss. However, table
summarization, whether carried out through data analysis
performed on the table from scratch or supported through
already available metadata, is an expensive operation. We
note that the table summarization process can be signifi
cantly sped up when the metadata used for supporting the
summarization itself is preprocessed to reduce the unnec
essary details. The preprocessing of the metadata, how
ever, needs to be performed carefully to ensure that it does
not add significant amounts of additional loss to the table
summarization process. In this paper, we propose a tRedux
algorithm for value hierarchy preprocessing and reduction.
Experimental evaluations show that, depending on the ta
ble and taxonomy complexity, metadata summarization can
provide gains in table summarization time that can range
(in absolute values) from seconds to 10sof1000s of seconds.
Consequently, while resulting in only an extra ∼ 20% re
duction in table quality, tRedux can provide ∼ 2× speedups
in table summarization time. Experiments also show that
tRedux has a better performance than alternative metadata
reduction strategies in supporting table summarization; and,
as the taxonomy complexity increases, the absolute gains of
tRedux also increase.
∗Partially supported by NSF Grant “Archaeological Data
Integration for the Study of LongTerm Human and Social
Dynamics” (0624341)
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
EDBT 2010, March 22–26, 2010, Lausanne, Switzerland.
Copyright 2010 ACM 9781605589459/10/0003 ...$10.00
*
1*2*
12192227
U.S.
Arizona
Maryland
California
Southwest
PhoenixLos AngelesBaltimore FrederickSan Diego
(a) Hierarchy for Age (b) Hierarchy for Location
Figure 1:
and Location (b); directed edges denote the clus
tering/summarization direction (taken from [1])
Value hierarchy for attribute Age (a)
Categories and Subject Descriptors
H.2.4 [Information Systems]: Database Management—
systems, database applications; H.3.3 [Information Sys
tems]: Information Storage and Retrieval—information search
and retrieval
General Terms
Algorithms, Experimentation
Keywords
Table Summarization, Metadata Complexity, Taxonomy re
duction
1.INTRODUCTION
Table summarization is an important tool in helping users
quickly explore large data sets. An effective summary needs
to minimize the information loss due to the reduction in
details; in particular each tuple in the original table needs
to be represented in the summary with a sufficiently sim
ilar tuple. Moreover, each tuple in the summary must be
sufficiently different from other summary tuples to ensure
that the summary realestate is not wasted. Summariza
tion algorithms leverage the underlying redundancy (such
as approximate functional dependencies and other patterns)
in the data to identify value and tuple clustering strategies
that represent the (almost) same information with a smaller
number of data representatives.
When available, metadata, such as value hierarchies (like
the ones shown in Figure 1) can help greatly reduce the re
sulting information loss. Value hierarchies have been com
monly used for userdriven data analysis (e.g. OLAP [2])
and exploration within large data sets. In [1], we have shown
that value hierarchies associated to the attributes of the ta
bles can also be used to support table summarization. Con
240
Page 2
Name
John
Sharon
Mary
Peter
James
Alice
Age
12
19
19
22
22
27
Location
Phoenix
Los Angeles
San Diego
Baltimore
Frederick
Baltimore
(a) Data table
Name


(b) Summarized table
Age
1*
2*
Location
Southwest
Maryland
Table 1: (a) A database and (b) a summary on the
?Age,Location? pair using hierarchies in Figure 1 (also
taken from [1])
U.S.
ld
S Southwesth
Maryland
PhoenixLos AngelesBaltimore FrederickSan Diego
U.S.
Arizona
California
PhoenixLos AngelesBaltimoreFrederick San Diego
(a) Alternative #1
Figure 2: Two possible reductions of the location
hierarchy in Figure 1
(b) Alternative #2
sider, for example, Table 1 (a) which shows a data table
consisting of 6 rows. If the user is interested in summariz
ing this table based on the attribute pair, ?Age,Location?
in such a way that the summarized table can be visualized
in a space that can hold at most 2 tuples, the hierarchies in
Figure 1 can be used to obtain the summary in Table 1(b).
In this paper, we note that table summarization, whether
carried out through data analysis performed on the table
from scratch or supported through already available meta
data, is an expensive operation. For example, the computa
tional cost of the metadata supported table summarization
process is exponential in the depth of the hierarchy (i.e., the
number of alternative value clustering strategies) [1, 3].
The key observation driving the work in this paper is that
the speed of the table summarization process can be sig
nificantly improved when the metadata used for supporting
summarization itself is preprocessed to reduce its unproduc
tive details. Metadata are often provided by domain experts
and their primary usage is in data organization and interpre
tation. As such, they can be overly detailed and all of these
details may not be relevant in the summary of the data.
The preprocessing of the metadata to eliminate details not
relevant for obtaining a table summary, however, needs to
be performed carefully to ensure that it does not add signif
icant amounts of additional loss to the table summarization
process. For example, while the reduced hierarchy in Fig
ure 2(a) would still give the table summary in in Table 1(b),
the alternative in Figure 2(b) would cause further loss in the
table summary. Thus, intuitively, in this context a ”good”
reduction of the given metadata is the one that leads to
highquality table summaries.
Based on these observations, we propose a novel hierar
chy/taxonomy reduction method, tRedux, which reduces the
complexities (defined in terms of the hierarchy height and
density) of the given value hierarchies while preserving their
effectiveness in table summarization. In the next section,
we provide an overview of the relevant literature. Section 3
formalizes the table summarization problem and introduces
the quality measures for the table summaries. In Sections 4
and 5, we first introduce our tRedux algorithm for taxonomy
reduction and, then, describe the use of tRedux within the
table summarization process.
Evaluations presented in Section 6 show that, depending
on the table and taxonomy complexity, metadata summa
rization can provide gains in table summarization time that
can range (in absolute values) from seconds to 10sof1000s
of seconds. Consequently, while resulting in only an extra
∼ 20% reduction in table quality, tRedux can provide ∼ 2×
speedups in table summarization time. Experiments also
show that tRedux has a better performance than alternative
metadata reduction strategies in supporting table summa
rization; and, as the taxonomy complexity increases, the
absolute gains of tRedux also increase.
2.
2.1
RELATED WORK
Table Summarization
[4, 5] present a table summarization system, SaintEtiQ,
which computes and incrementally maintains a hierarchi
cally arranged set of summaries of the input table. SaintE
tiQ uses background knowledge (i.e., metadata) to support
these summaries. [6] also performs data summarization, but
it relies on frequent patterns in the relational dataset. Tab
Sum [7] creates and maintains table summaries through row
and column reductions. To reduce the number of rows, the
algorithm first partitions the original table into groups based
on one or more attribute values of the table, and then col
lapses each group of rows into a single row relying on the
available metadata, such as the concept hierarchy. For col
umn reduction, it simplifies the value representation and/or
merges multiple columns.
Data compression techniques, like Huffman or Lempel
Ziv, can also be used to reduce the size of the table. For ex
ample, [8] presents a database compression technique based
on vector quantization. Buchsbaum et al. [9] develop al
gorithms to compress massive tables through a partition
training paradigm, but the compressed tables are not human
readable. Histograms can also be exploited to summarize in
formation into a compressed structure. Following this idea,
Buccafurri et al. [10] introduce a quadtree based partition
schema for summarizing twodimensional data. Leveraging
the quadtree structure, [11] proposes approaches to pro
cessing OLAP operations over the summary. In fact, any
multidimensional clustering algorithm can be used to sum
marize a table. Such methods, however, do not take into
account specific domain knowledge (e.g. “what are accept
able summarizations, how do they rank?”) that hierarchies
would provide.
The concept of imprecision in OLAP dimensions is dis
cussed in [12]. In that framework, a fact (e.g., a tuple in
the table) with imprecise data is associated with dimen
sion values of coarser granularities, resulting in the dimen
sional imprecision. In [13], we supported OLAP operations
over imperfectly integrated taxonomies. We proposed a re
classification strategy which eliminates conflicts by introduc
ing minimal imprecision. This approach is complementary
to the work presented here: the obtained navigable taxon
omy can be taken as the input for summarization.
The table summarization task is also related to k
anonymization problem, introduced as a technique against
linkage attacks on private data [3]. The kanonymization
241
Page 3
approach eliminates the possibility of such attacks by en
suring that, in the disseminated table, each value combi
nation of attributes is matched to k others.
this, kanonymization techniques rely on apriori knowledge
about acceptable value generalizations. Cell generalization
schemes [14] treat each cell in the data table independently.
Thus, different cells for the same attribute may be gener
alized in a different way. This provides significant flexibil
ity in anonymization, while the problem remains extremely
hard (NPhard [15]) and only approximation algorithms are
applicable under realistic usage scenarios [14].
generalization schemes [3, 16] treat all values for a given
attribute collectively; i.e., all values are generalized using
the same unique domain generalization strategy. While the
problem remains NPhard (in the number of attributes), this
approach saves significant amount of time in processing and
may eliminate the need for using approximation solutions,
since it does not need to consider the individual values. Most
of these schemes, such as Samarati’s original algorithm [3],
rely on the fact that, for a given attribute, applicable gener
alizations are in total order and that each generalization step
in this total order has the same cost. [3] leverages this to de
velop a binary search scheme to achieve savings in time. [16]
relies on the same observation to develop an algorithm which
achieves attributebased kanonymization one attribute at a
time, while pruning unproductive generalization strategies.
To achieve
Attribute
2.2 Metadata Reduction
Ontology summarization has been used to support various
reasoning tasks. Fokouel et al. [17] focus on the problem of
summarizing OWL ontologies to reduce the cost of reason
ing with ontologies. Often, metadata (such as RDF collec
tions) can be seen as graphs. [18] analyzes the metadata
graph to measure centrality of the nodes (e.g.
in the graph. In this approach, the centrality of a given
node reflects its degrees of salience; thus the most salient
nodes are maintained in the summary. In contrast, clus
tering based approaches discover groups of nodes such that
the intragroup similarity is maximized and the intercluster
similarity is minimized [19, 20]. [21] presented a novel hi
erarchical clustering algorithm, merging two clusters only if
the interconnectivity and closeness between two clusters are
high relative to the internal interconnectivity of the clusters
and closeness of items within the clusters. [22] uses an ag
glomerative clustering algorithm that merges the two clus
ters with the greatest number of common neighbors. More
recently, [23] and [24] use random walks based cluster anal
ysis techniques.
Since many metadata types, such as taxonomies are hi
erarchical, researchers also experimented with tree summa
rization algorithms [25, 26]. Davood et al. [26] observed that
summaries of XML trees did a much better job in document
clustering tasks than by using edit distance values, e.g. [27],
on the original trees. DataGuides [25] was one of the first
approaches which attempted to construct structural sum
maries of hierarchical structures to support efficient query
processing. Though this and similar methods work fine for
tree based structures, the constructed summaries are not
trees, but graphs. Various other summarization algorithms,
such as [26, 28], focus on creating summaries suitable for ef
ficient similaritysearch in treestructured data. Since, once
again, the goal of these algorithms is not to obtain a smaller
tree representing the larger one provided as input, but to
concepts)
find a representation that will speed up query processing,
the resulting summaries are in the form of strings, hash se
quences, and concept/label vectors. Our goal, in this paper,
however is to reduce the size of the input taxonomy tree
to support table summarization process, therefore these and
similar algorithms are not applicable.
3. TABLE SUMMARIZATION
Summarization of large data tables is required in many
scenarios where it is hard to display complete data sets.
The summarization process takes as input a database table
and returns a reduced version of it.
tuples with less precision (knowledge relaxation) than the
original, but still informative of the content of the database.
This reduced form can then be presented to the user for
exploration or be used as input for advanced data mining
processes.
3.1 Value Clustering Hierarchies
A value clustering hierarchy1is a tree H(V,E) where V
encodes values and clusteringidentifiers (e.g., highlevel con
cepts or cluster labels, such as “1*” in Figure 1(a), and E
contains acceptable value clustering relationships.
The result provides
Definition 3.1
tering hierarchy, H, is a tree H(V,E):
(Value Hierarchy). A value clus
• v=(id:value)∈V where v.id is the node id in the tree
and v.value is either a value in the database or a value
clustering encoded by the node.
• e = vi → vj ∈ E is a directed edge denoting that the
value encoded by the node vj can be clustered under
the value encoded by the node vi.
Those nodes in V which correspond to the attribute values
that are originally in the database do not have any outgoing
edges in E; i.e., they form the leaves of the hierarchy. Given
an attribute value in the data table T and a value hierarchy
corresponding to that attribute, we can define alternative
clusterings as paths on the corresponding hierarchy.
Definition 3.2
clustering hierarchy H, a tree node vi is a clustering of a
tree node vj, denoted by vj?vi, if ∃path p=vi?vj in H.
We also say that vi covers vj.
(Value Clustering). Given a value
3.2TupleClustering and Table Summary
Let us consider a data table, T, and a set, SA, of sum
marization attributes. Roughly speaking, our purpose is to
find another relation T?which clusters the values in T such
that T?summarizes T with respect to the summarization
attributes. Based on the above, in the following, we formal
ize the concept of tuple summarization.
Definition 3.3
on attributes SA= {Q1,··· ,Qq}. t?is said to be a clustering
of the tuple, t, (on attributes SA) iff ∀i∈[1,q]
• t?[Qi] = t[Qi] , or
• ∃path pi = t?[Qi] ? t[Qi] in the corresponding value
hierarchy Hi.
(TupleClustering). Let t be a tuple
1We use the terms “value hierarchy” and “value clustering
hierarchy” interchangeably
242
Page 4
In this paper, we use t?t?as shorthand.
Given this definition of tupleclustering, we can define the
summary of a table as a onetoone and onto mapping which
clusters the tuples of the original table.
Definition 3.4
tables T
summarizationattribute set SA, T?is said to be a summary
of T on attributes in SA (T[SA] ? T?[SA] for short) iff
there is a onetoone and onto mapping, μ, from the tuples
in T[SA] to T?[SA], such that
(Table Summary). Given two data
(with the same schema),
and T?
and the
• ∀t ∈ T[SA], t?μ(t)
Here, T[SA] and T?[SA] are projections of the data tables T
and T?on summarizationattributes.
3.3 Table Summarization Process
As described in Section 2.2, there are various metadata
supported table summarization algorithms [1, 3, 16]. With
out loss of generality, in this work, we use Samarati’s k
anonymization algorithm [3] as the backend table summa
rizer. In [3], each unique tuple gets clustered with at least
k−1 other similar tuples to ensure that no single tuple can be
uniquely identified. The algorithm uses attribute value hier
archies to ensure that the amount of loss (i.e., value general
izations using the value hierarchies) is minimized. For each
attribute, the algorithm takes a value clustering hierarchy (a
taxonomy) which describes the generalization/specialization
relationship between the possible values. For example, con
sider a table with a“Location”attribute. The hierarchy rep
resents all the relevant values in the corresponding domain as
the leaves of a tree (Figure 1(b)). The internal nodes in the
value hierarchy will correspond to appropriate (geographic
or political) clusterings of countries. Thus, in a summary,
an internal node of the hierarchy can be used to cluster all
the leaves below it using a more general label. If in the
summary a leaf value is used, this gives zero generalization
(g = 0); if, on the other hand, a leaf at depth d is replaced
with an internal node at depth d?, this causes g = d − d?
steps of generalization; of course, by picking clusters closer
to the root, the algorithm will be able to summarize more
easily. On the other hand, more general cluster labels also
cause higher degree of knowledge relaxation. In Section 3.4,
we will refer to the knowledge relaxation due to the use of
generalizing clusters as dilution.
Among all possible clusterings that put each tuple with
k − 1 other similar ones, [3] aims to find those that require
minimal generalizations; i.e., the amount of distortion in the
data needed to achieve the clustering is as small as possi
ble. Intuitively, if there is a generalization at depth d that
puts all tuples into clusters of size k, then there will be gen
eralizations of level d?≤ d that also cluster all tuples into
clusters of size at least k, but will have more loss; conversely,
if one can establish that there is no generalization at level
d that is a kclustering, then it follows that there are no
other clustering of level d?> d that can cluster all tuples
into clusters of size at least k. Relying on the fact that for
a given attribute, applicable domain generalizations are in
total order and that each generalization step in this total or
der has the same cost, [3] develops a binary search scheme to
achieve savings in time2. It starts evaluating generalization
levels from the middlelevel to see if there is a corresponding
kclustering solution:
2[16] relies on the same observation to develop an algorithm
V2
A
N
V2
A
A
N N
(a)(b)
Figure 3: Ghost nodes in unbalanced hierarchy
• if there is, then the algorithm tries to find another solu
tion with less generalization by jumping to the central
point of the half path with lower generalization;
• if there is none, on the other hand, the algorithm tries
to find a solution by jumping to the central point point
of the half path with higher generalization.
The process continues in this binary search until a general
ization level such that no solution with a lower generaliza
tion exists is found. Unfortunately, this and other attribute
generalization based algorithms, including [16] and [29], are
all exponential in the number of attributes that need sum
marization3. When there is a single attribute to summarize,
for a hierarchy, H, of depth, depth(H), the algorithm con
siders log(depth(H)) alternative clustering strategies. When
there are m attributes to consider, however, the maximum
degree of generalization is
attributes are generalized to the max, causing the great
est amount of loss. In this case, the algorithm considers
log(?
algorithm has to consider all combinations of attribute gen
eralizations such that (?
worst case time cost of the algorithm is exponential in the
number of attributes.
One problem with using Samarati’s algorithm [3] as the
backend table summarizer is that it assumes balanced hier
archies as input. In order to perform table summarization
using unbalanced input hierarchies, we first balance the in
put hierarchies by introducing ghost nodes that fill the empty
spots in the hierarchy. As shown in Figure 3, these ghost
nodes act as surrogates of the closest ancestors of the leaf
nodes that are out of balance: in this example, in order to
put N at the same level of the other leaves, we introduce
a ghost node between N and its parent A (Figure 3(b)).
The ghost node is also labeled A as the parent. Note that,
whether N is generalized to its original parent or the new
(ghost) node, it will be replaced in the summarized table
with A, therefore, this transformation does not result in ad
ditional loss.
3.4Quality of a Table Summary
Informationbased measures of quality leverage statisti
cal knowledge, for example the knowledge about data fre
quencies, to measure information loss. One advantage of
the use of value hierarchies for table summarization is that
?
1≤i≤mdepth(Hi), where all
1≤i≤mdepth(Hi)) alternative generalization levels on
the average; moreover, for each generalization level, g, the
1≤i≤mgi) = g. Since in general,
there can be exponentially many such combinations, the
which achieves attributebased kanonymization one at
tribute at a time, while pruning unproductive attribute gen
eralization strategies. [29] further assumes an attribute or
der and attributevalue order to develop a topdown frame
work with significant pruning opportunities.
3In fact the problem is NPhard [1]
243
Page 5
the degree of loss resulting from the summarization pro
cess, can be quantified and explicitly minimized using the
available value hierarchies. Unlike purely numeric informa
tion loss measures, such as mean squared error, and statis
tical measures, such as entropy, classification, discernibility,
and certainty [29, 30, 31, 32], knowledge about value hier
archies provides a mechanism to judge the significance of
the distortion within the given application domain [3, 33,
34]. For example, a commonly used technique for measur
ing the amount of loss during the summarization process
is to count the number of generalization steps required to
obtain the summary [3, 29, 16]: given a generalization hier
archy, each step followed to achieve the value clustering is
considered one unit of loss.
Definition 3.5
Let t and t?be two tuples on attributes SA= {Q1,··· ,Qq},
such that t ? t?. Then the cost of the corresponding clus
tering strategy is defined through a monotonic combination
function,
individual summarizationattribute:
(Penalty of a Tuple Clustering).
?, of the penalty of the clustering along each
Δ(t?t?)=
1≤i≤q
?
Δi,
where
• Δi = 0, if t?[Qi] = t[Qi]
• Δi = Δ(t[Qi],t?[Qi]) (i.e., the minimal number of
edges that separates t[Qi] to t?[Qi] in the corresponding
original hierarchy) otherwise.
Let us consider a data table T, and a set SA of summa
rization attributes. Let T?be a summary of T on attributes
in SA (i.e., T[SA] ? T?[SA]). In this paper, we use the
following quality measures to evaluate table summaries:
• dilution (dl):
dl(T,T?,SA) =
1
T
?
t∈T
Δ(t[SA],μ(t[SA])).
The smaller the degree of dilution, the smaller is the
amount of loss and the higher is the quality of the
summary.
• diversity (div):
div(T?,SA) =
2
T?(T? − 1)
?
t1,t2∈T?(t1=t2)
Δ(t1[SA],t2[SA]).
The greater the diversity, the higher is the quality of
the summary.
Therefore, given a table T and the set of summarization
attributes, SA, the goal of the summarization algorithm is
to find a summary such that the degree of dilution is mini
mized, yet the diversity is maximized.
4.TAXONOMY REDUCTION
As we have seen in Section 3.3, the table summarization
process can be prohibitively costly, especially when the num
ber of relevant attributes is large. Our key observation in
this paper is that it might be possible to reduce the cost of
the table summarization algorithm significantly by reducing
the sizes of the input hierarchies. The hierarchy reduction,
however, needs to be performed in such a way that it does
not add significant amounts of additional loss to the table
summarization process. In other words, the hierarchy re
duction process should eliminate the details in the metadata
that are not likely to be used in the table summary anyhow
(remember the example in Figure 2).
In this section, we propose a tRedux algorithm for value
hierarchy reduction. Our proposed approach to taxonomy
preprocessing and reduction consists of three steps:
I: create a graph representing the structural distances
between the nodes in the taxonomy as well as the dis
tribution of nodes labels in the database;
II: partition the resulting graph into disjoint subgraphs
based on connectivity analysis; and
III: finally, select a representative label for each partition
and reconstruct a taxonomy tree.
In Section 6, we show that taxonomies reduced using this
algorithm help preserve qualities of table summaries, while
providing significant gains in table summarization time. In
the rest of this section, we describe each of the above steps.
4.1Step I: Constructing the Node Structural
Similarity/Occurrence Graph
Naturally, the most effective way to ensure the quality
of the table summarization process is to cluster those tax
onomy nodes whose labels would be judged to be similar
by human users of the system. Simultaneously, the cluster
should also represent the jointdistribution of the node labels
in the database. Therefore, the first step of the process is to
create a graph that represent both the structural similarities
of the nodes and label cooccurrence in the database.
More formally, let us consider a data table T and a set SA
of summarization attributes. Let Hi(Vi,Ei) be a value hier
archy corresponding to the attribute Qi ∈ SA. In the next
step, the algorithm constructs a weighted directed graph,
Gi(Vi,E?
corresponds to the concepts in the input taxonomy. The
weights (w : E → R+) associated to the edges in E represent
both
i,w), where the set of vertices, Vi = {v1,...,vn},
• similarities between pairs of concepts in the taxonomy,
and
• occurrences of data values in the database.
In structurebased methods, the similarity between two
nodes in a taxonomy is measured by the distance between
them [35] or the sum of the edge weights along the shortest
path connecting the words [36]. CP/CV, presented in [37]
first associates a node vector to each node in the taxonomy
(the vector represents the relationship of this with the rest of
the nodes in the taxonomy) and, then, compares the vectors
to quantify the how structurally similar the two nodes are.
Comparisons against other approaches on available human
generated benchmark data [38, 39] showed that CP/CV im
proves the correlation of the resulting similarity judgments
to human common sense. Thus, without loss of generality,
we use the node similarity measure presented in [37] to quan
tify the structural similarities among the taxonomy nodes.
More specifically, given the data table T, for each value hi
erarchy, Hi(Vi,Ei), we construct a complete directed graph,
244