Reducing metadata complexity for faster table summarization.
ABSTRACT Since the visualization real estate puts stringent constraints on how much data can be presented to the users at once, table summarization is an essential tool in helping users quickly explore large data sets. An effective summary needs to minimize the information loss due to the reduction in details. Summarization algorithms leverage the redundancy in the data to identify value and tuple clustering strategies that represent the (almost) same amount of information with a smaller number of data representatives. It has been shown that, when available, metadata, such as value hierarchies associated to the attributes of the tables, can help greatly reduce the resulting information loss. However, table summarization, whether carried out through data analysis performed on the table from scratch or supported through already available metadata, is an expensive operation. We note that the table summarization process can be significantly sped up when the metadata used for supporting the summarization itself is preprocessed to reduce the unnecessary details. The preprocessing of the metadata, however, needs to be performed carefully to ensure that it does not add significant amounts of additional loss to the table summarization process. In this paper, we propose a tRedux algorithm for value hierarchy preprocessing and reduction. Experimental evaluations show that, depending on the table and taxonomy complexity, metadata summarization can provide gains in table summarization time that can range (in absolute values) from seconds to 10sof1000s of seconds. Consequently, while resulting in only an extra ~ 20% reduction in table quality, tRedux can provide ~ 2x speedups in table summarization time. Experiments also show that tRedux has a better performance than alternative metadata reduction strategies in supporting table summarization; and, as the taxonomy complexity increases, the absolute gains of tRedux also increase.
 [Show abstract] [Hide abstract]
ABSTRACT: In many domains (e.g., data mining, data management, data warehouse), a hierarchical organization of attribute values can help the data analysis process. Nevertheless, such hierarchical knowledge does not always available or even may be inadequate or useless when exists. Starting from this consideration, in this paper we tackle the problem of the automatic definition of datadriven taxonomies. To do this we combine techniques coming from information theory and clustering to obtain a structured representation of the attribute values: the Contextual AttributeValue Taxonomy (CAVT). The two main advantages of our method are to be fully unsupervised (i.e., without any knowledge provided by an expert) and parameterfree. We experiments the benefit of use CAVTs in the two following tasks: (i) the multilevel multidimensional sequential pattern mining problem in which hierarchies are involved to exploit abstraction over the data, (ii) the table summarization problem, in which the hierarchies are used to aggregate the data to supply a sketch of the original information to the user. To validate our approach we use real world datasets in which we obtain appreciable results regarding both quantitative and qualitative evaluation.01/2012;  SourceAvailable from: Victoria Nebot
Conference Paper: Building data warehouses with semantic data.
[Show abstract] [Hide abstract]
ABSTRACT: The Semantic Web (SW) deployment is now a realization and the amount of semantic annotations is ever increasing thanks to several initiatives that promote a change in the current Web towards the Web of Data, where the semantics of data become explicit through data representation formats and standards such as RDF/(S) and OWL. However, such initiatives have not yet been accompanied by efficient intelligent applications that can exploit the implicit semantics and thus, provide more insightful analysis. In this paper, we provide the means for efficiently analyzing and exploring large amounts of semantic data by combining the inference power from the annotation semantics with the analysis capabilities provided by OLAPstyle aggregations, navigation, and reporting. We formally present how semantic data should be organized in a welldefined conceptual MD schema, so that sophisticated queries can be expressed and evaluated. Our proposal has been evaluated over a real biomedical scenario, which demonstrates the scalability and applicability of the proposed approach.Proceedings of the 2010 EDBT/ICDT Workshops, Lausanne, Switzerland, March 2226, 2010; 01/2010  SourceAvailable from: Dino Ienco
Conference Paper: Knowledgefree Table Summarization
[Show abstract] [Hide abstract]
ABSTRACT: Considering relational tables as the object of analysis, methods to summarize them can help the analyst to have a starting point to explore the data. Typically, table summarization aims at producing an informative data summary through the use of metadata supplied by attribute taxonomies. Nevertheless, such a hierarchical knowledge is not always available or may even be inadequate when existing. To overcome these limitations, we propose a new framework, named cTabSum, to automatically generate attribute value taxonomies and directly perform table summarization based on its own content. Our innovative approach considers a relational table as input and proceeds in a twostep way. First, a taxonomy for each attribute is extracted. Second, a new table summarization algorithm exploits the automatic generated taxonomies. An information theory measure is used to guide the summarization process. Associated with the new algorithm we also develop a prototype. Interestingly, our prototype incorporates some additional features to help the user familiarizing with the data: (i) the resulting summarized table produced by cTabSum can be used as recommended starting point to browse the data; (ii) some very easytounderstand charts allow to visualize how taxonomies have been so built; (iii) finally, standard OLAP operators, i.e. drilldown and rollup, have been implemented to easily navigate within the data set. In addition we also supply an objective evaluation of our table summarization strategy over real data.Dawak; 08/2013
Page 1
Reducing Metadata Complexity for Faster Table
Summarization
∗
K. Selçuk Candan
Arizona State University
Tempe, AZ 85283, USA
candan@asu.edu
Mario Cataldi
Università di Torino
Torino, Italy
cataldi@di.unito.it
Maria Luisa Sapino
Università di Torino
Torino, Italy
mlsapino@di.unito.it
ABSTRACT
Since the visualization real estate puts stringent constraints
on how much data can be presented to the users at once,
table summarization is an essential tool in helping users
quickly explore large data sets. An effective summary needs
to minimize the information loss due to the reduction in de
tails. Summarization algorithms leverage the redundancy
in the data to identify value and tuple clustering strate
gies that represent the (almost) same amount of information
with a smaller number of data representatives. It has been
shown that, when available, metadata, such as value hier
archies associated to the attributes of the tables, can help
greatly reduce the resulting information loss. However, table
summarization, whether carried out through data analysis
performed on the table from scratch or supported through
already available metadata, is an expensive operation. We
note that the table summarization process can be signifi
cantly sped up when the metadata used for supporting the
summarization itself is preprocessed to reduce the unnec
essary details. The preprocessing of the metadata, how
ever, needs to be performed carefully to ensure that it does
not add significant amounts of additional loss to the table
summarization process. In this paper, we propose a tRedux
algorithm for value hierarchy preprocessing and reduction.
Experimental evaluations show that, depending on the ta
ble and taxonomy complexity, metadata summarization can
provide gains in table summarization time that can range
(in absolute values) from seconds to 10sof1000s of seconds.
Consequently, while resulting in only an extra ∼ 20% re
duction in table quality, tRedux can provide ∼ 2× speedups
in table summarization time. Experiments also show that
tRedux has a better performance than alternative metadata
reduction strategies in supporting table summarization; and,
as the taxonomy complexity increases, the absolute gains of
tRedux also increase.
∗Partially supported by NSF Grant “Archaeological Data
Integration for the Study of LongTerm Human and Social
Dynamics” (0624341)
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
EDBT 2010, March 22–26, 2010, Lausanne, Switzerland.
Copyright 2010 ACM 9781605589459/10/0003 ...$10.00
*
1*2*
121922 27
U.S.
Arizona
Maryland
California
Southwest
PhoenixLos AngelesBaltimoreFrederickSan Diego
(a) Hierarchy for Age(b) Hierarchy for Location
Figure 1:
and Location (b); directed edges denote the clus
tering/summarization direction (taken from [1])
Value hierarchy for attribute Age (a)
Categories and Subject Descriptors
H.2.4 [Information Systems]: Database Management—
systems, database applications; H.3.3 [Information Sys
tems]: Information Storage and Retrieval—information search
and retrieval
General Terms
Algorithms, Experimentation
Keywords
Table Summarization, Metadata Complexity, Taxonomy re
duction
1. INTRODUCTION
Table summarization is an important tool in helping users
quickly explore large data sets. An effective summary needs
to minimize the information loss due to the reduction in
details; in particular each tuple in the original table needs
to be represented in the summary with a sufficiently sim
ilar tuple. Moreover, each tuple in the summary must be
sufficiently different from other summary tuples to ensure
that the summary realestate is not wasted. Summariza
tion algorithms leverage the underlying redundancy (such
as approximate functional dependencies and other patterns)
in the data to identify value and tuple clustering strategies
that represent the (almost) same information with a smaller
number of data representatives.
When available, metadata, such as value hierarchies (like
the ones shown in Figure 1) can help greatly reduce the re
sulting information loss. Value hierarchies have been com
monly used for userdriven data analysis (e.g. OLAP [2])
and exploration within large data sets. In [1], we have shown
that value hierarchies associated to the attributes of the ta
bles can also be used to support table summarization. Con
240
Page 2
Name
John
Sharon
Mary
Peter
James
Alice
Age
12
19
19
22
22
27
Location
Phoenix
Los Angeles
San Diego
Baltimore
Frederick
Baltimore
(a) Data table
Name


(b) Summarized table
Age
1*
2*
Location
Southwest
Maryland
Table 1: (a) A database and (b) a summary on the
?Age,Location? pair using hierarchies in Figure 1 (also
taken from [1])
U.S.
ld
S Southwesth
Maryland
PhoenixLos Angeles Baltimore FrederickSan Diego
U.S.
Arizona
California
PhoenixLos AngelesBaltimoreFrederickSan Diego
(a) Alternative #1
Figure 2: Two possible reductions of the location
hierarchy in Figure 1
(b) Alternative #2
sider, for example, Table 1 (a) which shows a data table
consisting of 6 rows. If the user is interested in summariz
ing this table based on the attribute pair, ?Age,Location?
in such a way that the summarized table can be visualized
in a space that can hold at most 2 tuples, the hierarchies in
Figure 1 can be used to obtain the summary in Table 1(b).
In this paper, we note that table summarization, whether
carried out through data analysis performed on the table
from scratch or supported through already available meta
data, is an expensive operation. For example, the computa
tional cost of the metadata supported table summarization
process is exponential in the depth of the hierarchy (i.e., the
number of alternative value clustering strategies) [1, 3].
The key observation driving the work in this paper is that
the speed of the table summarization process can be sig
nificantly improved when the metadata used for supporting
summarization itself is preprocessed to reduce its unproduc
tive details. Metadata are often provided by domain experts
and their primary usage is in data organization and interpre
tation. As such, they can be overly detailed and all of these
details may not be relevant in the summary of the data.
The preprocessing of the metadata to eliminate details not
relevant for obtaining a table summary, however, needs to
be performed carefully to ensure that it does not add signif
icant amounts of additional loss to the table summarization
process. For example, while the reduced hierarchy in Fig
ure 2(a) would still give the table summary in in Table 1(b),
the alternative in Figure 2(b) would cause further loss in the
table summary. Thus, intuitively, in this context a ”good”
reduction of the given metadata is the one that leads to
highquality table summaries.
Based on these observations, we propose a novel hierar
chy/taxonomy reduction method, tRedux, which reduces the
complexities (defined in terms of the hierarchy height and
density) of the given value hierarchies while preserving their
effectiveness in table summarization. In the next section,
we provide an overview of the relevant literature. Section 3
formalizes the table summarization problem and introduces
the quality measures for the table summaries. In Sections 4
and 5, we first introduce our tRedux algorithm for taxonomy
reduction and, then, describe the use of tRedux within the
table summarization process.
Evaluations presented in Section 6 show that, depending
on the table and taxonomy complexity, metadata summa
rization can provide gains in table summarization time that
can range (in absolute values) from seconds to 10sof1000s
of seconds. Consequently, while resulting in only an extra
∼ 20% reduction in table quality, tRedux can provide ∼ 2×
speedups in table summarization time. Experiments also
show that tRedux has a better performance than alternative
metadata reduction strategies in supporting table summa
rization; and, as the taxonomy complexity increases, the
absolute gains of tRedux also increase.
2.
2.1
RELATED WORK
Table Summarization
[4, 5] present a table summarization system, SaintEtiQ,
which computes and incrementally maintains a hierarchi
cally arranged set of summaries of the input table. SaintE
tiQ uses background knowledge (i.e., metadata) to support
these summaries. [6] also performs data summarization, but
it relies on frequent patterns in the relational dataset. Tab
Sum [7] creates and maintains table summaries through row
and column reductions. To reduce the number of rows, the
algorithm first partitions the original table into groups based
on one or more attribute values of the table, and then col
lapses each group of rows into a single row relying on the
available metadata, such as the concept hierarchy. For col
umn reduction, it simplifies the value representation and/or
merges multiple columns.
Data compression techniques, like Huffman or Lempel
Ziv, can also be used to reduce the size of the table. For ex
ample, [8] presents a database compression technique based
on vector quantization. Buchsbaum et al. [9] develop al
gorithms to compress massive tables through a partition
training paradigm, but the compressed tables are not human
readable. Histograms can also be exploited to summarize in
formation into a compressed structure. Following this idea,
Buccafurri et al. [10] introduce a quadtree based partition
schema for summarizing twodimensional data. Leveraging
the quadtree structure, [11] proposes approaches to pro
cessing OLAP operations over the summary. In fact, any
multidimensional clustering algorithm can be used to sum
marize a table. Such methods, however, do not take into
account specific domain knowledge (e.g. “what are accept
able summarizations, how do they rank?”) that hierarchies
would provide.
The concept of imprecision in OLAP dimensions is dis
cussed in [12]. In that framework, a fact (e.g., a tuple in
the table) with imprecise data is associated with dimen
sion values of coarser granularities, resulting in the dimen
sional imprecision. In [13], we supported OLAP operations
over imperfectly integrated taxonomies. We proposed a re
classification strategy which eliminates conflicts by introduc
ing minimal imprecision. This approach is complementary
to the work presented here: the obtained navigable taxon
omy can be taken as the input for summarization.
The table summarization task is also related to k
anonymization problem, introduced as a technique against
linkage attacks on private data [3]. The kanonymization
241
Page 3
approach eliminates the possibility of such attacks by en
suring that, in the disseminated table, each value combi
nation of attributes is matched to k others.
this, kanonymization techniques rely on apriori knowledge
about acceptable value generalizations. Cell generalization
schemes [14] treat each cell in the data table independently.
Thus, different cells for the same attribute may be gener
alized in a different way. This provides significant flexibil
ity in anonymization, while the problem remains extremely
hard (NPhard [15]) and only approximation algorithms are
applicable under realistic usage scenarios [14].
generalization schemes [3, 16] treat all values for a given
attribute collectively; i.e., all values are generalized using
the same unique domain generalization strategy. While the
problem remains NPhard (in the number of attributes), this
approach saves significant amount of time in processing and
may eliminate the need for using approximation solutions,
since it does not need to consider the individual values. Most
of these schemes, such as Samarati’s original algorithm [3],
rely on the fact that, for a given attribute, applicable gener
alizations are in total order and that each generalization step
in this total order has the same cost. [3] leverages this to de
velop a binary search scheme to achieve savings in time. [16]
relies on the same observation to develop an algorithm which
achieves attributebased kanonymization one attribute at a
time, while pruning unproductive generalization strategies.
To achieve
Attribute
2.2 Metadata Reduction
Ontology summarization has been used to support various
reasoning tasks. Fokouel et al. [17] focus on the problem of
summarizing OWL ontologies to reduce the cost of reason
ing with ontologies. Often, metadata (such as RDF collec
tions) can be seen as graphs. [18] analyzes the metadata
graph to measure centrality of the nodes (e.g.
in the graph. In this approach, the centrality of a given
node reflects its degrees of salience; thus the most salient
nodes are maintained in the summary. In contrast, clus
tering based approaches discover groups of nodes such that
the intragroup similarity is maximized and the intercluster
similarity is minimized [19, 20]. [21] presented a novel hi
erarchical clustering algorithm, merging two clusters only if
the interconnectivity and closeness between two clusters are
high relative to the internal interconnectivity of the clusters
and closeness of items within the clusters. [22] uses an ag
glomerative clustering algorithm that merges the two clus
ters with the greatest number of common neighbors. More
recently, [23] and [24] use random walks based cluster anal
ysis techniques.
Since many metadata types, such as taxonomies are hi
erarchical, researchers also experimented with tree summa
rization algorithms [25, 26]. Davood et al. [26] observed that
summaries of XML trees did a much better job in document
clustering tasks than by using edit distance values, e.g. [27],
on the original trees. DataGuides [25] was one of the first
approaches which attempted to construct structural sum
maries of hierarchical structures to support efficient query
processing. Though this and similar methods work fine for
tree based structures, the constructed summaries are not
trees, but graphs. Various other summarization algorithms,
such as [26, 28], focus on creating summaries suitable for ef
ficient similaritysearch in treestructured data. Since, once
again, the goal of these algorithms is not to obtain a smaller
tree representing the larger one provided as input, but to
concepts)
find a representation that will speed up query processing,
the resulting summaries are in the form of strings, hash se
quences, and concept/label vectors. Our goal, in this paper,
however is to reduce the size of the input taxonomy tree
to support table summarization process, therefore these and
similar algorithms are not applicable.
3.TABLE SUMMARIZATION
Summarization of large data tables is required in many
scenarios where it is hard to display complete data sets.
The summarization process takes as input a database table
and returns a reduced version of it.
tuples with less precision (knowledge relaxation) than the
original, but still informative of the content of the database.
This reduced form can then be presented to the user for
exploration or be used as input for advanced data mining
processes.
3.1Value Clustering Hierarchies
A value clustering hierarchy1is a tree H(V,E) where V
encodes values and clusteringidentifiers (e.g., highlevel con
cepts or cluster labels, such as “1*” in Figure 1(a), and E
contains acceptable value clustering relationships.
The result provides
Definition 3.1
tering hierarchy, H, is a tree H(V,E):
(Value Hierarchy). A value clus
• v=(id:value)∈V where v.id is the node id in the tree
and v.value is either a value in the database or a value
clustering encoded by the node.
• e = vi → vj ∈ E is a directed edge denoting that the
value encoded by the node vj can be clustered under
the value encoded by the node vi.
Those nodes in V which correspond to the attribute values
that are originally in the database do not have any outgoing
edges in E; i.e., they form the leaves of the hierarchy. Given
an attribute value in the data table T and a value hierarchy
corresponding to that attribute, we can define alternative
clusterings as paths on the corresponding hierarchy.
Definition 3.2
clustering hierarchy H, a tree node vi is a clustering of a
tree node vj, denoted by vj?vi, if ∃path p=vi?vj in H.
We also say that vi covers vj.
(Value Clustering). Given a value
3.2TupleClustering and Table Summary
Let us consider a data table, T, and a set, SA, of sum
marization attributes. Roughly speaking, our purpose is to
find another relation T?which clusters the values in T such
that T?summarizes T with respect to the summarization
attributes. Based on the above, in the following, we formal
ize the concept of tuple summarization.
Definition 3.3
on attributes SA= {Q1,··· ,Qq}. t?is said to be a clustering
of the tuple, t, (on attributes SA) iff ∀i∈[1,q]
• t?[Qi] = t[Qi] , or
• ∃path pi = t?[Qi] ? t[Qi] in the corresponding value
hierarchy Hi.
(TupleClustering). Let t be a tuple
1We use the terms “value hierarchy” and “value clustering
hierarchy” interchangeably
242
Page 4
In this paper, we use t?t?as shorthand.
Given this definition of tupleclustering, we can define the
summary of a table as a onetoone and onto mapping which
clusters the tuples of the original table.
Definition 3.4
tables T
summarizationattribute set SA, T?is said to be a summary
of T on attributes in SA (T[SA] ? T?[SA] for short) iff
there is a onetoone and onto mapping, μ, from the tuples
in T[SA] to T?[SA], such that
(Table Summary). Given two data
(with the same schema),
and T?
and the
• ∀t ∈ T[SA], t?μ(t)
Here, T[SA] and T?[SA] are projections of the data tables T
and T?on summarizationattributes.
3.3 Table Summarization Process
As described in Section 2.2, there are various metadata
supported table summarization algorithms [1, 3, 16]. With
out loss of generality, in this work, we use Samarati’s k
anonymization algorithm [3] as the backend table summa
rizer. In [3], each unique tuple gets clustered with at least
k−1 other similar tuples to ensure that no single tuple can be
uniquely identified. The algorithm uses attribute value hier
archies to ensure that the amount of loss (i.e., value general
izations using the value hierarchies) is minimized. For each
attribute, the algorithm takes a value clustering hierarchy (a
taxonomy) which describes the generalization/specialization
relationship between the possible values. For example, con
sider a table with a“Location”attribute. The hierarchy rep
resents all the relevant values in the corresponding domain as
the leaves of a tree (Figure 1(b)). The internal nodes in the
value hierarchy will correspond to appropriate (geographic
or political) clusterings of countries. Thus, in a summary,
an internal node of the hierarchy can be used to cluster all
the leaves below it using a more general label. If in the
summary a leaf value is used, this gives zero generalization
(g = 0); if, on the other hand, a leaf at depth d is replaced
with an internal node at depth d?, this causes g = d − d?
steps of generalization; of course, by picking clusters closer
to the root, the algorithm will be able to summarize more
easily. On the other hand, more general cluster labels also
cause higher degree of knowledge relaxation. In Section 3.4,
we will refer to the knowledge relaxation due to the use of
generalizing clusters as dilution.
Among all possible clusterings that put each tuple with
k − 1 other similar ones, [3] aims to find those that require
minimal generalizations; i.e., the amount of distortion in the
data needed to achieve the clustering is as small as possi
ble. Intuitively, if there is a generalization at depth d that
puts all tuples into clusters of size k, then there will be gen
eralizations of level d?≤ d that also cluster all tuples into
clusters of size at least k, but will have more loss; conversely,
if one can establish that there is no generalization at level
d that is a kclustering, then it follows that there are no
other clustering of level d?> d that can cluster all tuples
into clusters of size at least k. Relying on the fact that for
a given attribute, applicable domain generalizations are in
total order and that each generalization step in this total or
der has the same cost, [3] develops a binary search scheme to
achieve savings in time2. It starts evaluating generalization
levels from the middlelevel to see if there is a corresponding
kclustering solution:
2[16] relies on the same observation to develop an algorithm
V2
A
N
V2
A
A
N N
(a)(b)
Figure 3: Ghost nodes in unbalanced hierarchy
• if there is, then the algorithm tries to find another solu
tion with less generalization by jumping to the central
point of the half path with lower generalization;
• if there is none, on the other hand, the algorithm tries
to find a solution by jumping to the central point point
of the half path with higher generalization.
The process continues in this binary search until a general
ization level such that no solution with a lower generaliza
tion exists is found. Unfortunately, this and other attribute
generalization based algorithms, including [16] and [29], are
all exponential in the number of attributes that need sum
marization3. When there is a single attribute to summarize,
for a hierarchy, H, of depth, depth(H), the algorithm con
siders log(depth(H)) alternative clustering strategies. When
there are m attributes to consider, however, the maximum
degree of generalization is
attributes are generalized to the max, causing the great
est amount of loss. In this case, the algorithm considers
log(?
algorithm has to consider all combinations of attribute gen
eralizations such that (?
worst case time cost of the algorithm is exponential in the
number of attributes.
One problem with using Samarati’s algorithm [3] as the
backend table summarizer is that it assumes balanced hier
archies as input. In order to perform table summarization
using unbalanced input hierarchies, we first balance the in
put hierarchies by introducing ghost nodes that fill the empty
spots in the hierarchy. As shown in Figure 3, these ghost
nodes act as surrogates of the closest ancestors of the leaf
nodes that are out of balance: in this example, in order to
put N at the same level of the other leaves, we introduce
a ghost node between N and its parent A (Figure 3(b)).
The ghost node is also labeled A as the parent. Note that,
whether N is generalized to its original parent or the new
(ghost) node, it will be replaced in the summarized table
with A, therefore, this transformation does not result in ad
ditional loss.
3.4Quality of a Table Summary
Informationbased measures of quality leverage statisti
cal knowledge, for example the knowledge about data fre
quencies, to measure information loss. One advantage of
the use of value hierarchies for table summarization is that
?
1≤i≤mdepth(Hi), where all
1≤i≤mdepth(Hi)) alternative generalization levels on
the average; moreover, for each generalization level, g, the
1≤i≤mgi) = g. Since in general,
there can be exponentially many such combinations, the
which achieves attributebased kanonymization one at
tribute at a time, while pruning unproductive attribute gen
eralization strategies. [29] further assumes an attribute or
der and attributevalue order to develop a topdown frame
work with significant pruning opportunities.
3In fact the problem is NPhard [1]
243
Page 5
the degree of loss resulting from the summarization pro
cess, can be quantified and explicitly minimized using the
available value hierarchies. Unlike purely numeric informa
tion loss measures, such as mean squared error, and statis
tical measures, such as entropy, classification, discernibility,
and certainty [29, 30, 31, 32], knowledge about value hier
archies provides a mechanism to judge the significance of
the distortion within the given application domain [3, 33,
34]. For example, a commonly used technique for measur
ing the amount of loss during the summarization process
is to count the number of generalization steps required to
obtain the summary [3, 29, 16]: given a generalization hier
archy, each step followed to achieve the value clustering is
considered one unit of loss.
Definition 3.5
Let t and t?be two tuples on attributes SA= {Q1,··· ,Qq},
such that t ? t?. Then the cost of the corresponding clus
tering strategy is defined through a monotonic combination
function,
individual summarizationattribute:
(Penalty of a Tuple Clustering).
?, of the penalty of the clustering along each
Δ(t?t?)=
1≤i≤q
?
Δi,
where
• Δi = 0, if t?[Qi] = t[Qi]
• Δi = Δ(t[Qi],t?[Qi]) (i.e., the minimal number of
edges that separates t[Qi] to t?[Qi] in the corresponding
original hierarchy) otherwise.
Let us consider a data table T, and a set SA of summa
rization attributes. Let T?be a summary of T on attributes
in SA (i.e., T[SA] ? T?[SA]). In this paper, we use the
following quality measures to evaluate table summaries:
• dilution (dl):
dl(T,T?,SA) =
1
T
?
t∈T
Δ(t[SA],μ(t[SA])).
The smaller the degree of dilution, the smaller is the
amount of loss and the higher is the quality of the
summary.
• diversity (div):
div(T?,SA) =
2
T?(T? − 1)
?
t1,t2∈T?(t1=t2)
Δ(t1[SA],t2[SA]).
The greater the diversity, the higher is the quality of
the summary.
Therefore, given a table T and the set of summarization
attributes, SA, the goal of the summarization algorithm is
to find a summary such that the degree of dilution is mini
mized, yet the diversity is maximized.
4.TAXONOMY REDUCTION
As we have seen in Section 3.3, the table summarization
process can be prohibitively costly, especially when the num
ber of relevant attributes is large. Our key observation in
this paper is that it might be possible to reduce the cost of
the table summarization algorithm significantly by reducing
the sizes of the input hierarchies. The hierarchy reduction,
however, needs to be performed in such a way that it does
not add significant amounts of additional loss to the table
summarization process. In other words, the hierarchy re
duction process should eliminate the details in the metadata
that are not likely to be used in the table summary anyhow
(remember the example in Figure 2).
In this section, we propose a tRedux algorithm for value
hierarchy reduction. Our proposed approach to taxonomy
preprocessing and reduction consists of three steps:
I: create a graph representing the structural distances
between the nodes in the taxonomy as well as the dis
tribution of nodes labels in the database;
II: partition the resulting graph into disjoint subgraphs
based on connectivity analysis; and
III: finally, select a representative label for each partition
and reconstruct a taxonomy tree.
In Section 6, we show that taxonomies reduced using this
algorithm help preserve qualities of table summaries, while
providing significant gains in table summarization time. In
the rest of this section, we describe each of the above steps.
4.1Step I: Constructing the Node Structural
Similarity/Occurrence Graph
Naturally, the most effective way to ensure the quality
of the table summarization process is to cluster those tax
onomy nodes whose labels would be judged to be similar
by human users of the system. Simultaneously, the cluster
should also represent the jointdistribution of the node labels
in the database. Therefore, the first step of the process is to
create a graph that represent both the structural similarities
of the nodes and label cooccurrence in the database.
More formally, let us consider a data table T and a set SA
of summarization attributes. Let Hi(Vi,Ei) be a value hier
archy corresponding to the attribute Qi ∈ SA. In the next
step, the algorithm constructs a weighted directed graph,
Gi(Vi,E?
corresponds to the concepts in the input taxonomy. The
weights (w : E → R+) associated to the edges in E represent
both
i,w), where the set of vertices, Vi = {v1,...,vn},
• similarities between pairs of concepts in the taxonomy,
and
• occurrences of data values in the database.
In structurebased methods, the similarity between two
nodes in a taxonomy is measured by the distance between
them [35] or the sum of the edge weights along the shortest
path connecting the words [36]. CP/CV, presented in [37]
first associates a node vector to each node in the taxonomy
(the vector represents the relationship of this with the rest of
the nodes in the taxonomy) and, then, compares the vectors
to quantify the how structurally similar the two nodes are.
Comparisons against other approaches on available human
generated benchmark data [38, 39] showed that CP/CV im
proves the correlation of the resulting similarity judgments
to human common sense. Thus, without loss of generality,
we use the node similarity measure presented in [37] to quan
tify the structural similarities among the taxonomy nodes.
More specifically, given the data table T, for each value hi
erarchy, Hi(Vi,Ei), we construct a complete directed graph,
244
Page 6
Gi(Vi,E?
omy Hi, the edge between the corresponding nodes in Gi
has the following weight:
?
where t.Qiis the value of tuple t for attribute Qiand cv(x)[y]
gives the CP/CV value for node x along the vector dimen
sion corresponding to the node y. Intuitively, the weight
w(?va,vb?) measures the aggregate similarity between the
taxonomy nodes va and vb in the value hierarchy for all the
values in the corresponding attribute in the database. Thus,
the resulting graph Gi(Vi,E?
relationships in Hi(Vi,Ei) as well as the distribution of the
data in the corresponding summarization attribute, Qi, in
the database: the weight of an edge is high if the concepts
are structurally related in the value hierarchy, Hi, and there
is plenty of tuples in the corresponding attribute that are
highly related to these concepts.
Lastly, this graph Gi(Vi,E?
locally adaptive edge thinning algorithm. For each va in V ,
we consider the set, out(va), of all outgoing edges:
i,w): for each pair va to vb of nodes in the taxon
w(?va,vb?) =
t∈T
cv(va)[t.Qi] × cv(vb)[t.Qi],
i,w) represents the structural
i,w) is thinned by applying a
1. we first sort the edges in out(va) in decreasing order
of weights;
2. next, we compute the maximum drop in consecu
tive weights; and identify the corresponding maxdrop
point in the sorted list of edges;
3. we, then, compute the average drop (between consecu
tive entities) for all those edges that are ranked before
the identified maxdrop drop point.
4. the first weight drop which is higher than the com
puted average drop is referred to as the criticaldrop.
All the edges in out(va) beyond this criticaldrop point
are eliminated from E?
i.
This final thinning process ensures that only those edges
that represent strongest relationships are maintained (note
that, since the graph is directed and the thinning process is
asymmetrical, it is possible that the E?
from va to vb, but not vice versa).
4.2Step II: Balanced Taxonomy Partitioning
Inthenextstep,the
Gi(Vi,E?
the weights. In theory, any existing graph partitioning al
gorithm (e.g. [19, 20, 40, 24, 41]) can be used in this stage.
Many of these (including METIS [41], which we evaluate in
the experiments section), however, require advance knowl
edge about the number of clusters. Thus, in practice, since
the user will not be likely to have a target taxonomy size for
table summarization, a summarization algorithm which can
partition the input graph based on its inherent structure,
without requiring an input number of clusters, may be more
suitable.
Consequently, without loss of generality, we rely on a ran
dom walkbased graph partitioning algorithm that does not
require an advance knowledge of the number of resulting
clusters. Intuitively, two vertices in the same cluster should
be more connected to each other than two vertices in other
clusters (and that two vertices in the same cluster should
be quickly reachable from each other through a random
iwill contain a link
resultingweighted graph
i,w) is partitioned based on its connectivity and
walk)[42]. In particular, given the graph Gi(Vi,E?
structed in the previous step we construct a random walk
graph by associating the following transition probability to
edges: let e be an edge from vertex va to vertex vb; then,
the corresponding probability of transition is
i,w) con
pa,b=
w(?va,vb?)
?
ek∈out(va)w(ek).
The resulting nodetonode transition probability matrix is
then used, as in [42], to partition the nodes based on their
proximities.
Existing randomwalk based clustering algorithms, such
as [42], consider only the connectivity and the weights and
does not seek to return partitions balanced in terms of the
number of vertices. In other words, the number and the sizes
of the clusters is strictly dependent on the weights obtained
in Step II. For data summarization, however, we may not
want a summarized taxonomy, where some summary nodes
are precise (and represent only a few nodes in the original
value hierarchy), whereas others are vague (and represent
large numbers of nodes). An equally distributed set of clus
ters, on the other hand, would permit to generate a more
informative and representative summary of the initial taxon
omy, as each new entry in the reduced taxonomy represents
(approximately) the same number of original nodes.
Therefore, we follow the initial partitioning step with
a rebalancing step.Let Hi(Vi,Ei) be a value hierarchy
and Pi = {Pi,1,...,Pi,m} be the set of partitions obtained
through the random walk process. In order to promote bal
ance in partitions, we introduce a tolerance value, τ = θVi
that sets the maximum number of concepts that could be
represented by any partition. If a cluster, Pi,j, contains too
large a number of concepts, then a set, Xi, of extra vertices
are picked and moved to other partitions. This set of ver
tices are selected in such a way that the cost, cost(Xi), of
displacement of the set of extra vertices among partitions.
The term cost(Xi) is
?
where edges(va) is the set of all incoming and outgoing edges
to va and dest(va) is the partition, other than Pi, with the
highest weighted connectivity to va. The vertices in Xi and
their destinations are selected through a Kmeans like it
erative improvement process. In Section 6.2 we will deeply
study the use of θ.
4.3Step III: Taxonomy Reconstruction
In order to construct the reduced taxonomy, we need to
reattach the partitions, obtained in the previous step, in the
form of a tree structure. Furthermore, for each partition, we
need to pick a label describing the concepts in the partition.
m,
va∈Xi
⎛
⎝
?
ej∈(edges(va)∩Pi)
w(ej) −
?
ej∈(edges(va)∩dest(va))
w(ej)
⎞
⎠,
4.3.1
Considering a hierarchy node, its label is important be
cause it is what will be presented to the user in the resulting
table summary. Thus, considering our partitions, we have
to carefully select the appropriate labels in order to be suffi
ciently representative of the cluster. Let Pi,j be a partition
in Pi. In order to pick a label for Pi,j, we consider the rela
tionships of the vertices in Pi,j in the original hierarchy Hi.
If there is a vertex, va ∈ Pi,j that dominates all the other
Partition Labeling
245
Page 7
V2
FG
IH
V2
V1
L
MN
O
K
C
B
D
(a) a sample taxonomy
V2
V1
V2
P2
P3
Edges in the original
hierarchy H
Edges in the
similarity/occurrences
graph G, but not in the
original hierarchy H
K
C
B
D
F
G
I
H
L
MN
O
P1
(b) partitioning example
Figure 4: A sample taxonomy and its partitioning
vertices in the partition (i.e., ∀vb ∈ Pi,j vb ? va), then va
is selected as the label. If there is no such single vertex,
then the minimal set, Dj, of vertices covering the partition
Pi,j (based on Hi) is found and the set, Dj, is used as the
partition label.
4.3.2
The reduced taxonomy, H?
structure of Hi as much as possible:
Partition Linking
ishould preserve the original
• The root of H?
root vertex of Hi.
iis the partition Pi,j which contains the
• Let us consider a pair, Pi,j and Pi,k, of partitions in
Pi. Let Ej,kbe the set of edges in Hi that go from the
vertices in Pi,j to vertices in Pi,k. Similarly, let Ek,j
be the set of edges in Hi that go from the vertices in
Pi,k to vertices in Pi,j.
If in H?
i, Pi,j is an ancestor of Pi,k, then the bro
ken set of edges in Ek,j will result in structural con
straints that are violated. If Pi,k is an ancestor of Pi,j,
then broken edges in Ej,k will result in structural con
straints that are violated. If neither is an ancestor of
the other, on the other hand, the edges in Ek,j∪ Ek,j
will determine the constraints that are violated.
Let e = ?va,vb? be an edge from partition Pi,j to Pi,k.
If e is broken, then its cost (cost(e)) is the number
of descendants of vb in the original hierarchy H also
contained in Pi,k plus one (for vb). For example, in
Figure 4, if the edge between V1 and K is broken, then
the cost of this edge is equal to 1+{O,N,L,M} = 5.
Thus, the taxonomy H?
i, minimizing the errors due to
structural constraint violations can be constructed by
1. creating a complete weighted directed graph,
GP(VP,EP,wP), of partitions, where
– VP = Pi,
– EP is the set of edges between all pairs of
partitions, and
– wP(?Pi,j,Pi,k?) =?
2. finding a maximum spanning tree of GP rooted at
the partition Pi,j which contains the root of Hi.
e∈Ek,jcost(e); and
At this point, the original taxonomy has been partitioned
and a reduced taxonomy, which can be used in the table
summarization process, has been reconstructed.
5.TABLE SUMMARIZATION USING RE
DUCED VALUE HIERARCHIES
Let us consider a data table T
{Q1,··· ,Qq} of summarization attributes. Let H1,...Hq
be the corresponding value hierarchies. In Section 3.2, Def
inition 3.4 introduces the summary, T?, of T based on the
given value hierarchies. Let H?
value hierarchies, corresponding to the summarization at
tributes.
Remember from Section 3.4 Definition 3.5 that if t and t?
are two tuples in T on attributes SA= {Q1,··· ,Qq}, such
that t?t?, then the penalty of clustering is defined as
Δ(t?t?)=
and a set SA =
1,...H?
qbe a set of reduced
?
1≤i≤q
Δi,
where
• Δi = 0, if t?[Qi] = t[Qi]
• Δi = Δ(t[Qi],t?[Qi]) (i.e., the penalty of replacing
t[Qi] with t?[Qi]) otherwise.
In the original definition, both t[Qi] and t?[Qi] are from the
same hierarchy, Hi; when using reduced hierarchies for table
summarization, however, while t[Qi] comes from Hi, t?[Qi]
comes from the reduced hierarchy H?
inition of the Δ(t[Qi],t?[Qi]) needs to be extended. Based
on the partition labeling strategy for reduced taxonomies,
discussed in Section 4.3.1, there are two cases to consider:
i. Therefore, the def
• Case I: t?[Qi] ∈ H?
the task is simpler and Δ(t[Qi],t?[Qi]) can simply be
computed on the original hierarchy.
iis also in Hi. If this is the case,
• Case II: t?[Qi] ∈ H?
t?[Qi] must correspond to a set of values in Hi. There
fore, Δ(t[Qi],t?[Qi]) needs to be computed using a
monotonic combination function,
ment costs of the values in the original hierarchy:
iis not in Hi. If this is the case,
?, of the replace
Δ(t[Qi],v).
Δ(t[Qi],t?[Qi]) =
?
v∈t?[Qi]
This monotonic combination function can be optimistic
(min), pessimistic (max or sum), or agnostic (avg).
6.EVALUATION OF METADATA REDUC
TION BASED TABLE SUMMARIZA
TION
Metadata supported table summarization needs three in
puts: (a) a table T to summarize with q attributes; (b) a set
of domain hierarchies Di (for each attribute which we want
to summarize), and (c) a parameter k that determines the
246
Page 8
minimum size of tuple clusters in the summary. We exper
imented with different data sets, taxonomies, and k values.
We considered two different datasets:
• Real dataset: we used the Census Income dataset (also
known as Adult dataset), extracted from the 1994 Cen
sus database [43]. This data set contains ∼30K tuples
and includes 16 attributes.
• Synthetic dataset: we constructed subsets of tuples
with different properties in order to evaluate tRedux
under different conditions (see Section 6.2).
For both data sets, we varied the number of tuples in the
data set. For all experiments described in Section 6.1, we
considered 7 different subsets of tuples (from ∼ 100 to ∼
800) for the Adult dataset and 6 different subsets (from ∼
100 to ∼ 1000) for the synthetic dataset.
We also varied the tuple count variance (tvar), which is
defined as the variance in the number of occurrences (in the
input table) of the leaf values of the hierarchy; this value
was varied between 0 (i.e., uniform distribution) and ∼ 11
for Adult dataset and between 0 and ∼ 15 for Synthetic
dataset.
We also experimented with different numbers (1,2 and 3)
of attributes in the summary. For each case, we considered
different summarization requirements, varying k in the set
{5,10,20,30}. In addition to the real and synthetic data,
we also experimented with real and synthetic domain hier
archies. The synthetic domain hierarchies we used for the
experiments also varied in structure (size and height). We
provide more details in Section 6.2.
Finally, we have also experimented with different parti
tion balance tolerance values when creating the reduced tax
onomies (see Section 4.3.2). We varied the tolerance value,
θ in the set {1,1.5,2,3,4} (θ < 1 is not meaningful, θ = 1
means balance, and θ > 1 is increasingly lax in terms of
balance requirement – as we will see in Section 6.2, diversity
and dilution is more or less constant for θ ≥ 2, therefore
this range is sufficient for observing the impact of θ). Un
less explicitly stated, the default tolerance value, θ = 2, is
used. For all the experiments we used an Intel Core 2CPU
@2,16GHz with 1GHz Ram.
6.1Loss in Diversity and Dilution due to Re
duced Metadata
Before we analyze the behavior of the tReduxbased table
summarization under different system parameters, we first
compare dilution and diversity behaviors of various alterna
tive metadata driven table summarization approaches. In
particular, we compare the following alternative schemes:
• table summarization using the original hierarchies; in
this scheme the input hierarchies are not reduced.
• table summarization using hierarchies reduced by ap
plying tRedux.
• table summarization using hierarchies reduced by ap
plying (instead of tRedux) kMETIS clustering [41]
over the concept similarity graph described in Section
4.1: the kMETIS algorithm guarantees that all par
titions will be approximately equally distributed; but
requires the number of partitions as input. In these
experiments, we vary the number of partitions as 20%,
(a)
(b)
(c)
(d)
Figure 5:
and kMETIS approaches: (a) diversityvstime, (b)
diversityvs.number of partitions, dilutionvstime,
(c) dilutionvsnumber of partitions
Comparisons among tRedux, original,
30%, 40% and 50% of the number of nodes in the in
put hierarchy (METIS0.2, 0.3, 0.4 and 0.5 in the
charts).
The experiments reported in this subsection are highlevel
averages of all experiments carried out with varying system
parameters. As we mentioned above, we varied values of
k, the number of tuples, the hierarchy size. Then, for each
alternative algorithm, we computed average diversity, aver
age dilution, and average execution time and plotted them
against each other to observe the general, highlevel trends
without focusing on the impacts of the specific system pa
rameters. In Figure 5, the first scheme, “original”, does not
use taxonomy reduction, while the other schemes, “tRedux”
and “kMETIS”, are both instances of taxonomy reduction
based table summarization approach proposed in this paper.
Diversity vs.Time Figure 5(a) shows the amount of
247
Page 9
diversity maintained by alternative schemes against the
amount of time required by the table summarization algo
rithm. As can be seen in this figure, table summarization
using the original hierarchies provides the highest diversity;
but also takes the greatest amount of time. METIS algo
rithms with 40% and 50% hierarchy nodes cause drops in
the diversity, without any significant temporal gain. METIS
with 20% and 30% nodes result in some gains in time; but
the highest gain in time occurs when using tRedux for sum
maries. Most importantly though, the diversityvstime be
havior (highlighted by the slopes of the line segments that
connect the point corresponding to original summaries with
the points corresponding to the algorithms), is the best for
tRedux. Overall, tRedux provides a ∼ 50% gain in execution
time, with only a ∼ 15% reduction in diversity.
Diversity vs. # of Nodes in the Reduced Hierarchy
Figure 5(b) shows the diversity maintained by alternative
schemes against the number of nodes (partitions) in the re
duced hierarchy. As expected, there is a correlation with
the number of nodes in the hierarchy with the overall diver
sity. However, as can be seen comparing METIS with 20%
of nodes and tRedux results, tRedux is able to maintain a
similar amount of diversity with smaller number of nodes in
the hierarchy.
Dilution vs.Time Figure 5(c) shows dilution4against
table summarization time: the highest absolute and relative
(to dilution) time gains are achieved by the tRedux.
Dilution vs.# of Nodes in the Reduced Hierar
chy Figure 5(d) shows the dilution5caused by alternative
schemes against the number of nodes in the reduced hier
archy. As can be seen here, as expected, the smaller the
number of nodes in the hierarchy, the higher the resulting
dilution is. On the other hand, among the different meta
data reduction schemes, tRedux has the best relative dilution
behavior: a 66% drop in the number of nodes in the hierar
chy results in only a less than 20% increase in dilution.
Summary The results in this section shows that tRedux is
able to reduce the taxonomy (based on its inherent structure,
without requiring the size of the output taxonomy as an
input) in a way that provides the best diversitytime and
dilutiontime tradeoff. Algorithms, like kMETIS can be
used as the base graph partitioner if the user would like
to reduce the sizes of the input taxonomies beyond what
is structurally recommendable (albeit at the cost of further
information loss).
6.2Dissecting tRedux
In the previous subsection, we have looked at the highlevel
behavior of the various algorithms and seen that metadata
reduction based table summarization can provide significant
time gains, while resulting in relatively small increase in di
lution and drop in diversity. We have also seen that among
alternative ways to taxonomy reduction, tRedux has the best
dilutiontime and diversitytime behaviors. In this subsec
tion, we look at the tRedux algorithm in greater detail and
study how different problem parameters affect dilution, di
versity and time behaviors of tRedux. In particular, we vary
4In this setting, we are using the agnostic avg combination
function to compute dilution –see Section 5. In these experi
ments, the effect of the dilution definition on the result were
extremely minute; thus charts considering other functions
(min, max, sum) are omitted for the sake of space.
5Again, using agnostic avg combination function.
(a) the imbalance tolerance value, θ, (b) the number of tu
ples in the input table, (c) the value distributions in the
data, (d) the sizes of the hierarchies, and (e) the heights
of the hierarchies, and compare the tReduxsupported sum
maries with summaries using original hierarchies. We mostly
experiment with synthetic data where we can freely change
various parameters and observe the behaviour of tRedux,
but we also include results with the Adults data set.
y?=?0,5241x1,0029
y?=?0,5241x1,0029
10
10
100
100
1000
1000
10000
10000
e (logsec) 
tRedux
Table Summarization Time (sec)
(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)
1
110100100010000
Time (logsec) 
tRedux
Time(logsec)  Original hierarchy
Table Summarization Time (sec)
(a) Time (loglog scale)
Table Summarization  DilutionTable Summarization  Dilution
(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)
10
10
10
101010
ux
tRedux
y?=?1.1241xy?=?1.1241xy?=?1.1241xy?=?1.1241xy?=?1.1241x
6
6
666
8
8
888
Table Summarization  Dilution Table Summarization  Dilution Table Summarization  Dilution Table Summarization  Dilution Table Summarization  Dilution
2
222
4
444
ution  tRedux
000
000222444666888101010
Dilution  tReduxDilution  tRedux
Dilution  Original hierarchyDilution  Original hierarchy
Dilution  tRedux
(b) Dilution
Table Summarization DiversityTable Summarization  Diversity
(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)(original hierarchies vs. with tRedux)
10
10
10
101010
x
tRedux
6
6
666
8
8
888
Table Summarization  Diversity Table Summarization  Diversity Table Summarization  Diversity Table Summarization  Diversity Table Summarization  Diversity
y?=?0.7888xy?=?0.7888xy?=?0.7888xy?=?0.7888x
2
222
4
444
sity  tRedux
000
00022
Diversity  Original hierarchyDiversity  Original hierarchy
444666888101010
Diversity  tReduxDiversity  tRedux
2
Diversity  tRedux
(c) Diversity
Figure 6:
without tRedux
Overview First, Figures 6(a),(b) and (c) bring together all
experiment instances (independent of their parameters) into
three tables which plot performance measures (time, dilution
and diversity) for table summarization with the original hi
erarchy against table summarization with tRedux. As the
trend line in Figure 6(a) shows, on the average the summa
rization times with tRedux is just ∼ 50% of the summariza
tion times needed with the original summary (i.e., summa
rization is 2× as fast when using tRedux) and this behavior
is highly consistent. Moreover, the average loss in terms of
dilution (Figure 6(b)) is only ∼ 12% higher when using a
summarized taxonomy, while the average loss in terms of
diversity is ∼ 21% (Figure 6(c)).
Next we consider the impact of the individual parameters
on these three measures.
Impact of the Partition Imbalance (Θ Parameter) As
introduced in Section 4.2, depending on the need, the user
Table summarization results with and
248
Page 10
Di l ut i on ( avg ) vs Thet a val ue
( Synt het i c + Adul t )
0
2
4
6
012345
Thet a
di l ut i on
Di ver si t y vs Thet a val ue
( Synt het i c + Adul t )
0
1
2
3
4
012345
Thet a
di ver si t y
Par t i t i on si ze var i ance ( p var ) vs Thet a val ue
( Synt het i c + Adul t )
0
0, 5
1
1, 5
2
2, 5
012345
Thet a
p var
Ti m e vs Thet a val ue
( Synt het i c + Adul t )
0
0, 5
1
1, 5
2
2, 5
012345
Thet a
Ti m e ( seconds)
(a)(b) (c)(d)
Figure 7: Dilution, diversity, partition size variance, and time for different imbalance tolerance values.
can balance the resulting partitions by using the parameter
θ. In fact, when reducing input hierarchies, creating parti
tions with widely varying sizes might be undesirable: some
partitions in the reduced hierarchy will be more precise (cor
responding to only a few entries in the original hierarchy),
while some others will be very vague (corresponding to a
large number of values). On the other hand, requiring per
fectly balanced partitions might also be counterproductive
since this may result in nodes in the reconstructed hierar
chy that are consisting of poorly related (nonhomogeneous)
concepts in the original hierarchy.
Figures 7(a) and (b), requiring strictly balanced partitions
(θ = 1) results in a slightly higher dilution and lower diver
sity. Any θ ≥ 2, however, provides the same performance;
this is because for such large θ values there is no need to
rebalance the partitions. Thus, while we foresee that in
most cases tRedux will be used with θ = 2, we also recognize
that some applications may require balanced partitions and
(as shown in Figure 7(c)), in these cases, θ can be used to
control the balance of the partitions. Note that, since the
number of partitions stays the same, θ does not affect the
table summarization time (Figure 7(d)). Notice that, while
diversity and dilution stay constant for θ >= 2, there is no
need to consider very high values of θ.
Impact of the Value Distribution in the Data Table
We also experimented with different value distributions in
the data table. Results for this setup are for a single at
tribute (with a balanced hierarchy with 127 nodes and 64
leaves) and 256 tuples in the table. In this experiment, we
varied the tuple count variance (tvar) of the table between
0 and ∼ 25 (in the case with tvar = 24.80, we have al
most all tuples distributed on only one leaf of the domain
hierarchy and the other leaves are only represented by one
tuple each). For each tvar, we analyzed 3 different ran
domly generated sets of tuples. Each presented result is the
average of these three cases. As can be seen in Figures 8(a)
and (b), large variances in the tuple distributions negatively
impact the dilution and diversity for summarization. As ex
pected, the original hierarchy provides better diversity and
dilution than tRedux, but is much slower Figure 8(d). As
tvar increases, the dilution, diversity, and execution time
behaviors of tRedux and the original scheme approach each
other. This is because an increase in the count variation
also causes an increase in the partition size variation (Fig
ure 8(c)) and when the partition variances are higher, Sama
rati’s algorithm tends to pick nodes closer to the root instead
of analyzing combinations of the internal nodes.
Impact of the Table Size To observe the effect of the table
size, we considered a summarization scenario with a single
attribute (having a hierarchy with 31 nodes and 16 leaves).
We varied the database size from ∼ 50 to ∼ 5000000, with
×10 increments. For these experiments, we set tvar to 0.
Indeed, as shown in
As these experiments show, as long as the tuples in the ta
ble are selected using the same value distribution, indepen
dent of the size of the table, the dilution and diversity stays
the same. Figure 9(a),(b),(c) and (d) show the obtained
results. Note that the time cost of the original scheme in
creases faster than the time cost of tRedux supported sum
marization as the table size increases.
Effects of the Number of Nodes in the Input Hierar
chies In order to study the impact of the number of nodes in
the input hierarchy, we selected 6 different hierarchies with
different number of nodes (57, 115, 230, 460, 921 and 1843
nodes) but having the same height (13 levels). For each of
the 6 considered cases, we analyzed 3 different random gen
erated hierarchies and the presented results are the averages
of these. For these experiments, we maintain tvar at 0. The
results in Figure 10(a) and (b) show that the dilution and
diversity behaviors of tRedux are not affected by the number
of nodes. As shown in Figure 10(c), partition size variance
pvar is also not affected by the changes in the number of
nodes. As shown in Figure 10(d), the number of nodes af
fects the process in terms of execution time (because the
algorithm needs to consider more nodes as candidates for
the summary); the benefits of the tRedux scheme is more
apparent for larger hierarchies.
Effects of the Heights of the Input Hierarchies For
these experiments, we used 8 different hierarchies with the
same numbers of nodes (460 nodes of which 256 are leaves),
but with different numbers of levels (8 through 17). For each
of these cases, we experimented with 3 different random gen
erated hierarchies; the presented results are averages. The
table has 1024 tuples and has tvar of 0. As Figure 11(a)
shows dilution of tRedux is not affected by the height of the
hierarchy. This is largely because of the fact that the parti
tion size variance pvar is not affected by the changes in the
hierarchy height (Figure 11(c)). The diversity of the sum
maries, however, increases with the height of the hierarchy
(Figure 11(b)): since diversity is measured by the distances
of the nodes in the hierarchy, when the height increases, the
diversity also increases. The height of the hierarchy does not
affect the time cost of the summarization process for both
tRedux and original alternatives (Figure 11(d)).
Effect of the Number of Attributes in the Summary
Figure 12 shows the effect of the number of attributes. Each
of the plots includes results from many experiments (vary
ing, as described in Section 6, two data sets, number of
tuples in the table and tvar, number, sizes, and heights of
hierarchies) in a single chart; these experiments are clus
tered in terms of the number of attributes in the data being
summarized and trend lines are drawn to help observe the
general trends. Figure 12(a) shows that, for both Adult (red
line) and synthetic (dashed blue line) data sets, the amount
of dilution increases with the number of attributes, but the
249
Page 11
Dilution (avg) vs tvar
(Synthetic, Theta=2)
0
1
2
3
4
5
051015 2025 30
tvar
dilution
Original
tRedux
Diversity vs tvar
(Synthetic, Theta=2)
0
2
4
6
8
051015202530
tvar
diversity
Original
tRedux
Partition size variance (pvar) vs tvar
(Synthetic,Theta=2 )
0
1
2
3
4
051015202530
tvar
pvar
Time vs tvar
(Synthetic, Theta=2)
0
2
4
6
8
10
051015202530
tvar
Time (seconds)
Original
tRedux
(a)(b)(c)(d)
Figure 8: (a) Dilution, (b) diversity, (c) partition size variance, and (d) time as a function of tuple value
count variance
Di l ut i on ( avg) vs # of t upl es
( Synt het i c, Thet a=2)
5
3
0
1
2
3
4
020000004000000
# of t upl es
60000008000000
di l ut i on
Or i gi nal
t Redux
Di ver si t y vs # of t upl es
( Synt het i c, Thet a=2)
0
1
2
020000004000000
# of t upl es
60000008000000
di ver si t y
Or i gi nal
t Redux
Par t i t i on si ze var i ance ( p var ) vs # of t upl es
( Synt het i c, Thet a=2)
0
1
2
3
020000004000000
# of t upl es
60000008000000
p var
Ti m e vs # of t upl es
( Synt het i c, Thet a=2)
0
4000
8000
12000
16000
Ti m e ( seconds)
20000
020000004000000
# of t upl es
60000008000000
Or i gi nal
t Redux
(a)(b)(c)(d)
Figure 9: (a) Dilution, (b) diversity, (c) partition size variance, and (d) time as a function of the number of
tuples
loss due to tRedux stays more or less constant, ≤ 20%. On
the synthetic data, diversity shows a ∼ 10% increase in loss
when the number of attributes increase from 1 to 3; on the
Adult data set, however, the impact of the number of at
tributes is rather negligible. It is also interesting to note
that, on the Adult data set, the loss in diversity due to the
use of tRedux is very close to 0. As Figure 12(c) shows, the
execution time gain due to the use of tRedux increases with
the number of attributes; for the Adult set the gain increases
from around 30% (i.e., ∼ 1.5× as fast as the original scheme)
for the case with a single attribute to more than 40% (i.e.,
almost 2× as fast as the original scheme) for the case with
three attributes.
7.CONCLUSIONS
In this paper we have shown that, by preprocessing input
value hierarchies before they are used in metadata supported
table summarization process, we can significantly reduce
the summarization cost, without causing significant qual
ity degradations in the resulting table summaries. We have
also introduced a novel hierarchy summarization approach,
tRedux, tailored towards this task and have shown that this
approach provides the best time gain vs. quality tradeoff
against alternative schemes.
8.
[1] K. Sel¸ cuk Candan, H. Cao, Y. Qi, and M. L. Sapino,
“Alphasum: sizeconstrained table summarization using value
lattices,” in Proc. EDBT, 2009, pp. 96–107.
[2] S. Chaudhuri and U. Dayal, “An overview of data warehousing
and olap technology.” SIGMOD Record, vol. 26, no. 1, pp.
65–74, 1997.
[3] P. Samarati, “Protecting respondents’ identities in microdata
release,” IEEE Trans. Knowl. Data Eng., vol. 13, no. 6, pp.
1010–1027, 2001.
[4] G. R. R´ egis SaintPaul and N. Mouaddib, “General purpose
database summarization,” in Proc. VLDB, 2005, pp. 733–744.
[5] ——, “Database summarization: The saintetiq system.” in
Proc. of ICDE, 2007, pp. 1475–1476.
[6] R. Alfred and D. Kazakov, “Data summarization approach to
relational domain learning based on frequent pattern to
support the development of decision making,” in ADMA, 2006,
pp. 889–898.
[7] M.L. Lo, K.L. Wu, and P. S. Yu, “Tabsum: A flexible and
dynamic table summarization approach,” ICDCS, 2000.
REFERENCES
Rel at i ve i ncr ease i n Di l ut i on ( avg )  ( t Redux vsOr i gi nal )
( Adul t & Synt het i c)
 0, 2
0
0, 2
0, 4
0, 6
0, 8
1
123
# of at t r i but es
Rel at i ve Di l ut i on
( t Redux  Or i gi nal ) / Or i gi nal
Adul t
Synt het i c
(a)
Rel at i ve l oss i n Di ver si t y  ( t Redux vsOr i gi nal )
( Adul t & Synt het i c)
 0, 2
0
0, 2
0, 4
0, 6
0, 8
1
123
# of at t r i but es
Rel at i ve Di ver si t y
( Or i gi nal  t Redux) / Or i gi nal
Adul t
Synt het i c
(b)
Rel at i ve dr op i n Exect ut i on Ti m e  ( t Redux vs Or i gi nal)
( Adul t & Synt het i c)
 0, 2
0
0, 2
0, 4
0, 6
0, 8
1
123
# of at t r i but es
Rel at i ve Ti m e
( Ori gi nal  t Redux) / Ori gi nal
Adul t
Synt het i c
(c)
Figure 12: Dilution, diversity and relative execution
time with varying number of attributes
[8] W. K. Ng and C. V. Ravishankar, “Relational database
compression using augmented vector quantization,” in ICDE,
1995, pp. 540–549.
250
Page 12
Di l ut i on ( avg ) vs # of Nodes
( Synt het i c, Thet a=2)
0
1
2
3
4
5
0500100015002000
# of nodes
di l ut i on
Or i gi nal
t Redux
Di ver si t y vs # of Nodes
( Synt het i c, Thet a=2)
0
2
4
6
8
0500100015002000
# of nodes
di ver si t y
Or i gi nal
t Redux
Par t i t i on si ze var i ance vs # of Nodes
( Synt het i c, Thet a=2)
0
2
4
6
8
0500100015002000
# of nodes
p var
Ti m e vs # of Nodes
( Synt het i c, Thet a=2)
0
1000
2000
3000
4000
5000
0500 10001500 2000
# of nodes
Ti m e ( seconds)
Or i gi nal
t Redux
(a)(b)(c)(d)
Figure 10: (a) Dilution, (b) diversity, (c) partition size variance, and (d) time as a function of hierarchy size
Di l ut i on ( avg) vs Hi er ar chy hei ght
( Synt het i c, Thet a=2)
0
2
4
6
8
0510 15
hi er ar chy hei ght
di l ut i on
Or i gi nal
t Redux
Di ver si t y vs Hi er ar chy hei ght
( Synt het i c, Thet a=2)
0
2
4
6
8
10
051015
hi er ar chy hei ght
di ver si t y
Or i gi nal
t Redux
Par t i t i on si ze var i ance ( p var ) vs Hi er ar chy hei ght
( Synt het i c, Thet a=2)
0
2
4
6
8
10
0510 15
hi er achy hei ght
p var
Ti m e vs Hi er ar chy hei ght
( Synt het i c, Thet a=2)
0
20
40
60
80
100
051015
hi er ar chy hei ght
t i m e ( seconds)
Or i gi nal
t Redux
(a)(b)(c)(d)
Figure 11: (a) Dilution, (b) diversity, (c) partition size variance, and (d) time as a function of hierarchy
height
[9] A. L. Buchsbaum, G. S. Fowler, and R. Giancarlo, “Improving
table compression with combinatorial optimization,” J. ACM,
vol. 50, no. 6, pp. 825–851, 2003.
[10] F. Buccafurri, F. Furfaro, D. Sacca, and C. Sirangelo, “A
quadtree based multiresolution approach for twodimensional
summary data,” in SSDBM’2003, USA, 2003, pp. 127–140.
[11] A. Cuzzocrea, D. Sacc` a, and P. Serafino, “A hierarchydriven
compression technique for advanced olap visualization of
multidimensional data cubes,” in DaWaK, 2006, pp. 106–119.
[12] T. B. Pedersen, C. S. Jensen, and C. E. Dyreson, “Supporting
imprecision in multidimensional databases using granularities,”
in SSDBM, 1999, pp. 90–101.
[13] Y. Qi, K. S. Candan, J. Tatemura, S. Chen, and F. Liao,
“Supporting olap operations over imperfectly integrated
taxonomies,” in Proc. of ACM SIGMOD, 2008.
[14] G. Aggarwal, K. K. Tomas Feder, R. Motwani, R. Panigrahy,
D. Thomas, and A. Zhu, “Approximation algorithms for
kanonymity,” Journal of Privacy Technology, 2005.
[15] A. Meyerson and R. Williams, “On the complexity of optimal
kanonymity,” in Proc. PODS, 2004, pp. 223–228.
[16] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito:
Efficient fulldomain kanonymity,” in Proc. of ACM
SIGMOD, 2005, pp. 49–60.
[17] A. Fokoue, A. Kershenbaum, L. Ma, E. Schonberg, and
K. Srinivas, “K.: The summary abox: Cutting ontologies down
to size,” in ISWC, 2006, pp. 343–356.
[18] X. Zhang, G. Cheng, and Y. Qu, “Ontology summarization
based on rdf sentence graph,” in WWW, USA, pp. 707–716.
[19] R. T. Ng and J. Han, “Clarans: A method for clustering objects
for spatial data mining,” IEEE Transactions on Knowledge
and Data Engineering, vol. 14, no. 5, pp. 1003–1016, 2002.
[20] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A densitybased
algorithm for discovering clusters in large spatial databases
with noise,” in Proc. of ACM SIGKDD, 1996, pp. 226–231.
[21] G. Karypis, E.H. S. Han, and V. Kumar, “Chameleon: A
hierarchical clustering algorithm using dynamic modeling,”
IEEE Computer, 1999.
[22] S. Guha, R. Rastogi, and K. Shim, “Rock: A robust clustering
algorithm for categorical attributes,” Proc. of ICDE, vol. 0, p.
512, 1999.
[23] N. Tishby and N. Slonim, “Data clustering by markovian
relaxation and the information bottleneck method,” in NIPS,
2000, pp. 640–646.
[24] G. W. Flake, R. E. Tarjan, and K. Tsioutsiouliklis, “Graph
clustering and minimum cut trees,” Internet Mathematics,
vol. 1, pp. 385–408, 2004.
[25] R. Goldman and J. Widom, “Dataguides: Enabling query
formulation and optimization in semistructured databases,” in
VLDB’97, 1997, pp. 436–445.
[26] D. Rafiei, D. L. Moise, and D. Sun, “Finding syntactic
similarities between xml documents,” Database and Expert
Systems Applications, vol. 0, pp. 512–516, 2006.
[27] A. Nierman and H. V. Jagadish, “Evaluating structural
similarity in xml documents,” in WebDB, 2002, pp. 61–66.
[28] S. Helmer, “Measuring the structural similarity of
semistructured documents using entropy,” in VLDB ’07, 2007,
pp. 1022–1032.
[29] R. J. Bayardo and R. Agrawal, “Data privacy through optimal
kanonymization,” in Proc. of ICDE, 2005, pp. 217–228.
[30] V. S. Iyengar, “Transforming data to satisfy privacy
constraints,” in Proc. of ACM SIGKDD, 2002, pp. 279–288.
[31] A. Machanavajjhala, J. Gehrke, D. Kifer, and
M. Venkitasubramaniam,“ldiversity: Privacy beyond
kanonymity,” in Proc. of ICDE, 2006.
[32] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, “Fast data
anonymization with low information loss,” in Proc. VLDB,
2007, pp. 758–769.
[33] J. Byun, A. Kamra, E. Bertino, and N. Li., “Efficient
kanonymization using clustering techniques,” in DASFAA,
2007.
[34] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. Fu,
“Utilitybased anonymization using local recoding,” in Proc. of
ACM SIGKDD, 2006, pp. 785–790.
[35] R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development
and application of a metric on semantic nets,” vol. 19, pp.
17–30, 1989.
[36] R. Richardson, A. F. Smeaton, A. F. Smeaton, J. Murphy, and
J. Murphy, “Using wordnet as a knowledge base for measuring
semantic similarity between words,” AICS, Tech. Rep., 1994.
[37] J. W. Kim and K. S. Candan, “Cp/cv: concept similarity
mining without frequency information from domain describing
taxonomies,” in CIKM ’06, 2006, pp. 483–492.
[38] W. C. G.A. Miller, “Contextual correlates of semantic
similarity,” Language and Cognitive Processes, vol. 6, no. 1,
1991.
[39] P. Resnik, “Semantic similarity in a taxonomy: An
informationbased measure and its application to problems of
ambiguity in natural language,” Journal of Artificial
Intelligence Research, vol. 11, pp. 95–130, 1999.
[40] S. Guha, R. Rastogi, and K. Shim, “CURE: an efficient
clustering algorithm for large databases,” 1998, pp. 73–84.
[Online]. Available: citeseer.ist.psu.edu/guha98cure.html
[41] G. Karypis and V. Kumar, “A fast and high quality multilevel
scheme for partitioning irregular graphs,” SIAM Journal on
Scientific Computing, vol. 20, pp. 359–392, 1998.
[42] D. Harel and Y. Koren, “On clustering using random walks,” in
FSTTCS, 2001, pp. 18–41.
[43] A. Asuncion and D. Newman, “UCI machine learning
repository,” 2007. [Online]. Available:
http://www.ics.uci.edu/∼mlearn/MLRepository.html
251
View other sources
Hide other sources
 Available from K. Selcuk Candan · Aug 25, 2014
 Available from unito.it