ArticlePDF Available

Size-l Object Summaries for Relational Keyword Search

Authors:

Abstract and Figures

A previously proposed keyword search paradigm produces, as a query result, a ranked list of Object Summaries (OSs). An OS is a tree structure of related tuples that summarizes all data held in a relational database about a particular Data Subject (DS). However, some of these OSs are very large in size and therefore unfriendly to users that initially prefer synoptic information before proceeding to more comprehensive information about a particular DS. In this paper, we investigate the effective and efficient retrieval of concise and informative OSs. We argue that a good size-l OS should be a stand-alone and meaningful synopsis of the most important information about the particular DS. More precisely, we define a size-l OS as a partial OS composed of l important tuples. We propose three algorithms for the efficient generation of size-l OSs (in addition to the optimal approach which requires exponential time). Experimental evaluation on DBLP and TPC-H databases verifies the effectiveness and efficiency of our approach.
Content may be subject to copyright.
Size-lObject Summaries for Relational Keyword Search
Georgios J. Fakas∗†, Zhi Cai, Nikos Mamoulis
Department of Computing and Mathematics Department of Computer Science
Manchester Metropolitan University, UK The University of Hong Kong, Hong Kong
{g.fakas, z.cai}@mmu.ac.uk nikos@cs.hku.hk
ABSTRACT
A previously proposed keyword search paradigm produces, as a
query result, a ranked list of Object Summaries (OSs). An OS is
a tree structure of related tuples that summarizes all data held in a
relational database about a particular Data Subject (DS). However,
some of these OSs are very large in size and therefore unfriendly
to users that initially prefer synoptic information before proceeding
to more comprehensive information about a particular DS. In this
paper, we investigate the effective and efficient retrieval of concise
and informative OSs. We argue that a good size-lOS should be a
stand-alone and meaningful synopsis of the most important infor-
mation about the particular DS. More precisely, we define a size-l
OS as a partial OS composed of limportant tuples. We propose
three algorithms for the efficient generation of size-lOSs (in ad-
dition to the optimal approach which requires exponential time).
Experimental evaluation on DBLP and TPC-H databases verifies
the effectiveness and efficiency of our approach.
1. INTRODUCTION
Web Keyword Search (W-KwS) has been very successful be-
cause it allows users to extract effectively and efficiently useful
information from the web using only a set of keywords. For in-
stance, Example 1 illustrates the partial result of a W-KwS (e.g.
Google) for Q1: “Faloutsos”: a ranked set (with the first three re-
sults shown only) of links to web pages containing the keyword(s).
We observe that each result is accompanied with a snippet [21],
i.e. a short summary that sometimes even includes the complete
answer to the query (if, for example, the user is only interested in
whether Christos Faloutsos is a Professor or whether his brothers
are academics).
The success of the W-KwS paradigm has encouraged the emer-
gence of the keyword search paradigm in relational databases (R-
KwS) [2, 4, 13]. The R-KwS paradigm is used to find tuples that
contain the keywords and their relationships through foreign-key
links, e.g. query Q2: “Faloutsos”+“Agrawal” returns Authors Fal-
Partially supported by the “Hosting of Experienced Researcher-
s from Abroad” programme (ΠPOΣEΛKYΣH/ΠPOEM/0308)
funded by the Research Promotion Foundation, Cyprus.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 38th International Conference on Very Large Data Bases,
August 27th - 31st 2012, Istanbul, Turkey.
Proceedings of the VLDB Endowment, Vol. 5, No. 3
Copyright 2011 VLDB Endowment 2150-8097/11/11... $10.00.
EXA MP LE 1. Q1 “Faloutsos” using a W-KwS (Google)
Christos Faloutsos
SCS CSD Professor’s affiliatons, research, projects, publications and
teaching.
www.cs.cmu.edu/christos/ - 9k
Michalis Faloutsos
The Homepage of Michalis Faloutsos ... Interesting and Miscallaneous
Links ·Fun pictures ·Other Faloutsos on the web; The Teach-To-Learn
Initiative:
www.cs.ucr.edu/michalis/ - 5k
Petros Faloutsos
Courses ·Press Coverage ·Publications ·Research Highlights ·Awards ·
MAGIX Lab ·Curriculum Vitae ·Family ·Other Faloutsos on Web.
www.cs.ucla.edu/pfal/ - 4k
...
EXA MP LE 2. Q2 using an R-KwS (searching DBLP database)
Author: Christos Faloutsos,Paper: Efficient similarity search in se-
quence databases, Author: Rakesh Agrawal.
Author: Christos Faloutsos,Paper: Method for high-dimensionality
indexing in a multi-media database, Author: Rakesh Agrawal.
Author: Christos Faloutsos,Paper: Quest: A project on database mining,
Author: Rakesh Agrawal.
EXA MP LE 3. Q1 using an R-KwS (searching DBLP database)
Author: Christos Faloutsos
Author: Michalis Faloutsos
Author: Petros Faloutsos
outsos and Agrawal and their associations through co-authored pa-
pers. Example 2 illustrates the result of a traditional R-KwS for Q2
on the DBLP database. On the other hand, the R-KwS paradigm
may not be very effective when trying to extract information about
a particular data subject (DS), e.g. “Faloutsos” in Q1. Example 3
illustrates the R-KwS result for Q1, namely a ranked set of Author
tuples containing the Faloutsos keyword, which are the Author tu-
ples corresponding to the three brothers. Evidently, this result fails
to provide comprehensive information to users about the Faloutsos
brothers, e.g. a complete list of their publications and other cor-
responding details (Certainly, the R-KwS paradigm remains very
useful when trying to combine keywords).
In [8], the concept of object summary (OS) is introduced; an
OS summarizes all data held in a database about a particular DS.
More precisely, an OS is a tree with the tuple tDS containing the
keyword (e.g. Author tuple Christos Faloutsos) as the root node and
its neighboring tuples, containing additional information (e.g. his
papers, co-authors etc.), as child nodes. The result for Q1 is in fact a
set of OSs: one per DS that includes all data held in the database for
each Faloutsos brother. Example 4 illustrates the OS for Christos
(the complete set of papers and the OSs of the other two brothers
were omitted due to lack of space). This result evidently provides
a more complete set of information per brother.
229
arXiv:1111.7169v1 [cs.DB] 30 Nov 2011
EXA MP LE 4. The OS for Christos Faloutsos
Author: Christos Faloutsos
.Paper: On Power-law Relationalships of the Internet Topology.
....Conference:SIGCOMM. Year:1999.
....Co-Author(s):Michalis Faloutsos, Petros Faloutsos.
.Paper: An Efficient Pictorial Database System for PSQL.
....Conference:IEEE Trans. Software Eng. Year:1988.
....Co-Author(s):N. Roussopoulos, T. Sellis.
...
...
.Paper: Declustering Using Fractals.
....Conference:PDIS. Year:1993. Co-Author(s):Pravin Bhagwat.
.Paper: Declustering Using Error Correcting Codes.
....Conference:PODS. Year:1989. Co-Author(s):Dimitris N. Metaxas.
(Total 1,309 tuples)
EXA MP LE 5. The size-l OSs for Q1 and l=15
Author: Christos Faloutsos
..Paper: On Power-law Relationalships of the Internet Topology.
.....Conference:SIGCOMM. Year:1999.
.....Co-Author(s):Michalis Faloutsos, Petros Faloutsos.
..Paper: The QBIC Project: Querying Images by Content, Using, Color,
............. Texture and Shape.
.....Conference:SPIE. Year:1993.
.....Co-Author(s):Carlton W. Niblack, Dragutin Petkovic, Peter Yanker.
..Paper: Efficient and Effective Querying by Image Content.
.....Conference:J. Intell. Inf. Syst. Year:1994.
.....Co-Author(s):N. Roussopoulos, T. Sellis.
...
Author: Michalis Faloutsos
..Paper: On Power-law Relationalships of the Internet Topology.
.....Conference:SIGCOMM. Year:1999.
.....Co-Author(s):Christos Faloutsos, Petros Faloutsos.
..Paper: QoSMIC: Quality of Service Sensitive Multicast Internet Protocol.
.....Conference:SIGCOMM. Year:1998.
.....Co-Author(s):Anindo Banerjea, Rajesh Pankaj.
..Paper: Aggregated Multicast with Inter-Group Tree Sharing.
.....Conference:Networked Group Communication. Year:2001.
.....Co-Author(s):Aiguo Fei.
...
Author: Petros Faloutsos
..Paper: On Power-law Relationalships of the Internet Topology.
.....Conference:SIGCOMM. Year:1999.
.....Co-Author(s):Christos Faloutsos, Michalis Faloutsos.
..Paper: Composable controllers for physics-based character animation.
.....Conference:SIGGRAPH. Year:2001.
.....Co-Author(s):Michiel van de Panne, Demetri Terzopoulos.
..Paper: The virtual stuntman: dynamic characters with a repertoire of
............. autonomous motor skills.
.....Conference:Computers &Graphics 25. Year:2001.
.....Co-Author(s):Michiel van de Panne, Demetri Terzopoulos.
From Example 4, we can observe that some of the OSs may be
very large in size; e.g. Christos Faloutsos has co-authored many
papers and his OS consists of 1,309 tuples. This is not only un-
friendly to users that prefer a quick glance first before deciding
which Faloutsos they are really interested in, but also expensive
to produce. Therefore, a partial OS of size l, composed of only l
representative and important tuples, may be more appropriate.
In this paper, we investigate in detail the effective and efficient
generation of size-lOSs. Example 5 illustrates Q1 with l=15 on
the DBLP database; namely a set of size-15 OSs composed of only
15 important tuples for each DS. From the user’s perspective, the
semantics of this paradigm resemble more a W-KwS rather than a
R-KwS. For instance, the complete OS of Example 4 resembles a
web page (as they both include comprehensive information about a
DS), whereas the size-lOSs of Example 5 resemble the snippets of
Example 1. Therefore, users with W-KwS experience will poten-
tially find it friendlier and also closer to their expectations.
OSs and size-lOSs can have many applications. For example,
OSs can automate responds to data protection act (DPA) subject
access requests (e.g. the US Privacy Act of 1974, UK DPA of 1984
and 1998 [1] etc.). According to DPA access requests, DSs have
the right to request access from any organization to personal infor-
mation about them. Thus, data controllers of organizations must
extract data for a given DS from their databases and present it in
an intelligible form [10]. Another application is for intelligent ser-
vices searching information about suspects from various databases.
Hence, size-lOSs can also be very useful as they enhance the us-
ability of OSs. In general, a size-lOS is a concise summary of
the context around any pivot database tuple, finding application in
(interactive) data exploration, schema extraction, etc.
We should effectively generate a stand-alone size-lOS, com-
posed of limportant tuples only, so that the user can comprehend it
without any additional information. A stand-alone size-lOS should
preserve meaningful and self-descriptive semantics about the DS.
As we explain in Section 3, for this reason, the ltuples should form
a connected graph that includes the root of the OS (i.e. tDS). To
distinguish the importance of individual tuples tito be included in
the size-lOS, a local importance score (denoted as Im(O S, ti))
is defined by combining the tuple’s global importance score in the
database (denoted as Im(ti)) and its affinity [8] in the OS (denoted
as Af(ti)). Based on the local importance scores of the tuples of
an OS, we can find the partial OS of size lwith the maximum im-
portance score, which includes tuples that are connected with tDS.
The efficient size-lgeneration of OSs is a challenging problem.
A brute force approach, that considers all candidate size-lOSs be-
fore finding the one with the maximum importance, requires expo-
nential time. We propose an optimal algorithm based on dynamic
programming, which is efficient for small problems, however, it
does not scale well with the OS size and l. In view of this, we
design three practical greedy algorithms.
We provide an extensive experimental study on DBLP and TPC-
H databases, which includes comparisons of our algorithms and
verifies their efficiency. To verify the effectiveness of our frame-
work, we collected user feedback, e.g. by asking several DBLP
authors (i.e. the DSs themselves) to assess the computed size-lOS-
s of themselves on the DBLP database. The users suggested that the
results produced by our method are very close to their expectations.
The rest of the paper is structured as follows. Section 2 describes
background and related work. Section 3 describes the semantics
of size-lOS keyword queries and formulates the problem of their
generation. Sections 4 and 5 introduce the optimal and greedy al-
gorithms respectively. Section 6 presents experimental results and
Section 7 provides concluding remarks.
2. BACKGROUND AND RELATED WORK
In this section, we first describe the concept of object summaries
(OSs), which we build upon in this paper. We then present and
compare other related work in R-KwS, ranking and summarization.
To the best of our knowledge there is no previous work that focuses
on the computation of size-lOSs.
2.1 Object Summaries
In the context of OS search in relational databases [8, 7], a query
is a set of keywords (e.g. “Christos Faloutsos”) and the result is a
set of OSs. An OS is generated for each tuple (tDS) found in the
database that contains the keyword(s) as part of an attribute’s value
(e.g. tuple “Christos Faloutsos” of relation Author in the DBLP
database). An OS is a tree structure composed of tuples, having tDS
as root and tDS’s neighboring tuples (i.e. those associated through
foreign keys) as its children/descendants.
In order to construct OSs, this approach combines the use of
graphs and SQL. The rationale is that there are relations, denoted
as RDS (e.g. the Author relation), which hold information about the
queried Data Subjects (DSs) and the relations linked around RDSs
contain additional information about the particular DS. For each
RDS, a Data Subject Schema Graph (GDS ) can be generated; this is
230
Paper Author
YearConference
Figure 1: The DBLP Database Schema
Conference(0.78)
0.216, 0
Co-author (0.82)
0.86, 0
Year (0.83)
0.841, 0.216
PaperCites (0.77)
7.38, 0
PaperCiteBy(0.77)
7.381, 0
Paper (0.92)
8.818, 7.381
Author (1)
1.049, 7.381
Figure 2: The DBLP Author GDS (Annotated with (Affinity),
max(Ri) and mmax(Ri))
a directed labeled tree that captures a subset of the database schema
with RDS as a root. (Figures 1 and 11 illustrate the schemata of
the DBLP and TPC-H databases whereas Figures 2 and 12 illus-
trate exemplary GDSs.) Each relation in GDS is also annotated with
useful information that we describe later, such as affinity and im-
portance. GDS is a “treealization” of the schema, i.e. RDS becomes
the root, its neighboring relations become child nodes and also any
looped or many-to-many relationships are replicated. Examples of
such replications are relations PaperCitedBy, PaperCites and Co-
Author on Author GDS and relations Partsupp, Lineitem, Parts etc.
on Customer GDS (see GDSs in Figures 2 and 12). (User evalua-
tion in [8] verified that the tree format (achieved via such replica-
tions) increases significantly friendliness and ease of use of OSs.)
The challenge now is the selection of the relations from GDS which
have the highest affinity with RDS; these need to be accessed and
joined in order to create a good OS. To facilitate this task, affinity
measures of relations (denoted as Af(Ri)) in GDS are investigated,
quantified and annotated on the GDS. The affinity of a relation Ri
to RDS can be calculated using the following formula:
Af(Ri) = X
j
mjwj·Af(RParent),(1)
where jranges over a set of metrics (m1, m2,...,mn), their corre-
sponding weights (w1, w2,...,wn) and Af(RParent)is the affinity
of Ri’s parent to RDS . Affinity metrics between Riand RDS in-
clude (1) their distance and (2) their connectivity properties on both
the database schema and the data-graph (see [8] for more details).
Given an affinity threshold θ, a subset of GDS can be produced,
denoted as GDS(θ). Finally, by traversing GDS(θ) (e.g. by join-
ing the corresponding relations) we can generate the OSs (either
by using the precomputed data-graph or directly from the database
using Algorithm 5). More precisely, a breadth-first traversal of the
corresponding GDS(θ) with the tDS tuple as the initial root entry
of the OS tree is applied. For instance, for keyword query Q1,
Author GDS of Figure 2 and θ=0.7 the report presented in Exam-
ple 4 will automatically be generated. Note that Author GDS (0.7)
includes all relations whilst Customer GDS(0.7) includes only Cus-
tomer, Nation, Region, Order, Lineitem and Partsupp relations (s-
ince all these relations have affinity greater than 0.7). Similarly, the
set of attributes Ajfrom each relation Rithat are included in a GDS
are selected by employing an attributes affinity and a threshold (i.e.
θ). For example, in a Customer OS, Comment is excluded from
Partsupp relation as it is not relevant to Customer DSs.
2.2 R-KwS and Ranking
R-KwS techniques facilitate the discovery of joining tuples (i.e.
Minimal Total Join Networks of Tuples (MTJNTs) [13]) that col-
lectively contain all query keywords and are associated through
their keys; for this purpose the concept of candidate networks is in-
troduced; see, for example, DISCOVER [13], BANKS [2, 4]. The
OSs paradigm differs from other R-KwS techniques semantically,
since it does not focus on finding and ranking candidate networks
that connect the given keywords, but searches for OSs, which are
trees centered around the data subject described by the keywords.
Pr´
ecis Queries [15, 19] resemble size-lOSs as they append ad-
ditional information to the nodes containing the keywords, by con-
sidering neighboring relations that are implicitly related to the key-
words. More precisely, a pr´
ecis query result is a logical subset of
the original database (i.e. a subset of relations and a subset of tu-
ples). For instance, the pr´
ecis of Q1 is a subset of the database
that includes the tuples of the three Faloutsos Authors and a subset
of their (common) Papers, Co-Authors, Conferences, etc. In con-
trast, our result is a set of three separate size-lOSs (Example 5). A
thorough evaluation between OSs and pr´
ecis appears in our earlier
work [8].
R-KwS techniques also investigate the ranking of their results.
Such ranking paradigms consider:
1) IR-Style techniques, which weight the amount of times key-
words (terms) appear in MTJNs [12, 16, 17, 23]. However, such
techniques miss tuples that are related to the keywords, but they do
not contain them [3]; e.g. for Q1, tuples in relation Papers also have
importance although they do not include the Faloutsos keyword.
2) Tuples’ Importance, which weights the authority flow through
relationships, e.g. ObjectRank [3], [22], ValueRank [9], PageRank
[5], BANKS (PageRank inspired) [2], [4], XRANK [11] etc. In this
paper we use tuples’ importance to model global importance scores
and more precisely global ObjectRank (for DBLP) and ValueRank
(for TPC-H). (Note that our algorithms are orthogonal to how tuple
importance is defined and other methods could also be investigat-
ed.) ObjectRank [3] is an extension of PageRank on databases and
introduces the concept of Authority Transfer Rates between the tu-
ples of each relation of the database (Authority Transfer Rates are
annotated on the so called Authority Transfer Schema Graph, de-
noted as GA, e.g. Figure 13). They are based on the observation
that solely mapping a relational database to a graph (as in the case
of the web) is not accurate and a GAis required to control the flow
of authority in neighboring tuples. For instance, well cited papers
should have higher importance than papers citing many other pa-
pers or a well cited paper should have better ranking than another
one with fewer citations. ValueRank is an extension of ObjectRank
which also considers the tuples’ values and thus can be applied
on any database (e.g. TPC-H) in contrast to ObjectRank which is
mainly effective on authoritative flow data such as bibliographic da-
ta (e.g. DBLP). For instance, in trading databases, a customer with
five orders of values $10 may get lower importance than another
customer with three orders of values $100.
2.3 Other Related Work
Document summarization techniques have attracted significant
research interest [20, 21]. In general, these techniques are IR-style
inspired. Web snippets [21] are examples of document summaries
that accompany search results of W-KwSs in order to facilitate their
quick preview (e.g. see Example 1). They can be either static (e.g.
231
composed of the first words of the document or description meta-
data) or query-biased (e.g. composed of sentences containing many
times the keywords) [20]. Still, the direct application of such tech-
niques on databases in general and OS in particular is ineffective;
e.g. they disregard the relational associations and semantics of the
displayed tuples. For example consider Q1 and Example 4, pa-
pers authored by Faloutsos (although don’t include the Faloutsos
keyword) have importance analogous to their citations and authors;
this is ignored by document summarization techniques.
XML keyword search techniques, similarly to R-KwSs, facil-
itate the discovery of XML sub-trees that contain all query key-
words (e.g. “Faloutsos”+“Agrawal”). Analogously, XML snippets
[14] are sub-trees of the complete XML result, with a given size,
that contain all keywords. An apparent difference between size-l
OSs and XML snippets is their semantics which is analogous to
the semantic difference between complete OSs and XML results.
Therefore, their generation is a completely different problem. An
interesting similarity is that both size-lOS and XML snippets are
sub-trees of the corresponding complete results, hence composed
of connected nodes. This common property is for the same reason,
i.e. to preserve self-descriptiveness.
3. SIZE- LOS
A size-lOS keyword query consists of (1) a set of keywords and
(2) a value for l(e.g. Q1 and l=15) and the result comprises a
set of size-lOSs. A good size-lOS should be a stand-alone and
meaningful synopsis of the most important information about the
particular DS.
DEFI NI TI ON 1. Given an OS and an integer size l, a candidate
size-lOS is any subset of the OS composed of ltuples, such that all
ltuples are connected with tDS (i.e. the root of the OS tree).
Definition 1 guarantees that the size-lOS remains stand-alone,
(so users can understand it as it is without any additional tuples);
i.e. by including connecting tuples we also include the semantics
of their connection to the DS. (Recall that this criterion was also
used in [14] for the same reasons.) Consider the example of Figure
3 which is a fraction of the Faloutsos OS (in the DBLP database).
Even, if the Paper “Efficient and Effective Querying by Image Con-
tent” has less local importance (e.g. 20) than the Co-Author(s) Sel-
lis (e.g. 43) and Roussopoulos (e.g. 34), we cannot exclude the
Paper and include only the Co-Authors. The rationale is that by
excluding the Paper tuple we also exclude the semantic association
between the Author and Co-Author(s), which in this case is their
common paper. Also note that a size-lOS will not necessarily in-
clude the ltuples with the largest importance scores. For example,
the Co-Author Roussopoulos, although with larger importance than
the particular Paper, may have to be excluded from a size-lOS (e.g.
from a size-3OS which will consist of (1) Author “Faloutsos”, (2)
Paper “Efficient . . . ” and (3) Co-Author “Sellis”).
Given an OS, we can extract exponentially many size-lOSs that
satisfy Definition 1. In the next section we define a measure for the
importance (i.e., quality) of a candidate size-lOS. Our goal then
would be to retrieve a size-lOS of the maximum possible quality.
3.1 Importance of a Size-lOS
The (global) importance of any candidate size-lOS S, denoted
as Im(S), is defined as:
Im(S) = X
tiS
Im(O S, ti),(2)
where Im(O S, ti)is the local importance of tuple ti(to be de-
fined in Section 3.2 below). We say that a candidate size-lOS
Author: Christos Faloutsos
58
Paper: Efficient and Effective
Querying by image Content
20
Co-Author: T. Sellis
43
Co-Author: N. Roussopoulos
34
Conference: IEEE Trans.
Software Eng.
14
.
.
.
.
.
.
Year: 1988
18
Figure 3: A Fraction of the Faloutsos OS (Annotated with Lo-
cal Importance)
is an optimal size-lOS, if it has the maximum Im(S)(denoted
as max(Im(S))) over all candidate size-lOSs for the given OS.
Wherever an optimal size-lOS is hard to find, we target the retrieval
of a sub-optimal size-lOS of the highest possible importance.
3.2 Local Importance of a Tuple (Im(OS, ti))
The local importance of Im(O S, ti)of each tuple tiin an OS
can be calculated by:
Im(O S, ti) = Im(ti)·Af (ti),(3)
where Im(ti)is the global importance of tiin the database. We use
global ObjectRank and ValueRank to calculate global importance,
as discussed in Section 2.2. Af (ti)is the affinity of tito the tDS;
namely the affinity Af(Ri)of the corresponding relation Riwhere
tibelongs, to RDS. This can be calculated from GDS using Equation
1, as discussed in Section 2.1 (alternatively, a domain expert can set
Af(Ri)s manually). For example, if tuple tiis paper “Efficient..
with Im(ti)=21.74 and Af (ti)=Af(RP aper )=0.92 (see the affin-
ity on Author GDS in Figure 2), then Im(OS, ti)= 21.74*0.92=20.
Multiplying global importance Im(ti)with affinity Af (ti)re-
duces the importance of tuples that are not closely related to the
DS. For instance, although paper “Efficient ..” and year “1988”
have equal global importance scores (21.74 and 21.64, respective-
ly), their local importance scores become 20 (=21.74*0.92) and
18 (=21.64*0.83) respectively. The use of importance and affin-
ity metrics is inspired by other earlier work; e.g. XRANK and
pr´
ecis employ variations of importance and affinity [11, 15]. For
defining affinity in [11, 15], only distance is considered; however,
as it is shown in [8] distance is only one among the possible affinity
metrics (e.g. cardinality, reverse cardinality etc.).
3.3 Problem Definition
The generation of a complete OS is straightforward: we only
have to traverse the corresponding GDS (see Algorithm 5 in the
Appendix). The generation of a size-lOS is a more challenging
task because we need to select ltuples that are connected to the tDS
of the tree and at the same time result to the largest Im(S). Hence,
the problem we study in this paper can be defined as follows:
PROB LE M 1 (FIN D AN O PT IM AL S IZ E-lO S) . Given a tDS, the
corresponding GDS and l, find a size-lOS Sof maximum Im(S).
A direct approach for solving this problem is to first generate the
complete OS (i.e. Algorithm 5)1and then determine the optimal
1In fact, any tuples or subtrees, which have distance at least lfrom
the root tDSare excluded from the OS, as these cannot be part of a
connected size-lOS rooted at tDS.
232
size-lOS from it. In Section 4, we propose a dynamic program-
ming (DP) algorithm for this purpose. If the complete OS is too
large, solving the problem exactly using DP can be too expensive.
In view of this, in Section 5, we propose greedy algorithms that
find a sub-optimal synopsis. In order to further reduce the cost of
finding a sub-optimal solution, in Section 5.3, we also propose an
economical approach, which, instead of the complete OS, initially
generates a preliminary partial OS, denoted as prelim-lOS. The ra-
tionale of a prelim-lOS is to avoid the extraction and consequently
further processing of fruitless tuples that are not promising to make
it in the size-lOS. DP and the greedy algorithms can be applied on
the prelim-lOS to find a good sub-optimal size-lOS.
4. THE DP ALGORITHM
This section describes a dynamic programming (DP) algorithm,
which, given an OS, determines the optimal size-lOS in it. The OS
is a tree, as discussed in Section 2. Every node vof the OS tree is
a tuple ti, and carries a weight w(v), which is the local importance
Im(O S, ti)of the corresponding tuple ti. Given the tree OS, our
objective is to find a subtree Sopt, such that (i) Sopt includes the
root node tDS of OS, (ii) the tree has lnodes, and (iii) its nodes
have the maximum sum of weights for all trees that satisfy (i) and
(ii). In the third condition, the sum of node weights corresponds
to Im(Sopt ), according to Equation 2. Since this is the maximum
among all qualifying subtrees, Sopt is the optimal size-lOS.
Assume that the root tDS in Sopt has a child vand the subtree
Sv
opt rooted at vhas inodes. Then, Sv
opt should be the optimal size-
iOS rooted at v. DP operates based on exactly this assertion; for
each candidate node vto be included in the optimal synopsis and
for each number of nodes iin the subtree of vthat can be included,
we compute the corresponding optimal size-isynopsis and the cor-
responding sum of weights. The optimal size-isynopsis rooted at v
is computed recursively from precomputed size-jsynopses (j < i)
rooted at v’s children; to find it, we should consider all synopses
formed by vand all size-(i1) combinations of its children and
subtrees rooted at them.
Specifically, let d(v)be the depth of a node vin OS (the root
tDS has depth 0). The subtree rooted at d(v)can contribute at most
ld(v)nodes to the optimal solution, because in every solution that
includes v, the complete path from the root to vmust be included
(due to the fact that tDS should be included and the solution must
be connected). The construction of the DP algorithm is to compute
for each node vof OS Sv,i: the optimal size-iOS for all i[1, l
d(v)], in the subtree rooted at v. In addition to Sv,i the algorithm
should track W(Sv,i), the sum of weights of all nodes in Sv ,i.
DP (Algorithm 1) proceeds in a bottom-up fashion; it starts from
nodes in OS at depth l1; these nodes can only contribute them-
selves in the optimal solution (nodes at depth at least lcannot par-
ticipate in a size-lOS). For each such node v, trivially Sv,1=v,
W(Sv,1)=w(v). Now consider a node vat depth k<l 1. Upon
reaching v, for all children uof v, quantities Su,i and W(Su,i )
have been computed for all i[1, l d(v)1]. Let us now see
how we can compute Sv,i for each i[1, l d(v)]. First, each
Sv,i should include vitself. Then, we examine all possible com-
binations of v’s children and number of nodes to be selected from
their subtrees, such that the total number of selected nodes is i1.
We do not have to check the subtrees of v’s children, since for each
number of nodes jto be selected from a subtree rooted at child u,
we already have the optimal set Su,j and the corresponding sum of
weights W(Su,j ). Note that when we reach the OS root r=tDS,
we only have to compute Sr,l: the optimal size-lOS (i.e., there is
no need to compute Sr,i for i[1, l 1]).
Algorithm 1 The Optimal Size-lOS (DP) Algorithm
DP(l, tDS,GDS )
1: OS Generation(tDS,GDS )generates the complete OS, annotates
with local importance each node
2: for each node vat depth l1do set Sv,1=v
3: for each depth k=l2to 0do
4: for each node vat depth kdo
5: for i=1 to ld(v)do
6: Sv,i ={v} ∪ the best combination of v’s children and nodes
from them such that the total number of nodes is i1
7: return Sr,l
1
30
6
35
5
80
4
31
3
11
10
13
12
12
11
30
9
5
8
15
14
40
13
60
7
10
2
20
Depth Computed Sets
3S13,1=13, S14,1=14
2S7,1=7, S8,1=8, S9,1=9, S10,1=10, S11,1=11,
S11,2={11,13},S12,1=12, S12,2={12,14}
1S2,1=2, S3,1=3, S3,2={3,8},S3,3={3,7,8},S4,1=4,
S4,2={4,11},S4,3={4,11,13},S5,1=5, S6,1=6,
S6,2={6,12},S6,3={6,12,40}
0S1,4={1,4,5,6}
Figure 4: Example: Steps of DP
As an example, consider the OS shown in Figure 4 (top) and as-
sume that we want to compute the optimal size-4OS from it. The
table shows the steps of DP in computing the optimal sets Sv,i in
a bottom-up fashion, starting from nodes 13 and 14 which are at
depth 3 (i.e. l1). For example, to compute S4,3={4,11,13}, we
compare the two possible cases S4,3={4} ∪ S10,1S11,1and
S4,3={4} S11,2since S10,1S11,1and S11,2are the only combi-
nations sets from node 4’s children that total to 2 nodes (i1=2).
S10,1S11,1={10,11}with total weight 43 and S11,2={11,13}
with total weight 90. Thus, S4,3={4} ∪ S11,2={4,11,13}. Note
that for nodes that do not have enough children, the number of sets
that are computed could be smaller than those indicated in the pseu-
docode. For example, for node 2, we only have S2,1; i.e. S2,2and
S2,3do not exist although the node is at depth 1, because node 2 has
no children. In addition, for the root node, DP only has to compute
S1,4, since we only care about the optimal size-lsummary (there
are no nodes above the root that could use smaller summaries).
In terms of complexity, we need to compute for each node vin
the OS up to depth l1up to ld(v)sets. For each set we need
to find the optimal combination of children and nodes from them
to choose. This cost of choosing the best combination increases
exponentially with i, which is O(l). Thus, the overall cost of DP
is O(nl)for an input OS of size n, as can be verified in our ex-
periments. This is essentially the complexity of the problem as DP
explores all possible summaries systematically and, in the general
case, there is no way to prune the search space. For large values
of l, DP becomes impractical and we resort to the greedy heuristics
described in the next section. Finally, the following lemma proves
the optimality of DP.
LEM MA 1. Algorithm 1 computes the optimal size-lOS.
233
PROO F. The optimal size-lOS Sopt includes the root tDS of the
OS and a set of subtrees rooted as some of tDS ’s children. DP tests
all possible combinations of children and numbers of nodes from
the corresponding subtrees, therefore the combination that corre-
sponds to Sopt will be considered. For the specific combination,
for each child vand number of nodes i, the optimal subtree root-
ed at vwith inodes (i.e., Sv,i ) has already been found during the
bottom-up computation process of DP. Therefore, DP will select
and output the optimal combination (which has the largest impor-
tance among all tested ones).
5. GREEDY ALGORITHMS
Since the DP algorithm does not scale well, in this section, we
investigate greedy heuristics that aim at producing a high-quality
size-lOS, not necessarily being the optimal. A property that the al-
gorithms exploit is that the local importance of tuples in the OS (i.e.
Im(O S, ti)) usually decreases with the node depth from the root
tDS of the OS. Recall that Im(O S, ti)is the product Im(ti)·Af (ti),
where Im(ti)is the global importance of tuple tiand Af (ti)is the
affinity of the relation that tibelongs to. Af (ti)monotonically de-
creases with the depth of the tuple since Af(Ri)is a product of
its parent’s affinity and Af(Ri)1 (cf. Equation 1). On the other
hand, the global importance for a particular tuple is to some ex-
tent unpredictable. Therefore, even though the local importance
is not monotonically decreasing with the depth of the tuple on the
OS tree, it has higher probability to decrease than to increase with
depth. Hence, it is more probable that tuples higher on the OS to
have greater local importance than lower tuples. Moreover, note
that due to the non-monotonicity of OSs, existing top-ktechniques
such as [6, 12, 17] cannot be applied.
5.1 Bottom-Up Pruning Size-lAlgorithm
This algorithm, given an initial OS (either a complete or a prelim-
lOS) iteratively prunes from the bottom of the tree the nlleaf
nodes with the smallest Im(O S, ti), where nis the number of n-
odes in the complete OS. The rationale is that since tuples need to
be connected with the root and lower tuples on the tree are expect-
ed to have lower importance, we can start pruning from the bottom.
A priority queue (P Q) organizes the current leaf nodes according
to their local importance. Algorithm 2 shows a pseudocode of the
algorithm and Figure 5 illustrates the steps.
More precisely, this algorithm firstly generates the initial OS
(line 1; e.g. the complete OS using Algorithm 5). The OS Gen-
eration algorithm generates the initial size-lOS and also the initial
P Q (initially holding all leaves of the given OS). Then, the algo-
rithm iteratively prunes the leaves with the smallest Im(OS, ti).
Whenever a new leaf is created (e.g., after pruning node 9 in Figure
5, node 3 becomes a leaf), it is added to P Q. The algorithm ter-
minates when only lnodes remain in the tree. The tree is then re-
turned as the size-lOS. In terms of time complexity, the algorithm
performs O(n) delete operations in constant time, each potentially
followed by an update to the P Q. Since there are O(n) elements in
P Q, the cost of each update operation is O(logn). Thus, the over-
all cost of the algorithm is O(nlogn). This is much lower than the
complexity of the DP algorithm, which gives the optimal solution.
On the other hand, this method will not always return the opti-
mal solution; e.g. the optimal size-5OS should include nodes 1, 5,
6, 12 and 14 instead of 1, 5, 6, 11 and 13 (Fig 5(d)). In practice, it
is very accurate (see our experimental results in Section 6.2), due
to the aforementioned property of Im(O S, ti), which gives higher
probability to nodes closer to the root to have a high local impor-
tance. Lemma 2 proves an optimality condition for this algorithm
Algorithm 2 The Bottom-Up Pruning Size-lAlgorithm
Bottom-Up Pruning Size-l (l,tDS,GDS )
1: OS Generation(tDS,GDS )generates initial size-l(i.e. complete or
prelim-l) OS and initial P Q
2: while (|size-lOS|> l)do
3: ttem=deQueue(P Q)the smallest value from P Q
4: if !(hasSiblings(size-lOS, ttem)) then
5: enQueue(P Q, parent(size-lOS, ttem)) check whether after
pruning ttem, its parent becomes a leaf node
6: prune ttem from size-lOS
7: return size-lOS
1
30
7
10
6
35
5
80
4
31
3
11
2
20
10
13
12
55
11
30
9
5
8
15
14
40
13
60
9
5
7
10
10
13
8
15
14
40
13
60
PQ
(a) The initial OS
7
10
3
11
10
13
8
15
14
40
13
60
PQ
1
30
7
10
6
35
5
80
4
31
3
11
2
20
10
13
12
55
11
30
8
15
14
40
13
60
(b) First leaf pruned out
1
30
6
35
5
80
4
31
2
20
12
55
11
30
8
15
14
40
13
60
8
15
4
31
14
40
13
60
PQ
(c) The size-10 OS
6
35
13
60
PQ
1
30
6
35
5
80
11
30
13
60
(d) The size-5OS
Figure 5: The Bottom-Up Pruning Size-lAlgorithm: Size-lOSs
and their Corresponding P Qs (annotated with tuple ID and
local importance)
(Paper OSs in the DBLP database are an example of this condition;
to be discussed in Section 6.2).
LEM MA 2. When the nodes of an OS have monotonically de-
creasing local importance scores to their distance from the root
(i.e. the score of each parent is not smaller than that of its chil-
dren), then the Bottom-Up Pruning Size-lAlgorithm returns the
optimal size-lOS.
PROO F.P Q.top always holds the node with the current small-
est score in the OS. This is because P Q.top is by definition the
smallest among leaf nodes, where leaf nodes always have smaller
scores than their ancestors. Therefore, by removing the nlcur-
rent smallest values (iteratively stored in PQ.top) from an OS, we
can get the optimal size-lOS.
5.2 Update Top-Path-lAlgorithm
We now explore a second greedy heuristic. This algorithm itera-
tively selects the path piof tuples with the largest average impor-
tance per tuple (denoted as AI(pi)), adds pito the size-lOS and
removes the nodes of pifrom the OS and updates AI(pi)for the
remaining paths accordingly. The rationale of selecting the path of
tuples (instead of the tuple) with the current largest importance, is
234
Algorithm 3 The Update Top-path-lAlgorithm
Update Top-path-l(l,tDS ,GDS)
1: OS Generation(tDS,GDS )generates initial size-l(i.e. complete or
prelim-l) OS, annotates tuples with AI(pi))
2: while (|size-lOS|< l)do
3: pi=path with max AI(pi)
4: add first l−|size-lOS|nodes of pito size-lOS
5: if (|size-lOS|< l)then
6: remove selected path pifrom the tree
7: for each child vof nodes in pido
8: update AI(pi)for each node tjin the subtree rooted at v
9: return size-lOS
1
30
30
7
10
20
6
35
33
5
80
55
4
31
31
3
11
21
2
20
25
10
13
25
12
12
26
11
30
47
9
5
15
8
15
22
14
40
29
13
60
50
(a) The initial OS
1
30
7
10
15
6
35
35
5
80
4
31
31
3
11
11
2
20
20
10
13
22
12
12
24
11
30
30
9
5
8
8
15
18
14
40
29
13
60
45
(b) First update
1
30
7
10
15
6
35
35
5
80
4
31
31
3
11
11
2
20
20
10
13
22
12
12
24
11
30
9
5
8
8
15
18
14
40
29
13
60
(c) Second update
1
30
7
10
15
6
35
5
80
4
31
31
3
11
11
2
20
20
10
13
22
12
12
12
11
30
9
5
8
8
15
18
14
40
26
13
60
(d) Final update (size-5OS)
Figure 6: The Update Top-Path-lAlgorithm: The size-5OS
(annotated with tuple ID, local importance and AI(pi); select-
ed nodes are shaded)
that since all nodes need to be connected and monotonicity may not
hold, we facilitate the selection of nodes of large importance even
though their ancestors may have lower importance. Algorithm 3 is
a pseudocode of the heuristic and Figure 6 illustrates an example.
More precisely, this algorithm (like the Bottom-Up Pruning Al-
gorithm) firstly generates the complete (or alternatively the prelim-
l) OS. During the OS generation, for each tuple ti, we also calculate
the importance per tuple AI(pi)for the corresponding path pifrom
the root to ti. We then select the node with the largest AI(pi)and
add the corresponding path to the size-lOS. By removing the nodes
of pifrom the OS, the tree now becomes a forest; each child of a
node in piis the root of a tree. Accordingly, the AI(pi)for each
node tiis updated again to disregard the removed nodes in the path
selected at the previous step. The process of selecting the path with
the highest AI(pi), adding it to the size-lOS is repeated as long as
less than lnodes have been selected so far. If less than |pi|nodes
are needed to complete the size-lOS then only the top nodes of
the path are added to the size-lOS (because only these nodes are
connected to the current size-lOS).
Consider the example shown in Figure 6. Node 5 has AI (pi)=55,
because its path includes nodes 1 and 5 with average Im(OS, ti)
being (30+80)/2=55. Assuming l=5, at the first loop, the algo-
rithm selects nodes 1 and 5 with the largest AI(pi), i.e. 55. Then,
the nodes along the path (nodes 1 and 5) are added to the size-5
OS. For the remaining nodes, AI(pi)is updated to disregard the
removed nodes (see top-right tree in Figure 6). For example, the
new AI(pi)for node 10 is 22, because its path now includes only
nodes 4 and 10 with average Im(OS, ti)being 22. The next path
to be selected is that ending at node 13, which adds two more nodes
to the snippet. Finally, node 6 is added to complete the size-5OS.
The complexity of the algorithm can be as high as O(nl), where
nis the size of the complete OS, as at each step the algorithm may
choose only one node which causes the update of O(n) paths. The
algorithm can be optimized if we precompute for each node vof
the tree the node s(v)with the highest AI(pi)in the subtree rooted
at v. Regardless of any change at any ancestor of v,s(v)should re-
main the node with the highest AI(pi)in the subtree (because the
change will affect all nodes in the subtree in the same way). Thus,
only a small number of comparisons would be needed after each
path selection to find the next path to be selected. Specifically, for
each child vof nodes in the currently selected path pi, we need to
update AI(pi)for s(v)and then compare all s(v)’s to pick the one
with the largest AI(pi). In terms of approximation quality, this al-
gorithm not always returns the optimal solution; e.g. the size-3OS
will have nodes 1, 5 and 11 instead of 1, 5 and 6. However, empir-
ically, this method gives better results than Bottom-Up Pruning.
5.3 Top-lPrelim-lOS Preprocessing
Instead of operating on the complete OS, which may be expen-
sive to generate and search, we propose to work on a smaller OS,
which hopefully includes a good size-lOS. We denote such a pre-
liminary partial OS as prelim-lOS (with size jwhere lj≤ |OS|).
On the prelim-lOS, we can apply any of the proposed algorithms
so far (of course, DP is not expected to return the optimal result,
unless the prelim-lOS is guaranteed to include it). The rationale of
the prelim-lOS is to avoid extraction and processing of tuples that
are not promising to make it in the optimal size-lOS. Algorithm
4 is a pseudocode for computing the prelim-lOS, Table 1 summa-
rizes symbols and definitions and Figure 7 illustrates an example.
Determining a prelim-lOS that includes the optimal size-lOS
can be very expensive, therefore we propose a heuristic, which
produces a prelim-lOS that includes at least the lnodes of the
complete OS with the largest local importance (denoted as top-l
set). Figure 7(a) illustrates such a prelim-lOS. Using avoidance
conditions and simple statistics that summarize the range of local
importance of every tuple in each relation (e.g. max(Ri)) we can
infer upper bounds for the local importance of tuples and thus safe-
ly predict whether a candidate path can potentially produce useful
tuples.
DEFI NI TI ON 2. Given an OS and an integer l, a top-lprelim-
lOS (or simply prelim-lOS) is a subset of the complete OS that
includes the ltuples of the OS with the largest local importance.
We annotate each relation Rion the GDS graph with the statistics
max(Ri) and mmax(Ri) (see Figure 2). (Recall from Section 2.1
that we generate GDS graphs for every relation that may contain in-
formation about DSs.) max(Ri) is the maximum local importance
of all tuples in Ri, which can be derived from the maximum global
importance in Ri(a global statistic that is computed/updated inde-
pendently of the queries) and the affinity of Af(Ri). mmax(Ri) is
the maximum local importance of all tuples that belong to Ri’s de-
scendant relation nodes in GDS (i.e. the maxj{max(Rj)};jranges
over all such relations) or 0 if Rihas no descendants (leaf node).
The algorithm for generating the prelim-lOS is an extension of
the complete OS generation algorithm (e.g. Algorithm 5). The ex-
tension incorporates pruning conditions in order to avoid adding to
the prelim-lOS fruitless tuples and their subtrees. More precisely,
235
Table 1: Symbols and Definitions (Top-lPrelim-lOSs)
Symbols Definition
top-lThe lnodes with the largest local importance in the OS
top-l P Q An l-sized priority queue with the current largest local
importance of extracted tuples
largest-lThe tuple with the lth largest local importance retrieved
so far (i.e. the smallest value of top-l P Q) or 0 if
|top-l P Q|<l
li(ti)The local importance of tuple ti(i.e. Im(O S, ti))
R(ti)The relation on GDSthat tuple tibelongs to
Ri(tj) The subset of Rithat joins with tuple tj
max(Ri) The maximum value of local importance of Ri
mmax(Ri) The maximum value of max(Ri) of all Ri’s descendants
nodes on GDSor 0 if Rihas no descendants (leaf node)
fruitless
tuple
A tuple not in top-l
fruitless
GDS
relation/
sub-tree
AGDSsub-tree starting from relation Riis considered
fruitless for a given largest-l, if none tuples from Riand
its descendants can be fruitful for the top-l(i.e. when
largest-lmax(Ri) AND largest-lmmax(Ri))
fruitful-l
relation
A relation Riis considered fruitful-lfor a given largest-
l, if only up to lnodes from the corresponding Ri(tj)
can be fruitful for the top-l, (i.e. when largest-
lmmax(Ri))
we traverse the GDS graph in a breath first order. Every extracted
tuple is appended to the prelim-lOS (lines 2 and 14) and to queue
Q(to facilitate the breadth first traversal of the GDS; see lines 3 and
15). Let largest-lbe the tuple with the lth largest local importance
retrieved so far. If the current tuple tiis greater than largest-l,ti
is added to the l-sized priority queue top-l P Q as well (in order to
update the top-lset; lines 4 and 17). Largest-lis set to the current
smallest value of top-l P Q or to 0 if the top-l P Q does not contain
lvalues yet (lines 20-23). We traverse the GDS as follows. For each
tuple de-queued from the queue Q(line 6), we extract all its child
nodes from each corresponding child relation (lines 7-12) and we
employ the following avoidance conditions:
Avoidance Condition 1 (Avoiding fruitless GDS sub-trees): If
the top-l P Q already contains ltuples and largest-lis greater than
or equal to the local importance of all tuples of the current relation
Riand all its descendants (i.e. largest-lmax(Ri) AND largest-
lmmax(Ri)), then there is no need to traverse the sub-tree start-
ing at Ri(line 8). In such cases, we say that the sub-tree starting
from Riis fruitless. For instance, consider the example of Figure
7; while retrieving tuple y8, largest-l=0.37 and the current child
relation Riis Conference with max(Ri)=0.22 and mmax(Ri)=0.
Thus, we can safely infer that Conference has no fruitful tuples for
the particular prelim-lOS. This avoidance condition does not re-
quire any I/O operations as all information required can cheaply be
obtained from the annotated GDS.
Avoidance Condition 2 (Limiting up to ltuple extractions
from fruitful-lrelations): Assume that we are about to traverse
Riin order to extract Ri(tj): the tuples in Riwhich join with the
parent tuple tj. We can limit the amount of tuples returned by this
join up to l, if we can safely predict that none of their descendants
(if any) can be fruitful for the top-l. We say a relation Rion the
GDS is considered fruitful-lfor a given largest-l, if we can safe-
ly predict that only up to ltuples from Rican be fruitful for the
top-land none of their descendants (if any); this is the case when
largest-lmmax(Ri) but largest-l<max(Ri). In other words, we
can safely extract only up to ltuples greater than the largest-lfrom
a fruitful-lrelation; i.e. there is no need to compute the complete
join. For instance consider the example of Figure 7, where we are
about to traverse the fruitful-lrelation PaperCitedBy (a leaf node on
the GDS, thus a fruitful-lrelation) in order to extract the joins with
Paper tuple p2. Then, we can extract from the database only up to l
Algorithm 4 The Prelim-lOS Generation Algorithm
Prelim-lOS Generation (l,tDS,GDS )
1: largest-l=0
2: add tDSas the root of the prelim-l
3: enQueue(Q,tDS)
4: enQueue(top-l P Q,tDS)
5: while !(IsEmptyQueue(Q)) do
6: tj=deQueue(Q)
7: for each child relation Riof R(tj)in GDS do
8: if !(largest-lmax(Ri) AND largest-lmmax(Ri))
then Av. Cond. 1
9: if (largest-lmmax(Ri)) then
10: Ri(tj)=“SELECT * TOP lFROM RiWHERE
(tj.ID=Ri.ID AND Ri.li >largest-l)” Av. Cond. 2.
tj.ID and Ri.ID represent the keys that tjand Rijoin and
Ri.li the local import. attribute of Ri
11: else
12: Ri(tj)=“SELECT * FROM RiWHERE (tj.ID=Ri.ID)”
13: for each tuple tiof Ri(tj)do
14: add tion prelim-las child of tj
15: enQueue(Q,ti)
16: if (li(ti)>largest-l)then
17: enQueue(top-l P Q,ti)
18: if (|top-l P Q|>l)then
19: deQueue(top-l P Q)
20: if (|top-l P Q|<l)then
21: largest-l=0
22: else
23: largest-l=Smallest(top-l PQ)
tuples with local importance greater than the largest-l(which is 0,
since |top-l P Q|<l). Similarly, when traversing the fruitful-lrela-
tion PaperCites with largest-l=0.12, we extract up to ltuples larger
than largest-l. Note that the Paper relation is not fruitful-l, since
largest-l=0 and mmax(RPaper)=7.38 thus largest-l<mmax(RPaper).
As a consequence, we cannot apply this avoidance condition and
hence we need to extract all tuples for Paper. Note, that this condi-
tion has no impact on M:1 relationships since the maximum cardi-
nality of Ri(tj)is 1 anyway.
In terms of cost, in the worst case we need up to nI/O accesses
(if operating directly on the database), where nis the amount of
nodes in the complete OS, even if we extract only jtuples (recall
that Avoidance Condition 2 still requires an I/O access even when
it returns no results). In practice, however, there can be significant
savings if the top-ltuples are found early and large subtrees of the
complete OS are pruned. The prelim-lOS created according to
Definition 2 does not essentially contain the optimal size-lOS, e.g.
the prelim-5OS of our example does not contain the ca16 node
which belongs to the optimal size-5OS. In practice, we found that
in most cases the prelim-lOS did contain the optimal solution. This
means that all size-lOS computation algorithms may give the same
results when applied either on the prelim-lor complete OS. The
following lemma proves that if monotonicity holds then the prelim-
lOS will certainly include the optimal size-lOS.
LEM MA 3. When the nodes of an OS have monotonically de-
creasing local importance scores to their distance from the root,
then the prelim-l OS contains the optimal size-l OS.
PROO F. When monotonicity holds, the optimal size-lOS is the
top-lset (as shown by Lemma 2). Therefore, the prelim-lOS pro-
duced by this algorithm that contains the top-lset is optimal.
Finally, we note that we have also investigated a variant of the
prelim-lOS, which includes the largest top-path-lnodes (rather
than the top-l), namely the ltuples with the largest AI(pi). How-
ever, this approach did not result to better time or approximation
quality so we do not further discuss it.
236
a1
.40
p2
.22
y8
.25
pb5
.19
pb4
.24
pc6
.37
c17
.13
ca10
.19
ca9
.40
p3
.12
y14
.70
c18
.13
ca16
.27
ca15
.60
pc7
.17
pc11
.24
pc12
.19
pc13
.15
(a) The complete OS, the prelim-lOS and the top-lset. Nodes
with low transparency are pruned tuples (e.g. pc7,ca10 etc.),
shaded nodes are the top-lset (e.g. a1,pc6etc.) and the rest are
the remaining tuples of the prelim-lOS (e.g. p2,p3etc.)
tj
tj
tjRi
Ri
RiRi(tj)
Ri(tj)
Ri(tj)Q
Q
Qtop-l
l
l P Q
P Q
P Q large
st-l
l
l
· · · a1
a10
.40
a1Paper p2,p3p2,p3
a1p2p30
.40 .22 .12
p2
Paper pb4,pb5p3,pb4,pb5a1pb4p2pb5p30.12
CitedBy .40 .24 .22 .19 .12
p2
Paper pc6p3,pb4,pb5a1pc6pb4p2pb50.19
Cites (Av. Cond. 2) pc6.40 .37 .24 .22 .19
p2Year y8p3,pb4,pb5a1pc6y8pb4p20.22
(Av. Cond. 2) pc6,y8.40 .37 .25 .24 .22
p2
Co- ca9p3,..., y8a1ca9pc6y8pb40.24
Author (Av. Cond. 2) ca9.40 .40 .37 .25 .24
... ... ... ... ... ... ... ... ... ...
y8
Confe ·ca9,y14 y14 ca15 a1ca9pc60.37
rence (Av. Cond. 1) ca15 .70 .60 .40 .40 .37
ca9· · y14,ca15
y14 ca15 a1ca9pc60.37
(leaf) (ca9is leaf) .70 .60 .40 .40 .37
y14
Confe ·ca15
y14 ca15 a1ca9pc60.37
rence (Av. Cond. 1) .70 .60 .40 .40 .37
ca15 · · øy14 ca15 a1ca9pc60.37
(leaf) (ca15 is leaf) .70 .60 .40 .40 .37
(b) Values of tj,Ri,Ri(tj), Q, top-l P Q and largest-lduring the
prelim-lOS generation
Figure 7: The Prelim-lOS Generation Algorithm (l=5, tDS =a1,
and GDS=GAuthor)
6. EXPERIMENTAL EVALUATION
In this section, we experimentally evaluate the proposed size-l
OS concept and algorithms. We evaluate our algorithms using both
complete and prelim-lOSs. First, the effectiveness of the proposed
size-lOSs is thoroughly investigated with the help of human eval-
uators. Then, the quality of the size-lOSs produced by the greedy
heuristics is compared to that of the corresponding optimal OSs.
Finally, the efficiency of algorithms is comparatively investigated.
We used two databases: DBLP2and TPC-H3(we used scale fac-
tor 1 in generating the TPC-H dataset). The two databases have
2,959,511 and 8,661,245 tuples, occupying 319.4MB and 1.1GB
on the disk, respectively.
We use ObjectRank (global) [3] and ValueRank [9] to calcu-
late the global importance for the tuples of the DBLP and TPC-
H databases respectively. For a more thorough evaluation, we in-
vestigate scores by various settings that have been studied in [3],
namely, two GAs: (1) the GA1s (default) are presented in Figure 13
whereas (2) the GA2 for the DBLP has common transfer rates (0.3)
for all edges and for the TPC-H neglects values (i.e. becomes an
ObjectRank GA) and three values of d:d1=0.85 (default), d2=0.10
and d3=0.99. We use Equation 1 to calculate affinity (alternatively
2http://www.informatik.uni-trier.de/ley/db/
3http://www.tpc.org/tpch/
an expert can define GDSs and affinity manually, i.e. to select which
relations to include in each GDS and their affinity). For the experi-
ments, we used Java, MySQL, cold cache and a PC with an AMD
Phenom 9650 2.3GHz (Quad-Core) processor and 4GB of memory.
6.1 Effectiveness
We used human evaluators to measure effectiveness. First, we
familiarized them with the concepts of OSs in general and size-
lOSs in particular. Specifically, we explained that a good size-l
OS should be a stand-alone and meaningful synopsis of the most
important information about the particular DS. Then, we provided
them with OSs and asked them to size-lthem for l= 5, 10, 15, 20,
25, 30. None of our evaluators were involved in this paper. Figure 8
measures the effectiveness of our approach as the average percent-
age of the tuples that exist both in the evaluators’ size-lOSs and
the computed size-lOS by our methods. This measure corresponds
to recall and precision at the same time, as both the OSs compared
have a common size.
DBLP. Since the DBLP database includes data about real people
and their papers, we asked the DSs themselves (i.e. eleven authors
listed in DBLP) to suggest their own Author and Paper size-lOS-
s. The rationale of this evaluation is that the DSs themselves have
best knowledge of their work and can therefore provide accurate
summaries. Figures 8(a) and (b) plot the recall of the optimal size-l
OS for various ObjectRank settings. In general, ObjectRank scores
produced with GA1-d1and GA1 -d3are good options for Author and
Paper size-lOSs generation (as these settings produce similar Ob-
jectRank scores) and always dominate on larger values of l. More
precisely for GA1-d1, effectiveness ranges from 75% to 90% for
l=10 to 30, and from 40% to 60% for l=5. These results are very
encouraging. User evaluation also revealed that the inter-relational
ranking properties (e.g. whether paper p1is more important than
author a1) affect crucially the quality of the size-lOSs. For in-
stance, on author OSs, evaluators first selected important Paper tu-
ples to include in the size-lOS and then additional tuples such as
co-authors, year, conferences (these were usually included in sum-
maries of larger sizes, i.e. l10). The bias to select Papers (i.e., 1st-
level neighbors) is favored by setting GA1 -d2, although in overall
this setting was not very effective; e.g., in Figure 8(a), this setting
achieves 73.3% (in comparison to 60% of GA1-d1) for l=5.
The impact of approximated size-lOSs produced by our greedy
algorithms on effectiveness is very minor. For instance using s-
cores produced by the default setting (i.e. GA1 and d1=0.85) on the
Author GDS, the Update Top-Path-lalgorithm generates summaries
of the same effectiveness as the optimal, whereas Bottom-Up has
very minor additional loss ranging from 2% to 10%. On the Paper
GDS, all approaches give the same effectiveness as they all return
the optimal size-lOSs. The use of prelim-lOSs had no impact
on effectiveness. As we show later, prelim-lOSs have very minor
impact on approximation quality which did not affect effectiveness.
TPC-H. We presented 16 random OSs to eight evaluators and
asked them to size-lthem. The evaluators were professors and re-
searchers from Manchester and Hong Kong Universities. In addi-
tion, for each OS and tuple, a set of descriptive details and statistics
was also provided. For instance for a customer, the total number,
size and value of orders and the corresponding minimum, median
and maximum values of all customers were provided (e.g. similarly
to the evaluation in [9]). The provision of such details gave a better
knowledge of the database to the evaluators.
In summary, the GA1 (for any d) is a safe option as it produces
good size-lOSs on both Customer and Supplier OSs (Figures 8(c)
and 8(d)); e.g. effectiveness results for GA1-d1range from 60% to
78%. On the other hand GA2, which is the ObjectRank version of
the GA1, did not satisfy as much the evaluators on Supplier OSs.
237
20
40
60
80
100
510 15 20 25 30
l
GA1-d1
GA1-d2
GA1-d3
GA2-d1
Effectiveness
(a) DBLP Author (Optimal size-lOS)
20
40
60
80
100
510 15 20 25 30
l
GA1-d1
GA1-d2
GA1-d3
GA2-d1
Effectiveness
(b) DBLP Paper (Optimal size-lOS)
20
40
60
80
100
510 15 20 25 30
l
GA1-d1
GA1-d2
GA1-d3
GA2-d1
Effectiveness
(c)TPC-H Customer (Optimal size-lOS)
20
40
60
80
100
510 15 20 25 30
l
GA1-d1
GA1-d2
GA1-d3
GA2-d1
Effectiveness
(d) TPC-H Supplier (Optimal size-lOS)
Figure 8: Effectiveness (i.e. Recall=Precision)
Interestingly, we observe that the effectiveness results for size 5
were very good on both OSs due to good inter-relational ranking.
Comparative Evaluation. We compared our results with Google
Desktop (a text document search engine). We store each OS as an
HTML file and then issue the corresponding query using Google
Desktop in order to obtain its snippet. Google snippets contain
a small amount of words from the beginning of the file, combin-
ing static text such as “Search for Christos Faloutsos in the DBLP
Database” and the first few tuples (up to three) from the OS (note
that the order of nodes in an OS is random). We make a less aus-
tere comparison by counting the selected tuples that belong to the
corresponding size-5OS proposed by our evaluators (since Google
snippets contain only up to three tuples). As expected, in all cas-
es Google snippets found zero and exceptionally one tuple from
the corresponding size-5OS. Detailed results are not shown due to
space constraints.
6.2 Approximation Quality
We now compare the importance of the size-lOSs produced by
the greedy methods against the optimal ones. More precisely, the
results of Figure 9 represent the approximation quality, namely the
ratio of the achieved size-lOS importance against the optimal im-
portance. We present the average results for 10 random OSs per
GDS. The average size (i.e. the amount of tuples) of OSs is also
indicated (denoted as Aver(|OS|)).
Figures 9(a)-(e) show the approximation quality produced by the
default settings (i.e. GA1 and d1=0.85). The results show that the
Update Top-Path-lis always better than the Bottom-Up Pruning
algorithm. In general, the superiority of Update Top-Path-lover
Bottom-Up Pruning is up to 10% (excluding Paper OSs where al-
l methods achieved 100%). The evaluation also reveals that top-l
prelim-lOSs have very low approximation quality loss. They have
no impact on the Bottom-Up algorithm and only up to 4% on the
Update Top-Path-lalgorithm. Another observation is that the con-
tents of the GDS and the values of the local importance scores also
have a significant impact. For instance, for Paper OSs all methods
achieved 100% quality. This is because the monotonicity property
70
80
90
100
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Approximation
(a) DBLP Author (Aver|OS|=1116)
70
80
90
100
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Approximation
(b) DBLP Paper (Aver|OS|=367)
70
80
90
100
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Approximation
(c) TPC-H Customer (Aver|OS|=176)
70
80
90
100
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Approximation
(d) TPC-H Supplier (Aver|OS|=1341)
70
80
90
100
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Approximation
(e) DBLP Author (|OS|=67)
70
80
90
100
GA1-d1
GA1-d2
GA1-d3
GA2-d1
Settings that produced global Importance
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Approximation
(f) DBLP Author (Aver|OS|=1116)
Figure 9: Approximation Quality
holds (Lemma 2); the Paper GDS is Paper (Author, PaperCit-
edBy, PaperCites, Year (Conference)) and the local importance
of Conferences is always smaller than those of the corresponding
Years. In general, inter-relational and intra-relational ranking of tu-
ples have an impact as well. For instance, Figure 9(f) summarizes
the average approximation quality for Author OSs with global im-
portance scores produced by the various settings (where inter and
intra relational scores vary). The experimental results also reveal
that the smaller the OS is in comparison to lthe more accurate our
algorithms are. For example, the particular Author OS of Figure
9(e) with |OS|=67 yields 100% approximation quality from all al-
gorithms, by l=25.
6.3 Efficiency
We compare the run-time performance of our algorithms in Fig-
ure 10. We used the same OSs as in Section 6.2 (i.e. the same 10
OSs per GDS) and used the default settings for generating the glob-
al importance of tuples (alterative settings do not have any impact
on the performance). Figures 10(a)-(e) show the costs of our al-
gorithms for computing size-lOSs from OSs of various sizes and
different lvalues, excluding the time required to generate the OS
where each algorithm operates on. Figures 10(a)-(d) show the cost-
s of OSs from various GDSs, while Figure 10(e) shows scalability
for Author OSs of different sizes and common l=10 (analogous re-
238
0
1
2
3
4
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Optimal (Complete OS)
Optimal (Prelim-l OS)
100
1000
Times (s)
(a) DBLP Author (Aver|OS|=1116)
0
1
2
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Optimal (Complete OS)
Optimal (Prelim-l OS)
Times (s)
(b) DBLP Paper (Aver|OS|=376)
0
1
2
3
4
5
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Optimal (Complete OS)
Optimal (Prelim-l OS)
25
50
Times (s)
(c) TPC-H Customer (Aver|OS|=176)
0
1
2
3
4
510 15 20 25 30 35 40 45 50
l
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Optimal (Complete OS)
Optimal (Prelim-l OS)
100
1000
Times (s)
(d) TPC-H Supplier (Aver|OS|=1341)
0
1
2
3
67
202
606
922
1309
|OS|
Bottom-Up (Complete OS)
Bottom-Up (Prelim-l OS)
Update Top-path-l (Complete OS)
Update Top-path-l (Prelim-l OS)
Optimal (Complete OS)
Optimal (Prelim-l OS)
300
Times (s)
(e) DBLP Author (size-10 OS)
7
8
9
10
11
12
13
10
50
l (|Prelim-lOS|)
Optimal
Update Top-path-l
Bottom-up
Prelim-l OS
Complete OS
1
50
500
5000
14
10 (134) 50 (259)
0.5
0
> 30 min
265sec
2
Times (s)
(f) TPC-H Supplier (Aver|OS|=1341)
Figure 10: Efficiency
sults were obtained from all GDSs but we omit them due to space
limitations). Note that the y-axes (time) in all graphs are split to
two parts; one linear (bottom) and one exponential (top) in order to
show how the expensive DP scales and at the same time keep the
differences between the other methods visible.
As expected, the OS size and lhave affect significantly the cost
(the bigger OS or lis, the more time is required). The cost of DP
becomes unbearable moderate to large OSs and values of l(we had
to stop the algorithms after 30 min. of running). Bottom-Up Prun-
ing is consistently faster than Update Top-Path-l, as it requires few-
er operations. An interesting observation is that Bottom-Up Prun-
ing on the complete OS becomes faster as lincreases, because n-l
drops and fewer de-heaping operations are needed.
Figure 10(f) breaks down the cost to OS generation (bottom of
the bar) and size-lcomputation (top of the bar) for each method.
We investigated two approaches for generating the OSs; the first
employs an in-memory data-graph and the second computes the
OS directly from the database. The OSs are generated much faster
using the data graph; thus, we present only the data-graph based
results in Figure 10(f). For example, to generate the Supplier OS-
s (that have the largest sizes among all tested OSs) only 0.2 sec.
are required using the data-graph, compared to 12.9 sec. directly
from the database. The DBLP and TPC-H data-graphs take only
17 sec. and 128 sec. to generate and occupy 150MB and 500MB,
respectively. More precisely, our data-graph nodes correspond to
the database tuples and edges to tuples relationships (through their
primary and foreign keys). Note that the data-graph is only an in-
dex and does not contain actual data as nodes capture only keys
and global importance. Figure 10(f) also shows the average sizes
of the complete OSs (1,341) and the prelim-lOSs (134 and 259 for
l= 10 and l= 50, respectively). The prelim-lOS generation is al-
ways faster than that of the complete OS; for instance the prelim-5
OS’s size is approximately 10% of the size of the complete OS and
its generation can be done up to 2.5 times faster (the savings are
not proportional, because there can be many accesses to fruitless
relations during the prelim-lOS generation; i.e. Avoidance Condi-
tion 2 which still requires access to relations even when it returns
no results); thus, prelim-lOSs further reduce the time required by
our algorithms. Bottom-Up Pruning becomes on average up to 5.7
times faster whereas the Update Top-Path-lis up to 4.1 times. Note
that the size of the database does not impact the OS generation
time, because hash-maps are used to look-up the required nodes of
an OS; we omit experimental results, due to space constraints.
Discussion. In summary, the DP algorithm is not practical on
large OSs and l’s whereas our greedy algorithms are very fast and
as we showed in Section 6.2, their results are of high approximation
quality. Note that, in this paper, our main focus has been on opti-
mizing the size-lOS generation, not the OS generation cost (which
we leave for further investigation as future work). In addition, the
use of prelim-lOSs is constantly a better choice over the complete
OSs since they are always faster with a very minor quality loss. If
we need to find the size-lOS at a high speed, then the Bottom-Up
Pruning is a good choice, since it is consistently the fastest method
(e.g. using the prelim-50 for Supplier costs 0.12+0.12=0.24 sec.).
If the OS had to be generated from the database, then the Update
Top-Path-lalgorithm is preferable as it gives better quality and is
insignificantly more expensive (e.g. 8.08+0.32 sec).
7. CONCLUSION AND FUTURE WORK
We investigated the effective and efficient generation of size-l
OSs. First, we gave a formal definition of the size-lOS, which
targets the synoptic and stand-alone presentation of a large OS.
We proposed a dynamic programming algorithm and two efficient
greedy heuristics for producing size-lOSs. In addition, we pro-
posed a preprocessing strategy that avoids generating the complete
OS before producing size-lOSs. A systematic experimental evalu-
ation conducted on the DBLP and TPC-H databases verifies the ef-
fectiveness, approximation quality and efficiency of our techniques.
A direction of future work concerns the further exploration of al-
gorithms using hashing and reachability indexing techniques [18].
Another challenging problem is the combined size-land top-krank-
ing of OSs. In addition, the selection of an appropriate value for l
is an interesting problem; a natural approach is to select lbased on
the amount of attributes or words it will result, e.g. 20 attributes
or 50 words. However, this approach results to the reformulation
of the problem and we plan to investigate it. Finally, it is observed
that, in the general case, optimal size-lOSs for different lcould be
very different. This prevents the incremental computation of a size-
lOS from the optimal size-(l1) OS, limiting pre-computation or
caching approaches that could accelerate computation. In the fu-
ture, we plan to experimentally analyze the space of optimal size-l
OSs and identify potential similarities among them that could assist
their pre-computation and compression.
8. ACKNOWLEDGEMENTS
We would like to thank Vagelas Hristidis for providing us with
his ObjectRank code and DBLP database and our evaluators for
239
their generous help and comments (in particular, Dimitris Papa-
dias, George Samaras and Chirstos Schizas). Finally, we thank the
anonymous reviewers for their constructive comments.
9. REFERENCES
[1] Data protection act, 1988. http://en.wikipedia.
org/wiki/Data_Protection_Act_1998.
[2] B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe,
and P. S. Sudarshan. BANKS: Browsing and keyword
searching in relational databases. In VLDB, pages
1083–1086, 2002.
[3] A. Balmin, V. Hristidis, and Y. Papakonstantinou.
Objectrank: Authority-based keyword search in databases. In
VLDB, pages 564–575, 2004.
[4] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and
S. Sudarshan. Keyword searching and browsing in databases
using BANKS. In ICDE, pages 431–440, 2002.
[5] S. Brin and L. Page. The anatomy of a large-scale
hypertextual web search engine. In WWW Conference, pages
107–117, 1998.
[6] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation
algorithms for middleware. In PODS, pages 102–113, 2001.
[7] G. J. Fakas. Automated generation of object summaries from
relational databases: A novel keyword searching paradigm.
In DBRank’08, ICDE, pages 564–567, 2008.
[8] G. J. Fakas. A novel keyword search paradigm in relational
databases: Object summaries. Data Knowl. Eng.,
70(2):208–229, 2011.
[9] G. J. Fakas and Z. Cai. Ranking of object summaries. In
DBRank’09, ICDE, pages 1580–1583, 2009.
[10] G. J. Fakas, B. Cawley, and Z. Cai. Automated generation of
personal data reports from relational databases. JIKM,
10(2):193–208, 2011.
[11] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram.
XRANK: Ranked keyword search over XML documents. In
SIGMOD, pages 16–27, 2003.
[12] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efficient
ir-style keyword search over relational databases. In VLDB,
pages 850–861, 2003.
[13] V. Hristidis and Y. Papakonstantinou. Discover: Keyword
search in relational databases. In VLDB, pages 670–681,
2002.
[14] Y. Huang, Z. Liu, and Y. Chen. Query biased snippet
generation in XM L search. In SIGMOD, pages 315–326,
2008.
[15] G. Koutrika, A. Simitsis, and Y. Ioannidis. Pr´
ecis: The
essence of a query answer. In ICDE, pages 69–79, 2006.
[16] F. Liu, C. Yu, W. Meng, and A. Chowdhury. Effective
keyword search in relational databases. In SIGMOD, pages
563–574, 2006.
[17] Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-k
keyword query in relational databases. In SIGMOD, pages
115–126, 2007.
[18] A. Markowetz, Y. Yang, and D. Papadias. Reachability
indexes for relational keyword search. In ICDE, pages
1163–1166, 2009.
[19] A. Simitsis, G. Koutrika, and Y. Ioannidis. Pr´
ecis: From
unstructured keywords as queries to structured databases as
answers. The VLDB Journal, 17(1):117–149, 2008.
[20] A. Tombros and M. Sanderson. Advantages of query biased
summaries in information retrieval. In SIGIR, pages 2–10,
1998.
[21] A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast
generation of result snippets in web search. In SIGIR, pages
127–134, 2007.
[22] R. Varadarajan, V. Hristidis, and L. Raschid. Explaining and
reformulating authority flow queries. In ICDE, pages
883–892, 2008.
[23] B. Yu, G. Li, K. Sollins, and A. K. H. Tung. Effective
keyword-based selection of relational databases. In
SIGMOD, pages 139–150, 2007.
APPENDIX
Algorithm 5 The OS Generation Algorithm
OS Generation (tDS,GDS )
1: add tDSas the root of the OS
2: enQueue(Q,tDS)Queue Qfacilitates breath first traversal
3: while !(isEmptyQueue(Q)) do
4: tj=deQueue(Q)
5: for each child relation Riof R(tj) in GDS do
6: Ri(tj)=“SELECT * FROM RiWHERE (tj.ID=Ri.ID)”
7: for each tuple tiof Ri(tj)do
8: add tion OS as child of tj
9: enQueue(Q,ti)
10: return OS
Nation Partsupp
Supplier
Customer
Parts
Orders Lineitem
Region
Figure 11: The TPC-H Database Schema
Customer (1)
1.65, 5.39
Nation (0.97)
3.12, 1.82
Region (0.91)
1.82, 0
Supplier (0.52)
0, 0
Order (0.95)
2.43, 5.39
Lineitem(0.87)
1.19, 5.39
Partsupp (0.77)
5.39, 0
Partsupp (0.43)
0, 0
Lineitem (0.34)
0, 0
Parts (0.36)
0, 0
Parts (0.65)
0, 0
Supplier (0.65)
0, 0
Figure 12: The TPC-H Customers GDS (Annotated with (Affin-
ity), max(Ri) and mmax(Ri))
Paper Author
0.3
0.1
0.7 cites
0 cited
Year 0.2
0.2
0.3
0.3
Conference
(a) The DBLP GA
Nation Partsupp
Si=0.2*f(SupplyCost)
Supplier
0.1
Region
Customer
Parts
Si=0.1*(RetailPrice)
0.1
0.1
Orders
Si=0.5*f(TotalPrice)
Lineitem
Si=0.1*f(ExtendedPrice)
0.1
0.1
0.1
0.2
0.3
0.1
0.1
0.2
0.1
0.5*
f(SupplyCost)
0.1 0.1
0.5*
f(TotalPrice)
(b) The TPC-H GA
Figure 13: The GAs for the DBLP and TPC-H Databases
240
... The efficient generation of DSize-l or PSize-l OSs is a challenging problem since information about the repetitions and frequencies of nodes is required and incremental computation is not possible (as opposed to the original size-l OS computation problem 7 ). ...
... ; R i ðn jR i j Þ after processing n j . At the beginning, UBFr(n j ; R i ) can be very loose, so we compare it with mFrðn DS Þ, to keep the minimum of the two (lines [6][7][8][9][10][11]. ...
... Let t be the current smallest value of the top w l set (or 0 if top w l does not contain l values yet). If the current tuple n i is greater than t (line 3, function AddNode) and if n i is diverse to all top w l nodes then is added to the PPrelim-l and the l-sized priority queue W l which manages the top w l set (AddNode(), lines [4][5][6][7][8][9][10][11]. For instance, in Fig. 6 the shaded nodes comprise the final top w l set for the given OS. ...
Article
The abundance and ubiquity of graphs (e.g., semantic knowledge graphs, such as Google’s knowledge graph, DBpedia; online social networks such as Google+, Facebook; bibliographic graphs such as DBLP, etc.) necessitates the effective and efficient search over them. Thus, we propose a novel keyword search paradigm, where the result of a search is an Object Summary (OS). More precisely, given a set of keywords that can identify a Data Subject (DS), our paradigm produces a set of OSs as results. An OS is a tree structure rooted at the DS node (i.e., a node containing the keywords) with surrounding nodes that summarize all data held on the graph about the DS. An OS can potentially be very large in size and therefore unfriendly for users who wish to view synoptic information about the data subject. Thus, we investigate the effective and efficient retrieval of concise and informative OS snippets (denoted as size-l OSs). A size-l OS is a partial OS containing l nodes such that the summation of their importance scores results in the maximum possible total score. However, the set of nodes that maximize the total importance score may result in an uninformative size-l OSs, as very important nodes may be repeated in it, dominating other representative information. In view of this limitation, we investigate the effective and efficient generation of two novel types of OS snippets, i.e., diverse and proportional size-l OSs, denoted as DSize-l and PSize-l OSs. Namely, besides the importance of each node, we also consider its pairwise relevance (similarity) to the other nodes in the OS and the snippet. We conduct an extensive evaluation on two real graphs (DBLP and Google+). We verify effectiveness by collecting user feedback, e.g., by asking DBLP authors (i.e., the DSs themselves) to evaluate our results. In addition, we verify the efficiency of our algorithms and evaluate quality of the snippets that they produce.
... In view of this limitation, keyword search paradigms facilitate retrieval using only keywords [16][17][18][19][20][21][22][23]26,40,46,50]. Given a query that consists of a set of keywords, an answer is a subgraph of the RDF graph. ...
Article
Full-text available
The abundance and ubiquity of RDF data (such as DBpedia and YAGO2) necessitate their effective and efficient retrieval. For this purpose, keyword search paradigms liberate users from understanding the RDF schema and the SPARQL query language. Popular RDF knowledge bases (e.g., YAGO2) also include spatial semantics that enable location-based search. In an earlier location-based keyword search paradigm, the user inputs a set of keywords, a query location, and a number of RDF spatial entities to be retrieved. The output entities should be geographically close to the query location and relevant to the query keywords. However, the results can be similar to each other, compromising query effectiveness. In view of this limitation, we integrate textual and spatial diversification into RDF spatial keyword search, facilitating the retrieval of entities with diverse characteristics and directions with respect to the query location. Since finding the optimal set of query results is NP-hard, we propose two approximate algorithms with guaranteed quality. Extensive empirical studies on two real datasets show that the algorithms only add insignificant overhead compared to non-diversified search, while returning results of high quality in practice (which is verified by a user evaluation study we conducted).
... The link analysis algorithms have been widely used in keyword search (i.e. [2], [8], [9], [12]). As one of the most classical link analysis algorithms, the P ageRank algorithm calculates the P ageRank value of each webpage and ranks This work is licensed under a Creative Commons Attribution 4.0 License. ...
Article
Full-text available
The optimal path planning is one of the hot spots in the research of intelligence transportation and geographic information systems. There are many productions and applications in path planning and navigation, however due to the complexity of urban road networks, the difficulty of the traffic prediction increases. The optimal path means not only the shortest distance in geography, but also the shortest time, the lowest cost, the maximum road capacity, etc. In fast-paced modern cities, people tend to reach the destination with the shortest time. The corresponding paths are considered as the optimal paths. However, due to the high data sensing speed of GPS devices, it is different to collect or describe real traffic flows. To address this problem, we propose an innovative path planning method in this paper. Specially, we first introduce a crossroad link analysis algorithm to calculate the real-time traffic conditions of crossroads (i.e. the CrossRank values). Then, we adopt a CrossRank value based A-Star for the path planning by considering the real-time traffic conditions. To avoid the high volume update of CrossRank values, a R-Tree structure is proposed to dynamically update local CrossRank values from the multi-level subareas. In the optimization process, to achieve desired navigation results, we establish the traffic congestion coefficient to reflect different traffic congestion conditions. To verify the effectiveness of the proposed method, we use the actual traffic data of Beijing. The experimental results show that our method is able to generate the appropriate path plan in the peak and low dynamic traffic conditions as compared to online applications.
... But it only focuses on the influence of the users in the social networks. Fakas et al. in [6,7,8,9] study the size-l object summaries to answer users' keyword queries in relational databases. But these works didn't consider the structure information in evaluating the importance of keyword search results. ...
Conference Paper
Full-text available
The rapid growth of information networks provides a significant opportunity for people to learn the world and find useful information for decision making. To find influential topics in a given context, instead of searching widely over the whole information network, normally it is wise to find the related communities first and then identify the influential topics in those communities. In this demonstration, we present a novel framework to compute the correlated sub-networks from a large information network such as CiteSeerX based on a user's keyword query, and to extract the influential topics from each correlated network. To help users understand the influential topics as a whole, we utilize a word cloud to represent the discovered topics for each correlated network. As such, multiple word clouds can be generated for different correlated networks, by which users can easily pick up their interested ones by reading the visualized topic descriptions over word clouds. To determine the sizes of different terms in a word cloud, we introduce a scoring scheme for assessing the influence of these terms in the corresponding networks. We demonstrate the functionality of our influential topic system, called iTopic, using the CiteSeerX information network data.
Article
Triple-structured open data creates value in many ways. However, the reuse of datasets is still challenging. Users feel difficult to assess the usefulness of a large dataset containing thousands or millions of triples. To satisfy the needs, existing abstractive methods produce a concise high-level abstraction of data. Complementary to that, we adopt the extractive strategy and aim to select the optimum small subset of data from a dataset as a snippet to compactly illustrate the content of the dataset. This has been formulated as a combinatorial optimization problem in our previous work. In this article, we design a new algorithm for the problem, which is an order of magnitude faster than the previous one but has the same approximation ratio. We also develop an anytime algorithm that can generate empirically better solutions using additional time. To suit datasets that are partially accessible via online query services (e.g., SPARQL endpoints for RDF data), we adapt our algorithms to trade off quality of snippet for feasibility and efficiency in the Web environment. We carry out extensive experiments based on real RDF datasets and SPARQL endpoints for evaluating quality and running time. The results demonstrate the effectiveness and practicality of our proposed algorithms.
Article
Full-text available
Recently, there has been a significant growth in keyword search due to its wide range of use-cases in people's everyday life. While keyword search has been applied to different kinds of data, ambiguity always exists no matter what data the query is being asked from. Generally, when users submit a query they need an exact answer that perfectly meets their needs, rather than wondering about different possible answers that are retrieved by the system. To achieve this, search systems need a Disambiguation functionality that can efficiently filter out and rank all possible answers to a query, before showing them to user. In this paper, we are going to describe how we are improving state of the art in various stages of a keyword-search pipeline in order to retrieve the answers that best match the user's intent.
Conference Paper
We study the problem of Query Reverse Engineering (QRE), where given a database and an output table, the task is to find a simple project-join SQL query that generates that table when applied on the database. This problem is known for its efficiency challenge due to mainly two reasons. First, the problem has a very large search space and its various variants are known to be NP-hard. Second, executing even a single candidate SQL query can be very computationally expensive. In this work we propose a novel approach for solving the QRE problem efficiently. Our solution outperforms the existing state of the art by 2-3 orders of magnitude for complex queries, resolving those queries in seconds rather than days, thus making our approach more practical in real-life settings.
Conference Paper
Entity linking connects the Web of documents with knowledge bases. It is the task of linking an entity mention in text to its corresponding entity in a knowledge base. Whereas a large body of work has been devoted to automatically generating candidate entities, or ranking and choosing from them, manual efforts are still needed, e.g., for defining gold-standard links for evaluating automatic approaches, and for improving the quality of links in crowdsourcing approaches. However, structured descriptions of entities in knowledge bases are sometimes very long. To avoid overloading human users with too much information and help them more efficiently choose an entity from candidates, we aim to substitute entire entity descriptions with compact, equally effective structured summaries that are automatically generated. To achieve it, our approach analyzes entity descriptions in the knowledge base and the context of entity mention from multiple perspectives, including characterizing and differentiating power, information overlap, and relevance to context. Extrinsic evaluation (where human users carry out entity linking tasks) and intrinsic evaluation (where human users rate summaries) demonstrate that summaries generated by our approach help human users carry out entity linking tasks more efficiently (22-23% faster), without significantly affecting the quality of links obtained; and our approach outperforms existing approaches to summarizing entity descriptions.
Conference Paper
Full-text available
A previously proposed keyword search paradigm produces, as a query result, a ranked list of object summaries (OSs); each OS summarizes all data held in a relational database about a particular data subject (DS). This paper further investigates the ranking of OSs and their tuples as to facilitate (1) the top-k ranking of OSs and also (2) the generation of partial size-l OSs (i.e. comprised of the l most important tuples). Therefore, a global Importance score for each tuple of the database (denoted as Im(t<sub>i</sub>)) is investigated and quantified. For this purpose, ValueRank (an extension of ObjectRank) is introduced which facilitates the estimation of scores for arbitrary databases (in contrast to PageRank-style techniques that are only effective on bibliographic databases). In addition, a variation of Combined functions are investigated for assigning an Importance score to an OS (denoted as Im(OS)) and a local Importance score of their tuples (denoted as Im(OS, t<sub>i</sub>)). Preliminary experimental evaluation on DBLP and Northwind Databases is presented.
Conference Paper
Full-text available
www.dcs.gla.ac.uk/-tombrosa / www-ciir.cs.umass.edu/-sanderso/ Abstract This paper presents an investigation into the utility of document summarisation in the context of information retrieval, more specifically in the application of so called query biased (or user directed) summaries: summaries customised to reflect the information need expressed in a query. Employed in the retrieved document list displayed after a retrieval took place, the summaries ’ utility was evaluated in a task-based environment by measuring users ’ speed and accuracy in identifying relevant documents. This was compared to the performance achieved when users were presented with the more typical output of an IR system: a static predefined summary composed of the title and first few sentences of retrieved documents. The results from the evaluation indicate that the use of query biased summaries significantly improves both the accuracy and speed of user relevance judgements. 1
Conference Paper
Full-text available
The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.
Conference Paper
Full-text available
The wide popularity of free-and-easy keyword based searches over World Wide Web has fueled the demand for incorporat- ing keyword-based search over structured databases. How- ever, most of the current research work focuses on keyword- based searching over a single structured data source. With the growing interest in distributed databases and service ori- ented architecture over the Internet, it is important to ex- tend such a capability over multiple structured data sources. One of the most important problems for enabling such a query facility is to be able to select the most useful data sources relevant to the keyword query. Traditional database summary techniques used for selecting unstructured data sources developed in IR literature are inadequate for our problem, as they do not capture the structure of the data sources. In this paper, we study the database selection prob- lem for relational data sources, and propose a method that eectively summarizes the relationships between keywords in a relational database based on its structure. We develop eective ranking methods based on the keyword relationship summaries in order to select the most useful databases for a given keyword query. We have implemented our system on PlanetLab. In that environment we use extensive experi- ments with real datasets to demonstrate the eectiveness of our proposed summarization method.
Conference Paper
With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results.BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results.
Article
This paper introduces a novel keyword search paradigm in relational databases, where the result of a search is an Object Summary (OS). An OS summarizes all data held about a particular Data Subject (DS) in a database. More precisely, it is a tree with a tuple containing the keyword(s) as a root and neighboring tuples as children. In contrast to traditional relational keyword search, an OS comprises a more complete and therefore semantically meaningful set of information about the enquired DS.The proposed paradigm introduces the concept of Affinity in order to automatically generate OSs. More precisely, it investigates and quantifies the Affinity of relations (i.e. Affinity) and their attributes (i.e. Attribute Affinity) in order to decide which tuples and attributes to include in the OS. Experimental evaluation on the TPC-H and Northwind databases verifies the searching quality of the proposed paradigm on both large and small databases; precision, recall, f-score, CPU and space measures are presented.
Article
Assume that each object in a database has m grades, or scores, one for each of m attributes. For example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. For each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). Each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, such as min or average. To determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. Fagin has given an algorithm (“Fagin's Algorithm”, or FA) that is much more efficient. For some monotone aggregation functions, FA is optimal with high probability in the worst case. We analyze an elegant and remarkably simple algorithm (“the threshold algorithm”, or TA) that is optimal in a much stronger sense than FA. We show that TA is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. Unlike FA, which requires large buffers (whose size may grow unboundedly as the database size grows), TA requires only a small, constant-size buffer. TA allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. We distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). We consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well.
Conference Paper
The BANKS system enables keyword-based search on databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner with-out any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, follow-ing hyperlinks, and interacting with controls on the displayed results. Extensive support for answer ranking forms a critical part of the BANKS system.