Content uploaded by Georgios Fakas

Author content

All content in this area was uploaded by Georgios Fakas on Nov 13, 2016

Content may be subject to copyright.

Content uploaded by Georgios Fakas

Author content

All content in this area was uploaded by Georgios Fakas on Nov 13, 2016

Content may be subject to copyright.

Size-lObject Summaries for Relational Keyword Search

Georgios J. Fakas∗∗†, Zhi Cai†, Nikos Mamoulis‡

†Department of Computing and Mathematics ‡Department of Computer Science

Manchester Metropolitan University, UK The University of Hong Kong, Hong Kong

{g.fakas, z.cai}@mmu.ac.uk nikos@cs.hku.hk

ABSTRACT

A previously proposed keyword search paradigm produces, as a

query result, a ranked list of Object Summaries (OSs). An OS is

a tree structure of related tuples that summarizes all data held in a

relational database about a particular Data Subject (DS). However,

some of these OSs are very large in size and therefore unfriendly

to users that initially prefer synoptic information before proceeding

to more comprehensive information about a particular DS. In this

paper, we investigate the effective and efﬁcient retrieval of concise

and informative OSs. We argue that a good size-lOS should be a

stand-alone and meaningful synopsis of the most important infor-

mation about the particular DS. More precisely, we deﬁne a size-l

OS as a partial OS composed of limportant tuples. We propose

three algorithms for the efﬁcient generation of size-lOSs (in ad-

dition to the optimal approach which requires exponential time).

Experimental evaluation on DBLP and TPC-H databases veriﬁes

the effectiveness and efﬁciency of our approach.

1. INTRODUCTION

Web Keyword Search (W-KwS) has been very successful be-

cause it allows users to extract effectively and efﬁciently useful

information from the web using only a set of keywords. For in-

stance, Example 1 illustrates the partial result of a W-KwS (e.g.

Google) for Q1: “Faloutsos”: a ranked set (with the ﬁrst three re-

sults shown only) of links to web pages containing the keyword(s).

We observe that each result is accompanied with a snippet [21],

i.e. a short summary that sometimes even includes the complete

answer to the query (if, for example, the user is only interested in

whether Christos Faloutsos is a Professor or whether his brothers

are academics).

The success of the W-KwS paradigm has encouraged the emer-

gence of the keyword search paradigm in relational databases (R-

KwS) [2, 4, 13]. The R-KwS paradigm is used to ﬁnd tuples that

contain the keywords and their relationships through foreign-key

links, e.g. query Q2: “Faloutsos”+“Agrawal” returns Authors Fal-

∗Partially supported by the “Hosting of Experienced Researcher-

s from Abroad” programme (ΠPOΣEΛKYΣH/ΠPOEM/0308)

funded by the Research Promotion Foundation, Cyprus.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee. Articles from this volume were invited to present

their results at The 38th International Conference on Very Large Data Bases,

August 27th - 31st 2012, Istanbul, Turkey.

Proceedings of the VLDB Endowment, Vol. 5, No. 3

Copyright 2011 VLDB Endowment 2150-8097/11/11... $10.00.

EXA MP LE 1. Q1 “Faloutsos” using a W-KwS (Google)

Christos Faloutsos

SCS CSD Professor’s afﬁliatons, research, projects, publications and

teaching.

www.cs.cmu.edu/∼christos/ - 9k

Michalis Faloutsos

The Homepage of Michalis Faloutsos ... Interesting and Miscallaneous

Links ·Fun pictures ·Other Faloutsos on the web; The Teach-To-Learn

Initiative:

www.cs.ucr.edu/∼michalis/ - 5k

Petros Faloutsos

Courses ·Press Coverage ·Publications ·Research Highlights ·Awards ·

MAGIX Lab ·Curriculum Vitae ·Family ·Other Faloutsos on Web.

www.cs.ucla.edu/∼pfal/ - 4k

...

EXA MP LE 2. Q2 using an R-KwS (searching DBLP database)

Author: Christos Faloutsos,Paper: Efﬁcient similarity search in se-

quence databases, Author: Rakesh Agrawal.

Author: Christos Faloutsos,Paper: Method for high-dimensionality

indexing in a multi-media database, Author: Rakesh Agrawal.

Author: Christos Faloutsos,Paper: Quest: A project on database mining,

Author: Rakesh Agrawal.

EXA MP LE 3. Q1 using an R-KwS (searching DBLP database)

Author: Christos Faloutsos

Author: Michalis Faloutsos

Author: Petros Faloutsos

outsos and Agrawal and their associations through co-authored pa-

pers. Example 2 illustrates the result of a traditional R-KwS for Q2

on the DBLP database. On the other hand, the R-KwS paradigm

may not be very effective when trying to extract information about

a particular data subject (DS), e.g. “Faloutsos” in Q1. Example 3

illustrates the R-KwS result for Q1, namely a ranked set of Author

tuples containing the Faloutsos keyword, which are the Author tu-

ples corresponding to the three brothers. Evidently, this result fails

to provide comprehensive information to users about the Faloutsos

brothers, e.g. a complete list of their publications and other cor-

responding details (Certainly, the R-KwS paradigm remains very

useful when trying to combine keywords).

In [8], the concept of object summary (OS) is introduced; an

OS summarizes all data held in a database about a particular DS.

More precisely, an OS is a tree with the tuple tDS containing the

keyword (e.g. Author tuple Christos Faloutsos) as the root node and

its neighboring tuples, containing additional information (e.g. his

papers, co-authors etc.), as child nodes. The result for Q1 is in fact a

set of OSs: one per DS that includes all data held in the database for

each Faloutsos brother. Example 4 illustrates the OS for Christos

(the complete set of papers and the OSs of the other two brothers

were omitted due to lack of space). This result evidently provides

a more complete set of information per brother.

229

arXiv:1111.7169v1 [cs.DB] 30 Nov 2011

EXA MP LE 4. The OS for Christos Faloutsos

Author: Christos Faloutsos

.Paper: On Power-law Relationalships of the Internet Topology.

....Conference:SIGCOMM. Year:1999.

....Co-Author(s):Michalis Faloutsos, Petros Faloutsos.

.Paper: An Efﬁcient Pictorial Database System for PSQL.

....Conference:IEEE Trans. Software Eng. Year:1988.

....Co-Author(s):N. Roussopoulos, T. Sellis.

...

...

.Paper: Declustering Using Fractals.

....Conference:PDIS. Year:1993. Co-Author(s):Pravin Bhagwat.

.Paper: Declustering Using Error Correcting Codes.

....Conference:PODS. Year:1989. Co-Author(s):Dimitris N. Metaxas.

(Total 1,309 tuples)

EXA MP LE 5. The size-l OSs for Q1 and l=15

Author: Christos Faloutsos

..Paper: On Power-law Relationalships of the Internet Topology.

.....Conference:SIGCOMM. Year:1999.

.....Co-Author(s):Michalis Faloutsos, Petros Faloutsos.

..Paper: The QBIC Project: Querying Images by Content, Using, Color,

............. Texture and Shape.

.....Conference:SPIE. Year:1993.

.....Co-Author(s):Carlton W. Niblack, Dragutin Petkovic, Peter Yanker.

..Paper: Efﬁcient and Effective Querying by Image Content.

.....Conference:J. Intell. Inf. Syst. Year:1994.

.....Co-Author(s):N. Roussopoulos, T. Sellis.

...

Author: Michalis Faloutsos

..Paper: On Power-law Relationalships of the Internet Topology.

.....Conference:SIGCOMM. Year:1999.

.....Co-Author(s):Christos Faloutsos, Petros Faloutsos.

..Paper: QoSMIC: Quality of Service Sensitive Multicast Internet Protocol.

.....Conference:SIGCOMM. Year:1998.

.....Co-Author(s):Anindo Banerjea, Rajesh Pankaj.

..Paper: Aggregated Multicast with Inter-Group Tree Sharing.

.....Conference:Networked Group Communication. Year:2001.

.....Co-Author(s):Aiguo Fei.

...

Author: Petros Faloutsos

..Paper: On Power-law Relationalships of the Internet Topology.

.....Conference:SIGCOMM. Year:1999.

.....Co-Author(s):Christos Faloutsos, Michalis Faloutsos.

..Paper: Composable controllers for physics-based character animation.

.....Conference:SIGGRAPH. Year:2001.

.....Co-Author(s):Michiel van de Panne, Demetri Terzopoulos.

..Paper: The virtual stuntman: dynamic characters with a repertoire of

............. autonomous motor skills.

.....Conference:Computers &Graphics 25. Year:2001.

.....Co-Author(s):Michiel van de Panne, Demetri Terzopoulos.

From Example 4, we can observe that some of the OSs may be

very large in size; e.g. Christos Faloutsos has co-authored many

papers and his OS consists of 1,309 tuples. This is not only un-

friendly to users that prefer a quick glance ﬁrst before deciding

which Faloutsos they are really interested in, but also expensive

to produce. Therefore, a partial OS of size l, composed of only l

representative and important tuples, may be more appropriate.

In this paper, we investigate in detail the effective and efﬁcient

generation of size-lOSs. Example 5 illustrates Q1 with l=15 on

the DBLP database; namely a set of size-15 OSs composed of only

15 important tuples for each DS. From the user’s perspective, the

semantics of this paradigm resemble more a W-KwS rather than a

R-KwS. For instance, the complete OS of Example 4 resembles a

web page (as they both include comprehensive information about a

DS), whereas the size-lOSs of Example 5 resemble the snippets of

Example 1. Therefore, users with W-KwS experience will poten-

tially ﬁnd it friendlier and also closer to their expectations.

OSs and size-lOSs can have many applications. For example,

OSs can automate responds to data protection act (DPA) subject

access requests (e.g. the US Privacy Act of 1974, UK DPA of 1984

and 1998 [1] etc.). According to DPA access requests, DSs have

the right to request access from any organization to personal infor-

mation about them. Thus, data controllers of organizations must

extract data for a given DS from their databases and present it in

an intelligible form [10]. Another application is for intelligent ser-

vices searching information about suspects from various databases.

Hence, size-lOSs can also be very useful as they enhance the us-

ability of OSs. In general, a size-lOS is a concise summary of

the context around any pivot database tuple, ﬁnding application in

(interactive) data exploration, schema extraction, etc.

We should effectively generate a stand-alone size-lOS, com-

posed of limportant tuples only, so that the user can comprehend it

without any additional information. A stand-alone size-lOS should

preserve meaningful and self-descriptive semantics about the DS.

As we explain in Section 3, for this reason, the ltuples should form

a connected graph that includes the root of the OS (i.e. tDS). To

distinguish the importance of individual tuples tito be included in

the size-lOS, a local importance score (denoted as Im(O S, ti))

is deﬁned by combining the tuple’s global importance score in the

database (denoted as Im(ti)) and its afﬁnity [8] in the OS (denoted

as Af(ti)). Based on the local importance scores of the tuples of

an OS, we can ﬁnd the partial OS of size lwith the maximum im-

portance score, which includes tuples that are connected with tDS.

The efﬁcient size-lgeneration of OSs is a challenging problem.

A brute force approach, that considers all candidate size-lOSs be-

fore ﬁnding the one with the maximum importance, requires expo-

nential time. We propose an optimal algorithm based on dynamic

programming, which is efﬁcient for small problems, however, it

does not scale well with the OS size and l. In view of this, we

design three practical greedy algorithms.

We provide an extensive experimental study on DBLP and TPC-

H databases, which includes comparisons of our algorithms and

veriﬁes their efﬁciency. To verify the effectiveness of our frame-

work, we collected user feedback, e.g. by asking several DBLP

authors (i.e. the DSs themselves) to assess the computed size-lOS-

s of themselves on the DBLP database. The users suggested that the

results produced by our method are very close to their expectations.

The rest of the paper is structured as follows. Section 2 describes

background and related work. Section 3 describes the semantics

of size-lOS keyword queries and formulates the problem of their

generation. Sections 4 and 5 introduce the optimal and greedy al-

gorithms respectively. Section 6 presents experimental results and

Section 7 provides concluding remarks.

2. BACKGROUND AND RELATED WORK

In this section, we ﬁrst describe the concept of object summaries

(OSs), which we build upon in this paper. We then present and

compare other related work in R-KwS, ranking and summarization.

To the best of our knowledge there is no previous work that focuses

on the computation of size-lOSs.

2.1 Object Summaries

In the context of OS search in relational databases [8, 7], a query

is a set of keywords (e.g. “Christos Faloutsos”) and the result is a

set of OSs. An OS is generated for each tuple (tDS) found in the

database that contains the keyword(s) as part of an attribute’s value

(e.g. tuple “Christos Faloutsos” of relation Author in the DBLP

database). An OS is a tree structure composed of tuples, having tDS

as root and tDS’s neighboring tuples (i.e. those associated through

foreign keys) as its children/descendants.

In order to construct OSs, this approach combines the use of

graphs and SQL. The rationale is that there are relations, denoted

as RDS (e.g. the Author relation), which hold information about the

queried Data Subjects (DSs) and the relations linked around RDSs

contain additional information about the particular DS. For each

RDS, a Data Subject Schema Graph (GDS ) can be generated; this is

230

Paper Author

YearConference

Figure 1: The DBLP Database Schema

Conference(0.78)

0.216, 0

Co-author (0.82)

0.86, 0

Year (0.83)

0.841, 0.216

PaperCites (0.77)

7.38, 0

PaperCiteBy(0.77)

7.381, 0

Paper (0.92)

8.818, 7.381

Author (1)

1.049, 7.381

Figure 2: The DBLP Author GDS (Annotated with (Afﬁnity),

max(Ri) and mmax(Ri))

a directed labeled tree that captures a subset of the database schema

with RDS as a root. (Figures 1 and 11 illustrate the schemata of

the DBLP and TPC-H databases whereas Figures 2 and 12 illus-

trate exemplary GDSs.) Each relation in GDS is also annotated with

useful information that we describe later, such as afﬁnity and im-

portance. GDS is a “treealization” of the schema, i.e. RDS becomes

the root, its neighboring relations become child nodes and also any

looped or many-to-many relationships are replicated. Examples of

such replications are relations PaperCitedBy, PaperCites and Co-

Author on Author GDS and relations Partsupp, Lineitem, Parts etc.

on Customer GDS (see GDSs in Figures 2 and 12). (User evalua-

tion in [8] veriﬁed that the tree format (achieved via such replica-

tions) increases signiﬁcantly friendliness and ease of use of OSs.)

The challenge now is the selection of the relations from GDS which

have the highest afﬁnity with RDS; these need to be accessed and

joined in order to create a good OS. To facilitate this task, afﬁnity

measures of relations (denoted as Af(Ri)) in GDS are investigated,

quantiﬁed and annotated on the GDS. The afﬁnity of a relation Ri

to RDS can be calculated using the following formula:

Af(Ri) = X

j

mjwj·Af(RParent),(1)

where jranges over a set of metrics (m1, m2,...,mn), their corre-

sponding weights (w1, w2,...,wn) and Af(RParent)is the afﬁnity

of Ri’s parent to RDS . Afﬁnity metrics between Riand RDS in-

clude (1) their distance and (2) their connectivity properties on both

the database schema and the data-graph (see [8] for more details).

Given an afﬁnity threshold θ, a subset of GDS can be produced,

denoted as GDS(θ). Finally, by traversing GDS(θ) (e.g. by join-

ing the corresponding relations) we can generate the OSs (either

by using the precomputed data-graph or directly from the database

using Algorithm 5). More precisely, a breadth-ﬁrst traversal of the

corresponding GDS(θ) with the tDS tuple as the initial root entry

of the OS tree is applied. For instance, for keyword query Q1,

Author GDS of Figure 2 and θ=0.7 the report presented in Exam-

ple 4 will automatically be generated. Note that Author GDS (0.7)

includes all relations whilst Customer GDS(0.7) includes only Cus-

tomer, Nation, Region, Order, Lineitem and Partsupp relations (s-

ince all these relations have afﬁnity greater than 0.7). Similarly, the

set of attributes Ajfrom each relation Rithat are included in a GDS

are selected by employing an attributes afﬁnity and a threshold (i.e.

θ′). For example, in a Customer OS, Comment is excluded from

Partsupp relation as it is not relevant to Customer DSs.

2.2 R-KwS and Ranking

R-KwS techniques facilitate the discovery of joining tuples (i.e.

Minimal Total Join Networks of Tuples (MTJNTs) [13]) that col-

lectively contain all query keywords and are associated through

their keys; for this purpose the concept of candidate networks is in-

troduced; see, for example, DISCOVER [13], BANKS [2, 4]. The

OSs paradigm differs from other R-KwS techniques semantically,

since it does not focus on ﬁnding and ranking candidate networks

that connect the given keywords, but searches for OSs, which are

trees centered around the data subject described by the keywords.

Pr´

ecis Queries [15, 19] resemble size-lOSs as they append ad-

ditional information to the nodes containing the keywords, by con-

sidering neighboring relations that are implicitly related to the key-

words. More precisely, a pr´

ecis query result is a logical subset of

the original database (i.e. a subset of relations and a subset of tu-

ples). For instance, the pr´

ecis of Q1 is a subset of the database

that includes the tuples of the three Faloutsos Authors and a subset

of their (common) Papers, Co-Authors, Conferences, etc. In con-

trast, our result is a set of three separate size-lOSs (Example 5). A

thorough evaluation between OSs and pr´

ecis appears in our earlier

work [8].

R-KwS techniques also investigate the ranking of their results.

Such ranking paradigms consider:

1) IR-Style techniques, which weight the amount of times key-

words (terms) appear in MTJNs [12, 16, 17, 23]. However, such

techniques miss tuples that are related to the keywords, but they do

not contain them [3]; e.g. for Q1, tuples in relation Papers also have

importance although they do not include the Faloutsos keyword.

2) Tuples’ Importance, which weights the authority ﬂow through

relationships, e.g. ObjectRank [3], [22], ValueRank [9], PageRank

[5], BANKS (PageRank inspired) [2], [4], XRANK [11] etc. In this

paper we use tuples’ importance to model global importance scores

and more precisely global ObjectRank (for DBLP) and ValueRank

(for TPC-H). (Note that our algorithms are orthogonal to how tuple

importance is deﬁned and other methods could also be investigat-

ed.) ObjectRank [3] is an extension of PageRank on databases and

introduces the concept of Authority Transfer Rates between the tu-

ples of each relation of the database (Authority Transfer Rates are

annotated on the so called Authority Transfer Schema Graph, de-

noted as GA, e.g. Figure 13). They are based on the observation

that solely mapping a relational database to a graph (as in the case

of the web) is not accurate and a GAis required to control the ﬂow

of authority in neighboring tuples. For instance, well cited papers

should have higher importance than papers citing many other pa-

pers or a well cited paper should have better ranking than another

one with fewer citations. ValueRank is an extension of ObjectRank

which also considers the tuples’ values and thus can be applied

on any database (e.g. TPC-H) in contrast to ObjectRank which is

mainly effective on authoritative ﬂow data such as bibliographic da-

ta (e.g. DBLP). For instance, in trading databases, a customer with

ﬁve orders of values $10 may get lower importance than another

customer with three orders of values $100.

2.3 Other Related Work

Document summarization techniques have attracted signiﬁcant

research interest [20, 21]. In general, these techniques are IR-style

inspired. Web snippets [21] are examples of document summaries

that accompany search results of W-KwSs in order to facilitate their

quick preview (e.g. see Example 1). They can be either static (e.g.

231

composed of the ﬁrst words of the document or description meta-

data) or query-biased (e.g. composed of sentences containing many

times the keywords) [20]. Still, the direct application of such tech-

niques on databases in general and OS in particular is ineffective;

e.g. they disregard the relational associations and semantics of the

displayed tuples. For example consider Q1 and Example 4, pa-

pers authored by Faloutsos (although don’t include the Faloutsos

keyword) have importance analogous to their citations and authors;

this is ignored by document summarization techniques.

XML keyword search techniques, similarly to R-KwSs, facil-

itate the discovery of XML sub-trees that contain all query key-

words (e.g. “Faloutsos”+“Agrawal”). Analogously, XML snippets

[14] are sub-trees of the complete XML result, with a given size,

that contain all keywords. An apparent difference between size-l

OSs and XML snippets is their semantics which is analogous to

the semantic difference between complete OSs and XML results.

Therefore, their generation is a completely different problem. An

interesting similarity is that both size-lOS and XML snippets are

sub-trees of the corresponding complete results, hence composed

of connected nodes. This common property is for the same reason,

i.e. to preserve self-descriptiveness.

3. SIZE- LOS

A size-lOS keyword query consists of (1) a set of keywords and

(2) a value for l(e.g. Q1 and l=15) and the result comprises a

set of size-lOSs. A good size-lOS should be a stand-alone and

meaningful synopsis of the most important information about the

particular DS.

DEFI NI TI ON 1. Given an OS and an integer size l, a candidate

size-lOS is any subset of the OS composed of ltuples, such that all

ltuples are connected with tDS (i.e. the root of the OS tree).

Deﬁnition 1 guarantees that the size-lOS remains stand-alone,

(so users can understand it as it is without any additional tuples);

i.e. by including connecting tuples we also include the semantics

of their connection to the DS. (Recall that this criterion was also

used in [14] for the same reasons.) Consider the example of Figure

3 which is a fraction of the Faloutsos OS (in the DBLP database).

Even, if the Paper “Efﬁcient and Effective Querying by Image Con-

tent” has less local importance (e.g. 20) than the Co-Author(s) Sel-

lis (e.g. 43) and Roussopoulos (e.g. 34), we cannot exclude the

Paper and include only the Co-Authors. The rationale is that by

excluding the Paper tuple we also exclude the semantic association

between the Author and Co-Author(s), which in this case is their

common paper. Also note that a size-lOS will not necessarily in-

clude the ltuples with the largest importance scores. For example,

the Co-Author Roussopoulos, although with larger importance than

the particular Paper, may have to be excluded from a size-lOS (e.g.

from a size-3OS which will consist of (1) Author “Faloutsos”, (2)

Paper “Efﬁcient . . . ” and (3) Co-Author “Sellis”).

Given an OS, we can extract exponentially many size-lOSs that

satisfy Deﬁnition 1. In the next section we deﬁne a measure for the

importance (i.e., quality) of a candidate size-lOS. Our goal then

would be to retrieve a size-lOS of the maximum possible quality.

3.1 Importance of a Size-lOS

The (global) importance of any candidate size-lOS S, denoted

as Im(S), is deﬁned as:

Im(S) = X

ti∈S

Im(O S, ti),(2)

where Im(O S, ti)is the local importance of tuple ti(to be de-

ﬁned in Section 3.2 below). We say that a candidate size-lOS

Author: Christos Faloutsos

58

Paper: Efficient and Effective

Querying by image Content

20

Co-Author: T. Sellis

43

Co-Author: N. Roussopoulos

34

Conference: IEEE Trans.

Software Eng.

14

.

.

.

.

.

.

Year: 1988

18

Figure 3: A Fraction of the Faloutsos OS (Annotated with Lo-

cal Importance)

is an optimal size-lOS, if it has the maximum Im(S)(denoted

as max(Im(S))) over all candidate size-lOSs for the given OS.

Wherever an optimal size-lOS is hard to ﬁnd, we target the retrieval

of a sub-optimal size-lOS of the highest possible importance.

3.2 Local Importance of a Tuple (Im(OS, ti))

The local importance of Im(O S, ti)of each tuple tiin an OS

can be calculated by:

Im(O S, ti) = Im(ti)·Af (ti),(3)

where Im(ti)is the global importance of tiin the database. We use

global ObjectRank and ValueRank to calculate global importance,

as discussed in Section 2.2. Af (ti)is the afﬁnity of tito the tDS;

namely the afﬁnity Af(Ri)of the corresponding relation Riwhere

tibelongs, to RDS. This can be calculated from GDS using Equation

1, as discussed in Section 2.1 (alternatively, a domain expert can set

Af(Ri)s manually). For example, if tuple tiis paper “Efﬁcient..”

with Im(ti)=21.74 and Af (ti)=Af(RP aper )=0.92 (see the afﬁn-

ity on Author GDS in Figure 2), then Im(OS, ti)= 21.74*0.92=20.

Multiplying global importance Im(ti)with afﬁnity Af (ti)re-

duces the importance of tuples that are not closely related to the

DS. For instance, although paper “Efﬁcient ..” and year “1988”

have equal global importance scores (21.74 and 21.64, respective-

ly), their local importance scores become 20 (=21.74*0.92) and

18 (=21.64*0.83) respectively. The use of importance and afﬁn-

ity metrics is inspired by other earlier work; e.g. XRANK and

pr´

ecis employ variations of importance and afﬁnity [11, 15]. For

deﬁning afﬁnity in [11, 15], only distance is considered; however,

as it is shown in [8] distance is only one among the possible afﬁnity

metrics (e.g. cardinality, reverse cardinality etc.).

3.3 Problem Deﬁnition

The generation of a complete OS is straightforward: we only

have to traverse the corresponding GDS (see Algorithm 5 in the

Appendix). The generation of a size-lOS is a more challenging

task because we need to select ltuples that are connected to the tDS

of the tree and at the same time result to the largest Im(S). Hence,

the problem we study in this paper can be deﬁned as follows:

PROB LE M 1 (FIN D AN O PT IM AL S IZ E-lO S) . Given a tDS, the

corresponding GDS and l, ﬁnd a size-lOS Sof maximum Im(S).

A direct approach for solving this problem is to ﬁrst generate the

complete OS (i.e. Algorithm 5)1and then determine the optimal

1In fact, any tuples or subtrees, which have distance at least lfrom

the root tDSare excluded from the OS, as these cannot be part of a

connected size-lOS rooted at tDS.

232

size-lOS from it. In Section 4, we propose a dynamic program-

ming (DP) algorithm for this purpose. If the complete OS is too

large, solving the problem exactly using DP can be too expensive.

In view of this, in Section 5, we propose greedy algorithms that

ﬁnd a sub-optimal synopsis. In order to further reduce the cost of

ﬁnding a sub-optimal solution, in Section 5.3, we also propose an

economical approach, which, instead of the complete OS, initially

generates a preliminary partial OS, denoted as prelim-lOS. The ra-

tionale of a prelim-lOS is to avoid the extraction and consequently

further processing of fruitless tuples that are not promising to make

it in the size-lOS. DP and the greedy algorithms can be applied on

the prelim-lOS to ﬁnd a good sub-optimal size-lOS.

4. THE DP ALGORITHM

This section describes a dynamic programming (DP) algorithm,

which, given an OS, determines the optimal size-lOS in it. The OS

is a tree, as discussed in Section 2. Every node vof the OS tree is

a tuple ti, and carries a weight w(v), which is the local importance

Im(O S, ti)of the corresponding tuple ti. Given the tree OS, our

objective is to ﬁnd a subtree Sopt, such that (i) Sopt includes the

root node tDS of OS, (ii) the tree has lnodes, and (iii) its nodes

have the maximum sum of weights for all trees that satisfy (i) and

(ii). In the third condition, the sum of node weights corresponds

to Im(Sopt ), according to Equation 2. Since this is the maximum

among all qualifying subtrees, Sopt is the optimal size-lOS.

Assume that the root tDS in Sopt has a child vand the subtree

Sv

opt rooted at vhas inodes. Then, Sv

opt should be the optimal size-

iOS rooted at v. DP operates based on exactly this assertion; for

each candidate node vto be included in the optimal synopsis and

for each number of nodes iin the subtree of vthat can be included,

we compute the corresponding optimal size-isynopsis and the cor-

responding sum of weights. The optimal size-isynopsis rooted at v

is computed recursively from precomputed size-jsynopses (j < i)

rooted at v’s children; to ﬁnd it, we should consider all synopses

formed by vand all size-(i−1) combinations of its children and

subtrees rooted at them.

Speciﬁcally, let d(v)be the depth of a node vin OS (the root

tDS has depth 0). The subtree rooted at d(v)can contribute at most

l−d(v)nodes to the optimal solution, because in every solution that

includes v, the complete path from the root to vmust be included

(due to the fact that tDS should be included and the solution must

be connected). The construction of the DP algorithm is to compute

for each node vof OS Sv,i: the optimal size-iOS for all i∈[1, l −

d(v)], in the subtree rooted at v. In addition to Sv,i the algorithm

should track W(Sv,i), the sum of weights of all nodes in Sv ,i.

DP (Algorithm 1) proceeds in a bottom-up fashion; it starts from

nodes in OS at depth l−1; these nodes can only contribute them-

selves in the optimal solution (nodes at depth at least lcannot par-

ticipate in a size-lOS). For each such node v, trivially Sv,1=v,

W(Sv,1)=w(v). Now consider a node vat depth k<l −1. Upon

reaching v, for all children uof v, quantities Su,i and W(Su,i )

have been computed for all i∈[1, l −d(v)−1]. Let us now see

how we can compute Sv,i for each i∈[1, l −d(v)]. First, each

Sv,i should include vitself. Then, we examine all possible com-

binations of v’s children and number of nodes to be selected from

their subtrees, such that the total number of selected nodes is i−1.

We do not have to check the subtrees of v’s children, since for each

number of nodes jto be selected from a subtree rooted at child u,

we already have the optimal set Su,j and the corresponding sum of

weights W(Su,j ). Note that when we reach the OS root r=tDS,

we only have to compute Sr,l: the optimal size-lOS (i.e., there is

no need to compute Sr,i for i∈[1, l −1]).

Algorithm 1 The Optimal Size-lOS (DP) Algorithm

DP(l, tDS,GDS )

1: OS Generation(tDS,GDS )⊲generates the complete OS, annotates

with local importance each node

2: for each node vat depth l−1do set Sv,1=v

3: for each depth k=l−2to 0do

4: for each node vat depth kdo

5: for i=1 to l−d(v)do

6: Sv,i ={v} ∪ the best combination of v’s children and nodes

from them such that the total number of nodes is i−1

7: return Sr,l

1

30

6

35

5

80

4

31

3

11

10

13

12

12

11

30

9

5

8

15

14

40

13

60

7

10

2

20

Depth Computed Sets

3S13,1=13, S14,1=14

2S7,1=7, S8,1=8, S9,1=9, S10,1=10, S11,1=11,

S11,2={11,13},S12,1=12, S12,2={12,14}

1S2,1=2, S3,1=3, S3,2={3,8},S3,3={3,7,8},S4,1=4,

S4,2={4,11},S4,3={4,11,13},S5,1=5, S6,1=6,

S6,2={6,12},S6,3={6,12,40}

0S1,4={1,4,5,6}

Figure 4: Example: Steps of DP

As an example, consider the OS shown in Figure 4 (top) and as-

sume that we want to compute the optimal size-4OS from it. The

table shows the steps of DP in computing the optimal sets Sv,i in

a bottom-up fashion, starting from nodes 13 and 14 which are at

depth 3 (i.e. l−1). For example, to compute S4,3={4,11,13}, we

compare the two possible cases S4,3={4} ∪ S10,1∪S11,1and

S4,3={4}∪ S11,2since S10,1∪S11,1and S11,2are the only combi-

nations sets from node 4’s children that total to 2 nodes (i−1=2).

S10,1∪S11,1={10,11}with total weight 43 and S11,2={11,13}

with total weight 90. Thus, S4,3={4} ∪ S11,2={4,11,13}. Note

that for nodes that do not have enough children, the number of sets

that are computed could be smaller than those indicated in the pseu-

docode. For example, for node 2, we only have S2,1; i.e. S2,2and

S2,3do not exist although the node is at depth 1, because node 2 has

no children. In addition, for the root node, DP only has to compute

S1,4, since we only care about the optimal size-lsummary (there

are no nodes above the root that could use smaller summaries).

In terms of complexity, we need to compute for each node vin

the OS up to depth l−1up to l−d(v)sets. For each set we need

to ﬁnd the optimal combination of children and nodes from them

to choose. This cost of choosing the best combination increases

exponentially with i, which is O(l). Thus, the overall cost of DP

is O(nl)for an input OS of size n, as can be veriﬁed in our ex-

periments. This is essentially the complexity of the problem as DP

explores all possible summaries systematically and, in the general

case, there is no way to prune the search space. For large values

of l, DP becomes impractical and we resort to the greedy heuristics

described in the next section. Finally, the following lemma proves

the optimality of DP.

LEM MA 1. Algorithm 1 computes the optimal size-lOS.

233

PROO F. The optimal size-lOS Sopt includes the root tDS of the

OS and a set of subtrees rooted as some of tDS ’s children. DP tests

all possible combinations of children and numbers of nodes from

the corresponding subtrees, therefore the combination that corre-

sponds to Sopt will be considered. For the speciﬁc combination,

for each child vand number of nodes i, the optimal subtree root-

ed at vwith inodes (i.e., Sv,i ) has already been found during the

bottom-up computation process of DP. Therefore, DP will select

and output the optimal combination (which has the largest impor-

tance among all tested ones).

5. GREEDY ALGORITHMS

Since the DP algorithm does not scale well, in this section, we

investigate greedy heuristics that aim at producing a high-quality

size-lOS, not necessarily being the optimal. A property that the al-

gorithms exploit is that the local importance of tuples in the OS (i.e.

Im(O S, ti)) usually decreases with the node depth from the root

tDS of the OS. Recall that Im(O S, ti)is the product Im(ti)·Af (ti),

where Im(ti)is the global importance of tuple tiand Af (ti)is the

afﬁnity of the relation that tibelongs to. Af (ti)monotonically de-

creases with the depth of the tuple since Af(Ri)is a product of

its parent’s afﬁnity and Af(Ri)≤1 (cf. Equation 1). On the other

hand, the global importance for a particular tuple is to some ex-

tent unpredictable. Therefore, even though the local importance

is not monotonically decreasing with the depth of the tuple on the

OS tree, it has higher probability to decrease than to increase with

depth. Hence, it is more probable that tuples higher on the OS to

have greater local importance than lower tuples. Moreover, note

that due to the non-monotonicity of OSs, existing top-ktechniques

such as [6, 12, 17] cannot be applied.

5.1 Bottom-Up Pruning Size-lAlgorithm

This algorithm, given an initial OS (either a complete or a prelim-

lOS) iteratively prunes from the bottom of the tree the n−lleaf

nodes with the smallest Im(O S, ti), where nis the number of n-

odes in the complete OS. The rationale is that since tuples need to

be connected with the root and lower tuples on the tree are expect-

ed to have lower importance, we can start pruning from the bottom.

A priority queue (P Q) organizes the current leaf nodes according

to their local importance. Algorithm 2 shows a pseudocode of the

algorithm and Figure 5 illustrates the steps.

More precisely, this algorithm ﬁrstly generates the initial OS

(line 1; e.g. the complete OS using Algorithm 5). The OS Gen-

eration algorithm generates the initial size-lOS and also the initial

P Q (initially holding all leaves of the given OS). Then, the algo-

rithm iteratively prunes the leaves with the smallest Im(OS, ti).

Whenever a new leaf is created (e.g., after pruning node 9 in Figure

5, node 3 becomes a leaf), it is added to P Q. The algorithm ter-

minates when only lnodes remain in the tree. The tree is then re-

turned as the size-lOS. In terms of time complexity, the algorithm

performs O(n) delete operations in constant time, each potentially

followed by an update to the P Q. Since there are O(n) elements in

P Q, the cost of each update operation is O(logn). Thus, the over-

all cost of the algorithm is O(nlogn). This is much lower than the

complexity of the DP algorithm, which gives the optimal solution.

On the other hand, this method will not always return the opti-

mal solution; e.g. the optimal size-5OS should include nodes 1, 5,

6, 12 and 14 instead of 1, 5, 6, 11 and 13 (Fig 5(d)). In practice, it

is very accurate (see our experimental results in Section 6.2), due

to the aforementioned property of Im(O S, ti), which gives higher

probability to nodes closer to the root to have a high local impor-

tance. Lemma 2 proves an optimality condition for this algorithm

Algorithm 2 The Bottom-Up Pruning Size-lAlgorithm

Bottom-Up Pruning Size-l (l,tDS,GDS )

1: OS Generation(tDS,GDS )⊲generates initial size-l(i.e. complete or

prelim-l) OS and initial P Q

2: while (|size-lOS|> l)do

3: ttem=deQueue(P Q)⊲the smallest value from P Q

4: if !(hasSiblings(size-lOS, ttem)) then

5: enQueue(P Q, parent(size-lOS, ttem)) ⊲check whether after

pruning ttem, its parent becomes a leaf node

6: prune ttem from size-lOS

7: return size-lOS

1

30

7

10

6

35

5

80

4

31

3

11

2

20

10

13

12

55

11

30

9

5

8

15

14

40

13

60

9

5

7

10

10

13

8

15

14

40

13

60

PQ

(a) The initial OS

7

10

3

11

10

13

8

15

14

40

13

60

PQ

1

30

7

10

6

35

5

80

4

31

3

11

2

20

10

13

12

55

11

30

8

15

14

40

13

60

(b) First leaf pruned out

1

30

6

35

5

80

4

31

2

20

12

55

11

30

8

15

14

40

13

60

8

15

4

31

14

40

13

60

PQ

(c) The size-10 OS

6

35

13

60

PQ

1

30

6

35

5

80

11

30

13

60

(d) The size-5OS

Figure 5: The Bottom-Up Pruning Size-lAlgorithm: Size-lOSs

and their Corresponding P Qs (annotated with tuple ID and

local importance)

(Paper OSs in the DBLP database are an example of this condition;

to be discussed in Section 6.2).

LEM MA 2. When the nodes of an OS have monotonically de-

creasing local importance scores to their distance from the root

(i.e. the score of each parent is not smaller than that of its chil-

dren), then the Bottom-Up Pruning Size-lAlgorithm returns the

optimal size-lOS.

PROO F.P Q.top always holds the node with the current small-

est score in the OS. This is because P Q.top is by deﬁnition the

smallest among leaf nodes, where leaf nodes always have smaller

scores than their ancestors. Therefore, by removing the n−lcur-

rent smallest values (iteratively stored in PQ.top) from an OS, we

can get the optimal size-lOS.

5.2 Update Top-Path-lAlgorithm

We now explore a second greedy heuristic. This algorithm itera-

tively selects the path piof tuples with the largest average impor-

tance per tuple (denoted as AI(pi)), adds pito the size-lOS and

removes the nodes of pifrom the OS and updates AI(pi)for the

remaining paths accordingly. The rationale of selecting the path of

tuples (instead of the tuple) with the current largest importance, is

234

Algorithm 3 The Update Top-path-lAlgorithm

Update Top-path-l(l,tDS ,GDS)

1: OS Generation(tDS,GDS )⊲generates initial size-l(i.e. complete or

prelim-l) OS, annotates tuples with AI(pi))

2: while (|size-lOS|< l)do

3: pi=path with max AI(pi)

4: add ﬁrst l−|size-lOS|nodes of pito size-lOS

5: if (|size-lOS|< l)then

6: remove selected path pifrom the tree

7: for each child vof nodes in pido

8: update AI(pi)for each node tjin the subtree rooted at v

9: return size-lOS

1

30

30

7

10

20

6

35

33

5

80

55

4

31

31

3

11

21

2

20

25

10

13

25

12

12

26

11

30

47

9

5

15

8

15

22

14

40

29

13

60

50

(a) The initial OS

1

30

7

10

15

6

35

35

5

80

4

31

31

3

11

11

2

20

20

10

13

22

12

12

24

11

30

30

9

5

8

8

15

18

14

40

29

13

60

45

(b) First update

1

30

7

10

15

6

35

35

5

80

4

31

31

3

11

11

2

20

20

10

13

22

12

12

24

11

30

9

5

8

8

15

18

14

40

29

13

60

(c) Second update

1

30

7

10

15

6

35

5

80

4

31

31

3

11

11

2

20

20

10

13

22

12

12

12

11

30

9

5

8

8

15

18

14

40

26

13

60

(d) Final update (size-5OS)

Figure 6: The Update Top-Path-lAlgorithm: The size-5OS

(annotated with tuple ID, local importance and AI(pi); select-

ed nodes are shaded)

that since all nodes need to be connected and monotonicity may not

hold, we facilitate the selection of nodes of large importance even

though their ancestors may have lower importance. Algorithm 3 is

a pseudocode of the heuristic and Figure 6 illustrates an example.

More precisely, this algorithm (like the Bottom-Up Pruning Al-

gorithm) ﬁrstly generates the complete (or alternatively the prelim-

l) OS. During the OS generation, for each tuple ti, we also calculate

the importance per tuple AI(pi)for the corresponding path pifrom

the root to ti. We then select the node with the largest AI(pi)and

add the corresponding path to the size-lOS. By removing the nodes

of pifrom the OS, the tree now becomes a forest; each child of a

node in piis the root of a tree. Accordingly, the AI(pi)for each

node tiis updated again to disregard the removed nodes in the path

selected at the previous step. The process of selecting the path with

the highest AI(pi), adding it to the size-lOS is repeated as long as

less than lnodes have been selected so far. If less than |pi|nodes

are needed to complete the size-lOS then only the top nodes of

the path are added to the size-lOS (because only these nodes are

connected to the current size-lOS).

Consider the example shown in Figure 6. Node 5 has AI (pi)=55,

because its path includes nodes 1 and 5 with average Im(OS, ti)

being (30+80)/2=55. Assuming l=5, at the ﬁrst loop, the algo-

rithm selects nodes 1 and 5 with the largest AI(pi), i.e. 55. Then,

the nodes along the path (nodes 1 and 5) are added to the size-5

OS. For the remaining nodes, AI(pi)is updated to disregard the

removed nodes (see top-right tree in Figure 6). For example, the

new AI(pi)for node 10 is 22, because its path now includes only

nodes 4 and 10 with average Im(OS, ti)being 22. The next path

to be selected is that ending at node 13, which adds two more nodes

to the snippet. Finally, node 6 is added to complete the size-5OS.

The complexity of the algorithm can be as high as O(nl), where

nis the size of the complete OS, as at each step the algorithm may

choose only one node which causes the update of O(n) paths. The

algorithm can be optimized if we precompute for each node vof

the tree the node s(v)with the highest AI(pi)in the subtree rooted

at v. Regardless of any change at any ancestor of v,s(v)should re-

main the node with the highest AI(pi)in the subtree (because the

change will affect all nodes in the subtree in the same way). Thus,

only a small number of comparisons would be needed after each

path selection to ﬁnd the next path to be selected. Speciﬁcally, for

each child vof nodes in the currently selected path pi, we need to

update AI(pi)for s(v)and then compare all s(v)’s to pick the one

with the largest AI(pi). In terms of approximation quality, this al-

gorithm not always returns the optimal solution; e.g. the size-3OS

will have nodes 1, 5 and 11 instead of 1, 5 and 6. However, empir-

ically, this method gives better results than Bottom-Up Pruning.

5.3 Top-lPrelim-lOS Preprocessing

Instead of operating on the complete OS, which may be expen-

sive to generate and search, we propose to work on a smaller OS,

which hopefully includes a good size-lOS. We denote such a pre-

liminary partial OS as prelim-lOS (with size jwhere l≤j≤ |OS|).

On the prelim-lOS, we can apply any of the proposed algorithms

so far (of course, DP is not expected to return the optimal result,

unless the prelim-lOS is guaranteed to include it). The rationale of

the prelim-lOS is to avoid extraction and processing of tuples that

are not promising to make it in the optimal size-lOS. Algorithm

4 is a pseudocode for computing the prelim-lOS, Table 1 summa-

rizes symbols and deﬁnitions and Figure 7 illustrates an example.

Determining a prelim-lOS that includes the optimal size-lOS

can be very expensive, therefore we propose a heuristic, which

produces a prelim-lOS that includes at least the lnodes of the

complete OS with the largest local importance (denoted as top-l

set). Figure 7(a) illustrates such a prelim-lOS. Using avoidance

conditions and simple statistics that summarize the range of local

importance of every tuple in each relation (e.g. max(Ri)) we can

infer upper bounds for the local importance of tuples and thus safe-

ly predict whether a candidate path can potentially produce useful

tuples.

DEFI NI TI ON 2. Given an OS and an integer l, a top-lprelim-

lOS (or simply prelim-lOS) is a subset of the complete OS that

includes the ltuples of the OS with the largest local importance.

We annotate each relation Rion the GDS graph with the statistics

max(Ri) and mmax(Ri) (see Figure 2). (Recall from Section 2.1

that we generate GDS graphs for every relation that may contain in-

formation about DSs.) max(Ri) is the maximum local importance

of all tuples in Ri, which can be derived from the maximum global

importance in Ri(a global statistic that is computed/updated inde-

pendently of the queries) and the afﬁnity of Af(Ri). mmax(Ri) is

the maximum local importance of all tuples that belong to Ri’s de-

scendant relation nodes in GDS (i.e. the maxj{max(Rj)};jranges

over all such relations) or 0 if Rihas no descendants (leaf node).

The algorithm for generating the prelim-lOS is an extension of

the complete OS generation algorithm (e.g. Algorithm 5). The ex-

tension incorporates pruning conditions in order to avoid adding to

the prelim-lOS fruitless tuples and their subtrees. More precisely,

235

Table 1: Symbols and Deﬁnitions (Top-lPrelim-lOSs)

Symbols Deﬁnition

top-lThe lnodes with the largest local importance in the OS

top-l P Q An l-sized priority queue with the current largest local

importance of extracted tuples

largest-lThe tuple with the lth largest local importance retrieved

so far (i.e. the smallest value of top-l P Q) or 0 if

|top-l P Q|<l

li(ti)The local importance of tuple ti(i.e. Im(O S, ti))

R(ti)The relation on GDSthat tuple tibelongs to

Ri(tj) The subset of Rithat joins with tuple tj

max(Ri) The maximum value of local importance of Ri

mmax(Ri) The maximum value of max(Ri) of all Ri’s descendants

nodes on GDSor 0 if Rihas no descendants (leaf node)

fruitless

tuple

A tuple not in top-l

fruitless

GDS

relation/

sub-tree

AGDSsub-tree starting from relation Riis considered

fruitless for a given largest-l, if none tuples from Riand

its descendants can be fruitful for the top-l(i.e. when

largest-l≥max(Ri) AND largest-l≥mmax(Ri))

fruitful-l

relation

A relation Riis considered fruitful-lfor a given largest-

l, if only up to lnodes from the corresponding Ri(tj)

can be fruitful for the top-l, (i.e. when largest-

l≥mmax(Ri))

we traverse the GDS graph in a breath ﬁrst order. Every extracted

tuple is appended to the prelim-lOS (lines 2 and 14) and to queue

Q(to facilitate the breadth ﬁrst traversal of the GDS; see lines 3 and

15). Let largest-lbe the tuple with the lth largest local importance

retrieved so far. If the current tuple tiis greater than largest-l,ti

is added to the l-sized priority queue top-l P Q as well (in order to

update the top-lset; lines 4 and 17). Largest-lis set to the current

smallest value of top-l P Q or to 0 if the top-l P Q does not contain

lvalues yet (lines 20-23). We traverse the GDS as follows. For each

tuple de-queued from the queue Q(line 6), we extract all its child

nodes from each corresponding child relation (lines 7-12) and we

employ the following avoidance conditions:

Avoidance Condition 1 (Avoiding fruitless GDS sub-trees): If

the top-l P Q already contains ltuples and largest-lis greater than

or equal to the local importance of all tuples of the current relation

Riand all its descendants (i.e. largest-l≥max(Ri) AND largest-

l≥mmax(Ri)), then there is no need to traverse the sub-tree start-

ing at Ri(line 8). In such cases, we say that the sub-tree starting

from Riis fruitless. For instance, consider the example of Figure

7; while retrieving tuple y8, largest-l=0.37 and the current child

relation Riis Conference with max(Ri)=0.22 and mmax(Ri)=0.

Thus, we can safely infer that Conference has no fruitful tuples for

the particular prelim-lOS. This avoidance condition does not re-

quire any I/O operations as all information required can cheaply be

obtained from the annotated GDS.

Avoidance Condition 2 (Limiting up to ltuple extractions

from fruitful-lrelations): Assume that we are about to traverse

Riin order to extract Ri(tj): the tuples in Riwhich join with the

parent tuple tj. We can limit the amount of tuples returned by this

join up to l, if we can safely predict that none of their descendants

(if any) can be fruitful for the top-l. We say a relation Rion the

GDS is considered fruitful-lfor a given largest-l, if we can safe-

ly predict that only up to ltuples from Rican be fruitful for the

top-land none of their descendants (if any); this is the case when

largest-l≥mmax(Ri) but largest-l<max(Ri). In other words, we

can safely extract only up to ltuples greater than the largest-lfrom

a fruitful-lrelation; i.e. there is no need to compute the complete

join. For instance consider the example of Figure 7, where we are

about to traverse the fruitful-lrelation PaperCitedBy (a leaf node on

the GDS, thus a fruitful-lrelation) in order to extract the joins with

Paper tuple p2. Then, we can extract from the database only up to l

Algorithm 4 The Prelim-lOS Generation Algorithm

Prelim-lOS Generation (l,tDS,GDS )

1: largest-l=0

2: add tDSas the root of the prelim-l

3: enQueue(Q,tDS)

4: enQueue(top-l P Q,tDS)

5: while !(IsEmptyQueue(Q)) do

6: tj=deQueue(Q)

7: for each child relation Riof R(tj)in GDS do

8: if !(largest-l≥max(Ri) AND largest-l≥mmax(Ri))

then ⊲Av. Cond. 1

9: if (largest-l≥mmax(Ri)) then

10: Ri(tj)=“SELECT * TOP lFROM RiWHERE

(tj.ID=Ri.ID AND Ri.li >largest-l)” ⊲Av. Cond. 2.

tj.ID and Ri.ID represent the keys that tjand Rijoin and

Ri.li the local import. attribute of Ri

11: else

12: Ri(tj)=“SELECT * FROM RiWHERE (tj.ID=Ri.ID)”

13: for each tuple tiof Ri(tj)do

14: add tion prelim-las child of tj

15: enQueue(Q,ti)

16: if (li(ti)>largest-l)then

17: enQueue(top-l P Q,ti)

18: if (|top-l P Q|>l)then

19: deQueue(top-l P Q)

20: if (|top-l P Q|<l)then

21: largest-l=0

22: else

23: largest-l=Smallest(top-l PQ)

tuples with local importance greater than the largest-l(which is 0,

since |top-l P Q|<l). Similarly, when traversing the fruitful-lrela-

tion PaperCites with largest-l=0.12, we extract up to ltuples larger

than largest-l. Note that the Paper relation is not fruitful-l, since

largest-l=0 and mmax(RPaper)=7.38 thus largest-l<mmax(RPaper).

As a consequence, we cannot apply this avoidance condition and

hence we need to extract all tuples for Paper. Note, that this condi-

tion has no impact on M:1 relationships since the maximum cardi-

nality of Ri(tj)is 1 anyway.

In terms of cost, in the worst case we need up to nI/O accesses

(if operating directly on the database), where nis the amount of

nodes in the complete OS, even if we extract only jtuples (recall

that Avoidance Condition 2 still requires an I/O access even when

it returns no results). In practice, however, there can be signiﬁcant

savings if the top-ltuples are found early and large subtrees of the

complete OS are pruned. The prelim-lOS created according to

Deﬁnition 2 does not essentially contain the optimal size-lOS, e.g.

the prelim-5OS of our example does not contain the ca16 node

which belongs to the optimal size-5OS. In practice, we found that

in most cases the prelim-lOS did contain the optimal solution. This

means that all size-lOS computation algorithms may give the same

results when applied either on the prelim-lor complete OS. The

following lemma proves that if monotonicity holds then the prelim-

lOS will certainly include the optimal size-lOS.

LEM MA 3. When the nodes of an OS have monotonically de-

creasing local importance scores to their distance from the root,

then the prelim-l OS contains the optimal size-l OS.

PROO F. When monotonicity holds, the optimal size-lOS is the

top-lset (as shown by Lemma 2). Therefore, the prelim-lOS pro-

duced by this algorithm that contains the top-lset is optimal.

Finally, we note that we have also investigated a variant of the

prelim-lOS, which includes the largest top-path-lnodes (rather

than the top-l), namely the ltuples with the largest AI(pi). How-

ever, this approach did not result to better time or approximation

quality so we do not further discuss it.

236

a1

.40

p2

.22

y8

.25

pb5

.19

pb4

.24

pc6

.37

c17

.13

ca10

.19

ca9

.40

p3

.12

y14

.70

c18

.13

ca16

.27

ca15

.60

pc7

.17

pc11

.24

pc12

.19

pc13

.15

(a) The complete OS, the prelim-lOS and the top-lset. Nodes

with low transparency are pruned tuples (e.g. pc7,ca10 etc.),

shaded nodes are the top-lset (e.g. a1,pc6etc.) and the rest are

the remaining tuples of the prelim-lOS (e.g. p2,p3etc.)

tj

tj

tjRi

Ri

RiRi(tj)

Ri(tj)

Ri(tj)Q

Q

Qtop-l

l

l P Q

P Q

P Q large

st-l

l

l

· · · a1

a10

.40

a1Paper p2,p3p2,p3

a1p2p30

.40 .22 .12

p2

Paper pb4,pb5p3,pb4,pb5a1pb4p2pb5p30.12

CitedBy .40 .24 .22 .19 .12

p2

Paper pc6p3,pb4,pb5a1pc6pb4p2pb50.19

Cites (Av. Cond. 2) pc6.40 .37 .24 .22 .19

p2Year y8p3,pb4,pb5a1pc6y8pb4p20.22

(Av. Cond. 2) pc6,y8.40 .37 .25 .24 .22

p2

Co- ca9p3,..., y8a1ca9pc6y8pb40.24

Author (Av. Cond. 2) ca9.40 .40 .37 .25 .24

... ... ... ... ... ... ... ... ... ...

y8

Confe ·ca9,y14 y14 ca15 a1ca9pc60.37

rence (Av. Cond. 1) ca15 .70 .60 .40 .40 .37

ca9· · y14,ca15

y14 ca15 a1ca9pc60.37

(leaf) (ca9is leaf) .70 .60 .40 .40 .37

y14

Confe ·ca15

y14 ca15 a1ca9pc60.37

rence (Av. Cond. 1) .70 .60 .40 .40 .37

ca15 · · øy14 ca15 a1ca9pc60.37

(leaf) (ca15 is leaf) .70 .60 .40 .40 .37

(b) Values of tj,Ri,Ri(tj), Q, top-l P Q and largest-lduring the

prelim-lOS generation

Figure 7: The Prelim-lOS Generation Algorithm (l=5, tDS =a1,

and GDS=GAuthor)

6. EXPERIMENTAL EVALUATION

In this section, we experimentally evaluate the proposed size-l

OS concept and algorithms. We evaluate our algorithms using both

complete and prelim-lOSs. First, the effectiveness of the proposed

size-lOSs is thoroughly investigated with the help of human eval-

uators. Then, the quality of the size-lOSs produced by the greedy

heuristics is compared to that of the corresponding optimal OSs.

Finally, the efﬁciency of algorithms is comparatively investigated.

We used two databases: DBLP2and TPC-H3(we used scale fac-

tor 1 in generating the TPC-H dataset). The two databases have

2,959,511 and 8,661,245 tuples, occupying 319.4MB and 1.1GB

on the disk, respectively.

We use ObjectRank (global) [3] and ValueRank [9] to calcu-

late the global importance for the tuples of the DBLP and TPC-

H databases respectively. For a more thorough evaluation, we in-

vestigate scores by various settings that have been studied in [3],

namely, two GAs: (1) the GA1s (default) are presented in Figure 13

whereas (2) the GA2 for the DBLP has common transfer rates (0.3)

for all edges and for the TPC-H neglects values (i.e. becomes an

ObjectRank GA) and three values of d:d1=0.85 (default), d2=0.10

and d3=0.99. We use Equation 1 to calculate afﬁnity (alternatively

2http://www.informatik.uni-trier.de/∽ley/db/

3http://www.tpc.org/tpch/

an expert can deﬁne GDSs and afﬁnity manually, i.e. to select which

relations to include in each GDS and their afﬁnity). For the experi-

ments, we used Java, MySQL, cold cache and a PC with an AMD

Phenom 9650 2.3GHz (Quad-Core) processor and 4GB of memory.

6.1 Effectiveness

We used human evaluators to measure effectiveness. First, we

familiarized them with the concepts of OSs in general and size-

lOSs in particular. Speciﬁcally, we explained that a good size-l

OS should be a stand-alone and meaningful synopsis of the most

important information about the particular DS. Then, we provided

them with OSs and asked them to size-lthem for l= 5, 10, 15, 20,

25, 30. None of our evaluators were involved in this paper. Figure 8

measures the effectiveness of our approach as the average percent-

age of the tuples that exist both in the evaluators’ size-lOSs and

the computed size-lOS by our methods. This measure corresponds

to recall and precision at the same time, as both the OSs compared

have a common size.

DBLP. Since the DBLP database includes data about real people

and their papers, we asked the DSs themselves (i.e. eleven authors

listed in DBLP) to suggest their own Author and Paper size-lOS-

s. The rationale of this evaluation is that the DSs themselves have

best knowledge of their work and can therefore provide accurate

summaries. Figures 8(a) and (b) plot the recall of the optimal size-l

OS for various ObjectRank settings. In general, ObjectRank scores

produced with GA1-d1and GA1 -d3are good options for Author and

Paper size-lOSs generation (as these settings produce similar Ob-

jectRank scores) and always dominate on larger values of l. More

precisely for GA1-d1, effectiveness ranges from 75% to 90% for

l=10 to 30, and from 40% to 60% for l=5. These results are very

encouraging. User evaluation also revealed that the inter-relational

ranking properties (e.g. whether paper p1is more important than

author a1) affect crucially the quality of the size-lOSs. For in-

stance, on author OSs, evaluators ﬁrst selected important Paper tu-

ples to include in the size-lOS and then additional tuples such as

co-authors, year, conferences (these were usually included in sum-

maries of larger sizes, i.e. l≥10). The bias to select Papers (i.e., 1st-

level neighbors) is favored by setting GA1 -d2, although in overall

this setting was not very effective; e.g., in Figure 8(a), this setting

achieves 73.3% (in comparison to 60% of GA1-d1) for l=5.

The impact of approximated size-lOSs produced by our greedy

algorithms on effectiveness is very minor. For instance using s-

cores produced by the default setting (i.e. GA1 and d1=0.85) on the

Author GDS, the Update Top-Path-lalgorithm generates summaries

of the same effectiveness as the optimal, whereas Bottom-Up has

very minor additional loss ranging from 2% to 10%. On the Paper

GDS, all approaches give the same effectiveness as they all return

the optimal size-lOSs. The use of prelim-lOSs had no impact

on effectiveness. As we show later, prelim-lOSs have very minor

impact on approximation quality which did not affect effectiveness.

TPC-H. We presented 16 random OSs to eight evaluators and

asked them to size-lthem. The evaluators were professors and re-

searchers from Manchester and Hong Kong Universities. In addi-

tion, for each OS and tuple, a set of descriptive details and statistics

was also provided. For instance for a customer, the total number,

size and value of orders and the corresponding minimum, median

and maximum values of all customers were provided (e.g. similarly

to the evaluation in [9]). The provision of such details gave a better

knowledge of the database to the evaluators.

In summary, the GA1 (for any d) is a safe option as it produces

good size-lOSs on both Customer and Supplier OSs (Figures 8(c)

and 8(d)); e.g. effectiveness results for GA1-d1range from 60% to

78%. On the other hand GA2, which is the ObjectRank version of

the GA1, did not satisfy as much the evaluators on Supplier OSs.

237

20

40

60

80

100

510 15 20 25 30

l

GA1-d1

GA1-d2

GA1-d3

GA2-d1

Effectiveness

(a) DBLP Author (Optimal size-lOS)

20

40

60

80

100

510 15 20 25 30

l

GA1-d1

GA1-d2

GA1-d3

GA2-d1

Effectiveness

(b) DBLP Paper (Optimal size-lOS)

20

40

60

80

100

510 15 20 25 30

l

GA1-d1

GA1-d2

GA1-d3

GA2-d1

Effectiveness

(c)TPC-H Customer (Optimal size-lOS)

20

40

60

80

100

510 15 20 25 30

l

GA1-d1

GA1-d2

GA1-d3

GA2-d1

Effectiveness

(d) TPC-H Supplier (Optimal size-lOS)

Figure 8: Effectiveness (i.e. Recall=Precision)

Interestingly, we observe that the effectiveness results for size 5

were very good on both OSs due to good inter-relational ranking.

Comparative Evaluation. We compared our results with Google

Desktop (a text document search engine). We store each OS as an

HTML ﬁle and then issue the corresponding query using Google

Desktop in order to obtain its snippet. Google snippets contain

a small amount of words from the beginning of the ﬁle, combin-

ing static text such as “Search for Christos Faloutsos in the DBLP

Database” and the ﬁrst few tuples (up to three) from the OS (note

that the order of nodes in an OS is random). We make a less aus-

tere comparison by counting the selected tuples that belong to the

corresponding size-5OS proposed by our evaluators (since Google

snippets contain only up to three tuples). As expected, in all cas-

es Google snippets found zero and exceptionally one tuple from

the corresponding size-5OS. Detailed results are not shown due to

space constraints.

6.2 Approximation Quality

We now compare the importance of the size-lOSs produced by

the greedy methods against the optimal ones. More precisely, the

results of Figure 9 represent the approximation quality, namely the

ratio of the achieved size-lOS importance against the optimal im-

portance. We present the average results for 10 random OSs per

GDS. The average size (i.e. the amount of tuples) of OSs is also

indicated (denoted as Aver(|OS|)).

Figures 9(a)-(e) show the approximation quality produced by the

default settings (i.e. GA1 and d1=0.85). The results show that the

Update Top-Path-lis always better than the Bottom-Up Pruning

algorithm. In general, the superiority of Update Top-Path-lover

Bottom-Up Pruning is up to 10% (excluding Paper OSs where al-

l methods achieved 100%). The evaluation also reveals that top-l

prelim-lOSs have very low approximation quality loss. They have

no impact on the Bottom-Up algorithm and only up to 4% on the

Update Top-Path-lalgorithm. Another observation is that the con-

tents of the GDS and the values of the local importance scores also

have a signiﬁcant impact. For instance, for Paper OSs all methods

achieved 100% quality. This is because the monotonicity property

70

80

90

100

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Approximation

(a) DBLP Author (Aver|OS|=1116)

70

80

90

100

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Approximation

(b) DBLP Paper (Aver|OS|=367)

70

80

90

100

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Approximation

(c) TPC-H Customer (Aver|OS|=176)

70

80

90

100

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Approximation

(d) TPC-H Supplier (Aver|OS|=1341)

70

80

90

100

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Approximation

(e) DBLP Author (|OS|=67)

70

80

90

100

GA1-d1

GA1-d2

GA1-d3

GA2-d1

Settings that produced global Importance

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Approximation

(f) DBLP Author (Aver|OS|=1116)

Figure 9: Approximation Quality

holds (Lemma 2); the Paper GDS is Paper →(Author, PaperCit-

edBy, PaperCites, Year →(Conference)) and the local importance

of Conferences is always smaller than those of the corresponding

Years. In general, inter-relational and intra-relational ranking of tu-

ples have an impact as well. For instance, Figure 9(f) summarizes

the average approximation quality for Author OSs with global im-

portance scores produced by the various settings (where inter and

intra relational scores vary). The experimental results also reveal

that the smaller the OS is in comparison to lthe more accurate our

algorithms are. For example, the particular Author OS of Figure

9(e) with |OS|=67 yields 100% approximation quality from all al-

gorithms, by l=25.

6.3 Efﬁciency

We compare the run-time performance of our algorithms in Fig-

ure 10. We used the same OSs as in Section 6.2 (i.e. the same 10

OSs per GDS) and used the default settings for generating the glob-

al importance of tuples (alterative settings do not have any impact

on the performance). Figures 10(a)-(e) show the costs of our al-

gorithms for computing size-lOSs from OSs of various sizes and

different lvalues, excluding the time required to generate the OS

where each algorithm operates on. Figures 10(a)-(d) show the cost-

s of OSs from various GDSs, while Figure 10(e) shows scalability

for Author OSs of different sizes and common l=10 (analogous re-

238

0

1

2

3

4

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Optimal (Complete OS)

Optimal (Prelim-l OS)

100

1000

Times (s)

(a) DBLP Author (Aver|OS|=1116)

0

1

2

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Optimal (Complete OS)

Optimal (Prelim-l OS)

Times (s)

(b) DBLP Paper (Aver|OS|=376)

0

1

2

3

4

5

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Optimal (Complete OS)

Optimal (Prelim-l OS)

25

50

Times (s)

(c) TPC-H Customer (Aver|OS|=176)

0

1

2

3

4

510 15 20 25 30 35 40 45 50

l

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Optimal (Complete OS)

Optimal (Prelim-l OS)

100

1000

Times (s)

(d) TPC-H Supplier (Aver|OS|=1341)

0

1

2

3

67

202

606

922

1309

|OS|

Bottom-Up (Complete OS)

Bottom-Up (Prelim-l OS)

Update Top-path-l (Complete OS)

Update Top-path-l (Prelim-l OS)

Optimal (Complete OS)

Optimal (Prelim-l OS)

300

Times (s)

(e) DBLP Author (size-10 OS)

7

8

9

10

11

12

13

10

50

l (|Prelim-lOS|)

Optimal

Update Top-path-l

Bottom-up

Prelim-l OS

Complete OS

1

50

500

5000

14

10 (134) 50 (259)

0.5

0

> 30 min

265sec

2

Times (s)

(f) TPC-H Supplier (Aver|OS|=1341)

Figure 10: Efﬁciency

sults were obtained from all GDSs but we omit them due to space

limitations). Note that the y-axes (time) in all graphs are split to

two parts; one linear (bottom) and one exponential (top) in order to

show how the expensive DP scales and at the same time keep the

differences between the other methods visible.

As expected, the OS size and lhave affect signiﬁcantly the cost

(the bigger OS or lis, the more time is required). The cost of DP

becomes unbearable moderate to large OSs and values of l(we had

to stop the algorithms after 30 min. of running). Bottom-Up Prun-

ing is consistently faster than Update Top-Path-l, as it requires few-

er operations. An interesting observation is that Bottom-Up Prun-

ing on the complete OS becomes faster as lincreases, because n-l

drops and fewer de-heaping operations are needed.

Figure 10(f) breaks down the cost to OS generation (bottom of

the bar) and size-lcomputation (top of the bar) for each method.

We investigated two approaches for generating the OSs; the ﬁrst

employs an in-memory data-graph and the second computes the

OS directly from the database. The OSs are generated much faster

using the data graph; thus, we present only the data-graph based

results in Figure 10(f). For example, to generate the Supplier OS-

s (that have the largest sizes among all tested OSs) only 0.2 sec.

are required using the data-graph, compared to 12.9 sec. directly

from the database. The DBLP and TPC-H data-graphs take only

17 sec. and 128 sec. to generate and occupy 150MB and 500MB,

respectively. More precisely, our data-graph nodes correspond to

the database tuples and edges to tuples relationships (through their

primary and foreign keys). Note that the data-graph is only an in-

dex and does not contain actual data as nodes capture only keys

and global importance. Figure 10(f) also shows the average sizes

of the complete OSs (1,341) and the prelim-lOSs (134 and 259 for

l= 10 and l= 50, respectively). The prelim-lOS generation is al-

ways faster than that of the complete OS; for instance the prelim-5

OS’s size is approximately 10% of the size of the complete OS and

its generation can be done up to 2.5 times faster (the savings are

not proportional, because there can be many accesses to fruitless

relations during the prelim-lOS generation; i.e. Avoidance Condi-

tion 2 which still requires access to relations even when it returns

no results); thus, prelim-lOSs further reduce the time required by

our algorithms. Bottom-Up Pruning becomes on average up to 5.7

times faster whereas the Update Top-Path-lis up to 4.1 times. Note

that the size of the database does not impact the OS generation

time, because hash-maps are used to look-up the required nodes of

an OS; we omit experimental results, due to space constraints.

Discussion. In summary, the DP algorithm is not practical on

large OSs and l’s whereas our greedy algorithms are very fast and

as we showed in Section 6.2, their results are of high approximation

quality. Note that, in this paper, our main focus has been on opti-

mizing the size-lOS generation, not the OS generation cost (which

we leave for further investigation as future work). In addition, the

use of prelim-lOSs is constantly a better choice over the complete

OSs since they are always faster with a very minor quality loss. If

we need to ﬁnd the size-lOS at a high speed, then the Bottom-Up

Pruning is a good choice, since it is consistently the fastest method

(e.g. using the prelim-50 for Supplier costs 0.12+0.12=0.24 sec.).

If the OS had to be generated from the database, then the Update

Top-Path-lalgorithm is preferable as it gives better quality and is

insigniﬁcantly more expensive (e.g. 8.08+0.32 sec).

7. CONCLUSION AND FUTURE WORK

We investigated the effective and efﬁcient generation of size-l

OSs. First, we gave a formal deﬁnition of the size-lOS, which

targets the synoptic and stand-alone presentation of a large OS.

We proposed a dynamic programming algorithm and two efﬁcient

greedy heuristics for producing size-lOSs. In addition, we pro-

posed a preprocessing strategy that avoids generating the complete

OS before producing size-lOSs. A systematic experimental evalu-

ation conducted on the DBLP and TPC-H databases veriﬁes the ef-

fectiveness, approximation quality and efﬁciency of our techniques.

A direction of future work concerns the further exploration of al-

gorithms using hashing and reachability indexing techniques [18].

Another challenging problem is the combined size-land top-krank-

ing of OSs. In addition, the selection of an appropriate value for l

is an interesting problem; a natural approach is to select lbased on

the amount of attributes or words it will result, e.g. 20 attributes

or 50 words. However, this approach results to the reformulation

of the problem and we plan to investigate it. Finally, it is observed

that, in the general case, optimal size-lOSs for different lcould be

very different. This prevents the incremental computation of a size-

lOS from the optimal size-(l−1) OS, limiting pre-computation or

caching approaches that could accelerate computation. In the fu-

ture, we plan to experimentally analyze the space of optimal size-l

OSs and identify potential similarities among them that could assist

their pre-computation and compression.

8. ACKNOWLEDGEMENTS

We would like to thank Vagelas Hristidis for providing us with

his ObjectRank code and DBLP database and our evaluators for

239

their generous help and comments (in particular, Dimitris Papa-

dias, George Samaras and Chirstos Schizas). Finally, we thank the

anonymous reviewers for their constructive comments.

9. REFERENCES

[1] Data protection act, 1988. http://en.wikipedia.

org/wiki/Data_Protection_Act_1998.

[2] B. Aditya, G. Bhalotia, S. Chakrabarti, A. Hulgeri, C. Nakhe,

and P. S. Sudarshan. BANKS: Browsing and keyword

searching in relational databases. In VLDB, pages

1083–1086, 2002.

[3] A. Balmin, V. Hristidis, and Y. Papakonstantinou.

Objectrank: Authority-based keyword search in databases. In

VLDB, pages 564–575, 2004.

[4] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and

S. Sudarshan. Keyword searching and browsing in databases

using BANKS. In ICDE, pages 431–440, 2002.

[5] S. Brin and L. Page. The anatomy of a large-scale

hypertextual web search engine. In WWW Conference, pages

107–117, 1998.

[6] R. Fagin, A. Lotem, and M. Naor. Optimal aggregation

algorithms for middleware. In PODS, pages 102–113, 2001.

[7] G. J. Fakas. Automated generation of object summaries from

relational databases: A novel keyword searching paradigm.

In DBRank’08, ICDE, pages 564–567, 2008.

[8] G. J. Fakas. A novel keyword search paradigm in relational

databases: Object summaries. Data Knowl. Eng.,

70(2):208–229, 2011.

[9] G. J. Fakas and Z. Cai. Ranking of object summaries. In

DBRank’09, ICDE, pages 1580–1583, 2009.

[10] G. J. Fakas, B. Cawley, and Z. Cai. Automated generation of

personal data reports from relational databases. JIKM,

10(2):193–208, 2011.

[11] L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram.

XRANK: Ranked keyword search over XML documents. In

SIGMOD, pages 16–27, 2003.

[12] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Efﬁcient

ir-style keyword search over relational databases. In VLDB,

pages 850–861, 2003.

[13] V. Hristidis and Y. Papakonstantinou. Discover: Keyword

search in relational databases. In VLDB, pages 670–681,

2002.

[14] Y. Huang, Z. Liu, and Y. Chen. Query biased snippet

generation in XM L search. In SIGMOD, pages 315–326,

2008.

[15] G. Koutrika, A. Simitsis, and Y. Ioannidis. Pr´

ecis: The

essence of a query answer. In ICDE, pages 69–79, 2006.

[16] F. Liu, C. Yu, W. Meng, and A. Chowdhury. Effective

keyword search in relational databases. In SIGMOD, pages

563–574, 2006.

[17] Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-k

keyword query in relational databases. In SIGMOD, pages

115–126, 2007.

[18] A. Markowetz, Y. Yang, and D. Papadias. Reachability

indexes for relational keyword search. In ICDE, pages

1163–1166, 2009.

[19] A. Simitsis, G. Koutrika, and Y. Ioannidis. Pr´

ecis: From

unstructured keywords as queries to structured databases as

answers. The VLDB Journal, 17(1):117–149, 2008.

[20] A. Tombros and M. Sanderson. Advantages of query biased

summaries in information retrieval. In SIGIR, pages 2–10,

1998.

[21] A. Turpin, Y. Tsegay, D. Hawking, and H. E. Williams. Fast

generation of result snippets in web search. In SIGIR, pages

127–134, 2007.

[22] R. Varadarajan, V. Hristidis, and L. Raschid. Explaining and

reformulating authority ﬂow queries. In ICDE, pages

883–892, 2008.

[23] B. Yu, G. Li, K. Sollins, and A. K. H. Tung. Effective

keyword-based selection of relational databases. In

SIGMOD, pages 139–150, 2007.

APPENDIX

Algorithm 5 The OS Generation Algorithm

OS Generation (tDS,GDS )

1: add tDSas the root of the OS

2: enQueue(Q,tDS)⊲Queue Qfacilitates breath ﬁrst traversal

3: while !(isEmptyQueue(Q)) do

4: tj=deQueue(Q)

5: for each child relation Riof R(tj) in GDS do

6: Ri(tj)=“SELECT * FROM RiWHERE (tj.ID=Ri.ID)”

7: for each tuple tiof Ri(tj)do

8: add tion OS as child of tj

9: enQueue(Q,ti)

10: return OS

Nation Partsupp

Supplier

Customer

Parts

Orders Lineitem

Region

Figure 11: The TPC-H Database Schema

Customer (1)

1.65, 5.39

Nation (0.97)

3.12, 1.82

Region (0.91)

1.82, 0

Supplier (0.52)

0, 0

Order (0.95)

2.43, 5.39

Lineitem(0.87)

1.19, 5.39

Partsupp (0.77)

5.39, 0

Partsupp (0.43)

0, 0

Lineitem (0.34)

0, 0

Parts (0.36)

0, 0

Parts (0.65)

0, 0

Supplier (0.65)

0, 0

Figure 12: The TPC-H Customers GDS (Annotated with (Afﬁn-

ity), max(Ri) and mmax(Ri))

Paper Author

0.3

0.1

0.7 cites

0 cited

Year 0.2

0.2

0.3

0.3

Conference

(a) The DBLP GA

Nation Partsupp

Si=0.2*f(SupplyCost)

Supplier

0.1

Region

Customer

Parts

Si=0.1*(RetailPrice)

0.1

0.1

Orders

Si=0.5*f(TotalPrice)

Lineitem

Si=0.1*f(ExtendedPrice)

0.1

0.1

0.1

0.2

0.3

0.1

0.1

0.2

0.1

0.5*

f(SupplyCost)

0.1 0.1

0.5*

f(TotalPrice)

(b) The TPC-H GA

Figure 13: The GAs for the DBLP and TPC-H Databases

240