On the Complexity of MultiQuery Optimization in Stream Grids.
 Citations (29)
 Cited In (0)

Article: Shoring Up Persistent Applications
Michael J. Carey, David J. Dewitt, Michael J. Franklin, Nancy E. Hall, Mark L. Mcauliffe, Jeffrey F. Naughton, Daniel T. Schuh, Marvin H. Solomon, C. K. Tan, Odysseas G. Tsatalos, Seth J. White, Michael J. Zwilling[Show abstract] [Hide abstract]
ABSTRACT: SHORE (Scalable Heterogeneous Object REpository) is a persistent object system under development at the University of Wisconsin. SHORE represents a merger of objectoriented database and file system technologies. In this paper we give the goals and motivation for SHORE, and describe how SHORE provides features of both technologies. We also describe some novel aspects of the SHORE architecture, including a symmetric peertopeer server architecture, server customization through an extensible valueadded server facility, and support for scalabilityonmultiprocessor systems. An initial version of SHORE is already operational, and we expect a release of Version 1 in mid1994.ACM SIGMOD Record 01/2001; · 0.96 Impact Factor  SourceAvailable from: psu.edu[Show abstract] [Hide abstract]
ABSTRACT: ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 8690481, or permissions@acm.org. 2 D. Kossmann and K. Stocker 1. INTRODUCTION The great commercial success of database systems is partly due to the development of sophisticated query optimization technology: users pose queries in a declarative way using SQL or OQL, and the optimizer of the database system finds a good way (i.e., plan) to execute these queries. The optimizer, for example, determines which indices should be used to execute a query and in which order the operations of a query (e.g., joins and groupbys) should be executed. To this end, the optimizer enumerates alternative plans, estimates the cost of every plan using a cost model, and chooses the p...ACM Transactions on Database Systems 12/1999; · 0.75 Impact Factor  SourceAvailable from: psu.edu
Conference Paper: Access Path Selection in a Relational Database Management System
[Show abstract] [Hide abstract]
ABSTRACT: In a high level query and data manipulation language such as SQL, requests are stated nonprocedurally, without reference to access paths. This paper describes how System R chooses access paths for both simple (single relation) and complex queries (such as joins), given a user specification of desired data as a boolean expression of predicates. System R is an experimental database management system developed to carry out research on the relational model of data. System R was designed and built by members of the IBM San Jose Research Laboratory.01/1979
Page 1
On the Complexity of MultiQuery Optimization in Stream Grids
Saikat MukherjeeSrinath Srinivasa Krithi Ramamritham
International Institute of Information Technology
Bangalore, India
{saikat.mukherjee,sri}@iiitb.ac.in
Indian Institute of Technology
Bombay, India
krithi@cse.iitb.ac.in
Abstract
Stream grids are widearea grid computing environments
that are fed by a set of stream data sources. Such grids
are becoming more widespread due to the large scale
deployment of sensor networks for a wide range of ap
plications, from monitoring geophysical activities to sup
ply chain management coupled with applications like net
work monitoring. Queries external to the system arrive on
any node in the grid seeking data from one or more data
streams. The kind of queries considered in this work are
(1) lifetime queries and (2) long running queries where new
query arrivals and query revocations are infrequent. From
the system perspective, computing the optimal query plan
for the set of queries incident on the grid would ensure min
imal systemwide resource usage, thereby maximizing the
number of concurrent queries that can be supported. The
key challenge in such a system is multiquery optimization.
In this work, we analyze the complexity of multiquery op
timization for select, project and join queries in isolation
and propose algorithms for computing optimal query plans
if polynomial time algorithms exist.
1Introduction
Stream grids are grid computing environments that are fed
with streaming data sources from instrumentation devices
like cameras, RFID (radiofrequency identification) sen
sors or other applications. Queries by users or applications
seek to tap into one or more such streams. From the sys
tem perspective, the important optimization goal is reduced
bandwidth consumption which can be achieved by efficient
routing of data streams.
Queries in such grids may originate on any node and
seek data from any stream or a set of streams. Such queries
are typically long lived, but not necessarily infinitely long
lived. Traditionally, query optimization has been addressed
for two classes of queries: “oneshot” queries and infinite
or “standing” queries [6]. Oneshot queries are transient
15thInternational Conference on Management of Data
COMAD 2009, Mysore, India, December 9–12, 2009
c ?Computer Society of India, 2009
in nature and have very short life spans. In such environ
ments, the speed of query processing takes precedence over
computing the execution plan with optimal data stream
routing. On the other hand, for standing queries whose life
times are practically infinitely long and systems where new
query arrivals and query revocations are very infrequent, it
is desirable to invest time and resources to obtain optimal
execution plans.
The problem of generating globally optimal query plans
has been considered earlier in the context of databases [26,
10, 15], data warehousing [21] and more recently in the
context of streaming data sources like sensor networks [19,
32]. While, the primary challenge in computation of opti
mal query plans in databases have been joins, in the data
warehousing context it has been the reuse of materialized
views and in the streaming data context, aggregate queries.
The key results from the research into complexity of
query optimization can be summarized as, (a) the opti
mal join ordering problem in distributed databases is NP
hard [31], (b) the optimal materialized view selection prob
lem using selection granularities in data warehousing is NP
hard [21], and (c) the problem of minimizing communica
tion cost is NPhard for max and min queries [30]. While in
the context of databases, considering selects, projects and
joins as part of a single query makes sense, for streaming
data, the possibility of data sharing introduces use cases
where project, select and join queries can be required in
isolation.
In this work, we consider systems which may require
project only, select only, and join only queries thereby ne
cessitating a relook at the complexity of multiquery opti
mization for such individual query types. We show that for
project queries, polynomial time algorithms exists for com
puting globally optimal query plans. We also show data
sharing coupled with infinite data sets allow projection re
sults to be composed from a number of sources, with each
source having a subset of data required to answer a query.
Finally, by using a variation of the traditional two step opti
mization process [9] involving a) data source selection and
access paths computation at runtime and b) operator site se
lection at compile time, we show that for a particular class
of queries, it is possible to achieve minimum communica
tion costs.
Page 2
2Related Work
Query optimization in databases is a very well studied area
in data management. Original query optimizers search the
plan space using dynamic programming [26], applying a
number of heuristics to reduce the number of options and
ensuring tractable optimization. One of the key observa
tions made in [26] was the notion of pushing down projec
tions and sometimes selections in the query tree to reduce
the data transfer between operators.
The focus of subsequent research was primarily on op
timizing joins [17, 10, 16, 25] and plan enumeration with
other operators in Starburst [15] and Volcano [13]. Cost
models for optimization using resource consumption as a
metric were described in [18]. The inability of resource
consumption models to incorporate operator parallelism
led to the development of response time models as de
scribed in [10].
A framework for determining the complexity of a gen
eral class of distributed query processing is discussed
in [31], while the NP hard nature of optimal materialized
view selection based on selection granularities is discussed
in [21].
The concept of data, query and hybrid shipping were
introduced in SHORE [5] and has evolved as the opera
tor placement problem in network aware query process
ing [1, 22]. The two step optimization process where query
plans are generated in parts at compile time and runtime is
discussed in [4, 9].
Multiquery optimization in [27, 24, 29] provides
heuristics for computing the query plan, while [14, 7, 12]
discuss issues with scheduling, pipelining and caching
techniques in multiquery optimization. The NP hard na
ture of computing multiquery optimization in databases is
discussed in [28].
Sensor networks [19] and in network query process
ing [32] primarily focus on aggregate query optimization
to maximize the energy efficiency of sensors. The com
plexity of multiquery optimization for aggregate queries
in sensor network is evaluated in [30].
Network aware query processing techniques described
in [1, 22] focus on the correct placement of operators in the
network. A spring relaxation technique to place operators
inthenetworkintegratingthetwostepoptimizationprocess
into a single step optimization process is also introduced
in [22].
3The Grid Model
Mathematically, thestreamgridismodeledas: G = (X,d),
where X represents all the grid nodes. A subset of the grid
nodes, S ⊆ X are also stream sources. d : X × X → ?+
is a distance function encapsulating latency between nodes.
The distance function is assumed to have the following
characteristics:
• ∀x ∈ X,d(x,x) = 0, and
• Triangle inequality:
∀x,y,z ∈ X,x ?= y ?= z,d(x,z) ≤ d(x,y) + d(y,z)
It is important to note that the distance function d rep
resents the latency incurred by the best path between pairs
of nodes. In this sense, even though the triangle inequality
doesn’t hold for packet routing on the Internet [2], it still
holds for the distance function. The space described is a
logical space, which need not directly correspond to any
geography and/or network topology.
Stream data is considered to be in the form of tuples,
with each tuple representing a row in an infinitely long ta
ble.
4Query Types
Queries may arrive on any node in the grid requesting for
one or more streams. Queries are represented as relational
algebra expressions over the data streams. For the pur
poses of gridlevel optimization, we consider three basic
relational operations: projections, selections and joins.
1. Projection query: q = πsi1,...sik(Si)
2. Selection query: q = σ(condition)Si
3. Join query: q = Si?? Sj
At any given grid node x ∈ X a subset of one or more
streams may be available as part of current query execution
plan. These streams can be reused to serve other queries in
the vicinity without them having to go all the way to the re
quired stream sources. A grid node which is the source of a
data stream is termed as a “primary source” and a grid node
providing data which is derived or processed from another
grid node is termed as a “secondary source.”
5Optimization Objective
From the user perspective, the key optimization goal is to
ensure reduced response times or latency, while from the
system perspective, the key objective is to reduce the re
quired bandwidth. To ensure minimal response times, each
query would need to be satisfied by connecting to the near
estnode(s)havingtherequireddata(minimumd). Itshould
be noted here that minimizing network latency may result
in increased bandwidth usage if a geographically distant
node provides low latency. However, once the data source
node(s) are identified, the bandwidth required can be re
duced by optimal operator placement. A bandwidthdelay
product combines both requirements and is termed as net
work usage in this work. The optimization objective is to
minimize network usage.
The other parameter which influences response time is
load on a node. Nodes with heavy loads would be a bottle
neck increasing the overall response time of queries. The
load on a node is a combination of communication and
computational load. In the work presented in this paper,
Page 3
we do not evaluate the complexity query optimization con
sidering the load on a node.
Network Usage: At any node x ∈ X, given a query q, it
is ultimately answered by returning a set of stream links
L(q) = {l1,l1,l2,...,ln}. For instance in Figure 1, a
query on node CN1requesting for s1?? s2?? s3can be
answered by forming the stream link set {l1,l2,l3}.
A link is a directed edge, represented as an ordered pair,
lp= (xp,yp). Data flows from data source ypto destina
tion xp, to satisfy in part or completely, a query at xp. In
Figure 1, the link l1would be represented as (CN1,SN1).
q(s1xs2xs3, t1)
s1
s3
s2
SN1
SN3
SN2
C N 1
q(s1xs3, t3)
C N 2
l1
l2
l3
l4
l5
q(s1, t2)
l6
Queries
Data stream links
Grid Node
Self Loops
Figure 1: Query Result Generation using Streams Links
For a link li= (xi,yi), its network usage is given as,
u(li) = Bandwidth(li) · d(li)
where, Bandwidth(li) is the data rate of the stream li, and
d(li) = d(xi,yi) as described earlier, is the latency of the
data stream.
Let Q be the set of all queries incident on the grid G at
any instance of time. Let L denote the set of all links that
have been returned in response to queries in Q. We refer to
L as the “Estuary graph” or the “link graph” of the grid G
for the present time. The Estuary graph is formally defined
as:
?
The global optimization objective on network usage is to
obtain an Estuary graph such that the overall network usage
is minimized. This is stated as:
?
6Components of Network Usage Optimiza
tion
(1)
L =
q∈Q
L(q)
(2)
argmin
L
li∈L
u(li)
(3)
From Equation 1 it is evident that there are two parts to
the optimization process, a) correct data source selection
leading to minimal d, and b) correct operator placement
reducing the bandwidth required to transfer a data stream
from one node to another.
B
A
C
q = pi(si1,si2)(Si)
pi(si1,si2,si4,si5)(Si)
pi(si1,si2,si3)(Si)
Figure 2: Non Composable Projection
B
A
D
C
q = pi(si1,si2)(Si)
pi(si2)(Si)
pi(si1)(Si)
pi(si1,si2,si3)(Si)
Figure 3: Projection Composition
6.1Data Source Selection
Correct data source selection depends on the type of query
plans supported in the system. The two types of plans we
consider in this work are, composable and noncomposable
plans.
NonComposable Query Plans
Consider a grid node x ∈ X processing a query of the form
qx= πsi1,si2(Si). This query can be answered by any node
x?∈ X, that contains a stream covering the requirements
of qx. In other words, a stream that contains either the same
or a superset of all the attributes of Sithat is required by
qx.
This is shown in Figure 2 where a query on node A is
answered by either node B or node C. A query plan that
always taps into such covering sources is termed as a “non
composable” plan.
Composable Query Plans
Alternatively, qxcan also be answered by sourcing streams
from two or more grid nodes even when none of them indi
vidually cover qx, as long as the union of all the attributes
sourced from the streams covers qi. Such a kind of query
plan is called a “composable” query plan.
This is shown in Figure 3, where node D is the source
of the data Siwith attributes si1, si2and si3. Node B and
C have πsi1(Si) and πsi2(Si) respectively and a query q
arrives on node A for πsi1,si2(Si). The query on node A is
satisfied using subset data available both at node B and C.
Page 4
D(Si)
B = πsi1(Si)
C = πsi2(Si)
B ∪ C
πsi1,si2(Si)
..
(1,1)
(1,1)
(1,2)
(2,3)
..
=
.... ..
(1,1,1)
(1,1,2)
(1,2,3)
(2,3,4)
..
(1)
(1)
(1)
(2)
..
(1)
(1)
(2)
(3)
..
Table 1: Continuous Data Projection Composition Possi
bility  No Duplicate Removal
It should be noted here that composable query plans are
not possible for project operators in traditional relational
algebra working on finite data sets. However, it is possi
ble to compose query results for projection operations in
streaming data, if duplicates are not removed from the data
stream.
Duplicate removal in stream data itself has been ad
dressed using various techniques like buffering [11],
Bloom filters [20] and [8] and windowing techniques [3].
If such duplicate removal techniques are used, query re
sult composition would not be possible. For many practical
applications involving aggregations of data elements (like
counting or computing averages over the query result) du
plicate removal is not advisable. These kinds of queries are
amenable to be answered by composable query plans.
Composability is possible with selects as well. A given
select operator σ can be answered by two or more streams
σ1...σkthat each have a smaller selectivity than σ as long
as the combined selectivity of σ1∪···∪σkcovers the selec
tivity of σ. Composing query results based on selections is
similar to computing a query over a larger table from two or
more smaller materialized views [21]. In contrast, a query
representing a join between two or more streams has to be
always composed from the different streams.
Hence, given a query q comprising a single operation
(either select, project or join), a query plan can compute
the result in three different ways:
1. Fetch the data from the relevant node hosting the data
stream or the primary source.
2. Fetch the data from a secondary source which can sat
isfy the query. A secondary source is a node which
shares data fetched from another node. A secondary
source satisfies q if the streams it hosts covers the se
lectivity or attribute requirements for q.
3. Compose the query result using two or more sources.
While composability allows for query plans resulting
in lesser network usage than noncomposable operations,
it also adds an extra layer of complexity. By allowing
compositions, determining the optimal query plan not only
involves identifying single sources which can satisfy the
query, but also consider all possibilities where a combina
tion of sources could satisfy the query.
6.2 Operator Placement
The notion of operator placement has been discussed with
respect to query tree optimization in databases [R*], data
and query shipping and network aware query processing.
For the SPJ queries considered in this work, we consider
the correct placement of operator resulting in minimal net
work usage for individual project, select and join queries
for composable and noncomposable query plans discussed
in the previous section. To decide on operator placement,
each operator O, a selectivity(O) parameter is used and is
defined as,
selectivity(O) =
bO(result)
?
i∈I
b(DSi)
(4)
where, I is the set of input streams to the operator O,
b(DSi)is the bandwidth required to transmit the ithinput
stream in I, bO(result)is the bandwidth required to trans
mit the resultant data stream.
The complexity of query optimization is the com
bined complexity of source selection coupled with opera
tor placement. If any of the two parts of the plan result in
nonpolynomial time algorithms, heuristics can be used.
7Complexity of Network Usage Optimiza
tion
For the purpose of computing the complexity of computing
a globally optimal plan, leading to minimal network usage,
at any instance, we assume that the system is frozen on
any query arrival or revocation until the plan is computed.
In other words, query plan computation is a systemwide
atomic step.
We now consider each of the operations (projects, se
lects and joins) separately for complexity calculation.
7.1Complexity of Projections
The complexity of projection queries for composable and
noncomposable query plans are as follows.
7.1.1Composable Projects
Data Source Selection: To find the optimal query plan
for a given query set, we rewrite all project queries re
questing for multiple attributes of a stream to single at
tribute queries (SAQ) of the same stream. For instance,
in Figure 3, the SAQ for query qAat node A is given as,
SAQ(qA) = {πsi1(Si),πsi2(Si)}.
In a graph theoretic sense, all nodes requiring a given
stream data attribute collectively form a directed acyclic
graph (DAG) with at least one node connected to the orig
inal data source. For instance the minimum spanning tree
(MST) overlays for the example in Figure 3 are given in
Figure 4 where, a MST is the tree incurring minimal net
work usage that spans across all nodes in the DAG.
Page 5
A query at a given node is satisfied by combining the
required attributes from individual data streams. For in
stance in Figure 3, the query at node A is satisfied by tak
ing a “union” the streams πsi1(Si) and πsi2(Si)} arriving
at node A as shown in Table 1.
To find the DAG with the minimal network usage, we
use the following rationale:
1. Ignore the direction of DAG edges and consider all
stream connections between all pairs of nodes in the
grid
2. Compute a MST for the grid based on the stream con
nections.
3. The overlay with the minimum network usage be
tween a given source and destination is the path be
tween them that lies on the MST.
Algorithm 1 Minimum Spanning Tree Overlay Algorithm
Require: Grid G and project query set Q incident on G
Ensure: Overlay of MSTs M with minimum network us
age Umin
1: S ← {},N ← {},M ← {},Umin← 0
2: for all q ∈ Q do
3:
S = S ∪ SAQ(q)
4: end for
5: for all SAQi∈ S do
6:
Ni= {x : x ∈ X ∧ ∃qx: SAQ(qx) ⊇ SAQi}
7:
N = N ∪ Ni
8: end for
9: for all Ni∈ N do
10:
Mi= MST(Ni,G)
11:
M = M ∪ Mi
12: end for
13: for all Mi∈ M do
14:
Umin= Umin+ U(Mi)
15: end for
The overall algorithm as shown in Algorithm 1 is ex
plained as follows. Each query q ∈ Q incident on the grid
is decomposed into individual SAQs required to satisfy q
using the SAQ(q) operator. These individual SAQs are
then added to a set S which contains all the unique, indi
vidual SAQs required to satisfy all the queries Q incident
on the grid. For each SAQi ∈ S, the set of grid nodes
Ni∈ G which require SAQito satisfy some query incident
on it are identified. N is the set of all Nis corresponding
to each SAQi. All nodes in Niand the source of SAQi
are connected together to create an overlay of edges Mius
ing a minimum spanning tree algorithm MST(Ni,G). The
network usage for the overlay Miis given by U(Mi) and
the network usage for the set of all overlays M results in
the minimum network usage Umin.
Theorem 1 An overlay of minimum spanning trees M as
computedbyAlgorithm1givestheminimumnetworkusage
query plan Uminfor a set of stream projection queries Q if
compositions are allowed.
B
(a) si1 overlay
A
D
D
C
A
(a) si2 overlay
Figure 4: Minimum Spanning Tree Overlays
Proof We prove the above theorem by refutation. Consider
one of the MST overlays Mi, requiring minimal access
pathsoverthesetofnodesNi. Supposethereexistsanother
topology M?
U(M?
if we replace the overlay path in Miwith M?
a spanning tree of smaller weight. This is a contradiction
since Miis the minimum spanning tree.
ito connect nodes in Niwith a network usage
i) such that U(M?
i) < U(Mi). This would mean that
iwe would get
Thus if compositions are allowed, then the optimal
query plan complexity is polynomial time with the optimal
query plan being an overlay of minimum spanning trees, as
the complexity of computing each minimum spanning tree
is polynomial [23].
Operator Placement: The project operator is not used at
all for composable projects. Instead for each query q at
a node, the required data streams are fetched from other
nodes and a union operator used to combine the individual
streams to get the required result. Since the selectivity of
the union operator is unity, there is no possibility of reduc
ing the bandwidth required to transmit the data by “better”
operator placement.
7.1.2Non Composable Projects
Data Source Selection: If compositions are not allowed,
the only way to satisfy a query request is to get it from
either the primary source or a secondary source which has
a superset of the required data.
To determine the set of possible sources to answer a
query, we check for the covering property using the SAQ
concept introduced earlier. A query qxon node x can be
covered or satisfied by either the source of the data, or an
other node y which answers a query qywhere SAQ(qx) ⊆
SAQ(qy).
The set of all sources and queries can be represented as
a poset of covering hierarchy based on the streams that they
possess or require. The covering poset for the example in
Figure 3 is shown in Figure 5. Node D being the source
can satisfy any query and is therefore at the top of the poset
and at level 0. The query at node A requiring both πsi1(Si)
and πsi2(Si) is next and can be satisfied only by source D.
Queries at nodes B and C can be satisfied by both node A
Page 6
and source node D and are hence at the highest level of the
poset.
(B, {si1(Si) })
(A, {si1(Si), si2(Si) })
(D, {si1(Si), si2 (Si), si3 (Si) })
(C, {si2(Si) })
Increasing Satisfiability
Increasing Levels
Level 0
Level 1
Level 2
Figure 5: Satisfiability Poset for Figure 3
Each poset element is represented as ei, where i
uniquely identifies the poset element. A function L(ei)
is defined to denote the level of poset element i and
SAQ(ei) ? SAQ(ej) indicates the satisfiability of poset
element eiby another poset element ej.
The hierarchy of the poset ensures if SAQ(ei) ?
SAQ(ej) or L(ej) ≤ L(ei). A source node s ∈ S can
answer a query for any attribute related to the source and
hence poset elements corresponding to sources are placed
at the top most level. All poset elements representing
queries are hierarchically organized below the source ele
ments. Algorithm 2 explains the poset hierarchy formation
process.
Algorithm 2 Poset Hierarchy Formation Algorithm
Require: Grid G and project query set Q incident on G
Ensure: Hierarchically ordered poset P
1: for all ei∈ P do
2:
if ei∈ S then
3:
L(ei) = 0
4:
else
5:
L(ei) = 1
6:
end if
7: end for
8: repeat
9:
finish ← true
10:
for all ei∈ P do
11:
for all ej∈ P do
12:
if SAQ(ei) ? SAQ(ej) then
13:
L(ei) = max[L(ei),(L(ej) + 1)]
14:
finish ← false
15:
end if
16:
end for
17:
end for
18: until finish
Lemma 2 At the end of the poset hierarchy formation
process, the ithposet element eiat level k = L(ei) can
only be satisfied by,
1. poset element ejat level k if and only if SAQ(ei) =
SAQ(ej)
2. poset element ejat level l = L(ej), where l < k, if
and only if SAQ(ei) ? SAQ(ej)
Proof We prove this by refutation. Assuming there exists
a poset element ej at level l > k which can satisfy ei,
then either (a)SAQ(ei) ? SAQ(ej), or (b) SAQ(ei) =
SAQ(ej).
• Refutation for (a): If there exists some ej such that
SAQ(ei) ? SAQ(ej), then from line 13 of Algo
rithm 2, k ≥ l + 1. Hence if k < l, then such an ej
cannot exist.
• Refutation for (b): If ejis at level l there must be some
eqat level l − 1 such that SAQ(ej) ? SAQ(eq). If
SAQ(ej) = SAQ(ei), then k = l as SAQ(ei) ⊂
SAQ(eq). Hence if k < l, then such an ejcannot
exist.
Once the poset P is ordered according to satisfiabil
ity, we now create a “minimum network usage graph”
MinGraph. To create the MinGraph, each poset ele
ment is considered as a node in the graph and the set of
edges determined ensuring minimum network usage. From
Lemma 2 all poset elements requiring the same data are
at the same level l and are grouped into a set X and re
ferred to as the destination nodes. The poset elements or
nodes which can satisfy the poset elements in set X are
in lower levels and are grouped together into set Y or the
source nodes. To determine the edges resulting in the min
imum network usage, Prim’s minimum spanning tree al
gorithm [23] is used where Y is considered to be the set
of nodes which are already in the tree and X is the set of
nodes still requiring to be connected.
Algorithm 3 determines the set of edges resulting in
minimum network usage.
Algorithm 3 Minimum Cost Network Usage Graph
Require: Grid G and ordered poset P
Ensure: Edges of MinGraph
1: Edges ← {}
2: for all ei∈ P do
3:
X ← {} {X is the set of destination nodes}
4:
Y ← {} {Y is the set of source nodes}
5:
for all ej∈ P do
6:
if (L(ej) = L(ei) & SAQ(ei) = SAQ(ej))
then
7:
X = X ∪ ej
8:
end if
9:
if (L(ej) < L(ei) & SAQ(ei) ⊂ SAQ(ej))
then
10:
Y = Y ∪ ej
11:
end if
12:
end for
13:
Edges = Edges ∪ PrimsMST(X,Y,G)
14: end for
Theorem 3 The set Edges determined using Algorithm 3
results in minimum network usage.
Page 7
Proof We use proof by refutation to prove the above algo
rithm. The incorrectness can arise from,
• Incorrect selection of source set: Incorrect source se
lection can occur if any possible source is being not
considered while considering the best source to select.
Given line 9 of Algorithm 3, if there is such a source
present, it must be represented by a poset element
with a level greater than the concerned poset element.
However this is not possible because of Lemma 2.
• Incorrect selection of destination set: Incorrect desti
nation selection can occur if any destination requiring
thesamedataisbeingnotconsidered. Line6ofAlgo
rithm 3 ensures that all equal sources are considered.
• Incorrect selection of edge: If there is an incorrect
edge selected, then there exists another edge with
lesser weight than the selected edge. This is not pos
sible because of the use of Prim’s algorithm which se
lects the minimum cost edge.
Operator Placement: The project operator is always
placed on the data source as selectivity(π) ≤ 1.
7.2 Complexity of Selections
Like project queries, we consider the complexity of com
posable and noncomposable selection query plans.
7.2.1 Composable Selects
Data Source Selection:
queries involving compositions is identifying the set of data
sources which would lead to the minimal network loss.
The quintessential notion which determines if a source can
serve a query is the selection granularity available at the
source and the selection granularity required by the query.
For instance a query q1= σ(b1=1&b2=5)Sican be answered
by composing the result from two secondary data sources
having data σ(b1<3)Siand σ(b2>4)Si.
Using selection granularities to determine reuse of data
is a well studied in the area of materialized view selection
techniques in datawarehouses [21].
In [21], for a given query Q, there are a set of candidate
materialized views V (Q) to satisfy Q and a cost function
cost(MVi,QRi)whichprovidesthecostforamaterialized
view MVi∈ V (Q) with a query region of QRi. The op
timal MV set problem is to find an optimal set S of pairs
(MVi,QRi) which can answer query Q, minimizing the
cost of S or,
?
[21] shows that the minimum set cover decision prob
lem, which is NPcomplete can be transformed in polyno
mial time to this decision problem thereby rendering the
optimization version as NPhard.
We map the optimal MV set problem to the problem of
identifying the correct set of sources to satisfy a query q
The main issue in selection
argmin
S
(MVi,QRi)∈S
cost(MVi,QRi)
(5)
incident on the grid. The candidate sources and secondary
sources S(q) for answering the query can be considered to
be the set of candidate MVs V (Q). The cost for a mate
rialized view can be considered to be the network usage
U(si,ri) for fetching data from the node si ∈ S(q) with
selection granularity ri. The optimal network usage prob
lem is to find the optimal set S of pairs (si,ri) to minimize
the network usage of S or,
?
Hence the optimal network usage problem is NPhard as
well.
argmin
S
(si,ri)∈S
U(si,ri)
(6)
7.2.2 Non Composable Selects
Data Source Selection: If compositions are not allowed,
the problem becomes very similar to the projection with
out compositions problem. Since queries can be answered
from only sources with higher selection granularities, a
single query stream is sufficient to answer the query. In
such a scenario we need to create a hierarchical poset for a
given query set using selection granularities to set the lev
els. The rest of the algorithm will be the same as in projec
tion queries.
Operator Placement: The select operator is always placed
on the data source as selectivity(σ) ≤ 1.
7.3 Complexity of Joins
Operator Placement: In a distributed environment with
data being available at different sites, a join query with n
relations is formulated as a graph problem [31]. A directed
graph with n+1 nodes is constructed where one node cor
responds to the final destination site D and the remaining
n nodes have a onetoone association with a relation. An
edge (Ri,Rj) indicates relation Ribeing sent to node with
relation Rjto perform a join. An edge (Ri,D) indicates
relation Ribeing sent to the destination site directly. The
objective is to find an inversely directed spanning tree to
ward D with the minimal transmission cost. Finding the
optimal join sequence to minimize the transmission cost is
NP hard [31]. By replacing the transmission cost with the
network usage associated with shipping relations between
nodes, our problem also becomes NPhard.
Trivially, joins cannot be performed without composi
tions as no single source will have the joined data.
8Conclusion and Future Work
This work shows that when composable query plans are
not allowed, polynomial time algorithms exists for com
puting globally optimal noncomposable plan for selects
and projects. However, the resultant network usage is not
the minimum possible value. Better query plans involving
lesser network usage are possible when selects and projects
are composed from streams with lower resolution. This
however, makes the optimization problem intractable ex
cept for projectonly queries.
Page 8
Conditions
Is query result
possible without
composition?
Complexity
globally optimal
composable plan
Complexity
globally optimal
noncomposable
plan
Projects
Yes
Selection
Yes
Joins
No
ofPNPhardNPhard
ofPP
Table 2: Summary of Query Processing Complexities for
Network Usage Optimization
However, it must be noted that the key assumption for
computing the globally optimal query plan is that the sys
tem is frozen on any query arrival or revocation for query
plan computation. While this may be feasible for a system
where query arrivals and revocations occur infrequently, it
would not be practically feasible in a system where query
arrivals and revocations occur frequently.
References
[1] Y. Ahmad, U. Cetintemel, J. Jannotti, A. Zgolinski,
and S. Zdonik. Network awareness in internetscale
stream processing. In IEEE Data Engineering Bul
letin, 2005.
[2] D. Anderson. Resilient overlay networks. In MS The
sis, Department of Electrical Engineering and Com
puter Science, MIT, 2001.
[3] B. H. Bloom. Space/time tradeoffs in hash coding
with allowable errors. In In Communications of the
ACM, 1970.
[4] M. Carey and H. Lu. Load balancing in a locally dis
tributed database system. In In Proceedings of the
ACM SIGMOD Conference on Management of Data,
pages 108–119, 1986.
[5] M. J. Carey, D. J. DeWitt, M. J. Franklin, N. E. Hall,
M. L. McAuliffe, J. F. Naughton, D. T. Schuh, M. H.
Solomon, C. K. Tan, O. G. Tsatalos, S. J. White, and
M. J. Zwilling. Shoring up persistent applications. In
ACMSIGMODInternationalConferenceonManage
ment of Data, pages 383–394, 1994.
[6] G. Cormode and M. Garofalakis.
connectedworld:queringandtrackingdistributeddata
streams. In International Conf. in Data Engineering,
2007.
Streaming in a
[7] N. Dalvi, S. Sanghai, P. Roy, and S. Sudarshan.
Pipelining in multiquery optimization. In Journal of
Computers and System Science (JCSS), pages 728–
762, 2003.
[8] F. Deng and D. Rafiei. Approximately detecting du
plicates from streaming data using bloom filters. In
In SIGMOD, 2006.
[9] S. Ganguly, A. Goel, and A. Silberschatz. Efficient
and accurate cost models for parallel query optimiza
ton. In In Proceedings of the ACM SIGMOD/SIGACT
Conference on Principles of Database Systems, pages
172–181, 1996.
[10] S. Ganguly, W. Hasan, and R. Krishnamurthy. Query
optimization for parallel execution. In The ACM SIG
MOD Conference on Management of Data, pages 9–
18, 1992.
[11] H. GarciaMolina, J. D. Ullman, and J. Widom. Data
base system implementation. In Prentice Hall, 2000.
[12] K. Gorman, D. Agarwal, and A. E. Abbadi. Multiple
query optimization by cacheaware middleware using
query teamwork. In The 18th Intl Conf. on Data En
gineering, 2002.
[13] G. Graefe and W. J. McKenna. The volcano opti
mizer generator: Extensibility and efficient search. In
Proceedings of the Ninth International Conference on
Data Engineering, pages 209–218, April 1993.
[14] A. Gupta, S. Sudarshan, and S. Viswanathan. Query
scheduling in multi query optimization. In The Intl.
Symposium on Database Engineering and Applica
tions, pages 11–19, 2001.
[15] L. M. Haas, J. C. Freytag, G. M. Lohman, and H. Pi
rahesh. Extensible query processing in starburst. In
Proceedings of the 1989 ACM SIGMOD international
conference on Management of data, pages 377–388,
1989.
[16] Y. Ioannidis and Y. Kang. Leftdeep vs. bushy trees:
an analysis of strategy spaces and its implications for
query optimization. In The ACM SIGMOD Confer
ence on Management of Data, pages 168–177, 1991.
[17] D. Kossmann and K. Stocker. Iterative dynamic pro
gramming: A new class of query optimization algo
rithms. In ACM Transactions on Database Systems,
March 2000.
[18] L. Mackert and G. Lohman. R* optimizer validation
and performance evaluation for distributed queries. In
In Proceedings of the Conference on Very Large Data
Bases, pages 149–159, 1986.
[19] S. Madden, M. J. Franklin, J. M. Hellerstein, and
W. Hong.The design of an acquisitional query
processor for sensor networks. In SIGMOD, 2003.
[20] A. Metwally, D. Agrawal, and A. E. Abbadi. Dupli
cate detection in click streams. In In Proc. of WWW,
2005.
Page 9
[21] C. S. Park, M. H. Kim, and Y. J. Lee. Finding an
efficient rewriting of olap queries using materialized
views in datawarehouses. In Decision Support Sys
tems, Vol 32, Issue 4, 2002.
[22] P. Pietzuch, J. Ledlie, J. Shneidman, M. Roussopou
los, M. Welsh, and M. Seltzer. Networkaware oper
ator placement for streamprocessing systems. In In
ternational Conference on Data Engineering, 2006.
[23] R. Prim. Shortest connection networks and some gen
eralizations. In Bell System Technical Journal, 36,
pages 1389–1401, 1957.
[24] P.Roy, A.Seshadri, A.Sudarshan, andS.Bhobhe. Ef
ficient and extensible algorithms for multi query op
timization. In ACM SIGMOD Conf. on Management
of Data, pages 249–260, 2000.
[25] D. Schneider and D. Dewitt. Tradeoffs in processing
complex join queries via hashing in multiprocessor
database machines. In The Conference on Very Large
Data Bases (VLDB), pages 469–480, 1990.
[26] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,
R.A.Lorie, andT.G.Price. Accesspathselectionina
relational database management system. In The 1979
ACMSIGMODInternationalConferenceonManage
ment of Data, 1979.
[27] T.Sellis. Multiplequeryoptimization. InACMTrans.
on Database Systems, pages 23–52, 1988.
[28] T. Sellis and S. Ghosh. On the multiplequery opti
mization problem. In IEEE Transactions on Knowl
edge and Data Engineering, pages 262–266, 1990.
[29] K. Shim, T. Sellis, and D. Nau. Improvements on
a heuristic algorithm for multiplequery optimization.
In Data and Knowledge Engineering, pages 197–222,
1994.
[30] N. Trigoni, Y. Yao, A. Demers, J. Gehrke, and R. Ra
jaraman.Multiquery optimization for sensor net
works. In IEEE International Conference on Distrib
uted Computing in Sensor Systems, 2005.
[31] C. Wang and M. S. Chen. On the complexity of dis
tributed query optimization. In IEEE Transactions on
Knowledge and Data Engineering, pages 650–662,
1996.
[32] Y. Yao and J. Gehrke. The cougar approach to in
network query processing in sensor networks.
ACM SIGMOD Record 31(3), pages 9–18, 2002.
In
View other sources
Hide other sources
 Available from Srinath Srinivasa · May 29, 2014
 Available from psu.edu