Conference PaperPDF Available

Multi-scale dissemination of time series data

Authors:

Abstract and Figures

In this paper, we consider the problem of continuous dissemination of time series data, such as sensor measurements, to a large number of subscribers. These subscribers fall into multiple subscription levels, where each subscription level is specified by the bandwidth constraint of a subscriber, which is an abstract indicator for both the physical limits and the amount of data that the subscriber would like to handle. To handle this problem, we propose a system framework for multi-scale time series data dissemination that employs a typical tree-based dissemination network and existing time-series compression models. Due to the bandwidth limits regarding to potentially sheer speed of data, it is inevitable to compress and re-compress data along the dissemination paths according to the subscription level of each node. Compression would caused the accuracy loss of data, thus we devise several algorithms to optimize the average accuracies of the data received by all subscribers within the dissemination network. Finally, we have conducted extensive experiments to study the performance of the algorithms.
Content may be subject to copyright.
Multi-Scale Dissemination of Time Series Data
Qingsong Guo, Yongluan Zhou, Li Su
Department of Mathematics and Computer Science, University of Southern Denmark, Denmark
{qguo, zhou, lsu}@imada.sdu.dk
ABSTRACT
In this paper, we consider the problem of continuous dissemina-
tion of time series data, such as sensor measurements, to a large
number of subscribers. These subscribers fall into multiple sub-
scription levels, where each subscription level is specified by the
bandwidth constraint of a subscriber, which is an abstract indicator
for both the physical limits and the amount of data that the sub-
scriber would like to handle. To handle this problem, we propose
a system framework for multi-scale time series data dissemination
that employs a typical tree-based dissemination network and exist-
ing time-series compression models. Due to the bandwidth limits
regarding to potentially sheer speed of data, it is inevitable to com-
press and re-compress data along the dissemination paths according
to the subscription level of each node. Compression would caused
the accuracy loss of data, thus we devise several algorithms to opti-
mize the average accuracies of the data received by all subscribers
within the dissemination network. Finally, we have conducted ex-
tensive experiments to study the performance of the algorithms.
Categories and Subject Descriptors
H.2.4 [Database]: Scientific Data Management
General Terms
Multi-Scale Dissemination
Keywords
Time Series Data, Data Dissemination, Piecewise Linear Approxi-
mation
1. INTRODUCTION
In many cases, scientific data generated from sensors and vari-
ous scientific equipments often take the form of time-series, i.e.
streams of data points measured at successive time instants with
uniform time intervals. In the context of Data-as-a-Service (DaaS)
or large-scale cooperated research projects, there would be a large
number of parties to the data. For instance, Microsoft’s SensorMap
project [23] establishes a portal to enable collection and sharing of
sensor network data from various deployments of public institutes,
companies as well as individuals. Data collected by the SensorMap
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
SSDBM ’13, July 29 - 31 2013, Baltimore, MD, USA
Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00
should be disseminated to a great number of receivers who have
subscribed to them. This framework can be organized as a pub-
lish/subscribe system, where each subscriber is an interested party.
With the technology development, the rates of data generated by
many advanced devices are prohibitively high. For example, there
are around 600 million collisions per second within LHC (Large
Hadron Collider) [2], which are observed by 4 detectors and pro-
duce a data rate of approximately 300 GB/s and the data size is
over 300 MB/s even after pre-filtering. As another example, con-
sider underwater sensor network [33] for monitoring the health of
river and marine environments, where the overall data rate scales
according to the size and the density of the deployment of sen-
sors. Note that scientists often tend to obtain measurements with
the largest scale and finest grain allowed by the technology. As-
sume we monitor an area of 100km ×100km which are divided
into 1m×1mgrids, and each grid deployed with a sensor. Sup-
pose each node produces one measurement every second and each
measurement has 4 bytes of data. This will give a total data rate at
around 40 GB/s.
Due to the sheer speed of such data, it is challenging to disseminate
them to a large number of subscribers. Furthermore, different or-
ganizations or individuals would have different physical capacities
and demands on data and hence required different granularities or
accuracies of the data. Sending the raw data to all the subscribers
are not only resource consuming but also undesirable. Supporting
different subscription levels with different data granularities or ac-
curacies would be a favorable feature for both the data providers
and the subscribers.
In this paper, we consider the problem of disseminating time-series
data to a large number of subscribers with multiple subscription
levels. These nodes are organized as a provide/subscribe system,
or dissemination network, where a node has been designated as the
primitive data provider and each subscriber is under a bandwidth
constraint. Bandwidth constraint is introduced to describe the sub-
scription level of a node, it is an abstract parameter indicates not
only the volume of data that a subscriber would prefer to consume
but also some physical capacities, such as bandwidth of communi-
cation link. This problem suffers from the challenges of the sheer
speed of data as well as bandwidth constrains in the system. The
basic idea is to reduce the volume of data and to optimize the qual-
ity of service by compression, as done in other publish/subscribe
systems [32].
Similar with the architecture of End System Multicast [13], all ends
has been organized into a overlay tree, our dissemination network
also is constructed as a tree on the application layer. As we know,
a End-to-End communication probably goes through other nodes.
It means that extra packages are inevitable for these intermediate
nodes if data are transmitted from the source node, which make
the challenge even worse as the bandwidth is limited regarding to
the data rate. We have to optimize our dissemination network. A
basic idea is that data for each subscriber originates from the data
of its parent other than the source. Therefore, in our system, each
node plays two roles: a subscriber that requests a particular scale
of compressed data and a broker that provides multi-scaled com-
pressed data to its child nodes.
In our context, data for subscribers are compressed into multiple
scales according to their bandwidth constraints. Various approaches
have been proposed to approximate and compress time-series data,
such as discrete Fourier Transform [8, 26], discrete Wavelet Trans-
form [10], Singular Value Decomposition [14], Piecewise Aggre-
gate Approximation [18], and Piecewise Linear Approximation
(PLA) [16, 20], etc. In this paper, we adopt PLA as the compression
model, i.e. series of line segments are generated and disseminated
instead the raw data. Apart from its simplicity and wide applicabil-
ity, PLA modelling can efficiently performed in an online fashion.
In our approach, each node within the dissemination tree receives
PLA models from its parent and then forwards them to its children,
where a re-compression is called when the bandwidth constraint
of any child is smaller than the size of models it received. The
whole procedure is referred to as model-based dissemination. In
summary, we have made the following contributions in this paper:
We formulate the problem of model-based time-series data
dissemination with multi-levels of bandwidth constraints as
an optimization problem to maximize the average accuracy
of the data received by all the nodes in the dissemination
network.
Due to the expensiveness of data compression operations and
the need of the optimization algorithms to estimate the data
accuracies with many different compression ratios, we pro-
pose an approach to estimate the expected data accuracy with-
out actually performing the compressions. Experiments show
that the approach is effective and robust.
We proposed three algorithms to generate an optimized dis-
semination plan. The greedy algorithm tries to maximize
the total volume of models that are disseminated to all the
subscribers. However, this approach may incur a lot of re-
compressions over the network and hence can incur high
data inaccuracies. We further proposed two algorithms to
address this problem, namely the Randomized algorithm and
the Most Frequent Bandwidth First (MFBF) algorithm.
We conducted an extensive experimental study on the per-
formance of the proposed algorithms. We have conducted
our experiments on 19 real datasets, among which the rep-
resentative results of 5 datasets are presented in this paper.
The experimental results show that our cost model can ac-
curately estimate the model accuracies. Furthermore, the re-
sults suggest that none of the three proposed algorithms can
outperform the others in all the situations. Our extensive ex-
periments can help making guidelines for real deployments.
The rest of this paper is organized as follows. In section 2 we for-
mulate the problem studied in this paper. In section 3 we propose
a cost model to estimate the accuracies with arbitrary compression
ratios. In section 4 we design three approximate algorithms to op-
timize the problem and we experimentally evaluate the proposed
methods in Section 5. Finally, in section 6 we present some related
work and we conclude the paper in Section 7.
2. RELATED WORK
Approximate representation models: There exist many mathe-
matical and statistical models to approximate and compress time-
series data. They are designed to for different purposes and hence
may suit different applications. For examples, Discrete Fourier
Transform [8] is for mapping time-series data to frequency domains
and often used for feature extraction; [30] present a framework (Cy-
press) to compress FFT represented time-series data to multi-scales
for efficient archival and querying; Discrete Wavelet Transform
(DWT)[10] is mainly used for efficient similarity search over time-
series data; Singular Value Decomposition [14] is usually used for
dimensionality reduction in similarity searching; Piecewise Linear
Approximation (PLA) [16, 20, 17, 22, 15, 19] is a very popular
approximate model and has been used in a variate of purpose, e.g.
fast exact similarity search, fuzzy query, dynamic time warping and
relevance feedback etc. Due to PLA’s simplicity and wide applica-
bility, a variety of algorithms have been proposed to perform PLA
modelling. Most segmentation algorithms can be divided into three
categories: Sliding Windows, Top-Down, and Bottom-Up [16, 17],
which are batch based algorithm. In [16], the authors proposed an
online algorithm, called SWAB (Sliding Window and Bottom-up),
for representing time series data by combine sliding windows and
bottom-up algorithms.
The multi-resolution feature of DWT also enable it to achieve multi-
scale compression. Unfortunately, the compression scales for DWT
are fixed which is inefficient for compression at arbitrary scales.
Given a sequence with length of n, then log2nscales can be ob-
tained by Haar Transform, i.e. {1
n,2
n,..., 2i
n,...,1}(i= 1,2,...,
log2n). PLA has no such limitation on compression scales as we
can see from the rest of the paper.
Data Dissemination and Publish/Subscribe Systems. There are
many previous works on data dissemination and publish/subscribe
systems, e.g. [9, 31, 35, 34]. In such system, data are continu-
ously disseminated from the sources to distributed subscribers. To
enhance the dissemination efficiency, these subscribers or brokers
are typically structured into dissemination trees. In [31, 34], the
authors studied the problem of optimizing dissemination trees to
reduce the latency of data dissemination and hence minimize the
average loss of fidelity. Their techniques of tree constructions can
be adapted to construct the dissemination tree in our system.
In [35], the authors proposed a general framework to disseminate
statistical models rather than raw data to the data receivers. How-
ever, their focus is on minimizing the volumes of models that are
transferred over the network and hence they proposed some model
routing algorithms to achieve this objective. Due to the radical dif-
ference in the application scenarios and problem formulations, their
results are not applicable to this paper.
Data as a Service: In recently years, data has increasingly become
an important commodity. For examples, Xignite [7] sells various fi-
nancial data; Gnip [1] provides data from social media (e.g. Twitter
and Facebook), PatientsLikeMe [6] provides more than 150,000
self-reported patient statistical data; AggData [4] sells various lo-
cation data, such as the locations of all the Walmart stores. Many
cloud services that support and facilitate the setup of online data
marketing have emerged, such as Windows Azure Marketplace [3]
and Infochimps [5]. Furthermore, pricing of data has become an
interesting research problem, e.g. Paraschos et al. have proposed
a query-based data pricing model in [21]. In some sense, we also
proposed a pricing model over time-series data in this paper, i.e.
priced based on the compression scale.
3. PROBLEM FORMULATION
3.1 System Model
In our system, a data source N0generates an infinite sequence of
numerical values, (..., vi1, vi, vi+1, ...), which will be dissemi-
nated to a large number of data receivers, {N1, N2,...,Nn}. Peri-
odically, N0will collect a subsequence of the data and disseminate
it to the receivers.
In this paper, we consider a tree-based dissemination architecture,
which is a widely adopted approach [35, 9, 31, 34]. In this archi-
tecture, all nodes are organized into a dissemination tree with the
root N0acting as the source. Data is disseminated from N0and
relayed by the internal nodes to all the nodes in the network.
Example 1: Figure 1 shows an example of dissemination tree T,
which involves 9 nodes, {N0, N1,...,N8}. Each node, other than
N0, is under a bandwidth constraint respectively, i.e. {c1, c1,...,c8}.
An instance of bandwidth constraints is also given, which is speci-
fied by the integer attached to each node.
DEFI NI TI ON 1. The dissemination path to Ni, denoted as P(Ni),
is defined as a sequence of nodes, pk(Ni), . . . , p2(Ni), p1(Ni), Ni.
Where pj(Ni)(1jk) denotes the j-th preceding node of Ni
on P(Ni), specially p1(Ni)is Ni’s parent node and pk(Ni)is the
root node. All these preceding nodes form a set φ(Ni).
For example, as shown in Figure 1, the dissemination path to N8,
denoted as P(N8), is the sequence of nodes N3, N1, N0.
As we addressed earlier, each node in the dissemination network
embedded with some constraints, such as physical bandwidth of its
communication link, demands on granularity or accuracy of data,
etc. Here, we use an abstract parameter bandwidth constraint of
Nito represent the constraint on the volume of data could be trans-
mitted to Niper time unit. For example, as shown in Figure 1, the
bandwidth constraint of N1is c1, so the data volume sent to N1
from N0per time unit should not exceed c1.
In practice, a child node may has a higher bandwidth constraint
than its parent node, e.g. c3> c1. However, the volume of data
received is no more than its parent node as such data originates
from the data received by its parent. It means that the volume
of data transmitting along any dissemination path is non-strictly
monotonously decreasing. We refer to this truth as volume decreas-
ing property.
3.2 Model-based Dissemination
Piecewise Linear Models. In PLA, the data sequence would be
partitioned into a number of non-overlapping and consecutive sub-
sequences and each subsequence is represented with a linear func-
tion [16]. Assume a sequence of time-series data D= (v1, v2,...,
vk)has been divided into ιsubsequences S= (s1, s2,...,sι),
where sj(jι)consists of a subsequence of data points with
10
9
6
6 6 6 5 4
5
1
c
2
c
3
c
4
c
5
c
6
c
7
c
8
c
8
N
7
N
6
N
5
N
4
N
3
N
2
N
1
N
0
Figure 1: Dissemination Tree T
consecutive indices (vtj, vtj+1,...,vtj+τ)and τis the size of sj.
Each subsequence sjis approximated with a linear function j. All
these linear functions M= (1, ℓ2,...,ℓι)compose a piecewise
linear approximate representation of the original data sequence D.
The segmentation of data sequences is an optimization problem that
has been extensively studied in previous work, e.g. [16, 11], and is
out of the scope of this paper. In our implementation, we adopt the
SWAB algorithm proposed in [16], which is a online algorithm and
suitable for our scenario.
DEFI NI TI ON 2. The volume (or size) of a sequence of models
M= (1, ℓ2,...,ℓι)is defined as the number of linear functions
that it contains, denoted as |M|.
Example 2. Figure 2 shows an example of using different num-
ber of linear piecewise models to approximate a sequence of dy-
namic data. Where a)is a sequence of raw Electrocardiogram data
(ECG) originates from the dataset in [11], and in b),c)and d)we
use 23, 13 and 3 linear segments to represent the raw data respec-
tively.
a)
b)
c)
d)
Figure 2: Piecewise Linear Approximation
Models’ Accuracy. As above example shows, we can use various
sequences of models to approximate the raw data. So we need some
criteria to evaluate the quality for a sequence of models, i.e error
and accuracy.
Suppose models Mis a approximate representation of original data
D, and D= (v
1,v
2,...,v
k)is the corresponding time-series data
of M, where v
j,1jk, is the vertical value of Mat the
time instance of vj. In some of the literatures [17, 25], the error of
using Mto represent Dis defined by the sum of vertical differences
between the Dand Dof at all data points, i.e. k
j=1 |vjv
j|.
In our work, it has been modified as a relative error, modification
on this definition, where normalize the error into a value in range
[0,1].
DEFI NI TI ON 3. Given a sequence of dynamic data D= (v1,
v2, . . . , vk), a sequence of models Mand its corresponding data
D= (v
1, v
2,...,v
k), the error of using Mto approximate repre-
sent Dis defined as
ε=k
j=1 |vjv
j|
k
j=1 |vj|.
εis the relative deviation between Mand D, which also reflects the
quality of using Mto represent D. For convenience of discussion,
we define the accuracy of a sequence of models.
DEFI NI TI ON 4. The accuracy of models Mreceived by node
Ni,f(Ni), is defined as
f(M) = 1ε
δ, if 0ε<δ
0, if ε δ
where f(M)[0,1],εis the error of M, and δis a user defined
constant.
As we can see in Figure 2, various models can be used to represent
the same source data D. However, some models may seriously
deviate from D, which are not suitable to representing D. Those
models can be excluded from the user’s choice by adjusting factor
δ. With a lower δvalue, more models’ accuracies will become 0
and hence it tends to exclude more models from the solution. δcan
be set according to the application’s requirements and can be set to
a relative smaller value while users need high quality models.
3.3 Dissemination Plan
Dissemination Plan. A dissemination plan X= (rN0, rN1, rN2,
..., rNn)is a vector of compression ratios, which defined how many
models will be generated and transmitted to each node per time
unit.
Assume models received by node Niis Mi, and M0is the initial
models. Then, the volume of Miis |Mi|=rNi· |M0|.
Take Figure 1 as an example. Assume (c0, c1, c2, c3, c4, c5, c6, c6,
c7, c8) = (10,9,6,6,7,6,5,4,5).Then, X1= (1,0.9,0.6,0.6,
0.7,0.6,0.5,0.4,0.5) is a dissemination plan. Note that X1is fea-
sible as it satisfies both the bandwidth constraints and volume de-
creasing property. However, plan X2= (1,0.9,0.6,0.6,0.7,0.6,
0.5,0.4,0.6) is infeasible, as it violates the bandwidth constraints.
Furthermore, X3= (1,0.9,0.6,0.4,0.7,0.6,0.5,0.4,0.5) is also
infeasible, because it does not satisfy the volume decreasing prop-
erty. The volume of models received by N8is larger than N3s,
which is impossible as N3is N8’s preceding node on the path
P(N8) = N0, N1, N3, N8.
Remodelling. Due to the bandwidth constraints in the dissemina-
tion network, it is inevitable a dissemination plan Xmay state that
a child node should receive less data than its parent node. Suppose
Niis an internal node and Njis its child node and rNi< rNj.
Then, node Niis responsible to compress its models to a new se-
quence of models with size of rNj· |M0|. This process is called
remodelling. For each internal node, only single remodelling is
required regardless the number of its children. Assuming Nihas
more than one child embedded with diverse bandwidth constraints,
and Njis the smallest. By using PLA, models in each scale can be
generated naturally when data is compressed at ratio of rNj.
Note that the accuracy of the models Mitransmitted to Niis deter-
mined by the specific remodelling processes along the dissemina-
tion path P(Ni)rather than solely on Mi’s volume. For example,
as shown in a model-based dissemination network Figure 1, f(M2)
is different with f(M3), though the volumes of models received by
both are equal (i.e. 6). Models M2have higher accuracy as they
are derived by from the original data (10 6), but the models for
N3are generated from models with lower accuracy (96).
3.4 Optimization Problem
Our goal is find out the optimal solution from all feasible dissemi-
nation plans. The objective function for this problem is defined as
average accuracy of models received by all nodes within the dis-
semination network. More formally, given the set of models M=
(M0, M1, M2..., Mn)to be received by N= (N0, N1, . . . , Nn),
the average accuracy F(M)is defined as
F(M) = 1
n
n
i=0
f(Ni)
The formal problem statement is as follow.
MAX IM UM -MOD EL -DIS SE MI NATI ON (MMD): Given a dissemi-
nation tree Tcomposed by a set of nodes N, the bandwidth con-
straints Cof all the nodes in N, choose a dissemination plan X
such that F(M)is maximized subjecting to the following condi-
tions:
rNicNi
cN0
rNirNj,where Njφ(Ni)
3.5 Challenges
In this subsection, we will discuss the challenges of solving the
aforementioned optimization problem.
THE OR EM 1. The Maximum-Model-Dissemination problem is
NP-hard.
Due to the page limit, we omit the proof.
Additional Challenges. To solve the MMD problem, we have to
design algorithms to search the solution space and try to find the op-
timal plan with the minimum F(M). Besides the NP-Hardness of
the problem, there are some additional challenges to be addressed.
In the optimization algorithms, we have to generate many differ-
ent dissemination plans and compare their costs. A straightforward
way to estimate the cost of a dissemination plan will work like the
following. First, we can use some historical data to generate the
models to be disseminated to each node. Then we can calculate
the accuracy of these models according to Definition 4. However,
the modelling process is usually very expensive and it has to be
done for every possible plan that is considered in the optimization
algorithms. Hence, this approach is computationally prohibitive.
One may think that the models’ accuracy is related to the number of
models that are sent to each node and hence an alternative approach
is to transform the problem to maximizing the number of models
that are transmitted to each node. Unfortunately, the models’ ac-
curacy depends on not only the volume of models but also, more
importantly, the actual remodelling processes within the dissemi-
nation network. For example, in Figure 1, (10,9,6,6,7,6,5,4,5)
is a feasible plan with the largest total volume of models, but it
is not an optimal plan. If we change x1to 6, then f(N1)would
be decreased while f(N3),f(N4)and f(N8)would be increased.
Therefore it is possible that the latter plan would have a higher av-
erage accuracy. This is also validated by our experiments presented
later in the paper.
In summary, we need to find a method to calculate the models’ ac-
curacy that can take the remodelling processes into consideration
but need not actually generate the models. The basic idea is to ob-
tain a function of the models’ accuracy that can produce the accu-
racy estimation without the actual models. Such a function would
be multi-dimensional with the parameters being the compression
ratios of the nodes on the dissemination path. Finding an efficient
way to generate such a function is challenging.
4. MODEL-BASED DISSEMINATION
In this section, we highlight the overall dissemination procedure in
our framework. Given a feasible plan X, the model-based dissem-
ination works as follows. It starts at the root of Tand end until all
leaves received their models. At first, the initial models M0will
be generated from the raw data Din the root node. With the PLA
models, these initial models are simply line segments that connect
pairs of adjacent points and hence has the same volume as the raw
data.
For each internal node N, new sequences of models will be gener-
ated based on the models received by Naccording to Xand then
are transmitted to the corresponding children. The detail operation
executed at each node is illustrated in Algorithm 1. The remodel()
(line 8) produces re-compress the input model sequence according
a specified size.
In our current implementation, we use the Bottom-Up algorithm
proposed in [16]. The Bottom-Up algorithm will create m/2seg-
ments to approximate represent the raw data initially. Then, the
cost of merging each pair of adjacent segments is calculated, and
it merge a pair with the lowest cost iteratively. The algorithm stop
until some user-specified criteria is met, such as compression ratio.
Of course, after one merge operation, the cost of merging the new
longer segment with its left neighbor and right neighbor need be
re-calculated.
This remodelling operations in an internal node can be further op-
timized according to the actual compression methods being used.
For instance, with Bottom-Up approach, the new sequences of mod-
els for all the children can be generated by running one pass of the
compression from models M(Ni)to M(nodes[k]), where nodes[k]
is the child with the smallest compression ratio.
Algorithm 1: Model-based Dissemination
Input: The current node Ni, Models M(Ni)received by this
node, Plan X
1nodes[0] Ni;
2append all child nodes nodes ;
3sort nodes in descending order by their compression ratios;
4 for j= 1 to sizeof (nodes)do
5 if rnodes[j]< rnodes[j1] then
6M(nodes[j]) remodel(M(nodes[0]), rnodes[j]);
7 else
8M(nodes[j]) M(nodes[j1]);
9transmit M(nodes[j]) to nodes[j];
5. COST MODEL
As mentioned earlier, it is very costly to estimate the models’ ac-
curacy. Our general approach is to try to find a proper form of the
accuracy function and use historical data and curve fitting methods
to find out the appropriate coefficients of the function.
As the accuracy of the models send to Niis determined by the
remodelling processes in its dissemination path, we can define a
general form of the accuracy function, f(Ni), as follows.
DEFI NI TI ON 5. f(Ni), the accuracy of node Ni, is a multidi-
mensional function of the compression ratios of all the nodes within
P(Ni):
f(Ni) = ~(rN0, rpk1(Ni), . . . , rp1(Ni), rNi)(1)
where pk1(Ni)φ(Ni), and rpk1(Ni)is its compression ratio
and rN0= 1.
One can assume a particular form of f(Ni), e.g. a multidimen-
sional quadratic function, and then we can find out the co-efficients
by curve fitting. However, curve fitting for multidimensional func-
tions are usually very expensive and less accurate. In the following
subsections, we try to simplify such a general function to a two-
dimensional function based on several realistic observations and
assumptions.
5.1 Observations and Assumptions
In this subsection, we present a few observations and assumptions
that will be used to simplify the accuracy function.
ASS UM PT IO N 1. For two distinct arbitrary nodes Niand Njin
T, if rNi=rNjand f(Ni) = f(Nj), then the sequence of models
received by Njand Ni, i.e. Miand Mj, are equivalent.
For two equivalent sequences of models Miand Mj, they can be
replaced by each other.
ASS UM PT IO N 2. For an arbitrary node Ni,0f(Ni)1
holds. f(Ni) = 1 if and only if rNi= 1.
Assumption 2 states that a node’s accuracy is a positive value that
falls in [0,1] and it is equal to 1 if and only if the data it received
do not undergo any remodeling, i.e. the original data.
ASS UM PT IO N 3. For an arbitrary node Ni,rNirp1(Ni)and
f(Ni)f(p1(Ni)) hold. f(Ni) = f(p1(Ni)) if and only if
rp1(Ni)=rNi.
Assumption 3 means that the accuracy of an arbitrary node is no
more than the accuracy of its ancestor nodes and the compression
ratios fulfill the volume decreasing property on a dissemination
path. The equality holds if and only if there is no remodeling occurs
while models transmitted from pj(Ni)to Niover the path P(Ni).
The next two assumptions are about two distinct arbitrary nodes in
T,Niand Nj. Figure 3 shows their dissemination paths.
ASS UM PT IO N 4. If rp1(Ni)=rp1(Nj),rNi=rNj, and
f(p1(Ni)) > f(p1(Nj)), then f(Ni)> f(Nj)holds.
Assumption 4 states that, if Niand Njare equal in their compres-
sion ratios and so are their parents, their accuracies are determined,
respectively, by their parents’ accuracies.
N
N
N
N
1
( )
j
Np
r
1
( )
i
N
p
1
( )
j
N
p
i
N
j
N
j
N
r
i
N
r
1
( )
i
N
p
r
 
Figure 3: Two Dissemination Paths
ASSUMPTION 5. If rp1(Ni)=rp1(Nj),f(p1(Ni)) = f(p1(Nj))
and rNi> rNj, then f(Ni)> f(Nj).
According to Assumption 1, the model sequences received by p1(Ni)
and p1(Nj)are equivalent under the conditions of Assumption 5.
Simply put, Assumption 5 states that the accuracies of the new
models generated from two sequence of equivalent models only
depends on its compression ratio.
ASSUMPTION 6. The accuracy difference between nodes p1(Ni)
and Ni, i.e. f(p1(Ni)) f(Ni), is a binary quadratic function of
rp1(Ni)and rNi:
f(p1(Ni)) f(Ni) = α1r2
Ni+β1rNi+α2r2
p1(Ni)+
β2rp1(Ni)+γrNirp1(Ni)+ω(2)
where α1, α2, β1, β2, γ and ωare constant coefficients.
Assumption 6 is to simplify the accuracy function from a multi-
dimensional one to a two-dimensional quadratic function. One can
of course assume a more complicated function. But, as we will see
later in the experiment results, this function performs quite well in
practice and hence we will use it due to its simplicity.
5.2 Accuracy Function
We can further reduce the coefficients in Eqn. (2) by applying the
equivalent condition in Assumption 3. More specifically, according
to Assumption 3, if rp1(Ni)=rNi, then f(Ni) = f(p1(Ni)). We
can derive that
0 = α1r2
Ni+α2r2
Ni+γr2
Ni+β1rNi+β2rNi+ω
Therefore, β1=β2,α1+α2+γ= 0 and ω= 0.
By substitution, we can simplify Eqn. (2) as follows:
f(p1(Ni)) f(Ni) = α1r2
Ni+α2r2
p1(Ni)+β1rNi
β1rp1(Ni)(α1+α2)rirp1(Ni)(3)
LEM MA 1. The accuracy f(Ni)defined by Eqn.(3) satisfies
Assumption 2 to 5.
PROO F. It is easy to prove Eqn. (3) satisfies Assumption 2 and
3. According to Assumption 6, we known that
f(N0)f(pk1(Ni)) 0,
. . . ,
f(p1(Ni)) f(Ni)0,
f(Ni)0.
Thus, 1 = f(N0) · · · f(p1(Ni)) f(Ni)0.Obviously,
f(p1(Ni)) = f(Ni)holds if and only if rp1(Ni)=rNi. In addi-
tion, rNirp1(Ni)is satisfied by the volume decreasing property.
Therefore, (3) satisfies Assumption 2 and 3.
Consider the path P(Nj)in Figure 3, according to function (3), we
have:
f(p1(Nj)) f(Nj) = α1r2
Nj+α2r2
p1(Nj)+β1rNj
β1rp1(Nj)(α1+α2)rNjrp1(Nj)(4)
For Assumption 4, let rp(Ni)=rp(Nj)and rNi=rNj, then
(3) (4) f(Ni)f(Nj) = f(p1(Ni)) f(p1(Nj)).
Thus, if f(p1(Ni)) f(p1(Nj)), obviously f(Ni)f(Nj)also
holds.
For Assumption 5, as rp1(Ni)=rp1(Nj)and f(p1(Ni)) =
f(p1(Nj)), we can derive (4) (3) as
f(Ni)f(Nj) = α1(r2
Njr2
Ni) + β1(rNjrNi)
(α1+α2)rp1(Ni)(rNjrNi)
= (rNjrNi)α1(rNi+rNj) + β1
(α1+α2)rp1(Ni)(5)
It is easy to make α1(rNi+rNj) + β1(α1+α2)rp1(Ni)0
holds by adjusting α1,α2and β1. In addition rNirNj, thus
f(Ni)f(Nj).
According to above analysis, it is clear that the function satisfies all
assumptions.
If we set rp1(Ni)= 1 (i.e. f(p1(Ni)) = 1) in Eqn. (3), then
f(Ni) = α1r2
Ni+ (α1+α2β1)rNi+ (1 + β1α2)(6)
This means that if the models sequence Miis obtained by direct
modelling over the original data, then the accuracy is a quadratic
function of its compression ratio. This is also validated in our ex-
periments in Section 7.
LEM MA 2. For a dissemination path P(Ni)to Ni,f(Ni)is
a quadratic function of rNiif all the compression ratios of Nis
preceding nodes on P(Ni)are given.
Lemma 2 shows a way to obtained the specific form of the accu-
racy function. If we fix all the compression ratios of the preceding
nodes of Nion P(Ni), then we can use a quadratic function to fit
the Ni’s actual accuracy by using the historical data. Through the
experiments, we found that a quadratic polynomial function can fit
all the tested datasets pretty well.
5.3 Computing Model Accuracy
With the cost model, model accuracy can be calculated without
source data. It simplifies the calculation of model accuracy. It
can also enormously improve the efficiency of calculation, which
is crucial for low-latency dissemination.
As stated earlier, the accuracy of a sequence of models Miis deter-
mined by the specific remodelling processes on the dissemination
path P(Ni). It is impossible to enumerate all possible remodelling
processes of a dissemination network and hence we can not pre-
compute all possible models’ accuracies. Alternatively, the com-
pression ratios ( or volume ) of models are classified into several
scales according to the bandwidth constraints (subscription levels)
of the receivers. All the possible remodelling paths will form a Re-
modelling Tree (RT) as illustrated in Figure 4 as an example, where
the compression ratios are classified into 10 scales. A complete
remodelling path in RT, from the root to a leaf, is a sequence of
strictly monotonically decreasing compression ratio values, which
starts at compression ratio of 1 and end at 0.1. Furthermore, the
models’ accuracies on arbitrary possible remodelling paths are pre-
computed regarding to these scales.
With a RT, models’ accuracies can be approximately calculated re-
gardless of its dissemination path. In order to estimate models’ ac-
curacies on an actual dissemination path P(Ni), we should find the
closest remodelling path for P(Ni)in the RT. For example, given
a dissemination path 1,0.9,0.9,0.75,0.43,0.1, the closest path
in RT is 1,0.9,0.8,0.4,0.1. Then, the accuracies for each node
on P(Ni)are approximated by the accuracies on the remodelling
path. Due to the space limit, we will not discuss the details about
the algorithm, which is quite straightforward.
1
0.9
0.8
0.1
0.2
0.1
0.8
0.2
0.1
0.7
0.6
0.3
0.4
0.10.1
... ... ...
...
0.7
Figure 4: Remodelling Tree RT
6. OPTIMIZATION ALGORITHMS
With the cost model developed above, we will, in this section, de-
sign algorithms to optimize the dissemination plan X. As stated,
MMD is NP-hard. Therefore, we will design several heuristic-
based algorithms to solve this problem.
6.1 Greedy Algorithm
As analyzed earlier, the accuracy of a sequence of models is related
to its size. Therefore, in the first algorithm, we try to maximize
the number of models that are sent out from each internal node. In
other words, we set the compression ratio of each node as large as
possible. This algorithm is presented in Algorithm 2 and works as
follows. It starts form the root node N0and set its compression
ratio as 1. Then it traverse the tree in a breadth first manner. For
each internal node Niand its child node Nj, according to the vol-
ume decreasing property and bandwidth constraint, rNjis set as
minimum(cNj
|M0|, rNi). Algorithm stops when all the nodes’ com-
pression ratio is set. Greedy algorithm rjust need one traversal of
Tand hence its complexity is O(n).
6.2 Randomized Algorithm
By a closer investigation, we identify two potential problems of the
above greedy algorithm. First, in the greedy algorithm, we try to
maximize the number of models that are sent from each parent to its
Algorithm 2: Greedy Algorithm
Input: Dissemination tree T, bandwidth constraints C, initial
models M0
Output: dissemination plan X
1Let NodeQueue be a queue for preserving nodes and tbe the
current visiting node;
2Initialization: rN01;
3enqueue(N odeQueue, N0);
4 while N odeQueue is not empty do
5tdequeue(N odeQueue);
6{t1, . . . , tk} ← find out all children of tin T;
7 for i= 1 to kdo
8enqueue(N odeQueue, ti);
9rtiminimum(rt,cti
|M0|);
children. However this may not provide the optimal solution. For
example, in the feasible plan depicted in Figure 1, if we decrease
the volume of models received by N1from 9 to 6, then we do not
need to perform remodelling at N1and hence the accuracies of the
models sent to N1’s descendants would be increased. This may
actually end up with a better plan. Generally speaking, if we de-
crease rNi, then it is possible that some of its children’s accuracies
would be increased by avoiding the remodelling process. Second,
the greedy algorithm only search for a limited part of the solution
space and hence would easily produce a sub-optimal solution.
Based on the above observations, we try to design a new algo-
rithm to address the above two potential problems. In this algo-
rithm, for a internal node, its compression ratio is chosen within
range [minC/|M0|, maxC/|M0|], where minC and maxC are
the minimum and maximum bandwidth constraints among its child
nodes. This is to address the first problem stated above.
Furthermore, instead of using a greedy algorithm, we employ a
randomized local search strategy. More specifically, we use the
stochastic hill climbing approach which selects a neighbor at ran-
dom. It will run in many iterations and within each iteration, the
compression for a node will be altered to a value within range
[minC/|M0|, maxC/|M0|]. This procedure will be repeated a
number of times until reached a local optimal and the correspond-
ing plan will be chosen. The initial plan will be randomly generated
by Algorithm 3.
In Algorithm 3, each node will be visited twice (except N0) within
each iteration: one time for determining the compression ratio of
its parent and another for itself. Besides, finding the minC and
maxC for a non-leaf node needs one scan of its children. Therefore
the cost of running Algorithm 3 is 2n+n0
i=0 θi, where θiis the
number of child nodes of node Ni.
In addition, each non-leaf node Nimeans an iteration for the whole
algorithm and at Niit would search at most θiplans. Suppose the
cost for calculate the accuracy for a plan is ktimes of the cardinality
of the plan, i.e k·n. Thus,
ϑ(n) = (2n+
n0
i=0
θi) +
n0
i=0
θi·(k·n)3n+k·n2
The complexity of the whole randomized algorithm is O(ϑ(n)) =
O(n2).
Algorithm 3: Randomized Algorithm
Input: dissemination tree T, nodes set N={N0, . . . , Nn},
bandwidth constraints C, initial models M0
Output: initial plan X0
1Let tbe current visiting node;
2Initialization: rN01;
3 for i= 1 to ndo
4tNi;
5 if tis not a leaf node then
6Tfind out all children of t;
7min minimum bandwidth in T;
8max maximum bandwidth in T;
9rand random integer in [minC, maxC ];
10 else
11 rand ct;
12 rtminimum(rp1(t),rand
|M0|);
Algorithm 4: MFBF Algorithm
Input: dissemination tree T, nodes set N={N0, . . . , Nn},
bandwidth constraints C, initial models M0
Output: dissemination plan X
1Let Ube the unvisited nodes set and Vbe the visited nodes
set of T;
2Initialization: rN01,UNN0,VN0;
3 while U̸=do
4bthe most frequent bandwidth of nodes in U;
5{t1, . . . , tk} ← all nodes with bandwidth of bin U;
6 for i= 1 to kdo
7{p1, . . . , pm} ← all preceding nodes of tiin U;
8pthe lowest preceding node of tiin V;
9rtiminimum(rp,b
|M0|);
10 for j= 1 to mdo
11 rpjrti;
12 move pjfrom Uto V;
6.3 MFBF Algorithm
The above randomized algorithm needs many iterations to obtain a
good plan. So even though the algorithm still has a linear asymp-
totic complexity, constant coefficient may still be very big. In this
subsection, we will design a more efficient algorithm that can still
produce a good plan.
The algorithm is called Most Frequent Bandwidth First (MFBF).
The basic idea of this algorithm is to minimize the number of re-
modelling occurred in the dissemination tree. Therefore, we try to
put higher priority to maximizing the accuracies of the nodes with
more frequent bandwidth constraints. The details of MFBF is de-
scribed in Algorithm 4.
As the first step, this algorithm will perform a traversal of the dis-
semination tree and if bandwidth constraint of a node is greater than
the maximum maxP constraint on its dissemination path, then it
will be re-set as maxP .
The algorithm runs in multiple iterations. Within each iteration, it
follows the following procedures
Find the most frequent bandwidth κamong the unvisited
nodes and put the unvisited nodes with bandwidth of κinto
a queue T.
Iterate over Tand, for each node tiin T, set the compression
ratio of tiand all its unvisited ancestor nodes to a value r.r
is calculated as the minimum value between cti/|M0|and
the compression ratio of ti’s lowest visited preceding nodes.
After this step, tiand all its unvisited ancestor nodes are con-
sidered visited.
The cost of MFBF algorithm mainly consists of three parts: 1)
the cost of finding the most frequent bandwidth from the unvisited
nodes U, denoted as h1; 2) the cost of finding Ni’s unvisited an-
cestor nodes from from U, denoted as h2; 3) the cost of finding the
lowest visited ancestor nodes, denoted as h3. In 1), each execution
of the finding the most frequent bandwidth just needs one pass scan
of Uand hence the complexity is O(n). Suppose that there are a
distinct bandwidth constraints in T, then h1a·n. In 2) and 3),
arbitrary node appear as unvisited node or visited preceding node
at most twice. In our implementation, we maintain a index point to
parent node for each node, which means we can find a preceding
node for Niin a constant time, denoted as ϵ. So, h2+h3ϵ·2n.
h(n) = h1+h2+h3
a·n+ϵ·2n
=n·(a+ 2 ·ϵ)
Thus, O(h(n)) = O(n).
7. EXPERIMENTS AND EVALUATION
In this section, we will present our experimental results. We im-
plemented the prototype system in Java. Due to the fact that our
objective function is the accuracies of the data received by the sub-
scribers, there is no difference between running it in a distributed
system or simulating it on one single machine. Therefore, we sim-
ulate the whole system on one machine and hence can easily make
tests with many different system parameters.
Data. We used a publicly available dataset [11], named as UCR
time series data. UCR is a collections of time series data in min-
ing/machine learning community. The data set we used is the data
1in [11], which make up of 19 types time series data from diverse
applications. We carry out our experiments on all of these data and
present the results on five dataset. The results on the other data are
similar to the reported ones. The specific time-series that we have
used are: 50 Words [29], OSU Leaf [12] , and Gun Point [28,
27], ECG and Wafer[24]. Each data set is composed of hundreds
of sequences, and there are hundreds of points in each sequence.
Bandwidth constraint and data rate for the source node were set
to 1MBytes/s, which would be different in practice. For any other
node, data rate was defined by its compression ratio (or bandwidth
constraint), and its unit was omitted in latter context. The experi-
ments can be divided into two parts:
Model Accuracy: In this part, we show how to use curve-
fitting approach to produce the cost model to estimate the
model accuracies. Furthermore, we validate the cost model
proposed in Section 5.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Accuracy of Models
Compression Ratio
ECG
Wafer
OSU Leaf
50 Words
Gun Point
(a) the longest path
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Accuracy of Models
Compression Ratio
ECG
Wafer
OSU Leaf
50 Words
Gun Point
ECG: f(x)=-0.61x2+ 1.18x+ 0.42
Wafer: f(x)=-0.19x2+ 0.33x+0.86
OSU Leaf: f(x)=-0.54x2+ 1.36x+0.19
50 Words: f(x)=-0.80x2+1.34x+0.44
Gun Point: f(x)=-0.37x2+0.68x+0.69
(b) 1 to x
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Accuracy of Models
Compression Ratio
ECG
Wafer
OSU Leaf
50 Words
Gun Point
ECG: f(x)=-0.66x2+1.24x+0.39
Wafer: f(x)=-0.43x2+0.74x+0.68
OSU Leaf: f(x)=-0.23x2+0.35x+0.85
50 Words: f(x)=-0.85x2+1.40x+0.40
Gun Point: f(x)=-0.48x2+1.29x+0.20
(c) 0.9 to x
Figure 5: Model Accuracy and Corresponding Cost Model
Optimization Algorithms: We construct a number of dis-
semination tree with multiple scales of bandwidth constraints
and examine the performance of the optimization algorithms
proposed in this paper.
7.1 Model Accuracy
We implemented the Remodelling Tree (RT), as shown in Figure 4.
In our experiments, we set the bandwidth constraints to ten scales,
i.e. {1,0.9,0.8,...,0.1}. We have also tested twenty scales, which
produce similar results.
Suppose P=1, . . . , rj, ri,...,0.1is a remodelling path of RT ,
where rj> ri. According to Lemma 2, we construct a remodelling
tree path for the remodelling processes on a dissemination path of
T, then we can get the cost model for each remodelling path by
curve fitting. With the cost model, we can easily calculate the ac-
curacy for each node.
P1=1,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1is the longest path
in RT. The models generated through P1has the lowest accuracy
among the models with the same compression ratio within RT .
The results is illustrated in Figure 5 (a).
According to Lemma 2, if all compression ratios of Ni’s preceding
nodes on path P(Ni)are provided, then f(Ni)is a quadratic func-
tion of rNi. That means if we fix all the compression ratios before
riin the remodelling path P, and riis within [0.1, rj], then we can
get the cost model f(ri)by curve fitting with a quadratic function.
Here, we choose two paths from RT to present our observation.
P2=1, ri, . . . , 0.1, where rivaries within {1,0.9,0.8,0.7,0.6,
0.5,0.4,0.3,0.2,0.1}.P3=1,0.9, ri,...,0.1, and rivaries
in {0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1}. As shown in Figure
5 (b) and (c), a quadratic function can fit the results quiet well,
which validates our claims in Lemma 1 and 2. The other possible
remodelling paths have similar results.
7.2 Performances of Optimization Strategies
We conduct an intensive study on the performance of the three op-
timization algorithms described in Section 6. The performances are
measured on the average accuracy F(M)of all nodes within the
dissemination network T.
We construct different dissemination tree Trandomly with two
tunable parameters: the maximum fan-out θand the total number
n. In our experiments, the setting of the two parameters as follows:
θ: varying in 3 scales {2,5,10};
n: varying in {100,200,500,1000,2000,5000,10000}.
θand nwill affect the structure of T, e.g. its height. The struc-
ture of Tdetermines the possible remodelling operations. Tis
constructed with the following properties:
Each node in Thas num children, where num is a random integer
within [0, θ]. In addition, each node is associated with a bandwidth
constraint whose scale fall within ten scales {1,0.9,...,0.1}. For
a node Niwith a bandwidth constraint scale of i, its bandwidth
constraint is equal to i·|M0|, where |M0|is the expected volume of
the time-series to be disseminated within each epoch. Furthermore,
we simply set the bandwidth of the root to the volume of the initial
models, i.e. cN0=|M0|. In our experiments, we generate the
bandwidths constraints of the nodes in Tbased on five different
distributions.
Level: nodes at the same level within Thave the same band-
width constraints.
Uniform: the bandwidth constraints of each node obeys uni-
form distribution and independent on its location within T.
Gaussian: the bandwidth constraints obey Gaussian distri-
bution with variance σ2= 1 and mean µequals the median
value of all the bandwidth constraints.
MaxZipf: the bandwidth constraints are subject to zipfian
distribution P(k)1
ka, where P(k)is the frequency of the
kth bandwidth constraint and a= 1, and the larger band-
width constraints have smaller kvalues.
MinZipf: similar to MaxZipf except that the smaller band-
width constraints have larger kvalues.
Furthermore, for each of the above given bandwidth distribution,
we considered two schemes to assign these bandwidths to each
node. In reality, this is determined by the organization of the dis-
semination tree, which is another complicated optimization prob-
lem and is out of the scope of this paper.
Ordered: nodes at higher levels of the tree have larger band-
width constraints.
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
(a) Greedy Algorithm
0
0.2
0.4
0.6
0.8
1
0 20000 40000 60000 80000 100000
Average Accuracy
Number of nodes
approx-2
approx-5
approx-10
(b) Approximate Algorithm
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
random-2
random-5
random-10
(c) Randomized Algorithm
Figure 6: Horizontal Comparison
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
MFBF-2
random-2
(a) Fan-out 2
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-5
MFBF-5
random-5
(b) Fan-out 5
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-10
MFBF-10
random-10
(c) Fan-out 10
Figure 7: Vertical Comparison
Unordered: no restriction on the bandwidths allocation within
T. Thus, we just randomly assign a bandwidths to a node Ni
according to the specific distribution.
Note that we only consider the “Ordered” scheme for the “Level”
distribution. Thus, there are 9 cases in total as shows in Table 1. We
evaluate the performances of these algorithms with all the different
distributions and the number of nodes varying from 100 to 10000.
We repeat the each group of experiments for 100 times and take the
mean of the average accuracies F(M)as the final result.
Table 1: Possible Combinations
Distribution Ordered Unordered
Uniform uniform-1 uniform-0
Gaussian gaussian-1 gaussian-0
MaxZipf maxzipf-1 maxzipf-0
MinZipf minzipf-1 minzipf-0
Level level
7.3 Performances Analysis
The randomized searching algorithm applies the stochastic hill climb-
ing search to optimize the dissemination, which is a costly proce-
dure. For example, it takes 2 hours to complete the experiments
with 10,000 nodes, while it takes only less than 1minute for the
other two algorithms. Due to the long running time of the random-
ized algorithm, we just present its results on the Level distribution,
as shown in Figure 6 and Figure 7. For the other distributions, we
only execute this algorithm on a small network scale and the results
suggest similar conclusion as in the Level distribution. Therefore,
in other 8 distributions, we only present and discuss the greedy and
MFBF algorithms.
The results are illustrated in Figure 6 Figure 15. We compare the
algorithms both horizontally and vertically. The horizontal com-
parison is to compare the performances of the algorithms, as shown
in Figure 6, while the vertical comparison is to compare the same
algorithm with varying θ, as shown in Figure 7. To illustrate, we
plot the results on the Level distribution into 6 graphs. In the other
cases, we put all these 9 results into one graphs to save space.
Horizontal comparisons of the Algorithms. As shown in Figure
6 and 14, MFBF performs better than greedy algorithm in level and
minzipf-0. Furthermore, MFBF performs especially well in level.
This is because, in level distribution, the bandwidth monotonically
decreases along each dissemination path of the dissemination tree.
Therefore, the greedy algorithm which attempts to maximize the
bandwidth usage on each edge will incur excessive remodelling on
all the dissemination paths. On the contrary, MFBF can signifi-
cantly decrease the total number of remodelling processes by tak-
ing bandwidth frequency into consideration. This is also true in the
minzipf-0 distribution but with less extent. Moreover, in the level
distribution, the randomized algorithm performs very close to the
MFBF algorithm, which is also effective in reducing the remod-
elling processes. However, as mentioned earlier, it takes a much
longer time to complete in comparing the other two algorithms.
In Figure 12 and 13, MFBF performs worse than the greedy algo-
rithm. This is because the main idea of the MFBF algorithm is to
reduce the remodelling processes by altering the bandwidth usage
of the ancestor nodes of the nodes with more frequent bandwidths
and it will be a great waste of bandwidths, especially when most
of the bandwidths are very large, which is unfortunately the case of
distribution maxzipf-0 and maxzipf-1.
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 8: uniform-0
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 9: uniform-1
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 10: gaussian-0
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 11: gaussian-1
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 12: maxzipf-0
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 13: maxzipf-1
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 14: minzipf-0
0
0.2
0.4
0.6
0.8
1
0 2000 4000 6000 8000 10000
Average Accuracy
Number of nodes
greedy-2
greedy-5
greedy-10
MFBF-2
MFBF-5
MFBF-10
Figure 15: minzipf-1
As shown in Figure 8 and Figure 9, the greedy algorithm outper-
forms MFBF slightly in the uniform distributions, i.e. uniform-1
and uniform-0. In such cases, there is no significant difference in
the bandwidth frequencies and hence there is limited opportunities
for the MFBF algorithm to reduce the remodelling processes.
Furthermore, as shown in Figure 10 and Figure 11, the perfor-
mance of greedy algorithm and MFBF are almost identical for the
two Gaussian distributions. In these cases, most of the nodes have
bandwidth constraints close to the median value. Therefore, the
gain of accuracy by reducing the remodelling processes is almost
equivalent to the lost of accuracy caused by the waste of bandwidth
usages.
According to above analysis, we can conclude that MFBF performs
very well when the dissemination tree is “well” organized as in
our level and minzipf-0 distribution. In such cases, there are more
nodes with smaller bandwidth constraints and the nodes with higher
bandwidth constraints are located higher in the dissemination tree.
We conjecture that this is true in many real situations. If this is not
true, then the greedy algorithm performs better.
The general conclusion of the results. The average accuracy
F(M)decreases progressively while we increase the total number
of nodes in Tfrom 100 to 10000. The reason is that the remod-
elling is rare when the number of nodes is small, but it becomes
more frequent with the increase of the number of nodes. Further-
more, when the network becomes greater, the average accuracies
would become steady in all the cases.
The sensitivity to θ.As shown in all the previous figures, F(M)
has a positive correlation with the maximum fan-out θof T. The
major reason of this is that the height of Tis determined by θwhen
nis fixed and the height of Tdetermines the possible maximum
remodelling processes on a dissemination path. In other words,
when θincreases, the number of remodelling within Tis reduced
and consequently the average accuracies would increase.
The influence of assignment schemes. In terms of the assignment
schemes, all the algorithms perform better in the ordered scheme
than in the unordered scheme with the same distribution. Further-
more, the assignment schemes have a higher influence on greedy
algorithm, while MFBF performed much more consistently.
8. CONCLUSIONS
In this paper, we have discussed about the problem of multi-scale
time series data dissemination over a predefined tree-based net-
work. We used bandwidth constraints as an abstract parameter to
indicate the subscription levels of the subscribers involved in the
dissemination network as well as the volumes of data that the sub-
scribers would be able to consume. A framework of model-based
dissemination is proposed. Based on this, we formulate an opti-
mization problem to maximize the average accuracy of the data
received by all the nodes in the dissemination network. One major
challenges to solve this problem is the complexity to estimate the
accuracy of a particular dissemination plan. To address the chal-
lenge, we have made a thorough study on the estimation of the
accuracy of using PLA to represent time series data. In addition,
we proposed three algorithms to optimize the dissemination plan.
Meanwhile, an extensive experiments were carried out to evaluate
the performances of these algorithms in various bandwidth distri-
butions. The experimental results can be used as guidelines for
choosing the optimization strategies according to the actual appli-
cation environments.
There are some interesting directions to be studied in the future.
First, the optimization of the structure of the dissemination tree
is an interesting problem, which would have significant effects on
the overall data accuracies. As we can see from our analysis and
experimental results, the number of remodelling that are executed
in the network is a major factor in the resulting data accuracies.
On one hand, organizing the dissemination network to reduce the
need of remodelling can help increase the data accuracies. On the
other hand, the network organization should take into consideration
of the actual physical network properties, such as transfer latency,
localities etc. For example, connect one node Nias a child node
to another node Njthat has the same subscription level can avoid
remodelling, but if Njis far away from Niand has a long network
latency, then it is undesirable. Hence, a good trade-off should be
found. Second, it is interesting to study using other models (e.g.
DWT) to compress time series data. In general, our framework can
accommodate different compression models as long as remodelling
methods are provided. Furthermore, cost models and optimization
algorithms should also be rethought and re-examined.
9. REFERENCES
[1] http://gnip.com/.
[2] http://lhc.web.cern.ch/lhc/.
[3] https://datamarket.azure.com/.
[4] http://www.aggdata.com/.
[5] http://www.infochimps.com/.
[6] http://www.patientslikeme.com/.
[7] http://www.xignite.com/.
[8] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity
search in sequence databases. The 4th International
Conference on Foundations of Data Organization and
Algorithms (FODO ’93).
[9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and
evaluation of a wide-area event notification service. ACM
Transactions on Computer Systems (TOCS).
[10] F. K.-P. Chan, A. W. chee Fu, and C. Yu. Haar wavelets for
efficient similarity search of time-series: With and without
time warping. IEEE Transactions on Knowledge and Data
Engineering.
[11] K. E., Z. Q., H. B., H. Y., X. X., W. L., and R. C.A. The ucr
time series classification/clustering homepage:
www.cs.ucr.edu/~eamonn/time_series_data/.
2011.
[12] A. Gandhi. Content-based image retrieval: Plant species
identification. Master thesis, Oregon State University,
September 2002.
[13] Y. hua Chu, S. G. Rao, S. Seshan, and H. Zhang. A case for
end system multicast. in Proceedings of ACM Sigmetrics.
[14] K. Kanth, D. Agrawal, and A. Singh. Dimensionality
reduction for similarity searching in dynamic databases.
Proceedings of the 1998 ACM SIGMOD international
conference on Management of Data.
[15] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra.
Dimensionality reduction for fast similarity search in large
time series databases. Knowledge and Information Systems,
3(3):263–286, Aug 2000.
[16] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online
algorithm for segmenting time series. In ICDM, 2001.
[17] E. Keogh, S. Chu, D. Hart, and M. Pazzani. Segmenting time
series: A survey and novel approach. Data Mining in Time
Series Databases, 57:1–22, 2004.
[18] E. Keogh and M. Pazzani. Scaling up dynamic time warping
to massive dataset. Proceeding of Third European
Conference on Principles of Data Mining and Knowledge
Discovery (PKDD ’99).
[19] E. J. Keogh and M. J. Pazzani. Relevance feedback retrieval
of time series sata. International Conference on Research
and Development in Information Retrieval (SIGIR),
3(3):183–190, Aug 1999.
[20] A. Koski, M. Juhola, and M. Meriste. Syntactic recognition
of ecg signals by attributed finite automata. Pattern
Recognition.
[21] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and
D. Suciu. Query-based data pricing. Proceedings of the 31st
symposium on Principles of Database Systems.
[22] X. Liu, Z. Lin, and H. Wang. Novel online methods for time
series segmentation. IEEE Transactions on Knowledge and
Data Engineering (TKDE), 20(12):1616–1626, December
2008.
[23] S. Nath, J. Liu, and F. Zhao. Sensormap for wide-area sensor
webs. IEEE Computer.
[24] R. T. Olszewski. Generalized feature extraction for structural
pattern recognition in time-series data. PhD thesis, Carnegie
Mellon University, Pittsburgh, PA,, 2001.
[25] T. Palpanas, M. Vlachos, and E. Keogh. Online amnesic
approximation of streaming time series. International
Conference on Data Engineering (ICDE).
[26] D. Rafiei and A. Mendelzon. Efficient retrieval of similar
time sequences using dft.
[27] C. A. Ratanamahatana and E. Keogh. Everything you know
about dynamic time warping is wrong. SIAM International
Conference on Data Mining, 2004.
[28] C. A. Ratanamahatana and E. Keogh. Making time-series
classification more accurate using learned constraints. SIAM
International Conference on Data Mining, 2004.
[29] T. M. Rath and R. Manmatha. Word image matching using
dynamic time warping. CVPR, 2003.
[30] G. Reeves, J. Liu, S. Nath, and F. Zhao. Managing massive
time series streams with multi-scale compressed trickles.
Proceedings of 35th Conference on Very Large Data Bases.
[31] S. Shah, K. Ramamritham, and P. J. Shenoy. Resilient and
coherence preserving dissemination of dynamic data using
cooperating peers. IEEE Transactions on Knowledge and
Data Egnineering (TKDE).
[32] E. Skjervold, K. Lund, T. H. Bloebaum, and F. T. Johnsen.
Bandwidth optimizations for standards-based
publish/subscribe in disadvantaged grids. Millitray
Communications Conference(MILCOM 2012).
[33] I. Vasilescu, K. Kotay, D. Rus, M. Dunbabin, and P. Corke.
Data collection, storage, and retrieval with an underwater
sensor network. Proceedings of the 3rd international
conference on Embedded networked sensor systems (SenSys
05).
[34] B. C. O. Y. Zhou and K.-L. Tan. Disseminating streaming
data in a dynamic environment: anadaptive and cost-based
approach. The International Journal on Very Large Data
Bases (VLDB J.).
[35] Y. Zhou, Z. Vagena, and J. Haustad. Dissemination of
models over time-varying data. Proceedings of the VLDB
Endowment (PVLDB), 4, 2011.
... These work do not consider the privacy issue, and the version derivation in our work is more complicated than just filtering the useful data for each peer. [8,33] introduced the problem of disseminating models instead of raw data. The data traffic was minimized by model sharing and redundant model elimination. ...
... Data: data source s 0 , broker set S, client set C 1 begin 2 Sort C in ascending order of δ i − ∆ opt(k i ) ; 3 for i = 0;i < |C|;i++ do 4 Find CP for c i ; 5 for each broker s j in CP do 6 Find the optimal IL(c i ) when connected to s j ; 7 Choose the broker s j with minimal IL(c i ) as p(c i ); 8 if p(c i ) has not found then 9 Choose the s j with minimal d(s j ) as p(c i ); 10 Compromise s j and its ancestors one by one in a bottom-up approach until δ i is satisfied; 11 Update all descendants of the compromised brokers; ...
Conference Paper
Full-text available
With the vision of the emergence of streaming data marketplaces, we study the problem of how to use a scalable dissemination infrastructure, composed by a number of brokers, to disseminate anonymized streaming data to a large number of clients. To satisfy the clients, who are trusted at different anonymity levels and have their own urgencies in requiring the data, we propose to deeply integrate the anonymization process into the dissemination infrastructure. More specifically, we extend the existing anonymization algorithms to derive the anonymity data from other anonymity data with different privacy constraints rather than only from the original microdata, in a technique which we call version derivation. With this flexibility, the anonymous data can be generated as needed, according to the available bandwidth, on the way from the data source to the end clients. Exploiting such new opportunities, we formulate the problem of dissemination planning which aims at minimizing the information loss of the disseminated data. Furthermore, we design two dissemination plan optimization strategies to solve the problem. The experimental study using both synthetic and real datasets verifies the effectiveness of our approach.
Article
Soil microbial activity and community diversity in water level fluctuation zone (WLFZ) of the drinking water reservoir is still not well understood. In this context, soil dehydrogenase activity and bacterial community diversity of a drinking water reservoir was determined aiming to examine the water level fluctuation driving soil bacterial community activity and diversity across both vertical and longitudinal transects. BIOLOG method was employed to explore functional diversity of WLFZ soil bacterial community. The results shown that the highest dehydrogenase activity, 2.64 ìg TF/g.24h, was found in the top of WLFZ, which was 2.67 times higher than that of the bottom. Meanwhile, the average well color development (AWCD)590nm, species richness and Shannon's diversity of bacterial community associated with top of WLFZ was also significant higher than that of the bottom. The significant "site" and "WLFZ" revealed that the dehydrogenase activity, AWCD590nm and species richness varied among the sites within the WLFZ (P < 0.01). However, there were no significant two-way interactions for different sites (P > 0.05). Heatmap and principle component analyses (PCA) of bacterial community metabolic fingerprints suggested a significant discrimination soil bacterial community in the bottom, middle and top of three different sampling sites of WLFZ. The higher discriminate carbon substrates utilization by WLFZ soil bacterial community were cellobiose, hydroxy benzoic acid, tween 80, arginine, threonine, pyruvic acid methyl ester, ketobutyric acid, malic acid, glucosaminic acid and glycogen. These founding demonstrated that water level draw down affect soil microbial activity and bacterial community functional diversity in the water level fluctuation zone of the JIN PEN drinking water reservoir.
Conference Paper
Full-text available
In this paper we present a novel platform for underwater sen-sor networks to be used for long-term monitoring of coral reefs and fisheries. The sensor network consists of static and mobile underwater sensor nodes. The nodes commu-nicate point-to-point using a novel high-speed optical com-munication system integrated into the TinyOS stack, and they broadcast using an acoustic protocol integrated in the TinyOS stack. The nodes have a variety of sensing capa-bilities, including cameras, water temperature, and pres-sure. The mobile nodes can locate and hover above the static nodes for data muling, and they can perform network maintenance functions such as deployment, relocation, and recovery. In this paper we describe the hardware and soft-ware architecture of this underwater sensor network. We then describe the optical and acoustic networking protocols and present experimental networking and data collected in a pool, in rivers, and in the ocean. Finally, we describe our experiments with mobility for data muling in this network.
Conference Paper
Full-text available
It has long been known that Dynamic Time Warping (DTW) is superior to Euclidean distance for classification and clustering of time series. However, until lately, most research has utilized Euclidean distance because it is more efficiently calculated. A recently introduced technique that greatly mitigates DTWs demanding CPU time has sparked a flurry of research activity. However, the technique and its many extensions still only allow DTW to be applied to moderately large datasets. In addition, almost all of the research on DTW has focused exclusively on speeding up its calculation; there has been little work done on improving its accuracy. In this work, we target the accuracy aspect of DTW performance and introduce a new framework that learns arbitrary constraints on the warping path of the DTW calculation. Apart from improving the accuracy of classification, our technique as a side effect speeds up DTW by a wide margin as well. We show the utility of our approach on datasets from diverse domains and demonstrate significant gains in accuracy and efficiency.
Conference Paper
Full-text available
We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the first few frequencies are strong. Another important observation is Parseval's theorem, which specifies that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lowerdimensionality space by using only the first few Fourier coefficients, we use R - trees to index the sequences and efficiently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coefficients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences. On sabbatical from the Dept. of Computer Science, Un...
Conference Paper
NATO has identified Web services as a key enabler for its network enabled capability. Web services facilitate interoperability, easy integration and use of commercial off-the-shelf components, and while request/response-based schemes have hitherto been predominant, publish/subscribe-based services are gaining ground. SOAP-based Web services, however, introduce considerable communication overhead, and optimization must be done to enable use on the tactical level. Data compression is one such optimization, and it works well for large messages. We claim that the inherent characteristics of publish/subscribe-based Web services are such that using difference-based compression will allow effective compression also for small messages. In this paper we present the design and implementation of a proof-of-concept mechanism called ZDiff, which we have tested on several types of military data formats. Together with our SOAP-based proxy system it can be used together with commercial off-the-shelf Web services software. The results show that difference-based compression outperforms traditional compression for small messages, at the same time as it never performs worse than traditional compression for larger messages.
Article
Data is increasingly being bought and sold online, and Web-based marketplace services have emerged to facilitate these activities. However, current mechanisms for pricing data are very simple: buyers can choose only from a set of explicit views, each with a specific price. In this article, we propose a framework for pricing data on the Internet that, given the price of a few views, allows the price of any query to be derived automatically. We call this capability query-based pricing. We first identify two important properties that the pricing function must satisfy, the arbitrage-free and discount-free properties. Then, we prove that there exists a unique function that satisfies these properties and extends the seller's explicit prices to all queries. Central to our framework is the notion of query determinacy, and in particular instance-based determinacy: we present several results regarding the complexity and properties of it. When both the views and the query are unions of conjunctive queries or conjunctive queries, we show that the complexity of computing the price is high. To ensure tractability, we restrict the explicit prices to be defined only on selection views (which is the common practice today). We give algorithms with polynomial time data complexity for computing the price of two classes of queries: chain queries (by reducing the problem to network flow), and cyclic queries. Furthermore, we completely characterize the class of conjunctive queries without self-joins that have PTIME data complexity, and prove that pricing all other queries is NP-complete, thus establishing a dichotomy on the complexity of the pricing problem when all views are selection queries.
Article
The conventional wisdom has been that IP is the natural protocol layer for implementing multicast related functionality. However, ten years after its initial proposal, IP Multicast is still plagued with concerns pertaining to scalability, network management, deployment and support for higher layer functionality such as error, flow and congestion control. In this paper, we explore an alternative architecture for small and sparse groups, where end systems implement all multicast related functionality including membership management and packet replication. We call such a scheme End System Multicast. This shifting of multicast support from routers to end systems has the potential to address most problems associated with IP Multicast. However, the key concern is the performance penalty associated with such a model. In particular, End System Multicast introduces duplicate packets on physical links and incurs larger end-to-end delay than IP Multicast. In this paper, we study this question in the context of the Narada protocol. In Narada, end systems self-organize into an overlay structure using a fully distributed protocol. In addition, Narada attempts to optimize the efficiency of the overlay based on end-to-end measurements. We present details of Narada and evaluate it using both simulation and Internet experiments. Preliminary results are encouraging. In most simulations and Internet experiments, the delay and bandwidth penalty are low. We believe the potential benefits of repartitioning multicast functionality between end systems and routers significantly outweigh the performance penalty incurred.
Article
Databases are increasingly being used to store multimedia objects such as maps, images, audio, and video. Storage and retrieval of these objects is accomplished using multidimensional index structures such as R*-trees and SS-trees. As dimensionality increases, query performance in these index structures degrades. This phenomenon, generally referred to as the dimensionality curse, can be circumvented by reducing the dimensionality of the data. Such a reduction is, however, accompanied by a loss of precision of query results. Current techniques such as QBIC use SVD transform-based dimensionality reduction to ensure high query precision. The drawback of this approach is that SVD is expensive to compute and, therefore, not readily applicable to dynamic databases. In this paper, we propose novel techniques for performing SVD-based dimensionality reduction in dynamic databases. When the data distribution changes considerably so as to degrade query precision, we recompute the SVD transform and incorporate it in the existing index structure. For recomputing the SVD-transform, we propose a novel technique that uses aggregate data from the existing index rather than the entire data. This technique reduces the SVD-computation time without compromising query precision. We then explore efficient ways to incorporate the recomputed SVD-transform in the existing index structure. These techniques reduce the computation time by a factor of 20 in experiments on color and texture image vectors. The error due to approximate computation of SVD is less than 10%.
Article
A syntactic pattern recognition method of electrocardiograms (ECG) is described in which attributed automata are used to execute the analysis of ECG signals. An ECG signal is first encoded into a string of primitives and then attributed automata are used to analyse the string. We have found that we can perform fast and reliable analysis of ECG signals by attributed automata.