Content uploaded by Yongluan Zhou

Author content

All content in this area was uploaded by Yongluan Zhou on Oct 03, 2016

Content may be subject to copyright.

Multi-Scale Dissemination of Time Series Data

Qingsong Guo, Yongluan Zhou, Li Su

Department of Mathematics and Computer Science, University of Southern Denmark, Denmark

{qguo, zhou, lsu}@imada.sdu.dk

ABSTRACT

In this paper, we consider the problem of continuous dissemina-

tion of time series data, such as sensor measurements, to a large

number of subscribers. These subscribers fall into multiple sub-

scription levels, where each subscription level is speciﬁed by the

bandwidth constraint of a subscriber, which is an abstract indicator

for both the physical limits and the amount of data that the sub-

scriber would like to handle. To handle this problem, we propose

a system framework for multi-scale time series data dissemination

that employs a typical tree-based dissemination network and exist-

ing time-series compression models. Due to the bandwidth limits

regarding to potentially sheer speed of data, it is inevitable to com-

press and re-compress data along the dissemination paths according

to the subscription level of each node. Compression would caused

the accuracy loss of data, thus we devise several algorithms to opti-

mize the average accuracies of the data received by all subscribers

within the dissemination network. Finally, we have conducted ex-

tensive experiments to study the performance of the algorithms.

Categories and Subject Descriptors

H.2.4 [Database]: Scientiﬁc Data Management

General Terms

Multi-Scale Dissemination

Keywords

Time Series Data, Data Dissemination, Piecewise Linear Approxi-

mation

1. INTRODUCTION

In many cases, scientiﬁc data generated from sensors and vari-

ous scientiﬁc equipments often take the form of time-series, i.e.

streams of data points measured at successive time instants with

uniform time intervals. In the context of Data-as-a-Service (DaaS)

or large-scale cooperated research projects, there would be a large

number of parties to the data. For instance, Microsoft’s SensorMap

project [23] establishes a portal to enable collection and sharing of

sensor network data from various deployments of public institutes,

companies as well as individuals. Data collected by the SensorMap

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are not

made or distributed for proﬁt or commercial advantage and that copies bear

this notice and the full citation on the ﬁrst page. Copyrights for components

of this work owned by others than ACM must be honored. Abstracting with

credit is permitted. To copy otherwise, or republish, to post on servers or to

redistribute to lists, requires prior speciﬁc permission and/or a fee. Request

permissions from Permissions@acm.org.

SSDBM ’13, July 29 - 31 2013, Baltimore, MD, USA

Copyright 2013 ACM 978-1-4503-1921-8/13/07 $15.00

should be disseminated to a great number of receivers who have

subscribed to them. This framework can be organized as a pub-

lish/subscribe system, where each subscriber is an interested party.

With the technology development, the rates of data generated by

many advanced devices are prohibitively high. For example, there

are around 600 million collisions per second within LHC (Large

Hadron Collider) [2], which are observed by 4 detectors and pro-

duce a data rate of approximately 300 GB/s and the data size is

over 300 MB/s even after pre-ﬁltering. As another example, con-

sider underwater sensor network [33] for monitoring the health of

river and marine environments, where the overall data rate scales

according to the size and the density of the deployment of sen-

sors. Note that scientists often tend to obtain measurements with

the largest scale and ﬁnest grain allowed by the technology. As-

sume we monitor an area of 100km ×100km which are divided

into 1m×1mgrids, and each grid deployed with a sensor. Sup-

pose each node produces one measurement every second and each

measurement has 4 bytes of data. This will give a total data rate at

around 40 GB/s.

Due to the sheer speed of such data, it is challenging to disseminate

them to a large number of subscribers. Furthermore, different or-

ganizations or individuals would have different physical capacities

and demands on data and hence required different granularities or

accuracies of the data. Sending the raw data to all the subscribers

are not only resource consuming but also undesirable. Supporting

different subscription levels with different data granularities or ac-

curacies would be a favorable feature for both the data providers

and the subscribers.

In this paper, we consider the problem of disseminating time-series

data to a large number of subscribers with multiple subscription

levels. These nodes are organized as a provide/subscribe system,

or dissemination network, where a node has been designated as the

primitive data provider and each subscriber is under a bandwidth

constraint. Bandwidth constraint is introduced to describe the sub-

scription level of a node, it is an abstract parameter indicates not

only the volume of data that a subscriber would prefer to consume

but also some physical capacities, such as bandwidth of communi-

cation link. This problem suffers from the challenges of the sheer

speed of data as well as bandwidth constrains in the system. The

basic idea is to reduce the volume of data and to optimize the qual-

ity of service by compression, as done in other publish/subscribe

systems [32].

Similar with the architecture of End System Multicast [13], all ends

has been organized into a overlay tree, our dissemination network

also is constructed as a tree on the application layer. As we know,

a End-to-End communication probably goes through other nodes.

It means that extra packages are inevitable for these intermediate

nodes if data are transmitted from the source node, which make

the challenge even worse as the bandwidth is limited regarding to

the data rate. We have to optimize our dissemination network. A

basic idea is that data for each subscriber originates from the data

of its parent other than the source. Therefore, in our system, each

node plays two roles: a subscriber that requests a particular scale

of compressed data and a broker that provides multi-scaled com-

pressed data to its child nodes.

In our context, data for subscribers are compressed into multiple

scales according to their bandwidth constraints. Various approaches

have been proposed to approximate and compress time-series data,

such as discrete Fourier Transform [8, 26], discrete Wavelet Trans-

form [10], Singular Value Decomposition [14], Piecewise Aggre-

gate Approximation [18], and Piecewise Linear Approximation

(PLA) [16, 20], etc. In this paper, we adopt PLA as the compression

model, i.e. series of line segments are generated and disseminated

instead the raw data. Apart from its simplicity and wide applicabil-

ity, PLA modelling can efﬁciently performed in an online fashion.

In our approach, each node within the dissemination tree receives

PLA models from its parent and then forwards them to its children,

where a re-compression is called when the bandwidth constraint

of any child is smaller than the size of models it received. The

whole procedure is referred to as model-based dissemination. In

summary, we have made the following contributions in this paper:

•We formulate the problem of model-based time-series data

dissemination with multi-levels of bandwidth constraints as

an optimization problem to maximize the average accuracy

of the data received by all the nodes in the dissemination

network.

•Due to the expensiveness of data compression operations and

the need of the optimization algorithms to estimate the data

accuracies with many different compression ratios, we pro-

pose an approach to estimate the expected data accuracy with-

out actually performing the compressions. Experiments show

that the approach is effective and robust.

•We proposed three algorithms to generate an optimized dis-

semination plan. The greedy algorithm tries to maximize

the total volume of models that are disseminated to all the

subscribers. However, this approach may incur a lot of re-

compressions over the network and hence can incur high

data inaccuracies. We further proposed two algorithms to

address this problem, namely the Randomized algorithm and

the Most Frequent Bandwidth First (MFBF) algorithm.

•We conducted an extensive experimental study on the per-

formance of the proposed algorithms. We have conducted

our experiments on 19 real datasets, among which the rep-

resentative results of 5 datasets are presented in this paper.

The experimental results show that our cost model can ac-

curately estimate the model accuracies. Furthermore, the re-

sults suggest that none of the three proposed algorithms can

outperform the others in all the situations. Our extensive ex-

periments can help making guidelines for real deployments.

The rest of this paper is organized as follows. In section 2 we for-

mulate the problem studied in this paper. In section 3 we propose

a cost model to estimate the accuracies with arbitrary compression

ratios. In section 4 we design three approximate algorithms to op-

timize the problem and we experimentally evaluate the proposed

methods in Section 5. Finally, in section 6 we present some related

work and we conclude the paper in Section 7.

2. RELATED WORK

Approximate representation models: There exist many mathe-

matical and statistical models to approximate and compress time-

series data. They are designed to for different purposes and hence

may suit different applications. For examples, Discrete Fourier

Transform [8] is for mapping time-series data to frequency domains

and often used for feature extraction; [30] present a framework (Cy-

press) to compress FFT represented time-series data to multi-scales

for efﬁcient archival and querying; Discrete Wavelet Transform

(DWT)[10] is mainly used for efﬁcient similarity search over time-

series data; Singular Value Decomposition [14] is usually used for

dimensionality reduction in similarity searching; Piecewise Linear

Approximation (PLA) [16, 20, 17, 22, 15, 19] is a very popular

approximate model and has been used in a variate of purpose, e.g.

fast exact similarity search, fuzzy query, dynamic time warping and

relevance feedback etc. Due to PLA’s simplicity and wide applica-

bility, a variety of algorithms have been proposed to perform PLA

modelling. Most segmentation algorithms can be divided into three

categories: Sliding Windows, Top-Down, and Bottom-Up [16, 17],

which are batch based algorithm. In [16], the authors proposed an

online algorithm, called SWAB (Sliding Window and Bottom-up),

for representing time series data by combine sliding windows and

bottom-up algorithms.

The multi-resolution feature of DWT also enable it to achieve multi-

scale compression. Unfortunately, the compression scales for DWT

are ﬁxed which is inefﬁcient for compression at arbitrary scales.

Given a sequence with length of n, then ⌈log2n⌉scales can be ob-

tained by Haar Transform, i.e. {1

n,2

n,..., 2i

n,...,1}(i= 1,2,...,

⌈log2n⌉). PLA has no such limitation on compression scales as we

can see from the rest of the paper.

Data Dissemination and Publish/Subscribe Systems. There are

many previous works on data dissemination and publish/subscribe

systems, e.g. [9, 31, 35, 34]. In such system, data are continu-

ously disseminated from the sources to distributed subscribers. To

enhance the dissemination efﬁciency, these subscribers or brokers

are typically structured into dissemination trees. In [31, 34], the

authors studied the problem of optimizing dissemination trees to

reduce the latency of data dissemination and hence minimize the

average loss of ﬁdelity. Their techniques of tree constructions can

be adapted to construct the dissemination tree in our system.

In [35], the authors proposed a general framework to disseminate

statistical models rather than raw data to the data receivers. How-

ever, their focus is on minimizing the volumes of models that are

transferred over the network and hence they proposed some model

routing algorithms to achieve this objective. Due to the radical dif-

ference in the application scenarios and problem formulations, their

results are not applicable to this paper.

Data as a Service: In recently years, data has increasingly become

an important commodity. For examples, Xignite [7] sells various ﬁ-

nancial data; Gnip [1] provides data from social media (e.g. Twitter

and Facebook), PatientsLikeMe [6] provides more than 150,000

self-reported patient statistical data; AggData [4] sells various lo-

cation data, such as the locations of all the Walmart stores. Many

cloud services that support and facilitate the setup of online data

marketing have emerged, such as Windows Azure Marketplace [3]

and Infochimps [5]. Furthermore, pricing of data has become an

interesting research problem, e.g. Paraschos et al. have proposed

a query-based data pricing model in [21]. In some sense, we also

proposed a pricing model over time-series data in this paper, i.e.

priced based on the compression scale.

3. PROBLEM FORMULATION

3.1 System Model

In our system, a data source N0generates an inﬁnite sequence of

numerical values, (..., vi−1, vi, vi+1, ...), which will be dissemi-

nated to a large number of data receivers, {N1, N2,...,Nn}. Peri-

odically, N0will collect a subsequence of the data and disseminate

it to the receivers.

In this paper, we consider a tree-based dissemination architecture,

which is a widely adopted approach [35, 9, 31, 34]. In this archi-

tecture, all nodes are organized into a dissemination tree with the

root N0acting as the source. Data is disseminated from N0and

relayed by the internal nodes to all the nodes in the network.

Example 1: Figure 1 shows an example of dissemination tree T,

which involves 9 nodes, {N0, N1,...,N8}. Each node, other than

N0, is under a bandwidth constraint respectively, i.e. {c1, c1,...,c8}.

An instance of bandwidth constraints is also given, which is speci-

ﬁed by the integer attached to each node.

DEFI NI TI ON 1. The dissemination path to Ni, denoted as P(Ni),

is deﬁned as a sequence of nodes, ⟨pk(Ni), . . . , p2(Ni), p1(Ni), Ni⟩.

Where pj(Ni)(1≤j≤k) denotes the j-th preceding node of Ni

on P(Ni), specially p1(Ni)is Ni’s parent node and pk(Ni)is the

root node. All these preceding nodes form a set φ(Ni).

For example, as shown in Figure 1, the dissemination path to N8,

denoted as P(N8), is the sequence of nodes ⟨N3, N1, N0⟩.

As we addressed earlier, each node in the dissemination network

embedded with some constraints, such as physical bandwidth of its

communication link, demands on granularity or accuracy of data,

etc. Here, we use an abstract parameter bandwidth constraint of

Nito represent the constraint on the volume of data could be trans-

mitted to Niper time unit. For example, as shown in Figure 1, the

bandwidth constraint of N1is c1, so the data volume sent to N1

from N0per time unit should not exceed c1.

In practice, a child node may has a higher bandwidth constraint

than its parent node, e.g. c3> c1. However, the volume of data

received is no more than its parent node as such data originates

from the data received by its parent. It means that the volume

of data transmitting along any dissemination path is non-strictly

monotonously decreasing. We refer to this truth as volume decreas-

ing property.

3.2 Model-based Dissemination

Piecewise Linear Models. In PLA, the data sequence would be

partitioned into a number of non-overlapping and consecutive sub-

sequences and each subsequence is represented with a linear func-

tion [16]. Assume a sequence of time-series data D= (v1, v2,...,

vk)has been divided into ιsubsequences S= (s1, s2,...,sι),

where sj(j≤ι)consists of a subsequence of data points with

10

9

6

6 6 6 5 4

5

1

c

2

c

3

c

4

c

5

c

6

c

7

c

8

c

8

N

7

N

6

N

5

N

4

N

3

N

2

N

1

N

0

N

Figure 1: Dissemination Tree T

consecutive indices (vtj, vtj+1,...,vtj+τ)and τis the size of sj.

Each subsequence sjis approximated with a linear function ℓj. All

these linear functions M= (ℓ1, ℓ2,...,ℓι)compose a piecewise

linear approximate representation of the original data sequence D.

The segmentation of data sequences is an optimization problem that

has been extensively studied in previous work, e.g. [16, 11], and is

out of the scope of this paper. In our implementation, we adopt the

SWAB algorithm proposed in [16], which is a online algorithm and

suitable for our scenario.

DEFI NI TI ON 2. The volume (or size) of a sequence of models

M= (ℓ1, ℓ2,...,ℓι)is deﬁned as the number of linear functions

that it contains, denoted as |M|.

Example 2. Figure 2 shows an example of using different num-

ber of linear piecewise models to approximate a sequence of dy-

namic data. Where a)is a sequence of raw Electrocardiogram data

(ECG) originates from the dataset in [11], and in b),c)and d)we

use 23, 13 and 3 linear segments to represent the raw data respec-

tively.

a)

b)

c)

d)

Figure 2: Piecewise Linear Approximation

Models’ Accuracy. As above example shows, we can use various

sequences of models to approximate the raw data. So we need some

criteria to evaluate the quality for a sequence of models, i.e error

and accuracy.

Suppose models Mis a approximate representation of original data

D, and D′= (v′

1,v′

2,...,v′

k)is the corresponding time-series data

of M, where v′

j,1≤j≤k, is the vertical value of Mat the

time instance of vj. In some of the literatures [17, 25], the error of

using Mto represent Dis deﬁned by the sum of vertical differences

between the D′and Dof at all data points, i.e. k

j=1 |vj−v′

j|.

In our work, it has been modiﬁed as a relative error, modiﬁcation

on this deﬁnition, where normalize the error into a value in range

[0,1].

DEFI NI TI ON 3. Given a sequence of dynamic data D= (v1,

v2, . . . , vk), a sequence of models Mand its corresponding data

D′= (v′

1, v′

2,...,v′

k), the error of using Mto approximate repre-

sent Dis deﬁned as

ε=k

j=1 |vj−v′

j|

k

j=1 |vj|.

εis the relative deviation between Mand D, which also reﬂects the

quality of using Mto represent D. For convenience of discussion,

we deﬁne the accuracy of a sequence of models.

DEFI NI TI ON 4. The accuracy of models Mreceived by node

Ni,f(Ni), is deﬁned as

f(M) = 1−ε

δ, if 0≤ε<δ

0, if ε ≥δ

where f(M)∈[0,1],εis the error of M, and δis a user deﬁned

constant.

As we can see in Figure 2, various models can be used to represent

the same source data D. However, some models may seriously

deviate from D, which are not suitable to representing D. Those

models can be excluded from the user’s choice by adjusting factor

δ. With a lower δvalue, more models’ accuracies will become 0

and hence it tends to exclude more models from the solution. δcan

be set according to the application’s requirements and can be set to

a relative smaller value while users need high quality models.

3.3 Dissemination Plan

Dissemination Plan. A dissemination plan X= (rN0, rN1, rN2,

..., rNn)is a vector of compression ratios, which deﬁned how many

models will be generated and transmitted to each node per time

unit.

Assume models received by node Niis Mi, and M0is the initial

models. Then, the volume of Miis |Mi|=rNi· |M0|.

Take Figure 1 as an example. Assume (c0, c1, c2, c3, c4, c5, c6, c6,

c7, c8) = (10,9,6,6,7,6,5,4,5).Then, X1= (1,0.9,0.6,0.6,

0.7,0.6,0.5,0.4,0.5) is a dissemination plan. Note that X1is fea-

sible as it satisﬁes both the bandwidth constraints and volume de-

creasing property. However, plan X2= (1,0.9,0.6,0.6,0.7,0.6,

0.5,0.4,0.6) is infeasible, as it violates the bandwidth constraints.

Furthermore, X3= (1,0.9,0.6,0.4,0.7,0.6,0.5,0.4,0.5) is also

infeasible, because it does not satisfy the volume decreasing prop-

erty. The volume of models received by N8is larger than N3’s,

which is impossible as N3is N8’s preceding node on the path

P(N8) = ⟨N0, N1, N3, N8⟩.

Remodelling. Due to the bandwidth constraints in the dissemina-

tion network, it is inevitable a dissemination plan Xmay state that

a child node should receive less data than its parent node. Suppose

Niis an internal node and Njis its child node and rNi< rNj.

Then, node Niis responsible to compress its models to a new se-

quence of models with size of rNj· |M0|. This process is called

remodelling. For each internal node, only single remodelling is

required regardless the number of its children. Assuming Nihas

more than one child embedded with diverse bandwidth constraints,

and Njis the smallest. By using PLA, models in each scale can be

generated naturally when data is compressed at ratio of rNj.

Note that the accuracy of the models Mitransmitted to Niis deter-

mined by the speciﬁc remodelling processes along the dissemina-

tion path P(Ni)rather than solely on Mi’s volume. For example,

as shown in a model-based dissemination network Figure 1, f(M2)

is different with f(M3), though the volumes of models received by

both are equal (i.e. 6). Models M2have higher accuracy as they

are derived by from the original data (10 →6), but the models for

N3are generated from models with lower accuracy (9→6).

3.4 Optimization Problem

Our goal is ﬁnd out the optimal solution from all feasible dissemi-

nation plans. The objective function for this problem is deﬁned as

average accuracy of models received by all nodes within the dis-

semination network. More formally, given the set of models M=

(M0, M1, M2..., Mn)to be received by N= (N0, N1, . . . , Nn),

the average accuracy F(M)is deﬁned as

F(M) = 1

n

n

i=0

f(Ni)

The formal problem statement is as follow.

MAX IM UM -MOD EL -DIS SE MI NATI ON (MMD): Given a dissemi-

nation tree Tcomposed by a set of nodes N, the bandwidth con-

straints Cof all the nodes in N, choose a dissemination plan X

such that F(M)is maximized subjecting to the following condi-

tions:

rNi≤cNi

cN0

∧rNi≤rNj,where Nj∈φ(Ni)

3.5 Challenges

In this subsection, we will discuss the challenges of solving the

aforementioned optimization problem.

THE OR EM 1. The Maximum-Model-Dissemination problem is

NP-hard.

Due to the page limit, we omit the proof.

Additional Challenges. To solve the MMD problem, we have to

design algorithms to search the solution space and try to ﬁnd the op-

timal plan with the minimum F(M). Besides the NP-Hardness of

the problem, there are some additional challenges to be addressed.

In the optimization algorithms, we have to generate many differ-

ent dissemination plans and compare their costs. A straightforward

way to estimate the cost of a dissemination plan will work like the

following. First, we can use some historical data to generate the

models to be disseminated to each node. Then we can calculate

the accuracy of these models according to Deﬁnition 4. However,

the modelling process is usually very expensive and it has to be

done for every possible plan that is considered in the optimization

algorithms. Hence, this approach is computationally prohibitive.

One may think that the models’ accuracy is related to the number of

models that are sent to each node and hence an alternative approach

is to transform the problem to maximizing the number of models

that are transmitted to each node. Unfortunately, the models’ ac-

curacy depends on not only the volume of models but also, more

importantly, the actual remodelling processes within the dissemi-

nation network. For example, in Figure 1, (10,9,6,6,7,6,5,4,5)

is a feasible plan with the largest total volume of models, but it

is not an optimal plan. If we change x1to 6, then f(N1)would

be decreased while f(N3),f(N4)and f(N8)would be increased.

Therefore it is possible that the latter plan would have a higher av-

erage accuracy. This is also validated by our experiments presented

later in the paper.

In summary, we need to ﬁnd a method to calculate the models’ ac-

curacy that can take the remodelling processes into consideration

but need not actually generate the models. The basic idea is to ob-

tain a function of the models’ accuracy that can produce the accu-

racy estimation without the actual models. Such a function would

be multi-dimensional with the parameters being the compression

ratios of the nodes on the dissemination path. Finding an efﬁcient

way to generate such a function is challenging.

4. MODEL-BASED DISSEMINATION

In this section, we highlight the overall dissemination procedure in

our framework. Given a feasible plan X, the model-based dissem-

ination works as follows. It starts at the root of Tand end until all

leaves received their models. At ﬁrst, the initial models M0will

be generated from the raw data Din the root node. With the PLA

models, these initial models are simply line segments that connect

pairs of adjacent points and hence has the same volume as the raw

data.

For each internal node N, new sequences of models will be gener-

ated based on the models received by Naccording to Xand then

are transmitted to the corresponding children. The detail operation

executed at each node is illustrated in Algorithm 1. The remodel()

(line 8) produces re-compress the input model sequence according

a speciﬁed size.

In our current implementation, we use the Bottom-Up algorithm

proposed in [16]. The Bottom-Up algorithm will create m/2seg-

ments to approximate represent the raw data initially. Then, the

cost of merging each pair of adjacent segments is calculated, and

it merge a pair with the lowest cost iteratively. The algorithm stop

until some user-speciﬁed criteria is met, such as compression ratio.

Of course, after one merge operation, the cost of merging the new

longer segment with its left neighbor and right neighbor need be

re-calculated.

This remodelling operations in an internal node can be further op-

timized according to the actual compression methods being used.

For instance, with Bottom-Up approach, the new sequences of mod-

els for all the children can be generated by running one pass of the

compression from models M(Ni)to M(nodes[k]), where nodes[k]

is the child with the smallest compression ratio.

Algorithm 1: Model-based Dissemination

Input: The current node Ni, Models M(Ni)received by this

node, Plan X

1nodes[0] ←Ni;

2append all child nodes nodes ;

3sort nodes in descending order by their compression ratios;

4 for j= 1 to sizeof (nodes)do

5 if rnodes[j]< rnodes[j−1] then

6M(nodes[j]) ←remodel(M(nodes[0]), rnodes[j]);

7 else

8M(nodes[j]) ←M(nodes[j−1]);

9transmit M(nodes[j]) to nodes[j];

5. COST MODEL

As mentioned earlier, it is very costly to estimate the models’ ac-

curacy. Our general approach is to try to ﬁnd a proper form of the

accuracy function and use historical data and curve ﬁtting methods

to ﬁnd out the appropriate coefﬁcients of the function.

As the accuracy of the models send to Niis determined by the

remodelling processes in its dissemination path, we can deﬁne a

general form of the accuracy function, f(Ni), as follows.

DEFI NI TI ON 5. f(Ni), the accuracy of node Ni, is a multidi-

mensional function of the compression ratios of all the nodes within

P(Ni):

f(Ni) = ~(rN0, rpk−1(Ni), . . . , rp1(Ni), rNi)(1)

where pk−1(Ni)∈φ(Ni), and rpk−1(Ni)is its compression ratio

and rN0= 1.

One can assume a particular form of f(Ni), e.g. a multidimen-

sional quadratic function, and then we can ﬁnd out the co-efﬁcients

by curve ﬁtting. However, curve ﬁtting for multidimensional func-

tions are usually very expensive and less accurate. In the following

subsections, we try to simplify such a general function to a two-

dimensional function based on several realistic observations and

assumptions.

5.1 Observations and Assumptions

In this subsection, we present a few observations and assumptions

that will be used to simplify the accuracy function.

ASS UM PT IO N 1. For two distinct arbitrary nodes Niand Njin

T, if rNi=rNjand f(Ni) = f(Nj), then the sequence of models

received by Njand Ni, i.e. Miand Mj, are equivalent.

For two equivalent sequences of models Miand Mj, they can be

replaced by each other.

ASS UM PT IO N 2. For an arbitrary node Ni,0≤f(Ni)≤1

holds. f(Ni) = 1 if and only if rNi= 1.

Assumption 2 states that a node’s accuracy is a positive value that

falls in [0,1] and it is equal to 1 if and only if the data it received

do not undergo any remodeling, i.e. the original data.

ASS UM PT IO N 3. For an arbitrary node Ni,rNi≤rp1(Ni)and

f(Ni)≤f(p1(Ni)) hold. f(Ni) = f(p1(Ni)) if and only if

rp1(Ni)=rNi.

Assumption 3 means that the accuracy of an arbitrary node is no

more than the accuracy of its ancestor nodes and the compression

ratios fulﬁll the volume decreasing property on a dissemination

path. The equality holds if and only if there is no remodeling occurs

while models transmitted from pj(Ni)to Niover the path P(Ni).

The next two assumptions are about two distinct arbitrary nodes in

T,Niand Nj. Figure 3 shows their dissemination paths.

ASS UM PT IO N 4. If rp1(Ni)=rp1(Nj),rNi=rNj, and

f(p1(Ni)) > f(p1(Nj)), then f(Ni)> f(Nj)holds.

Assumption 4 states that, if Niand Njare equal in their compres-

sion ratios and so are their parents, their accuracies are determined,

respectively, by their parents’ accuracies.

N

N

N

N

1

( )

j

Np

r

1

( )

i

N

p

1

( )

j

N

p

i

N

j

N

j

N

r

i

N

r

1

( )

i

N

p

r

Figure 3: Two Dissemination Paths

ASSUMPTION 5. If rp1(Ni)=rp1(Nj),f(p1(Ni)) = f(p1(Nj))

and rNi> rNj, then f(Ni)> f(Nj).

According to Assumption 1, the model sequences received by p1(Ni)

and p1(Nj)are equivalent under the conditions of Assumption 5.

Simply put, Assumption 5 states that the accuracies of the new

models generated from two sequence of equivalent models only

depends on its compression ratio.

ASSUMPTION 6. The accuracy difference between nodes p1(Ni)

and Ni, i.e. f(p1(Ni)) −f(Ni), is a binary quadratic function of

rp1(Ni)and rNi:

f(p1(Ni)) −f(Ni) = α1r2

Ni+β1rNi+α2r2

p1(Ni)+

β2rp1(Ni)+γrNirp1(Ni)+ω(2)

where α1, α2, β1, β2, γ and ωare constant coefﬁcients.

Assumption 6 is to simplify the accuracy function from a multi-

dimensional one to a two-dimensional quadratic function. One can

of course assume a more complicated function. But, as we will see

later in the experiment results, this function performs quite well in

practice and hence we will use it due to its simplicity.

5.2 Accuracy Function

We can further reduce the coefﬁcients in Eqn. (2) by applying the

equivalent condition in Assumption 3. More speciﬁcally, according

to Assumption 3, if rp1(Ni)=rNi, then f(Ni) = f(p1(Ni)). We

can derive that

0 = α1r2

Ni+α2r2

Ni+γr2

Ni+β1rNi+β2rNi+ω

Therefore, β1=−β2,α1+α2+γ= 0 and ω= 0.

By substitution, we can simplify Eqn. (2) as follows:

f(p1(Ni)) −f(Ni) = α1r2

Ni+α2r2

p1(Ni)+β1rNi−

β1rp1(Ni)−(α1+α2)rirp1(Ni)(3)

LEM MA 1. The accuracy f(Ni)deﬁned by Eqn.(3) satisﬁes

Assumption 2 to 5.

PROO F. It is easy to prove Eqn. (3) satisﬁes Assumption 2 and

3. According to Assumption 6, we known that

f(N0)−f(pk−1(Ni)) ≥0,

. . . ,

f(p1(Ni)) −f(Ni)≥0,

f(Ni)≥0.

Thus, 1 = f(N0)≥ · · · ≥ f(p1(Ni)) ≥f(Ni)≥0.Obviously,

f(p1(Ni)) = f(Ni)holds if and only if rp1(Ni)=rNi. In addi-

tion, rNi≤rp1(Ni)is satisﬁed by the volume decreasing property.

Therefore, (3) satisﬁes Assumption 2 and 3.

Consider the path P(Nj)in Figure 3, according to function (3), we

have:

f(p1(Nj)) −f(Nj) = α1r2

Nj+α2r2

p1(Nj)+β1rNj−

β1rp1(Nj)−(α1+α2)rNjrp1(Nj)(4)

For Assumption 4, let rp(Ni)=rp(Nj)and rNi=rNj, then

(3) −(4) ⇒f(Ni)−f(Nj) = f(p1(Ni)) −f(p1(Nj)).

Thus, if f(p1(Ni)) ≥f(p1(Nj)), obviously f(Ni)≥f(Nj)also

holds.

For Assumption 5, as rp1(Ni)=rp1(Nj)and f(p1(Ni)) =

f(p1(Nj)), we can derive (4) −(3) as

f(Ni)−f(Nj) = α1(r2

Nj−r2

Ni) + β1(rNj−rNi)−

(α1+α2)rp1(Ni)(rNj−rNi)

= (rNj−rNi)α1(rNi+rNj) + β1−

(α1+α2)rp1(Ni)(5)

It is easy to make α1(rNi+rNj) + β1−(α1+α2)rp1(Ni)≤0

holds by adjusting α1,α2and β1. In addition rNi≥rNj, thus

f(Ni)≤f(Nj).

According to above analysis, it is clear that the function satisﬁes all

assumptions.

If we set rp1(Ni)= 1 (i.e. f(p1(Ni)) = 1) in Eqn. (3), then

f(Ni) = −α1r2

Ni+ (α1+α2−β1)rNi+ (1 + β1−α2)(6)

This means that if the models sequence Miis obtained by direct

modelling over the original data, then the accuracy is a quadratic

function of its compression ratio. This is also validated in our ex-

periments in Section 7.

LEM MA 2. For a dissemination path P(Ni)to Ni,f(Ni)is

a quadratic function of rNiif all the compression ratios of Ni’s

preceding nodes on P(Ni)are given.

Lemma 2 shows a way to obtained the speciﬁc form of the accu-

racy function. If we ﬁx all the compression ratios of the preceding

nodes of Nion P(Ni), then we can use a quadratic function to ﬁt

the Ni’s actual accuracy by using the historical data. Through the

experiments, we found that a quadratic polynomial function can ﬁt

all the tested datasets pretty well.

5.3 Computing Model Accuracy

With the cost model, model accuracy can be calculated without

source data. It simpliﬁes the calculation of model accuracy. It

can also enormously improve the efﬁciency of calculation, which

is crucial for low-latency dissemination.

As stated earlier, the accuracy of a sequence of models Miis deter-

mined by the speciﬁc remodelling processes on the dissemination

path P(Ni). It is impossible to enumerate all possible remodelling

processes of a dissemination network and hence we can not pre-

compute all possible models’ accuracies. Alternatively, the com-

pression ratios ( or volume ) of models are classiﬁed into several

scales according to the bandwidth constraints (subscription levels)

of the receivers. All the possible remodelling paths will form a Re-

modelling Tree (RT) as illustrated in Figure 4 as an example, where

the compression ratios are classiﬁed into 10 scales. A complete

remodelling path in RT, from the root to a leaf, is a sequence of

strictly monotonically decreasing compression ratio values, which

starts at compression ratio of 1 and end at 0.1. Furthermore, the

models’ accuracies on arbitrary possible remodelling paths are pre-

computed regarding to these scales.

With a RT, models’ accuracies can be approximately calculated re-

gardless of its dissemination path. In order to estimate models’ ac-

curacies on an actual dissemination path P(Ni), we should ﬁnd the

closest remodelling path for P(Ni)in the RT. For example, given

a dissemination path ⟨1,0.9,0.9,0.75,0.43,0.1⟩, the closest path

in RT is ⟨1,0.9,0.8,0.4,0.1⟩. Then, the accuracies for each node

on P(Ni)are approximated by the accuracies on the remodelling

path. Due to the space limit, we will not discuss the details about

the algorithm, which is quite straightforward.

1

0.9

0.8

0.1

0.2

0.1

0.8

0.2

0.1

0.7

0.6

0.3

0.4

0.10.1

... ... ...

...

0.7

Figure 4: Remodelling Tree RT

6. OPTIMIZATION ALGORITHMS

With the cost model developed above, we will, in this section, de-

sign algorithms to optimize the dissemination plan X. As stated,

MMD is NP-hard. Therefore, we will design several heuristic-

based algorithms to solve this problem.

6.1 Greedy Algorithm

As analyzed earlier, the accuracy of a sequence of models is related

to its size. Therefore, in the ﬁrst algorithm, we try to maximize

the number of models that are sent out from each internal node. In

other words, we set the compression ratio of each node as large as

possible. This algorithm is presented in Algorithm 2 and works as

follows. It starts form the root node N0and set its compression

ratio as 1. Then it traverse the tree in a breadth ﬁrst manner. For

each internal node Niand its child node Nj, according to the vol-

ume decreasing property and bandwidth constraint, rNjis set as

minimum(cNj

|M0|, rNi). Algorithm stops when all the nodes’ com-

pression ratio is set. Greedy algorithm rjust need one traversal of

Tand hence its complexity is O(n).

6.2 Randomized Algorithm

By a closer investigation, we identify two potential problems of the

above greedy algorithm. First, in the greedy algorithm, we try to

maximize the number of models that are sent from each parent to its

Algorithm 2: Greedy Algorithm

Input: Dissemination tree T, bandwidth constraints C, initial

models M0

Output: dissemination plan X

1Let NodeQueue be a queue for preserving nodes and tbe the

current visiting node;

2Initialization: rN0←1;

3enqueue(N odeQueue, N0);

4 while N odeQueue is not empty do

5t←dequeue(N odeQueue);

6{t1, . . . , tk} ← ﬁnd out all children of tin T;

7 for i= 1 to kdo

8enqueue(N odeQueue, ti);

9rti←minimum(rt,cti

|M0|);

children. However this may not provide the optimal solution. For

example, in the feasible plan depicted in Figure 1, if we decrease

the volume of models received by N1from 9 to 6, then we do not

need to perform remodelling at N1and hence the accuracies of the

models sent to N1’s descendants would be increased. This may

actually end up with a better plan. Generally speaking, if we de-

crease rNi, then it is possible that some of its children’s accuracies

would be increased by avoiding the remodelling process. Second,

the greedy algorithm only search for a limited part of the solution

space and hence would easily produce a sub-optimal solution.

Based on the above observations, we try to design a new algo-

rithm to address the above two potential problems. In this algo-

rithm, for a internal node, its compression ratio is chosen within

range [minC/|M0|, maxC/|M0|], where minC and maxC are

the minimum and maximum bandwidth constraints among its child

nodes. This is to address the ﬁrst problem stated above.

Furthermore, instead of using a greedy algorithm, we employ a

randomized local search strategy. More speciﬁcally, we use the

stochastic hill climbing approach which selects a neighbor at ran-

dom. It will run in many iterations and within each iteration, the

compression for a node will be altered to a value within range

[minC/|M0|, maxC/|M0|]. This procedure will be repeated a

number of times until reached a local optimal and the correspond-

ing plan will be chosen. The initial plan will be randomly generated

by Algorithm 3.

In Algorithm 3, each node will be visited twice (except N0) within

each iteration: one time for determining the compression ratio of

its parent and another for itself. Besides, ﬁnding the minC and

maxC for a non-leaf node needs one scan of its children. Therefore

the cost of running Algorithm 3 is 2n+n0

i=0 θi, where θiis the

number of child nodes of node Ni.

In addition, each non-leaf node Nimeans an iteration for the whole

algorithm and at Niit would search at most θiplans. Suppose the

cost for calculate the accuracy for a plan is ktimes of the cardinality

of the plan, i.e k·n. Thus,

ϑ(n) = (2n+

n0

i=0

θi) +

n0

i=0

θi·(k·n)≤3n+k·n2

The complexity of the whole randomized algorithm is O(ϑ(n)) =

O(n2).

Algorithm 3: Randomized Algorithm

Input: dissemination tree T, nodes set N={N0, . . . , Nn},

bandwidth constraints C, initial models M0

Output: initial plan X0

1Let tbe current visiting node;

2Initialization: rN0←1;

3 for i= 1 to ndo

4t←Ni;

5 if tis not a leaf node then

6T←ﬁnd out all children of t;

7min ←minimum bandwidth in T;

8max ←maximum bandwidth in T;

9rand ←random integer in [minC, maxC ];

10 else

11 rand ←ct;

12 rt←minimum(rp1(t),rand

|M0|);

Algorithm 4: MFBF Algorithm

Input: dissemination tree T, nodes set N={N0, . . . , Nn},

bandwidth constraints C, initial models M0

Output: dissemination plan X

1Let Ube the unvisited nodes set and Vbe the visited nodes

set of T;

2Initialization: rN0←1,U←N−N0,V←N0;

3 while U̸=∅do

4b←the most frequent bandwidth of nodes in U;

5{t1, . . . , tk} ← all nodes with bandwidth of bin U;

6 for i= 1 to kdo

7{p1, . . . , pm} ← all preceding nodes of tiin U;

8p←the lowest preceding node of tiin V;

9rti←minimum(rp,b

|M0|);

10 for j= 1 to mdo

11 rpj←rti;

12 move pjfrom Uto V;

6.3 MFBF Algorithm

The above randomized algorithm needs many iterations to obtain a

good plan. So even though the algorithm still has a linear asymp-

totic complexity, constant coefﬁcient may still be very big. In this

subsection, we will design a more efﬁcient algorithm that can still

produce a good plan.

The algorithm is called Most Frequent Bandwidth First (MFBF).

The basic idea of this algorithm is to minimize the number of re-

modelling occurred in the dissemination tree. Therefore, we try to

put higher priority to maximizing the accuracies of the nodes with

more frequent bandwidth constraints. The details of MFBF is de-

scribed in Algorithm 4.

As the ﬁrst step, this algorithm will perform a traversal of the dis-

semination tree and if bandwidth constraint of a node is greater than

the maximum maxP constraint on its dissemination path, then it

will be re-set as maxP .

The algorithm runs in multiple iterations. Within each iteration, it

follows the following procedures

•Find the most frequent bandwidth κamong the unvisited

nodes and put the unvisited nodes with bandwidth of κinto

a queue T.

•Iterate over Tand, for each node tiin T, set the compression

ratio of tiand all its unvisited ancestor nodes to a value r.r

is calculated as the minimum value between cti/|M0|and

the compression ratio of ti’s lowest visited preceding nodes.

After this step, tiand all its unvisited ancestor nodes are con-

sidered visited.

The cost of MFBF algorithm mainly consists of three parts: 1)

the cost of ﬁnding the most frequent bandwidth from the unvisited

nodes U, denoted as h1; 2) the cost of ﬁnding Ni’s unvisited an-

cestor nodes from from U, denoted as h2; 3) the cost of ﬁnding the

lowest visited ancestor nodes, denoted as h3. In 1), each execution

of the ﬁnding the most frequent bandwidth just needs one pass scan

of Uand hence the complexity is O(n). Suppose that there are a

distinct bandwidth constraints in T, then h1≤a·n. In 2) and 3),

arbitrary node appear as unvisited node or visited preceding node

at most twice. In our implementation, we maintain a index point to

parent node for each node, which means we can ﬁnd a preceding

node for Niin a constant time, denoted as ϵ. So, h2+h3≤ϵ·2n.

h(n) = h1+h2+h3

≤a·n+ϵ·2n

=n·(a+ 2 ·ϵ)

Thus, O(h(n)) = O(n).

7. EXPERIMENTS AND EVALUATION

In this section, we will present our experimental results. We im-

plemented the prototype system in Java. Due to the fact that our

objective function is the accuracies of the data received by the sub-

scribers, there is no difference between running it in a distributed

system or simulating it on one single machine. Therefore, we sim-

ulate the whole system on one machine and hence can easily make

tests with many different system parameters.

Data. We used a publicly available dataset [11], named as UCR

time series data. UCR is a collections of time series data in min-

ing/machine learning community. The data set we used is the data

1in [11], which make up of 19 types time series data from diverse

applications. We carry out our experiments on all of these data and

present the results on ﬁve dataset. The results on the other data are

similar to the reported ones. The speciﬁc time-series that we have

used are: 50 Words [29], OSU Leaf [12] , and Gun Point [28,

27], ECG and Wafer[24]. Each data set is composed of hundreds

of sequences, and there are hundreds of points in each sequence.

Bandwidth constraint and data rate for the source node were set

to 1MBytes/s, which would be different in practice. For any other

node, data rate was deﬁned by its compression ratio (or bandwidth

constraint), and its unit was omitted in latter context. The experi-

ments can be divided into two parts:

•Model Accuracy: In this part, we show how to use curve-

ﬁtting approach to produce the cost model to estimate the

model accuracies. Furthermore, we validate the cost model

proposed in Section 5.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Accuracy of Models

Compression Ratio

ECG

Wafer

OSU Leaf

50 Words

Gun Point

(a) the longest path

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Accuracy of Models

Compression Ratio

ECG

Wafer

OSU Leaf

50 Words

Gun Point

ECG: f(x)=-0.61x2+ 1.18x+ 0.42

Wafer: f(x)=-0.19x2+ 0.33x+0.86

OSU Leaf: f(x)=-0.54x2+ 1.36x+0.19

50 Words: f(x)=-0.80x2+1.34x+0.44

Gun Point: f(x)=-0.37x2+0.68x+0.69

(b) 1 to x

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Accuracy of Models

Compression Ratio

ECG

Wafer

OSU Leaf

50 Words

Gun Point

ECG: f(x)=-0.66x2+1.24x+0.39

Wafer: f(x)=-0.43x2+0.74x+0.68

OSU Leaf: f(x)=-0.23x2+0.35x+0.85

50 Words: f(x)=-0.85x2+1.40x+0.40

Gun Point: f(x)=-0.48x2+1.29x+0.20

(c) 0.9 to x

Figure 5: Model Accuracy and Corresponding Cost Model

•Optimization Algorithms: We construct a number of dis-

semination tree with multiple scales of bandwidth constraints

and examine the performance of the optimization algorithms

proposed in this paper.

7.1 Model Accuracy

We implemented the Remodelling Tree (RT), as shown in Figure 4.

In our experiments, we set the bandwidth constraints to ten scales,

i.e. {1,0.9,0.8,...,0.1}. We have also tested twenty scales, which

produce similar results.

Suppose P=⟨1, . . . , rj, ri,...,0.1⟩is a remodelling path of RT ,

where rj> ri. According to Lemma 2, we construct a remodelling

tree path for the remodelling processes on a dissemination path of

T, then we can get the cost model for each remodelling path by

curve ﬁtting. With the cost model, we can easily calculate the ac-

curacy for each node.

P1=⟨1,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1⟩is the longest path

in RT. The models generated through P1has the lowest accuracy

among the models with the same compression ratio within RT .

The results is illustrated in Figure 5 (a).

According to Lemma 2, if all compression ratios of Ni’s preceding

nodes on path P(Ni)are provided, then f(Ni)is a quadratic func-

tion of rNi. That means if we ﬁx all the compression ratios before

riin the remodelling path P, and riis within [0.1, rj], then we can

get the cost model f(ri)by curve ﬁtting with a quadratic function.

Here, we choose two paths from RT to present our observation.

P2=⟨1, ri, . . . , 0.1⟩, where rivaries within {1,0.9,0.8,0.7,0.6,

0.5,0.4,0.3,0.2,0.1}.P3=⟨1,0.9, ri,...,0.1⟩, and rivaries

in {0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1}. As shown in Figure

5 (b) and (c), a quadratic function can ﬁt the results quiet well,

which validates our claims in Lemma 1 and 2. The other possible

remodelling paths have similar results.

7.2 Performances of Optimization Strategies

We conduct an intensive study on the performance of the three op-

timization algorithms described in Section 6. The performances are

measured on the average accuracy F(M)of all nodes within the

dissemination network T.

We construct different dissemination tree Trandomly with two

tunable parameters: the maximum fan-out θand the total number

n. In our experiments, the setting of the two parameters as follows:

•θ: varying in 3 scales {2,5,10};

•n: varying in {100,200,500,1000,2000,5000,10000}.

θand nwill affect the structure of T, e.g. its height. The struc-

ture of Tdetermines the possible remodelling operations. Tis

constructed with the following properties:

Each node in Thas num children, where num is a random integer

within [0, θ]. In addition, each node is associated with a bandwidth

constraint whose scale fall within ten scales {1,0.9,...,0.1}. For

a node Niwith a bandwidth constraint scale of i, its bandwidth

constraint is equal to i·|M0|, where |M0|is the expected volume of

the time-series to be disseminated within each epoch. Furthermore,

we simply set the bandwidth of the root to the volume of the initial

models, i.e. cN0=|M0|. In our experiments, we generate the

bandwidths constraints of the nodes in Tbased on ﬁve different

distributions.

•Level: nodes at the same level within Thave the same band-

width constraints.

•Uniform: the bandwidth constraints of each node obeys uni-

form distribution and independent on its location within T.

•Gaussian: the bandwidth constraints obey Gaussian distri-

bution with variance σ2= 1 and mean µequals the median

value of all the bandwidth constraints.

•MaxZipf: the bandwidth constraints are subject to zipﬁan

distribution P(k)∼1

ka, where P(k)is the frequency of the

kth bandwidth constraint and a= 1, and the larger band-

width constraints have smaller kvalues.

•MinZipf: similar to MaxZipf except that the smaller band-

width constraints have larger kvalues.

Furthermore, for each of the above given bandwidth distribution,

we considered two schemes to assign these bandwidths to each

node. In reality, this is determined by the organization of the dis-

semination tree, which is another complicated optimization prob-

lem and is out of the scope of this paper.

•Ordered: nodes at higher levels of the tree have larger band-

width constraints.

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

(a) Greedy Algorithm

0

0.2

0.4

0.6

0.8

1

0 20000 40000 60000 80000 100000

Average Accuracy

Number of nodes

approx-2

approx-5

approx-10

(b) Approximate Algorithm

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

random-2

random-5

random-10

(c) Randomized Algorithm

Figure 6: Horizontal Comparison

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

MFBF-2

random-2

(a) Fan-out 2

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-5

MFBF-5

random-5

(b) Fan-out 5

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-10

MFBF-10

random-10

(c) Fan-out 10

Figure 7: Vertical Comparison

•Unordered: no restriction on the bandwidths allocation within

T. Thus, we just randomly assign a bandwidths to a node Ni

according to the speciﬁc distribution.

Note that we only consider the “Ordered” scheme for the “Level”

distribution. Thus, there are 9 cases in total as shows in Table 1. We

evaluate the performances of these algorithms with all the different

distributions and the number of nodes varying from 100 to 10000.

We repeat the each group of experiments for 100 times and take the

mean of the average accuracies F(M)as the ﬁnal result.

Table 1: Possible Combinations

Distribution Ordered Unordered

Uniform uniform-1 uniform-0

Gaussian gaussian-1 gaussian-0

MaxZipf maxzipf-1 maxzipf-0

MinZipf minzipf-1 minzipf-0

Level level

7.3 Performances Analysis

The randomized searching algorithm applies the stochastic hill climb-

ing search to optimize the dissemination, which is a costly proce-

dure. For example, it takes 2 hours to complete the experiments

with 10,000 nodes, while it takes only less than 1minute for the

other two algorithms. Due to the long running time of the random-

ized algorithm, we just present its results on the Level distribution,

as shown in Figure 6 and Figure 7. For the other distributions, we

only execute this algorithm on a small network scale and the results

suggest similar conclusion as in the Level distribution. Therefore,

in other 8 distributions, we only present and discuss the greedy and

MFBF algorithms.

The results are illustrated in Figure 6 ∼Figure 15. We compare the

algorithms both horizontally and vertically. The horizontal com-

parison is to compare the performances of the algorithms, as shown

in Figure 6, while the vertical comparison is to compare the same

algorithm with varying θ, as shown in Figure 7. To illustrate, we

plot the results on the Level distribution into 6 graphs. In the other

cases, we put all these 9 results into one graphs to save space.

Horizontal comparisons of the Algorithms. As shown in Figure

6 and 14, MFBF performs better than greedy algorithm in level and

minzipf-0. Furthermore, MFBF performs especially well in level.

This is because, in level distribution, the bandwidth monotonically

decreases along each dissemination path of the dissemination tree.

Therefore, the greedy algorithm which attempts to maximize the

bandwidth usage on each edge will incur excessive remodelling on

all the dissemination paths. On the contrary, MFBF can signiﬁ-

cantly decrease the total number of remodelling processes by tak-

ing bandwidth frequency into consideration. This is also true in the

minzipf-0 distribution but with less extent. Moreover, in the level

distribution, the randomized algorithm performs very close to the

MFBF algorithm, which is also effective in reducing the remod-

elling processes. However, as mentioned earlier, it takes a much

longer time to complete in comparing the other two algorithms.

In Figure 12 and 13, MFBF performs worse than the greedy algo-

rithm. This is because the main idea of the MFBF algorithm is to

reduce the remodelling processes by altering the bandwidth usage

of the ancestor nodes of the nodes with more frequent bandwidths

and it will be a great waste of bandwidths, especially when most

of the bandwidths are very large, which is unfortunately the case of

distribution maxzipf-0 and maxzipf-1.

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 8: uniform-0

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 9: uniform-1

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 10: gaussian-0

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 11: gaussian-1

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 12: maxzipf-0

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 13: maxzipf-1

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 14: minzipf-0

0

0.2

0.4

0.6

0.8

1

0 2000 4000 6000 8000 10000

Average Accuracy

Number of nodes

greedy-2

greedy-5

greedy-10

MFBF-2

MFBF-5

MFBF-10

Figure 15: minzipf-1

As shown in Figure 8 and Figure 9, the greedy algorithm outper-

forms MFBF slightly in the uniform distributions, i.e. uniform-1

and uniform-0. In such cases, there is no signiﬁcant difference in

the bandwidth frequencies and hence there is limited opportunities

for the MFBF algorithm to reduce the remodelling processes.

Furthermore, as shown in Figure 10 and Figure 11, the perfor-

mance of greedy algorithm and MFBF are almost identical for the

two Gaussian distributions. In these cases, most of the nodes have

bandwidth constraints close to the median value. Therefore, the

gain of accuracy by reducing the remodelling processes is almost

equivalent to the lost of accuracy caused by the waste of bandwidth

usages.

According to above analysis, we can conclude that MFBF performs

very well when the dissemination tree is “well” organized as in

our level and minzipf-0 distribution. In such cases, there are more

nodes with smaller bandwidth constraints and the nodes with higher

bandwidth constraints are located higher in the dissemination tree.

We conjecture that this is true in many real situations. If this is not

true, then the greedy algorithm performs better.

The general conclusion of the results. The average accuracy

F(M)decreases progressively while we increase the total number

of nodes in Tfrom 100 to 10000. The reason is that the remod-

elling is rare when the number of nodes is small, but it becomes

more frequent with the increase of the number of nodes. Further-

more, when the network becomes greater, the average accuracies

would become steady in all the cases.

The sensitivity to θ.As shown in all the previous ﬁgures, F(M)

has a positive correlation with the maximum fan-out θof T. The

major reason of this is that the height of Tis determined by θwhen

nis ﬁxed and the height of Tdetermines the possible maximum

remodelling processes on a dissemination path. In other words,

when θincreases, the number of remodelling within Tis reduced

and consequently the average accuracies would increase.

The inﬂuence of assignment schemes. In terms of the assignment

schemes, all the algorithms perform better in the ordered scheme

than in the unordered scheme with the same distribution. Further-

more, the assignment schemes have a higher inﬂuence on greedy

algorithm, while MFBF performed much more consistently.

8. CONCLUSIONS

In this paper, we have discussed about the problem of multi-scale

time series data dissemination over a predeﬁned tree-based net-

work. We used bandwidth constraints as an abstract parameter to

indicate the subscription levels of the subscribers involved in the

dissemination network as well as the volumes of data that the sub-

scribers would be able to consume. A framework of model-based

dissemination is proposed. Based on this, we formulate an opti-

mization problem to maximize the average accuracy of the data

received by all the nodes in the dissemination network. One major

challenges to solve this problem is the complexity to estimate the

accuracy of a particular dissemination plan. To address the chal-

lenge, we have made a thorough study on the estimation of the

accuracy of using PLA to represent time series data. In addition,

we proposed three algorithms to optimize the dissemination plan.

Meanwhile, an extensive experiments were carried out to evaluate

the performances of these algorithms in various bandwidth distri-

butions. The experimental results can be used as guidelines for

choosing the optimization strategies according to the actual appli-

cation environments.

There are some interesting directions to be studied in the future.

First, the optimization of the structure of the dissemination tree

is an interesting problem, which would have signiﬁcant effects on

the overall data accuracies. As we can see from our analysis and

experimental results, the number of remodelling that are executed

in the network is a major factor in the resulting data accuracies.

On one hand, organizing the dissemination network to reduce the

need of remodelling can help increase the data accuracies. On the

other hand, the network organization should take into consideration

of the actual physical network properties, such as transfer latency,

localities etc. For example, connect one node Nias a child node

to another node Njthat has the same subscription level can avoid

remodelling, but if Njis far away from Niand has a long network

latency, then it is undesirable. Hence, a good trade-off should be

found. Second, it is interesting to study using other models (e.g.

DWT) to compress time series data. In general, our framework can

accommodate different compression models as long as remodelling

methods are provided. Furthermore, cost models and optimization

algorithms should also be rethought and re-examined.

9. REFERENCES

[1] http://gnip.com/.

[2] http://lhc.web.cern.ch/lhc/.

[3] https://datamarket.azure.com/.

[4] http://www.aggdata.com/.

[5] http://www.infochimps.com/.

[6] http://www.patientslikeme.com/.

[7] http://www.xignite.com/.

[8] R. Agrawal, C. Faloutsos, and A. Swami. Efﬁcient similarity

search in sequence databases. The 4th International

Conference on Foundations of Data Organization and

Algorithms (FODO ’93).

[9] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and

evaluation of a wide-area event notiﬁcation service. ACM

Transactions on Computer Systems (TOCS).

[10] F. K.-P. Chan, A. W. chee Fu, and C. Yu. Haar wavelets for

efﬁcient similarity search of time-series: With and without

time warping. IEEE Transactions on Knowledge and Data

Engineering.

[11] K. E., Z. Q., H. B., H. Y., X. X., W. L., and R. C.A. The ucr

time series classiﬁcation/clustering homepage:

www.cs.ucr.edu/~eamonn/time_series_data/.

2011.

[12] A. Gandhi. Content-based image retrieval: Plant species

identiﬁcation. Master thesis, Oregon State University,

September 2002.

[13] Y. hua Chu, S. G. Rao, S. Seshan, and H. Zhang. A case for

end system multicast. in Proceedings of ACM Sigmetrics.

[14] K. Kanth, D. Agrawal, and A. Singh. Dimensionality

reduction for similarity searching in dynamic databases.

Proceedings of the 1998 ACM SIGMOD international

conference on Management of Data.

[15] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra.

Dimensionality reduction for fast similarity search in large

time series databases. Knowledge and Information Systems,

3(3):263–286, Aug 2000.

[16] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online

algorithm for segmenting time series. In ICDM, 2001.

[17] E. Keogh, S. Chu, D. Hart, and M. Pazzani. Segmenting time

series: A survey and novel approach. Data Mining in Time

Series Databases, 57:1–22, 2004.

[18] E. Keogh and M. Pazzani. Scaling up dynamic time warping

to massive dataset. Proceeding of Third European

Conference on Principles of Data Mining and Knowledge

Discovery (PKDD ’99).

[19] E. J. Keogh and M. J. Pazzani. Relevance feedback retrieval

of time series sata. International Conference on Research

and Development in Information Retrieval (SIGIR),

3(3):183–190, Aug 1999.

[20] A. Koski, M. Juhola, and M. Meriste. Syntactic recognition

of ecg signals by attributed ﬁnite automata. Pattern

Recognition.

[21] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and

D. Suciu. Query-based data pricing. Proceedings of the 31st

symposium on Principles of Database Systems.

[22] X. Liu, Z. Lin, and H. Wang. Novel online methods for time

series segmentation. IEEE Transactions on Knowledge and

Data Engineering (TKDE), 20(12):1616–1626, December

2008.

[23] S. Nath, J. Liu, and F. Zhao. Sensormap for wide-area sensor

webs. IEEE Computer.

[24] R. T. Olszewski. Generalized feature extraction for structural

pattern recognition in time-series data. PhD thesis, Carnegie

Mellon University, Pittsburgh, PA,, 2001.

[25] T. Palpanas, M. Vlachos, and E. Keogh. Online amnesic

approximation of streaming time series. International

Conference on Data Engineering (ICDE).

[26] D. Raﬁei and A. Mendelzon. Efﬁcient retrieval of similar

time sequences using dft.

[27] C. A. Ratanamahatana and E. Keogh. Everything you know

about dynamic time warping is wrong. SIAM International

Conference on Data Mining, 2004.

[28] C. A. Ratanamahatana and E. Keogh. Making time-series

classiﬁcation more accurate using learned constraints. SIAM

International Conference on Data Mining, 2004.

[29] T. M. Rath and R. Manmatha. Word image matching using

dynamic time warping. CVPR, 2003.

[30] G. Reeves, J. Liu, S. Nath, and F. Zhao. Managing massive

time series streams with multi-scale compressed trickles.

Proceedings of 35th Conference on Very Large Data Bases.

[31] S. Shah, K. Ramamritham, and P. J. Shenoy. Resilient and

coherence preserving dissemination of dynamic data using

cooperating peers. IEEE Transactions on Knowledge and

Data Egnineering (TKDE).

[32] E. Skjervold, K. Lund, T. H. Bloebaum, and F. T. Johnsen.

Bandwidth optimizations for standards-based

publish/subscribe in disadvantaged grids. Millitray

Communications Conference(MILCOM 2012).

[33] I. Vasilescu, K. Kotay, D. Rus, M. Dunbabin, and P. Corke.

Data collection, storage, and retrieval with an underwater

sensor network. Proceedings of the 3rd international

conference on Embedded networked sensor systems (SenSys

05).

[34] B. C. O. Y. Zhou and K.-L. Tan. Disseminating streaming

data in a dynamic environment: anadaptive and cost-based

approach. The International Journal on Very Large Data

Bases (VLDB J.).

[35] Y. Zhou, Z. Vagena, and J. Haustad. Dissemination of

models over time-varying data. Proceedings of the VLDB

Endowment (PVLDB), 4, 2011.