In this paper, we discuss various techniques for the effi- cient organization of a temporal coherency preserving dy- namic data dissemination network. The network consists of sources of dynamically changing data, repositories to repli- cate this data, and clients. Given the temporal coherency properties of the data available at various repositories, we suggest methods to intelligently choose a repository to serve a new client request. The goal is to support as many clients as possible, from the given network. Secondly, we propose strategies to decide what data should reside on the reposi- tories, given the data coherency needs of the clients. We model the problem of selection of repositories for serving each of the clients as a linear optimization prob- lem, and derive its objective function and constraints. In view of the complexity and infeasibility of using this so- lution in practical scenarios, we also suggest a heuristic solution. Experimental evaluation, using real world data, demonstrates that the fidelity achieved by clients using the heuristic algorithm is close to that achieved using linear optimization. To improve the fidelity further through better load sharing between repositories, we propose an adaptive algorithm to adjust the resource provisions of repositories according to their recent response times. It is often advantageous to reorganize the data at the repositories according to the needs of clients. To this end, we propose two strategies based on reducing the communi- cation and computational overheads. We evaluate and com- pare the two strategies, analytically, using the expected re- sponse time for an update at repositories, and by simula- tion, using the loss of fidelity at clients, as our performance measure. The results suggest that a considerable improve- ment in fidelity can be achieved by judicious reorganization.
All content in this area was uploaded by Krithivasan Ramamritham on Nov 16, 2014
Content may be subject to copyright.
A preview of the PDF is not available
... , v n ⟩. This central correlation has three major scalability issues: (1) it results in a vast amount of parallel network connections at a central system, (2) it misses edge computing opportunities, and (3) it relies on costly stream joins. Preprint, all rights reserved Moreover, applications assume that joint sensor data tuples represent a concise snapshot of all values (i.e., measurements) taken at time t. ...
... The latency is low, because sensors transmit their values directly to the central server. However, there are two major disadvantages in a central join topology: (1) The join cannot provide any guarantee for the time coherence of result tuples because it relies on the correctness of the timestamps transmitted from 1 Network jitter is the variance of transmission times in a network. In a sensing pipeline, one node initiates a request and passes it to the succeeding node in the pipeline. ...
... Deolasee et al. propose an adaptive push-pull method to maintain the coherence of such cached data copies efficiently [12]. Agrawal et al. discuss techniques to smartly select up to date (temporal coherent) cached data copies for serving client requests [1]. These works consider the coherence between data sources and several copies of these sources. ...
Data analysis in the Internet of Things (IoT) requires us to combine event streams from a huge amount of sensors. This combination (join) of events is usually based on the time stamps associated with the events. We address two challenges in environments which acquire and join events in the IoT: First, due to the growing number of sensors, we are facing the performance limits of central joins with respect to throughput, latency, and network utilization. Second, in the IoT, diverse sensor nodes are operated by different organizations and use different time synchronization techniques. Thus, events with the same timestamps are not necessarily recorded at the exact same time and joined data tuples have an unknown time incoherence. This can cause undetected failures, such as false correlations and wrong predictions. We present SENSE, a system for scalable data acquisition from distributed sensors. SENSE introduces time coherence measures as a fundamental data characteristic in addition to common time synchronization techniques. The time coherence of a data tuple is the time span in which all values contained in the tuple have been read from sensors. We explore concepts and algorithms to quantify and optimize time coherence and show that SENSE scales to thousands of sensors, operates efficiently under latency and coherence constraints, and adapts to changing network conditions.
... In [22], authors present technique of reorganizing a data dissemination network when client requirements change. Instead, we try to answer the client query using the existing network. ...
Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data items, for example, to know (a) the average of temperatures sensed by a set of sensors (b) the value of index of mid-cap stocks. In these queries a client specifies a coherency requirement as part of the query. In this paper we present a low-cost, scalable technique to answer continuous aggregation queries using a content distribution network of dynamic data items. In such a network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Just as various fragments of a dynamic web-page are served by one or more nodes of a content distribution network, our technique involves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen data aggregators with their individual sub-query incoherency bounds. We provide a technique of getting the optimal query plan (i.e., set of sub- queries and their chosen data aggregators) which satisfies client query's coherency requirement with least cost, measured in terms of the number of refresh messages sent from aggregators to the client. For estimating query execution cost, we build a continuous query cost model which can be used to estimate the number of messages required to satisfy the client specified incoherency bound. Performance results using real-world traces show that our cost based query planning leads to queries being executed using less than one third the number of messages required by existing schemes.
Continuous queries are used to monitor changes to time varying data and to provide results useful for online decision making. Typically a user desires to obtain the value of some aggregation function over distributed data items, for example, to know value of portfolio for a client; or the AVG of temperatures sensed by a set of sensors. In these queries a client specifies a coherency requirement as part of the query. We present a low-cost, scalable technique to answer continuous aggregation queries using a network of aggregators of dynamic data items. In such a network of data aggregators, each data aggregator serves a set of data items at specific coherencies. Our technique involves decomposing a client query into sub-queries and executing sub-queries on judiciously chosen data aggregators. We provide a technique for getting the optimal set of sub-queries with their incoherency bounds which satisfies client query's coherency requirement with least number of refresh messages sent from aggregators to the client. Performance results using real-world traces show that our cost based query planning leads to queries being executed using less than one third the number of messages required by existing schemes.
This paper explores ways to provide improved consis- tency for Internet applications that scale to millions of clients. We make four contributions. First, we iden- tify how workloads affect the scalability of cache consis- tency algorithms. Second, we define two primitive mech- anisms, split and join, for growing and shrinking consis- tency hierarchies, and we present a simple mechanism for implementing them. Third, we describe and evaluate policies for using split and join to address the fault toler- ance and performance challenges of consistency hierar- chies. Fourth, using synthetic workload and trace-based simulation, we compare various algorithms for maintain- ing strong consistency in a range of hierarchy configura- tions. Our results indicate that a promising configuration for providing strong consistency in a WAN is a two-level consistency hierarchy where servers and proxies work to maintain consistency for data cached at clients. Specif- ically, by adapting to clients' access patterns, two-level hierarchies reduce the read latency for demanding work- loads without introducing excessive overhead for nonde- manding workloads. Also, they can improve scalability by orders of magnitude. Furthermore, this configuration is easy to deploy by augmenting proxies, and it allows invalidation messages to traverse firewalls.
Many researchers have shown that server-driven consistencyprotocols can potentially reduce read latency. Server-drivenconsistency protocols are particularly attractive for largescaledynamic web workloads because dynamically generateddata can change rapidly and unpredictably. However,there have been no reports on engineering server-driven consistencyfor such a workload. This paper reports our experiencein engineering server-driven consistency for a Sportingand Event web site hosted by...
This paper discusses the design and performance of a hierarchical proxy-cache designed to make Internet in- formation systems scale better. The design was motivated by our earlier trace-driven simulation study of Internet traffic. We challenge the conventional wisdom that the benefitsof hierarchical filecaching do not merit the costs, and believe the issue merits reconsideration in the Internet environment. The cache implementation supports a highly concurrent stream of requests. We present performance measurements that show that our cache outperforms other popular Inter- net cache implementations by an order of magnitude under concurrent load. These measurements indicate that hierar- chy does not measurably increase access latency. Our soft- ware can also be configuredas a Web-server accelerator; we present data that our httpd-acceleratoris ten times faster than Netscape' s Netsite and NCSA 1.4 servers. Finally, we relate our experience fittingthe cache into the increasingly complex and operational world of Internet in- formation systems, including issues related to security, trans- parency to cache-unaware clients, and the role of filesystems in support of ubiquitous wide-area information systems.
This is the second edition (that is revised to remove many errors) that is now as a, cheaper, paperback; the first edition was published by Prentice-Hall in 1982. The first edition is also available as an Indian edition from Prentice-Hall India (very inexpensive paperback). The second edition is also available as Asian edition from John Wiley, Singapore (inexpensive paperback). These cheaper Asian/Indian editions are for use in developing countries only. Since 2016, a Chinese translation of the book is also available. In the paperback, I removed nearly hundred (minor) errors. So though it is not a new edition, it should be nearly error-free. Recall that a complete solution manual and ppt slides of all chapters can be obtained from John Wiley or from me for those teaching using the book. Note also that the SHARPE software package may make specification and solution of the problems in the book easier. The package may be requested from me after completing and signing an agreement (for University students/professors).
The tradeoffs between consistency, performance, and availability are well understood. Traditionally, how- ever, designers of replicated systems have been forced to choose from either strong consistency guarantees or none at all. This paper explores the semantic space be- tween traditional strong and optimistic consistency mod- els for replicated services. We argue that an important class of applications can tolerate relaxed consistency, but benefit from bounding the maximum rate of inconsistent access in an application-specific manner. Thus, we de- velop a set of metrics, Numerical Error, Order Error, and Staleness, to capture the consistency spectrum. We then present the design and implementation of TACT, a middleware layer that enforces arbitrary consistency bounds among replicas using these metrics. Finally, we show that three replicated applications demonstrate sig- nificant semantic and performance benefits from using our framework.
We describe the design and implementation of an integrated
architecture for cache systems that scale to hundreds or thousands of
caches with thousands to millions of users. Rather than simply try to
maximize hit rates, we take an end-to-end approach to improving response
time by also considering hit times and miss times. We begin by studying
several Internet caches and workloads, and we derive three core design
principles for large scale distributed caches: minimize the number of
hops to locate and access data on both hits and misses; share data among
many users and scale to many caches; and cache data close to clients.
Our strategies for addressing these issues are built around a scalable,
high-performance data-location service that tracks where objects are
replicated. We describe how to construct such a service and how to use
this service to provide direct access to remote data and push-based data
replication. We evaluate our system through trace-driven simulation and
find that these strategies together provide response time speedups of
1.27 to 2.43 compared to a traditional three-level cache hierarchy for a
range of trace workloads and simulated environments
As the Web continues to explode in size, caching becomes increasingly important. With caching comes the problem of cache consistency. Conventional wisdom holds that strong cache consistency is too expensive for the Web, and weak consistency methods such as Time-To-Live (TTL) are most appropriate. The article compares three consistency approaches: adaptive TTL, polling-every-time, and invalidation, using prototype implementation and trace replay in a simulated environment. Our results show that invalidation generates less or a comparable amount of network traffic and server workload than adaptive TTL and has a slightly lower average client response time, while polling-every-time generates more network traffic and longer client response times. We show that, contrary to popular belief, strong cache consistency can be maintained for the Web with little or no extra cost than the current weak consistency approaches, and it should be maintained using an invalidation based protocol
We describe a prototype scalable and highly available web server, built on an IBM SP-2 system, and analyze its scalability. The system architecture consists of a set of logical front-end or network nodes and a set of back-end or data nodes connected by a switch, and a load balancing component. A combination of TCP routing and Domain Name Server (DNS) techniques are used to balance the load across the Front-end nodes that run the Web (httpd) daemons. The scalability achieved is quantified and compared with that of the known DNS technique. The load on the back-end nodes is balanced by striping the data objects across the back-end nodes and disks. High availability is provided by detecting node or daemon failures and reconfiguring the system appropriately. The scalable and highly available web server is combined with parallel databases, and other back-end servers, to provide integrated scalable and highly available solutions