Proceedings of the IEEE Symposium on Reliable Distributed Systems

Description

  • ISSN
    1060-9857

Publications in this journal

  • Conference Proceeding: Active Replication at (Almost) No Cost
    [show abstract] [hide abstract]
    ABSTRACT: MapReduce has become a popular programming paradigm in the domain of batch processing systems. Its simplicity allows applications to be highly scalable and to be easily deployed on large clusters. More recently, the MapReduce approach has been also applied to Event Stream Processing (ESP) systems. This approach, which we call StreamMapReduce, enabled many novel applications that require both scalability and low latency. Another recent trend is to move distributed applications to public clouds such as Amazon EC2 rather than running and maintaining private data centers. Most cloud providers charge their customers on an hourly basis rather than on CPU cycles consumed. However, many applications, especially those that process online data, need to limit their CPU utilization to conservative levels (often as low as 50%) to be able to accommodate natural and sudden load variations without causing unacceptable deterioration in responsiveness. In this paper, we present a new fault tolerance approach based on active replication for StreamMapReduce systems. This approach is cost effective for cloud consumers as well as cloud providers. Cost effectiveness is achieved by fully utilizing the acquired computational resources without performance degradation and by reducing the need for additional nodes dedicated to fault tolerance.
    Reliable Distributed Systems (SRDS), 2011 30th IEEE Symposium on; 11/2011
  • Conference Proceeding: Secure, Dependable, and High Performance Cloud Storage
    [show abstract] [hide abstract]
    ABSTRACT: There have been works considering protocols for accessing partitioned data. Most of these works assume the local cluster based environment and their designs target atomic semantics. However, when considering widely distributed cloud storage systems, these existing protocols may not scale well. In this paper, we analyze the requirements of access protocols for storage systems based on data partitioning schemes in widely distributed cloud environments. We consider the regular semantics instead of atomic semantics to improve access efficiency. Then, we develop an access protocol following the requirements to achieve correct and efficient data accesses. Various protocols are compared experimentally and the results show that our protocol yields much better performance than the existing ones.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: On Optimizing Traffic Signal Phase Ordering in Road Networks
    [show abstract] [hide abstract]
    ABSTRACT: Traffic signals are an elementary component of all urban road networks and play a critical role in controlling the flow of vehicles. However, current road transportation systems and traffic signal implementations are very inefficient. The objective of this research is to evaluate optimal phase ordering within a signal cycles to minimize the average waiting delay and thus in turn minimizing fuel consumption and greenhouse gas (GHG) emissions. Through extensive simulation analysis, we show that by choosing optimal phase ordering, the stopped delay can be reduced by 40% per car at each signal resulting in a saving of up to 100 gallons of fuel per traffic signal each day.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Source
    Conference Proceeding: Fixed Cost Maintenance for Information Dissemination in Wireless Sensor Networks
    [show abstract] [hide abstract]
    ABSTRACT: Because of transient wireless link failures, incremental node deployment, and node mobility, existing information dissemination protocols used in wireless ad-hoc and sensor networks cause nodes to periodically broadcast "advertisement" containing the version of their current data item even in the "steady state" when no dissemination is being done. This is to ensure that all nodes in the network are up-to-date. This causes a continuous energy expenditure during the steady state, which is by far the dominant part of a network's lifetime. In this paper, we present a protocol called Varuna which incurs a constant energy cost, independent of the duration of the steady state. In Varuna, nodes monitor the traffic pattern of the neighboring nodes to decide when an advertisement is necessary. Using testbed experiments and simulations, we show that Varuna achieves several orders of magnitude energy savings compared to Trickle, the existing standard for dissemination in sensor networks, at the expense of a reasonable amount of memory for state maintenance.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Source
    Conference Proceeding: A Multi-step Simulation Approach toward Secure Fault Tolerant System Evaluation
    [show abstract] [hide abstract]
    ABSTRACT: As new techniques of fault tolerance and security emerge, so does the need for suitable tools to evaluate them. Generally, the security of a system can be estimated and verified via logical test cases, but the performance overhead of security algorithms on a system needs to be numerically analyzed. The diversity in security methods and design of fault tolerant systems make it impossible for researchers to come up with a standard, affordable and openly available simulation tool, evaluation framework or an experimental test-bed. Therefore, researchers choose from a wide range of available modeling-based, implementation-based or simulation-based approaches in order to evaluate their designs. All of these approaches have certain merits and several drawbacks. For instance, development of a system prototype provides a more accurate system analysis but unlike simulation, it is not highly scalable. This paper presents a multi-step, simulation-based performance evaluation methodology for secure fault tolerant systems. We use a divide-and-conquer approach to model the entire secure system in a way that allows the use of different analytical tools at different levels of granularity. This evaluation procedure tries to strike a balance between the efficiency, effort, cost and accuracy of a system's performance analysis. We demonstrate this approach in a step-by-step manner by analyzing the performance of a secure and fault tolerant system using a JAVA implementation in conjunction with the ARENA simulation.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: Towards Mobile Data Streaming in Service Oriented Architecture
    [show abstract] [hide abstract]
    ABSTRACT: Service Oriented Architecture (SOA) is an architectural pattern providing agility to align technical solutions to modular business services that are decoupled from service consumers. Service capabilities such as interface options, quality of service (QoS), throughput, security and other constraints are described in the Service Level Agreement (SLA) that would typically be published in the service registry (UDDI) for use by consumers and/or mediation mechanisms. For mobile data streaming applications, problems arise when a service provider's SLA attributes cannot be mapped one-to-one to the service consumers (i.e. 150MB/sec video stream service provider to 5MB/sec data consumer). In this paper we present a generic framework prototype for managing and disseminating streaming data within a SOA environment as an alternative to custom service implementations based upon specific consumers or data types. Based on this framework, we implemented a set of services: Stream Discovery Service, Stream Multiplexor/Demultiplexor (routing) Service, Stream Brokering Service, Stream Repository Service and Stream Filtering Service to demonstrate the flexibility of such a streaming data framework within SOA environment.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Source
    Conference Proceeding: A Study on Latent Vulnerabilities
    [show abstract] [hide abstract]
    ABSTRACT: Software code reuse has long been touted as a reliable and efficient software development paradigm. Whilst this practice has numerous benefits, it is inherently susceptible to latent vulnerabilities. Source code which is re-used without being patched for various reasons may result in vulnerable binaries, despite the vulnerabilities being made publicly known. To aggravate matters, crackers have access to information on these vulnerabilities as well. Defenders need to ensure all loopholes are patched, while attackers need just one such loophole. In this work, we define latent vulnerabilities, and study the prevalence of the problem. This provides us the motivation, and an insight into the future work to be done in solving the problem. Our results show that unpatched source files which are more than one year old are commonly used in the latest operating systems. In fact, several of these files are more than ten years old. We explore the premises of using symbols in identifying binaries and conclude that they are insufficient in solving the problem. Additionally, we discuss two possible approaches to solve the problem.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: Uncertainty Propagation in Analytic Availability Models
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we discuss a Monte Carlo sampling based method for propagating the epistemic uncertainty in model parameters, through the system availability model. We also outline methods to compute the number of samples needed to obtain a desired confidence interval for various scenarios. We illustrate this method with a real system example and discuss the results obtained. While our example discusses confidence interval for system availability, this method can be directly applied to compute uncertainty for other dependability, performance and perform ability measures, computed by solving stochastic analytic models. We also emphasize the fact that no simulation is carried out in our method but a repeated sampling is performed over the parameter space followed by the execution of the analytic model with the final phase being the statistical analysis of the output vector.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: Optimization Based Topology Control for Wireless Ad Hoc Networks to Meet QoS Requirements
    [show abstract] [hide abstract]
    ABSTRACT: This paper proposes a technique for topology control (TC) of wireless nodes to meet Quality of Service (QoS) requirements between source and destination node pairs. The nodes are assumed to use a TDMA (Time Division Multiple Access) based MAC (Medium Access Control) layer. Given a set of QoS requirements, a set of wireless nodes and their initial positions, the goal is to find a topology of the nodes by adjusting the transmitting power, which will meet the QoS requirements under the presence of interference and at the same time minimize the energy consumed. The problem of TC is treated like an optimization problem and techniques of Linear Programming (LP) and Genetic Algorithms (GA) are used to solve it. The solution obtained after solving the optimization problem is in the form of optimal routes to be followed between each source, destination node pair. This information is used to construct the optimal topology.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: Securing Mobile Unattended WSNs against a Mobile Adversary
    [show abstract] [hide abstract]
    ABSTRACT: One important factor complicating security in Wireless Sensor Networks (WSNs) is lack of inexpensive tamper-resistant hardware in commodity sensors. Once an adversary compromises a sensor, all memory and forms of storage become exposed, along with all secrets. Thereafter, any cryptographic remedy ceases to be effective. Regaining sensor security after compromise (i.e., intrusion-resilience) is a formidable challenge. Prior approaches rely on either (1) the presence of an on-line trusted third party (sink), or (2) the availability of a True Random Number Generator (TRNG) on each sensor. Neither assumption is realistic in large-scale Unattended Wireless Sensor Networks (UWSNs) composed of low-cost commodity sensors. periodic visits by the sink. Previous work has demonstrated that sensor collaboration is an effective, yet expensive, means of attaining intrusion-resilience in UWSNs. In this paper, we explore intrusion resilience in Mobile UWSNs in the presence of a powerful mobile adversary. We show how the choice of the sensor mobility model influences intrusion resilience with respect to this adversary. We also explore self healing protocols that require only local communication. Results indicate that sensor density and neighborhood variability are the two key parameters affecting intrusion resilience. Our findings are supported by extensive analyses and simulations.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Source
    Conference Proceeding: VMDriver: A Driver-Based Monitoring Mechanism for Virtualization
    [show abstract] [hide abstract]
    ABSTRACT: Monitoring virtual machine (VM) is an essential function for virtualized platforms. Existing solutions are either coarse-grained - monitoring in granularity of VM level, or not general - only support specific monitoring functions for particular guest operating system (OS). Thus they do not satisfy the monitoring requirement in large-scale server cluster such as data center and public cloud platform, where each physical platform runs hundreds of VMs with different guest OSes. In this paper, we propose VMDriver, a general and fine-grained approach for virtualization monitoring. The novel design of VMDriver is the separation of event interception point in VMM level and rich guest OS semantic reconstructions in management domain. With this design, variant monitoring drivers in management domain can mask the differences of guest OSes. We implement VMDriver on Xen and our experimental study shows that it introduces very small performance overhead. We demonstrate its generality by inspecting four aspects information about the target virtual machines with different guest OSes. The unified interface of VMDriver brings convenience to develop complex monitoring tools for distributed virtualization environment.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: Data-Mining-Based Link Failure Detection for Wireless Mesh Networks
    [show abstract] [hide abstract]
    ABSTRACT: Mobile robot applications operating in wireless environments require fast detection of link failures in order to enable fast repair. In previous work, we have shown that cross-layer failure detection can reduce failure detection latency significantly. In particular, we monitor the behavior of the WLAN MAC layer to predict failures on the link layer. In this paper, we investigate data mining techniques to determine which parameters, i.e., the events, or combination and timing of events, occurring on the MAC layer most probably lead to link failures. Our results show, that the parameters revealed with the data mining approach produce similar or even more accurate failure predictions than achieved so far.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Conference Proceeding: GAUL: Gestalt Analysis of Unstructured Logs for Diagnosing Recurring Problems in Large Enterprise Storage Systems
    [show abstract] [hide abstract]
    ABSTRACT: We present GAUL, a system to automate the whole log comparison between a new problem and the ones diagnosed in the past to identify recurring problems. GAUL uses a fuzzy match algorithm based on the contextual overlap between log lines and efficiently implements this using scalable index/search. The accuracy and efficiency of the comparison is further improved by leveraging problem set information and noise tolerance techniques. We evaluate GAUL using 4339 customer problems that occurred in all field deployments of an enterprise storage system over the course of a year. Our results show that with human-filtered logs, GAUL can identify the correct problem set 66% of the time among the top10 matches, which is 15% more accurate than the VSM system that uses cosine similarity and 19% more accurate than the ERRCMP system that uses error codes for log comparison. With unfiltered logs, the top10 match accuracy of GAUL is 40%, which is 22% more accurate than VSM and 26% more accurate than ERRCMP.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010
  • Source
    Conference Proceeding: Shedding Light on Enterprise Network Failures Using Spotlight
    [show abstract] [hide abstract]
    ABSTRACT: Fault localization in enterprise networks is extremely challenging. A recent approach called Sherlock makes some headway into this problem by using an inference algorithm over a multi-tier probabilistic dependency graph that relates fault symptoms with possible root causes (e.g., routers, servers). A key limitation of Sherlock is its scalability because of the use of complicated inference algorithms based on Bayesian networks. We present a fault localization system called Spotlight that essentially uses two basic ideas. First, it compresses a multi-tier dependency graph into a bipartite graph with direct probabilistic edges between root causes and symptoms. Second, it runs a novel weighted greedy minimum set cover algorithm to provide fast inference. Through extensive simulations with real service dependency graphs and enterprise network topologies reported previously in literature, we show that Spotlight is about 100× faster than Sherlock in typical settings, with comparable accuracy in diagnosis.
    Reliable Distributed Systems, 2010 29th IEEE Symposium on; 12/2010

Related Journals