[show abstract][hide abstract] ABSTRACT: Distributed computing applications are increasingly utilizing distributed data sources. However, the unpredictable cost of data access in large-scale computing infrastructures can lead to severe performance bottlenecks. Providing predictability in data access is, thus, essential to accommodate the large set of newly emerging large-scale, data-intensive computing applications. In this regard, accurate estimation of network performance is crucial to meeting the performance goals of such applications. Passive estimation based on past measurements is attractive for its relatively small overhead compared to relying on explicit probing. In this paper, we take a passive approach for network performance estimation. Our approach is different from existing passive techniques that rely either on past direct measurements of pairs of nodes or on topological similarities. Instead, we exploit secondhand measurements collected by other nodes without any topological restrictions. In this paper, we present Overlay Passive Estimation of Network performance (OPEN), a scalable framework providing end-to-end network performance estimation based on secondhand measurements, and discuss how OPEN achieves cost-effective estimation in a large-scale infrastructure. Our extensive experimental results show that OPEN estimation can be applicable for replica and resource selections commonly used in distributed computing.
IEEE Transactions on Parallel and Distributed Systems 09/2011; · 1.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: MapReduce is a distributed computing paradigm widely used for building large-scale data processing applications. When used in cloud environments, MapReduce clusters are dynamically created using virtual machines (VMs) and managed by the cloud provider. In this paper, we study the energy efficiency problem for such MapReduce clusters in private cloud environments, that are characterized by repeated, batch execution of jobs. We describe a unique spatio-temporal tradeoff that includes efficient spatial fitting of VMs on servers to achieve high utilization of machine resources, as well as balanced temporal fitting of servers with VMs having similar runtimes to ensure a server runs at a high utilization throughout its uptime. We propose VM placement algorithms that explicitly incorporate these tradeoffs. Our algorithms achieve energy savings over existing placement techniques, and an additional optimization technique further achieves savings while simultaneously improving job performance.
Cloud Computing (CLOUD), 2011 IEEE International Conference on; 08/2011
[show abstract][hide abstract] ABSTRACT: Current cloud infrastructures are important for their ease of use and performance. However, they suffer from several shortcomings. The main problem is inefficient data mobility due to the centralization of cloud resources. We believe such clouds are highly unsuited for dispersed-data-intensive applications, where the data may be spread at multiple geographical locations (e.g., distributed user blogs). Instead, we propose a new cloud model called Nebula: a dispersed, context-aware, and cost-effective cloud. We provide experimental evidence for the need for Nebulas using a distributed blog analysis application followed by the system architecture and components of our system.
[show abstract][hide abstract] ABSTRACT: MapReduce has gained in popularity as a distributed data analysis paradigm, particularly in the cloud, where MapReduce jobs are run on virtual clusters. The provisioning of MapReduce jobs in the cloud is an important problem for optimizing several user as well as provider-side metrics, such as runtime, cost, throughput, energy, and load. In this paper, we present an intelligent provisioning framework called STEAMEngine that consists of provisioning algorithms to optimize these metrics through a set of common building blocks. These building blocks enable spatio-temporal tradeoffs unique to MapReduce provisioning: along with their resource requirements (spatial component), a MapReduce job runtime (temporal component) is a critical element for any provisioning algorithm. We also describe tw o novel provisioning algorithms — a user-driven performance optimization and a provider-driven energy optimization — that leverage these building blocks. Our experimental results based on an Amazon EC2 cluster and a local Xen/Hadoop cluster show the benefits of STEAMEngine through improvements in performance and energy via the use of these algorithms and building blocks.
18th International Conference on High Performance Computing, HiPC 2011, Bengaluru, India, December 18-21, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: MapReduce is a highly-popular paradigm for high-performance com-puting over large data sets in large-scale platforms. However, when the source data is widely distributed and the computing platform is also distributed, e.g. data is collected in separate data center loca-tions, the most efficient architecture for running Hadoop jobs over the entire data set becomes non-trivial. In this paper, we show the traditional single-cluster MapReduce setup may not be suitable for situations when data and compute resources are widely distributed. Further, we provide recommendations for alternative (and even hi-erarchical) distributed MapReduce setup configurations, depending on the workload and data set.
[show abstract][hide abstract] ABSTRACT: Resource discovery is an important process for finding suitable nodes that satisfy application requirements in large loosely coupled distributed systems. Besides internode heterogeneity, many of these systems also show a high degree of intranode dynamism, so that selecting nodes based only on their recently observed resource capacities can lead to poor deployment decisions resulting in application failures or migration overheads. However, most existing resource discovery mechanisms rely mainly on recent observations to achieve scalability in large systems. In this paper, we propose the notion of a resource bundle-a representative resource usage distribution for a group of nodes with similar resource usage patterns-that employs two complementary techniques to overcome the limitations of existing techniques: resource usage histograms to provide statistical guarantees for resource capacities and clustering-based resource aggregation to achieve scalability. Using trace-driven simulations and data analysis of a month-long PlanetLab trace, we show that resource bundles are able to provide high accuracy for statistical resource discovery, while achieving high scalability. We also show that resource bundles are ideally suited for identifying group-level characteristics (e.g., hot spots, total group capacity). To automatically parameterize the bundling algorithm, we present an adaptive algorithm that can detect online fluctuations in resource heterogeneity.
IEEE Transactions on Parallel and Distributed Systems 09/2010; · 1.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: We examine whether traditional disk I/O scheduling still provides benefits in a layered system consisting of virtualized operating systems and underlying virtual machine monitor. We demonstrate that choosing the appropriate scheduling algorithm in guest operating systems provides performance benefits, while scheduling in the virtual machine monitor has no measurable advantage. We propose future areas for investigation, including schedulers optimized for running in a virtual machine, for running in a virtual machine monitor, and layered schedulers optimizing both application level access and the underlying storage technology.
[show abstract][hide abstract] ABSTRACT: Virtualization is being widely used in large-scale computing environments, such as clouds, data centers, and grids, to provide application portability and facilitate resource multiplexing while retaining application isolation. In many exist- ing virtualized platforms, it has been found that the network bandwidth often becomes the bottleneck resource due to the hierarchical topology of the underlying network, causing both high network contention and reduced performance for com- munication and data-intensive applications. In this paper, we present a decentralized affin ity-aware migration technique that incorporates heterogeneity and dynamism in network topology and job communication patterns to allocate virtual machines on the available physical resources. Our technique monitors network affin ity between pairs of VMs and uses a distributed bartering algorithm, coupled with migration, to dynamically adjust VM placement such that communication overhead is minimized. Our experimental results running the Intel MPI benchmark and a scientifi c application on an 8-node Xen cluster show that we can get up to 42% improvement in the runtime of the application over a no-migration technique, while achieving up to 85% reduction in network communication cost. In addition, our technique is able to adjust to dynamic variations in communication patterns and provides both good performance and low network contention with minimal overhead.
39th International Conference on Parallel Processing, ICPP 2010, San Diego, California, USA, 13-16 September 2010; 01/2010
[show abstract][hide abstract] ABSTRACT: Resource discovery enables applications deployed in heterogeneous large-scale distributed systems to find resources that meet QoS requirements. In particular, most applications need resource requirements to be satisfied simultaneously for multiple resources (such as CPU, memory and network bandwidth). Due to dynamism in many large-scale systems, providing statistical guarantees on such requirements is important to avoid application failures and overheads. However, existing techniques either provide guarantees only for individual resources, or take a static or memoryless approach along multiple dimensions. We present HiDRA, a scalable resource discovery technique providing statistical guarantees for resource requirements spanning multiple dimensions simultaneously. Through trace analysis and a 307-node PlanetLab implementation, we show that HiDRA, while using over 1,400 times less data, performs nearly as well as a fully-informed algorithm, showing better precision and having recall within 3%. We demonstrate that HiDRA is a feasible, low-overhead approach to statistical resource discovery in a distributed system.
Quality of Service, 2009. IWQoS. 17th International Workshop on; 08/2009
[show abstract][hide abstract] ABSTRACT: Large-scale distributed systems provide an attractive scalable infrastructure for network applications. However, the loosely coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. We introduce the notion of accessibility to capture both availability and performance. An increasing number of data-intensive applications require not only considerations of node computation power but also accessibility for adequate job allocations. For instance, selecting a node with intolerably slow connections can offset any benefit to running on a fast node. In this paper, we present accessibility-aware resource selection techniques by which it is possible to choose nodes that will have efficient data access to remote data sources. We show that the local data access observations collected from a node's neighbors are sufficient to characterize accessibility for that node. By conducting trace-based, synthetic experiments on PlanetLab, we show that the resource selection heuristics guided by this principle significantly outperform conventional techniques such as latency-based or random allocations. The suggested techniques are also shown to be stable even under churn despite the loss of prior observations.
IEEE Transactions on Parallel and Distributed Systems 01/2009; 20:788-801. · 1.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: Supercomputers are prone to frequent faults that adversely affect their performance, reliability and functionality. System logs collected on these systems are a valuable resource of information about their operational status and health. However, their massive size, complexity, and lack of standard format makes it difficult to automatically extract information that can be used to improve system management. In this work we propose a novel method to succinctly represent the contents of supercomputing logs, by using textual clustering to automatically find the syntactic structures of log messages. This information is used to automatically classify messages into semantic groups via an online clustering algorithm. Further, we describe a methodology for using the temporal proximity between groups of log messages to identify correlated events in the system. We apply our proposed methods to two large, publicly available supercomputing logs and show that our technique features nearly perfect accuracy for online log-classification and extracts meaningful structural and temporal message patterns that can be used to improve the accuracy of other log analysis techniques.
16th International Conference on High Performance Computing, HiPC 2009, December 16-19, 2009, Kochi, India, Proceedings; 01/2009
[show abstract][hide abstract] ABSTRACT: Resource discovery is an important process for finding suitable nodes that satisfy application requirements in large loosely-coupled distributed systems. Besides inter-node heterogeneity, many of these systems also show a high degree of intra-node dynamism, so that selecting nodes based only on their recently observed resource capacities for scalability reasons can lead to poor deployment decisions resulting in application failures or migration overheads. In this paper, we propose the notion of a resource bundle - a representative resource usage distribution for a group of nodes with similar resource usage patterns - that employs two complementary techniques to overcome the limitations of existing techniques: resource usage histograms to provide statistical guarantees for resource capacities, and clustering-based resource aggregation to achieve scalability. Using trace-driven simulations and data analysis of a month-long Planet Lab trace, we show that resource bundles are able to provide high accuracy for statistical resource discovery (up to 56% better precision than using only recent values), while achieving high scalability (up to 55% fewer messages than a non-aggregation algorithm). We also show that resource bundles are ideally suited for identifying group-level characteristics such as finding load hot spots and estimating total group capacity (within 8% of actual values).
Distributed Computing Systems, 2008. ICDCS '08. The 28th International Conference on; 07/2008
[show abstract][hide abstract] ABSTRACT: Large-scale distributed systems provide an attractive scalable infrastructure for network applications. However,the loosely-coupled nature of this environment can make data access unpredictable, and in the limit, unavailable. We introduce the notion of accessibility to capture both availability and performance. An increasing number of data intensive applications require not only considerations of node computation power but also accessibility for adequate job allocations. For instance, selecting a node with intolerably slow connections can offset any benefit to running on a fast node. In this paper, we present accessibility-aware resource selection techniques by which it is possible to choose nodes that will have efficient data access to remote data sources. We show that the local data access observations collected from a node's neighbors are sufficient to characterize accessibility for that node. We then present resource selection heuristics guided by this principle, and show that they significantly out perform standard techniques. The suggested techniques are also shown to be stable even under churn despite the loss of prior observations.
Distributed Computing Systems, 2008. ICDCS '08. The 28th International Conference on; 07/2008
[show abstract][hide abstract] ABSTRACT: Hierarchical scheduling has been proposed as a scheduling technique to achieve aggregate resource partitioning among related groups of threads and applications in uniprocessor and packet scheduling environments. Existing hierarchical schedulers are not easily extensible to multiprocessor environments because 1) they do not incorporate the inherent parallelism of a multiprocessor system while resource partitioning and 2) they can result in unbounded unfairness or starvation if applied to a multiprocessor system in a naive manner. In this paper, we present hierarchical multiprocessor scheduling (H-SMP), a novel hierarchical CPU scheduling algorithm designed for a symmetric multiprocessor (SMP) platform. The novelty of this algorithm lies in its combination of space and time multiplexing to achieve the desired bandwidth partition among the nodes of the hierarchical scheduling tree. This algorithm is also characterized by its ability to incorporate existing proportional-share algorithms as auxiliary schedulers to achieve efficient hierarchical CPU partitioning. In addition, we present a generalized weight feasibility constraint that specifies the limit on the achievable CPU bandwidth partitioning in a multiprocessor hierarchical framework and propose a hierarchical weight readjustment algorithm designed to transparently satisfy this feasibility constraint. We evaluate the properties of H-SMP using hierarchical surplus fair scheduling (H-SFS), an instantiation of H-SMP that employs surplus fair scheduling (SFS) as an auxiliary algorithm. This evaluation is carried out through a simulation study that shows that H-SFS provides better fairness properties in multiprocessor environments as compared to existing algorithms and their naive extensions.
IEEE Transactions on Parallel and Distributed Systems 04/2008; · 1.80 Impact Factor
[show abstract][hide abstract] ABSTRACT: The layered design of the Linux operating system hides the live- ness of file system data from the underlying block layers. This lack of liveness information prevents the storage system from dis- carding blocks deleted by the file system, often resulting in poor utilization, security problems, inefficient caching, and migration overheads. In this paper, we define a generic "purge" operation that can be used by a file system to pass liveness information to the block layer with minimal changes in the layer interfaces, allowing the storage system to discard deleted data. We present three ap- proaches for implementing such a purge operation: direct call, zero blocks, and flagged writes, each of which differs in their architec- tural complexity and potential performance overhead. We evalu- ate the feasibility of these techniques through a reference imple- mentation of a dynamically resizable copy on write (COW) data store in User Mode Linux (UML). Performance results obtained from this reference implementation show that all these techniques can achieve significant storage savings with a reasonable execution time overhead. At the same time, our results indicate that while the direct call approach has the best performance, the zero block approach provides the best compromise in terms of performance overhead and its semantic and architectural simplicity. Overall, our results demonstrate that passing liveness information across the file system-block layer interface with minimal changes is not only fea- sible but practical.
[show abstract][hide abstract] ABSTRACT: Large-scale distributed systems provide the backbone for numerous distributed applications and online services. These systems span over a multitude of computing nodes located at different geographical locations connected together via wide-area networks and overlays. A major concern with such systems is their susceptibility to failures leading to downtime of services and hence high monetary/business costs. In this paper, we argue that to understand failures in such a system, we need to co-design monitoring system with the failure analysis system. Unlike existing monitoring systems which are not designed specifically for failure analysis, we advocate a new way to design a monitoring system with the goal of uncovering causes of failures. Similarly the failure analysis techniques themselves need to go beyond simple statistical analysis of failure events in isolation to serve as an effective tool. Towards this end, we provide a discussion of some guiding principles for the co-design of monitoring and failure analysis systems for planetary scale systems.
[show abstract][hide abstract] ABSTRACT: The scalability and computing power of large-scale computational platforms has made them attractive for hosting compute-intensive time-critical applications. Many of these applications are composed of computational tasks that require specific deadlines to be met for successful completion. In this paper, we show that combining redundant scheduling with deadline-based scheduling in these systems leads to a fundamental tradeoff between throughput and fairness. We propose a new scheduling algorithm called Limited Resource Earliest Deadline (LRED) that couples redundant scheduling with deadline-driven scheduling in a flexible way by using a simple tunable parameter to exploit this tradeoff. Our evaluation of LRED shows that LRED provides a powerful mechanism to achieve desired throughput or fairness under high loads and low timeliness environments.
Proceedings of the 2008 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2008, Annapolis, MD, USA, June 2-6, 2008; 01/2008