Quorums are a basic construct in solving many fundamental distributed computing problems. One of the known ways of making quorums scalable and efficient is by weakening their intersection guarantee to being probabilistic. This paper explores several access strategies for implementing probabilistic quorums in ad hoc networks. In particular, we present the first detailed study of asymmetric probabilistic bi-quorum systems and show its advantages in ad hoc networks. The paper includes both a formal analysis of these approaches backed by a simulation based study. In particular, we show that one of the strategies, based on Random Walks, exhibits the smallest communication overhead.
COCA is a fault-tolerant and secure online certification authority that has been built and deployed both in a local area network and in the Internet. Extremely weak assumptions characterize environments in which COCA's protocols execute correctly: no assumption is made about execution speed and message delivery delays; channels are expected to exhibit only intermittent reliability; and with 3t + 1 COCA servers up to t may be faulty or compromised. COCA is the first system to integrate a Byzantine quorum system (used to achieve availability) with proactive recovery (used to defend against mobile adversaries which attack, compromise, and control one replica for a limited period of time before moving on to another). In addition to tackling problems associated with combining fault-tolerance and security, new proactive recovery protocols had to be developed. Experimental results give a quantitative evaluation for the cost and effectiveness of the protocols.
We favor an approach to building secure systems that includes an application-based security model. An instance of such a model and its formalization have been presented. Important aspects of the model are: (1) because it is framed in terms of operations and data objects that the user sees, the model captures the system's security requirements in a way that is understandable to users; (2) the model defines a hierarchy of entities and references; access to an entity can be controlled based on the path used to refer to it; (3) because the model avoids specifying implementation strategies, software developers are free to choose the most effective implementation; (4) the model and its formalization provide a basis for certifiers to assess the security of the system as a whole. Simplicity and clarity in the model's statement have been primary goals. The model's statement does not, however, disguise the complexity that is inherent in the application. In this respect, we have striven for a model that is as simple as possible but stops short of distorting the user's view of the system. The work reported demonstrates the feasibility of defining an application-based security model informally and subsequently formalizing it.
In 2001, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to 7 times more of certain kinds of faults than other directories. This result inspired a number of development and research efforts on improving the reliability of driver code. Today Linux is used in a much wider range of environments, provides a much wider range of services, and has adopted a new development and release model. What has been the impact of these changes on code quality? Are drivers still a major problem?
To answer these questions, we have transported the experiments of Chou et al. to Linux versions 2.6.0 to 2.6.33, released between late 2003 and early 2010. We find that Linux has more than doubled in size during this period, but that the number of faults per line of code has been decreasing. And, even though drivers still accounts for a large part of the kernel code and contains the most faults, its fault rate is now below that of other directories, such as arch (HAL) and fs (file systems). These results can guide further development and research efforts. To enable others to continually update these results as Linux evolves, we define our experimental protocol and make our checkers and results available in a public archive.
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistant orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols is the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
One aspect of fault tolerance in process control programs is the ability to tolerate sensor failure. A methodology for transforming a process control program that cannot tolerate sensor failures onto one that can is presented. Issues addressed include modifying specifications in order to accommodate uncertainty in sensor values and averaging sensor values in a fault tolerant manner. In addition, a hierarchy of sensor failure models is identified, and both the attainable accuracy and the run-time complexity of sensor averaging with respect to this hierarchy is discussed.
Many distributed systems replicate data for fault tolerance or availability. In such systems, a logical update on a data item results in a physical update on a number of copies. The synchronization and communication required to keep the copies of replicated data consistent introduce a delay when operations are performed. A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated. The additional concurrency thus obtained results in better response time when performing operations on replicated data. How this technique performs in conjunction with a roll-back and a roll-forward failure recovery mechanism is also discussed.
One of the overwhelming problems that software producers must contend with, is the unauthorized use and distribution of their
products. Copyright laws concerning software are rarely enforced, thereby causing major losses to the software companies.
Technical means of protecting software from illegal duplication are required, but the available means are imperfect. We present
protocols that enables software protection, without causing overhead in distribution and maintenance. The protocols may be
implemented by a conventional cryptosystem, such as the DES, or by a public key cryptosystem, such as the RSA. Both implementations
are proved to satisfy required security criterions.
High-level hardware modeling via simulation is an essential step in hardware systems design and research. Despite the importance of simulation, current model creation methods are error prone and are unnecessarily time consuming. To address these problems, we have publicly released the Liberty Simulation Environment (LSE), Version 1.0, consisting of a simulator builder and automatic visualizer based on a shared hardware description language. LSE's design was motivated by a careful analysis of the strengths and weaknesses of existing systems. This has resulted in a system in which models are easier to understand, faster to develop, and have performance on par with other systems. LSE is capable of modeling any synchronous hardware system. To date, LSE has been used to simulate and convey ideas about a diverse set of complex systems including a chip multiprocessor out-of-order IA64 machine and a multiprocessor system with detailed device models.
This paper analyzes and quantifies the locality characteristics of numerical loop nests in order to suggest future directions for architecture and software cache optimizations. Since most programs spend the majority of their time in nests, the vast majority of cache optimization techniques target loop nests. In contrast, the locality characteristics that drive these optimizations are usually collected across the entire application rather than the nest level. Researchers have studied numerical codes for so long that a number of commonly held assertions have emerged on their locality characteristics. In light of these assertions, we use the SPEC'95 and Perfect Benchmarks to take a new look at measuring locality on numerical codes based on references, loop nests, and program locality properties. Our results show that several popular assertions are at best overstatements. For example, although most reuse is within a loop nest, in line with popular assertions, most misses are inter-nest capacity misses, and correspond to potential reuse between nearby loop nests. In addition, we find that temporal and spatial reuse have balanced roles within a loop nest and most reuse across nests and the entire program is temporal. These results are consistent with high hit rates (80% or more hits), but go against the commonly held assumption that spatial reuse dominates. Our locality measurements reveal important differences between loop nests and programs; refute some popular assertions; and provide new insights for the compiler writer and the architect.
ion of Shared Accesses Peter J. Keleher firstname.lastname@example.org Department of Computer Science University of Maryland College Park, MD 20742 Key words: shared memory, DSM, programming libraries, update protocols We describe the design and use of the tape mechanism, a new high-level abstraction of accesses to shared data for software DSMs. Tapes consolidate and generalize a number of recent protocol optimizations, including update-based locks and record-replay barriers. Tapes are usually created by "recording" shared accesses. The resulting recordings can be used to anticipate future accesses by tailoring data movement to application semantics. Tapes-based mechanisms are layered on top of existing shared memory protocols, and are largely independent of the underlying memory model. Tapes can also be used to emulate the data-movement semantics of several update-based protocol implementations, without altering the underlying protocol implementation. We have used tapes to create the Tape...
this paper, we present an access control mechanism for extensible systems to address this problem. Our access control mechanism decomposes access control into a policy-neutral enforcement manager and a security policy manager, and it is transparent to extensions in the absence of security violations. It structures the system into protection domains, enforces protection domains through access control checks, and performs auditing of system operations. The access control mechanism works by inspecting extensions for their types and operations to determine which abstractions require protection and by redirecting procedure or method invocations to inject access control operations into the system. We describe the design of this access control mechanism, present an implementation within the SPIN extensible operating system, and provide a qualitative as well as quantitative evaluation of the mechanism.
: In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run well on such machines, software must exploit these features. We believe it is too onerous for a programmer to do this by hand, so we have been exploring the use of restructuring compiler technology for this purpose. In this paper, we start with a language like HPF-FORTRAN with user-specified data distributionand develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. We demonstrate the power of our techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library. An important feature of our approach is that we model loop transformations using invertible matrices and integer lattice theo...
In large multiprocessor systems, fast synchronization is crucial for high performance. However, synchronization traffic tends to create “hot-spots” in shared memory and cause network congestion. Multistage shuffle-exchange networks have been proposed and built to handle synchronization traffic. Software combining schemes have also been proposed to relieve network congestion caused by hot-spots. However, multistage combining networks could be very expensive and software combining could be very slow.
In this paper, we propose a single-stage combining network to handle synchronization traffic, which is separated from the regular memory traffic. A single-stage combining network has several advantages: (1) it is attractive from an implementation perspective because only one stage is needed(instead of log N stages); (2) Only one network is needed to handle both forward and returning requests; (3) combined requests are distributed evenly through the network—the wait buffer size is reduced; and (4) fast-finishing algorithms  can be used to shorten the network delay.
Because of all these advantages, we show that a single-stage combining network gives good performance at a lower cost than a multistage combining network.
this article, we describe the Vesta Parallel File System, first introduced by Corbett et al. [1993a]. Vesta introduces a new abstraction of parallel files, by which application programmers can express the required partitioning of file data among the processes of a parallel application. This reduces the need for synchronization and concurrency control and allows for a more streamlined implementation. Also, Vesta provides explicit control over the way data are distributed across the I/O nodes and allows the distribution to be tailored for the expected access patterns
A fail-stop processor halts instead of performing an erroneous state transformation that might be visible to other processors, can detect whether another fail- stop processor has halted (due to a failure), and has a predefined portion of its storage that is unaffected by failures and accessible to any other fail-stop processor.
Threads are the vehicle for concurrency in many approaches to parallel programming. Threads can be supported either by the operating system kernel or by user-level library code in the application address space, but neither approach has been fully satisfactory.
This paper addresses this dilemma. First, we argue that the performance of kernel threads is inherently worse than that of user-level threads, rather than this being an artifact of existing implementations; managing parallelism at the user level is essential to high-performance parallel computing. Next, we argue that the problems encountered in integrating user-level threads with other system services is a consequence of the lack of kernel support for user-level threads provided by contemporary multiprocessor operating systems; kernel threads are the wrong abstraction on which to support user-level management of parallelism. Finally, we describe the design, implementation, and performance of a new kernel interface and user-level thread package that together provide the same functionality as kernel threads without compromising the performance and flexibility advantages of user-level management of parallelism.
This article explores memory sharing and protection support in Opal, a single-address-space operating system designed for wide-address (64-bit) architectures. Opal threads execute within protection domains in a single shared virtual address space. Sharing is simplified, because addresses are context independent. There is no loss of protection, because addressability and access are independent; the right to access a segment is determined by the protection domain in which a thread executes. This model enables beneficial code-and data-sharing patterns that are currently prohibitive, due in part to the inherent restrictions of multiple address spaces, and in part to Unix programming style.
We have designed and implemented an Opal prototype using the Mach 3.0 microkernel as a base. Our implementation demonstrates how a single-address-space structure can be supported alongside of other environments on a modern microkernel operating system, using modern wide-address architectures. This article justifies the Opal model and its goals for sharing and protection, presents the system and its abstractions, describes the prototype implementation, and reports experience with integrated applications.
Through analysis and experiments, this paper investigates two-phase waiting algorithms to minimize the cost of waiting for synchronization in large-scale multiprocessors. In a two-phase algorithm, a thread first waits by polling a synchronization variable. If the cost of polling reaches a limit L poll and further waiting is necessary, the thread is blocked, incurring an additional fixed cost, B . The choice of L poll is a critical determinant of the performance of two-phase algorithms. We focus on methods for statically determining L poll because the run-time overhead of dynamically determining L poll can be comparable to the cost of blocking in large-scale multiprocessor systems with lightweight threads.
Our experiments show that always-block ( L poll = 0) is a good waiting algorithm with performance that is usually close to the best of the algorithms compared. We show that even better performance can be achieved with a static choice of L poll based on knowledge of likely wait-time distributions. Motivated by the observation that different synchronization types exhibit different wait-time distributions, we prove that a static choice of L poll can yield close to optimal on-line performance against an adversary that is restricted to choosing wait times from a fixed family of probability distributions. This result allows us to make an optimal static choice of L poll based on synchronization type. For exponentially distributed wait times, we prove that setting L poll = 1n(e-1) B results in a waiting cost that is no more than e/(e-1) times the cost of an optimal off-line algorithm. For uniformly distributed wait times, we prove that setting L poll =1/2(square root of 5 -1) B results in a waiting cost that is no more than (square root of 5 + 1)/2 (the golden ratio) times the cost of an optimal off-line algorithm. Experimental measurements of several parallel applications on the Alewife multiprocessor simulator corroborate our theoretical findings.
Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory.
We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates 0(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes.
We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors. — From the Authors' Abstract
Heap allocation with copying garbage collection is a general storage management technique for modern programming languages. It is believed to have poor memory subsystem performance. To investigate this, we conducted an in-depth study of the memory subsystem performance of heap allocation for memory subsystems found on many machines. We studied the performance of mostly-functional Standard ML programs which made heavy use of heap allocation. We found that most machines support heap allocation poorly. However, with the appropriate memory subsystem organization, heap allocation can have good performance. The memory subsystem property crucial for achieving good performance was the ability to allocate and initialize a new object into the cache without a penalty. This can be achieved by having subblock placement with a subblock size of one word with a write allocate policy, along with fast page-mode writes or a write buffer. For caches with subblock placement, the data cache overhead was und...
This article introduces Smart Packets and describes the Smart Packets architecture, the packet formats, the language and its design goals, and security considerations. Smart Packets is an Active Networks project focusing on applying active networks technology to network management and monitoring. Messages in active networks are programs that are executed at nodes on the path to one or more target hosts. Smart Packets programs are written in a tightly encoded, safe language specifically designed to support network management and avoid dangerous constructs and accesses. Smart Packets improves the management of large complex networks by (1) moving management decision points closer to the node being managed, (2) targeting specific aspects of the node for information rather than exhaustive collection via polling, and (3) abstracting the management concepts to language constructs, allowing nimble network control.
This paper describes a new way to organizing network software that differs from conventional architectures in all three of these properties. In our approach, the protocol graph is complex, individual protocols encapsulate a single function, and the topology of the graph is dynamic. The main contribution of this paper is to describe the ideas behind our new architecture, illustrate the advantages of using the architecture, and demonstrate that the architecture results in efficient network software.
Distributed-memory systems have traditionally had great difficulty performing network I/O at rates proportional to their computational power. The problem is that the network interface has to support network I/O for a supercomputer, using computational and memory bandwidth resources similar to those of a workstation. As a result, the network interface becomes a bottleneck. In this paper we present an I/O architecture that addresses these problems and supports high-speed network I/O on distributed-memory systems. The key to good performance is to partition the work appropriately between the system and the network interface. Some communication tasks are performed on the distributed-memory parallel system since it is more powerful, and less likely to become a bottleneck than the network interface. Tasks that do not parallelize well are performed on the network interface and hardware support is provided for the most time-critical operations. This architecture has been implemented for the iWarp distributedmemory system and has been used by a number of applications. We describe this implementation, present performance results, and use application examples to validate the main features of the I/O architecture. To appear in ACM Transactions on Computer Systems. This research was supported by the Advanced Research Projects Agency/CSTO monitored by the Space and Naval Warfare Systems Command under contract N00039-93-C-0152. Peter Steenkiste can be reached at email@example.com. His mailing address is School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213. 1 1
this paper, we describe a new information management service called Astrolabe. Astrolabe monitors the dynamically changing state of a collection of distributed resources, reporting summaries of this information to its users. Like DNS, Astrolabe organizes the resources into a hierarchy of domains, which we call zones to avoid confusion, and associates attributes with each zone. Unlike DNS, zones are not bound to specific servers, the attributes may be highly dynamic, and updates propagate quickly; typically, in tens of seconds
We describe a design for security in a distributed system and its implementation. In our design, applications gain access to security services through a narrow interface. This interface provides a notion of identity that includes simple principals, groups, roles, and delegations. A new operating system component manages principals, credentials, and secure channels. It checks credentials according to the formal rules of a logic of authentication. Our implementation is efficient enough to support a substantial user community.
Enterprise-scale storage systems, which can contain hundreds of host computers and storage devices and up to tens of thousands of disks and logical volumes, are difficult to design. The volume of choices that need to be made is massive, and many choices have unforeseen interactions. Storage system design is tedious and complicated to do by hand, usually leading to solutions that are grossly over-provisioned, substantially under-performing or, in the worst case, both. To solve the configuration nightmare, we present MIN- ERVA: a suite of tools for designing storage systems automatically. MINERVA uses declarative specifications of application requirements and device capabilities; constraintbased formulations of the various sub-problems; and optimization techniques to explore the search space of possible solutions. This paper also explores and evaluates the design decisions that went into MINERVA, using specialized microand macro-benchmarks. We show that MINERVA can successfully handle a workload with substantial complexity (a decision-support database benchmark). MIN- ERVA created a 16-disk design in only a few minutes that achieved the same performance as a 30-disk system manually designed by human experts. Of equal importance, MINERVA was able to predict the resulting system's performance before it was built. Contact author, firstname.lastname@example.org. This work was done while all authors were working for Hewlett-Packard Laboratories, 1U13, 1501 Page Mill Rd., Palo Alto, CA 94304, USA 1
This article presents dynamic feedback, a technique that enables computations to adapt dynamically to different execution environments. A compiler that uses dynamic feedback produces several different versions of the same source code; each version uses a different optimization policy. The generated code alternately performs sampling phases and production phases. Each sampling phase measures the overhead of each version in the current environment. Each production phase uses the version with the least overhead in the previous sampling phase. The computation periodically resamples to adjust dynamically to changes in the environment.
As raw system and network performance continues to improve at exponential rates, the utility of many services is increasingly limited by availability rather than performance. A key approach to improving availability involves replicating the service across multiple, wide-area sites. However, replication introduces well-known tradeoffs between service consistency and availability. Thus, this paper explores the benefits of dynamically trading consistency for availability using a continuous consistency model. In this model, applications specify a maximum deviation from strong consistency on a per-replica basis. In this paper, we: i) evaluate availability of a prototype replication system running across the Internet as a function of consistency level, consistency protocol, and failure characteristics, ii) demonstrate that simple optimizations to existing consistency protocols result in significant availability improvements (more than an order of magnitude in some scenarios), iii) use our experience with these optimizations to prove tight upper bounds on the availability of services, and iv) show that maximizing availability typically entails remaining as close to strong consistency as possible during times of good connectivity, resulting in a communication versus availability tradeoff.
To provide high availability for services such as mail or bulletin boards, data must be replicated. One way to guarantee consistency of replicated data is to force service operations to occur in the same order at all sites, but this approach is expensive. For some applications a weaker causal operation order can preserve consistency while providing better performance. This paper describes a new way of implementing causal operations. Our technique also supports two other kinds of operations: operations that are totally ordered with respect to one another, and operations that are totally ordered with respect to all other operations. The method performs well in terms of response time, operation processing capacity, amount of stored state, and number and size of messages; it does better than replication methods based on reliable multicast techniques. This research was supported in part by the National Science Foundation under Grant CCR-8822158 and in part by the Advanced Research Projects ...
Wireless transmission of a bit can require over 1000 times more energy than a single 32-bit computation. It would therefore seem desirable to perform significant computation to reduce the number of bits transmitted. If the energy required to compress data is less than the energy required to send it, there is a net energy savings and consequently, a longer battery life for portable computers. This paper reports on the energy of lossless data compressors as measured on a StrongARM SA-110 system. We show that with several typical compression tools, there is a net energy increase when compression is applied before transmission. Reasons for this increase are explained, and hardwareaware programming optimizations are demonstrated. When applied to Unix compress, these optimizations improve energy efficiency by 51%. We also explore the fact that, for many usage models, compression and decompression need not be performed by the same algorithm. By choosing the lowest-energy compressor and decompressor on the test platform, rather than using default levels of compression, overall energy to send compressible web data can be reduced 31%. Energy to send harder-to-compress English text can be reduced 57%. Compared with a system using a single optimized application for both compression and decompression, the asymmetric scheme saves 11% or 12% of the total energy depending on the dataset.
Extensive caching is a key feature of the Echo distributed file system. Echo client machines maintain coherent caches of file and directory data and properties, with write-behind (delayed write-back) of all cached information. Echo specifies ordering constraints on this write-behind, enabling applications to store and maintain consistent data structures in the file system even when crashes or network faults prevent some writes from being completed. In this paper we describe the Echo cache's coherence and ordering semantics, show how they can improve the performance and consistency of applications, and explain how they are implemented. We also discuss the general problem of reliably notifying applications and users when write-behind is lost; we addressed this problem as part of the Echo design but did not find a fully satisfactory solution. Contents 1 Introduction 1 2 Design motivation 2 3 Coherence and ordering semantics 4 3.1 Ordering constraints : : : : : : : : : : : : : : : : : : ...
The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies, and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. As an extreme example, HFS allows a file's structure to be optimized for concurrent random-access write-only operations by 10 threads, something no other file system can do. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application's access pattern. In contrast, most parallel file systems support a single file structure and a small set of policies. We have implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns, HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system. Categories and Subject Descriptors: D.4.3 [Operating Systems]: File Systems Management-access methods; file organization; D.4.8 [Operating Systems]: Performance-measurements; E.5 [Data]: Files-optimization; organization/structure.
This article presents a new and highly accurate method for branch prediction. The key idea is to use one of the simplest possible neural methods, the perceptron, as an alternative to the commonly used two-bit counters. The source of our predictor's accuracy is its ability to use long history lengths, because the hardware resources for our method scale linearly, rather than exponentially, with the history length. We describe two versions of perceptron predictors, and we evaluate these predictors with respect to five well-known predictors. We show that for a 4 KB hardware budget, a simple version of our method that uses a global history achieves a misprediction rate of 4.6% on the SPEC 2000 integer benchmarks, an improvement of 26% over gshare. We also introduce a global/local version of our predictor that is 14% more accurate than the McFarling-style hybrid predictor of the Alpha 21264. We show that for hardware budgets of up to 256 KB, this global/local perceptron predictor is more accurate than Evers' multicomponent predictor, so we conclude that ours is the most accurate dynamic predictor currently available. To explore the feasibility of our ideas, we provide a circuit-level design of the perceptron predictor and describe techniques that allow our complex predictor to operate quickly. Finally, we show how the relatively complex perceptron predictor can be used in modern CPUs by having it override a simpler, quicker Smith predictor, providing IPC improvements of 15.8% over gshare and 5.7% over the McFarling hybrid predictor.
This article presents the design, implementation, and evaluation of IO-Lite, a unified I/O buffering and caching system for general-purpose operating systems. IO-Lite unifies all buffering and caching in the system, to the extent permitted by the hardware. In particular, it allows applications, the interprocess communication system, the file system, the file cache, and the network subsystem to safely and concurrently share a single physical copy of the data. Protection and security are maintained through a combination of access control and read-only sharing. IO-Lite eliminates all copying and multiple buffering of I/O data, and enables various cross-subsystem optimizations. Experiments with a Web server show performance improvements between 40 and 80% on real workloads as a result of IO-Lite.
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulate the effects of a compiler-directed prefetching algorithm, running on a range of bus-based multiprocessors. We show that, despite a high memory latency, this architecture does not necessarily support prefetching well, in some cases actually causing performance degradations. We pinpoint several problems with prefetching on a shared memory architecture (additional conflict misses, no reduction in the data sharing traffic and associated latencies, a multiprocessor's greater sensitivity to memory utilization and the sensitivity of the cache hit rate to prefetch distance) and measure their effect on performance. We then solve those problems throug...
This paper examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip's area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and urning o" cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a urry of frequent use when rst brought into the cache, and then have a period of dead time" before they are evicted. By devising eective, low-power ways of deducing dead time, our results show that in many cases we can reduce L1 cache leakage energy by 4x in SPEC2000 applications without impacting performance. Because our decay-based techniques have notions of competitive on-line algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oracle-based policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies eectively reduce L1 cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance
As the performance gap between disks and microprocessors continues to increase, effective utilization of the file cache becomes increasingly important. Application-controlled file caching and prefetching can apply application specific knowledge to improve file cache management. However, supporting application-controlled file caching and prefetching is nontrivial because caching and prefetching need to be integrated carefully, and the kernel needs to allocate cache blocks among processes appropriately. This paper presents the design, implementation and performance of a file system that integrates application-controlled caching, prefetching and disk scheduling. We use a two-level cache management strategy. The kernel uses the LRU-SP (Least-Recently-Used with Swapping and Placeholders) policy to allocate blocks to processes, and each process integrates application-specific caching and prefetching based on the controlledaggressive policy, an algorithm previously shown in a theoretical sen...
This paper presents the design, implementation, and measurement of a hint-based cooperative caching file system. Hints allow a decentralized approach to cooperative caching that provides performance comparable to that of existing tightly-coordinated algorithms such as N-chance and GMS, but incurs less overhead. Simulations show that the block access times of our system are as good as those of the existing algorithms, while reducing manager load by more than a factor of 15, block lookup traffic by nearly a factor of two-thirds, and replacement traffic by more than a factor of 5. We also implemented a hint-based cooperative caching file system and measured its performance with real users over the period of one week. Hint-based cooperative caching reduced the average block access time to almost half that of NFS. Moreover, our system exhibited reduced overhead and high hint accuracy as predicted by the simulations. 1 Introduction Caching is an essential element of distributed file systems...
: This paper presents the design and implementation of the recovery scheme in Calypso. Calypso is a cluster-optimized, distributed file system for UNIX clusters. As in Sprite and AFS, Calypso servers are stateful and scale well to a large number of clients. The recovery scheme in Calypso is non-disruptive, meaning that open files remain open, client modified data is saved, and in-flight operations are properly handled across server recovery. The scheme uses distributed state among the clients to reconstruct the server state on a backup node if disks are multi-ported or on the rebooted server node. It guarantees data consistency during recovery and provides congestion control. Measurements show that the state reconstruction can be quite fast: for example, in a 32-node cluster, when an average node contains state for about 420 files, the reconstruction time is about 3.3 seconds. However, the time to update a file system after a failure can be a major factor in the overall recovery time, ...
This paper presents Click, a flexible, modular software architecture for creating routers. Click routers are built from fine-grained components; this supports fine-grained extensions throughout the forwarding path. The components are packet processing modules called elements. The basic element interface is narrow, consisting mostly of functions for initialization and packet handoff, but elements can extend it to support other functions (such as reporting queue lengths). To build a router configuration, the user chooses a collection of elements and connects them into a directed graph. The graph's edges, which are called connections, represent possible paths for packet handoff. To extend a configuration, the user can write new elements or compose existing elements in new ways, much as UNIX allows one to build complex applications directly or by composing simpler ones using pipes
This article presents our observations demonstrating that operations on "narrow-width" quantities are common not only in multimedia codes, but also in more general workloads. In fact, across the SPECint95 benchmarks, over half the integer operation executions require 16 bits or less. Based on this data, we propose two hardware mechanisms that dynamically recognize and capitalize on these narrow-width operations. The first, power-oriented optimization reduces processor power consumption by using operand-value-based clock gating to turn off portions of arithmetic units that will be unused by narrow-width operations. This optimization results in a 45%-60% reduction in the integer unit's power consumption for the SPECint95 and MediaBench benchmark suites. Applying this optimization to SPECfp95 benchmarks results in slightly smaller power reductions, but still seems warranted. These reductions in integer unit power consumption equate to a 5%--10% full-chip power savings. Our second, performance-oriented optimization improves processor performance by packing together narrow-width operations so that they share a single arithmetic unit. Conceptually similar to a dynamic form of MMX, this optimization offers speedups of 4.3%--6.2% for SPECint95 and 8.0%--10.4% for MediaBench. Overall, these optimizations highlight an increasing opportunity for value-based optimizations to improve both power and performance in current microprocessors
As an exercise in synchronization without mutual exclusion, algorithms are developed to implement both a monotonic and a cyclic multiple-word clock that is updated by one process and read by one or more other processes. Capsule Review It is convenient for an operating system to maintain the system clock in shared memory, so it can be read directly by user processes, without a system call. But doing this is tricky if the clock has more than one word of precision, because the system may update the clock while a user process is partway through reading it. This paper presents a simple algorithm for maintaining the clock in shared memory that requires no locking or retries. Theorists will find the algorithm and its correctness proof interesting, while practitioners will find the algorithm useful and easy to implement. Tim Mann v Contents 1 Introduction 1 2 Notation and Theorems 2 3 A Monotonic Clock 3 4 A Cyclic Clock 6 References 7 vi 1 Introduction In an asynchronous multiprocess s...
Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85 % of recently reported failures. This paper describes Nooks, a reliability subsystem that seeks to greatly enhance OS reliability by isolating the OS from driver failures. The Nooks approach is practical: rather than guaranteeing complete fault tolerance through a new (and incompatible) OS or driver architecture, our goal is to prevent the vast majority of driver-caused crashes with little or no change to existing driver and system code. To achieve this, Nooks isolates drivers within lightweight protection domains inside the kernel address space, where hardware and software prevent them from corrupting the kernel. Nooks
ing with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works, requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept, ACM Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or email@example.com. 2 Delta E. Bugnion, S. Devine, K. Govil, and M. Rosenblum 1. INTRODUCTION Scalable computers have moved from the research lab to the marketplace. Multiple vendors are now shipping scalable systems with configurations in the tens or even hundreds of processors. Unfortunately, the system software for these machines has often trailed hardware in reaching the functionality and reliability expected by modern computer users. Operating systems developers shoulder much of the blame for the inability to deliver on the promises of these machines. Extensive modifications to the operating system are required to efficiently support scalable ...
Communication-oriented abstractions such as atomic multicast,group RPC, and protocols for locationindependent mobile computing can simplify the development of complex applications built on distributed systems. This paper describes Coyote, a system that supports the construction of highly modular and configurable versions of such abstractions. Coyote extends the notion of protocol objects and hierarchical composition found in existing systems with support for finer-grain objects called micro-protocols that implement individual semantic properties of the target service. A customized service is constructed by selecting micro-protocols based on their semantic guarantees and configuring them together with a standard runtime system to form a composite protocol implementing the service. Micro-protocols within a composite protocol can share data and are executed using an event-driven paradigm that enhances configurability. The overall approach is described and illustrated with examples of serv...
This paper introduces and analyzes the effect of Seance Communication in a multiprocessing environment. Seance communication is an unnecessary coherency related activity that is associated with dead cache datum. Dead information may reside in the cache for various reasons: task migration, context switches or working-set changes. First, we present an analytical model to evaluate the overhead of this phenomenon and show that it may severely reduce overall system performance, since the system is unable to detect and flush cache information as soon as it is not needed (dead information). We show that the seance related overhead may affect the performance of both write-update and write-invalidate protocols. Second, we present various simulation results and compare the impact of this phenomenon on update-based and invalidate-based cache coherency protocols. These results indicate that update-based protocols are more affected by seance communication than invalidate-based protocols. The model ...
ion Psync is based on a conversation abstraction that provides a shared message space through which a collection of processes exchange messages. The general form of this message space is defined by a directed acyclic graph that preserves the partial order of the exchanged messages. For the purpose of this section, we view a conversation as an abstract data type that is implemented in shared memory; Section 3 gives an algorithm for implementing a conversation in an unreliable network. A conversation behaves much like any connection-oriented IPC abstraction: A well-defined set of processes---called participants---explicitly open a conversation, exchange messages through it, and close the conversation. Only processes that have been identified as participants may exchange message through the conversation, and this set is fixed for the duration of the conversation. Processes begin a conversation with the operations: conv = active open(participant set) conv = passive open(pid) The first...
this paper, provides safe and efficient communication between address spaces on the same machine without kernel mediation. URPC isolates from one other the three components of interprocess communication: processor reallocation, thread management, and data transfer. Control transfer between address spaces, which is the communication abstraction presented to the programmer, is implemented through a combination of thread management and processor reallocation. Only processor reallocation requires kernel volvement; thread management and data transfer do not. Thread management and interprocess communication are done by application~level libraries, rather than by the kernel
Distributed shared memory (DSM) is an abstraction of shared memory on a distributed memory machine. Hardware DSM systems support this abstraction at the architecture level; software DSM systems support the abstraction within the runtime system. One of the key problems in building an efficient software DSM system is to reduce the amount of communication needed to keep the distributed memories consistent. In this paper we present four techniques for doing so: (1) software release consistency; (2) multiple consistency protocols; (3) write-shared protocols; and (4) an update-with-timeout mechanism. These techniques have been implemented in the Munin DSM system. We compare the performance of seven Munin application programs, first to their performance when implemented using message passing, and then to their performance when running on a conventional software DSM system that does not embody the above techniques. On a 16-processor cluster of workstations, Munin's performance is wi...