Conference Paper

Characterizing parallel workloads to reduce multiple writer overhead in shared virtual memory systems

Departamento de Informatica de Sistemas y Computadores, Univ. Politecnica de Valencia
DOI: 10.1109/EMPDP.2002.994285 Conference: Parallel, Distributed and Network-based Processing, 2002. Proceedings. 10th Euromicro Workshop on
Source: IEEE Xplore


Shared virtual memory (SVM) systems, because of their software
implementation, enable shared-memory programming at a low design and
maintenance cost. Nevertheless, as hardware implementations become
faster, their performance is still far from that achieved by distributed
shared memory (DSM) systems. Nowadays, SVM systems use relaxed memory
consistency models and multiple writer protocols as techniques to reduce
latencies and false sharing, respectively. However, these techniques
induce additional overhead that decreases system performance. We
performed a study of workload behavior aimed at improving the design of
SVM protocols. The work focused on the identification of the type of
shared data patterns that can appear in the accesses to protected
sections using semaphores. Most coherence actions in SVM systems are
performed as a consequence of the write operations executed in critical
sections, so we pay special attention to the write operations performed
when multiple writers are allowed. As these write operations may present
spatial locality, we also study the write patterns on shared pages with
similar behaviour. Different software filters are applied in the
instrumented parallel workloads selected to capture and classify the
most common sharing patterns. This enables the recognition of those
patterns in which coherence overhead can be reduced by modifying the
coherence actions performed by the protocol. Despite the fact that the
performance evaluation of new coherence solutions is not our main goal,
the ideas presented to improve the behaviour of SVM systems can be
implemented at a reasonable hardware/software cost

Download full-text


Available from: Julio Sahuquillo
  • Source
    • "The data vortex topology is designed to be implemented in high-speed optics, and even with the possibility of a cluster of processors at each node, a 100% workload for the network is very unlikely. Common parallel computing algorithms, including benchmarks like SPLASH-2, generate infrequent shared memory accesses [14], [15], and the data vortex can handle a vast amount of traffic due to its virtual buffering provided by angles and deflection routing, so a 20% load should be more than enough to realistically exercise the system for study. Because of the immense data capacity of the network, however, for many comparisons (especially with single-angle injection), an 80% load or greater is chosen to sufficiently stress the systems being studied (to illustrate the best design under a near worst case scenario). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Reducing communication latency in multiprocessor interconnection networks can increase system performance on a broad range of applications. The data vortex photonic network reduces message latency by utilizing all-optical end-to-end transparent links and deflection routing. Cylinders replace node storage for buffering messages. The cylinder circumference (measured as number of angles) has a significant impact on the message acceptance rate and average message latency. A new symmetric mode of usage for the data vortex is discussed in which a fraction of the angles is used for input/output (I/O), and the remainder is used for "virtual buffering" of messages. For single-angle injection, six total angles provide the best performance. Likewise, the same ratio of 5 : 1 purely routing nodes versus I/O nodes is shown to produce greater than 99% acceptance, under normal loading conditions for all other network sizes studied. It is shown that for a given network I/O size, a shorter height and wider circumference data vortex organization provides acceptable latency with fewer total nodes than a taller but narrower data vortex. The performance versus system cost is discussed and evaluated, and the 5 : 1 noninjection-to-injection angle ratio is shown to be cost effective when constructing a system in current optical technology
    Full-text · Article · Oct 2006 · Journal of Lightwave Technology
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Today's supercomputers employ the fastest processors incorporating the latest VLSI technology. Unfortunately, usable system performance is often limited by excessive interprocessor latency. To overcome this bottleneck, this thesis explores the use of all-optical path interconnection networks using a new topology defined by Coke Reed [31]. This work overcomes limitations of previous optical networks through a novel use of defection routing to minimize latency and allow more processors to collaborate on the same application and dataset. In this thesis research, the data vortex is formally characterized and tested for performance. Extra angles serve as "virtual buffers" to provide required system performance, even under asymmetric mode operation. The data vortex is compared to two well-known interconnection networks (omega and butterfly) using metrics of average latency and message acceptance rate. The data vortex is shown to outperform the comparison networks, with a 20-50% higher acceptance rate and comparable average latency. The impact of angle size is also studied, and a new, synchronous mode of operation is proposed where additional angles are added to increase the virtual buffering of the network. The tradeoff between virtual buffering and angle resolution backpressure is explored, and an optimal point is found at the 1:6 I/O to non-I/O (virtual buffering) angle ratio. The new mode and optimal angle count are used to form data vortex networks that perform as well as larger networks with fewer total nodes. Finally, hierarchical layering with data vortex clusters is proposed and compared to a single-level data vortex. In today's technology, similar performance is attained at high network communication locality loads (> 2/3), and a 19% latency reduction is obtained at the highest locality loads (> 95%) for current optical switching technology. For projected future technology, the clustered system is shown to yield up to a 55% reduction in latency for applications with 2/3 or better locality. Dr. Henry L. Owen III, Committee Member ; Dr. David Keezer, Committee Member ; Dr. D. Scott Wills, Committee Chair. Thesis (Ph. D.)--Electrical and Computer Engineering, Georgia Institute of Technology, 2007.
    Full-text · Article ·
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The definition of the data vortex architecture leaves broad room for decisions regarding the exact design point required for achieving a desired performance level. A detailed simulation-based study of various parameters that affect a data vortex interconnection network's performance is reported. Three implementations are compared by acceptance rate, latency, and cost.
    Full-text · Article · Apr 2007 · Journal of Optical Networking
Show more