[show abstract][hide abstract] ABSTRACT: Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application
such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing
resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting
the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available
for a given application by leveraging free memory in other computers in the cluster.
Our proposal is based on the observation that many applications benefit from having more memory resources but do not require
more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better
Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of
applications that use our proposal is similar to the time required to execute them in a computer populated with enough local
memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas.
The first results from real executions in this prototype demonstrate not only that our proposal works but also that it can
efficiently execute applications that make use of remote memory resources.
[show abstract][hide abstract] ABSTRACT: Evolution in high performance computing (HPC) leads to increasing demands on bandwidth, connectivity and flexibility. Active optical cables (AOC) are of special interest, combining the benefits of electrical connectors and optical transmission. Optimization and development of AOC solutions requires enhancements concerning different technology barriers. Area and volume occupied by connectors is of special interest within HPC networks. This led to the development of a 12x AOC for the mini-HT connector creating the densest AOC available. In order to integrate electrical optical conversion into a module not higher than 3 mm, a new concept of coupling fibers to VCSELs or photodiodes had to be developed. This unique concept is based on a direct replication process of an integrated fiber coupler consisting of a 90° light deflecting and focusing mirror, a fiber guiding structure, and a fiber funnel. The integrated fiber coupler is directly replicated on top of active components, reducing the distance between active components and fibers to a minimum, thus providing a highly efficient light coupling. As AOC prototype, multi-chipmodules (MCM) including the complete electrical to optical conversion for send and receive connected by two 12x fiber ribbons have been developed. The paper presents the integrated fiber coupling technique and also design and measurement data of the prototype.
[show abstract][hide abstract] ABSTRACT: We have developed a new memory architecture for clusters that allows automatic access from any processor to any memory module in the cluster completely by hardware. Thus, with a single assembly instruction a processor can retrieve (or update) a memory location in a remote node. The efficiency of this new paradigm makes it possible to speed-up the execution of shared-memory applications with very large memory footprints by running them across the entire cluster, thus providing them a true shared-memory environment (contrary to the emulation typically carried out by software-based distributed shared memory). This new memory architecture, referred to as MEMSCALE, opens up a new frontier for memory-hungry applications. In this paper we focus on in-memory databases and show how this target application can be boosted by our memory architecture, which can virtually provide unlimited memory resources to it. In the demo presented in this paper we show the advantages of our architecture by means of a prototype cluster. We configure two cluster sizes, 16 and 32 nodes, to analyze throughput scalability and latency worsening, to extrapolate these metrics to bigger clusters, and to show the benefits of our technology compared to other alternatives like SSD-based databases. Moreover, we also show the easiness of use of our architecture by explaining how we ported MySQL Server to our prototype cluster. Finally, the possibility of executing queries in any processor of the cluster during the live demo will show the audience how our system aggregates the advantages of the scale out and scale up approaches for database server growing.
Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Although large scale high performance computing today typically relies on message passing, shared memory can offer significant advantages, as the overhead associated with MPI is completely avoided. In this way, we have developed an FPGA-based Shared Memory Engine that allows to forward memory transactions, like loads and stores, to remote memory locations in large clusters, thus providing a single memory address space. As coherency protocols do not scale with system size we completely avoid a global coherency across the cluster. However, we maintain local coherency domains, thus keeping the cores within one node coherent. In this paper, we show the suitability of our approach by analyzing the performance of barriers, a very common synchronization primitive in parallel programs. Experiments in a real cluster prototype show that our approach allows synchronization among 1024 cores spread over 64 nodes in less than 15us, several times faster than other highly optimized barriers. We show the feasibility of this approach by executing a shared-memory implementation of FFT. Finally, note that this barrier can also be leveraged by MPI applications running on our shared memory architecture for clusters. This ensures the usefulness of this work for applications already written.
18th International Conference on High Performance Computing, HiPC 2011, Bengaluru, India, December 18-21, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: Improvements in hardware for parallel shared- memory computing usually involve increments in the number of computing cores and in the amount of memory available for a given application. However, many shared-memory applications do not require more computing cores than available in current motherboards because their scalability is bounded to a few tens of parallel threads. Nevertheless, they may still benefit from having more memory resources. Additionally, the per- formance of extended systems involving more cores is typically constrained by the glueing coherency protocol, whose overhead lowers the performance of the final system. In this paper we present a 32-node prototype of a new non- coherent distributed-memory architecture for clusters, aimed to provide applications additional memory borrowed from other nodes without providing them more cores, thus avoiding the penalty of maintaining coherency among nodes of the cluster. Results from the execution of real applications in this prototype demonstrate that our proposal truly works, as well as its performance is assessed.
13th IEEE International Conference on High Performance Computing & Communication, HPCC 2011, Banff, Alberta, Canada, September 2-4, 2011; 01/2011
[show abstract][hide abstract] ABSTRACT: The use of Field Programmable Gate Arrays (FPGAs) in the area of High Performance Computing (HPC) to accelerate computations is well known. We present here a case where FPGAs can be used to speed up communication instead of computation. Current interconnects for HPC are in particular missing support for fine grain communication, which is increasingly found in various applications. In order to overcome this situation we developed a novel custom network. By using solely FPGAs it can easily be reconfigured to custom needs. The main drawback of FPGAs is their limited performance, which is about one to two orders of magnitude slower than commercial (specialized) solutions. However, an architecture optimized for small packet sizes results in a performance superior even to commercial high performance solutions. This excellent communication performance is verified by results from several popular benchmarks. In summary, we present a case where FPGAs can be used to accelerate communication and outperform commercial interconnection networks for HPC.
Networks (ICN), 2010 Ninth International Conference on; 05/2010
[show abstract][hide abstract] ABSTRACT: Current commercial solutions intended to provide additional resources to an application being executed in a cluster usually aggregate processors and memory from different nodes. In this paper we present a 16-node prototype for a shared-memory cluster architecture that follows a different approach by decoupling the amount of memory available to an application from the processing resources assigned to it. In this way, we provide a new degree of freedom so that the memory granted to a process can be expanded with the memory from other nodes in the cluster without increasing the number of processors used by the program. This feature is especially suitable for memory-hungry applications that demand large amounts of memory but present a parallelization level that prevents them from using more cores than available in a single node. The main advantage of this approach is that an application can use more memory from other nodes without involving the processors, and caches, from those nodes. As a result, using more memory no longer implies increasing the coherence protocol overhead because the number of caches involved in the coherent domain has become independent from the amount of available memory. The prototype we present in this paper leverages this idea by sharing 128GB of memory among the cluster. Real executions show the feasibility of our prototype and its scalability.
Proceedings of the 2010 IEEE International Conference on Cluster Computing, Heraklion, Crete, Greece, 20-24 September, 2010; 01/2010
[show abstract][hide abstract] ABSTRACT: This paper presents a novel stateless, virtualized communication engine for sub-microsecond latency. Using a Field-Programmable-Gate-Array (FPGA) based prototype we show a latency of 970 ns between two machines with our Virtualized Engine for Low Overhead (VELO). The FPGA device is directly connected to the CPUs by a HyperTransport link. The described hardware architecture is optimized for small messages and avoids the overhead typically found with Direct-Memory Access (DMA) controlled transfers. The stateless approach allows to use the hardware unit directly from many threads and processes simultaneously. It provides a secure user level communication with an extremely optimized start-up phase. Micro benchmarks results are reported both based on proprietary API and OpenMPI basis.
Parallel Processing, 2008. ICPP '08. 37th International Conference on; 10/2008
[show abstract][hide abstract] ABSTRACT: SWORDFISH (Simple Wormhole Routing and Fault Injection on Simulated Hardware) is a simulator to explore the design space of high-performance networks. The simu- lator features a high modularity and is configurable by plug-ins. The simulated network is scalable to a large number of nodes and its modules are parameterizable to model timing, delays and buffer sizes. A simple and univer- sal method to generate communication patterns is based on the single program multiple data (SPMD) programming model. Various topologies from distributed to centralized switches with a extendable variety of routing and arbitra- tion functions can be generated easily. Extensive possibili- ties to collect both performance and statistical data gives the network designer a large set of data to prove the design decisions. Accuracy is proven by comparing the simulation results to real performance measurements of an existing network.
International Conference on Parallel and Distributed Computing Systems, PDCS 2005, November 14-16, 2005, Phoenix, AZ, USA; 01/2005
[show abstract][hide abstract] ABSTRACT: The ATOLL System Area Network (SAN) is a low laten- cy interconnect that integrates all required components of a communication system for cluster computing into a single chip. The chip integrates not only four communication de- vices but also a high performance crossbar. The low laten- cy and the scalability of the ATOLL chip implementation makes it an interesting technology in the area of cluster computing. The following paper presents performance re- sults of basic point-to-point benchmarks together with an evaluation of the use as a cluster interconnect.
Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, part of the 23rd Multi-Conference on Applied Informatics, Innsbruck, Austria, February 15-17, 2005; 01/2005
[show abstract][hide abstract] ABSTRACT: Flexible, high performance networks-on-chip (NoC) are of paramount importance for the upcoming multi core era re-lying on extremely parallel computing. The NoC frame-works that currently exist target Systems on chip (SoC) ap-plications and are not perfectly suited for Application Spe-cific Integrated Circuits (ASIC) and Field Programmable Gate Array (FPGA) implementations. Such implementa-tions which include e.g. multicore graphic processing units or coprocessors show different characteristics compared to SoCs and demand for a novel solution. They are much more latency sensitive and require a tight integration of the switch into the processing pipeline while maintaining flexibility and generality. Furthermore, the requirements regarding the topology and protocol differ significantly from application to application. The High Throughput Advanced X-Bar (HTAX) presents a novel NoC framework that particularly targets integrated ASIC and FPGA implementations. This paper presents a modular and protocol agnostic framework and a switch development environment which enables the one click generation of arbitrary switches. The developed tool is used to generate a wide range of switch implementa-tions which are evaluated regarding performance and re-source consumption.
[show abstract][hide abstract] ABSTRACT: FPGAs as reconfigurable devices play an important role in both rapid prototyping and high performance reconfigurable computing. Usually, FPGA vendors help the users with pre-designed cores, for instance for various communication protocols. However, this is only true for widely used protocols. In the use case described here, the target application may benefit from a tight integration of the FPGA in a computing system. Typical commodity protocols like PCI Express may not fulfill these demands. HyperTransport (HT), on the other hand, allows connecting directly and without intermediate bridges or protocol conversion to a processor interface. As a result, communication costs between the FPGA unit and both processor and main memory are minimal. In this paper we present an HT3 interface for Stratix IV based FPGAs, which allows for minimal latencies and high bandwidths between processor and device and main memory and device. Designs targeting a HT connection can now be prototyped in real world systems. Furthermore, this design can be leveraged for acceleration tasks, with the minimal communication costs allowing fine-grain work deployment and the use of cost-efficient main memory instead of size-limited and costly on-device memory.
[show abstract][hide abstract] ABSTRACT: Cluster computing is still the most cost-effective solution to meet the increasing demand for computing power. Clus-ters are typically based on commodity computing hard-ware with specialized interconnection networks (IN). These cluster interconnects differ from commodity net-works by higher bandwidth, lower latency, lower CPU uti-lization and improved scalability. But even with these sophisticated INs the latency of a message transfer between two nodes is still decades higher than a local memory access. Especially for fine grain communication the latency of a message transfer is crucial. An analysis of the latency shows that the main component originates from the I/O system. The goal of this paper is to present a new mechanism called Ultra Low Latency Message Trans-fer (ULTRA), which allows message passing with lowest latencies possible. Beside the usage of well-known tech-niques like User-Level Communication this work focuses on improving the Network Interface by an optimized and most efficient usage of the I/O system. The ULTRA mech-anism and architecture presented here show a topmost optimized approach for low latencies, limited only by the used standard I/O system. With it a much closer coupling of the cluster nodes is possible and fine grain communica-tion schemes are more suitable for cluster computing.
[show abstract][hide abstract] ABSTRACT: The Hypertransport technology as a chip-to-chip and board-to-board interconnect technology is a well estab-lished standard, used in various computing systems. New is the standardization of an expansion slot with direct Hypertransport connection, called HTX. The opportunity of such a direct I/O interface led to the development of a rapid prototyping station with a HTX connector, plugga-ble into any HTX-equipped system. The architecture and physical design of this HTX device is presented. Beside the HTX connection, the most remarkable parts in the architecture are a FPGA as main component and an array of six high speed serial transceivers. The intend of the transceivers is to build up custom direct interconnec-tion networks. Beside the use for rapid prototyping and development of custom interconnects, possible applica-tions are co-processing and CPU-offloading. The physi-cal board design is shown with the most challenging problems like cost-efficient stack-up, the power distribu-tion system and signal integrity for high speed signals. The Hypertransport Consortium already adopted this development as a Reference Design in their portfolio. To our knowledge, until now there are no other comparable devices available.