[Show abstract][Hide abstract] ABSTRACT: In this paper, we present DetLock, a runtime system to ensure deterministic execution of multithreaded programs running on multicore systems. DetLock does not rely on any hardware support or kernel modification to ensure determinism. For tracking the progress of the threads, logical clocks are used. Unlike previous approaches, which rely on non-portable hardware to update the logical clocks, DetLock employs a compiler pass to insert code for updating these clocks, thus increasing portability. For 4 cores, the average overhead of these clocks on tested benchmarks is brought down from 16 to 2 % by applying several optimizations. Moreover, the average overall overhead, including deterministic execution, is 14 %.
[Show abstract][Hide abstract] ABSTRACT: This paper describes a software based fault tolerance approach for multithreaded programs running on multicore processors. Redundant multithreaded processes are used to detect soft errors and recover from them. Our scheme makes sure that the execution of the redundant processes is identical even in the presence of non-determinism due to shared memory accesses. This is done by making sure that the redundant processes acquire the locks for accessing the shared memory in the same order. Instead of using record/replay technique to do that, our scheme is based on deterministic multithreading, meaning that for the same input, a multithreaded program always have the same lock interleaving. Unlike record/replay systems, this eliminates the requirement for communication between the redundant processes. Moreover, our scheme is implemented totally in software, requiring no special hardware, making it very portable. Furthermore, our scheme is totally implemented at user-level, requiring no modification of the kernel. For selected benchmarks, our scheme adds an average overhead of 49% for 4 threads.
[Show abstract][Hide abstract] ABSTRACT: Parallel systems were for a long time confined to high-performance computing. However, with the increasing popularity of multicore processors, parallelization has also become important for other computing domains, such as desktops and embedded systems. Mission-critical embedded software, like that used in avionics and automotive industry, also needs to guarantee real time behavior. For that purpose, tools are needed to calculate the worst-case execution time (WCET) of tasks running on a processor, so that the real time system can make sure that real time guarantees are met. However, due to the shared resources present in a multicore system, this task is made much more difficult as compared to finding WCET for a single core processor. In this paper, we will discuss how recent research has tried to solve this problem and what the open research problems are.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present an overview of interconnect solutions for hardware accelerator systems. A number of solutions are presented: bus-based, DMA, crossbar, NoC, as well as combinations of these. The paper proposes analytical models to predict the performance of these solutions and implements them in practice. The jpeg decoder application is implemented as our case study in different scenarios using the presented interconnect solutions. We profile the application to extract the input data for our analytical model. Measurement results show that the NoC solution combined with a bus-based system provides the best performance as predicted by the analytical models. The NoC solution achieves a speed-up of up to 2.4× compared to the bus-based system, while consuming the least amount of energy. However, the NoC has the highest resource usage of up to 20.7% overhead.
Adaptive Hardware and Systems (AHS), 2013 NASA/ESA Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a heterogeneous hardware accelerator architecture to support streaming image processing. Each image in a data-set is pre-processed on a host processor and sent to hardware kernels. The host processor and the hardware kernels process a stream of images in parallel. The Convey hybrid computing system is used to develop our proposed architecture. We use the Canny edge detection algorithm as our case study. The data-set used for our experiment contains 7200 images. Experimental results show that the system with the proposed architecture achieved a speed-up of the kernels by 2.13× and of the whole application by 2.40× with respect to a software implementation running on the host processor. Moreover, our proposed system achieves 55% energy reduction compared to a hardware accelerator system without streaming support.
Advanced Technologies for Communications (ATC), 2013 International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: The communication infrastructure is one of the important components of a multicore system along with the computing cores and memories. A good interconnect design plays a key role in improving the performance of such systems. In this paper, we introduce a hybrid communication infrastructure using both the standard bus and our area-efficient and delay-optimized network on chip for heterogeneous multicore systems, especially hardware accelerator systems. An adaptive data communication-based mapping for reconfigurable hardware accelerators is proposed to obtain a low overhead and latency interconnect. Experimental results show that the proposed communication infrastructure and the adaptive data communication-based mapping achieves a speed-up of 2.4× with respect to a similar system using only a bus as interconnect. Moreover, our proposed system achieves a reduction of energy consumption of 56% compared to the original system.
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013; 01/2013
[Show abstract][Hide abstract] ABSTRACT: This paper describes a low overhead software-based fault tolerance approach for shared memory multicore systems. The scheme is implemented at user-space level and requires almost no changes to the original application. Redundant multithreaded processes are used to detect soft errors and recover from them. Our scheme makes sure that the execution of the redundant processes is identical even in the presence of non-determinism due to shared memory accesses. It provides a very low overhead mechanism to achieve this. Moreover it implements a fast error detection and recovery mechanism. The overhead incurred by our approach ranges from 0% to 18% for selected benchmarks. This is lower than comparable systems published in literature.
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Multi-core processing technology is one of the best way of achieving high performance computing without driving up heat and power consumption. In addition, reconfigurable systems are gaining popularity due to the fact that they combine performance and flexibility. These systems allow us to have software tasks running on a General Purpose Processor (GPP) along with hardware task running on a reconfigurable fabric, such as FPGA. An important part of parallel processing in multi-core reconfigurable systems is to allocate tasks to processors to achieve the best perfor-mance. The objectives of task scheduling algorithms are to maximize system throughput by as-signing a task to a proper processor, maximize resource utilization, and minimize execution time. Task execution on such platforms is managed by a scheduler that can assign tasks either to the GPPs or to the reconfigurable fabric. In this paper, we compare and evaluate different scheduling policies which have been classified into descriptive categories. The various task scheduling algo-rithms are discussed from different aspects, such as task dependency, static or dynamic policies, and heterogeneity of processors.
European Network of Excellence on High Performance and Embedded Architecture and Compilation; 07/2012
[Show abstract][Hide abstract] ABSTRACT: The ever decreasing transistor size has made it possible to integrate multiple cores on a single die. On the downside, this has introduced reliability concerns as smaller transistors are more prone to both transient and permanent faults. However, the abundant extra processing resources of a multicore system can be exploited to provide fault tolerance by using redundant execution. We have designed a library for multicore processing, that can make a multithreaded user-level application fault tolerant by simple modifications to the code. It uses the abundant cores found in the system to perform redundant execution for error detection. Besides that, it also allows recovery through checkpoint/rollback. Our library is portable since it does not depend on any special hardware. Furthermore, the overhead (up to 46% for 4 threads), our library adds to the original application, is less than other existing approaches, such as Respec.
[Show abstract][Hide abstract] ABSTRACT: Multicore systems are not only hard to program but also hard to test, debug and maintain. This is because the traditional way of accessing shared memory in multithreaded applications is to use lock-based synchronization, which is inherently non-deterministic and can cause a multithreaded application to have many different possible execution paths for the same input. This problem can be avoided however by forcing a multithreaded application to have the same lock acquisition order for the same input. In this paper, we present DetLock, which is able to run multithreaded programs deterministically without relying on any hardware support or kernel modification. The logical clocks used for performing deterministic execution are inserted by the compiler. For 4 cores, the average overhead of these clocks on tested benchmarks is brought down from 20% to 8% by applying several optimizations. Moreover, the overall overhead, including deterministic execution, is comparable to state of the art systems such as Kendo, even surpassing it for some applications, while providing more portability.
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Multicore architectures, especially hardware accelerator systems with heterogeneous processing elements, are being increasingly used due to the increasing processing demand of modern digital systems. However, data communication in multicore architectures is one of the main performance bottle-necks. Therefore, reducing data communication overhead is an important method to improve the speed-up of such systems. In this paper, we propose a heuristic-based approach to address the data communication bottleneck. The proposed approach uses a detailed quantitative data communication profiling to generate interconnect designs automatically that are relatively simple, low overhead and low area solutions. Experimental results show that we can gain speed-up of 3.05× for the whole application and up to 7.8× speed-up for accelerator functions in comparison with software.
Field-Programmable Technology (FPT), 2012 International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Multicore processing, especially heterogeneous multicore, is being increasingly used for data intensive processing in embedded systems. An important challenge in multicore processing is, efficiently, to get the data to the computing core that needs it. In order to have an efficient interconnect design for multicore architectures, a detailed profiling of data communication patterns is necessary. In this work, we propose a heuristic-based approach to design an application-specific custom interconnect using quantitative data communication profiling information. The ultimate goal is, automatically, to have the most optimized custom interconnect design taking runtime communication pattern into account. Experimental results show that the hardware accelerators speed-up achieved in comparison with software is up to 7.8×, which is 2.98× in comparison with the system without using our interconnect approach.
Reconfigurable Computing and FPGAs (ReConFig), 2012 International Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Task scheduling algorithms in distributed and parallel sys-tems play a vital role to provide better performance platforms for mul-tiprocessor networks. A large number of policies, which can determine best structures of task scheduling algorithms, have been explored so far. These policies have significant value for optimizing system efficiency. The objective of all these approaches are maximizing system throughput with assigning a task to a suitable processor, maximizing resource utilization, and minimizing execution time. In this essay, there are various types of different algorithms for parallel and distributed systems that have been classified by reviewing former surveys. Then, various task scheduling al-gorithms are discussed from different points of view such as dependency among tasks, static vs. dynamic approaches, and heterogeneity of proces-sors. Precedence orders like list heuristics have been studied. Duplication based algorithms, clustering heuristics and scheduling methods inspired by nature's laws like GA (Genetic Algorithm) are other kind of algorithm approaches of this study.
22th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC 2011); 11/2011
[Show abstract][Hide abstract] ABSTRACT: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method and is widely used for genetic databases. This paper presents a Graphics Processing Units (CPUs) accelerated S-W implementation for protein sequence alignment. The paper proposes a new sequence database organization and several optimizations to reduce the number of memory accesses. The new implementation achieves a performance of 21.4 GCUPS, which is 1.13 times better than the state-of-the-art implementation on an NVIDIA GTX 275 graphics card.
Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 08/2011; 2011:2442-6.
[Show abstract][Hide abstract] ABSTRACT: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring of these databases. Thus to come up with an accurate and fast solution, it is highly desired to speed up the S-W algorithm.
This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs). The new implementation improves performance by optimizing the database organization and reducing the number of memory accesses to eliminate bandwidth bottlenecks. The implementation is called Database Optimized Protein Alignment (DOPA) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date.
In the new GPU-based implementation for protein sequence alignment (DOPA), the database is organized in equal length sequence sets. This equally distributes the workload among all the threads on the GPU's multiprocessors. The result is an improved performance which is better than the fastest available GPU implementation.
[Show abstract][Hide abstract] ABSTRACT: The presence of parasitic node capacitance on a defective resistive node can induce dynamic changes in the electrical behavior of the circuit in SRAM devices, which may be referred to as the parasitic memory effect. This effect can cause dynamic faults in SRAMs. This paper presents an analysis of the parasitic memory effect in SRAMs on the defective resistive node. The paper demonstrates that the faulty behavior in SRAMs is exacerbated in the presence of parasitic node capacitance, something that reduces the fault coverage of current memory tests, and increases the defect-per-million rates.
Design and Test Workshop (IDT), 2010 5th International; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Sequence alignment is an essential, but compute-intensive application in Bioinformatics. Hardware implementation speeds up this application by exploiting its inherent parallelism, where the performance of the hardware depends on its capability to align long sequences. In hardware terms, the length of a biological query sequence that can be aligned against a database sequence depends on the number of Processing Elements (PEs) available, which in turn depends on the amount of available hardware resources. In addition, the amount of available bandwidth to transfer the data processed by these PEs plays a significant role in defining the maximum performance. In this paper, we carry out a detailed performance and bandwidth analysis for biological sequence alignment and formulate theoretical performance boundaries for various cases. Further, we optimize the performance gain and memory bandwidth requirements and develop generalized equations for this optimization.
Design and Test Workshop (IDT), 2010 5th International; 01/2011
[Show abstract][Hide abstract] ABSTRACT: With the advent of modern nano-scale technology, it has become possible to implement multiple processing cores on a single die. The shrinking transistor sizes however have made reliability a concern for such systems as smaller transistors are more prone to permanent as well as transient faults. To reduce the probability of failures of such systems, online fault tolerance techniques can be applied. These techniques need to be efficient as they execute concurrently with applications running on such systems. This paper discusses the challenges involved in online fault tolerance and existing work which tackles these challenges. We classify fault tolerance into four different steps which are proactive fault management, error detection, fault diagnosis and recovery and discuss related work for each step, with focus on techniques for shared memory multicore/multiprocessor systems. We also highlight the additional difficulties in tolerating faults for parallel execution on shared memory multicore/multiprocessor systems.
[Show abstract][Hide abstract] ABSTRACT: I. INTRODUCTION Memory test optimization can significantly reduce test complexity, while retaining the quality of the test. In the presence of parasitic BL coupling, faults may only be detected by writing all possible coupling backgrounds (CBs) in the neighboring cells of the victim , . However, using all possible CBs while testing for every fault consumes enormous test time, which can be significantly reduced, for the same fault coverage, if only limited required CBs are identified for each functional fault model (FFM). So far, no systematic approach has been proposed that identifies such limited required CBs, nor corresponding optimized memory tests generated that apply limited CBs . Therefore, this paper presents a systematic approach to identify such limited CBs, and thereafter presents an optimized test, March BLC, which detects all static memory faults in the presence of BL coupling using only required CBs.
16th European Test Symposium (ETS 2011), May 23-27, 2011, Trondheim, Norway; 01/2011