[Show abstract][Hide abstract] ABSTRACT: We present a RNA deep sequencing (RNAseq) analysis of a comparison of the transcriptome responses to infection of zebrafish larvae with Staphylococcus epidermidis and Mycobacterium marinum bacteria. We show how our developed GeneTiles software can improve RNAseq analysis approaches by more confidently identifying a large set of markers upon infection with these bacteria. For analysis of RNAseq data currently, software programs such as Bowtie2 and Samtools are indispensable. However, these programs that are designed for a LINUX environment require some dedicated programming skills and have no options for visualisation of the resulting mapped sequence reads. Especially with large data sets, this makes the analysis time consuming and difficult for non-expert users. We have applied the GeneTiles software to the analysis of previously published and newly obtained RNAseq datasets of our zebrafish infection model, and we have shown the applicability of this approach also to published RNAseq datasets of other organisms by comparing our data with a published mammalian infection study. In addition, we have implemented the DEXSeq module in the GeneTiles software to identify genes, such as glucagon A, that are differentially spliced under infection conditions. In the analysis of our RNAseq data, this has led to the possibility to improve the size of data sets that could be efficiently compared without using problem-dedicated programs, leading to a quick identification of marker sets. Therefore, this approach will also be highly useful for transcriptome analyses of other organisms for which well-characterised genomes are available.
Electronic supplementary material
The online version of this article (doi:10.1007/s00251-014-0820-3) contains supplementary material, which is available to authorized users.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present DetLock, a runtime system to ensure deterministic execution of multithreaded programs running on multicore systems. DetLock does not rely on any hardware support or kernel modification to ensure determinism. For tracking the progress of the threads, logical clocks are used. Unlike previous approaches, which rely on non-portable hardware to update the logical clocks, DetLock employs a compiler pass to insert code for updating these clocks, thus increasing portability. For 4 cores, the average overhead of these clocks on tested benchmarks is brought down from 16 to 2 % by applying several optimizations. Moreover, the average overall overhead, including deterministic execution, is 14 %.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce an automated interconnect design strategy to create an efficient custom interconnect for kernels in an FPGA-based accelerator system to accelerate their communication behavior. Our custom interconnect includes an NoC, shared local memory solution or both. Depending on the quantitative communication profiling of the application, the interconnect is built using our proposed custom interconnect design algorithm and adaptive mapping function. Experimental results show that our system achieves an overall application speed-up of 3.72× compared to software and of 2.87× compared to the baseline system - a conventional FPGA bus-based accelerator system. Moreover, our proposed system achieves 66.5% energy reduction due to the reduced execution time.
[Show abstract][Hide abstract] ABSTRACT: This paper surveys state of the art low-power techniques for both single and multicore systems. Based on our proposed power management model for multicore systems, we present a classification of total power reduction techniques including both leakage and active power. According to this classification, three main classes are discussed: power optimization techniques within the cores, techniques for the interconnect and techniques applicable for the whole multicore system. This paper describes several techniques from these classes along with a comparison. For the whole multicore system, we focus on adaptive voltage scaling and propose a comprehensive taxonomy of adaptive voltage scaling techniques, while considering process variations.
[Show abstract][Hide abstract] ABSTRACT: Parallel systems were for a long time confined to high-performance computing. However, with the increasing popularity of multicore processors, parallelization has also become important for other computing domains, such as desktops and embedded systems. Mission-critical embedded software, like that used in avionics and automotive industry, also needs to guarantee real time behavior. For that purpose, tools are needed to calculate the worst-case execution time (WCET) of tasks running on a processor, so that the real time system can make sure that real time guarantees are met. However, due to the shared resources present in a multicore system, this task is made much more difficult as compared to finding WCET for a single core processor. In this paper, we will discuss how recent research has tried to solve this problem and what the open research problems are.
[Show abstract][Hide abstract] ABSTRACT: This paper describes a software based fault tolerance approach for multithreaded programs running on multicore processors. Redundant multithreaded processes are used to detect soft errors and recover from them. Our scheme makes sure that the execution of the redundant processes is identical even in the presence of non-determinism due to shared memory accesses. This is done by making sure that the redundant processes acquire the locks for accessing the shared memory in the same order. Instead of using record/replay technique to do that, our scheme is based on deterministic multithreading, meaning that for the same input, a multithreaded program always have the same lock interleaving. Unlike record/replay systems, this eliminates the requirement for communication between the redundant processes. Moreover, our scheme is implemented totally in software, requiring no special hardware, making it very portable. Furthermore, our scheme is totally implemented at user-level, requiring no modification of the kernel. For selected benchmarks, our scheme adds an average overhead of 49% for 4 threads.
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a heterogeneous hardware accelerator architecture to support streaming image processing. Each image in a data-set is pre-processed on a host processor and sent to hardware kernels. The host processor and the hardware kernels process a stream of images in parallel. The Convey hybrid computing system is used to develop our proposed architecture. We use the Canny edge detection algorithm as our case study. The data-set used for our experiment contains 7200 images. Experimental results show that the system with the proposed architecture achieved a speed-up of the kernels by 2.13× and of the whole application by 2.40× with respect to a software implementation running on the host processor. Moreover, our proposed system achieves 55% energy reduction compared to a hardware accelerator system without streaming support.
[Show abstract][Hide abstract] ABSTRACT: The communication infrastructure is one of the important components of a multicore system along with the computing cores and memories. A good interconnect design plays a key role in improving the performance of such systems. In this paper, we introduce a hybrid communication infrastructure using both the standard bus and our area-efficient and delay-optimized network on chip for heterogeneous multicore systems, especially hardware accelerator systems. An adaptive data communication-based mapping for reconfigurable hardware accelerators is proposed to obtain a low overhead and latency interconnect. Experimental results show that the proposed communication infrastructure and the adaptive data communication-based mapping achieves a speed-up of 2.4× with respect to a similar system using only a bus as interconnect. Moreover, our proposed system achieves a reduction of energy consumption of 56% compared to the original system.
[Show abstract][Hide abstract] ABSTRACT: This paper describes a low overhead software-based fault tolerance approach for shared memory multicore systems. The scheme is implemented at user-space level and requires almost no changes to the original application. Redundant multithreaded processes are used to detect soft errors and recover from them. Our scheme makes sure that the execution of the redundant processes is identical even in the presence of non-determinism due to shared memory accesses. It provides a very low overhead mechanism to achieve this. Moreover it implements a fast error detection and recovery mechanism. The overhead incurred by our approach ranges from 0% to 18% for selected benchmarks. This is lower than comparable systems published in literature.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present an overview of interconnect solutions for hardware accelerator systems. A number of solutions are presented: bus-based, DMA, crossbar, NoC, as well as combinations of these. The paper proposes analytical models to predict the performance of these solutions and implements them in practice. The jpeg decoder application is implemented as our case study in different scenarios using the presented interconnect solutions. We profile the application to extract the input data for our analytical model. Measurement results show that the NoC solution combined with a bus-based system provides the best performance as predicted by the analytical models. The NoC solution achieves a speed-up of up to 2.4× compared to the bus-based system, while consuming the least amount of energy. However, the NoC has the highest resource usage of up to 20.7% overhead.
[Show abstract][Hide abstract] ABSTRACT: Multicore processing, especially heterogeneous multicore, is being increasingly used for data intensive processing in embedded systems. An important challenge in multicore processing is, efficiently, to get the data to the computing core that needs it. In order to have an efficient interconnect design for multicore architectures, a detailed profiling of data communication patterns is necessary. In this work, we propose a heuristic-based approach to design an application-specific custom interconnect using quantitative data communication profiling information. The ultimate goal is, automatically, to have the most optimized custom interconnect design taking runtime communication pattern into account. Experimental results show that the hardware accelerators speed-up achieved in comparison with software is up to 7.8×, which is 2.98× in comparison with the system without using our interconnect approach.
[Show abstract][Hide abstract] ABSTRACT: Multicore architectures, especially hardware accelerator systems with heterogeneous processing elements, are being increasingly used due to the increasing processing demand of modern digital systems. However, data communication in multicore architectures is one of the main performance bottle-necks. Therefore, reducing data communication overhead is an important method to improve the speed-up of such systems. In this paper, we propose a heuristic-based approach to address the data communication bottleneck. The proposed approach uses a detailed quantitative data communication profiling to generate interconnect designs automatically that are relatively simple, low overhead and low area solutions. Experimental results show that we can gain speed-up of 3.05× for the whole application and up to 7.8× speed-up for accelerator functions in comparison with software.
[Show abstract][Hide abstract] ABSTRACT: Multi-core processing technology is one of the best way of achieving high performance computing without driving up heat and power consumption. In addition, reconfigurable systems are gaining popularity due to the fact that they combine performance and flexibility. These systems allow us to have software tasks running on a General Purpose Processor (GPP) along with hardware task running on a reconfigurable fabric, such as FPGA. An important part of parallel processing in multi-core reconfigurable systems is to allocate tasks to processors to achieve the best perfor-mance. The objectives of task scheduling algorithms are to maximize system throughput by as-signing a task to a proper processor, maximize resource utilization, and minimize execution time. Task execution on such platforms is managed by a scheduler that can assign tasks either to the GPPs or to the reconfigurable fabric. In this paper, we compare and evaluate different scheduling policies which have been classified into descriptive categories. The various task scheduling algo-rithms are discussed from different aspects, such as task dependency, static or dynamic policies, and heterogeneity of processors.
[Show abstract][Hide abstract] ABSTRACT: The ever decreasing transistor size has made it possible to integrate multiple cores on a single die. On the downside, this has introduced reliability concerns as smaller transistors are more prone to both transient and permanent faults. However, the abundant extra processing resources of a multicore system can be exploited to provide fault tolerance by using redundant execution. We have designed a library for multicore processing, that can make a multithreaded user-level application fault tolerant by simple modifications to the code. It uses the abundant cores found in the system to perform redundant execution for error detection. Besides that, it also allows recovery through checkpoint/rollback. Our library is portable since it does not depend on any special hardware. Furthermore, the overhead (up to 46% for 4 threads), our library adds to the original application, is less than other existing approaches, such as Respec.
[Show abstract][Hide abstract] ABSTRACT: Multicore systems are not only hard to program but also hard to test, debug and maintain. This is because the traditional way of accessing shared memory in multithreaded applications is to use lock-based synchronization, which is inherently non-deterministic and can cause a multithreaded application to have many different possible execution paths for the same input. This problem can be avoided however by forcing a multithreaded application to have the same lock acquisition order for the same input. In this paper, we present DetLock, which is able to run multithreaded programs deterministically without relying on any hardware support or kernel modification. The logical clocks used for performing deterministic execution are inserted by the compiler. For 4 cores, the average overhead of these clocks on tested benchmarks is brought down from 20% to 8% by applying several optimizations. Moreover, the overall overhead, including deterministic execution, is comparable to state of the art systems such as Kendo, even surpassing it for some applications, while providing more portability.
[Show abstract][Hide abstract] ABSTRACT: Task scheduling algorithms in distributed and parallel sys-tems play a vital role to provide better performance platforms for mul-tiprocessor networks. A large number of policies, which can determine best structures of task scheduling algorithms, have been explored so far. These policies have significant value for optimizing system efficiency. The objective of all these approaches are maximizing system throughput with assigning a task to a suitable processor, maximizing resource utilization, and minimizing execution time. In this essay, there are various types of different algorithms for parallel and distributed systems that have been classified by reviewing former surveys. Then, various task scheduling al-gorithms are discussed from different points of view such as dependency among tasks, static vs. dynamic approaches, and heterogeneity of proces-sors. Precedence orders like list heuristics have been studied. Duplication based algorithms, clustering heuristics and scheduling methods inspired by nature's laws like GA (Genetic Algorithm) are other kind of algorithm approaches of this study.
[Show abstract][Hide abstract] ABSTRACT: Due to rapid and continuous technology scaling, faults in semiconductor memories (and ICs in general) are becoming pervasive and weak rather than strong, a weak fault is a fault that escape the test program (because it does not cause an error/system failure). However, multiple weak faults may cause an error during the application. Components with weak faults which fail at board and system level are sent to suppliers, but only to have them returned back as No Trouble Found (NTF). This is because the conventional memory test approach assumes the presence of a single defect at a time causing a strong fault (hence an error), and is therefore unable to deal with weak faults. This paper presents a new memory test approach able to detect weak faults, it is based on assuming the presence of multiple weak faults at a time in a memory system rather than a single strong fault at a time. Being able to detect weak faults reduces the number of escapes, hence also the number of NTFs. The experimental analysis done using SPICE simulation for a case of study show that when assuming two simultaneous weak faults, the missing (defect) coverage can be reduced with 10% as compared with the conventional approach.
[Show abstract][Hide abstract] ABSTRACT: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method and is widely used for genetic databases. This paper presents a Graphics Processing Units (CPUs) accelerated S-W implementation for protein sequence alignment. The paper proposes a new sequence database organization and several optimizations to reduce the number of memory accesses. The new implementation achieves a performance of 21.4 GCUPS, which is 1.13 times better than the state-of-the-art implementation on an NVIDIA GTX 275 graphics card.
No preview · Article · Aug 2011 · Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference
[Show abstract][Hide abstract] ABSTRACT: Smith-Waterman (S-W) algorithm is an optimal sequence alignment method for biological databases, but its computational complexity makes it too slow for practical purposes. Heuristics based approximate methods like FASTA and BLAST provide faster solutions but at the cost of reduced accuracy. Also, the expanding volume and varying lengths of sequences necessitate performance efficient restructuring of these databases. Thus to come up with an accurate and fast solution, it is highly desired to speed up the S-W algorithm.
This paper presents a high performance protein sequence alignment implementation for Graphics Processing Units (GPUs). The new implementation improves performance by optimizing the database organization and reducing the number of memory accesses to eliminate bandwidth bottlenecks. The implementation is called Database Optimized Protein Alignment (DOPA) and it achieves a performance of 21.4 Giga Cell Updates Per Second (GCUPS), which is 1.13 times better than the fastest GPU implementation to date.
In the new GPU-based implementation for protein sequence alignment (DOPA), the database is organized in equal length sequence sets. This equally distributes the workload among all the threads on the GPU's multiprocessors. The result is an improved performance which is better than the fastest available GPU implementation.