[Show abstract][Hide abstract] ABSTRACT: Accurate simulation is essential for the proper design and evalu-ation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing de-vices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an x86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite. Simulation ca-pabilities are demonstrated with a preliminary architectural explo-ration study, and workload characterization examples. The project source code, benchmark packages, and a detailed user's guide are publicly available at www.multi2sim.org.
[Show abstract][Hide abstract] ABSTRACT: The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.
IEEE Transactions on Parallel and Distributed Systems 02/2011; · 1.80 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Digital Breast Tomosynthesis (DBT) is a technology that mit-igates many of the shortcomings associated with traditional mammography. Using multiple low-dose x-ray projections with an iterative maximum likelihood estimation method, DBT is able to create a high-quality, three-dimensional re-construction of the breast. However, the tenability of DBT depends largely on the potential for decreasing the execution time to be acceptable within a clinical setting. In this work we accelerate our DBT algorithm on the lat-est generation of NVIDIA's CUDA-enabled GPUs, reducing the execution time to under 20 seconds for eight iterations (the amount usually required to obtain a clean reconstruction). Moreover, with the execution time substantially decreased, a large number of additional benefits can be achieved, such as using redundant computations to prevent inaccuracies or ar-tifacts that can be introduced from transient faults or other memory errors during execution. We also supply the high-level algorithms and thread-mapping strategy (for both the CPU and GPUs) for creating a multiple-GPU version of the algorithm, and discuss how the choices play to the strengths of the GPU architecture.
[Show abstract][Hide abstract] ABSTRACT: Loop vectorization, a key feature exploited to obtain high perfor- mance on Single Instruction Multiple Data (SIMD) vector architec- tures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that cap- tures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transforma- tions can significantly increase the number of loops that can be vec- torized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data trans- formations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels stud- ied, we achieve consistent and significant performance improve- ments (up to 11.4X) by applying vectorization using our data trans- formation approach.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2010, Bangalore, India, January 9-14, 2010; 05/2010
[Show abstract][Hide abstract] ABSTRACT: The Local Outlier Factor (LOF) is a very powerful anomaly detection method available in machine learning and classification. The algorithm defines the notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned an LOF which represents the likelihood of that object being an outlier. Although this concept of a local outlier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor queries -- this overhead can limit the use of LOF due to the computational overhead involved. Due to the growing popularity of Graphics Processing Units (GPU) in general-purpose computing domains, and equipped with a high-level programming language designed specifically for general-purpose applications (e.g., CUDA), we look to apply this parallel computing approach to accelerate LOF. In this paper we explore how to utilize a CUDA-based GPU implementation of the k-nearest neighbor algorithm to accelerate LOF classification. We achieve more than a 100X speedup over a multi-threaded dual-core CPU implementation. We also consider the impact of input data set size, the neighborhood size (i.e., the value of k) and the feature space dimension, and report on their impact on execution time.
Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010; 01/2010
[Show abstract][Hide abstract] ABSTRACT: As general purpose computing on Graphics Processing Units (GPGPU) matures, more complicated scientific applications are being
targeted to utilize the data-level parallelism available on a GPU. Implementing physically-based simulation on data-parallel
hardware requires preprocessing overhead which affects application performance. We discuss our implementation of physics-based
data structures that provide significant performance improvements when used on data-parallel hardware. These data structures
allow us to maintain a physics-based abstraction of the underlying data, reduce programmer effort and obtain 6x-8x speedup
over previously implemented GPU kernels.
High Performance Computing for Computational Science - VECPAR 2010 - 9th International conference, Berkeley, CA, USA, June 22-25, 2010, Revised Selected Papers; 01/2010
[Show abstract][Hide abstract] ABSTRACT: Optimizing program execution targeted for Graphics Pro- cessing Units (GPUs) can be very challenging. Our ability to eciently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail about the underlying hardware. Program- mers are left to attempt trial and error to produce optimized codes. Recent publication of the underlying instruction set archi- tecture (ISA) of the AMD/ATI GPU has allowed researchers to begin to propose aggressive optimizations. In this work, we present an optimization methodology that utilizes this in- formation to accelerate programs on AMD/ATI GPUs. We start by defining optimization spaces that guide our work. We begin with disassembled machine code and collect pro- gram statistics provided by the AMD Graphics Shader An- alyzer (GSA) profiling toolset. We explore optimizations targeting three dierent
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2009, Washington, DC, USA, March 8, 2009; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Given the rapid growth in computational requirements for medical image analysis, Graphics Processing Units (GPUs) have begun to be utilized to address these demands. But even though GPUs are well-suited to the underlying processing associated with medical image reconstruction, extracting the full benefits of moving to GPU platforms requires significant programming effort, and presents a fundamental barrier for more general adoption of GPU acceleration in a wider range of medical imaging applications. In this paper we describe our experience in accelerating a number of challenging medical imaging applications, and discuss how we utilize profile-guided analysis to reap the full benefits available on GPU platforms. Our work considers different GPU architectures, as well as how to fully exploit the benefits of using multiple GPUs.
Proceedings of the 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Boston, MA, USA, June 28 - July 1, 2009; 01/2009
[Show abstract][Hide abstract] ABSTRACT: Although iterative reconstruction techniques (IRTs) have been shown to produce images of superior quality over conventional filtered back projection (FBP) based algorithms, the use of IRT in a clinical setting has been hampered by the significant computational demands of these algorithms. In this paper we present results of our efforts to overcome this hurdle by exploiting the combined computational power of multiple graphical processing units (GPUs). We have implemented forward and backward projection steps of reconstruction on an NVIDIA Tesla S870 hardware using CUDA. We have been able to accelerate forward projection by 71x and backward projection by 137x. We generate these results with no perceptible difference in image quality between the GPU and serial CPU implementations. This work illustrates the power of using commercial off-the-shelf relatively low-cost GPUs, potentially allowing IRT tomographic image reconstruction to be run in near real time, lowering the barrier to entry of IRT, and enabling deployment in the clinic.
Proceedings of the 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Boston, MA, USA, June 28 - July 1, 2009; 01/2009
[Show abstract][Hide abstract] ABSTRACT: This paper proposes the control of monomer concentration as a novel improvement of the kinetic Tile Assembly Model (kTAM)
to reduce the error rate in DNA self-assembly. Tolerance to errors in this process is very important for manufacturing scaffolds
for highly dense ICs; the proposed technique significantly decreases error rates (i.e. it increases error tolerance) by controlling
the concentration of the monomers (tiles) for a specific pattern to be assembled. By profiling, this feature is shown to be
applicable to different tile sets. A stochastic analysis based on a new state model is presented. The analysis is extended
to the cases of single, double and triple bondings. The kinetic trap model is modified to account for the different monomer
concentrations. Different scenarios (such as dynamic and adaptive) for monomer control are proposed: in the dynamic (adaptive)
control case, the concentration of each tile is assessed based on the current (average) demand during growth as found by profiling
the pattern. Significant error rate reductions are found by evaluating the proposed schemes compared to a scheme with constant
concentration. One of the significant advantages of the proposed schemes is that it doesn’t entail an overhead such as increase
in size and a slow growth, while still achieving a significant reduction in error rate. Simulation results are provided.
Journal of Electronic Testing 01/2008; 24(1):271-284. · 0.45 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper proposes a novel technique based on profiling the monomers for reducing the error rate in DNA self-assembly. This technique utilizes the average concentration of the monomers (tiles) for a specific pattern as found by profiling its growth. The validity of profiling and the large difference in the concentrations of the monomers are shown to be applicable to different tile sets. To evaluate the error rate new Markov based models are proposed to account for the different types of bonding (i.e. single, double and triple) in the monomers as modification to the commonly assumed kinetic trap model. A significant error rates reduction is accomplished compared to a scheme with constant concentration as commonly utilized under the kinetic trap model. Simulation results are provided
2007 Design, Automation and Test in Europe Conference and Exposition (DATE 2007), April 16-20, 2007, Nice, France; 01/2007
[Show abstract][Hide abstract] ABSTRACT: This paper proposes the control of monomer con- centration as a novel improvement of the kinetic Tile Assem- bly Model (kTAM) to reduce the error rate in DNA self- assembly. Tolerance to errors in this process is very important for manufacturing highly dense ICs; the proposed technique significantly decreases error rates (i.e. it increases error tolerance) by controlling the concentration of monomers. A stochastic analysis based on a new state model is presented. Error rates reductions of at least 10% are found by evaluating the proposed scheme comparing to a scheme with constant concentration. One of the significant advantages of the proposed scheme is that it doesn't entail an overhead such as increase in size and a slow growth, while still achieving a significant reduction in error rate.
21th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2006), 4-6 October 2006, Arlington, Virginia, USA; 01/2006
[Show abstract][Hide abstract] ABSTRACT: In this paper, a new architecture of distributed embedded memory cores for SoC is proposed and an effective memory repair method by using the proposed spare line borrowing (software-driven reconfiguration) technique is investigated. It is known that faulty cells in memory core show spatial locality, also known as fault clustering. This physical phenomenon tends to occur more often as deep submicron technology advances due to defects that span multiple circuit elements and sophisticated circuit design. The combination of new architecture & repair method proposed in this paper ensures fault tolerance enhancement in SoC, especially in case of fault clustering. This fault tolerance enhancement is obtained through optimal redundancy utilization: spare redundancy in a fault-resistant memory core is used to fix the fault in a fault-prone memory core. The effect of spare line borrowing technique on the reliability of distributed memory cores is analyzed through modeling and extensive parametric simulation
Instrumentation and Measurement Technology Conference, 2005. IMTC 2005. Proceedings of the IEEE; 06/2005
[Show abstract][Hide abstract] ABSTRACT: In this paper, the modeling and evaluation of multi- bank SRAM design with dynamic threshold and supply voltage control is presented to reduce leakage power. The bank of SRAM, the unit of control, is put in sleep mode (high threshold voltage and low supply voltage) from active mode (low threshold voltage and high supply voltage) whenever it is not frequently used. The change of modes is based on the characteristics of the temporal and spatial locality of memory accesses. The simulation results show that significant leakage reduction can be achieved through combined implementation of spatial locality and temporal locality while minimizing the re-synchronization penalties when the size of superbank is optimized based on the characteristics of application program.