A design case study: CPU vs. GPGPU vs. FPGA
ABSTRACT This paper describes our winning submission for the Absolute Performance category of the MEMOCODE 2009 Design Contest. We show that our GPGPU-based design achieves performance within a factor of four of theoretical maximum performance for the implemented algorithm. This result was reached after a short design-cycle of 2 man-days, which indicates that the NVIDIA CUDA platform allows for rapid development and optimization of applications that make substantial use of all available GPGPU computing resources. We also analyze the maximum theoretical performance of alternative computing systems that could have been used to implement the algorithm.
- SourceAvailable from: Michael Lang[show abstract] [hide abstract]
ABSTRACT: In this work we present an initial performance evaluation of Intel's latest, second- generation quad-core processor, Nehalem, and provide a comparison to first-generat ion AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first In tel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket com pute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intens ive codes by a factor of two for a Nehalem node with 8 cores and a Barcelona node containing 16 cores. Further optimizations are pos sible with Nehalem, including the use of Simultaneous Multithreading, which improves the performance of some applications by up to 50%.Parallel Processing Letters. 01/2008; 18:453-469.
Conference Proceeding: Benchmarking GPUs to tune dense linear algebra[show abstract] [hide abstract]
ABSTRACT: We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for; 12/2008
Conference Proceeding: 2009 MEMOCODE Co-Design Contest.[show abstract] [hide abstract]
ABSTRACT: The 2009 MEMOCODE Co-Design Contest is the third in the series of annual design contests organized by the MEMOCODE Conference. Contestants have one month to create the best performing design solution to a posted design challenge. The contest is open to all interested participants, and the contest rules are designed to not exclude or favor any one design methodology or platform. The goal of the contest is to invite developers of tools and platforms to showcase their technology in a leveled competition and to encourage hands-on design activities in the fields of interest of the MEMOCODE Conference. Please see http://www.memocode-conference.com for current information about this contest.7th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2009), July 13-15, 2009, Cambridge, Massachusetts, USA; 01/2009