-
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012; 01/2012
-
13th IEEE International Conference on High Performance Computing & Communication, HPCC 2011, Banff, Alberta, Canada, September 2-4, 2011; 01/2011
-
22nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2011, Santa Monica, CA, USA, Sept. 11-14, 2011; 01/2011
-
IEEE/IFIP 9th International Conference on Embedded and Ubiquitous Computing, EUC 2011, Melbourne, Australia, October 24-26, 2011; 01/2011
-
22nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2011, Santa Monica, CA, USA, Sept. 11-14, 2011; 01/2011
-
[show abstract]
[hide abstract]
ABSTRACT: Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.
Application-specific Systems Architectures and Processors (ASAP), 2010 21st IEEE International Conference on; 08/2010
-
[show abstract]
[hide abstract]
ABSTRACT: TimeTrial is a performance monitoring tool designed to enhance the understanding of how and why a streaming data application is performing when deployed on an architecturally di-verse computer. A challenge that exists in architecturally diverse systems is that different computing platforms (e.g., processor core, FPGA) have different clocks, and the notion of time as measured on one platform does not directly compare to that measured on another platform. Here, we describe the global time model employed in the TimeTrial performance monitor, and demonstrate measurements made with TimeTrial that require timestamp comparisons across the disparate time domains (called timezones).
08/2010;
-
Proceedings of the 2010 International Conference on Engineering of Reconfigurable Systems & Algorithms, ERSA 2010, July 12-15, 2010, Las Vegas Nevada, USA; 01/2010
-
Proceedings of the ACM/SIGDA 18th International Symposium on Field Programmable Gate Arrays, FPGA 2010, Monterey, California, USA, February 21-23, 2010; 01/2010
-
18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2010, Charlotte, North Carolina, USA, 2-4 May 2010; 01/2010
-
19th International Conference on Field Programmable Logic and Applications, FPL 2009, August 31 - September 2, 2009, Prague, Czech Republic; 01/2009
-
Annual IEEE International SoC Conference, SoCC 2009, September 9-11, 2009, Belfast, Northern Ireland, UK, Proceedings; 01/2009
-
22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, Miami, Florida USA, April 14-18, 2008; 01/2008
-
22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, Miami, Florida USA, April 14-18, 2008; 01/2008
-
21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA; 01/2007
-
International Journal of Parallel Programming. 01/2005; 33:115-136.
-
2005 International Conference on Microelectronics Systems Education (MSE 2005), 12-13 June 2005, Anaheim, CA, USA; 01/2005
-
Proceedings 35th Annual Simulation Symposium (ANSS-35 2002), San Diego, California, USA, 14-18 April 2002; 01/2002
-
[show abstract]
[hide abstract]
ABSTRACT: Many compute-bound software kernels have seen order-of-magnitude speedups on special-purpose accelerators built on specialized architectures such as field-programmable gate arrays (FPGAs). These architectures are particularly good at implementing dynamic programming algorithms that can be expressed as systems of recurrence equations, which in turn can be realized as systolic array designs. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue software tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing design tools in this area produce a design that is latency-space optimal. However, we instead wish to target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For such applications, overall throughput rather than latency per input is the most important measure of performance. In this work, we introduce a new procedure to optimize throughput of a systolic array subject to resource constraints, in this case the area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum number of lattice points executed by any processor in the array, which to a close approximation is determined solely by the array's projection vector. We describe a bounded search Type of Report: Other Abstract Many compute-bound software kernels have seen order-of-magnitude speedups on special-purpose accelerators built on specialized archi-tectures such as field-programmable gate arrays (FPGAs). These architectures are particularly good at implementing dynamic pro-gramming algorithms that can be expressed as systems of recur-rence equations, which in turn can be realized as systolic array designs. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue software tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing design tools in this area produce a design that is latency-space optimal. However, we instead wish to target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For such applications, over-all throughput rather than latency per input is the most important measure of performance. In this work, we introduce a new procedure to optimize through-put of a systolic array subject to resource constraints, in this case the area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum num-ber of lattice points executed by any processor in the array, which to a close approximation is determined solely by the array's projection vector. We describe a bounded search process to find throughput-optimal projection vectors and a tool to perform automated design space exploration, discovering a range of array designs that are op-timal for inputs of different sizes. We apply our techniques to the Nussinov RNA folding algo-rithm to generate multiple mappings of this algorithm into systolic arrays. By combining our library of designs with run-time recon-figuration of an FPGA device to dynamically switch among them, we predict significant speedup over a single, latency-space optimal array.
-
[show abstract]
[hide abstract]
ABSTRACT: Large-scale DNA sequence comparison, as implemented by BLAST and related algorithms, is one of the pillars of mod-ern genomic analysis. One way to accelerate these com-putations is with a streaming architecture, in which proces-sors are arranged in a pipeline that replicates the multistage structure of the algorithm. To achieve high performance, the processor hardware implementing the critical seed matching and ungapped extension stages of BLAST should be special-ized to execute these stages as quickly as possible. However, accelerating these stages requires solving two key problems: first, the seed matching stage is not of a form which has traditionally been amenable to hardware acceleration; and second, the accelerated implementation of BLAST should re-tain sensitivity at least comparable to that of the original software. We describe Mercury BLASTN, an FPGA-based imple-mentation of BLAST for DNA. Mercury BLASTN combines a Bloom filtering approach to seed matching with a modi-fied ungapped extension algorithm. On a previous gener-ation FPGA hardware platform, Mercury BLASTN runs 5 to 11 times faster than NCBI BLASTN current-generation general-purpose CPUs, with the prospect of a further eight-fold speedup on current-generation FPGAs. Moreover, its sensitivity to significant DNA sequence alignments is 99% of that observed with software NCBI BLASTN.