Roger D. Chamberlain

Washington University in St. Louis, Saint Louis, MO, USA

Are you Roger D. Chamberlain?

Claim your profile

Publications (22)0 Total impact

  • Conference Proceeding: Efficient deadlock avoidance for streaming computation with filtering.
    Jeremy D. Buhler, Kunal Agrawal, Peng Li, Roger D. Chamberlain
    Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25-29, 2012; 01/2012
  • Conference Proceeding: Asking for Performance: Exploiting Developer Intuition to Guide Instrumentation with TimeTrial.
    13th IEEE International Conference on High Performance Computing & Communication, HPCC 2011, Banff, Alberta, Canada, September 2-4, 2011; 01/2011
  • Conference Proceeding: TimeTrial: A low-impact performance profiler for streaming data applications.
    22nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2011, Santa Monica, CA, USA, Sept. 11-14, 2011; 01/2011
  • Source
    Conference Proceeding: Crossing Boundaries in TimeTrial: Monitoring Communications across Architecturally Diverse Computing Platforms.
    IEEE/IFIP 9th International Conference on Embedded and Ubiquitous Computing, EUC 2011, Melbourne, Australia, October 24-26, 2011; 01/2011
  • Source
    Conference Proceeding: Optimal design-space exploration of streaming applications.
    Shobana Padmanabhan, Yixin Chen, Roger D. Chamberlain
    22nd IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2011, Santa Monica, CA, USA, Sept. 11-14, 2011; 01/2011
  • Source
    Conference Proceeding: Design of throughput-optimized arrays from recurrence abstractions
    Arpith C. Jacob, Jeremy D. Buhler, Roger D. Chamberlain
    [show abstract] [hide abstract]
    ABSTRACT: Many compute-bound applications have seen order-of-magnitude speedups using special-purpose accelerators. FPGAs in particular are good at implementing recurrence equations realized as arrays. Existing high-level synthesis approaches for recurrence equations produce an array that is latency-space optimal. We target applications that operate on a large collection of small inputs, e.g. a database of biological sequences, where overall throughput is the most important measure of performance. In this work, we introduce a new design-space exploration procedure within the polyhedral framework to optimize throughput of a systolic array subject to area and bandwidth constraints of an FPGA device. Our approach is to exploit additional parallelism by pipelining multiple inputs on an array and multiple iteration vectors in a processing element. We prove that the throughput of an array is given by the inverse of the maximum number of iteration vectors executed by any processor in the array, which is determined solely by the array's projection vector. We have applied this observation to discover novel arrays for Nussinov RNA folding. Our throughput-optimized array is 2× faster than the standard latency-space optimal array, yet it uses 15% fewer LUT resources. We achieve a further 2× speedup by processor pipelining, with only a 37% increase in resources. Our tool suggests additional arrays that trade area for throughput and are 4–5× faster than the currently used latency-optimized array. These novel arrays are 70–172× faster than a software baseline.
    Application-specific Systems Architectures and Processors (ASAP), 2010 21st IEEE International Conference on; 08/2010
  • Source
    Article: Crossing Timezones in the TimeTrial Performance Monitor
    Joseph M Lancaster, Roger D Chamberlain
    [show abstract] [hide abstract]
    ABSTRACT: TimeTrial is a performance monitoring tool designed to enhance the understanding of how and why a streaming data application is performing when deployed on an architecturally di-verse computer. A challenge that exists in architecturally diverse systems is that different computing platforms (e.g., processor core, FPGA) have different clocks, and the notion of time as measured on one platform does not directly compare to that measured on another platform. Here, we describe the global time model employed in the TimeTrial performance monitor, and demonstrate measurements made with TimeTrial that require timestamp comparisons across the disparate time domains (called timezones).
    08/2010;
  • Source
    Conference Proceeding: Better Languages for More Effective Designing.
    Roger D. Chamberlain, Joseph M. Lancaster
    Proceedings of the 2010 International Conference on Engineering of Reconfigurable Systems & Algorithms, ERSA 2010, July 12-15, 2010, Las Vegas Nevada, USA; 01/2010
  • Conference Proceeding: Design space exploration of throughput-optimized arrays from recurrence abstractions (abstract only).
    Arpith C. Jacob, Jeremy D. Buhler, Roger D. Chamberlain
    Proceedings of the ACM/SIGDA 18th International Symposium on Field Programmable Gate Arrays, FPGA 2010, Monterey, California, USA, February 21-23, 2010; 01/2010
  • Source
    Conference Proceeding: Rapid RNA Folding: Analysis and Acceleration of the Zuker Recurrence.
    Arpith C. Jacob, Jeremy D. Buhler, Roger D. Chamberlain
    18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2010, Charlotte, North Carolina, USA, 2-4 May 2010; 01/2010
  • Source
    Conference Proceeding: Optimal runtime reconfiguration strategies for systolic arrays.
    Arpith C. Jacob, Jeremy D. Buhler, Roger D. Chamberlain
    19th International Conference on Field Programmable Logic and Applications, FPL 2009, August 31 - September 2, 2009, Prague, Czech Republic; 01/2009
  • Source
    Conference Proceeding: Efficient runtime performance monitoring of FPGA-based applications.
    Joseph M. Lancaster, Jeremy D. Buhler, Roger D. Chamberlain
    Annual IEEE International SoC Conference, SoCC 2009, September 9-11, 2009, Belfast, Northern Ireland, UK, Proceedings; 01/2009
  • Conference Proceeding: Understanding the performance of streaming applications deployed on hybrid systems.
    Joseph M. Lancaster, Ron Cytron, Roger D. Chamberlain
    22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, Miami, Florida USA, April 14-18, 2008; 01/2008
  • Conference Proceeding: Analytic performance models for bounded queueing systems.
    Praveen Krishnamurthy, Roger D. Chamberlain
    22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2008, Miami, Florida USA, April 14-18, 2008; 01/2008
  • Source
    Conference Proceeding: Preliminary results in accelerating profile HMM search on FPGAs.
    21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA; 01/2007
  • Source
    Article: Extracting and Improving Microarchitecture Performance on Reconfigurable Architectures.
    International Journal of Parallel Programming. 01/2005; 33:115-136.
  • Source
    Conference Proceeding: Use of a Soft-Core Processor in a Hardware/Software Codesign Laboratory.
    2005 International Conference on Microelectronics Systems Education (MSE 2005), 12-13 June 2005, Anaheim, CA, USA; 01/2005
  • Conference Proceeding: Breaking the Memory Bottleneck with an Optical Data Path.
    Jason E. Fritts, Roger D. Chamberlain
    Proceedings 35th Annual Simulation Symposium (ANSS-35 2002), San Diego, California, USA, 14-18 April 2002; 01/2002
  • Source
    Article: Throughput-optimal systolic arrays from recurrence equations
    Arpith, C Jacob, Jeremy D Buhler, Roger D Chamberlain, Arpith C Jacob
    [show abstract] [hide abstract]
    ABSTRACT: Many compute-bound software kernels have seen order-of-magnitude speedups on special-purpose accelerators built on specialized architectures such as field-programmable gate arrays (FPGAs). These architectures are particularly good at implementing dynamic programming algorithms that can be expressed as systems of recurrence equations, which in turn can be realized as systolic array designs. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue software tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing design tools in this area produce a design that is latency-space optimal. However, we instead wish to target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For such applications, overall throughput rather than latency per input is the most important measure of performance. In this work, we introduce a new procedure to optimize throughput of a systolic array subject to resource constraints, in this case the area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum number of lattice points executed by any processor in the array, which to a close approximation is determined solely by the array's projection vector. We describe a bounded search Type of Report: Other Abstract Many compute-bound software kernels have seen order-of-magnitude speedups on special-purpose accelerators built on specialized archi-tectures such as field-programmable gate arrays (FPGAs). These architectures are particularly good at implementing dynamic pro-gramming algorithms that can be expressed as systems of recur-rence equations, which in turn can be realized as systolic array designs. To efficiently find good realizations of an algorithm for a given hardware platform, we pursue software tools that can search the space of possible parallel array designs to optimize various design criteria. Most existing design tools in this area produce a design that is latency-space optimal. However, we instead wish to target applications that operate on a large collection of small inputs, e.g. a database of biological sequences. For such applications, over-all throughput rather than latency per input is the most important measure of performance. In this work, we introduce a new procedure to optimize through-put of a systolic array subject to resource constraints, in this case the area and bandwidth constraints of an FPGA device. We show that the throughput of an array is dependent on the maximum num-ber of lattice points executed by any processor in the array, which to a close approximation is determined solely by the array's projection vector. We describe a bounded search process to find throughput-optimal projection vectors and a tool to perform automated design space exploration, discovering a range of array designs that are op-timal for inputs of different sizes. We apply our techniques to the Nussinov RNA folding algo-rithm to generate multiple mappings of this algorithm into systolic arrays. By combining our library of designs with run-time recon-figuration of an FPGA device to dynamically switch among them, we predict significant speedup over a single, latency-space optimal array.
  • Source
    Article: Mercury BLASTN: Faster DNA sequence comparison using a streaming hardware architecture
    [show abstract] [hide abstract]
    ABSTRACT: Large-scale DNA sequence comparison, as implemented by BLAST and related algorithms, is one of the pillars of mod-ern genomic analysis. One way to accelerate these com-putations is with a streaming architecture, in which proces-sors are arranged in a pipeline that replicates the multistage structure of the algorithm. To achieve high performance, the processor hardware implementing the critical seed matching and ungapped extension stages of BLAST should be special-ized to execute these stages as quickly as possible. However, accelerating these stages requires solving two key problems: first, the seed matching stage is not of a form which has traditionally been amenable to hardware acceleration; and second, the accelerated implementation of BLAST should re-tain sensitivity at least comparable to that of the original software. We describe Mercury BLASTN, an FPGA-based imple-mentation of BLAST for DNA. Mercury BLASTN combines a Bloom filtering approach to seed matching with a modi-fied ungapped extension algorithm. On a previous gener-ation FPGA hardware platform, Mercury BLASTN runs 5 to 11 times faster than NCBI BLASTN current-generation general-purpose CPUs, with the prospect of a further eight-fold speedup on current-generation FPGAs. Moreover, its sensitivity to significant DNA sequence alignments is 99% of that observed with software NCBI BLASTN.