Herman Lam

University of Florida, Gainesville, Florida, United States

Are you Herman Lam?

Claim your profile

Publications (101)21.72 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Reconfigurable computing (RC) devices such as field-programmable gate arrays (FPGAs) offer significant advantages over fixed-logic, many-core CPU and GPU architectures, including increased performance for many computationally challenging applications, superior power efficiency, and reconfigurability. Difficulties of using FPGAs, however, has limited their acceptance in high-performance computing (HPC) and high-performance embedded computing (HPEC) applications. These difficulties stem from a lack of standards between FPGA platforms and the complexities of hardware design, and lead to higher costs and time to market over competing technologies. Differences in FPGA platform resources such as the type and number of FPGAs, memories and interconnects, as well as vendor-specific procedural APIs and hardware interfaces, inhibits application portability and code reusability. Despite efforts to reduce FPGA application design complexity through technologies such as high-level synthesis (HLS) tools, platform support and portability remains limited, and is typically left as a challenge for application developers. In this paper, we present a novel RC Middleware (RCMW), an extensible framework which enables FPGA application portability and enhances developer productivity by providing an application-centric development environment. Developers focus specifically on the optimal resources and interfaces required by their application, and RCMW handles the mapping and translation of those resources onto a target platform. We demonstrate that RCMW enables application portability over three heterogeneous platforms from two vendors, using both Xilinx and Altera FPGAs, with less than 10% performance and area overhead for several application kernels, and microbenchmarks for the common case. We present the productivity benefits of RCMW, showing that RCMW reduces required number of hardware and software driver lines of code and total development time with respect to native platform deploymen- methods for several application kernels.
    Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The mean-shift algorithm provides a unique non-parametric and unsupervised clustering solution to image segmentation and has a proven record of very good performance for a wide variety of input images. It is essential to image processing because it provides the initial and vital steps to numerous object recognition and tracking applications. However, image segmentation using mean-shift clustering is widely recognized as one of the most compute-intensive tasks in image processing, and suffers from poor scalability with respect to the image size (N pixels) and number of iterations (k): O(kN2). Our novel approach focuses on creating a scalable hardware architecture fine-tuned to the computational requirements of the mean-shift clustering algorithm. By efficiently parallelizing and mapping the algorithm to reconfigurable hardware, we can effectively cluster hundreds of pixels independently. Each pixel can benefit from its own dedicated pipeline and can move independently of all other pixels towards its respective cluster. By using our mean-shift FPGA architecture, we achieve a speedup of three orders of magnitude with respect to our software baseline.
    Application-Specific Systems, Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: With many-core processor architectures emerging, concerns arise regarding the productivity of numerous parallel programming tools, models, and languages as developers from a broad spectrum of science domains struggle to maximize performance and maintain correctness of their applications. Fortunately, a partitioned global address space (PGAS) programming model has demonstrated realizable performance and productivity potential for large parallel computing systems with distributed-memory architectures. One such PGAS approach is SHMEM, a lightweight, shared-memory programming library. Renewed interest for SHMEM has developed around Oppenheim, a recent community-led effort to produce a standardized specification for the SHMEM library amidst incompatible commercial implementations. This paper presents and evaluates the design of TSHMEM (short for TileSHMEM), a new OpenSHMEM library for the Tilera TILE-Gx8036 and TILEPro64 many-core processors. TSHMEM is built atop Tilera-provided libraries with key emphasis upon realizable performance with those libraries, demonstrated through micro benchmarking. Furthermore, SHMEM application portability is illustrated with two case studies. TSHMEM successfully delivers high performance with ease of programmability and portability for SHMEM applications on TILE-Gx and TILEPro architectures.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Commercial SRAM-based, field-programmable gate arrays (FPGAs) have the potential to provide space applications with the necessary performance to meet next-generation mission requirements. However, mitigating an FPGA’s susceptibility to single-event upset (SEU) radiation is challenging. Triple-modular redundancy (TMR) techniques are traditionally used to mitigate radiation effects, but TMR incurs substantial overheads such as increased area and power requirements. In order to reduce these overheads while still providing sufficient radiation mitigation, we propose a reconfigurable fault tolerance (RFT) framework that enables system designers to dynamically adjust a system’s level of redundancy and fault mitigation based on the varying radiation incurred at different orbital positions. This framework includes an adaptive hardware architecture that leverages FPGA reconfigurable techniques to enable significant processing to be performed efficiently and reliably when environmental factors permit. To accurately estimate upset rates, we propose an upset rate modeling tool that captures time-varying radiation effects for arbitrary satellite orbits using a collection of existing, publically available tools and models. We perform fault-injection testing on a prototype RFT platform to validate the RFT architecture and RFT performability models. We combine our RFT hardware architecture and the modeled upset rates using phased-mission Markov modeling to estimate performability gains achievable using our framework for two case-study orbits.
    ACM Transactions on Reconfigurable Technology and Systems (TRETS). 12/2012; 5(4).
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rising number of application accelerators, developers are looking for ways to evaluate new and competing platforms quickly, fairly, and early in the development cycle. As high-performance computing (HPC) applications increase their demands on application acceleration platforms, graphics processing units (GPUs) provide a potential solution for many developers looking for increased performance. Device performance metrics, such as Computational Density (CD), provide a useful but limited starting point for device comparison. The authors developed the Realizable Utilization (RU) metric and methodology to quantify the discrepancy between theoretical device performance shown by CD and the performance developers can achieve. As the RU score increases, the application is achieving a larger percentage of the computational power the device can provide. The authors survey technical publications about GPUs and use this data to analyze the RU scores for several arithmetic application kernels that are frequently accelerated in GPUs. The RU concepts presented in this paper are a first step towards a formalized comparison framework for diverse devices such as CPUs, FPGAs, GPUs and other novel architectures. GPU kernels for matrix multiplication, matrix decomposition, and N-body simulations show RU scores ranging from almost 0% to approaching 99% depending on the application, but all kernel areas show a significant decrease in RU as the computational capacities increase. Additionally, the RU scores show the higher realized performance of the GeForce 8 Series GPUs versus newer GPU architectures. This paper shows that applications running on GPUs with higher computational density report significantly lower RU scores than more mature GPUs with lower computational density. This trend implies that while the raw performance available is still increasing with newer GPUs, the achieved performance is not keeping pace with the theoretical capacities of the devices.
    Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multi-asset barrier contracts are path-dependent exotic options consisting of two or more underlying assets. As the dimensions of an option increase, so does the mathematical complexity of a closed form solution. Monte Carlo (MC) methods offer an attractive solution under such conditions. MC methods have an O(n-1/2) convergence rate irrespective of the dimension of the integral. However, such methods using conventional computing with CPUs are not scalable enough to enable banks to realize the potential that these exotic options promise. This paper presents an FPGA-based accelerated system architecture to price multi-asset barrier contracts. The architecture consists of a parallel set of Monte Carlo cores, each capable of simulating multiple Monte Carlo paths. Each MC core is designed to be customizable so that the core for the model (i.e., "model" core) can be easily replaced. In our current design, a Heston core based on the full truncation Euler discretization method is used as the model core. Similarly, we can use different payoff calculator kernels to compute various payoffs such as vanilla portfolios, barriers, look-backs, etc. The design leverages an early termination condition of "out" barrier options to efficiently schedule MC paths across multiple cores in a single FPGA and across multiple FPGAs. The target platform for our design is Novo-G, a reconfigurable supercomputer housed at the NSF Center for High-Performance Reconfigurable Computing (CHREC), University of Florida. Our design is validated for the single-asset configuration by comparing our output to option prices calculated analytically and achieves an average speedup of ranging from 123 to 350 on one FPGA as we vary the number of underlying assets from 32 down to 4. For a configuration with 16 underlying assets, the speedup achieved is 7134 when scaled to 48 FPGAs as compared to a single-threaded version of an SSE2-optimized C program running on a single Intel Sandy Bridge E5-2687 core at 3.1 GHz - ith hyper-threading turned on. Finally, the techniques described in this paper can be applied to other exotic multi-asset option classes, such as look backs, rainbows, and Asian-style options.
    Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Over the past 20 to 30 years, the analysis of tandem mass spectrometry data generated from protein fragments has become the dominant method for the identification and classification of unknown protein samples. With wide ranging application in numerous scientific disciplines such as pharmaceutical research, cancer diagnostics, and bacterial identification, the need for accurate protein identification remains important, and the ability to produce more accurate identifications at faster rates would be of great benefit to society as a whole. As a key step towards improving the speed, and thus achievable accuracy, of protein identification algorithms, this paper presents a FPGA-based solution that considerably accelerates the Isotope Pattern Calculator, a computationally intense subroutine common in de novo protein identification. Although previous work shows incremental progress in the acceleration of software-based IPC (mainly by sacrificing accuracy for speed), to the best of our knowledge this is the first work to consider IPC on FPGAs. In this paper, we describe the design and implementation of an efficient and configurable IPC kernel. The described design provides 23 customization parameters allowing for general use within many protein identification algorithms. We discuss several parameter tradeoffs and demonstrate experimentally their effect on performance when comparing execution of optimized IPC software with various configurations of our hardware IPC solution, we demonstrate between 72 and 566 speedup on a single Stratix IV E530 FPGA. Finally, a favorable IPC configuration is scaled to multiple FPGAs, where a best-case speedup of 3340 on 16 FPGAs is observed when experimentally evaluated on a single node of Novo-G, the reconfigurable supercomputer in the NSF CHREC Center at Florida.
    Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Numerous studies have shown significant performance and power benefits of field-programmable gate arrays (FPGAs). Despite these benefits, FPGA usage has been limited by application design complexity caused largely by the lack of code and tool portability across different FPGA platforms, which prevents design reuse. This paper addresses the portability challenge by introducing a framework of architecture and middleware for virtualization of FPGA platforms, collectively named VirtualRC. Experiments show modest overhead of 5-6% in performance and 1% in area, while enabling portability of 11 applications and two high-level synthesis tools across three physical platforms.
    Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays, FPGA 2012, Monterey, California, USA, February 22-24, 2012; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Methods for decoding movements from neural spike counts using adaptive filters often rely on minimizing the mean-squared error. However, for non-Gaussian distribution of errors, this approach is not optimal for performance. Therefore, rather than using probabilistic modeling, we propose an alternate non-parametric approach. In order to extract more structure from the input signal (neuronal spike counts) we propose using minimum error entropy (MEE), an information-theoretic approach that minimizes the error entropy as part of an iterative cost function. However, the disadvantage of using MEE as the cost function for adaptive filters is the increase in computational complexity. In this paper we present a comparison between the decoding performance of the analytic Wiener filter and a linear filter trained with MEE, which is then mapped to a parallel architecture in reconfigurable hardware tailored to the computational needs of the MEE filter. We observe considerable speedup from the hardware design. The adaptation of filter weights for the multiple-input, multiple-output linear filters, necessary in motor decoding, is a highly parallelizable algorithm. It can be decomposed into many independent computational blocks with a parallel architecture readily mapped to a field-programmable gate array (FPGA) and scales to large numbers of neurons. By pipelining and parallelizing independent computations in the algorithm, the proposed parallel architecture has sublinear increases in execution time with respect to both window size and filter order.
    Conference proceedings: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference 08/2011; 2011:4621-4.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reconfigurable Computing (RC) systems based on FPGAs are becoming an increasingly attractive solution to building parallel systems of the future. Applications targeting such systems have demonstrated superior performance and reduced energy consumption versus their traditional counterparts based on microprocessors. However, most of such work has been limited to small system sizes. Unlike traditional HPC systems, lack of integrated, system-wide, parallel-programming models and languages presents a significant design challenge for creating applications targeting scalable, reconfigurable HPC systems. In this article, we extend the traditional Partitioned Global Address Space (PGAS) model to provide a multilevel integration of memory, which simplifies development of parallel applications for such systems and improves developer productivity. The new multilevel-PGAS programming model captures the unique characteristics of reconfigurable HPC systems, such as the existence of multiple levels of memory hierarchy and heterogeneous computation resources. Based on this model, we extend and adapt the SHMEM communication library to become what we call SHMEM+, the first known SHMEM library enabling coordination between FPGAs and CPUs in a reconfigurable, heterogeneous HPC system. Applications designed with SHMEM+ yield improved developer productivity compared to current methods of multidevice RC design and exhibit a high degree of portability. In addition, our design of SHMEM+ library itself is portable and provides peak communication bandwidth comparable to vendor-proprietary versions of SHMEM. Application case studies are presented to illustrate the advantages of SHMEM+.
    TRETS. 01/2011; 4:26.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Editor's note:As part of their ongoing work with the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC), the authors are developing a complete tool chain for FPGA-based acceleration of scientific computing, from early-stage assessment of applications down to rapid routing. This article provides an overview of this tool chain.—George A. Constantinides (Imperial College London) and Nicola Nicolici (McMaster University)
    IEEE Design and Test of Computers 01/2011; 28:68-77. · 1.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Power limitations in semiconductors have made explicitly parallel device architectures such as Field-Programmable Gate Arrays (FPGAs) increasingly attractive for use in scalable systems. However, mitigating the significant cost of FPGA development requires efficient design-space exploration to plan and evaluate a range of potential algorithm and platform choices prior to implementation. The authors propose the RC Amenability Test for Scalable Systems (RATSS), an analytical model which enables straightforward, fast, and reasonably accurate performance prediction prior to implementation by extending current modeling concepts to multi-FPGA designs. RATSS provides a comprehensive strategic model to evaluate applications based on the computation and communication requirements of the algorithm and capabilities of the FPGA platform. The RATSS model targets data-parallel applications on current scalable FPGA systems. Three case studies with RATSS demonstrate nearly 90% prediction accuracy as compared to corresponding implementations.
    TRETS. 01/2011; 4:27.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Machine-learning algorithms are employed in a wide variety of applications to extract useful information from data sets, and many are known to suffer from super-linear increases in computational time with increasing data size and number of signals being processed (data dimension). Certain principal machine-learning algorithms are commonly found embedded in larger detection, estimation, or classification operations. Three such principal algorithms are the Parzen window-based, non-parametric estimation of Probability Density Functions (PDFs), K-means clustering and correlation. Because they form an integral part of numerous machine-learning applications, fast and efficient execution of these algorithms is extremely desirable. FPGA-based reconfigurable computing (RC) has been successfully used to accelerate computationally intensive problems in a wide variety of scientific domains to achieve speedup over traditional software implementations. However, this potential benefit is quite often not fully realized because creating efficient FPGA designs is generally carried out in a laborious, case-specific manner requiring a great amount of redundant time and effort. In this paper, an approach using pattern-based decomposition for algorithm acceleration on FPGAs is proposed that offers significant increases in productivity via design reusability. Using this approach, we design, analyze, and implement a multi-dimensional PDF estimation algorithm using Gaussian kernels on FPGAs. First, the algorithm’s amenability to a hardware paradigm and expected speedups are predicted. After implementation, actual speedup and performance metrics are compared to the predictions, showing speedup on the order of 20× over a 3.2 GHz processor. Multi-core architectures are developed to further improve performance by scaling the design. Portability of the hardware design across multiple FPGA platforms is also analyzed. After implementing the PDF algorithm, the value of pattern-based decomposition to support reuse is demonstrated by rapid development of the K-means and correlation algorithms. KeywordsFPGA–Design patterns–Machine learning–Pattern recognition–Hardware acceleration–Performance prediction
    Journal of Signal Processing Systems 01/2011; 62(1):43-63. · 0.55 Impact Factor
  • Source
    Alan D. George, Herman Lam, Greg Stitt
    [Show abstract] [Hide abstract]
    ABSTRACT: The Novo-G supercomputer's architecture can adapt to match each application/s unique needs and thereby attain more performance with less energy than conventional machines. Reconfigurable computing can provide solutions for domain scientists at a fraction of the time and cost of traditional servers or supercomputers. As we describe here, the Novo-G machine, applications, research forum, and preliminary results are helping to pave the way for scalable reconfigurable computing.
    Computing in Science and Engineering 01/2011; 13:82-86. · 1.73 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Information-theoretic cost functions such as minimization of the error entropy (MEE) can extract more structure from the error signal, yielding better results in many realistic problems. However, adaptive filters (AFs) using MEE methods are more computationally intensive when compared to conventional, mean-squared error (MSE) methods employed in the well-known, least mean squares (LMS) algorithm. This paper presents a novel, parallel hardware architecture for MEE adaptive filtering. The design has been implemented and evaluated in realtime on one of the servers of the Novo-G machine in the NSF CHREC Center at the University of Florida, believed to be the most powerful reconfigurable supercomputer in academia. By pipelining the design and parallelizing independent computations within the algorithm, our proposed hardware architecture successfully achieves a speedup of 5800 on one FPGA, 23200 on one quad-FPGA board, and 46400 on two quad-FPGA boards, as compared to the same algorithm running in software (optimized C program) on a single CPU core. Just as important, our results show that this reconfigurable design does not lose precision while converging to the optimum solution in the same number of steps as the software version. As a result, our approach makes it possible for AFs using the MEE cost function to adapt in real-time for signals that require a sampling rate in excess of 400 kHz and thus can target a much wider range of applications.
    High-Performance Reconfigurable Computing Technology and Applications ( HPRCTA), 2010 Fourth International Workshop on; 12/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The computing market constantly experiences the introduction of new devices, architectures, and enhancements to existing ones. Due to the number and diversity of processor and accelerator devices available, it is important to be able to objectively compare them based upon their capabilities regarding computation, I/O, power, and memory interfacing. This paper presents an extension to our existing suite of metrics to quantify additional characteristics of devices and highlight tradeoffs that exist between architectures and specific products. These metrics are applied to a large group of modern devices to evaluate their computational density, power consumption, I/O bandwidth, internal memory bandwidth, and external memory bandwidth.
    High-Performance Reconfigurable Computing Technology and Applications ( HPRCTA), 2010 Fourth International Workshop on; 12/2010
  • Source
    Proceedings of the 2010 International Conference on Engineering of Reconfigurable Systems & Algorithms, ERSA 2010, July 12-15, 2010, Las Vegas Nevada, USA; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: As on-chip transistor counts increase, the computing landscape has shifted to multi- and many-core devices. Computational accelerators have adopted this trend by incorporating both fixed and reconfigurable many-core and multi-core devices. As more, disparate devices enter the market, there is an increasing need for concepts, terminology, and classification techniques to understand the device tradeoffs. Additionally, computational performance, memory performance, and power metrics are needed to objectively compare devices. These metrics will assist application scientists in selecting the appropriate device early in the development cycle. This article presents a hierarchical taxonomy of computing devices, concepts and terminology describing reconfigurability, and computational density and internal memory bandwidth metrics to compare devices.
    ACM Transactions on Reconfigurable Technology and Systems 01/2010; 3(4):19. · 0.52 Impact Factor
  • Source
    TRETS. 01/2010; 3:19.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The field of high-performance computing (HPC) is currently undergoing a major transformation brought upon by a variety of new processor device technologies. Accelerator devices (e.g. FPGA, GPU) are becoming increasingly popular as coprocessors in HPC, embedded, and other systems, improving application performance while in some cases also reducing energy consumption. The presence of such devices introduces additional levels of communication and memory hierarchy in the system, which warrants an expansion of conventional parallel-programming practices to address these differences. Programming models and libraries for heterogeneous, parallel, and reconfigurable computing such as SHMEM+ have been developed to support communication and coordination involving a diverse mix of processor devices. However, to evaluate the impact of communication on application performance and obtain optimal performance, a concrete understanding of the underlying communication infrastructure is often imperative. In this paper, we introduce a new multilevel communication model for representing various data transfers encountered in these systems and for predicting performance. Three use cases are presented and evaluated. First, the model enables application developers to perform early design-space exploration of communication patterns in their applications before undertaking the laborious and expensive process of implementation, yielding improved performance and productivity. Second, the model enables system developers to quickly optimize performance of data-transfer routines within tools such as SHMEM+ when being ported to a new platform. Third, the model augments tools such as SHMEM+ to automatically improve performance of data transfers by self-tuning internal parameters to match platform capabilities. Results from experiments with these use cases suggest marked improvement in performance, productivity, and portability.
    01/2010;

Publication Stats

881 Citations
21.72 Total Impact Points

Institutions

  • 1987–2012
    • University of Florida
      • • Department of Electrical and Computer Engineering
      • • Department of Computer and Information Science and Engineering
      • • Database Systems Research and Development Center
      Gainesville, Florida, United States
  • 2006
    • Union University
      • Computer Science
      Alabama, United States
  • 1992
    • Texas Instruments Inc.
      Dallas, Texas, United States
  • 1991
    • Bull HN Information Systems Inc.
      Chelmsford, Massachusetts, United States