About
26
Publications
5,214
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
611
Citations
Publications
Publications (26)
Probabilistic Sentential Decision Diagrams (PSDDs) provide efficient methods for modeling and reasoning with probability distributions in the presence of massive logical constraints. PSDDs can also be synthesized from graphical models such as Bayesian networks (BNs) therefore offering a new set of tools for performing inference on these models (in...
Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random memory accessing and unbalanced load in processing because of random distribution of elements in...
Streaming applications have become one of the key application domains for high-level synthesis (HLS) tools. For a streaming application, there is a potential to simplify the control logic by regulating each task with a stream of input and output data. This is called free-running optimization. But it is difficult to understand when such optimization...
Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications, including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random memory accessing and unbalanced load in processing because of random distribution of elements i...
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequ...
With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing element...
With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, we found that it is not easy to fully utilize the available bandwidth when developing some applications with high-level...
C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of result (QoR) and short development cycle compared with the traditional register-transfer level (RTL) design approach. Yet, limited by the...
A large semantic gap between a high-level synthesis (HLS) design and a low-level RTL simulation environment often creates a barrier for those who are not FPGA experts. Moreover, such a low-level simulation takes a long time to complete. Software HLS simulators can help bridge this gap and accelerate the simulation process; but their shortcoming is...
A large semantic gap between the high-level synthesis (HLS) design and the low-level (on-board or RTL) simulation environment often creates a barrier for those who are not FPGA experts. Moreover, such low-level simulation takes a long time to complete. Software-based HLS simulators can help bridge this gap and accelerate the simulation process; how...
Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs...
A large semantic gap between the high-level synthesis (HLS) design and the low-level (on-board or RTL) simulation environment often creates a barrier for those who are not FPGA experts. Moreover, such low-level simulation takes a long time to complete. Software-based HLS simulators can help bridge this gap and accelerate the simulation process; how...
In order to further increase the productivity of field-programmable gate array (FPGA) programmers, several design space exploration (DSE) frameworks for high-level synthesis (HLS) tools have been recently proposed to automatically determine the FPGA design parameters. However, one of the common limitations found in these tools is that they cannot f...
CPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial for users to choose the right platform among various PCIe and QPI based CPU-FPGA platforms from dif...
Reducing radiation doses is one of the key concerns in computed tomography (CT) based 3D reconstruction. Although iterative methods such as the expectation maximization (EM) algorithm can be used to address this issue, applying this algorithm to practice is difficult due to the long execution time. Our goal is to decrease this long execution time t...
Although graph cuts (GC) is popularly used in many computer vision problems, slow execution time due to its high complexity hinders wide usage. Manycore solution using Graphics Processing Unit (GPU) may solve this problem. However, conventional GC implementation does not fully exploit GPU's computing power. To address this issue, a new GC algorithm...
Belief propagation (BP) is a commonly used global energy minimization algorithm for solving stereo matching problem in 3D reconstruction. However, it requires large memory bandwidth and data size. In this paper, we propose a novel memory-efficient algorithm of BP in stereo matching on the Graphics Processing Units (GPU). The data size and transfer...
We have developed a memory access reduced VLSI chip for 5,000 word speaker-independent continuous speech recognition. This
chip employs a context-dependent HMM (hidden Markov model) based speech recognition algorithm, and contains parallel and pipelined
hardware units for emission probability computation and Viterbi beam search. To maximize the per...
Measuring distance to obstacles is an important process for intelligent vehicles (IV). With accurate measurement, IV can make appropriate maneuver to avoid such obstacles. To obtain highly accurate result, we used a Markov random field model-based global energy minimization algorithm called belief propagation (BP). However, BP has high computationa...
A real-time hardware-based large vocabulary speech recognizer requires high memory bandwidth. We have developed a field-programmable-gate-array (FPGA)-based 20 000-word speech recognizer utilizing efficient dynamic random access memory (DRAM) access. This system contains all the functional blocks for hidden-Markov-model-based speaker-independent co...
We have developed a VLSI chip for 5,000 word speaker- independent continuous speech recognition. This chip em- ploys a context-dependent HMM (hidden Markov model) based speech recognition algorithm, and contains emission probability and Viterbi beam search pipelined hardware units. The feature vector for speech recognition is com- puted using a hos...
We have developed a hidden Markov model based 5000-word speaker independent continuous speech recognizer us-ing a Field-Programmable Gate Array (FPGA). The feature extraction is conducted in software on a soft-core based CPU, while the emission probability computation and the Viterbi beam search are implemented using parallel and pipelined hardware...