About
105
Publications
30,400
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,421
Citations
Introduction
Dr. Zhenman Fang has joined School of Engineering Science (Computer Engineering Option), Simon Fraser University as a Tenure-Track Assistant Professor since Apr 2019.
From Sept 2017 to Mar 2019, Zhenman worked in the Xilinx SDx group at San Jose as a Staff Software Engineer, where he works on the topic of accelerator-rich architectures and systems, which is the major focus of his postdoc research at UCLA. From Jul 2014 to Sept 2017, Zhenman was a postdoc in Department of Computer Science, UCLA, under the supervision of Prof. Jason Cong and Prof. Glenn Reinman. While at UCLA, he was also a member of the NSF/Intel funded multi-university Center for Domain-Specific Computing (CDSC) and SRC/DARPA funded multi-university Center for Future Architectures Research (C-FAR).
Current institution
Additional affiliations
August 2019 - present
April 2019 - present
September 2017 - April 2019
Publications
Publications (105)
Learned image compression (LIC) methods have recently outperformed traditional codecs such as VVC in rate-distortion performance. However, their large models and high computational costs have limited their practical adoption. In this paper, we first construct a high-capacity teacher model by integrating Swin-Transformer V2-based attention modules,...
Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data formats and additional pre-processing operations on quantizers. However, it remains quite challenging to achieve...
Deep learning-based image compression has made great progresses recently. However, some leading schemes use serial context-adaptive entropy model to improve the rate-distortion (R-D) performance, which is very slow. In addition, the complexities of the encoding and decoding networks are quite high and not suitable for many practical applications. I...
In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduce...
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs). However, ViT models are often computation-intensive for efficient deployment on resource-limited edge devices. This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search fram...
Recently learned image compression (LIC) has achieved great progress and even outperformed the traditional approach using DCT or discrete wavelet transform (DWT). However, LIC mainly reduces spatial redundancy in the autoencoder networks and entropy coding, but has not fully removed the frequency-domain correlation explicitly as in DCT or DWT. To l...
Today’s big data query engines are constantly under pressure to keep up with the rapidly increasing demand for faster processing of more complex workloads. In the past few years, FPGA-based database acceleration efforts have demonstrated promising performance improvement with good energy efficiency. However, few studies target the programming and d...
In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. S...
Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as...
The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the compu...
Many aerospace and automotive applications use FPGAs in their designs due to their low power and reconfigurability requirements. Meanwhile, such applications also pose a high standard on system reliability, which makes the early-stage reliability analysis for FPGA-based designs very critical. In this paper, we present a framework that enables fast...
Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils t...
Accurately and timely detecting multiscale small objects that contain tens of pixels from remote sensing images (RSI) remains challenging. Most of the existing solutions primarily design complex deep neural networks to learn strong feature representations for objects separated from the background, which often results in a heavy computation burden....
While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called Heat...
Stochastic rounding is a critical technique used in low-precision deep neural networks (DNNs) training to ensure good model accuracy. However, it requires a large number of random numbers generated on the fly. This is not a trivial task on the hardware platforms such as FPGA and ASIC. The widely used solution is to introduce random number generator...
In today’s FPGA-based soft-processors, one of the slowest instructions is integer division. Compared to the low single-digit latency of other arithmetic operations, the fixed 32-cycle latency of radix-2 division is substantially longer. Given that today’s soft-processors typically only implement radix-2 division—if they support hardware division at...
The most advanced ASIC-based approximate adders are focused on gate or transistor level approximating structures. However, due to architectural differences between ASIC and FPGA, comparable performance gains for FPGA-based approximate adders cannot be obtained using ASIC-based approximation ones. In this paper, we propose a method for designing a l...
In this paper, we propose an accurate yet fast small object detection method for RSI, named SuperYOLO, which fuses multimodal data and performs high resolution (HR) object detection on multiscale objects by utilizing the assisted super resolution (SR) learning and considering both the detection accuracy and computation cost. First, we construct a c...
In this paper, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allow users to easily express flexible and complex inter-task communication structures. S...
Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils t...
Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-sche...
The emergence of high-bandwidth memory (HBM) brings new opportunities to boost the performance of sorting acceleration on FPGAs, which was conventionally bounded by the available off-chip memory bandwidth. However, it is nontrivial for designers to fully utilize this immense bandwidth. First, the existing sorter designs cannot be directly scaled at...
Both modern datacenter and embedded FPGAs provide great opportunities for high-performance and high energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for soft...
Compared to conventional artificial neural networks, Spiking Neural Networks (SNNs) are more biologically plausible and require less computation due to their event-driven nature of spiking neurons. However, the default asynchronous execution of SNNs also poses great challenges to accelerate their performance on FPGAs.
In this work, we present a nov...
Recently, the convolutional neural network (CNN)-based approach for on-satellite ship detection in synthetic aperture radar (SAR) images has received increasing attention since it does not rely on predefined imagery features and distributions that are required in conventional detection methods. To achieve high detection accuracy, most of the existi...
The emergence of high-bandwidth memory (HBM) brings new opportunities to boost the performance of sorting acceleration on FPGAs, which was conventionally bounded by the available off-chip memory bandwidth. However, it is nontrivial for designers to fully utilize this immense bandwidth. First, the existing sorter designs cannot be directly scaled at...
FPGA-based accelerators are increasingly popular across a broad range of applications, because they offer massive parallelism, high energy efficiency, and great flexibility for customizations. However, difficulties in programming and integrating FPGAs have hindered their widespread adoption. Since the mid 2000s, there has been extensive research an...
Deep neural networks (DNNs) are increasingly being deployed in safety-critical systems such as personal healthcare devices and self-driving cars. In such DNN-based systems, error resilience is a top priority since faults in DNN inference could lead to mispredictions and safety hazards. For latency-critical DNN inference on resource-constrained edge...
Recently, deep neural networks (DNNs) have been deployed in safety-critical systems such as autonomous vehicles and medical devices. Shortly after that, the vulnerability of DNNs were revealed by stealthy adversarial examples where crafted inputs -- by adding tiny perturbations to original inputs -- can lead a DNN to generate misclassification outp...
Deep neural networks (DNNs) are increasingly being deployed in safety-critical systems such as personal healthcare devices and self-driving cars. In such DNN-based systems, error resilience is a top priority since faults in DNN inference could lead to mispredictions and safety hazards. For latency-critical DNN inference on resource-constrained edge...
Nowadays nanoscale combinational circuits are facing significant reliability challenges including soft errors and process variations. This paper presents novel process variation-aware placement strategies that include two algorithms to increase the reliability of combinational circuits against both Single Event Transients (SETs) and Multiple Event...
Adversarial bit-flip attack (BFA) on Neural Network weights can result in catastrophic accuracy degradation by flipping a very small number of bits. A major drawback of prior bit flip attack techniques is their reliance on test data. This is frequently not possible for applications that contain sensitive or proprietary data. In this paper, we propo...
With the public availability of FPGAs from major cloud service providers like AWS, Alibaba, and Nimbix, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to fi...
The k-nearest neighbors (KNN) algorithm is an essential
algorithm in many applications, such as similarity search,
image classification, and database query. With the rapid growth
in the dataset size and the feature dimension of each data point,
processing KNN becomes more compute and memory hungry.
Most prior studies focus on accelerating the compu...
With the CMOS technology scaling, transistor aging has become one major issue affecting circuit reliability and lifetime. There are two major classes of existing studies that model the aging effects in the circuit delay. One is at transistor-level, which is highly accurate
but very slow. The other is at gate-level, which is faster but less accurate...
Genome sequencing is one of the key applications in healthcare and has a great potential to realize precision medicine and personalized healthcare. However, its computing process is very time consuming. Even pre-processing the raw sequence data of a whole genome for a single person to the analysis ready data can take several days on a single-core C...
The power and utilization walls in today's processors have led to a recent focus on accelerator-rich architectures (ARAs), which include a sea of customized accelerators with orders-of-magnitude performance and energy gains. Meanwhile, some researchers wonder how the reported large gains are achieved, considering that ARAs use a similar memory hier...
Conventional homogeneous multicore processors are not able to provide the continued performance and energy improvement that we have expected from past endeavors. Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for resolving this issue. Among different heterogeneous devices, FPGAs...
Since its establishment in 2009, the Center for Domain-Specific Computing (CDSC) has focused on customizable computing. We believe that future computing systems will be customizable with extensive use of accelerators, as custom-designed accelerators often provide 10-100X performance/energy efficiency over the general-purpose processors. Such an acc...
With the recent advancement of multilayer convolutional neural networks (CNN) and fully connected networks (FCN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGAbased acceleration emerg...
FPGA-based heterogeneous architectures provide programmers with the ability to customize their hardware accelerators for flexible acceleration of many workloads. Nonetheless, such advantages come at the cost of sacrificing programmability. FPGA vendors and researchers attempt to improve the programmability through high-level synthesis (HLS) technol...
Data compression techniques have been widely used to reduce the data storage and movement overhead, especially in the big data era. While FPGAs are well suited to accelerate the computation-intensive lossless compression algorithms, big data compression with parallel requests in nature poses two challenges to the overall system throughput. First, s...
In conventional Hadoop MapReduce applications, I/O used to play a heavy role in the overall system performance. More recently, a study from the Apache Spark community—state-of-the-art in-memory cluster computing framework—reports that I/O is no longer the bottleneck and has a marginal performance impact on applications like SQL processing. However,...
With the slowing down of Moore's law, major cloud service providers---such as Amazon Web Services, Microsoft Azure, and Alibaba Cloud---all started deploying FPGAs in their cloud platforms to improve the performance and energy-efficiency. From the perspective of performance per unit cost in the cloud, it is essential to efficiently utilize all avai...
The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which...
Computational genomics plays an important role in health care, but is computationally challenging as most genomics applications use large data sets and are both computation-intensive and memory-intensive. Recent approaches with on-chip hardware accelerators can boost computing capability and energy efficiency, but are limited by the memory requirem...
Due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alternative to accelerate big data applications due to their low power, high performance and energy efficiency. In this paper we aim to answer one key question: How should the multicore CPU and FPGA coordinate together to optimize the performance of big data app...
To efficiently process a tremendous amount of data, today's big data applications tend to distribute the datasets into multiple partitions, such that each partition can be fit into memory and be processed by a separate core/server in parallel. Meanwhile, due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alterna...
While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support fo...
Molecular dynamics (MD) simulation is one of the past decade's most important tools for enabling biology scientists and researchers to explore human health and diseases. However, due to the computation complexity of the MD algorithm, it takes weeks or even months to simulate a comparatively simple biology entity on conventional multicore processors...
With the recent advancement of multilayer convolutional neural networks (CNN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy-efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive a...
Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains and are emerging as one of the most promising solutions in the age of dark silicon. However, many design issues related to the complex interaction between general-purpose cores, accelerators, custom...
With the end of CPU core scaling due to dark silicon limitations , customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPG...
FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks like Spark? To provide a generalized methodology and insights for efficient integration, we condu...
CPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial for users to choose the right platform among various PCIe and QPI based CPU-FPGA platforms from dif...
Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1...
Compared to conventional general-purpose processors, accelerator-rich architectures (ARAs) can provide orders-of-magnitude performance and energy gains. In this paper we design and implement the ARAPrototyper to enable rapid design space explorations for ARAs in real silicons and reduce the tedious prototyping efforts. First, ARAPrototyper provides...
The power wall and utilization wall in today’s processors have led to a focus on accelerator-rich architecture, which will include a sea of accelerators that can achieve orders-of-magnitude performance and energy gains. The emerging accelerator-rich architecture is still in its early stage, and many design issues, such as the efficient accelerator...
As multicore and many-core architectures evolve, their memory systems are becoming increasingly more complex. To bridge the latency and bandwidth gap between the processor and memory, they often use a mix of multilevel private/shared caches that are either blocking or nonblocking and are connected by high-speed network-on-chip. Moreover, they also...
Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strat...