Hardware Accelerator architecture The CPU 'offloads' its task to the FPGA, which performs the action and returns the result to the CPU in a expedited manner

Hardware Accelerator architecture The CPU 'offloads' its task to the FPGA, which performs the action and returns the result to the CPU in a expedited manner

Source publication
Article
Full-text available
The major requirements of a good tomographic reconstruction algorithm are reduction in radiation dosage, accurate reconstruction, detail enhancement and rapid reconstruction time. Some of these factors are covered by many algorithms, but are not collectively addressed in one. While the Maximum Likelihood Expectation Maximization (MLEM) algorithm fa...

Context in source publication

Context 1
... EST -Estimate; BP -Back Projection; Proj -Projection The primary aim of hardware acceleration is to increase the computational speed by using custom hardware specially designed to implement a particular routine or algorithm. The hardware accelerator architecture is shown in Fig. 4. The advantage of this is that the CPU is free to process other data, while the necessary computation for the routine to be accelerated is offloaded to the hardware co-processor. The only time the processor is involved is to set up the coprocessor to begin its calculation and the time it takes to receive the results. To realize the ...

Similar publications

Preprint
Full-text available
Metodología para diseñar procesadores de propósito específico en una HDL (Verilog, VHDL) y ser implementados en una FPGA o un SoC.

Citations

... In this figure, the central processing unit (CPU) 'offloads' its task to any of the accelerators, which executes the assignment faster and delivers the result to the CPU [7]. Differential equations are often referred to as nature's language because they provide a way to describe and understand many natural phenomena in a precise and systematic way. ...
Preprint
Full-text available
Technology progress has changed the way research and development tasks are done. The capability to perform intricate system modeling and simulation is now crucial in engineering, physics, chemistry, biology, and other industries relying on scientific computing. As these scientific computing workloads require quick and efficient execution, there is an increasing need for disruptive technology. This is where hardware accelerators enter the picture. These accelerators, capable of managing intricate systems modeling and simulation, hold promise for improving precision and dependability in research and development results, thereby saving time and resources. In this survey, we define, summarize, and analyze the accelerators required in different scientific computing domains. We have also proposed a taxonomy based on these aspects: implementation methods; types of implementations; host coupling; cost factors; and applications, grouped into five macro-categories. Then we observed and categorized the intersection of micro-architectures with differential equations in three areas in the scientific computing domain, like acceleration for higher-order nonlinear systems, acceleration based on differential equations using off-the-shelf accelerators, and finally acceleration using customized co-processors. Lastly, we finish the survey by giving a brief summary. We think that these methods will be useful for researchers who are aiming to make new hardware accelerators in the scientific computing area.
... This approach helps to minimize CPU overload and memory space consumption, meeting the evolving demands of computational tasks in various domains of scientific computing [20][21][22][23][24]. In Figure 2, an overview of a hardware accelerator architecture has been presented [20], demonstrating how the central processing unit (CPU) 'offloads' its task to the accelerator in this diagram. ...
... This approach helps to minimize CPU overload and memory space consumption, meeting the evolving demands of computational tasks in various domains of scientific computing [20][21][22][23][24]. In Figure 2, an overview of a hardware accelerator architecture has been presented [20], demonstrating how the central processing unit (CPU) 'offloads' its task to the accelerator in this diagram. Subsequently, executing the task faster than the CPU and delivering results directly back to the CPU. ...
Thesis
Full-text available
Along with the advancement in technology, the role of hardware accelerators is increasing consistently, delivering advancements in scientific simulations and data analysis in scientific computing, signal processing tasks in communication systems, matrix operations, and neural network computations in artificial intelligence and machine learning models. On the other hand, several high-speed computer applications in this era of high-performance computing often depend on ordinary differential equations (ODEs); however, their nonlinear nature can present a challenge to obtaining analytic solutions. Consequently, numerical approaches prove effective in delivering only approximate solutions to these equations. This research discusses the implementation of a customized hardware accelerator for solving an ordinary differential equation (ODE) by utilizing numerical approaches while evaluating several performance metrics, including on-chip power consumption, FPGA hardware resources, and timing summary. The third-party vendor AXI4 stream Xilinx single-precision floating-point IP support has been used to develop the accelerator for solving the ordinary differential equation using those methods. The accelerator will determine the iteration approximation result of the ODE using those methods. The entire work uses VHDL hardware description language and the Xilinx Vivado Design Suite and has been deployed on the Zynq-ZC702 FPGA Evaluation Board, along with a design space exploration.
... S lack = DataRequiredT ime-DataArrivaltime. (5) By changing the time instant for the operations of the main processing unit as explained in the second strategy (see Figure10), the critical data path is now between the first DFF and the Sub-Byte IP as shown in Figure16. The value of the slack time has been reduced from (-1.63nsec) to (-1.05nsec) using the second timing trigger strategy as shown in Figure17. ...
Article
Full-text available
For real-time video processing, the analysis time is a big challenge for researchers. Since digital images from cameras or any image sources can be quite large, it is common practice for researchers to divide these large images into smaller sub-images. The present study proposes a subsystem module to read and display the region of interest (ROI) of real-time video signals for static camera applications to prepare for background subtraction (BGS) algorithm operation. The proposed subsystem was developed using Verilog hardware description language (HDL), synthesized, and implemented in the ZYBO Z7-10 platform. An ROI background image of (360×360) resolution was selected to test the operation of the module in real time. The subsystem consists of five basic modules. Timing analysis was used to determine the real-time performance of the proposed subsystem. Multi-clock domain frequencies are used to manage the module operations, 445.5MHz, 222.75MHz, 148.5MHz, and 74.25MHz, which are six, three, two, and one-time pixel clock frequencies, respectively. These frequencies are chosen to perform five basic processing operations in real-time during the pixel period instant. Two strategies are selected to explain the effectiveness of choosing the trigger instant of the used clock signals on the system performance. The operation revealed that the latency of the proposed ROI reading subsystem was 13.468ns (one-pixel period), which matched the requirements for real-time applications.
... In addition, these methods often fail to deal with noise and complex structures in images, which limits their practicability in clinical application. With the rapid development of deep learning technology, especially the rise of deep learning models such as Convolutional Neural Networks (CNN), Generative Adversarial Networks (GAN), and Autoencoder, the field of medical image reconstruction and enhancement has also ushered in new opportunities and challenges [7][8][9]. Through end-to-end learning, the deep learning model can learn richer and more abstract feature representations from the original data, and then realize the accurate reconstruction and enhancement of medical images. Compared with traditional methods, the deep learning model has higher accuracy and adaptability, and can better handle different types of medical image data with different quality. ...
Article
Full-text available
In recent years, deep learning technology has made remarkable progress in medical image reconstruction and enhancement, and has become one of the research hotspots in the field of medical image processing. This paper discusses the latest research progress and application of deep learning in medical image reconstruction and enhancement. Firstly, the importance of medical image reconstruction and enhancement and the limitations of traditional methods are introduced. Then, a detailed discussion was conducted on the application of deep learning models, including Convolutional Neural Networks (CNN), Generative Adversarial Networks (GAN), and Autoencoders, in medical image processing. Specifically, an analysis and comparison were conducted on the image reconstruction ability of CNN models, the image enhancement effect of GAN models, and the image denoising and reconstruction of Autoencoder models. Then, the advantages and challenges of deep learning model in medical image processing are discussed, and the future development direction is discussed. Finally, the research results of this paper are summarized and the prospect of future research is put forward. The research in this paper provides some enlightenment and reference for researchers and practitioners in the field of medical image processing, which is helpful to promote the continuous innovation and progress of medical image processing technology.
... Images are used as inputs in a variety of systems [1,2,3] , and techniques for extracting information from them are the foundation of many applications such as simultaneous localization and mapping (SLAM) [4], autonomous driving safety [5], and human-robot interaction [6]. However, achieving high-speed image pro- ...
Article
Full-text available
Computer vision plays a critical role in many applications, particularly in the domain of autonomous vehicles. To achieve high-level image processing tasks such as image classification and object tracking, it is essential to extract low-level features from the image data. However, in order to integrate these compute-intensive tasks into a control loop, they must be completed as quickly as possible. This paper presents a novel FPGA-based system for fast and accurate image feature extraction, specifically designed to meet the constraints of data fusion in autonomous vehicles. The system computes a set of generic statistical image features, including contrast, homogeneity, and entropy, and is implemented on two Xilinx FPGA platforms - an Alveo U200 Data Center Accelerator Card and a Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit. Experimental results show that the proposed system achieves high-speed image feature extraction with low latency, making it well-suited for use in autonomous vehicle systems that require real-time image processing. The presented system can also be easily extended to extract additional features for various image and data fusion applications.
... But there is currently no research into FPGA implementations of JPEG XS entropy codecs. FPGA is becoming particularly popular as hardware accelerators and is known for their programmability, con gurability, and massive parallelism through large numbers of con gurable logic blocks (CLBS) [17]. ...
Preprint
Full-text available
JPEG XS is the latest international standard for shallow compression fields launched by the International Organization for Standardization (ISO). The coding standard was officially released in 2019. The JPEG XS standard can be encoded and decoded on different devices, but there is no research on the implementation of JPEG XS entropy codec on FPGAs. This paper briefly introduces JPEG XS encoding, proposes a modular design scheme of encoder and decoder on FPGA for the entropy encoding and decoding part, and parallelizes the algorithm in JPEG XS coding standard according to the characteristics of FPGA parallelization processing, mainly including low-latency optimization design, storage space optimization design. So that the coding speed can reach 4 coefficients/clock and the decoding speed reach 2 coefficients/clock, which reduces the encoding and decoding time by 75%. The maximum clock frequency of the entropy encoder is about 222.6MHz, and the maximum clock frequency of the entropy decoder is about 127MHz. The design and implementation of the FPGA-based JPEG XS entropy encoding and decoding algorithm is of great significance and provides ideas for the subsequent implementation and optimization of the entire JPEG XS standard on FPGAs. This work is the first in the world to propose the design and implementation of an algorithm that can implement the JPEG XS entropy encoding and decoding process on FPGA. It creates the possibility for the effective application of JPEG XS standard in more media.
... SIRT, though, is very slow to reconstruct an image because it takes a lot of time to achieve a sufficiently high precision image during iteration. In addition, the SIRT creates a distorted smoothing effect [62]. ...
... (SART). SART, which makes full use of the combination of ART and SIRT algorithms, has been proposed as an upgrade to the ART and SIRT algorithms [62]. ART is a quick convergence operation, whereas SIRT produces an image of high quality, so SART is supposed to have certain useful characteristics. ...
Article
Full-text available
Image reconstruction in magnetic resonance imaging (MRI) and computed tomography (CT) is a mathematical process that generates images at many different angles around the patient. Image reconstruction has a fundamental impact on image quality. In recent years, the literature has focused on deep learning and its applications in medical imaging, particularly image reconstruction. Due to the performance of deep learning models in a wide variety of vision applications, a considerable amount of work has recently been carried out using image reconstruction in medical images. MRI and CT appear as the ultimate scientifically appropriate imaging mode for identifying and diagnosing different diseases in this ascension age of technology. This study demonstrates a number of deep learning image reconstruction approaches and a comprehensive review of the most widely used different databases. We also give the challenges and promising future directions for medical image reconstruction.
... The development of new and improvement of existing circuit design methods is actively being carried out to solve this problem. The design of specialized hardware circuits such Field-Programmable Gate Array (FPGA) [6], Application-Specific Integrated Circuit (ASIC) [7], System-on-a-Chip (SoC) [8] is one of the main approaches to improve the digital device's efficiency. ...
... By contrast, the main challenge of interventional CT is to display the reconstructed images in real time with an acceptable image quality necessary for the smooth functioning of interventional procedures. To overcome the constraints induced by image quality, X-ray dose reduction, and real-time capability, the development of efficient algorithms and their implementation utilizing task and/or data parallelism in hardware accelerators such as graphics processing units (GPU), digital signal processors (DSP) and field programmable gate arrays (FPGA) is an active research area [3][4][5][6][7]. Alcaín et al. [7] published a survey about the different usage of various hardware accelerators in real-time medical imaging. ...
... For exploring new custom co-processors, FPGAs are well-suitable platforms. In contrast to CPUs, GPUs, and DSPs that have a fixed instruction-set architecture (ISA) and data representations, FPGAs allow designers to define custom hardware architectures and to explore custom data representations [6]. Therefore, they can be used for exploring the design space, where different custom and standard data formats are defined and selected. ...
... Therefore, as shown in Formula (7), the I0-correction can be performed directly on raw sensor data with basic operations provided by most of the math co-processors. This mathematical optimization determines the decreasing of the resource utilization and execution time, compared to the implementation of Formula (6). ...
Article
Full-text available
In computed tomography imaging, the computationally intensive tasks are the pre-processing of 2D detector data to generate total attenuation or line integral projections and the reconstruction of the 3D volume from the projections. This paper proposes the optimization of the X-ray pre-processing to compute total attenuation projections by avoiding the intermediate step to convert detector data to intensity images. In addition, to fulfill the real-time requirements, we design a configurable hardware architecture for data acquisition systems on FPGAs, with the goal to have a “on-the-fly” pre-processing of 2D projections. Finally, this architecture was configured for exploring and analyzing different arithmetic representations, such as floating-point and fixed-point data formats. This design space exploration has allowed us to find the best representation and data format that minimize execution time and hardware costs, while not affecting image quality. Furthermore, the proposed architecture was integrated in an open-interface computed tomography device, used for evaluating the image quality of the pre-processed 2D projections and the reconstructed 3D volume. By comparing the proposed solution with the state-of-the-art pre-processing algorithm that make use of intensity images, the latency was decreased 4.125×, and the resources utilization of ∼6.5×, with a mean square error in the order of 10−15 for all the selected phantom experiments. Finally, by using the fixed-point representation in the different data precisions, the latency and the resource utilization were further decreased, and a mean square error in the order of 10−1 was reached.
... The development of new and improvement of existing methods of digital processing data is actively being carried out to solve this problem. One of the main approaches to improving the performance of digital devices is based on the use of specialized hardware accelerators such as Field-Programmable Gate Array (FPGA) [6], Application-Specific Integrated Circuit (ASIC) [7], System-on-a-Chip (SoC) [8]. ...
Conference Paper
Discrete wavelet transform (DWT) is widely used in modern science and technology to solve a wide range of signal and image processing problems and digital communications. The high growth rates of quantitative and qualitative characteristics of digital information lead to the need to improve information processing methods and increase the efficiency of their implementation. Specialized hardware circuits are used to solve this problem since they can significantly enhance the characteristics of DWT implementation devices. Calculations for DWT are organized using multiple approaches that differ in the priorities of resource consumption. This paper proposes a comparative analysis of state-of-the-art approaches to the circuits design for DWT with Cohen-Daubechies-Feauveau 9/7 wavelet. The evaluation results using the “unit-gate” model showed that the optimized direct implementation of DWT exceeds the lifting scheme by 4.59 times in computational speed and requires 9.55% fewer hardware costs when implemented on modern digital devices such as Field-Programmable Gate Arrays.KeywordsDiscrete wavelet transformCircuits designCDF 9/7Unit-gate modelLifting scheme