FIGURE 1 - uploaded by Ahmad Shawahna
Content may be subject to copyright.
Source publication
Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged, and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolution neural networks (CNNs) have demonstrated their effectiveness in ima...
Contexts in source publication
Context 1
... have achieved even better accuracy in classifica- tion and various computer vision tasks. The classification accuracy in ILSVRC improved to 88.8% [48], 93.3% [31], and 96.4% [49] in the 2013,2014, and 2015 competitions, respectively. Fig. 1 shows the accuracy loss for the winners of ImageNet competitions before and after the emergence of deep learning ...
Context 2
... improved version of CNP architectures given in [127], [142] was presented in [143] and referred to as neuFlow. Particularly, neuFlow has replaced the 2D grid of ALUs with a 2D grid of processing tiles (PTs). The proposed architecture contains a 2D grid of PTs, a control unit, and a smart DMA module, as shown in Fig. 10. Each PT consists of local operators and a routing multiplexer (MUX). The top three PTs have been implemented to perform MAC operation. Thus, they can be used to perform 2D convolution, simple dot-products, and spatial pooling. General-purpose operations, such as dividing and squaring, have been implemented at the middle three PTs. ...
Context 3
... proposed accelerator composed of a computational engine and memory sub-system is depicted in Fig. 11. The computation engine is designed as T m duplicated tree- shaped poly structures with T n inputs from the input FMs, T n inputs from the weights, and one input from the bias. On the other hand, the memory sub-system is implemented as four sets of on-chip buffers; two sets to store the input FMs and weights, each with T n buffer ...
Context 4
... top-level architecture of the proposed CNN accelera- tor is shown in Fig. 12. Multi-banked input buffer and kernel weight buffer are used to provide an efficient buffering scheme of FMs and weights, respectively. To minimize the off-chip memory traffic, a specialized network on-chip has been designed to re-distribute the output FMs on the multi-banked input buffer instead of transferring them to the external ...
Context 5
... proposed architecture consists of a processing system (CPU) and programmable logic (FPGA). CNN computations are performed through special design of processing element modules in FPGA. The main modules in the processing element are convolver complex, max-pooling, non-linearity, data shift, bias shift, and adder tree, as shown in Fig. 13. The convolver complex is designed as a classical line buffer [154], as shown in Fig. 14, to achieve convolution operations as well as to compute FC layer multiplication of matrix-vector. The pooling layer implemented in the max- pooling module is used to output the maximum value in the input data stream with a window of size 2. The ...
Context 6
... logic (FPGA). CNN computations are performed through special design of processing element modules in FPGA. The main modules in the processing element are convolver complex, max-pooling, non-linearity, data shift, bias shift, and adder tree, as shown in Fig. 13. The convolver complex is designed as a classical line buffer [154], as shown in Fig. 14, to achieve convolution operations as well as to compute FC layer multiplication of matrix-vector. The pooling layer implemented in the max- pooling module is used to output the maximum value in the input data stream with a window of size 2. The activation function of CNN is applied to the input data stream using the non-linearity ...
Context 7
... explores the design space of the topology matrix components while considering the resource con- straints. In doing so, fpgaConvNet performs graph parti- tioning, coarse-grained folding, and fine-grained folding. The graph partitioning splits the original SDFG into sub- graphs and each subgraph is then mapped to a distinct bitstream as shown in Fig. 15. Note that the proposed multi- bitstream architecture might have multiple CONV layer processors (CLPs), as in the provided example. This away, on-chip RAM is used for intermediate results and data reuse within the subgraph, while accesss of off-chip memory is minimized and limited for input and output streams of the subgraph. However, ...
Context 8
... overall modules of the proposed CNN accelerator are shown in Fig. 16. The controller is responsible for directing and ensuring in-order computation of CNN modules for each layer. The data routers oversee the selection of data read and data write of two adjacent modules as well as the assignment of buffer outputs to shared or pool multipliers of the multiplier bank. The feature buffers hold the FMs using ...
Context 9
... unrolling in performing convolutions and this can be achieved either through intra-output or inter-output parallelism. Finally, operator-level parallelism is achieved by parallelising the k ˆ k MACs operations needed for convolution operation in convolutional layers or the n MACs needed for inner-product computation in fully connected layers. Fig. 17 shows the parallel framework exploiting these four levels of ...
Context 10
... buffer. BNN weight matrix is distributed across the PEs and stored locally in on- chip memory. Subsequently, the input images are streamed through the MVTU and multiplied with the weight matrix. Particularly, the PE computes the dot-product between an input vector and a row of weight matrix, each of S-bits wide, using an XNOR gate, as shown in Fig. 18. Then, it compares the number of set bits to a threshold and produces a 1-bit output value as previously ...
Context 11
... layer using a sliding window unit (SWU) and an MVTU, where convo- lutional operation is transformed to matrix-multiplication of image matrix and filter matrix. SWU generates the image matrix to MVTU by moving the sliding window over the input FMs, while the filter matrix is generated by packing the weights from the convolution filters as shown in Fig. 19. In order to meet the user throughput requirement, MVTU is folded (time-multiplexed) by controlling the values of P and S. Folding of MVM decides partitioning of the matrix across PEs. Every row of matrix tile is mapped to a distinct PE and every column of PE buffer is mapped to a distinct SIMD lane. In this away, the required number ...
Context 12
... a single flexible architecture, named as the reference architecture and derived using pattern matching, to execute the workloads of all subgraphs by transitioning to different modes. Upon the execution of a new subgraph, the subgraph's weights are read into the on-chip memory and the multiplexers are configured to form the appropriate datapath. Fig. 21 ...
Context 13
... the architecture in [168] where individual CONV module is assigned to each CONV layer, the scalable RTL computing module proposed in this work is reused by all CNN layers of the same type for different CNNs as shown in Fig. 31. Note that it is not necessary to have all these modules in the architecture. For instance, the RTL compiler will not compile or synthesize Eltwise and combined batch normalization with scale (Bnorm) modules for VGG-16 model which greatly saves the hardware ...
Context 14
... have achieved even better accuracy in classifica- tion and various computer vision tasks. The classification accuracy in ILSVRC improved to 88.8% [48], 93.3% [31], and 96.4% [49] in the 2013,2014, and 2015 competitions, respectively. Fig. 1 shows the accuracy loss for the winners of ImageNet competitions before and after the emergence of deep learning ...
Context 15
... improved version of CNP architectures given in [127], [142] was presented in [143] and referred to as neuFlow. Particularly, neuFlow has replaced the 2D grid of ALUs with a 2D grid of processing tiles (PTs). The proposed architecture contains a 2D grid of PTs, a control unit, and a smart DMA module, as shown in Fig. 10. Each PT consists of local operators and a routing multiplexer (MUX). The top three PTs have been implemented to perform MAC operation. Thus, they can be used to perform 2D convolution, simple dot-products, and spatial pooling. General-purpose operations, such as dividing and squaring, have been implemented at the middle three PTs. ...
Context 16
... proposed accelerator composed of a computational engine and memory sub-system is depicted in Fig. 11. The computation engine is designed as T m duplicated tree- shaped poly structures with T n inputs from the input FMs, T n inputs from the weights, and one input from the bias. On the other hand, the memory sub-system is implemented as four sets of on-chip buffers; two sets to store the input FMs and weights, each with T n buffer ...
Context 17
... top-level architecture of the proposed CNN accelera- tor is shown in Fig. 12. Multi-banked input buffer and kernel weight buffer are used to provide an efficient buffering scheme of FMs and weights, respectively. To minimize the off-chip memory traffic, a specialized network on-chip has been designed to re-distribute the output FMs on the multi-banked input buffer instead of transferring them to the external ...
Context 18
... proposed architecture consists of a processing system (CPU) and programmable logic (FPGA). CNN computations are performed through special design of processing element modules in FPGA. The main modules in the processing element are convolver complex, max-pooling, non-linearity, data shift, bias shift, and adder tree, as shown in Fig. 13. The convolver complex is designed as a classical line buffer [154], as shown in Fig. 14, to achieve convolution operations as well as to compute FC layer multiplication of matrix-vector. The pooling layer implemented in the max- pooling module is used to output the maximum value in the input data stream with a window of size 2. The ...
Context 19
... logic (FPGA). CNN computations are performed through special design of processing element modules in FPGA. The main modules in the processing element are convolver complex, max-pooling, non-linearity, data shift, bias shift, and adder tree, as shown in Fig. 13. The convolver complex is designed as a classical line buffer [154], as shown in Fig. 14, to achieve convolution operations as well as to compute FC layer multiplication of matrix-vector. The pooling layer implemented in the max- pooling module is used to output the maximum value in the input data stream with a window of size 2. The activation function of CNN is applied to the input data stream using the non-linearity ...
Context 20
... explores the design space of the topology matrix components while considering the resource con- straints. In doing so, fpgaConvNet performs graph parti- tioning, coarse-grained folding, and fine-grained folding. The graph partitioning splits the original SDFG into sub- graphs and each subgraph is then mapped to a distinct bitstream as shown in Fig. 15. Note that the proposed multi- bitstream architecture might have multiple CONV layer processors (CLPs), as in the provided example. This away, on-chip RAM is used for intermediate results and data reuse within the subgraph, while accesss of off-chip memory is minimized and limited for input and output streams of the subgraph. However, ...
Context 21
... overall modules of the proposed CNN accelerator are shown in Fig. 16. The controller is responsible for directing and ensuring in-order computation of CNN modules for each layer. The data routers oversee the selection of data read and data write of two adjacent modules as well as the assignment of buffer outputs to shared or pool multipliers of the multiplier bank. The feature buffers hold the FMs using ...
Context 22
... unrolling in performing convolutions and this can be achieved either through intra-output or inter-output parallelism. Finally, operator-level parallelism is achieved by parallelising the k ˆ k MACs operations needed for convolution operation in convolutional layers or the n MACs needed for inner-product computation in fully connected layers. Fig. 17 shows the parallel framework exploiting these four levels of ...
Context 23
... buffer. BNN weight matrix is distributed across the PEs and stored locally in on- chip memory. Subsequently, the input images are streamed through the MVTU and multiplied with the weight matrix. Particularly, the PE computes the dot-product between an input vector and a row of weight matrix, each of S-bits wide, using an XNOR gate, as shown in Fig. 18. Then, it compares the number of set bits to a threshold and produces a 1-bit output value as previously ...
Context 24
... layer using a sliding window unit (SWU) and an MVTU, where convo- lutional operation is transformed to matrix-multiplication of image matrix and filter matrix. SWU generates the image matrix to MVTU by moving the sliding window over the input FMs, while the filter matrix is generated by packing the weights from the convolution filters as shown in Fig. 19. In order to meet the user throughput requirement, MVTU is folded (time-multiplexed) by controlling the values of P and S. Folding of MVM decides partitioning of the matrix across PEs. Every row of matrix tile is mapped to a distinct PE and every column of PE buffer is mapped to a distinct SIMD lane. In this away, the required number ...
Context 25
... a single flexible architecture, named as the reference architecture and derived using pattern matching, to execute the workloads of all subgraphs by transitioning to different modes. Upon the execution of a new subgraph, the subgraph's weights are read into the on-chip memory and the multiplexers are configured to form the appropriate datapath. Fig. 21 ...
Context 26
... the architecture in [168] where individual CONV module is assigned to each CONV layer, the scalable RTL computing module proposed in this work is reused by all CNN layers of the same type for different CNNs as shown in Fig. 31. Note that it is not necessary to have all these modules in the architecture. For instance, the RTL compiler will not compile or synthesize Eltwise and combined batch normalization with scale (Bnorm) modules for VGG-16 model which greatly saves the hardware ...
Citations
... Neural network-based control demands higher memory usage, increased computational complexity, and greater FPGA resource consumption compared to traditional feedback controllers. 10 To make ML-based control viable in resource-constrained environments, optimizations are essential in areas such as weight storage, inference latency, and numerical precision. This aligns with the principles of TinyML, where neural networks are tailored for efficient execution on low-power hardware, enabling real-time performance without excessive resource overhead. ...
Maintaining structural integrity of electronic components in dynamic environments requires effective vibration
control strategies, particularly in aerospace, automotive, and industrial applications where transient impacts and
sustained oscillations can degrade performance and reliability. Traditional passive damping techniques provide
limited flexibility to varying external forces, necessitating active control approaches for enhanced vibration
suppression. This study evaluates the computational feasibility and performance trade-offs of Proportional
Derivative (PD), Linear Quadratic Gaussian (LQG), and Multi-Layer Perceptron (MLP) controllers for active
vibration mitigation in a simulated cantilever beam system. A finite element model is used to simulate structural
response under impact loading, while real-time implementation is conducted on an Field Deployable Gate Array
(FPGA)-based control system. The PD and LQG controllers follow structured control frameworks, with the LQG
approach incorporating optimal control and state estimation for improved stabilization. The MLP controller,
trained on LQG-derived synthetic data, learns a control strategy without explicit system modeling. Performance
metrics, including peak displacement, settling time, RMS displacement, control effort, and damping efficiency,
provide insight into each controller’s trade-offs. Additionally, FPGA resource utilization is analyzed to assess the
computational overhead of deploying structured and learning-based controllers in embedded systems. Results
highlight that while structured controllers maintain efficiency and predictable performance, the MLP controller
demonstrates the potential for adaptive vibration suppression while using 22.5% fewer FPGA resources than
the LQG controller in terms of total slices. This is achieved while maintaining performance comparable to the
LQG controller, with a settling time of 1.36 s compared to 1.01 s for LQG, and significantly outperforming the
PD controller, which has a settling time of 1.82 s. These findings emphasize that the choice of control strategy
depends on application constraints, balancing computational efficiency with flexibility.
... For instance, FPGAs offer customizable hardware solutions that can be tailored for specific AI applications, providing a balance between performance and flexibility. [SSEM19], [Bha21] In certain high-performance or high-efficiency use cases, the co-design of hardware and software can encompass the creation of dedicated hardware accelerators (Application Specific Integrated Circuits -ASICs) for the particular AI model. By tailoring software algorithms to leverage specific hardware features, and vice versa, this technique achieves efficient execution of AI tasks. ...
This document aims to provide an overview and synopsis of frugal AI, with a particular focus on its role in promoting cost-effective and sustainable innovation in the context of limited resources. It discusses the environmental impact of AI technologies and the importance of optimising AI systems for efficiency and accessibility. It explains the interface between AI, sustainability and innovation. In fourteen sections, it also makes interested readers aware of various research topics related to frugal AI, raises open questions for further exploration, and provides pointers and references.
... Many of the advances that we are seeing in the field of deep learning (DL) are due to collaborative research efforts made in both hardware (HW) and software (SW) designs [1][2][3][4][5][6]. In the past years, DL frameworks have facilitated the exploration of new DL architectures. ...
This paper presents the development and evaluation of a distributed system employing low-latency embedded field-programmable gate arrays (FPGAs) to optimize scheduling for deep learning (DL) workloads and to configure multiple deep learning accelerator (DLA) architectures. Aimed at advancing FPGA applications in real-time edge computing, this study focuses on achieving optimal latency for a distributed computing system. A novel methodology was adopted, using configurable hardware to examine clusters of DLAs, varying in architecture and scheduling techniques. The system demonstrated its capability to parallel-process diverse neural network (NN) models, manage compute graphs in a pipelined sequence, and allocate computational resources efficiently to intensive NN layers. We examined five configurable DLAs—Versatile Tensor Accelerator (VTA), Nvidia DLA (NVDLA), Xilinx Deep Processing Unit (DPU), Tensil Compute Unit (CU), and Pipelined Convolutional Neural Network (PipeCNN)—across two FPGA cluster types consisting of Zynq-7000 and Zynq UltraScale+ System-on-Chip (SoC) processors, respectively. Four deep neural network (DNN) workloads were tested: Scatter-Gather, AI Core Assignment, Pipeline Scheduling, and Fused Scheduling. These methods revealed an exponential decay in processing time up to 90% speedup, although deviations were noted depending on the workload and cluster configuration. This research substantiates FPGAs’ utility in adaptable, efficient DL deployment, setting a precedent for future experimental configurations and performance benchmarks.
... To date, the exponential growth in DNN model parameters and data quantities [10] stretches the capacity limits of conventional digital computing architectures, primarily due to the "Von Neumann" bottleneck in data movement [11]. Though specialized computing paradigms like the Google tensor processing unit (TPU) [12,13], field-programmable gate arrays (FPGAs) [14,15], graphics processing units (GPUs) [16,17], and memristor architectures [18,19] that merge memory and matrix multiplication computations together have been developed to improve computational throughput, reduce latency, and lower energy consumption, they are still fundamentally limited in both bandwidth and efficiency because of electronic Joule heating, capacitance, and electromagnetic crosstalk [20]. ...
... This principle could also be used to calibrate the analog time integration, for the time integrator that integrates over l time steps, the maximum photodetector output power corresponds to the analog value 4 * l * Analog max and the minimum output power corresponds to the analog value 4 * l * Analog min . Any output powers between the maximum output power and the minimum output power could be mapped to their corresponding analog value through Analog out = (P out − P outmin ) * Analog max − Analog min P outmax − P outmin + Analog min (14) Using these above-mentioned encoding and decoding methods, we perform the Analog value multiplication using the optical hardware, and the optical computing experimental results match well with the theory value, showing high computing accuracy over 8 bits. ...
The ever-increasing data demand craves advancements in high-speed and energy-efficient computing hardware. Analog optical neural network (ONN) processors have emerged as a promising solution, offering benefits in bandwidth and energy consumption. However, existing ONN processors exhibit limited computational parallelism, and while certain architectures achieve high parallelism, they encounter serious scaling roadblocks for large-scale implementation. This restricts the throughput, latency, and energy efficiency advantages of ONN processors. Here, we introduce a spatial-wavelength-temporal hyper-multiplexed ONN processor that supports high data dimensionality, high computing parallelism and is feasible for large-scale implementation, and in a single time step, a three-dimensional matrix-matrix multiplication (MMM) optical tensor processor is demonstrated. Our hardware accelerates convolutional neural networks (CNNs) and deep neural networks (DNNs) through parallel matrix multiplication. We demonstrate benchmark image recognition using a CNN and a subsequently fully connected DNN in the optical domain. The network works with 292,616 weight parameters under ultra-low optical energy of 20 attojoules (aJ) per multiply and accumulate (MAC) at 96.4% classification accuracy. The system supports broad spectral and spatial bandwidths and is capable for large-scale demonstration, paving the way for highly efficient large-scale optical computing for next-generation deep learning.
... Finally, as low bitwidth convolutions can be implemented efficiently in standard CPU or GPU, but also on tight constrained field-programmable gate arrays (FPGAs) or even on Application Specific Integrated Circuits (ASICs), binarized networks are exploited on each specialized computer hardware. It leads in [1] to an energy-efficient and scalable CNN accelerator on ASICs, whereas [34] proposes a recent survey on hardware accelerators that uses FPGAs. ...
We present a new architecture to learn a light neural network using an asynchronous layerwise bayesian optimization process deployed on low-power devices. The procedure is based on a sequence of five modules. In each module, an accept-reject algorithm allows to update real-valued - or binary - weights without any back propagation of gradients. The learning process is tested on two different environments and the electricity consumption is evaluated on several epochs, based on a homemade open source library using standard softwares and performance counters, and compared with a physical power meter. It shows that the decentralized version deployed on several low-power devices is more energy-efficient than the standard GP-GPU version on a dedicated server.
... Deep learning networks are very similar to traditional artificial neural networks, but there are some key differences. While traditional artificial neural networks usually have only one or two layers with a relatively small number of parameters and twin structures, deep learning networks can contain tens of layers or more with a very large number of parameters and structural complexity [1][2]. This architecture allows deep learning networks to perform better on the task of handling large-scale, high-dimensional data, which can be used in the recognition of gymnastic movements. ...
The study adopts the OpenPose algorithm in deep learning to extract and recognize gymnastics movements, and it initially constructs the OpenPose gymnastics movement recognition model. The MobileNet-V3 network is introduced to replace VGG-19, which was the feature extraction network in the original model, in order to optimize the accuracy of OpenPose in recognizing gymnastics actions and to construct an OpenPose-MobileNet-V3 gymnastics action recognition model. The original model is compared with the optimized OpenPose-MobileNet-V3 model for comparison experiments in action recognition, and then the OpenPose-MobileNet-V3 model is compared with other recognition models to examine its effectiveness in action recognition. Finally, the parameter sensitivities of MobileNet-V3 and cosine annealing strategies are compared to explore the optimization effect of the two strategies on the OpenPose model.The OpenPose-MobileNet-V3 algorithm improves its recognition accuracy by 6.857% over the pre-optimization OpenPose algorithm.The recognition accuracy of the OpenPose-MobileNet-V3 is improved by 6.857% on the two datasets, which have accuracies of 95.786% and 94.572%, respectively, which are significantly better than other recognition models. The cosine annealing strategy-trained model is 2.143 percentage points less accurate than the OpenPose-MobileNet-V3 model at recognizing gymnastics movements, and MobileNet-V3 is better optimized.
... Compared to Application-Specific Integrated Circuits (ASICs), FPGAs offer cost efficiency, enhanced performance, and reduced power consumption [20]. Additionally, advancements in FPGA-based deep learning accelerators have demonstrated substantial performance improvements in computational efficiency [21]. ...
Nowadays, induction motors are an essential part of industrial development. Faults due to short-circuit turns within induction motors are “incipient faults”. This type of failure affects engine operation through undesirable vibrations. Such vibrations negatively affect the operation of the system or the products with which said motor is in contact. Early fault detection prevents sudden downtime in the industry that can result in heavy economic losses. The incipient failures these motors can present have been a vast research topic worldwide. Existing methodologies for detecting incipient faults in alternating current motors have the problem that they are implemented at the simulation level, or are invasive, or do not allow in situ measurements, or their digital implementation is complex. This article presents the design and development of a purpose-specific system capable of detecting short-circuit faults in the turns of the induction motor winding without interrupting the motor’s working conditions, allowing online measurements. This system is standalone, portable and allows non-invasive and in situ measurements to obtain phase currents. These data form classified descriptors using a multilayer perceptron neural network. This type of neural network enables agile and efficient digital processing. The developed neural network could classify current faults with an accuracy rate of 93.18%. The neural network was successfully implemented on a low-cost and low-range purpose-specific Field Programmable Gate Array board for online processing, taking advantage of its computing power and real time processing features. The measurement of phase current and the class of fault detected is displayed on a liquid-crystal display screen, allowing the user to take necessary actions before major faults occur.
... The pixel chip captures the signals and then processes them through the ISP (Image Signal Processor) and AI processing is performed in the processing step on the logical chip and the information is output as metadata, which reduces the size and amount of processed data. Users can specify the image output format according to the application requirements, including ISP (YUV/RGB) output images and ROI (region of interest) region extraction images, as shown in Figure 4. field programmable gate arrays (FPGAs) further enhance CNN-based designs, supporting multi-core embedded systems and mobile-compatible Ag-IoT crop detectors [166,167]. ...
... Supercomputers improve CNN model training by accelerating data augmentation and reducing computational time, enabling precise crop monitoring and classification. field programmable gate arrays (FPGAs) further enhance CNN-based designs, supporting multi-core embedded systems and mobile-compatible Ag-IoT crop detectors [166,167]. IoT devices pose security risks, addressed through application-specific integrated circuits (ASICs) for edge gateways, essential for secure Ag-IoT systems [168,169]. ML-optimized gateways improve task efficiency under resource constraints, reducing latency and enhancing privacy in Ag5.0 edge computing systems. ...
Agriculture 5.0 (Ag5.0) represents a groundbreaking shift in agricultural practices, addressing the global food security challenge by integrating cutting-edge technologies such as artificial intelligence (AI), machine learning (ML), robotics, and big data analytics. To adopt the transition to Ag5.0, this paper comprehensively reviews the role of AI, machine learning (ML) and other emerging technologies to overcome current and future crop management challenges. Crop management has progressed significantly from early agricultural methods to the advanced capabilities of Ag5.0, marking a notable leap in precision agriculture. Emerging technologies such as collaborative robots, 6G, digital twins, the Internet of Things (IoT), blockchain, cloud computing, and quantum technologies are central to this evolution. The paper also highlights how machine learning and modern agricultural tools are improving the way we perceive, analyze, and manage crop growth. Additionally, it explores real-world case studies showcasing the application of machine learning and deep learning in crop monitoring. Innovations in smart sensors, AI-based robotics, and advanced communication systems are driving the next phase of agricultural digitalization and decision-making. The paper addresses the opportunities and challenges that come with adopting Ag5.0, emphasizing the transformative potential of these technologies in improving agricultural productivity and tackling global food security issues. Finally, as Agriculture 5.0 is the future of agriculture, we highlight future trends and research needs such as multidisciplinary approaches, regional adaptation, and advancements in AI and robotics. Ag5.0 represents a paradigm shift towards precision crop management, fostering sustainable, data-driven farming systems that optimize productivity while minimizing environmental impact.
... Therefore, when executing a typical DL approach, satisfying the requirement of accuracy is of paramount significance. The distinctive feature of DL is its mechanism of learning the attributes and weights of the networks from large datasets [8]. ...
Recently, convolutional neural networks (CNNs) have received a massive amount of interest due to their ability to achieve high accuracy in various artificial intelligence tasks. With the development of complex CNN models, a significant drawback is their high computational burden and memory requirements. The performance of a typical CNN model can be enhanced by the improvement of hardware accelerators. Practical implementations on field-programmable gate arrays (FPGA) have the potential to reduce resource utilization while maintaining low power consumption. Nevertheless, when implementing complex CNN models on FPGAs, these may may require further computational and memory capacities, exceeding the available capacity provided by many current FPGAs. An effective solution to this issue is to use quantized neural network (QNN) models to remove the burden of full-precision weights and activations. This article proposes an accelerator design framework for FPGAs, called FPGA-QNN, with a particular value in reducing high computational burden and memory requirements when implementing CNNs. To approach this goal, FPGA-QNN exploits the basics of quantized neural network (QNN) models by converting the high burden of full-precision weights and activations into integer operations. The FPGA-QNN framework comes up with 12 accelerators based on multi-layer perceptron (MLP) and LeNet CNN models, each of which is associated with a specific combination of quantization and folding. The outputs from the performance evaluations on Xilinx PYNQ Z1 development board proved the superiority of FPGA-QNN in terms of resource utilization and energy efficiency in comparison to several recent approaches. The proposed MLP model classified the FashionMNIST dataset at a speed of 953 kFPS with 1019 GOPs while consuming 2.05 W.
... However, deploying these computationally intensive models on edge devices for real-time applications poses unique challenges. Limited computational resources, memory, and energy efficiency restrict the direct use of conventional DL models on edge platforms like autonomous systems, the Internet of Things (IoT), and mobile applications [1][2][3]. ...
Deep learning (DL) has revolutionized image classification, yet deploying convolutional neural networks (CNNs) on edge devices for real-time applications remains a significant challenge due to constraints in computation, memory, and power efficiency. This work presents an optimized implementation of VGG16 and VGG19, two widely used CNN architectures, for classifying the CIFAR-10 dataset using transfer learning on field-programmable gate arrays (FPGAs). Utilizing the Xilinx Vitis-AI and TensorFlow2 frameworks, we adapt VGG16 and VGG19 for FPGA deployment through quantization, compression, and hardware-specific optimizations. Our implementation achieves high classification accuracy, with Top-1 accuracy of 89.54% and 87.47% for VGG16 and VGG19, respectively, while delivering significant reductions in inference latency (7.29× and 6.6× compared to CPU-based alternatives). These results highlight the suitability of our approach for resource-efficient, real-time edge applications. Key contributions include a detailed methodology for combining transfer learning with FPGA acceleration, an analysis of hardware resource utilization, and performance benchmarks. This work underscores the potential of FPGA-based solutions to enable scalable, low-latency DL deployments in domains such as autonomous systems, IoT, and mobile devices.