Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Convolutional neural networks (CNNs) have been widely adopted for computer vision applications. CNNs require many multiplications, making their use expensive in terms of both computational complexity and hardware. An effective method to mitigate the number of required multiplications is via the Winograd algorithm. Previous implementations of CNNs based on Winograd use the 2-D algorithm F(2 x 2,3 x 3), which reduces computational complexity by a factor of 2.25 over regular convolution. However, current Winograd implementations only apply when using a stride (shift displacement of a kernel over an input) of 1. In this article, we presented a novel method to apply the Winograd algorithm to a stride of 2. This method is valid for one, two, or three dimensions. We also introduced new Winograd versions compatible with a kernel of size 3, 5, and 7. The algorithms were successfully implemented on an NVIDIA K20c GPU. Compared to regular convolutions, the implementations for stride 2 are 1.44 times faster for a 3 x 3 kernel, 2.04x faster for a 5 x 5 kernel, 2.42x faster for a 7 x 7 kernel, and 1.73x faster for a 3 x 3 x 3 kernel. Additionally, a CNN accelerator using a novel processing element (PE) performs two 2-D Winograd stride 1, or one 2-D Winograd stride 2, and operations per clock cycle was implemented on an Intel Arria-10 field-programmable gate array (FPGA). We accelerated the original and our proposed modified VGG-16 architectures and achieved digital signal processor (DSP) efficiencies of 1.22 giga operations per second (GOPS)/DSPs and 1.33 GOPS/DSPs, respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Paper [34] proposes the decomposable Winograd method (DWM), which expands the usage range of Winograd to other types of kernels. Paper [30] proposes a similar method for the stride of 2 kernel, and validates two types of kernels on FPGA. In addition, paper [30] considers how to save Look-Up Table (LUT) resources when supporting two types of kernels. ...
... Paper [30] proposes a similar method for the stride of 2 kernel, and validates two types of kernels on FPGA. In addition, paper [30] considers how to save Look-Up Table (LUT) resources when supporting two types of kernels. Paper [35] proposes a more efficient approach for the kernel size of 3  3 and stride of 2 and eventually saves 49.7% of LUT resources compared with paper [30]. ...
... In addition, paper [30] considers how to save Look-Up Table (LUT) resources when supporting two types of kernels. Paper [35] proposes a more efficient approach for the kernel size of 3  3 and stride of 2 and eventually saves 49.7% of LUT resources compared with paper [30]. But papers [30,35] neglect considering how to save LUT resources when types of kernels are larger than 2. ...
Article
Full-text available
The programmability of FPGA suits the constantly changing convolutional neural network (CNN). However, several challenges arise when the previous FPGA-based accelerators update CNN. Firstly, although the model of RepVGG can balance accuracy and speed, it solely supports two types of kernels. Meanwhile, 8-bit integer-only quantization of PyTorch which can support various CNNs is seldom successfully supported by the FPGA-based accelerators. In addition, Winograd F(4 × 4, 3 × 3) uses less multiplication, but its transformation matrix contains irregular decimals, which could lead to accuracy problems. To tackle these issues, this paper proposes High-accuracy Branch-fused CNN Accelerator (HBCA): a toolchain and corresponding FPGA-based accelerator. The toolchain proposes inception-based branch–fused technique, which can support more types of kernels. Meanwhile, the accelerator proposes Winograd-quantization dual decimal–fuse techniques to balance accuracy and speed. In addition, this accelerator supports multi-types of kernels and proposes Winograd decomposed-part reuse, multi-mode BRAM & DSP and data reuse to increase power efficiency. Experiments show that HBCA is capable of supporting seven CNNs with different types of kernels and more branches. The accuracy loss is within 0.1% when compared to the quantized model. Furthermore, the power efficiency (GOPS/W) of Inception, ResNet and VGG is up to 226.6, 188.1 and 197.7, which are better than other FPGA-based CNN accelerators.
... CNN is the most extensively used image classification method, and it's employed in a lot of applications. CNN's require a lot of computing and memory [3]. Convolution layers (CONV) and FC layers are two of them that are memory-bound and compute-bound, respectively [9]. ...
... In contrast to the traditional CNN, the Winograd algorithm is used in this study. It is derived based on Chinese Remainder Theorem [3]. Since using traditional CNN is more expensive and takes several multiplications. ...
... The accuracies reported by the mentioned boards were 86.66, 84.66, and 91.4. With efficiencies of 0.18, 1.32, and 1.33 (GOPs/DSP), the performances were 137, 3044, and 1788 (GOPs) respectively [3]. ...
Article
Full-text available
The convolutional neural network (CNN) is the most widely used machine learning technique within the fields of image and video processing. It is primarily used to categorize images using vast datasets. This require a lot of calculations. The effectiveness of a field-programmable gate array (FPGA) as a hardware accelerator for CNNs will give excellent performance at low power budgets. The employment of the Winograd algorithm can reduce the number of processing stages in CNN. 2-D convolution is employed for the bulk of calculations in CNNs. The tactic for computing convolution for smaller filter sizes that uses Winograd minimum filtering is the handiest. The comparison of computation complexity for multiplications will be performed using Matlab. The architecture of the Winograd-based processing element, RTL coding in Verilog HDL, and the test bench are designed to examine the performance. The Xilinx Vivado / Cadence tool will be used to implement the processing element in the convolution unit on an FPGA or ASIC.
... Various convolution accelerators based on the Winograd algorithm have been proposed [28][29][30][31]. Moreover, [32,33] proposed a method of optimizing 2-stride convolution with the Winograd algorithm, which reduces the design complexity of the Winograd-based CNN accelerator and significantly enhances the computation efficiency. However, the Winograd algorithm is not the best optimization scheme for the upsampling in U-Net because of the sparse feature map after expansion. ...
... Such large transformation matrices result in complex pre-computation and more latency. Yang et al. [32] and Yepez and Ko [33] further applied the Winograd algorithm for 2-stride convolution by decomposing the input feature map titles and kernels. ...
... Refs. [32,33] proposed a strategy of decomposing 2-D 2-stride convolution, that is, decomposing and recombining the input feature title and convolution kernel according to the location of elements. Then, each decomposed feature sub-titles only do convolution with the corresponding decomposed sub-kernel. ...
Article
Full-text available
Real-time object detection is a challenging but crucial task for autonomous underwater vehicles because of the complex underwater imaging environment. Resulted by suspended particles scattering and wavelength-dependent light attenuation, underwater images are always hazy and color-distorted. To overcome the difficulties caused by these problems to underwater object detection, an end-to-end CNN network combined U-Net and MobileNetV3-SSDLite is proposed. Furthermore, the FPGA implementation of various convolution in the proposed network is optimized based on the Winograd algorithm. An efficient upsampling engine is presented, and the FPGA implementation of squeeze-and-excitation module in MobileNetV3 is optimized. The accelerator is implemented on a Zynq XC7Z045 device running at 150 MHz and achieves 23.68 frames per second (fps) and 33.14 fps when using MobileNetV3-Large and MobileNetV3-Small as the feature extractor. Compared to CPU, our accelerator achieves 7.5×–8.7× speedup and 52×–60× energy efficiency.
... The authors of [8] presented a software implementation of Winograd method and applied it in a convolutional layer of a neural network with calculations on a graphical processor. In [9], the authors developed a hardware accelerator on Field-Programmable Gate Array (FPGA) based on the Winograd method for the convolutional layer of the neural network. ...
... A comparison is made of the proposed filter architecture with computations in RNS and the known filter architecture with computations in PNS [9]. The parameters of filters with a finite impulse response based on multiplyaccumulate (MAC) blocks were also calculated, we denote them as FIR MAC , then for 2 × 2 filter mask the delay and area parameters are calculated as follows [23]: ...
... Theoretical analysis based on the ''unit-gate'' model of the proposed device parameters showed that RNS usage allows to reduce the device delay by 24.79% -66.77%, and the area device by 17.59% -53.67%, compared with the known implementation based on Winograd filtering in PNS [9]. In addition, the proposed device architecture has 13.47% -42.04% less delay, and 2.20% -18.03% less area, except for the 8-bit device, which has a 47.38% larger area than the known MAC-based filter architecture [23]. ...
Article
Full-text available
Improving the technical characteristics of digital signal processing devices is an important problem in many practical tasks. According to the Winograd method, the paper proposes the architecture of a device for two-dimensional filtering in a residue number system (RNS) with moduli of a special type. The work carried out the technical parameters theoretical analysis of the proposed filter architecture for different RNS moduli sets by the "unit-gate"-model. In addition, the proposed architecture is compared with known digital filter implementations. The theoretical analysis results showed that the proposed filter architecture makes it possible to increase the signal processing speed by 1.33 – 6.90 times, compared with the known device implementations. Also, in the paper, the hardware simulation of the proposed filter architecture was performed on FPGA, which showed that the performance of the proposed device is 1.31 – 4.12 times higher than known digital filter architectures. The research results can be used in digital signal processing systems to increase their performance and reduce hardware costs. In addition, the developed architectures can be applied in the development of hardware accelerators for complex digital signals analysis systems.
... These results highlight the efficacy and superiority of 3D CNNs over conventional methods. [11,17,18], Single-FPGA [1,3,5,19,20], Multi-FPGA [1], CPU [21,22], Xeon Phi [12,21], GPU [22][23][24], CPU-GPU [22], DSP [14], resistive RAM [25] CONV style Direct CONV [3,4,11,13,[21][22][23]26], matrix multiplication based [20], FFT-based [13,22], Winograd-based [1,12,14,18,19,23] 3D CNN evaluated C3D [3, 4, 11, 12, 17, 19-21, 27, 28], Base3D [28], I3D [11,17], 3D ResNet-50 [11,17], 3D U-Net [12], S3D [27], V3D [23], E3DNet [5], R(2+1)D [27,29] Dataset 3D MNIST [25], UCF101 [3,5,17,18,27], Sports-1M [4], LUNA-16 [1], Kinetics [11,17], TRECVID [18] Use of framework/library Intel Thread Building Block [22], MKL [5,12,13,22,26], FFTW [13,22], cuDNN [12,22], cuFFT [22], cuBLAS [23] Comparison with Caffe [13,24], Theano [13], Pytorch [26], TensorFlow [26], cuDNN [23,24,26] , LIBXSMM [12], MNN [27], MXNet [5] 3 COMPUTING PLATFORMS FOR 3D CNNS Table 1 gives the classification of various research works based on key parameters. It shows their computing platforms, the strategy for realizing CONV and the 3D CNN workload used by them. ...
... These results highlight the efficacy and superiority of 3D CNNs over conventional methods. [11,17,18], Single-FPGA [1,3,5,19,20], Multi-FPGA [1], CPU [21,22], Xeon Phi [12,21], GPU [22][23][24], CPU-GPU [22], DSP [14], resistive RAM [25] CONV style Direct CONV [3,4,11,13,[21][22][23]26], matrix multiplication based [20], FFT-based [13,22], Winograd-based [1,12,14,18,19,23] 3D CNN evaluated C3D [3, 4, 11, 12, 17, 19-21, 27, 28], Base3D [28], I3D [11,17], 3D ResNet-50 [11,17], 3D U-Net [12], S3D [27], V3D [23], E3DNet [5], R(2+1)D [27,29] Dataset 3D MNIST [25], UCF101 [3,5,17,18,27], Sports-1M [4], LUNA-16 [1], Kinetics [11,17], TRECVID [18] Use of framework/library Intel Thread Building Block [22], MKL [5,12,13,22,26], FFTW [13,22], cuDNN [12,22], cuFFT [22], cuBLAS [23] Comparison with Caffe [13,24], Theano [13], Pytorch [26], TensorFlow [26], cuDNN [23,24,26] , LIBXSMM [12], MNN [27], MXNet [5] 3 COMPUTING PLATFORMS FOR 3D CNNS Table 1 gives the classification of various research works based on key parameters. It shows their computing platforms, the strategy for realizing CONV and the 3D CNN workload used by them. ...
... These results highlight the efficacy and superiority of 3D CNNs over conventional methods. [11,17,18], Single-FPGA [1,3,5,19,20], Multi-FPGA [1], CPU [21,22], Xeon Phi [12,21], GPU [22][23][24], CPU-GPU [22], DSP [14], resistive RAM [25] CONV style Direct CONV [3,4,11,13,[21][22][23]26], matrix multiplication based [20], FFT-based [13,22], Winograd-based [1,12,14,18,19,23] 3D CNN evaluated C3D [3, 4, 11, 12, 17, 19-21, 27, 28], Base3D [28], I3D [11,17], 3D ResNet-50 [11,17], 3D U-Net [12], S3D [27], V3D [23], E3DNet [5], R(2+1)D [27,29] Dataset 3D MNIST [25], UCF101 [3,5,17,18,27], Sports-1M [4], LUNA-16 [1], Kinetics [11,17], TRECVID [18] Use of framework/library Intel Thread Building Block [22], MKL [5,12,13,22,26], FFTW [13,22], cuDNN [12,22], cuFFT [22], cuBLAS [23] Comparison with Caffe [13,24], Theano [13], Pytorch [26], TensorFlow [26], cuDNN [23,24,26] , LIBXSMM [12], MNN [27], MXNet [5] 3 COMPUTING PLATFORMS FOR 3D CNNS Table 1 gives the classification of various research works based on key parameters. It shows their computing platforms, the strategy for realizing CONV and the 3D CNN workload used by them. ...
Article
Full-text available
3D convolution neural networks (CNNs) have shown excellent predictive performance on tasks such as action recognition from videos. Since 3D CNNs have unique characteristics and extremely high compute/memory-overheads, executing them on accelerators designed for 2D CNNs provides sub-optimal performance. To overcome these challenges, researchers have recently proposed architectures for 3D CNNs. In this paper, we present a survey of hardware accelerators and hardware-aware algorithmic optimizations for 3D CNNs. We include only those CNNs that perform 3D convolution and not those that perform only 2D convolution on 2D or 3D data. We highlight their key ideas and underscore their similarities and differences. We believe that this survey will spark a great deal of research towards the design of ultra-efficient 3D CNN accelerators of tomorrow.
... As not all convolutional layers can be implemented efficiently with the Winograd algorithm, we only replace those with a 3×3 kernel size and unitary stride, whereas all the others, like the 1×1 pointwise convolutions, are processed using a standard algorithm. Although strided convolution can be implemented with the Winograd algorithm [65], [67], the control and compute overhead dominates the potential MACs reduction (i.e., stride-2 F 4 leads only to a 1.8× MACs reduction). ...
... VI. RELATED WORK Winograd Algorithm. Several works have been proposed to extend the original Winograd algorithm [62] to work on general 2D convolution [29], [65], [67], and to improve its performance by combining it with the Strassen algorithm [69] or its numerical accuracy by using higher-order polynomials [4] and better polynomial root points for m > 4 [1], [3]. Li et al. [34] combined the Winograd algorithm with AdderNet, which uses 1 instead of 2 norm for feature extraction, therefore replacing all MAC operations with additions. ...
Preprint
Most of today's computer vision pipelines are built around deep neural networks, where convolution operations require most of the generally high compute effort. The Winograd convolution algorithm computes convolutions with fewer MACs compared to the standard algorithm, reducing the operation count by a factor of 2.25x for 3x3 convolutions when using the version with 2x2-sized tiles $F_2$. Even though the gain is significant, the Winograd algorithm with larger tile sizes, i.e., $F_4$, offers even more potential in improving throughput and energy efficiency, as it reduces the required MACs by 4x. Unfortunately, the Winograd algorithm with larger tile sizes introduces numerical issues that prevent its use on integer domain-specific accelerators and higher computational overhead to transform input and output data between spatial and Winograd domains. To unlock the full potential of Winograd $F_4$, we propose a novel tap-wise quantization method that overcomes the numerical issues of using larger tiles, enabling integer-only inference. Moreover, we present custom hardware units that process the Winograd transformations in a power- and area-efficient way, and we show how to integrate such custom modules in an industrial-grade, programmable DSA. An extensive experimental evaluation on a large set of state-of-the-art computer vision benchmarks reveals that the tap-wise quantization algorithm makes the quantized Winograd $F_4$ network almost as accurate as the FP32 baseline. The Winograd-enhanced DSA achieves up to 1.85x gain in energy efficiency and up to 1.83x end-to-end speed-up for state-of-the-art segmentation and detection networks.
... For higher accuracy, a stride method, a technique frequently used in the convolutional neural network (CNN) [24], was introduced to our learning group. Due to the nature of learning through the number of times, a pattern appears at a specific location, a pattern that has a similar shape but has a different location within the image may be recognized as a completely different pattern. ...
Article
Full-text available
In this paper, we present a digital processing in memory (DPIM) configured as a stride edge-detection search frequency neural network (SE-SFNN) which is trained through a spike-location-dependent-plasticity (SLDP), a learning mechanism reminiscent of spike-timing-dependent plasticity. This mechanism allows for rapid online learning as well as a simple memory-based implementation. In particular, we employ a ternary data scheme to take advantage of a ternary content addressable memory (TCAM). The scheme utilizes a ternary representation of the image pixels and the TCAMs are used in a two-layer format to significantly reduce the computation time. The first layer applies several filtering kernels followed by the second layer that reorders pattern dictionaries of TCAMs to place the most frequent patterns at the top of each supervised TCAM dictionary. Numerous TCAM blocks in both layers operate in a massively parallel fashion using digital ternary values. There are no complicated multiply operations performed and learning is performed in a feedforward scheme. This allows rapid robust learning as a trade-off with the parallel memory block size. Furthermore, we propose a method to reduce the TCAM memory size using a two-tiered minor to major promotion (M2MP) of frequently occurring patterns. This reduction scheme is performed concurrently during the learning operation without incurring a preconditioning overhead. We show that with a minimal circuit overhead, the required memory size is reduced by 84.4% and the total clock cycles required for learning also decrease by 97.31 % while the accuracy decreases only by 1.12%. We classified images with 94.58% accuracy on the MNIST dataset. Using a 100MHz clock, our simulation results show that the MNIST training takes about 6.3 ms dissipating less than 4mW of average power. In terms of the inference speed, the trained hardware is capable of processing 5,882,352 images per second.
... Stride is a parameter or constraint that can determine the amount of filter shift. If the value of a stride is 1, then the convolutional filter will shift by 1 pixel moving horizontally and vertically (Yepez & Ko, 2020). The smaller the value of a stride, the more detailed the value of the information we will get on input, but this stride process has a large computation time (Awangga & Putro, 2020). ...
Article
Full-text available
In the coastal area of Likupang, many types of saltwater fish can be consumed, such as tuna and skipjack. Yet, there are also types of saltwater fish that cannot be consumed or protected by the government, such as Napoleon fish and sea kingfish. Thus, this research aimed to build a desktop application that can automatically classify consumable and non-consumable saltwater fish species more accurately and promptly using a suitable image recognition method like the Convolutional Neural Network (CNN). CNN has abilities to distinguish images by recognizing several pixels in a two-dimensional image and RGB (Red, Green, Blue) colors which are then converted into a matrix with various values, making it easier for the system to recognize the two-dimensional image. By using 40% test data (143 images) and 60% training data (213 images), this study obtained test accuracy in identifying and classifying images of consumable fish, non-consumable fish, and non-fish images with each percentage of 94%, 98%, and 95% respectively. Abstrak-Di perairan Likupang terdapat banyak jenis ikan air asin yang bisa dikonsumsi, seperti ikan Tuna dan Cakalang. Namun, ada juga jenis ikan air asin yang tidak bisa dikonsumsi atau dilindungi oleh pemerintah, seperti ikan Napoleon dan ikan Raja Laut. Oleh karena itu, penelitian ini bertujuan untuk membangun aplikasi desktop yang dapat secara otomatis mengklasifikasikan spesies ikan laut yang dapat dikonsumsi dan tidak dapat dikonsumsi dengan lebih akurat dan cepat menggunakan metode pengenalan citra yang sesuai seperti Convolutional Neural Network (CNN). CNN memiliki kemampuan untuk membedakan gambar dengan mengenali beberapa piksel pada gambar dua dimensi dan warna RGB (Red, Green, Blue) yang kemudian diubah menjadi matriks dengan berbagai nilai, sehingga memudahkan sistem untuk mengenali gambar dua dimensi tersebut. Dengan menggunakan 40% data uji (143 citra) dan 60% data latih (213 citra), penelitian ini mendapatkan akurasi uji dalam mengidentifikasi dan mengklasifikasikan citra ikan konsumsi, ikan tidak konsumsi, dan citra non-ikan dengan persentase masing-masing 94% , 98%, dan 95%.
... Therefore, we adapt two 8x8 multiplications into one DSP48E to reduce the total DSP usage ( Figure 5). For the DWC Accelerator shown in Figure 6, many mature and efficient methods for a standard 3 * 3 convolution have been proposed, such as Winograd [26,27]. Because the DWC part needs to match the PWC output and one kernel is only used in one channel, we keep the data unchanged and feed it into a line buffer unit directly to complete the DWC. ...
Article
Full-text available
Convolutional neural networks (CNNs) have been widely applied in the fields of medical tasks because they can achieve high accuracy in many fields using a large number of parameters and operations. However, many applications designed for auxiliary checks or help need to be deployed into portable devices, where the huge number of operations and parameters of a standard CNN can become an obstruction. MobileNet adopts a depthwise separable convolution to replace the standard convolution, which can greatly reduce the number of operations and parameters while maintaining a relatively high accuracy. Such highly structured models are very suitable for FPGA implementation in order to further reduce resource requirements and improve efficiency. Many other implementations focus on performance more than on resource requirements because MobileNets has already reduced both parameters and operations and obtained significant results. However, because many small devices only have limited resources they cannot run MobileNet-like efficient networks in a normal way, and there are still many auxiliary medical applications that require a high-performance network running in real-time to meet the requirements. Hence, we need to figure out a specific accelerator structure to further reduce the memory and other resource requirements while running MobileNet-like efficient networks. In this paper, a MobileNet accelerator is proposed to minimize the on-chip memory capacity and the amount of data that is transferred between on-chip and off-chip memory. We propose two configurable computing modules: Pointwise Convolution Accelerator and Depthwise Convolution Accelerator, to parallelize the network and reduce the memory requirement with a specific dataflow model. At the same time, a new cache usage method is also proposed to further reduce the use of the on-chip memory. We implemented the accelerator on Xilinx XC7Z020, deployed MobileNetV2 on it, and achieved 70.94 FPS with 524.25 KB on-chip memory usage under 150 MHz.
... В статье [9] предложенные алгоритмы цифровой фильтрации на основе метода Винограда для сверточных слоев нейронных сетей показали превосходство над быстрым преобразованием Фурье по скорости работы глубокой нейронной сети при обработке больших массивов визуальных данных. Данный подход был расширен и обобщен на случаи обработки одномерных, двумерных и трехмерных сигналов сверточной нейронной сетью [10]. На базе данных исследований разработаны различные архитектуры [11] и аппаратные ускорители [12 -14] для высокопроизводительной реализации алгоритмов нейросетевой обработки изображений на основе метода Винограда. ...
Article
Full-text available
The fast increase of the amount of quantitative and qualitative characteristics of digital visual data calls for the improvement of the performance of modern image processing devices. This article proposes new algorithms for 2D digital image processing based on the Winograd method in a general form. An analysis of the obtained results showed that the use of the Winograd method reduces the computational complexity of image processing by up to 84% compared to the traditional direct digital filtering method depending on the filter parameters and image fragments, while not affecting the quality of image processing. The resulting Winograd method transformation matrices and the algorithms developed can be used in image processing systems to improve the performance of the modern microelectronic devices that carry out image denoising, compression, and pattern recognition. Research directions that show promise for further research include hardware implementation on a field-programmable gate array and application-specific integrated circuit, development of algorithms for digital image processing based on the Winograd method in a general form for a 1D wavelet filter bank and for stride convolution used in convolutional neural networks.
... Besides, we compared our proposed model on COVID-19 diagnosis tasks with those of previous SOTA COVID- 19 screening methods, which included Shi [47], Wang [48], and Xu [47] in Table 3. Shi [47] presented an infection region-specific segmentation technique based on a random forest model to distinguish COVID-19 from other forms of pneumonia using CT exams [47]. This study reported 83.30% accuracy [47]. ...
Article
Full-text available
In this paper, a two-dimensional Winograd CNN (Convolutional Neural Network) chip for COVID-19 and pneumonia detection is proposed. In light of the COVID-19 pandemic, many studies have led to a dramatic increase in the effects of the virus on the lungs. Some studies have also pointed out that the clinical application of deep learning in the medical field is also increasing, and it is also pointed out that the radiation impact of CT exposure is more serious than that of X-ray films and that CT exposure is not suitable for viral pneumonia. This study will analyze the results of X-rays trained using CNN architecture and convolutional using Winograd. This research will also set up a popular model architecture to realize four kinds of grayscale image prediction to verify the actual prediction effect on this data. The experimental data is mainly composed of chest X-rays of four different types of grayscales as input material. Among them, the research method of this experiment is to design the basic CNN operation structure of the chip and apply the Winograd calculus method to the convolutional operation. Finally, according to the TSMC 0.18 μm process, the actual chip is produced, and each step is verified to ensure the correctness of the circuit. The experimental results prove that the accuracy of our proposed method reaches 87.87%, and the precision reaches 88.48%. This proves that our proposed method has an excellent recognition rate.
... Moreover, the nature of the Winograd transformation is only applicable to convolutions with stride s = 1. Making the Winograd transformation work for stride s > 1 is an open research problem with multiple solutions having been proposed (Pan and Chen 2021;Huang et al. 2021;Yepez and Ko 2020) in the recent past. Using Winograd convolution for r = 7 causes a lack of numerical precision, and hence we avoid it. ...
Preprint
Full-text available
ML-as-a-service continues to grow, and so does the need for very strong privacy guarantees. Secure inference has emerged as a potential solution, wherein cryptographic primitives allow inference without revealing users' inputs to a model provider or model's weights to a user. For instance, the model provider could be a diagnostics company that has trained a state-of-the-art DenseNet-121 model for interpreting a chest X-ray and the user could be a patient at a hospital. While secure inference is in principle feasible for this setting, there are no existing techniques that make it practical at scale. The CrypTFlow2 framework provides a potential solution with its ability to automatically and correctly translate clear-text inference to secure inference for arbitrary models. However, the resultant secure inference from CrypTFlow2 is impractically expensive: Almost 3TB of communication is required to interpret a single X-ray on DenseNet-121. In this paper, we address this outstanding challenge of inefficiency of secure inference with three contributions. First, we show that the primary bottlenecks in secure inference are large linear layers which can be optimized with the choice of network backbone and the use of operators developed for efficient clear-text inference. This finding and emphasis deviates from many recent works which focus on optimizing non-linear activation layers when performing secure inference of smaller networks. Second, based on analysis of a bottle-necked convolution layer, we design a X-operator which is a more efficient drop-in replacement. Third, we show that the fast Winograd convolution algorithm further improves efficiency of secure inference. In combination, these three optimizations prove to be highly effective for the problem of X-ray interpretation trained on the CheXpert dataset.
... WHD [16] exploited the fusion of Winograd unit, but only one type of unit can be used in per-layer. Reference [22] proposed a Winograd processing element for convolutions which only compatible with filter size 3 and both stride 1 and 2. Reference [17,19] introduced the universal approach to deal with the large stride and large filter size. Previous design space exploration schemes have been applicable only to 2D CNN accelerators, making them unsuitable for different dimension architecture. ...
Article
Convolutional neural networks (CNNs) have proven to be promising in various applications such as audio recognition, image classification, and video understanding. Winograd algorithm helps to reduce the complexity of computation in a convolution but suffers from poor compatibility for different convolution shapes. This work introduces a dynamic dimension-level fusion architecture based on Winograd for accelerating different dimensions of CNNs. We explore this Winograd architecture by designing Dimension Fusion, a dimension-level processing engine that dynamically fuses to match the convolution shape of individual CNN layers. The proposed architecture is the first work based on Winograd algorithm to be compatible with all convolution shapes (dimension, stride, and filter-size) and achieves highest PE efficiency up to 1.55x and energy efficiency up to 3.3x compared with the state-of-art accelerators.
... Down-sampling often uses a stride convolution operator to reduce the size of the feature map through non-unit step convolution. [19] extended the algorithm to three dimensions while achieving a step size of 2. ...
Preprint
Convolutional Neural Network (CNN) has been widely used in various fields and played an important role. Convolution operators are the fundamental component of convolutional neural networks, and it is also the most time-consuming part of network training and inference. In recent years, researchers have proposed several fast convolution algorithms including FFT and Winograd. Among them, Winograd convolution significantly reduces the multiplication operations in convolution, and it also takes up less memory space than FFT convolution. Therefore, Winograd convolution has quickly become the first choice for fast convolution implementation within a few years. At present, there is no systematic summary of the convolution algorithm. This article aims to fill this gap and provide detailed references for follow-up researchers. This article summarizes the development of Winograd convolution from the three aspects of algorithm expansion, algorithm optimization, implementation, and application, and finally makes a simple outlook on the possible future directions.
... Down-sampling often uses a stride convolution operator to reduce the size of the feature map through non-unit step convolution. [19] extended the algorithm to three dimensions while achieving a step size of 2. ...
Conference Paper
Convolutional Neural Network (CNN) has been widely used in various fields and played an important role. Convolution operators are the fundamental component of convolutional neural networks, and it is also the most time-consuming part of network training and inference. In recent years, researchers have proposed several fast convolution algorithms including FFT and Winograd. Among them, Winograd convolution significantly reduces the multiplication operations in convolution, and it also takes up less memory space than FFT convolution. Therefore, Winograd convolution has quickly become the first choice for fast convolution implementation within a few years. At present, there is no systematic summary of the convolution algorithm. This article aims to fill this gap and provide detailed references for follow-up researchers. This article summarizes the development of Winograd convolution from the three aspects of algorithm expansion, algorithm optimization, implementation, and application, and finally makes a simple outlook on the possible future directions.
... The first layer of our customized backbone contains a CNN with stride 2; this downsamples the input without a max pooling layer and requires less computation than stride one [50]. ...
Article
Full-text available
The Commercial Vehicle Safety Alliance (CVSA) aims to achieve uniformity, compatibility and reciprocity of commercial motor vehicle inspections and enforcement by certified inspectors dedicated to driver and vehicle safety. Commercial vehicles that pass a CVSA inspection are eligible for a decal representing a commitment to safety. In this work, we propose a two-step automatic CVSA decal recognition system using deep convolutional neural network architectures. The first step localizes a vehicle's windshield and the CVSA decal within, and classifies the decal colour. The CVSA decal is cropped and used as input to the second stage, which localizes and classifies a digit and the corner-cut of a CVSA decal. With the corner cut, colour, and digit, the system can determine the decal's date of issue. We use as our baseline the MobileDet architecture, customizing the backbone to our tasks. Our first custom architecture is larger than the baseline because it needs more representational power to detect small decals within an image. The second architecture is much smaller because digit and corner-cut recognition is a simpler task. Our custom architectures reduce processing time and exceed accuracies relative to pre-trained architectures. We implemented our models on different edge hardware accelerators (e.g. the Google Coral, Nvidia Jetsons, and Intel NCS) and compared the performance when processing a real-time video stream. Our system can predict frames at 173.31 FPS on an NVIDIA Jetson AGX Xavier with 98.5% mean average precision @ 0.5 IoU. © 2021 The Authors. IET Intelligent Transport Systems published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
... Two types of strides are used: Stride 1 (s1) and Stride 2 (s2). Stride represents the element-wise shift displacement of a kernel over an input along a particular axis [11]. Stride 1 will move one filter at a time, and Stride 2 will move two filters at a time. ...
Article
Full-text available
The machine learning models based on Convolutional Neural Networks (CNNs) can be effectively used for detection and recognition of objects, such as Corona Virus Disease 19 (COVID-19). In particular, the MobileNet and Single Shot multi-box Detector (SSD) have recently been proposed as the machine learning model for object detection. However, there are still some challenges for deployment of such architectures on the embedded devices, due to the limited computational power. Another problem is that the accuracy of the associated machine learning model may be decreased, depending on the number of concerned parameters and layers. This paper proposes a light-weight MobileNet (LMN) architecture that can be used to improve the accuracy of the machine learning model, with a small number of layers and lower computation time, compared to the existing models. By experimentation, we show that the proposed LMN model can be effectively used for detection of COVID-19 virus. The proposed LMN can achieve the accuracy of 98% with the file size of 27.8 Mbits by replacing the standard CNN layers with separable convolutional layers.
Chapter
Deep learning algorithms are playing a vital role in wide range of Artificial Intelligence-based applications. Since the CPU/GPU-based solutions are not suitable for low power applications like IoT due to their high-power requirements, a dedicated hardware called hardware accelerator is needed for a given AI-based processing. Further, the hardware accelerators are essential in reducing the inference-time latency. The FPGA/ASIC-based CNN accelerator implementations are in research focus now a days owing to their high performance per Watt in variety of domains such as vision, voice, text. In this paper, the k-means algorithm based technique for the CNN model compression and Winograd based techniques for reducing the number of multiplications reported in the literature are discussed as case studies for model compression, and hardware optimization.KeywordsConvolutioal neural networkDeep learning neural networkField programmable gate arrayHardware accelerator
Conference Paper
With the astonishing achievements of Convolutional Neural Network (CNN) accelerators in real-time applications, the deployment of CNNs on hardware has become an attractive matter. Pooling layers in CNNs are employed for reducing the computation of convolutional layers. Nevertheless, their hardware implementation strategy can impact the accuracy and performance of accelerators. This paper presents a novel parallel Stochastic Computing (SC) based architecture of pooling modules in hardware for stochastic CNN accelerators. With this approach, the SC-based average pooling is reconfigurable with 1.28 times lower power consumption, and the max pooling layer achieves area reduction with the ratio of 4.36. Extending the application of stochastic CNN accelerators in different classification problems is also achieved by implementing AAD pooling with the proposed method. Eventually, the reliability of the proposed method is approved by testing our pooling layers in the VGG-16 structure with the CIFAR-10 dataset.
Article
Convolutional neural networks (CNNs) have had great success when applied to computer vision technology, and many application-specific integrated circuit (ASIC) and field-programmable gate array (FPGA) CNN accelerators have been proposed. These accelerators primarily focus on the acceleration of a single input, and they are not particularly optimized for video applications. In this article, we focus on the similarities between continuous inputs in video, and we propose a YOLOv3-tiny CNN FPGA accelerator using incremental operation. The accelerator can skip the convolution operation of similar data between continuous inputs. We also use the Winograd algorithm to optimize the conv3 $\times$ $3$ operator in the YOLOv3-tiny network to further improve the accelerator’s efficiency. Experimental results show that our accelerator achieved 74.2 frames/s on ImageNet ILSVRC2015. Compared to the original network without Winograd algorithm and incremental operation, our design provides a 4.10 $\times$ speedup. When compared with other YOLO network FPGA accelerators applied to video applications, our design provided a 3.13 $\times$ –18.34 $\times$ normalized digital signal processor (DSP) efficiency and 1.10 $\times$ –14.2 $\times$ energy efficiency.
Article
There has been a dramatic proliferation of research concerned with Convolutional Neural Networks (CNNs) over the past decade. In the field of smart surveillance, multi-channel frames need to be processed simultaneously as real-time operations, which leads to intensive computation. Deep CNNs have tens of layers which lead to intensive computation. In order to solve tremendous computation pressure, lots of CNN accelerators have been proposed. Moreover, many works focus on the design of a multiply-and-accumulate (MAC) unit since ten billions of MAC computations induce enormous energy consumption for logical operations in CNNs. Although DynCNN (Tsai et al., 2020) improves the structure of conventional CNNs by using the high similarity property of consecutive frames, it generates redundant cells and leads to redundant operations. Therefore, this paper proposed an energy-efficient reconfigurable MAC and a high-performance CNN accelerator based on the Winograd Minimum Filtering Algorithm (WMFA). The proposed design not only dramatically enhances throughput but also acquires better energy efficiency in limited resources. Experimental results showed that the proposed Winograd processing element (PE) gives 8% to 12% power improvements. The DSP efficiency was improved in the range of from 1.48× to 6.82× as well compared with the previous works.
Article
Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embedded systems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads. In addition, the PEs in systolic arrays can be efficiently used only if the data items ( tensors ) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency. Compared to Eyeriss v2, RiSA improves the area and energy efficiency for MobileNet-V1 inference by 1.91× and 1.31×, respectively.
Article
Convolutional Neural Networks (CNNs) have been widely adopted in many kinds of artificial intelligence applications. Most of the computational overhead of CNNs is spent on convolutions. An effective approach to reducing the overhead is transforming convolutions in the time domain into multiplications in the frequency domain by means of Fast Fourier Transform (FFT) algorithms, known as FFT-based fast algorithms for convolutions. However, current FFT-based fast implementations only work for unit-strided convolutions with stride as 1, and cannot be directly applied to strided convolutions with stride size greater than 1, which are usually used as the first layer of CNNs and as an effective alternative to the pooling layers for downsampling. In this paper, we first introduce rearrangement- and sampling-based methods for applying FFT-based fast algorithms to strided convolutions, and the arithmetic complexities of these two methods and the direct method are compared in detail. Then, the highly optimized parallel implementations of the two methods on ARMv8-based many-core CPU are presented. Lastly, we benchmark these implementations against two GEMM-based implementations on this ARMv8 CPU. Our experimental results with convolutions of different kernels, feature maps, and batch sizes show that the rearrangement-based method generally exceeds the sampling-based one under the same optimizations in most cases, and these two methods can get much better performance than GEMM-based ones when the kernels, feature maps and batch sizes are large. The experimental results on the convolutional layers in popular CNNs further demonstrate the conclusions above.
Preprint
Full-text available
The combination of Winograd's algorithm and systolic array architecture has demonstrated the capability of improving DSP efficiency in accelerating convolutional neural networks (CNNs) on FPGA platforms. However, handling arbitrary convolution kernel sizes in FPGA-based Winograd processing elements and supporting efficient data access remain underexplored. In this work, we are the first to propose an optimized Winograd processing element (WinoPE), which can naturally support multiple convolution kernel sizes with the same amount of computing resources and maintains high runtime DSP efficiency. Using the proposed WinoPE, we construct a highly efficient systolic array accelerator, termed WinoCNN. We also propose a dedicated memory subsystem to optimize the data access. Based on the accelerator architecture, we build accurate resource and performance modeling to explore optimal accelerator configurations under different resource constraints. We implement our proposed accelerator on multiple FPGAs, which outperforms the state-of-the-art designs in terms of both throughput and DSP efficiency. Our implementation achieves DSP efficiency up to 1.33 GOPS/DSP and throughput up to 3.1 TOPS with the Xilinx ZCU102 FPGA. These are 29.1\% and 20.0\% better than the best solutions reported previously, respectively.
Article
Nowadays, with the increasing shortage of traditional medical resources, the existing portable monitoring healthcare device is no longer satisfactory. Thus, wearable healthcare device with diagnostic capability is becoming much more desirable. However, the design of wearable healthcare device faces the challenge of limited hardware resource and high diagnostic accuracy. In this paper, an efficient hardware architecture is proposed to implement a 1-D CNN with global average pooling (GAP) specially for embedded electrocardiogram (ECG) classification. The GAP is implemented by substituting division into shifting operation without extra computing resource consumption and it can largely reduce the parameters of the network. The fully pipelined processing unit (PU) array is designed to increase computing efficiency. A sign bit based dynamic activation strategy is developed for removing redundant multiplications and resource consumption of ReLU. The proposed efficient hardware architecture is implemented on Xilinx Zynq ZC706 board and achieves an average performance of 25.7 GOP/s under 200-MHz with resource consumption of 1538 LUT, which makes resource efficiency improved by more than $3\times $ compared with non-optimized case. The averaged classification accuracy of five ECG beats classes is 99.10%. In brief, the proposed efficient hardware design is prospective for wearable healthcare device especially in ECG classification area.
Article
Rib fractures are injuries commonly assessed in trauma wards. Deep learning has demonstrated state-of-the-art accuracy for a variety of tasks, including image classification. This paper assesses the speed-accuracy trade-offs and general suitability of four popular convolutional neural networks to classification of rib fractures from axial computed tomography imagery. We transfer learned InceptionV3, ResNet50, MobileNetV2, and VGG16 models, additionally training “decomposed” models comprised of taking only the first n blocks for each block for each architecture. Given that acute (new) fractures are generally most important to detect, we trained two types of models: a classful model with classes acute, old (healed), and normal (non-fractured); and a binary model with acute vs. the other classes. We found a model based on the first 7 blocks of InceptionV3 to achieve the best results and general speed-accuracy trade-off. The classful model achieved a 5-fold cross-validation average accuracy and macro recall of 96.00% and 94.0%, respectively. The binary model achieved a 5-fold cross-validation average accuracy, macro recall, and area under receiver operator characteristic curve of 97.76%, 94.6%, and 94.7%, respectively. On a Windows 10 PC with 32GB RAM and an Nvidia 1080ti GPU, the model's average CPU and GPU per-crop inference times were 13.6 and 12.2 ms, respectively. Compared to the InceptionV3 Block 7 classful model, a radiologist with 9 years of experience was less accurate but more sensitive to acute fractures; meanwhile, the deep learning model had fewer false positive diagnoses and better sensitivity to old fractures and normal ribs. The Cohen's Kappa between the two was 0.813.
Article
Full-text available
Convolutional neural networks (CNNs) are widely used in many computer vision applications. Previous FPGA implementations of CNNs are mainly based on the conventional convolutional algorithm. However, the high arithmetic complexity of conventional convolution algorithm for CNNs restricts the performance of accelerators and significantly increases the challenges of design. It has been proved that the Winograd algorithm for CNNs can effectively reduce the computational complexity. Although a few FPGA approaches based on the Winograd algorithm have been implemented, their works are lake of evaluation on the performance for different tile sizes of the Winograd algorithm. In this work, we focus on exploring the possibility of using the Winograd algorithm to accelerate CNNs on FPGA. First, we propose an accelerator architecture applying to both convolutional layers and fully connected layers. Second, we use high level synthesis tool to expediently implement our design. Finally, we evaluate our accelerator with different tile sizes in terms of resource utilization, performance and efficiency. On VUS440 platform, we achieve an average 943 GOPS for overall VGG16 under low resource utilization, which reaches higher efficiency than the state-of-the-art works on FPGAs.
Conference Paper
Full-text available
Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on FPGA. We find accelerating 3D CNNs on FPGA to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple FPGA platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.
Article
Full-text available
This paper presents a method for detecting salient objects in videos where temporal information in addition to spatial information is fully taken into account. Following recent reports on the advantage of deep features over conventional hand-crafted features, we propose the SpatioTemporal Deep (STD) feature that utilizes local and global contexts over frames. We also propose the SpatioTemporal Conditional Random Field (STCRF) to compute saliency from STD features. STCRF is our extension of CRF toward the temporal domain and formulates the relationship between neighboring regions both in a frame and over frames. STCRF leads to temporally consistent saliency maps over frames, contributing to the accurate detection of the boundaries of salient objects and the reduction of noise in detection. Our proposed method first segments an input video into multiple scales and then computes a saliency map at each scale level using STD features with STCRF. The final saliency map is computed by fusing saliency maps at different scale levels. Our intensive experiments using publicly available benchmark datasets confirm that the proposed method significantly outperforms state-of-the-art methods. We also applied our saliency computation to the video object segmentation task, showing that our method outperforms existing video object segmentation methods.
Conference Paper
Full-text available
Convolutional neural networks (CNNs) have been widely applied in many deep learning applications. In recent years, the FPGA implementation for CNNs has attracted much attention because of its high performance and energy efficiency. However, existing implementations have difficulty to fully leverage the computation power of the latest FPGAs. In this paper we implement CNN on an FPGA using a systolic array architecture, which can achieve high clock frequency under high resource utilization. We provide an analytical model for performance and resource utilization and develop an automatic design space exploration framework, as well as source-to-source code transformation from a C program to a CNN implementation using systolic array. The experimental results show that our framework is able to generate the accelerator for real-life CNN models, achieving up to 461 GFlops for floating point data type and 1.2 Tops for 8-16 bit fixed point.
Article
Full-text available
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Conference Paper
Full-text available
In recent years, convolutional neural network (CNN) based methods have achieved great success in a large number of applications and have been among the most powerful and widely used techniques in computer vision. However, CNN-based methods are com-putational-intensive and resource-consuming, and thus are hard to be integrated into embedded systems such as smart phones, smart glasses, and robots. FPGA is one of the most promising platforms for accelerating CNN, but the limited bandwidth and on-chip memory size limit the performance of FPGA accelerator for CNN. In this paper, we go deeper with the embedded FPGA platform on accelerating CNNs and propose a CNN accelerator design on embedded FPGA for Image-Net large-scale image classification. We first present an in-depth analysis of state-of-the-art CNN models and show that Convolutional layers are computational-centric and Fully-Connected layers are memory-centric. Then the dynamic-precision data quantization method and a convolver design that is efficient for all layer types in CNN are proposed to improve the bandwidth and resource utilization. Results show that only 0.4% accuracy loss is introduced by our data quantization flow for the very deep VGG16 model when 8/4-bit quantization is used. A data arrangement method is proposed to further ensure a high utilization of the external memory bandwidth. Finally, a state-of-the-art CNN, VGG16-SVD, is implemented on an embedded FPGA platform as a case study. VGG16-SVD is the largest and most accurate network that has been implemented on FPGA end-to-end so far. The system on Xilinx Zynq ZC706 board achieves a frame rate at 4.45 fps with the top-5 accuracy of 86.66% using 16-bit quantization. The average performance of convolutional layers and the full CNN is 187.8 GOP/s and 137.0 GOP/s under 150MHz working frequency, which outperform previous approaches significantly.
Conference Paper
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For \(300 \times 300\) input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for \(512 \times 512\) input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https:// github. com/ weiliu89/ caffe/ tree/ ssd.
Article
Full-text available
Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet
Article
Full-text available
This paper introduces the use of single layer and deep convolutional networks for remote sensing data analysis. Direct application to multi- and hyper-spectral imagery of supervised (shallow or deep) convolutional networks is very challenging given the high input data dimensionality and the relatively small amount of available labeled data. Therefore, we propose the use of greedy layer-wise unsupervised pre-training coupled with a highly efficient algorithm for unsupervised learning of sparse features. The algorithm is rooted on sparse representations and enforces both population and lifetime sparsity of the extracted features, simultaneously. We successfully illustrate the expressive power of the extracted representations in several scenarios: classification of aerial scenes, as well as land-use classification in very high resolution (VHR), or land-cover classification from multi- and hyper-spectral images. The proposed algorithm clearly outperforms standard Principal Component Analysis (PCA) and its kernel counterpart (kPCA), as well as current state-of-the-art algorithms of aerial classification, while being extremely computationally efficient at learning representations of data. Results show that single layer convolutional networks can extract powerful discriminative features only when the receptive field accounts for neighboring pixels, and are preferred when the classification requires high resolution and detailed results. However, deep architectures significantly outperform single layers variants, capturing increasing levels of abstraction and complexity throughout the feature hierarchy.
Article
In recent years, Convolutional Neural Networks (CNNs) have become widely adopted for computer vision tasks. FPGAs have been adequately explored as a promising hardware accelerator for CNNs due to its high performance, energy efficiency, and reconfigurability. However, prior FPGA solutions based on the conventional convolutional algorithm is often bounded by the computational capability of FPGAs (e.g., the number of DSPs). To address this problem, the feature maps are transformed to a special domain using fast algorithms to reduce the arithmetic complexity. Winograd and Fast Fourier Transformation (FFT), as fast algorithm representatives, first transform input data and filter to Winograd or frequency domain, then perform element-wise multiplication, and apply inverse transformation to get the final output. In this paper, we propose a novel architecture for implementing fast algorithms on FPGAs. Our design employs line buffer structure to effectively reuse the feature map data among different tiles. We also effectively pipeline the Winograd/FFT PE engine and initiate multiple PEs through parallelization. Meanwhile, there exists a complex design space to explore. We propose an analytical model to predict the resource usage and the performance. Then, we use the model to guide a fast design space exploration. Experiments using the state-of-the-art CNNs demonstrate the best performance and energy efficiency on FPGAs. We achieve 854.6 GOP/s and 2479.6 GOP/s for AlexNet and VGG16 on Xilinx ZCU102 platform using Winograd. We achieve 130.4 GOP/s for Resnet using Winograd and 201.1 GOP/s for YOLO using FFT on Xilinx ZC706 platform.
Article
Polarimetric synthetic aperture radar (PolSAR) image classification is an important application. Advanced deep learning techniques represented by deep convolutional neural network (CNN) have been utilized to enhance the classification performance. One current challenge is how to adapt deep CNN classifier for PolSAR classification with limited training samples, while keeping good generalization performance. This letter attempts to contribute to this problem. The core idea is to incorporate expert knowledge of target scattering mechanism interpretation and polarimetric feature mining to assist deep CNN classifier training and improve the final classification performance. A polarimetric-feature-driven deep CNN classification scheme is established. Both classical roll-invariant polarimetric features and hidden polarimetric features in the rotation domain are used to drive the proposed deep CNN model. Comparison studies validate the efficiency and superiority of the proposal. For the benchmark AIRSAR data, the proposed method achieves the state-of-the-art classification accuracy. Meanwhile, the convergence speed from the proposed polarimetric-feature-driven CNN approach is about 2.3 times faster than the normal CNN method. For multitemporal UAVSAR data sets, the proposed scheme achieves comparably high classification accuracy as the normal CNN method for train-used temporal data, while for train-not-used data it obtains an average of 4.86% higher overall accuracy than the normal CNN method. Furthermore, the proposed strategy can also produce very promising classification accuracy even with very limited training samples. IEEE
Article
Facial expression recognition is a challenging task that involves detection and interpretation of complex and subtle changes in facial muscles. Recent advances in feed-forward deep neural networks (DNNs) have offered improved object recognition performance. Sparse feature learning in feed-forward DNN models offers further improvement in performance when compared to the earlier handcrafted techniques. However, the depth of the feed-forward DNNs and the computational complexity of the models increase proportional to the challenges posed by the facial expression recognition problem. The feed-forward DNN architectures do not exploit another important learning paradigm, known as recurrency, which is ubiquitous in the human visual system. Consequently, this paper proposes a novel biologically relevant sparse-deep simultaneous recurrent network (S-DSRN) for robust facial expression recognition. The feature sparsity is obtained by adopting dropout learning in the proposed DSRN as opposed to usual handcrafting of additional penalty terms for the sparse representation of data. Theoretical analysis of S-DSRN shows that the dropout learning offers desirable properties such as sparsity, and prevents the model from overfitting. Experimental results also suggest that the proposed method yields better performance accuracy, requires reduced number of parameters, and offers reduced computational complexity than that of the previously reported state-of-the-art feed-forward DNNs using two of the most widely used publicly available facial expression data sets. Furthermore, we show that by combining the proposed neural architecture with a state-of-the-art metric learning technique significantly improves the overall recognition performance. Finally, a graphical processing unit (GPU)-based implementation of S-DSRN is obtained for real-time applications.
Article
We propose an integrated end-to-end automatic speech recognition (ASR) paradigm by joint learning of the front-end speech signal processing and back-end acoustic modeling. We believe that “only good signal processing can lead to top ASR performance” in challenging acoustic environments. This notion leads to a unified deep neural network (DNN) framework for distant speech processing that can achieve both high-quality enhanced speech and high-accuracy ASR simultaneously. Our goal is accomplished by two techniques, namely: (i) a reverberation-time-aware DNN based speech dereverberation architecture that can handle a wide range of reverberation times to enhance speech quality of reverberant and noisy speech, followed by (ii) DNN-based multi-condition training that takes both clean-condition and multi-condition speech into consideration, leveraging upon an exploitation of the data acquired and processed with multi-channel microphone arrays, to improve ASR performance. The final end-to-end system is established by a joint optimization of the speech enhancement and recognition DNNs.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
Convolutional neural network (CNN) finds applications in a variety of computer vision applications ranging from object recognition and detection to scene understanding owing to its exceptional accuracy. There exist different algorithms for CNNs computation. In this paper, we explore conventional convolution algorithm with a faster algorithm using Winograd's minimal filtering theory for efficient FPGA implementation. Distinct from the conventional convolution algorithm, Winograd algorithm uses less computing resources but puts more pressure on the memory bandwidth. We first propose a fusion architecture that can fuse multiple layers naturally in CNNs, reusing the intermediate data. Based on this fusion architecture, we explore heterogeneous algorithms to maximize the throughput of a CNN. We design an optimal algorithm to determine the fusion and algorithm strategy for each layer. We also develop an automated toolchain to ease the mapping from Caffe model to FPGA bitstream using Vivado HLS. Experiments using widely used VGG and AlexNet demonstrate that our design achieves up to 1.99X performance speedup compared to the prior fusion-based FPGA accelerator for CNNs.
Article
Objective: Accurate estimation of spatial gait characteristics is critical to assess motor impairments resulting from neurological or musculoskeletal disease. Currently, however, methodological constraints limit clinical applicability of state-of-the-art double integration approaches to gait patterns with a clear zero-velocity phase. Methods: We describe a novel approach to stride length estimation that uses deep convolutional neural networks to map stride-specific inertial sensor data to the resulting stride length. The model is trained on a publicly available and clinically relevant benchmark dataset consisting of 1220 strides from 101 geriatric patients. Evaluation is done in a tenfold cross validation and for three different stride definitions. Results: Even though best results are achieved with strides defined from midstance to midstance with average accuracy and precision of , performance does not strongly depend on stride definition. The achieved precision outperforms state-of-the-art methods evaluated on the same benchmark dataset by . Conclusion: Due to the independence of stride definition, the proposed method is not subject to the methodological constrains that limit applicability of state-of-the-art double integration methods. Furthermore, it was possible to improve precision on the benchmark dataset. Significance: With more precise mobile stride length estimation, new insights to the progression of neurological disease or early indications might be gained. Due to the independence of stride definition, previously uncharted diseases in terms of mobile gait analysis can now be investigated by retraining and applying the proposed method.
Conference Paper
Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL(TM), which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020 img/s, or 23 img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FPGAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.
Article
Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL(TM), which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020 img/s, or 23 img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FPGAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.
Article
Inspired by the popular deep learning architecture - Deep Stacking Network (DSN), a specific deep model for polarimetric synthetic aperture radar (POLSAR) image classification is proposed in this paper, which is named as Wishart Deep Stacking Network (W-DSN). First of all, a fast implementation of Wishart distance is achieved by a special linear transformation, which speeds up the classification of POLSAR image and makes it possible to use this polarimetric information in the following Neural Network (NN). Then a single-hidden-layer neural network based on the fast Wishart distance is defined for POLSAR image classification, which is named as Wishart Network (WN) and improves the classification accuracy. Finally, a multi-layer neural network is formed by stacking WNs, which is in fact the proposed deep learning architecture W-DSN for POLSAR image classification and improves the classification accuracy further. In addition, the structure of WN can be expanded in a straightforward way by adding hidden units if necessary, as well as the structure of the W-DSN. As a preliminary exploration on formulating specific deep learning architecture for POLSAR image classification, the proposed methods may establish a simple but clever connection between POLSAR image interpretation and deep learning. The experiment results tested on real POLSAR image show that the fast implementation of Wishart distance is very efficient (a POLSAR image with 768000 pixels can be classified in 0.53s), and both the single-hidden-layer architecture WN and the deep learning architecture W-DSN for POLSAR image classification perform well and work efficiently.
Article
Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm ×4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.
Article
Convolutional Neural Networks (CNNs) have been successfully used for many computer vision applications. It would be beneficial to these applications if the computational workload of CNNs could be reduced. In this work we analyze the linear algebraic properties of CNNs and propose an algorithmic modification to reduce their computational workload. An up to a 47% reduction can be achieved without any change in the image recognition results or the addition of any hardware accelerators.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
This letter presents a rotation-invariant method for detecting geospatial objects from high-resolution satellite images. First, a superpixel segmentation strategy is proposed to generate meaningful and nonredundant patches. Second, a multilayer deep feature generation model is developed to generate high-level feature representations of patches using deep learning techniques. Third, a set of multiscale Hough forests with embedded patch orientations is constructed to cast rotation-invariant votes for estimating object centroids. Quantitative evaluations on the images collected from Google Earth service show that an average completeness, correctness, quality, and F1- measure values of 0.958, 0.969, 0.929, and 0.963, respectively, are obtained. Comparative studies with three existing methods demonstrate the superior performance of the proposed method in accurately and correctly detecting objects that are arbitrarily oriented and of varying sizes.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
The hybrid deep neural network (DNN) and hidden Markov model (HMM) has recently achieved dramatic performance gains in automatic speech recognition (ASR). The DNN-based acoustic model is very powerful but its learning process is extremely time-consuming. In this paper, we propose a novel DNN-based acoustic modeling framework for speech recognition, where the posterior probabilities of HMM states are computed from multiple DNNs (mDNN), instead of a single large DNN, for the purpose of parallel training towards faster turnaround. In the proposed mDNN method all tied HMM states are first grouped into several disjoint clusters based on data-driven methods. Next, several hierarchically structured DNNs are trained separately in parallel for these clusters using multiple computing units (e.g. GPUs). In decoding, the posterior probabilities of HMM states can be calculated by combining outputs from multiple DNNs. In this work, we have shown that the training procedure of the mDNN under popular criteria, including both frame-level cross-entropy and sequence-level discriminative training, can be parallelized efficiently to yield significant speedup. The training speedup is mainly attributed to the fact that multiple DNNs are parallelized over multiple GPUs and each DNN is smaller in size and trained by only a subset of training data. We have evaluated the proposed mDNN method on a 64-hour Mandarin transcription task and the 320-hour Switchboard task. Compared to the conventional DNN, a 4-cluster mDNN model with similar size can yield comparable recognition performance in Switchboard (only about 2% performance degradation) with a greater than 7 times speed improvement in CE training and a 2.9 times improvement in sequence training, when 4 GPUs are used.
Article
Detection of salient objects from images is gaining increasing research interest in recent years as it can substantially facilitate a wide range of content-based multimedia applications. Based on the assumption that foreground salient regions are distinctive within a certain context, most conventional approaches rely on a number of hand-designed features and their distinctiveness is measured using local or global contrast. Although these approaches have been shown to be effective in dealing with simple images, their limited capability may cause difficulties when dealing with more complicated images. This paper proposes a novel framework for saliency detection by first modeling the background and then separating salient objects from the background. We develop stacked denoising autoencoders with deep learning architectures to model the background where latent patterns are explored and more powerful representations of data are learned in an unsupervised and bottom-up manner. Afterward, we formulate the separation of salient objects from the background as a problem of measuring reconstruction residuals of deep autoencoders. Comprehensive evaluations of three benchmark datasets and comparisons with nine state-of-the-art algorithms demonstrate the superiority of this paper.
Article
Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.
Searching for MobileNetV3
  • A Howard
A. Howard et al., "Searching for MobileNetV3," 2019, arXiv:1905.02244. [Online]. Available: https://arxiv.org/abs/1905.02244
  • M Sandler
  • A Howard
  • M Zhu
  • A Zhmoginov
  • L.-C Chen
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "MobileNetV2: Inverted residuals and linear bottlenecks," 2018, arXiv:1801.04381. [Online]. Available: https://arxiv.org/abs/1801.04381