Article

# ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

Authors:
• Megvii Technology Inc.
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

We introduce an extremely computation efficient CNN architecture named ShuffleNet, designed specially for mobile devices with very limited computing power (e.g., 10-150 MFLOPs). The new architecture utilizes two proposed operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Experiments on ImageNet classification and MS COCO object detection demonstrate the superior performance of ShuffleNet over other structures, e.g. lower top-1 error (absolute 6.7\%) than the recent MobileNet system on ImageNet classification under the computation budget of 40 MFLOPs. On an ARM-based mobile device, ShuffleNet achieves \textasciitilde 13$\times$ actual speedup over AlexNet while maintaining comparable accuracy.

## No full-text available

... Decoupling the tasks preserves the representation power of the network and reduces the number of parameters and FLOPS. Several recent methods [18,28,35,46] have successfully built lightweight architectures by (1) constructing CNN with DSConv and (2) training the model from scratch. ...
... Recent works [17,18,28,35,46] have constructed lightweight architectures by designing modules based on depthwise separable convolution. Our proposed formulation GEP is crucial to understand modules based on depthwise separable convolution. ...
... GDGConv, which is used in Shufflenet [46], consists of four sublayers, group pointwise convolution, channel shuffle, depthwise convolution, and another group pointwise convolution, as well as a shortcut. We examine the relation between standard convolution and the last two convolutions of GDGConv using GEP as follows. ...
Article
Full-text available
How can we efficiently compress convolutional neural network (CNN) using depthwise separable convolution, while retaining their accuracy on classification tasks? Depthwise separable convolution, which replaces a standard convolution with a depthwise convolution and a pointwise convolution, has been used for building lightweight architectures. However, previous works based on depthwise separable convolution are limited when compressing a trained CNN model since (1) they are mostly heuristic approaches without a precise understanding of their relations to standard convolution, and (2) their accuracies do not match that of the standard convolution. In this paper, we propose Falcon, an accurate and lightweight method to compress CNN based on depthwise separable convolution.Falcon uses generalized elementwise product (GEP), our proposed mathematical formulation to approximate the standard convolution kernel, to interpret existing convolution methods based on depthwise separable convolution. By exploiting the knowledge of a trained standard model and carefully determining the order of depthwise separable convolution via GEP, Falcon achieves sufficient accuracy close to that of the trained standard model. Furthermore, this interpretation leads to developing a generalized version rank-k Falcon which performs k independent Falcon operations and sums up the result. Experiments show that Falcon (1) provides higher accuracy than existing methods based on depthwise separable convolution and tensor decomposition and (2) reduces the number of parameters and FLOPs of standard convolution by up to a factor of 8 while ensuring similar accuracy. We also demonstrate that rank-k Falcon further improves the accuracy while sacrificing a bit of compression and computation reduction rates.
... In spite of the great achievements of sign language recog- 24 nition, real-time recognition on the mobile terminal still 25 has not lived up to expectations. The huge and complicated 26 network architecture achieves recognition accuracy but lost 27 computing efficiency. ...
... However, the detection speed of this network 117 is slow when applied to the mobile terminal, Because the 118 GPU computing speed of mobile terminals is much lower 119 than that of PC terminals. In order to meet the needs of 120 mobile devices, some lightweight CNN networks such as 121 MobileNet [23] and ShuffleNet [24] have been proposed, 122 which have a good balance between speed and accuracy. 123 The proposed ShufflenetV2 [25] uses channel shuffling to 124 improve the exchange of information flow between chan-125 nels, and further considers the actual speed of the hardware. ...
Article
Full-text available
To better meet the communication needs of hearing impaired people and the public, it is of great significance to recognize sign language more quickly and accurately on embedded platforms and mobile terminals. YOLOv3, raised by Joseph Redmon and Ali Farhadi in 2018, achieved a great improvement in detection speed with considerable accuracy by optimizing Yolo. However, YOLOv3 is still too bloated to use on mobile terminals. A static sign language recognition method based on the ShuffleNetv2-YOLOv3 lightweight model was proposed. The ShuffleNetv2-YOLOv3 lightweight model makes the network lightweight by using ShuffleNetv2 as the backbone network of YOLOv3. The lightweight network improved the recognition speed steeply. Combing with the CIoU loss function, the ShuffleNetv2-YOLOv3 keeps the recognition accuracy while improving the recognition speed. Recognition effectiveness of the self-made sign language images and public database by the ShuffleNetv2-YOLOv3 lightweight model was evaluated by F1 score and mAP value. The performance of the ShuffleNetv2-YOLOv3 model was compared with that of the YOLOv3-tiny, SSD, Faster-RCNN, and YOLOv4-tiny model, respectively. The experimental results show that the proposed ShuffleNetv2-YOLOv3 model achieved a good balance between the accuracy and speed of the gesture detection under the premise of model lightweight. The F1 score and mAP value of the ShuffleNetv2-YOLOv3 model were 99.1% and 98.4%, respectively. The gesture detection speed on the GPU reaches 54 frames per second, which is better than other models. The mobile terminal application of the proposed lightweight model was also evaluated. The minimal inference speed of single frame images on the CPU and GPU is 0.14 and 0.025 s per image, respectively. It is only 1/6.5 and 1/8.5 of the running speed of the original YOLOv3 model. The ShuffleNetv2-YOLOv3 lightweight model is conducive to quick, real time, and similar static sign language gesture recognition, laying a good foundation for real-time gesture recognition in the embedded platforms and mobile terminals.
... With the increasing demand for mobile applications limited by storage and computing resources in recent years, lightweight models with few parameters and low FLOPs have attracted significant attention from developers and researchers. The earliest attempt for designing an efficient model dates back to the Inceptionv3 [47] era that uses asymmetric convolutions to replace standard convolution, later MobileNet [18] proposes depth-wise separable convolution to significantly decrease the amount of computation and parameters, which is viewed as a fundamental CNN-based component for subsequent works [14,36,41,68]. Remarkably, MobileNetv2 [44] proposes an efficient Inverted Residual Block (IRB) based on Depth-Wise Convolution (DW-Conv) that becomes the standard efficient module. ...
... SqueezeNet [21] replaces 3x3 filters with 1x1 filters and decreases channel numbers to reduce model parameters, while Inceptionv3 [47] factorizes the standard convolution into asymmetric convolutions. Later MobileNet [18] introduces depth-wise separable convolution to alleviate a large amount of computation and parameters, followed in subsequent lightweight models [14,36,41,44,68]. Besides the above handcraft methods, researchers exploit automatic architecture design in the pre-defined search space [3,17,30,49,50] and obtain considerable results. ...
Preprint
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
... Lo et al. [11] and Szegedy et al. [43] used the strategy of decomposing the standard 3 × 3 convolution into 3×1 and 1×3 convolutions to reduce the number of parameters and computational costs at the expense of slight performance degradation. ShuffleNet [44] divides the convolutions into multiple groups in a similar way to [12], which leads to a significant reduction in FLOPs with a rather small decrease on accuracy. By combining asymmetric convolution and dilation convolution, researchers [45] [46] further designed a depthwise asymmetric dilation convolution to reduce the number of parameters of models. ...
Article
Full-text available
Convolutional neural networks (CNNs) have achieved significant success in medical image segmentation. However, they also suffer from the requirement of a large number of parameters, leading to a difficulty of deploying CNNs to low-source hardwares, e.g., embedded systems and mobile devices. Although some compacted or small memory-hungry models have been reported, most of them may cause degradation in segmentation accuracy. To address this issue, we propose a shape-guided ultralight network (SGU-Net) with extremely low computational costs. The proposed SGU-Net includes two main contributions: it first presents an ultralight convolution that is able to implement double separable convolutions simultaneously, i.e., asymmetric convolution and depthwise separable convolution. The proposed ultralight convolution not only effectively reduces the number of parameters but also enhances the robustness of SGU-Net. Secondly, our SGUNet employs an additional adversarial shape-constraint to let the network learn shape representation of targets, which can significantly improve the segmentation accuracy for abdomen medical images using self-supervision. The SGU-Net is extensively tested on four public benchmark datasets, LiTS, CHAOS, NIH-TCIA and 3Dircbdb. Experimental results show that SGU-Net achieves higher segmentation accuracy using lower memory costs, and outperforms state-of-the-art networks. Moreover, we apply our ultralight convolution into a 3D volume segmentation network, which obtains a comparable performance with fewer parameters and memory usage. The available code of SGUNet is released at https://github.com/SUST-reynole/SGUNet.
... (i) For the Basic performance evaluation and snapshot recovery from training crash experiments, the model training is performed on the PyTorch deep learning framework [51] with six typical DNNs: VGG-16 [9], ResNet-18 [52], GoogleNet [2], MobileNet [53], ShuffleNet [54]. We train each DNNs on CIFAR-10 dataset [55] with 200 epochs. ...
... We conducted experiments on different supervised classification models (GoogLeNet [20], ShuffleNet [25], VggNet [17], ResNet [10] and HRNet [19]) and competitive SSL models (Cross Pseudo Supervision (CPS) [7], Cross Consistency Training (CCT) [15], Entropy Minimization (EM) [23] Deep Co-Training (DCT) [16] and Mean Teacher (MT) [21]) to select the more efficient backbone. For the classification, we split our database into train input with 18624 tumor patch images Table 2. HRNet backbone performed best in both classification and SSL segmentation tasks. ...
... Theoretically, the amount of computation can still be reduced. ShuffleNet [52] used group convolution and channel shuffling to effectively reduce the amount of computation for point convolution, which achieved better performance. With the advancement of mobile devices and the diversified development of application scenarios, lightweight networks show higher engineering value. ...
Article
Full-text available
Traffic sign detection is an essential part of traffic security and unmanned driving system. Due to the changes in the traffic environment is complex, how to intelligently and efficiently detect traffic signs in real scenes is of great significance. The traffic sign detection task is characterized by many small targets and complex environmental interference, and the detection scene also requires the detection model to be lightweight and efficient. This paper proposes a lightweight model Ghost-YOLO, and a lightweight module C3Ghost is designed to replace the feature extraction module in YOLOv5. C3Ghost modules extract features in a lightweight way, which effectively speeds up inference. At the same time, a new multi-scale feature extraction is designed to enhance the focus on small targets. Experimental results show that the mAP of the Ghost-YOLO is 92.71%, and the number of parameters and computations are respectively reduced to 91.4% and 50.29% of the original. Compared with multiple lightweight models, the speed and accuracy of this method are competitive.
... There are many pre-trained CNN models available which are used for the classification of the image. Googlenet (Zhang et al. 2017) and Squeezenet (Iandola et al. 2016) are just some of the many pretrained CNN models available. The model selected for current research is Inceptionv3. ...
Conference Paper
Full-text available
Millions of dollars are spent annually to detect damages in demanding infrastructures, including bridges, roads, and buildings. Natural disasters such as floods and earthquakes result in catastrophic damage to urban infrastructure. In practice, automatic crack identification from images of diverse settings is valuable and challenging. Cracks in civil constructions are crucial indicators of structural deterioration and may signal a massive failure's onset. Image-based crack detection has been attempted in research communities, and these techniques can replace human-based inspection. Convolutional Neural Network (CNN) is a powerful tool for classifying images based on features. The current research is based on the performance of Inception V3, a CNN model. Images are classified into four categories: Vertical crack, Horizontal Crack, Diagonal crack, and Uncracked. The inception V3 model is trained on three train-test splits to evaluate the impact of the number of images on the accuracy of models. Results illustrate that models trained with more images have better accuracy for classifying uncracked, diagonal, horizontal, and vertical cracks.
... Finally, the CQT-CNN network of two Multiscale blocks, 32 filter channels of size 3 × 3 is taken into account for competing with other existing state-of-the-art models, including EfficientNet [22], SqueezeNet [23], GoogLeNet [24], MobileNet-V3 [25], and ShufleNet [26] in the same task of underwater acoustic signal classification. The ...
Article
Full-text available
This article proposes a multi-scale deep learning network to classify different underwater acoustic signal sources. The proposed network is cleverly designed with multiple branches, creating a multi-scale block which allows learning various spatial features of Constant-Q Transform spectrograms. The network is trained and tested on the ShipsEAR dataset, which is augmented by the overlap segmentation technique to ensure a balance in ship label classes’ data. The experiment results show that our network achieves an average classification accuracy of up to 99.93% and an execution speed of 2.2 ms when configured with two multi-scale blocks and 32 filter channels. In comparison, our network remarkably outperforms other existing networks in terms of accuracy and execution time.
... ShuffleFaceNet [179] is a compact FR model based on ShuffleNet-V2 [173]. Shuf-fleNetV2 utilizes a channel shuffle operation proposed by ShuffleNetV1 [291], achieving an acceptable trade-off between accuracy and computational efficiency. Channel shuffle operation enables information flowing between different groups of channels by shuffling a group of g channels of the convolution output (i.e. ...
Thesis
The growing need for reliable and accurate recognition solutions along with the recent innovations in deep learning methodologies has reshaped the research landscape of biometric recognition. Developing efficient biometric solutions is essential to minimize the required computational costs, especially when deployed on embedded and low-end devices. This drives the main contributions of this work, aiming at enabling wide application range of biometric technologies. Towards enabling wider implementation of face recognition in use cases that are extremely limited by computational complexity constraints, this thesis presents a set of efficient models for accurate face verification, namely MixFaceNets. With a focus on automated network architecture design, this thesis is the first to utilize neural architecture search to successfully develop a family of lightweight face-specific architectures, namely PocketNets. Additionally, this thesis proposes a novel training paradigm based on knowledge distillation (KD), the multi-step KD, to enhance the verification performance of compact models. Towards enhancing face recognition accuracy, this thesis presents a novel margin-penalty softmax loss, ElasticFace, that relaxes the restriction of having a single fixed penalty margin. Occluded faces by facial masks during the recent COVID-19 pandemic presents an emerging challenge for face recognition. This thesis presents a solution that mitigates the effects of wearing a mask and improves masked face recognition performance. This solution operates on top of existing face recognition models and thus avoids the high cost of retraining existing face recognition models or deploying a separate solution for masked face recognition. Aiming at introducing biometric recognition to novel embedded domains, this thesis is the first to propose leveraging the existing hardware of head-mounted displays for identity verification of the users of virtual and augmented reality applications. This is additionally supported by proposing a compact ocular segmentation solution as a part of an iris and periocular recognition pipeline. Furthermore, an identity-preserving synthetic ocular image generation approach is designed to mitigate potential privacy concerns related to the accessibility to real biometric data and facilitate the further development of biometric recognition in new domains.
Chapter
As an excellent algorithm, YOLOv5 is an object detection model with the advantages of high flexibility and fast speed, but it has problems with many network parameters, complex model structure, and low boundary regression accuracy for the target. For the above problems, this study improves on the YOLOv5s algorithm and proposes a new model YOLO-GC with lower hardware requirements, fewer network parameters and higher boundary regression accuracy. First, the upsampling module of YOLOv5s is replaced by the CARAFE upsampling module to increase the receptive field and semantic features, the Ghost module to reduce the number of parameters and calculation, and the CBAM attention mechanism to combine the spatial and channel attention map, the model pays more attention to the key areas to improve the model accuracy. The test results of this research method on the PASCAL VOC object detection benchmark dataset show that compared with YOLOv5s, the number of parameters is reduced by 44.2%, the model size is reduced by 42.8%, the is increased by 1.2%, and the :0.95 An increase of 5.5%.
Article
Full-text available
In complex traffic scenarios, it is crucial to develop a rapid and precise real-time detection system for non-motorized vehicles to ensure safe driving. D-YOLO is a lightweight real-time detection technique for non-motorized vehicles based on an enhanced version of YOLOv4-tiny. Typically, the computing capabilities of mobile devices are constrained, therefore we begin by reducing the number of model parameters. Then, we add dilated convolution and depthwise separable convolution into the network’s Cross Stage Partial Connection (CSPNet) in order to produce the DCSPNet with improved performance. Coordinate Attention(CA) is implemented to enhance the network’s capability to extract effective features. In the neck network of the model by introducing a spatial pyramid set (SPP) to enhance the feature representation of non-motorized vehicles in the feature layer. Finally, we test this proposed model on dataset, the experimental results show that D-YOLO has a model size of only 6.7MB, which is 16.5MB smaller than YOLOv4-tiny. The detection speed of D-YOLO is about 25% faster than that of YOLOv4-tiny, D-YOLO has approximately 58% fewer model parameters than YOLOv4-tiny, D-YOLO has a mAP of 70.36%, which is 2.01% higher than YOLOv4-tiny. It can be shown that D-YOLO ensures both accuracy and real-time performance to satisfies the demand of real-time detection of non-motorized vehicles in intelligent traffic scenarios.
Chapter
Full-text available
The low resolution of infrared images makes it more difficult to detect objects, and the quality of detection results obtained by the CNN based object detection model are worse for few-shot problems. Two-stage Fine-tune Approach (TFA) is effective to improve the precision of detection for few-shot problems. Because of the category imbalance of training samples, TFA has the problem of misclassification. To solve this problem, TFA with similarity contrast (SC-TFA) is proposed. The VOVNetv2 is used as the backbone network to improve the detection accuracy. The similarity contrast detection head is added to the detection module to improving the classify performance. And both cosine similarity and Euclidean distance are used as the similarity measure in the contrast loss function. The effectiveness of the improved TFA for the few-shot problem is verified on the VOC dataset and the infrared ship dataset. The average precision of the novel categories (nAP) of SC-TFA on VOC dataset and the infrared ship dataset reaches 54.92% and 41.1% respectively, which is 4.7% and 3.4% higher than TFA.
Article
Full-text available
Recently, the application of convolution neural network (CNN) in single image super-resolution (SISR) is gradually developing. Although many CNN-based methods have acquired splendid performance, oversized model complexity hinders their application in real life. In response to this problem, lightweight and efficient are becoming development tendency of SR models. The residual feature distillation network (RFDN) is one of the state-of-the-art lightweight SR networks. However, the shallow residual block (SRB) in RFDN still uses ordinary convolution to extract feature, where still has great improvement room for the reduction of network parameters. In this paper, we propose the Group-convolutional Feature Enhanced Distillation Network (GFEDNet), which is constructed by the stacking of feature distillation and aggregation block (FDAB). Benefitting from residual learning of residual feature aggregation (RFA) framework and feature distillation strategy of RFDN, the FDAB can obtain more diverse and detailed feature representations, thereby improves the SR capability. Furthermore, we propose the multi-scale group convolution block (MGCB) to replace the SRB. Thanks to group convolution and multi-branch parallel structure, the MGCB reduces the parameters substantially while maintaining SR performance. Extensive experiments show the powerful function of our proposed GFEDNet against other state-of-the-art methods.
Article
Nowadays many semantic segmentation algorithms have achieved satisfactory accuracy on von Neumann platforms (e.g., GPU), but the speed and energy consumption have not meet the high requirements of certain edge applications like autonomous driving. To tackle this issue, it is of necessity to design an efficient lightweight semantic segmentation algorithm and then implement it on emerging hardware platforms with high speed and energy efficiency. Here, we first propose an extremely factorized network (EFNet) which can learn multi-scale context information while preserving rich spatial information with reduced model complexity. Experimental results on the Cityscapes dataset show that EFNet achieves an accuracy of 68.0% mean intersection over union (mIoU) with only 0.18M parameters, at a speed of 99 frames per second (FPS) on a single RTX 3090 GPU. Then, to further improve the speed and energy efficiency, we design a memristor-based computing-in-memory (CIM) accelerator for the hardware implementation of EFNet. It is shown by the simulation in DNN+NeuroSim V2.0 that the memristor-based CIM accelerator is ∼63× (∼4.6×) smaller in area, at most ∼9.2× (∼1000×) faster, and ∼470× (∼2400×) more energy-efficient than the RTX 3090 GPU (the Jetson Nano embedded development board), although its accuracy slightly decreases by 1.7% mIoU. Therefore, the memristor-based CIM accelerator has great potential to be deployed at the edge to implement lightweight semantic segmentation models like EFNet. This study showcases an algorithm-hardware co-design to realize real-time and low-power semantic segmentation at the edge.
Article
Coronavirus Disease-2019 (COVID-19) causes Severe Acute Respiratory Syndrome-Corona Virus-2 (SARS-CoV-2) and has opened several challenges for research concerning diagnosis and treatment. Chest X-rays and computed tomography (CT) scans are effective and fast alternatives to detect and assess the damage that COVID causes to the lungs at different stages of the disease. Although the CT scan is an accurate exam, the chest X-ray is still helpful due to the cheaper, faster, lower radiation exposure, and is available in low-incoming countries. Computer-aided diagnostic systems based on Artificial Intelligence (AI) and computer vision are an alternative to extract features from X-ray images, providing an accurate COVID-19 diagnosis. However, specialized and expensive computational resources come across as challenging. Also, it needs to be better understood how low-cost devices and smartphones can hold AI models to predict diseases timely. Even using deep learning to support image-based medical diagnosis, challenges still need to be addressed once the known techniques use centralized intelligence on high-performance servers, making it difficult to embed these models in low-cost devices. This paper sheds light on these questions by proposing the Artificial Intelligence as a Service Architecture (AIaaS), a hybrid AI support operation, both centralized and distributed, with the purpose of enabling the embedding of already-trained models on low-cost devices or smartphones. We demonstrated the suitability of our architecture through a case study of COVID-19 diagnosis using a low-cost device. Among the main findings of this paper, we point out the performance evaluation of low-cost devices to handle COVID-19 predicting tasks timely and accurately and the quantitative performance evaluation of CNN models embodiment on low-cost devices.
Article
The identification of defects plays a key role in the semiconductor industry as it can reduce production risks, minimize the effects of unexpected downtimes and optimize the production process. A literature review protocol is implemented and latest advances are reported in defect detection considering wafer maps towards quality control. In particular, the most recent works are outlined to demonstrate the use of AI-technologies in wafer maps defect detection. The popularity of these technologies is then presented in the form of visualizing graphs. This enables the identification of the most popular and most prominent ML-methods that can be exploited for the purposes of Industry 4.0.
Chapter
Aiming at the performance improvement of face morphing detection, a novel method is proposed by using a two-stream network with channel attention and residual of multiple color spaces. This method first obtains H, S, V, Y, Cb, Cr six color channel image, then use the bilateral filter for filtering the six color channel to get the corresponding residual noise image, then the combined six channel image and the residual noise image as input to the designed two-stream network for training, so as to detect the morphed face image. In addition, an efficient channel attention module is proposed to improve the expressiveness of the model. Experiments are conducted on the standard databases, and the performance of the proposed method is compared with that of 9 state-of-the-art morphing attack detection methods. The results show that the proposed method can achieve better detection performance than the existing works.KeywordsFace recognitionFace morphing detectionDeep learningResidual noise
Article
Deep neural networks have greatly facilitated the applications of semantic segmentation. However, most of the existing neural networks bring massive calculations with lots of model parameters for achieving a higher precision, which is unaffordable for resource-constrained edge devices. To achieve an appropriate trade-off between computing efficiency and segmentation accuracy, we proposed an effective lightweight attention-guided network (ELANet) for real-time semantic segmentation based on an asymmetrical encoder–decoder framework in this work. In the encoding phase, we combined atrous convolution and depth-wise convolution to design two types of effective context guidance blocks to learn contextual semantic information. A refined feature fusion module with a dual attention-guided fusion (DAF) unit was developed in the decoder to exploit different levels of features. Without any pretraining, we estimated the performance of multi-attention ELANet with extensive experiments on the Cityscapes dataset with an input resolution of 512×1024, resulting in 75.4% mIoU and 83 FPS inference speed with only 0.76 M parameters and 10.34 GFLOPs on a single 3090 GPU. The code is publicly available at https://github.com/DGS666/ELANet.
Article
Full-text available
Human pose estimation is the process of detecting the body keypoints of a person and can be used to classify different poses. Many researchers have proposed various ways to get a perfect 2D as well as a 3D human pose estimator that could be applied for various types of applications. This paper is a review of all the state-of-the-art architectures based on human pose estimation, the papers referred were based on the types of computer vision and machine learning algorithms, such as feed-forward neural networks, convolutional neural networks (CNN), OpenPose, MediaPipe, and many more. These different approaches are compared on various parameters, like the type of dataset used, the evaluation metric, etc. Different human pose datasets, such as COCO and MPII activity datasets with keypoints, as well as specific application-based datasets, are reviewed in this survey paper. Researchers may use these architectures and datasets in a range of domains, which are also discussed. The paper analyzes several approaches and architectures that can be used as a guide for other researchers to assist them in developing better techniques to achieve high accuracy.
Chapter
To improve the efficiency of coal and gangue separation in coal production, it is imaging based on double-ray transmission minerals and uses image processing and a lightweight neural network model. Now, a lightweight neural network model called SAB-MobileNet, namely Speed and Accuracy Balance MobileNet is proposed to efficiently identify and classify coal and gangue samples. Based on the MobileNetV2 network, the SAB-MobileNet lightweight network model improves the sampling, integrates the attention mechanism and the change of the cascade network structure, can realize the high accuracy of maintaining the coal and gangue classification tasks, greatly reduces the computational amount and parameters of the network model, and improves the efficiency of the overall identification and classification. The SAB-MobileNet network model achieved 98.4% accuracy in coal and gangue classification, greatly reduced the number of parameters and calculation, and greatly improved the efficiency of coal and gangue classification, indicating the feasibility and practicability of the model for the coal and gangue classification task.KeywordsX-ray sorting coalGangue classificationLight neural networksDetachable convolutionAttention mechanisms
Chapter
Full-text available
A Planar Inverted Fantenna (PIFA) design is presented at5.03-5.09GHz and 5.725-5.825GHz for wireless local area network (WLAN) applications. Miniaturization of Patch antennaand reasonable bandwidth is achieved in this work. The twocomplimentary single split ring resonators (CSSRR) have been etched in the ground plane to achieved dual band operation. The proposed antenna provides compact structure because of using PIFA configuration, while capacitive matching has been achieved by gap coupling.Proposed antenna is simulated using CST V.12 simulator and Particle Swarm optimization algorithm technique.
Article
Full-text available
Image-based fruit classification offers many useful applications in industrial production and daily life, such as self-checkout in the supermarket, automatic fruit sorting and dietary guidance. However, fruit classification task will have different data distributions due to different application scenarios. One feasible solution to solve this problem is to use domain adaptation that adapts knowledge from the original training data (source domain) to the new testing data (target domain). In this paper, we propose a novel deep learning-based unsupervised domain adaptation method for cross-domain fruit classification. A hybrid attention module is proposed and added to MobileNet V3 to construct the HAM-MobileNet that can suppress the impact of complex backgrounds and extract more discriminative features. A hybrid loss function combining subdomain alignment and implicit distribution metrics is used to reduce domain discrepancy during model training and improve model classification performance. Two fruit classification datasets covering several domains are established to simulate common industrial and daily life application scenarios. We validate the proposed method on our constructed grape classification dataset and general fruit classification dataset. The experimental results show that the proposed method achieves an average accuracy of 95.0% and 93.2% on the two datasets, respectively. The classification model after domain adaptation can well overcome the domain discrepancy brought by different fruit classification scenarios. Meanwhile, the proposed datasets and method can serve as a benchmark for future cross-domain fruit classification research.
Article
In recent years, deep convolutional neural networks (DCNNs) have been widely used in the task of ship target detection in synthetic aperture radar (SAR) imagery. However, the vast storage and computational cost of DCNN limits its application to spaceborne or airborne onboard devices with limited resources. In this paper, a set of lightweight detection networks for SAR ship target detection are proposed. To obtain these lightweight networks, this paper designs a network structure optimization algorithm based on the multi-objective firefly algorithm (termed NOFA). In our design, the NOFA algorithm encodes the filters of a well-performing ship target detection network into a list of probabilities, which will determine whether the lightweight network will inherit the corresponding filter structure and parameters. After that, the multi-objective firefly optimization algorithm (MFA) continuously optimizes the probability list and finally outputs a set of lightweight network encodings that can meet the different needs of the trade-off between detection network precision and size. Finally, the network pruning technology transforms the encoding that meets the task requirements into a lightweight ship target detection network. The experiments on SSDD and SDCD datasets prove that the method proposed in this paper can provide more flexible and lighter detection networks than traditional detection networks.
Article
Full-text available
We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.
Article
Full-text available
This paper presents incremental network quantization (INQ), a novel method, targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which are struggled in noticeable accuracy loss, our INQ has the potential to resolve this issue, as benefiting from two innovations. On one hand, we introduce three interdependent operations, namely weight partition, group-wise quantization and re-training. A well-proven measure is employed to divide the weights in each layer of a pre-trained CNN model into two disjoint groups. The weights in the first group are responsible to form a low-precision base, thus they are quantized by a variable-length encoding method. The weights in the other group are responsible to compensate for the accuracy loss from the quantization, thus they are the ones to be re-trained. On the other hand, these three operations are repeated on the latest re-trained group in an iterative manner until all the weights are converted into low-precision ones, acting as an incremental network quantization and accuracy enhancement procedure. Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed method. Specifically, at 5-bit quantization, our models have improved accuracy than the 32-bit floating-point references. Taking ResNet-18 as an example, we further show that our quantized models with 4-bit, 3-bit and 2-bit ternary weights have improved or very similar accuracy against its 32-bit floating-point baseline. Besides, impressive results with the combination of network pruning and INQ are also reported. The code will be made publicly available.
Conference Paper
Full-text available
Article
Full-text available
This paper presents how we can achieve the state-of-the-art accuracy in multi-category object detection task while minimizing the computational cost by adapting and combining recent technical innovations. Following the common pipeline of "CNN feature extraction + region proposal + RoI classification", we mainly redesign the feature extraction part, since region proposal part is not computationally expensive and classification part can be efficiently compressed with common techniques like truncated SVD. Our design principle is "less channels with more layers" and adoption of some building blocks including concatenated ReLU, Inception, and HyperNet. The designed network is deep and thin and trained with the help of batch normalization, residual connections, and learning rate scheduling based on plateau detection. We obtained solid results on well-known object detection benchmarks: 81.8% mAP (mean average precision) on VOC2007 and 82.5% mAP on VOC2012 (2nd place), while taking only 750ms/image on Intel i7-6700K CPU with a single core and 46ms/image on NVIDIA Titan X GPU. Theoretically, our network requires only 12.3% of the computational cost compared to ResNet-101, the winner on VOC2012.
Conference Paper
Full-text available
Article
Full-text available
Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here: https://github.com/DeepScale/SqueezeNet
Conference Paper
Full-text available
Multilayer Neural Networks (MNNs) are commonly trained using gradient descent-based methods, such as BackPropagation (BP). Inference in probabilistic graphical models is often done using variational Bayes methods, such as Expectation Propagation (EP). We show how an EP based approach can also be used to train deterministic MNNs. Specifically, we approximate the posterior of the weights given the data using a " mean-field " factorized distribution, in an online setting. Using online EP and the central limit theorem we find an analytical approximation to the Bayes update of this posterior, as well as the resulting Bayes estimates of the weights and outputs. Despite a different origin, the resulting algorithm, Expectation BackPropagation (EBP), is very similar to BP in form and efficiency. However, it has several additional advantages: (1) Training is parameter-free, given initial conditions (prior) and the MNN architecture. This is useful for large-scale problems, where parameter tuning is a major challenge. (2) The weights can be restricted to have discrete values. This is especially useful for implementing trained MNNs in precision limited hardware chips, thus improving their speed and energy efficiency by several orders of magnitude. We test the EBP algorithm numerically in eight binary text classification tasks. In all tasks, EBP outperforms: (1) standard BP with the optimal constant learning rate (2) previously reported state of the art. Interestingly, EBP-trained MNNs with binary weights usually perform better than MNNs with continuous (real) weights-if we average the MNN output using the inferred posterior.
Article
Full-text available
We present flattened convolutional neural networks that are designed for fast feedforward execution. The redundancy of the parameters, especially weights of the convolutional filters in convolutional neural networks has been extensively studied and different heuristics have been proposed to construct a low rank basis of the filters after training. In this work, we train flattened networks that consist of consecutive sequence of one-dimensional filters across all directions in 3D space to obtain comparable performance as conventional convolutional networks. We tested flattened model on different datasets and found that the flattened layer can effectively substitute for the 3D filters without loss of accuracy. The flattened convolution pipelines provide around two times speed-up during feedforward pass compared to the baseline model due to the significant reduction of learning parameters. Furthermore, the proposed method does not require efforts in manual tuning or post processing once the model is trained.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, We introduce a three stage pipeline: pruning, quantization and Huffman encoding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman encoding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory, which has 180x less access energy.
Article
Porting state of the art deep learning algorithms to resource constrained compute platforms (e.g. VR, AR, wearables) is extremely challenging. We propose a fast, compact, and accurate model for convolutional neural networks that enables efficient learning and inference. We introduce LCNN, a lookup-based convolutional neural network that encodes convolutions by few lookups to a dictionary that is trained to cover the space of weights in CNNs. Training LCNN involves jointly learning a dictionary and a small set of linear combinations. The size of the dictionary naturally traces a spectrum of trade-offs between efficiency and accuracy. Our experimental results on ImageNet challenge show that LCNN can offer 3.2x speedup while achieving 55.1% top-1 accuracy using AlexNet architecture. Our fastest LCNN offers 37.6x speed up over AlexNet while maintaining 44.3% top-1 accuracy. LCNN not only offers dramatic speed ups at inference, but it also enables efficient training. In this paper, we show the benefits of LCNN in few-shot learning and few-iteration learning, two crucial aspects of on-device training of deep learning models.
Article
We present a simple, highly modularized network architecture for image classification. Our network is constructed by repeating a building block that aggregates a set of transformations with the same topology. Our simple design results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call "cardinality" (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, codenamed ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart.
Conference Paper
Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62 % error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https:// github. com/ KaimingHe/ resnet-1k-layers.
Conference Paper
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32$$\times$$ memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58$$\times$$ faster convolutional operations (in terms of number of the high precision operations) and 32$$\times$$ memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than $$16\,\%$$ in top-1 accuracy. Our code is available at: http:// allenai. org/ plato/ xnornet.
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Conference Paper
High demand for computation resources severely hinders deployment of large-scale Deep Neural Networks (DNN) in resource constrained devices. In this work, we propose a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNNs evaluation. Experimental results show that SSL achieves on average 5.1x and 3.1x speedups of convolutional layer computation of AlexNet against CPU and GPU, respectively, with off-the-shelf libraries. These speedups are about twice speedups of non-structured sparsity; (3) regularize the DNN structure to improve classification accuracy. The results show that for CIFAR-10, regularization on layer depth can reduce 20 layers of a Deep Residual Network (ResNet) to 18 layers while improve the accuracy from 91.25% to 92.60%, which is still slightly higher than that of original ResNet with 32 layers. For AlexNet, structure regularization by SSL also reduces the error by around ~1%.
Article
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge
Article
Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems. Also, conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy, by learning only the important connections. Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections. On the ImageNet dataset, our method reduced the number of parameters of AlexNet by a factor of 9x, from 61 million to 6.7 million, without incurring accuracy loss. Similar experiments with VGG16 found that the network as a whole can be reduced 6.8x just by pruning the fully-connected layers, again with no loss of accuracy.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper aims to accelerate the test-time computation of convolutional neural networks (CNNs), especially very deep CNNs that have substantially impacted the computer vision community. Unlike existing methods that are designed for approximating linear filters or linear responses, our method takes the nonlinear units into account. We develop an effective solution to the resulting nonlinear optimization problem without the need of stochastic gradient descent (SGD). More importantly, while current methods mainly focus on optimizing one or two layers, our nonlinear method enables an asymmetric reconstruction that reduces the rapidly accumulated error when multiple (e.g., >=10) layers are approximated. For the widely used very deep VGG-16 model, our method achieves a whole-model speedup of 4x with merely a 0.3% increase of top-5 error in ImageNet classification. Our 4x accelerated VGG-16 model also shows a graceful accuracy degradation for object detection when plugged into the latest Fast R-CNN detector.
Article
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
The focus of this paper is speeding up the evaluation of convolutional neural networks. While delivering impressive results across a range of computer vision and machine learning tasks, these networks are computationally demanding, limiting their deployability. Convolutional layers generally consume the bulk of the processing time, and so in this work we present two simple schemes for drastically speeding up these layers. This is achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain. Our methods are architecture agnostic, and can be easily applied to existing CPU and GPU convolutional frameworks for tuneable speedup performance. We demonstrate this with a real world network designed for scene text character recognition, showing a possible 2.5x speedup with no loss in accuracy, and 4.5x speedup with less than 1% drop in accuracy, still achieving state-of-the-art on standard benchmarks.
Article
Convolutional networks are one of the most widely employed architectures in computer vision and machine learning. In order to leverage their ability to learn complex functions, large amounts of data are required for training. Training a large convolutional network to produce state-of-the-art results can take weeks, even when using modern GPUs. Producing labels using a trained network can also be costly when dealing with web-scale datasets. In this work, we present a simple algorithm which accelerates training and inference by a significant factor, and can yield improvements of over an order of magnitude compared to existing state-of-the-art implementations. This is done by computing convolutions as pointwise products in the Fourier domain while reusing the same transformed feature map many times. The algorithm is implemented on a GPU architecture and addresses a number of related challenges.
• Ashish Agarwal
• Paul Barham
• Eugene Brevdo
• Zhifeng Chen
• Craig Citro
• Greg S Corrado
• Andy Davis
• Jeffrey Dean
• Matthieu Devin
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
Caffe: Convolutional architecture for fast feature embedding
• Yangqing Jia
• Evan Shelhamer
• Jeff Donahue
• Sergey Karayev
• Jonathan Long
• Ross Girshick
• Trevor Darrell
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675-678. ACM, 2014.
Speeding-up convolutional neural networks using fine-tuned cp-decomposition
• Yaroslav Ganin
• Maksim Rakhuba
• Ivan Oseledets
• Victor Lempitsky
Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
• Nicolas Vasilache
• Jeff Johnson
• Michael Mathieu
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.
Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial "bottleneck" structure
• Min Wang
• Baoyuan Liu
• Hassan Foroosh
Min Wang, Baoyuan Liu, and Hassan Foroosh. Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial "bottleneck" structure. arXiv preprint arXiv:1608.04337, 2016.
Faster r-cnn: Towards real-time object detection with region proposal networks
• Kaiming Shaoqing Ren
• Ross He
• Jian Girshick
• Sun
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91-99, 2015.