Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Based on the findings in the paper it also presents a shallow CNN with good object classification scores. From 2013 until the present day the most prominent and high-scoring CNNs for object classification in images are the VGG, ResNet, and Inceptionmodels [30], [18], [32], [31]. ...
... These blocks can be put together, trained and tested. The optimal architecture is problem-dependent, and several articles exploring different configurations have been published [34], [30]. The general approach however, is a repetition of CONV-ReLU-Pool-CONV..., decreasing the spatial size of the activations throughout the network with the pooling layer. ...
... The second model we explore is the deeper Visual Geometry Group (VGG) 16-layer model [30]. This model was proposed simply to increase accuracy compared to pre-existing models. ...
Thesis
Full-text available
This thesis investigates how a tailored CNN can aid autonomous surface vehicles (ASVs) in detecting and classifying maritime traffic for collision avoidance. Several state of the art CNN models are presented and trained on data sets with relevance to the above-mentioned objective. Data collected from different sources are used for training these CNN models in pursuit to obtain a good performing detector. The main data sets are large, general purpose image sets of ships and boats. A smaller image set is also developed in this thesis. This custom data set is constructed from images taken along a predefined path at sea from a video camera. This includes images along docks and of ships in transit at sea. This data set is then split into training and testing images which are in close relation to each other. Through experiments, variations of the general purpose data sets are used to train both a 5 layer deep and a 16 layer deep CNN model to detect ships in an image
... This study employed supervised learning and CNN methods for model training. In particular, networks, such as VGG [22] and ResNet [23] which have maintained their significance within the realm of deep learning for image classification. The architectures emphasize depth through the recurrent use of small receptive field convolutions and pooling layers, simplifying implementation and comprehension. ...
... The architectures emphasize depth through the recurrent use of small receptive field convolutions and pooling layers, simplifying implementation and comprehension. ResNet [23] introduced shortcut connections known as residual blocks, however VGG's [22] adaptability remains a significant advantage. By incorporating pre-trained weights into both models, they can both be seamlessly integrated into diverse networks and deep learning tasks, emphasizing their versatility. ...
Preprint
Full-text available
Use of fingerprints found at a crime scene is a common practice for identifying suspects in criminal investigations. Over the past two decades, attempts have been made to obtain additional information from fingerprints, beyond locating suspects as part of an investigation. This includes gender, age and nationality. Researchers demonstrated 75%-90% accuracy in gender classification based on fingerprint images. Nonetheless, despite promising results, these studies have several significant shortcomings with respect to their practical feasibility. First, they ignore the low quality and quantity of fingerprints collected from the crime scene since typically the scene of a crime has only one fingerprint collected, and the fingerprint might be partially or of poor quality as well. Second, as most results are based on a single database, public or private, it is difficult to generalize the most suitable method. Third, studies miss the untapped potential of Data-Centric AI (DCAI) approaches for improving results. The aim of this study was to compare, for the first time, the gender classification from a fingerprint using several datasets and with varying fingerprint image quality. The results from four databases are compared, three public and one internal private database. In addition, we utilize the latest Data-Centric AI (DCAI) approaches for improving classification results. The results demonstrate that a conservative Convolutional Neural Networks (CNN) such as VGG is sufficient for this task. Classification accuracy ranges from 80% to 95% depending on the quality of the fingerprint, with DCAI approaches adding 1%-4% improvement. For partially or low-quality fingerprint images, the periphery of a fingerprint is the most significant area for determining gender. The source code is also provided here for practical application.
... where P i J is the probability of the outcome of the ith unit on the jth base learner, O i j is the output of the ith unit of the jth base learner and K is the number of the classes. This approach is appropriate when individual performance is proportional [28]. On the other hand, it is not appropriate when individual classifier performance is grossly disproportionate. ...
... The previous approach is appropriate when the performance of the individuals are proportional [28]. On the other hand, it isn't appropriate when the individual learners' performances are absolutely disproportionate. ...
Article
Full-text available
Chronic kidney disease (CKD) is one of today’s most serious illnesses. Because this disease usually does not manifest itself until the kidney is severely damaged, early detection saves many people’s lives. Therefore, the contribution of the current paper is proposing three predictive models to predict CKD possible occurrence within 6 or 12 months before disease existence namely; convolutional neural network (CNN), long short-term memory (LSTM) model, and deep ensemble model. The deep ensemble model fuses three base deep learning classifiers (CNN, LSTM, and LSTM-BLSTM) using majority voting technique. To evaluate the performance of the proposed models, several experiments were conducted on two different public datasets. Among the predictive models and the reached results, the deep ensemble model is superior to all the other models, with an accuracy of 0.993 and 0.992 for the 6-month data and 12-month data predictions, respectively.
... After large-scale datasets and computer resources were made available, convolutional neural networks were adopted as the standard for visual recognition. Numerous deep and efficient neural network architectures, including VGG [33], ResNet [16], ResNeXt [34], etc., have been presented. Semantic segmentation is a classification task performed at the pixel level, whereas FCNs first apply whole convolutional networks to the entire image. ...
Article
Full-text available
Multi-scale representation provides an effective answer to the scale variation of objects and entities in semantic segmentation. The ability to capture long-range pixel dependency facilitates semantic segmentation. In addition, semantic segmentation necessitates the effective use of pixel-to-pixel similarity in the channel direction to enhance pixel areas. By reviewing the characteristics of earlier successful segmentation models, we discover a number of crucial elements that enhance segmentation model performance, including a robust encoder structure, multi-scale interactions, attention mechanisms, and a robust decoder structure. The attention mechanism of the asymmetric non-local neural network (ANNet) is merged with multi-scale pyramidal modules to accelerate model segmentation while maintaining high accuracy. However, ANNet does not account for the similarity between pixels in the feature map channel direction, making the segmentation accuracy unsatisfactory. As a result, we propose EMSNet, a straightforward convolutional network architecture for semantic segmentation that consists of Integration of enhanced regional module (IERM) and Multi-scale convolution module (MSCM). The IERM module generates weights using four or five-stage feature maps, then fuses the input features with the weights and uses more computation. The similarity of the channel direction feature graphs is also calculated using ANNet’s auxiliary loss function. The MSCM module can more accurately describe the interactions between various channels, capture the interdependencies between feature pixels, and capture the multi-scale context. Experiments prove that we perform well in tests using the benchmark dataset. On Cityscapes test data, we get 82.2% segmentation accuracy. The mIoU in the ADE20k and Pascal VOC datasets are, respectively, 45.58% and 85.46%.
... We use the Letters dataset, which contains 26 letter types of data and the size of each image is 28 × 28. In all subsequent experiments, the model we trained was VGG16 (Simonyan and Zisserman 2014), and we used the adam optimizer where we set the batch size to 50 on each client. In order to fit the model to the EMN-IST dataset, we transform the image so that the original size of 28 × 28 becomes 32 × 32 . ...
Article
Full-text available
Federated Learning (FL) suffers from the Non-IID problem in practice, which poses a challenge for efficient and accurate model training. To address this challenge, prior research has introduced clustered FL (CFL), which involves clustering clients and training them separately. Despite its potential benefits, CFL can be computationally and communicationally expensive when the data distribution is unknown beforehand. This is because CFL involves the entire neural networks of involved clients in computing the clusters during training, which can become increasingly time-consuming with large-sized models. To tackle this issue, this paper proposes an efficient CFL approach called LayerCFL that employs a Layer-wised clustering technique. In LayerCFL, clients are clustered based on a limited number of layers of neural networks that are pre-selected using statistical and experimental methods. Our experimental results demonstrate the effectiveness of LayerCFL in mitigating the impact of Non-IID data, improving the accuracy of clustering, and enhancing computational efficiency.
... Following this, other CNN models with various depths have been designed. The classification was furthermore increased in 2014 by using the newly developed VGG [14], and GoogleNet [15] model. Understanding the advancement of the CNN model and considering the necessity of monitoring the cracks in concrete structures, including railway sleepers, researchers nowadays are more willing to use CNN models for detecting cracks. ...
Preprint
Full-text available
Crack inspection in railway sleepers is crucial for ensuring rail safety and avoiding deadly accidents. Traditional methods for detecting cracks on railway sleepers are very time-consuming and lack efficiency. Therefore nowadays, researchers are paying attention to the vision-based algorithm, especially Deep Learning algorithms. In this work, we adopted the U-net for the first time for detecting cracks on a railway sleeper and proposed a modified U-net architecture named Dense U-net for segmenting the cracks. In the Dense U-net structure, we established several short connections between the encoder and decoder blocks, which enabled the architecture to obtain better pixel information flow. Thus, the model extracted the necessary information in more detail to predict the cracks. We collected images from railway sleepers, processed them in a dataset, and finally trained the model with the images. The model achieved an overall F1-score, precision, Recall, and IoU of 86.5% 88.53%, 84.63%, and 76.31% respectively. We compared our suggested model with the original U-net, and the results demonstrate that our model outperformed the U-net in both quantitative and qualitative results. Moreover, we considered the necessity of crack severity analysis and measured a few parameters of the cracks (e.g., length, maximum width, area, density). The engineers must know the severity of the cracks to have an idea about the most severe locations and take the necessary steps to repair the badly affected sleepers.
... Therefore, using a CNN with only a few filters per layer to process the heatmap should be sufficient to fully analyze it. On the contrary, when designing a CNN to process the cat image, it is often necessary to use many filters per each convolutional layer, because the number of details and patterns in the cat image are large [53], [54]. ...
Preprint
Full-text available
p>Due to the involvement of human subjects or animals in this study, all ethical and experimental procedures and protocols were approved by the Institutional Review Board (IRB) at Princess Sumaya University for Technology under research protocol number (2023-0071)</p
... Some of the works that scored high in citation counts and centrality were on the Deep Learning fundamentals, not its applications in NDT&E [15], [16], [17] [18], [19], [20], and [21]. These articles and books were resources for the NDT&E researcher to comprehend the Deep Learning basics, apply its merits and introduce them to the NDT&E community. ...
Article
This study aims to present a scientometric analysis of Artificial Intelligence (AI) applications in NonDestructive Testing and Evaluation (NDT&E) and Structural Integrity Monitoring. The data is collected from the research papers contained in the Web of Science Core Collection, covering studies on AI applications in various NDT methods employed across different industry sectors. The collected 654 research articles-dataset published between April 1989 and January 2023 was processed using CiteSpace software, which classified the dataset into twelve clusters. Moreover, a subset of the data pertaining to the relevant research in the nuclear industry was considered for analysis. Various metrics were used to analyse the consistency of the clusters and publications' influence and quality. The results, which quantify the dynamics and quality of research in AI for NDT&E, are given in visualised mappings, and tabulated data and the content of the major clusters was discussed to reveal the used AI and NDT&E methods and the sectors or industries where they have been used.
... Therefore, using a CNN with only a few filters per layer to process the heatmap should be sufficient to fully analyze it. On the contrary, when designing a CNN to process the cat image, it is often necessary to use many filters per each convolutional layer, because the number of details and patterns in the cat image are large [53], [54]. ...
Preprint
Full-text available
p>Due to the involvement of human subjects or animals in this study, all ethical and experimental procedures and protocols were approved by the Institutional Review Board (IRB) at Princess Sumaya University for Technology under research protocol number (2023-0071)</p
... The computational volume of DSC convolution is about 1/9 of ordinary convolution [4], and the computational efficiency of DSC is much better than that of ordinary convolution, so it is a very common practice nowadays to use DSC to build lightweight models. a large number of DSC convolutions are used in LERM models to reduce the model size and improve the computational efficiency [5][6][7][8][9][10]. ...
Chapter
Full-text available
Modern site management has made significant progress, and many advancing equipment and technologies have been introduced into the site management process. During power transmission and transformation construction, safety accidents often occur due to fatigue construction or negative emotions. Embedded Devices plus AI is an advanced solution that monitors and identifies worker sentiment in real-time. The LERM model optimized in this paper can run well on embedded devices with reliable accuracy. This model can be well applied to surveillance camera equipment, at low cost, fast response, and high recognition accuracy. The application of cameras with embedded LERM models on power transmission and transformation sites can identify staff emotions in real-time and alarm managers. As a result of this application, fatigue construction or negative emotions can be avoided as a safety hazard in power transmission and transformation construction personnel.
... Expression recognition algorithm can do a good classification of emotions With the development of deep learning. At present, mainstream expression recognition models include VGGNet [4], GoogleNet [5],AlexNet [6]. However, with the deepening of the network layer, the phenomenon of gradient explosion will become more and more serious. ...
Chapter
Full-text available
In view of the lack of facial expression data set in the classroom environment, the classroom expression data set was constructed, including the acquisition and preprocessing of students’ face pictures, the selection of students’ emotional categories in the classroom environment and the labeling of pictures. Based on the Resnet50 network model, a network structure with attention module is proposed, so that it can focus on the feature parts that clearly represent the target emotion in facial images, so as to enhance the accuracy of facial emotion recognition. In order to verify the effect of the model presented in this paper, training tests were carried out on the common data set of expression Fer2013 and the classroom data set constructed in this article. The results show that the structural model presented in this paper has better recognition effect and can effectively enhance the accuracy of expression recognition.
... Since there are variable modules in Mobileunet-CC model, the comparison experiment is carried out to ensure that all the module B A (Simonyan and Zisserman, 2014) has smaller filters and deeper networks. With these unique feature extraction structures, both two models have good segmentation performance. ...
Article
Full-text available
The immature winter flush affects the flower bud differentiation, flowering and fruit of litchi, and then seriously reduces the yield of litchi. However, at present, the area estimation and growth process monitoring of winter flush still rely on manual judgment and operation, so it is impossible to accurately and effectively control flush. An efficient approach is proposed in this paper to detect the litchi flush from the unmanned aerial vehicle (UAV) remoting images of litchi crown and track winter flush growth of litchi tree. The proposed model is constructed based on U-Net network, of which the encoder is replaced by MobeilNetV3 backbone network to reduce model parameters and computation. Moreover, Convolutional Block Attention Module (CBAM) is integrated and convolutional layer is added to enhance feature extraction ability, and transfer learning is adopted to solve the problem of small data volume. As a result, the Mean Pixel Accuracy (MPA) and Mean Intersection over Union (MIoU) on the flush dataset are increased from 90.95% and 83.3% to 93.4% and 85%, respectively. Moreover, the size of the proposed model is reduced by 15% from the original model. In addition, the segmentation model is applied to the tracking of winter flushes on the canopy of litchi trees and investigating the two growth processes of litchi flushes (late-autumn shoots growing into flushes and flushes growing into mature leaves). It is revealed that the growth processes of flushes in a particular branch region can be quantitatively analysed based on the UAV images and the proposed semantic segmentation model. The results also demonstrate that a sudden drop in temperature can promote the rapid transformation of late-autumn shoots into flushes. The method proposed in this paper provide a new technique for accurate management of litchi flush and a possibility for the area estimation and growth process monitoring of winter flush, which can assist in the control operation and yield prediction of litchi orchards.
... We apply the approach described in [9] to extract the RGB features. In contrast to [9], our Pyramid Net (as shown in Fig 6) uses simpler VGG [31] frameworks and convolutional upsampling blocks to accommodate for the high similarity and poor resolution, reducing the number of parameters and increasing inference speed. ...
Preprint
Full-text available
6D pose object estimation is a key task in robotic manipulation and grasping scenes. Many prior two-stage works with slower inference speeds require extra refinement to deal with challenge problems such as variations of lighting, sensor noise, object occlusion, and truncation. To address these problems, this work proposes a decoupled one-stage network (DON6D) for 6D pose estimation, which improves inference speed, on the premise of maintaining accuracy. Specifically, since RGB images are aligned with RGB-D images, DON6D first utilizes a 2D detection network to locate the interested objects in RGB-D images. Second, DON6D includes a module of feature extraction and fusion to fully extract color and geometric features. Furthermore, dual data augmentation is implemented to enhance the generalizability of the model. Third, DON6D fuses them and applies an attention residual encoder-decoder which can boost pose estimation performance to obtain accurate 6D pose. DON6D is a real-time method that is evaluated on the LINEMOD dataset and YCB-Video dataset. The results demonstrate that DON6D is superior to several state-of-the-art methods under the ADD(-S) metric and ADD(-S) AUC metric.
... The FCN includes an encoder in the form of convolutional layers for image downsampling. The encoder usually uses the VGG16 network [30], which is designed for image recognition tasks and has 16 layers including (13 convolutional and three fully connected) [28] and additional five max-pooling layers. Unlike U-Net, FCN's decoder path is not symmetrical to the encoder. ...
Article
Automated seed sorting is widely used in the agricultural industry. Deep learning is a new field of study in agricultural seed sorting applications. In this study, a classification of buckwheat seeds and foreign materials, such as sticks, chaff, stones was performed using deep learning. The main purpose of the study was to show the effect of scaling the images on the classification results, while creating a dataset. An industrial experimental setup was used to generate the datasets of buckwheat seeds and foreign materials to be sorted by deep learning. The images in the created dataset were rescaled with two different techniques, precision scaling and direct scaling, which were labelled as Type1 dataset and Type2 dataset, respectively. To classify buckwheat seeds and foreign materials, AlexNet architecture was used. The classification accuracy was calculated as 98.57% for Type1 Dataset and 97.34% for Type2 Dataset. As a result, it was concluded that the Type1 dataset had a higher accuracy and the use of precision scaling can be used to improve the classification results in industrial applications.
Article
Image desnowing is a challenging task in computer vision, as it requires the removal of snow from images while preserving the underlying scene structure and content. In order to achieve high performance, desnowing methods need to be able to effectively capture both local and global information in the image. The proposed method addresses this challenge by introducing a novel Context-aware Feature Aggregation (CFA) module. The CFA module is designed to capture both local and global information by aggregating features of the network in latent space. This allows the method to better understand the contextual relationships in the image, which is essential for accurate snow removal. In addition to the CFA module, the proposed method also introduces a Selective Refinement Head (SRH). The SRH is designed to adaptively fuse coarse features from the encoder and decoder of the network. This allows the method to refine the output by incorporating relevant information from both low-level and high-level representations. Finally, the proposed method leverages the capabilities of contrastive learning to better align the desnow images and ground-truth images in perceptual space. This leads to improved image quality and desnowing performance. Extensive experiments on both synthetic and real-world datasets show that the proposed method achieves state-of-the-art results on image desnowing task.
Article
The novel coronavirus 2019 (COVID-19) has rapidly spread, evolving into a global epidemic. Existing pharmaceutical techniques and diagnostic tests, such as reverse transcription–polymerase chain reaction (RT-PCR) and serology tests, are time-consuming, expensive, and require well-equipped laboratories for analysis. This restricts their accessibility to a broader population. The need for a simple and accurate screening method is imperative to identify infected individuals and curtail the virus’s propagation. In this paper, we introduce a novel COVID-19 classification and detection approach (LSAE, latent space autoencoder) based on chest X-ray image scans. Initially, the high dimensionality of input data is compressed into a reduced representation (latent space), preserving crucial features while discarding noise. This latent space subsequently serves as the input to build an efficient SVM classifier for COVID-19 detection. Experimental outcomes using the COVID-19 dataset are promising as they confirm the rapidity and detection capability of the proposed LSAE.
Chapter
Developing a new Salient Object Detection (SOD) model involves selecting an ImageNet pre-trained backbone and creating novel feature refinement modules to use backbone features. However, adding new components to a pre-trained backbone needs retraining the whole network on the ImageNet dataset, which requires significant time. Hence, we explore developing a neural network from scratch directly trained on SOD without ImageNet pre-training. Such a formulation offers full autonomy to design task-specific components. To that end, we propose SODAWideNet, an encoder-decoder-style network for Salient Object Detection. We deviate from the commonly practiced paradigm of narrow and deep convolutional models to a wide and shallow architecture, resulting in a parameter-efficient deep neural network. To achieve a shallower network, we increase the receptive field from the beginning of the network using a combination of dilated convolutions and self-attention. Therefore, we propose Multi Receptive Field Feature Aggregation Module (MRFFAM) that efficiently obtains discriminative features from farther regions at higher resolutions using dilated convolutions. Next, we propose Multi-Scale Attention (MSA), which creates a feature pyramid and efficiently computes attention across multiple resolutions to extract global features from larger feature maps. Finally, we propose two variants, SODAWideNet-S (3.03M) and SODAWideNet (9.03M), that achieve competitive performance against state-of-the-art models on five datasets. We provide the code and pre-computed saliency maps here.
Article
Scene classification and recognition have always been one of the most challenging tasks of scene understanding due to the inherent ambiguity in visual scenes. The core of scene classification and recognition tasks is scene representation. Deep learning advances in computer vision, especially deep CNNs, have significantly improved scene representation in the last decade. Deep convolutional features extracted from deep CNNs provide discriminative representations of the images and are widely used in various computer vision tasks, such as scene classification. Deep convolutional features capture the appearance characteristics of the image and the spatial information about different image regions. Meanwhile, the semantic and context information obtained from high-level concepts about scene images, such as objects and their relationships, can significantly contribute to identifying scene images. Therefore, in this paper, we divide visual scenes into two categories, object-based and layout-based. Object-based scenes are scenes that have scene-specific objects and, based on those objects, can be described and identified. In contrast, the layout-based scenes do not have scene-specific objects and are described and identified based on the appearance and layout of the image. This paper proposes a new neural network model for representing and classifying visual scenes, which we call G-CNN (GNN-CNN). The proposed model includes two modules, feature extraction and feature fusion, and the feature extraction module composes of visual and semantic branches. The visual branch is responsible for extracting deep CNN features from the image, and the semantic branch is responsible for extracting semantic GNN features from the scene graph corresponding to the image. The feature fusion module is a novel two-stream neural network that fuses the CNN and GNN feature vectors to produce a comprehensive representation of the scene image. Finally, a fully-connected classifier classified the obtained comprehensive feature vector into one of the pre-defined categories. The proposed model has been evaluated on three benchmark scene datasets, UIUC Sports, MIT67, and SUN397, and obtained classification accuracy of 99.91%, 96.01%, and 85.32%, respectively. In addition, a new dataset named Scene40, which has been introduced in our previous paper, is also used for further evaluation of the proposed method. The comparison results based on classification accuracy criteria show that the proposed model can outperform the best previous methods on three benchmark scene datasets.
Article
Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers. In this paper, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upwards exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability. We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8 × over cuBLAS/cuDNN and 1.1 × over TVM on GPU when used to optimize deep learning models; it also outperforms PPCG and TVM by 11% and 20%, respectively, when generating code for the deep learning accelerator.
Article
The image-to-image translation aims to learn a mapping between the source and target domains. For improving visual quality, the majority of previous works adopt multi-stage techniques to refine coarse results in a progressive manner. In this work, we present a novel approach for generating plausible details by only introducing a group of intermediate supervisions without cascading multiple stages. Specifically, we propose a Laplacian Pyramid Transformation Generative Adversarial Network (LapTransGAN) to simultaneously transform components in different frequencies from the source domain to the target domain within only one stage. Hierarchical perceptual and gradient penalization are utilized for learning consistent semantic structures and details at each pyramid level. The proposed model is evaluated based on various metrics, including the similarity in feature maps, reconstruction quality, segmentation accuracy, similarity in details, and qualitative appearances. Our experiments show that LapTransGAN can achieve a much better quantitative performance than both the supervised pix2pix model and the unsupervised CycleGAN model. Comprehensive ablation experiments are conducted to study the contribution of each component.
Article
Binary Neural Networks (BNN) have binarized neuron and connection values so that their accelerators can be realized by extremely efficient hardware. However, there is a significant accuracy gap between BNNs and networks with wider bit-width. Conventional BNNs binarize feature maps by static globally-unified thresholds, which makes the produced bipolar image lose local details. This paper proposes a multi-input activation function to enable adaptive thresholding for binarizing feature maps: (a) At the algorithm level, instead of operating each input pixel independently, adaptive thresholding dynamically changes the threshold according to surrounding pixels of the target pixel. When optimizing weights, adaptive thresholding is equivalent to an accompanied depth-wise convolution between normal convolution and binarization. Accompanied weights in the depth-wise filters are ternarized and optimized end-to-end. (b) At the hardware level, adaptive thresholding is realized through a multi-input activation function, which is compatible with common accelerator architectures. Compact activation hardware with only one extra accumulator is devised. By equipping the proposed method on FPGA, 4.1% accuracy improvement is achieved on the original BNN with only 1.1% extra LUT resource. Compared with State-of-the-art methods, the proposed idea further increases network accuracy by 0.8% on the Cifar-10 dataset and 0.4% on the ImageNet dataset.
Article
Full-text available
The past decade has witnessed many great successes of machine learning (ML) and deep learning (DL) applications in agricultural systems, including weed control, plant disease diagnosis, agricultural robotics, and precision livestock management. However, a notable limitation of these ML/DL models lies in their reliance on large-scale labeled datasets for training, with their performance closely tied to the quantity and quality of available labeled data. The process of collecting, processing, and labeling such datasets is both expensive and time-consuming, primarily due to escalating labor costs. This challenge has sparked substantial interest among researchers and practitioners in the development of label-efficient ML/DL methods tailored for agricultural applications. In fact, there are more than 50 papers on developing and applying deep-learning-based label-efficient techniques to address various agricultural problems since 2016, which motivates the authors to provide a timely and comprehensive review of recent label-efficient ML/DL methods in agricultural applications. To this end, a principled taxonomy is first developed to organize these methods according to the degree of supervision, including weak supervision (i.e., active learning and semi-/weakly- supervised learning), and no supervision (i.e., un-/self- supervised learning), supplemented by representative state-of-the-art label-efficient ML/DL methods. In addition, a systematic review of various agricultural applications exploiting these label-efficient algorithms, such as precision agriculture, plant phenotyping, and postharvest quality assessment, is presented. Finally, the current problems and challenges are discussed, as well as future research directions. A well-classified paper list that will be actively updated can be accessed at https://github.com/DongChen06/Label-efficient-in-Agriculture.
Article
Full-text available
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network train-ing. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k cate-gories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition ser-vice. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural-network that operates directly off of the image pixels. This model is configured with 11 hidden layers all with feedforward connections. We employ the DistBelief implementation of deep neural networks to scale our computations over this network. We have evaluated this approach on the publicly available SVHN dataset and achieve over 96% accuracy in recognizing street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art and achieve 97.84% accuracy. We also evaluated this approach on an even more challenging dataset generated from Street View imagery containing several 10s of millions of street number annotations and achieve over 90% accuracy. Our evaluations further indicate that at specific operating thresholds, the performance of the proposed system is comparable to that of human operators and has to date helped us extract close to 100 million street numbers from Street View imagery worldwide.
Article
Full-text available
We investigate multiple techniques to improve upon the current state of the art deep convolutional neural network based image classification pipeline. The techiques include adding more image transformations to training data, adding more transformations to generate additional predictions at test time and using complementary models applied to higher resolution images. This paper summarizes our entry in the Imagenet Large Scale Visual Recognition Challenge 2013. Our system achieved a top 5 classification error rate of 13.55% using no external data which is over a 20% relative improvement on the previous year's winner.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
The Fisher kernel (FK) is a generic framework which com- bines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained us- ing only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant re- sources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Article
I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.
Article
This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class. We show that such maps can be employed for weakly supervised object segmentation using classification ConvNets. Finally, we establish the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks [Zeiler et al., 2013].
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.