Conference Paper

Deep Residual Learning for Image Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... where W contains eigenvectors of the covariance matrix of X. • Deep learning-based feature extraction [20]: CNNs reduce dimensionality by extracting hierarchical features. For example, ResNet [21] uses residual connections, as follows: ...
... where f (x) represents the learned transformation within a residual block [21]. After feature extraction, a classification model, such as an SVM, random forest, or a DL network like a CNN, is trained on labeled samples. ...
Article
Full-text available
Few-shot classification of polarimetric synthetic aperture radar (PolSAR) images is a challenging task due to the scarcity of labeled data and the complex scattering properties of PolSAR data. Traditional deep learning models often suffer from overfitting and catastrophic forgetting in such settings. Recent advancements have explored innovative approaches, including data augmentation, transfer learning, meta-learning, and multimodal fusion, to address these limitations. Data augmentation methods enhance the diversity of training samples, with advanced techniques like generative adversarial networks (GANs) generating realistic synthetic data that reflect PolSAR’s polarimetric characteristics. Transfer learning leverages pre-trained models and domain adaptation techniques to improve classification across diverse conditions with minimal labeled samples. Meta-learning enhances model adaptability by learning generalizable representations from limited data. Multimodal methods integrate complementary data sources, such as optical imagery, to enrich feature representation. This survey provides a comprehensive review of these strategies, focusing on their advantages, limitations, and potential applications in PolSAR classification. We also identify key trends, such as the increasing role of hybrid models combining multiple paradigms and the growing emphasis on explainability and domain-specific customization. By synthesizing SOTA approaches, this survey offers insights into future directions for advancing few-shot PolSAR classification.
... The framework of the proposed architecture for FER is illustrated in Fig. 1. It includes a modified non-local attention block integrated into the backbone network of ResNet-18 [25], followed by a multilayer perceptron layer for expression prediction. In this section, we elaborate on the backbone network, the non-local operation, and the modified non-local attention mechanism for the FER task. ...
... We utilize the well-known CNN architecture, ResNet-18 [25] as our foundational model. To prevent overfitting and maintain a low computational complexity, we employ the initial 4 encoder blocks from ResNet-18, which have been pre-trained on the CASIA-WebFace [26] dataset. ...
... First, we select an optimal backbone CNN architecture from a pre-defined search space. In this study, we consider ResNet [28] and UNet [29], both of which have been effectively applied in genomics. ResNet, with its deep residual connections, is well-suited for classification tasks that require capturing hierarchical features from sequential data. ...
Preprint
Full-text available
Pre-trained language models have transformed the field of natural language processing (NLP), and their success has inspired efforts in genomics to develop domain-specific foundation models (FMs). However, creating high-quality genomic FMs from scratch is resource-intensive, requiring significant computational power and high-quality pre-training data. The success of large language models (LLMs) in NLP has largely been driven by industrial-scale efforts leveraging vast, diverse corpora and massive computing infrastructure. In this work, we aim to bypass the data and computational bottlenecks of creating genomic FMs from scratch and instead propose repurposing existing LLMs for genomics tasks. Inspired by the recently observed 'cross-modal transfer' phenomenon - where transformers pre-trained on natural language can generalize to other modalities - we introduce L2G, which adapts a pre-trained LLM architecture for genomics using neural architecture search (NAS) and a novel three-stage training procedure. Remarkably, without requiring extensive pre-training on DNA sequence data, L2G achieves superior performance to fine-tuned genomic FMs and task-specific models on more than half of tasks across multiple genomics benchmarks. In an enhancer activity prediction task, L2G further demonstrates its capacity to identify significant transcription factor motifs. Our work not only highlights the generalizability and efficacy of language models in out-of-domain tasks such as genomics, but also opens new avenues for more efficient and less resource-intensive methodologies in genomic research.
... The Deep Residual Network (ResNet) model proposed by Kaiming He [22] introduces residual blocks to establish cross-layer data pathways, using 1 × 1 convolutions during down-sampling and up-sampling to combine information across channels and adding nonlinear features. This allows gradients to propagate forward across layers, better handling complex image features and improving the accuracy of corrosion detection [23]. ...
Article
Full-text available
As the core component of railways, the switch sliding baseplate has a bad operating environment, and its surface is prone to corrosion. Existing methods, including traditional methods, ultrasonic detection, and image processing, have difficulty in extracting corrosion features and being applied in practice. To solve the above problems, the Residual Neural Network 50 (ResNet50) model, a deep learning model, is introduced in this paper. To solve the problems of gradient explosion and weak corrosion in the model, a new fusion model, VGG-ResNet50-corrosion (VGGRES50_Corrosion), is proposed in this paper. First of all, for the problem that there is no public dataset, this study conducts a neutral salt spray corrosion test and collects the image features and corrosion depth parameters of skateboard corrosion in different time periods as the dataset to test the performance of the model. Then, corrosion thickness is introduced as a modified variable in the ResNet50 network, and a new network, VGGRES50_Corrosion, is introduced by blending the improved model with the Visual Geometry Group-16 (VGG16) network through a model fusion strategy. Finally, a model test and ultrasonic contrast test are designed to verify the performance of the model. In the model test, the recognition accuracy of the fusion model is 98.98% higher than that of other models, which effectively solves the shortcoming of the gradient explosion's weak generalization ability under a small sample model. In the ultrasonic comparison experiment, the mean relative errors of this method and ultrasonic detection method are 4.08% and 46.14%, respectively, and the mean square errors are 1.86 h and 15.01 h, respectively. The prediction result of deep learning is better than that of ultrasonic piecewise linear fitting. It has been proved that VGGRES50_Corrosion can identify the degree of corrosion of slip switches more effectively, and it has great significance in improving the corrosion detection efficiency of slip switches.
... (1) ResNet: ResNet was proposed in 2015 by He et al. [40]. ResNet greatly improves the solution to the degradation problem of deep networks with its residual connectivity property while significantly reducing the number of parameters. ...
Article
Full-text available
Rolling bearings are critical rotating components in machinery and equipment; they are essential for the normal operation of such systems. Consequently, there is a pressing need for a highly efficient, applicable, and reliable method for bearing fault diagnosis. Currently, one-dimensional data-driven fault diagnosis methods, which rely on one-dimensional data, represent a mainstream approach in this field. However, these methods exhibit weak diagnostic capabilities in noisy environments and when confronted with insufficient sample sizes. In order to solve these limitations, a new fault diagnosis method for rolling bearings is proposed, which combines the ConvNeXt network and improved DenseBlock into a parallel network with a feature fusion function. The network can fully extract the global feature and the detail feature of the signal and integrate them, which shows a good diagnostic ability in the face of a strong noise environment. Additionally, the Dy-ReLU function is introduced into the network, which enhances the generalization ability of the network and improves the convergence speed. Comparative experiments show that this method still has strong fault diagnosis capability under the condition of noise pollution and insufficient training samples.
... Unlike the PSNet paper [4], we opt for ResNet101 [29] to image encoding instead of the default visual Transformer employing ViT-B/32 [30]. The selected image encoder is finetuned for the duration of training to extract maximum performance from the network. ...
... ResNet (Residual Neural Network) [45] is a widely used transfer learning backbone for Mask R-CNN. ResNet consists of a 7 × 7 convolution followed by a max pool layer and a subsequent mixture of combinations of convolutional size combinations, depending upon the number of layers in ResNet. ...
Article
Full-text available
Arecaceae (palms) play a crucial role for native communities and wildlife in the Amazon region. This study presents a first-of-its-kind regional-scale spatial cataloging of palms using remotely sensed data for the country of Guyana. Using very high-resolution satellite images from the GeoEye-1 and WorldView-2 sensor platforms, which collectively cover an area of 985 km2, a total of 472,753 individual palm crowns are detected with F1 scores of 0.76 and 0.79, respectively, using a convolutional neural network (CNN) instance segmentation model. An example of CNN model transference between images is presented, emphasizing the limitation and practical application of this approach. A method is presented to optimize precision and recall using the confidence of the detection features; this results in a decrease of 45% and 31% in false positive detections, with a moderate increase in false negative detections. The sensitivity of the CNN model to the size of the training set is evaluated, showing that comparable metrics could be achieved with approximately 50% of the samples used in this study. Finally, the diameter of the palm crown is calculated based on the polygon identified by mask detection, resulting in an average of 7.83 m, a standard deviation of 1.05 m, and a range of {4.62, 13.90} m for the GeoEye-1 image. Similarly, for the WorldView-2 image, the average diameter is 8.08 m, with a standard deviation of 0.70 m and a range of {4.82, 15.80} m.
... To compensate for this shortcoming, we introduce a detail capture module to provide some important fine-grained local detail information for SAM-VIT. The network is modified from the classic ResNet18 [28], and we removed the max pooling operation and the last residual convolutional layer 4. Specifically, the detail capture module processed the bitemporal images T 1 , T 2 to obtain three multi-scale local features, which were downsampled by 1/2, 1/4, and 1/8, denoted Feature Aggregator. Considering the limitations [29] of SAM for remote sensing images: complex background interference and objects with unclear profiles pose a significant challenge to the segmentation capability of SAM, and the performance of applying SAM directly on remote sensing image segmentation depends largely on the type, location, and quantity of prompts. ...
Article
Full-text available
In recent years, change detection has been a hot research topic in remote sensing. Previous research has focused on binary change detection (BCD), limiting its practical applications. Therefore, semantic change detection (SCD), which can detect multiple change classes, is gradually becoming a more mainstream task. Most existing SCD methods use convolutional neural networks (CNNs) as the backbone to extract multi-scale features and use relatively simple decoder structures, leading to unsatisfactory detection accuracy. We propose a multi-task network for SCD, and in the encoder, given the great success of segment anything module (SAM) and vision transformer (VIT) in the field of general-purpose segmentation task, we introduce SAM-VIT into the backbone to enhance the encoder's ability to capture long-range contextual semantic relationships. We propose a transformer-based decoder structure for the semantic segmentation (SS) branch to extract local and global features effectively. We propose a convolutional attention-based change extractor for the BCD branch to enhance temporal information fusion. Also, we analyze in detail the semantic inconsistency that affects the performance of SCD. First, we introduce contrastive loss to establish the correlation between the output features of the BCD branch and the segmentation branch. Secondly, we design a bi-temporal graph semantic interaction module to maintain semantic consistency between the output features of the two segmentation branches; the module assigns pixels with different land cover types to the corresponding graph nodes based on clustering techniques and then uses cross-attention to model the correlation between bitemporal semantic features in the graph space. Finally, a self-learning training scheme based on pseudo-label further mitigates the problem of semantic inconsistency. SCDVit achieves state-of-the-art performance on two popular high-resolution datasets. Meanwhile, adequate quantitative and qualitative analyses highlight the potential of SAM-VIT for change detection and the effectiveness of the module designed based on semantic consistency.
... In the context of Alzheimer's disease detection, ResNet50 can capture subtle, deep patterns in MRI or CT scans that may be missed by shallower networks. By combining ResNet50 with other models, we tap into its ability to dig deep into the data and find patterns that are crucial for accurate diagnosis, giving our approach a more thorough feature extraction process [38]. ...
Article
Full-text available
Background: Alzheimer’s disease (AD) is a progressive neurological disorder that significantly affects middle-aged and elderly adults, leading to cognitive deterioration and hindering daily activities. Notwithstanding progress, conventional diagnostic techniques continue to be susceptible to inaccuracies and inefficiencies. Timely and precise diagnosis is essential for early intervention. Methods: We present an enhanced hybrid deep learning framework that amalgamates the EfficientNetV2B3 with Inception-ResNetV2 models. The models were integrated using an adaptive weight selection process informed by the Cuckoo Search optimization algorithm. The procedure commences with the pre-processing of neuroimaging data to guarantee quality and uniformity. Features are subsequently retrieved from the neuroimaging data by utilizing the EfficientNetV2B3 and Inception-ResNetV2 models. The Cuckoo Search algorithm allocates weights to various models dynamically, contingent upon their efficacy in particular diagnostic tasks. The framework achieves balanced usage of the distinct characteristics of both models through the iterative optimization of the weight configuration. This method improves classification accuracy, especially for early-stage Alzheimer’s disease. A thorough assessment was conducted on extensive neuroimaging datasets to verify the framework’s efficacy. Results: The framework attained a Scott’s Pi agreement score of 0.9907, indicating exceptional diagnostic accuracy and dependability, especially in identifying the early stages of Alzheimer’s disease. The results show its superiority over current state-of-the-art techniques.Conclusions: The results indicate the substantial potential of the proposed framework as a reliable and scalable instrument for the identification of Alzheimer’s disease. This method effectively mitigates the shortcomings of conventional diagnostic techniques and current deep learning algorithms by utilizing the complementing capabilities of EfficientNetV2B3 and Inception-ResNetV2 by using an optimized weight selection mechanism. The adaptive characteristics of the Cuckoo Search optimization facilitate its application across many diagnostic circumstances, hence extending its utility to a wider array of neuroimaging datasets. The capacity to accurately identify early-stage Alzheimer’s disease is essential for facilitating prompt therapies, which are crucial for decelerating disease development and enhancing patient outcomes.
... In 2014, Oxford University and Deepmind Company proposed a deep convolutional network with a depth of 16-19 layers. ResNet [29] includes a residual block to ensure the efficiency of increasing the depth of deep networks. The network depth reached an unprecedented level, and ResNet provided excellent classification performance. ...
Article
Full-text available
The representation and utilization of environmental information by service robots has become increasingly challenging. In order to solve the problems that the service robot platform has, such as high timeliness requirements for indoor environment recognition tasks and the small scale of indoor scene data, a method and model for rapid classification of household environment domain knowledge is proposed, which can achieve high recognition accuracy by using a small-scale indoor scene and tool dataset. This paper uses a knowledge graph to associate data for home service robots. The application requirements of knowledge graphs for home service robots are analyzed to establish a rule base for the system. A domain ontology of the home environment is constructed for use in the knowledge graph system, and the interior functional areas and functional tools are classified. This designed knowledge graph contributes to the state of the art by improving the accuracy and efficiency of service decision making. The lightweight network MobileNetV3 is used to pre-train the model, and a lightweight convolution method with good feature extraction performance is selected. This proposal adopts a combination of MobileNetV3 and transfer learning, integrating large-scale pre-training with fine-tuning for the home environment to address the challenge of limited data for home robots. The results show that the proposed model achieves higher recognition accuracy and recognition speed than other common methods, meeting the work requirements of service robots. With the Scene15 dataset, the proposed scheme has the highest recognition accuracy of 0.8815 and the fastest recognition speed of 63.11 microseconds per sheet.
Article
Full-text available
As the expected lifespans of structures and road approaches, as well as the importance of road maintenance, increase globally, safety inspections have emerged as a crucial task. Nonetheless, the existing crack detection models focus on multi-scale feature loss and performance degradation in learning various types of cracks. We propose the Multi-Scale Parallel Attention U-Net (MSP U-Net) as a network designed for low-resolution images that considers the irregular characteristics of cracks. MSP U-Net applies a large receptive field flock to an attention U-Net, minimizing feature loss across multiple scales. Using the Crack500 dataset, our network achieved a mean intersection of union (mIoU) of 0.7752, outperforming the existing methods on low-resolution images.
Article
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval. The method aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an N×NN\times N grid layout. This reduces the number of visual encodings to 1N2\frac{1}{N^{2}} and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size N , image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.
Article
Full-text available
In the rapidly developing field of wireless communications, the precise classification of modulated signals is essential for optimizing spectrum utilization and improving communication quality. However, existing networks face challenges in robustness against signals containing phase shift keying and computational efficiency. This paper introduces TCN-GRU, a lightweight model that combines the advantages of multiscale feature extraction of the temporal convolutional network (TCN) and global sequence modeling of gated recurrent unit (GRU). Compared to the state-of-the-art MCLDNN, TCN-GRU reduces parameters by 37.6%, achieving an accuracy of 0.6156 and 0.6466 on the RadioML2016.10a and RadioML2016.10b, respectively (versus MCLDNN’s 0.6101 and 0.6462). Furthermore, TCN-GRU demonstrates superior ability in distinguishing challenging modulations such as QAM16 and QAM64, and it improves classification accuracy by about 10.5% compared to MCLDNN. These results suggest that TCN-GRU is a robust and efficient solution for enhancing AMC in complex and noisy environments.
Article
Full-text available
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Article
Full-text available
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
Article
Full-text available
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
Preface to the Second Edition Twelve years have passed since the publication of the first edition of A Multigrid Tutorial. During those years, the field of multigrid and multilevel methods has expanded at a tremendous rate, reflecting progress in the development and analysis of algorithms and in the evolution of computing environments. Because of these changes, the first edition of the book has become increasingly outdated and the need for a new edition has become quite apparent. With the overwhelming growth in the subject, an area in which I have never done serious research, I felt remarkably unqualified to attempt a new edition. Realizing that I needed some help, I recruited two experts to assist with the project. Steve McCormick (Department of Applied Mathematics, University of Colorado at Boulder) is one of the original researchers in the field of multigrid methods and the real instigator of the first edition. There could be no better collaborator on the subject. Van Emden Henson (Center for Applied Scientific Computing, Lawrence Livermore National Laboratory) has specialized in applications of multigrid methods, with a particular emphasis on algebraic multigrid methods. Our collaboration on a previous SIAM monograph made him an obvious choice as a co-author. With the team in place, we began deliberating on the content of the new edition. It was agreed that the first edition should remain largely intact with little more than some necessary updating. Our aim was to add a roughly equal amount of new material that reflects important core developments in the field. A topic that probably should have been in the first edition comprises Chapter 6: FAS (Full Approximation Scheme), which is used for nonlinear problems. Chapter 7 is a collection of methods for four special situations that arise frequently in solving boundary value problems: Neumann boundary conditions, anisotropic problems, variable-mesh problems, and variable-coefficient problems. One of the chief motivations for writing a second edition was the recent surge of interest in algebraic multigrid methods, which is the subject of Chapter 8. In Chapter 9, we attempt to explain the complex subject of adaptive grid methods, as it appears in the FAC (Fast Adaptive Composite) Grid Method. Finally, in Chapter 10, we depart from the predominantly finite difference approach of the book and show how finite element formulations arise. This chapter provides a natural closing because it ties a knot in the thread of variational principles that runs through much of the book. There is no question that the new material in the second half of this edition is more advanced than that presented in the first edition. However, we have tried to create a safe passage between the two halves, to present many motivating examples, and to maintain a tutorial spirit in much of the discourse. While the first half of the book remains highly sequential, the order of topics in the second half is largely arbitrary. The FAC examples in Chapter 9 were developed by Bobby Philip and Dan Quinlan, of the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory, using AMR++ within the Overture framework. Overture is a parallel object-oriented framework for the solution of PDEs in complex and moving geometries. More information on Overture can be found at http://www.llnl.gov/casc/ Overture. We thank Irad Yavneh for a thorough reading of the book, for his technical insight, and for his suggestion that we enlarge Chapter 4. We are also grateful to John Ruge who gave Chapter 8 a careful reading in light of his considerable knowledge of AMG. Their suggestions led to many improvements in the book. Deborah Poulson, Lisa Briggeman, Donna Witzleben, Mary Rose Muccie, Kelly Thomas, Lois Sellers, and Vickie Kearn of the editorial staff at SIAM deserve thanks for coaxing us to write a second edition and for supporting the project from beginning to end. Finally, I am grateful for the willingness of my co-authors to collaborate on this book. They should be credited with improvements in the book and held responsible for none of its shortcomings. Bill Briggs November 15, 1999 Boulder, Colorado
Article
Full-text available
We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed remains finite: for a special class of initial conditions on the weights, very deep networks incur only a finite delay in learning speed relative to shallow networks. We further show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, thereby providing analytical insight into the success of unsupervised pretraining in deep supervised learning tasks.
Conference Paper
Full-text available
Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Article
Full-text available
The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection. This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Article
Full-text available
Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or training mechanisms. Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future. We first observe the influence of the non-linear activations functions. We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation. Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks. We find that a new non-linearity that saturates less can often be beneficial. Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1. Based on these considerations, we propose a new initialization scheme that brings substantially faster convergence.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Chapter
It has long been known that neural networks can learn faster when their input and hidden unit activities are centered about zero; recently we have extended this approach to also encompass the centering of error signals [15]. Here we generalize this notion to all factors involved in the network's gradient, leading us to propose centering the slope of hidden unit activation functions as well. Slope centering removes the linear component of backpropagated error; this improves credit assignment in networks with shortcut connections. Benchmark results show that this can speed up learning significantly without adversely affecting the trained network's generalization ability.
Article
Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and training of very deep networks remains an open problem. Here we introduce a new architecture designed to overcome this. Our so-called highway networks allow unimpeded information flow across many layers on information highways. They are inspired by Long Short-Term Memory recurrent networks and use adaptive gating units to regulate the information flow. Even with hundreds of layers, highway networks can be trained directly through simple gradient descent. This enables the study of extremely deep and efficient architectures.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License at https://github.com/rbgirshick/fast-rcnn.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Book
Ripley brings together two crucial ideas in pattern recognition: statistical methods and machine learning via neural networks. He brings unifying principles to the fore, and reviews the state of the subject. Ripley also includes many examples to illustrate real problems in pattern recognition and how to overcome them.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
Within the field of pattern classification, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classifier. We propose to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images. We show that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms. Our approach demonstrates excellent performance on two challenging databases: an in-house database of 19 object/scene categories and the recently released VOC 2006 database. It is also very practical: it has low computational needs both at training and test time and vocabularies trained on one set of categories can be applied to another set without any significant loss in performance.
Conference Paper
VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such as feature detectors, feature extractors, (hierarchical) k-means clustering, randomized kd-tree matching, and super-pixelization. The source code and interfaces are fully documented. The library integrates directly with MATLAB, a popular language for computer vision research.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Article
This paper develops locally adapted hierarchical basis functions for effectively preconditioning large optimization problems that arise in computer graphics applications such as tone mapping, gradient- domain blending, colorization, and scattered data interpolation. By looking at the local structure of the coefficient matrix and p erform- ing a recursive set of variable eliminations, combined with a sim- plification of the resulting coarse level problems, we obtai n bases better suited for problems with inhomogeneous (spatially varying) data, smoothness, and boundary constraints. Our approach removes the need to heuristically adjust the optimal number of precondi- tioning levels, significantly outperforms previously prop osed ap- proaches, and also maps cleanly onto data-parallel architectures such as modern GPUs.
Article
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.