Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... #Para denotes a number of learnable parameters. #Para and FLOPs are computed by using tool ptflops [67] "-" means "not available" or "not to be reported", #Paras is a number of learnable parameters computed by using tool ptflops [67] in comparison with several popular large models, i.e., GoogLeNet [24], InceptionV4 [55], and VGG16 [56]. ...
... ResNet18 [23] has the same performance, but it takes up to about 3 times of learnable parameters, i.e., 11.36M versus our 3.78M (see Table 5). Table 5 also indicates the comparative performance of CGDF-Net compared to several popular large models, where VGG16 [56] obtained a little higher rate on Places365 (i.e., 55.24% versus 54.67%) but it needs up to 135.76M learnable parameters. ...
Article
Full-text available
Addressing grouped dilation features (GDFs) improved the learning ability of MobileNetV1 in image representation. However, the computational complexity is still at a high level, while the performance is a modest degree. This expensive cost is principally caused by the backbone of MobileNetV1 taking deep feature maps in several latest layers. To mitigate these issues, we propose a light-weight network (called CGDF-Net) with an adaptative architecture to effectively extract grouped dilation features. CGDF-Net is structured by two main contributions: (i) Its backbone is improved by simply replacing several latest layers of MobileNetV1 with a pointwise convolutional layer for reducing the computational complexity; (ii) Embedding an attention mechanism into the GDF block to form a completed GDF perceptron (CGDF) that directs the learning process into the significant properties of objects in images instead of the trivial ones. Experimental results on benchmark datasets for image recognition have validated that the proposed CGDF-Net network obtained good performance with a small computational cost in comparison with MobileNets and other light-weight models. For instance, CGDF-Net obtained 60.86% with 3.53M learnable parameters on Stanford Dogs, up to 6% better than MoblieNetV1-GDF (54.9%, 3.39M) and 9% versus MoblieNetV1 (51.6%, 3.33M). Meantime, the performance of CGDF-Net on ImageNet-100 is 85.22%, about 6%\sim8% higher than MobileNetV1-GDF’s (79.14%) and MobileNetV1’s (77.01%), respectively. The code of CGDF-Net is available at https://github.com/nttbdrk25/CGDFNet.
... The recently proposed class-conditional masked generative model, MaskBit [62], introduced several techniques to enhance VQGAN [22] training, two of which we incorporate in the training of our proposed TA-TiTok. First, MaskBit demonstrated that using ResNet50 [30] for perceptual loss [34] yields richer features than the VGG network [54] used in LPIPS [71], thereby improving tokenizer training. Second, we strengthen the PatchGAN [22] discriminator by replacing traditional average pooling with blur kernels [70] and adding LeCAM regularization [59] during training. ...
Preprint
Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.
... For the classification Knowledge Distillation (KD) method, we utilize different networks for the teacher and student networks. The teacher network primarily consists of ResNet [42] architectures (such as ResNet32×4 and ResNet56), WideResNet [43] (WRN) structures (e.g., WRN-40-2), and VGG [44] architectures (like VGG13). In contrast, the student network is designed to be a compressed version corresponding to the teacher network, which comprises models like resnet20, resnet8×4, WRN-16-2, WRN-40-1, VGG8, and the lightweight MobileNetV2 [45] ShuffleNetV1 [46] ShuffleNetV2 [47]. ...
Preprint
Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.
... Figure 1 shows the structure of the SSD network. The backbone network is modified on the basis of VGG16 [10] by replacing the last two fully-connected layers FC6 and FC7 with convolutional layers Conv6 and Conv7, and then adding four groups of convolutional layers: Conv8, Conv9, Conv10, and Conv11. Then, the feature maps of Conv4 3 and Conv7 are combined with those of Conv8 2, Conv9 2, Conv10 2, and Conv11 2 to form a multi-scale feature extraction network. ...
Article
To solve the problem of time-consuming and low efficiency in manual defect detection, this paper proposes a bonding defect detection algorithm based on improved Single Shot MultiBox Detector (SSD). DenseNet is used to replace VGG of the SSD algorithm to improve the detection effect of bonding defect. A novel feature fusion network is designed, in which dilated convolution is used to reduce the size of the low-level feature map, and it is fused with the high-level feature map, and then the Convolutional Block Attention Module (CBAM) attention mechanism is used to increase the ability to extract the features. Focal loss is used to control the ratio of positive and negative samples for training and suppress easily separable samples, so that the samples involved in training have better distribution and the model has better detection performance. Then, the defect data set is constructed and a comparison experiment is carried out. The results show that the mAP, Precision, and Recall of the improved SSD network are increased to 75.9 %, 77.3 %, and 75.6 %, respectively, which can better identify bonding defect.
... In medical image analysis, CNN is a popular deep learning architecture that is used for image classification [8], segmentation [9], detection [10], and other tasks. Starting with AlexNet [11], various end-to-end models have been developed with deeper and deeper networks and greater representation compactness for image classification, such as VGG [12], ResNet [13], and DenseNet [14]. Deep learning is being utilized extensively in medical image processing because to the exceptional outcomes these models have delivered. ...
Article
Full-text available
Convolutional neural networks have been frequently utilized in computer-aided diagnosis (CAD). Breast cancer image classification is one of the vital applications of CAD. The purpose of this study was to explore the role of attention mechanism in breast cancer image classification. Specifically, this study introduces attention mechanism into the classical image classification deep learning network and constructs a new breast cancer image classification model. Test results indicate that convolutional neural networks perform better in classification when an attention mechanism is added, and they also perform better in terms of training loss and accuracy.
... To optimize resource utilization, only images that are significantly different are annotated (segmentation, landmarks) for each heart. This is based on a cosine similarity score of a pretrained VGG16 embedding vector [4,17]. This leads to a significantly varying count of images for each heart, as illustrated in Table 1. ...
Conference Paper
Full-text available
A conditional Denoising Diffusion Probabilistic Model (cDDPM) is trained to generate realistic images of aortic valves that could be used as an advanced data augmentation technique. RGB images of porcine aortic valves and three conditional masks (cusp segmentation, visible and occluded landmarks) serve as training data for the model. The dataset comprises seven porcine hearts and contains 414 training images and 37 test images. Given Gaussian noise image as input, the model is able to generate RGB images of aortic valves that align with the given conditional masks. To enhance realism, the RePaint algorithm is integrated into the image synthesis process, enabling the generation of primary image components alongside small segments from the original image. The synthetic images are evaluated against the test dataset using three common metrics: the Multi-Scale Structural Similarity Index (MS-SSIM) reaches values of up to 0.29, the Kernel Inception Distance (KID) values up to 0.041 and the Fr'echet Inception Distance (FID) values up to 113.5. The generated images closely align with the specified conditional segmentation masks. From a human observer's perspective, the alignment of the conditional landmarks appears to be less precise compared to the segmentation mask. Conditional masks allow explicit specification of cusp number and geometry, enabling generation of rare aortic valves, including unicuspid and bicuspid types, even though these are not in the training data. cD-DPM has demonstrated the capacity to generate realistic aortic valve images. However, the limited data availability and model restrictions resulting from the design process represent limitations and constraints to the model.
... To evaluate our methodology on the aforementioned data bases, we use the same deep learning architecture on the three data sets. We use the VGG architecture 44 , which consists of stacking double convolutional layers 3 to process the image before 3 Stack two layers of 2-dimensional convolutions with max-pooling 7/12 Figure 6a shows that the SEL covariate is considered, by far, to be the most important covariate in the model. The order of the other covariates remains the same when we train the model without our SEL variable (i.e. ...
Preprint
Full-text available
Feature engineering is of critical importance in the field of Data Science. While any data scientist knows the importance of rigorously preparing data to obtain good performing models, only scarce literature formalizes its benefits. In this work, we present the method of Statistically Enhanced Learning (SEL), a formalization framework of existing feature engineering and extraction tasks in Machine Learning (ML). Contrary to existing approaches, predictors are not directly observed but obtained as statistical estimators. Our goal is to study SEL, aiming to establish a formalized framework and illustrate its improved performance by means of simulations as well as applications on practical use cases.
... It classifies object proposals and refines their spatial locations directly from shared feature maps, significantly improving training and testing speed. Fast R-CNN trains the VGG16 network [23] nine times faster and tests 213 VOLUME 11, 2023 5 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. ...
Article
Full-text available
This review paper presents an in-depth analysis of deep learning (DL) models applied to traffic scene understanding, a key aspect of modern intelligent transportation systems. It examines fundamental techniques such as classification, object detection, and segmentation, and extends to more advanced applications like action recognition, object tracking, path prediction, scene generation and retrieval, anomaly detection, Image-to-Image Translation (I2IT), and person re-identification (Person Re-ID). The paper synthesizes insights from a broad range of studies, tracing the evolution from traditional image processing methods to sophisticated DL techniques, such as Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs). The review also explores three primary categories of domain adaptation (DA) methods: clustering-based, discrepancy-based, and adversarial-based, highlighting their significance in traffic scene understanding. The significance of Hyperparameter Optimization (HPO) is also discussed, emphasizing its critical role in enhancing model performance and efficiency, particularly in adapting DL models for practical, real-world use. Special focus is given to the integration of these models in real-world applications, including autonomous driving, traffic management, and pedestrian safety. The review also addresses key challenges in traffic scene understanding, such as occlusions, the dynamic nature of urban traffic, and environmental complexities like varying weather and lighting conditions. By critically analyzing current technologies, the paper identifies limitations in existing research and proposes areas for future exploration. It underscores the need for improved interpretability, real-time processing, and the integration of multi-modal data. This reviewserves as a valuable resource for researchers and practitioners aiming to apply or advance DL techniques in traffic scene understanding.
Article
Deep convolutional neural networks with high performance are hard to be deployed in many real world applications, since the computing resources of edge devices such as smart phones or embedded GPU are limited. To alleviate this hardware limitation, the compression of deep neural networks from the model side becomes important. As one of the most popular methods in the spotlight, channel pruning of the deep convolutional model can effectively remove redundant convolutional channels from the CNN (convolutional neural network) without affecting the network’s performance remarkably. Existing methods focus on pruning design, evaluating the importance of different convolutional filters in the CNN model. A fast and effective fine-tuning method to restore accuracy is urgently needed. In this paper, we propose a fine-tuning method KDFT (Knowledge Distillation Based Fine-Tuning), which improves the accuracy of fine-tuned models with almost negligible training overhead by introducing knowledge distillation. Extensive experimental results on benchmark datasets with representative CNN models show that up to 4.86% accuracy improvement and 79% time saving can be obtained.
Article
Full-text available
Recently, scientists have widely utilized Artificial Intelligence (AI) approaches in intelligent agriculture to increase the productivity of the agriculture sector and overcome a wide range of problems. Detection and classification of plant diseases is a challenging problem due to the vast numbers of plants worldwide and the numerous diseases that negatively affect the production of different crops. Early detection and accurate classification of plant diseases is the goal of any AI-based system. This paper proposes a hybrid framework to improve classification accuracy for plant leaf diseases significantly. This proposed model leverages the strength of Convolutional Neural Networks (CNNs) and Vision Transformers (ViT), where an ensemble model, which consists of the well-known CNN architectures VGG16, Inception-V3, and DenseNet20, is used to extract robust global features. Then, a ViT model is used to extract local features to detect plant diseases precisely. The performance proposed model is evaluated using two publicly available datasets (Apple and Corn). Each dataset consists of four classes. The proposed hybrid model successfully detects and classifies multi-class plant leaf diseases and outperforms similar recently published methods, where the proposed hybrid model achieved an accuracy rate of 99.24% and 98% for the apple and corn datasets.
Article
Blink detection is a highly concerned research direction in the field of computer vision, which plays a key role in various application scenes such as human-computer interaction, fatigue detection and emotion perception. In recent years, with the rapid development of deep learning, the application of deep learning techniques for precise blink detection has emerged as a significant area of interest among researchers. Compared with traditional methods, the blink detection method based on deep learning offers superior feature learning ability and higher detection accuracy. However, the current research on blink detection based on deep learning lacks systematic summarization and comparison. Therefore, the aim of this article is to comprehensively review the research progress in deep learning-based blink detection methods and help researchers to have a clear understanding of the various approaches in this field. This article analyzes the progress made by several classical deep learning models in practical applications of eye blink detection while highlighting their respective strengths and weaknesses. Furthermore, it provides a comprehensive summary of commonly used datasets and evaluation metrics for blink detection. Finally, it discusses the challenges and future directions of deep learning for blink detection applications. Our analysis reveals that deep learning-based blink detection methods demonstrate strong performance in detection. However, they encounter several challenges, including training data imbalance, complex environment interference, real-time processing issues and application device limitations. By overcoming the challenges identified in this study, the application prospects of deep learning-based blink detection algorithms will be significantly enhanced.
Article
Full-text available
INTRODUCTION Diagnostic performance of optical coherence tomography (OCT) to detect Alzheimer's disease (AD) and mild cognitive impairment (MCI) remains limited. We aimed to develop a deep‐learning algorithm using OCT to detect AD and MCI. METHODS We performed a cross‐sectional study involving 228 Asian participants (173 cases/55 controls) for model development and testing on 68 Asian (52 cases/16 controls) and 85 White (39 cases/46 controls) participants. Features from OCT were used to develop an ensemble trilateral deep‐learning model. RESULTS The trilateral model significantly outperformed single non‐deep learning models in Asian (area under the curve [AUC] = 0.91 vs. 0.71–0.72, p = 0.022‐0.032) and White (AUC = 0.84 vs. 0.58–0.75, p = 0.056‐ < 0.001) populations. However, its performance was comparable to that of the trilateral statistical model (AUCs similar, p > 0.05). DISCUSSION Both multimodal approaches, using deep learning or traditional statistical models, show promise for AD and MCI detection. The choice between these models may depend on computational resources, interpretability preferences, and clinical needs. Highlights A deep‐learning algorithm was developed to detect Alzheimer's disease (AD) and mild cognitive impairment (MCI) using OCT images. The combined model outperformed single OCT parameters in both Asian and White cohorts. The study demonstrates the potential of OCT‐based deep‐learning algorithms for AD and MCI detection.
Article
Efficient waste management is crucial for sustainable urban living. However, challenges such as improper segregation, low recycling rates, and reliance on manual systems hinder progress toward environmental goals. This paper introduces EcoSort AI, an AI-driven waste management solution that combines computer vision, IoT technologies, and machine learning to automate and optimize waste segregation. Leveraging convolutional neural networks (CNNs), the system identifies, classifies, and sorts waste materials into appropriate categories, ensuring improved recycling rates and reduced landfill burden. EcoSort AI features IoT-enabled smart bins for real-time classification and integrates seamlessly into existing urban infrastructures. Experimental results demonstrate significant improvements in sorting accuracy, efficiency, and public engagement. Index Terms- Waste segregation, artificial intelligence, IoT-enabled bins, CNNs, smart cities, recycling optimization, sustainable development.
Article
Full-text available
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items, rather than one prototype per label. Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music, and is evaluated against existing approaches in the literature. The results demonstrate a significant performance improvement in almost all domains and training setups when using LC-Protonets for multi-label classification. In addition to training a few-shot learning model from scratch, we explore the use of a pre-trained model, obtained via supervised learning, to embed items in the feature space. Fine-tuning improves the generalization ability of all methods, yet LC-Protonets achieve high-level performance even without fine-tuning, in contrast to the comparative approaches. We finally analyze the scalability of the proposed method, providing detailed quantitative metrics from our experiments. The implementation and experimental setup are made publicly available, offering a benchmark for future research.
Article
Full-text available
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network train-ing. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k cate-gories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition ser-vice. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Article
Full-text available
We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learnt simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013), and produced near state of the art results for the detection and classifications tasks. Finally, we release a feature extractor from our best model called OverFeat.
Article
Full-text available
Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural-network that operates directly off of the image pixels. This model is configured with 11 hidden layers all with feedforward connections. We employ the DistBelief implementation of deep neural networks to scale our computations over this network. We have evaluated this approach on the publicly available SVHN dataset and achieve over 96% accuracy in recognizing street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art and achieve 97.84% accuracy. We also evaluated this approach on an even more challenging dataset generated from Street View imagery containing several 10s of millions of street number annotations and achieve over 90% accuracy. Our evaluations further indicate that at specific operating thresholds, the performance of the proposed system is comparable to that of human operators and has to date helped us extract close to 100 million street numbers from Street View imagery worldwide.
Article
Full-text available
We investigate multiple techniques to improve upon the current state of the art deep convolutional neural network based image classification pipeline. The techiques include adding more image transformations to training data, adding more transformations to generate additional predictions at test time and using complementary models applied to higher resolution images. This paper summarizes our entry in the Imagenet Large Scale Visual Recognition Challenge 2013. Our system achieved a top 5 classification error rate of 13.55% using no external data which is over a 20% relative improvement on the previous year's winner.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
The Fisher kernel (FK) is a generic framework which com- bines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained us- ing only SIFT descriptors and costless linear classifiers. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant re- sources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.
Article
Full-text available
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. This approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.
Article
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
Article
I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.
Article
This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets). We consider two visualisation techniques, based on computing the gradient of the class score with respect to the input image. The first one generates an image, which maximises the class score [Erhan et al., 2009], thus visualising the notion of the class, captured by a ConvNet. The second technique computes a class saliency map, specific to a given image and class. We show that such maps can be employed for weakly supervised object segmentation using classification ConvNets. Finally, we establish the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks [Zeiler et al., 2013].