Article

Authors:
Article

# ImageNet Classification with Deep Convolutional Neural Networks

If you want to read the PDF, try requesting it from the authors.

## Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

## No full-text available

... It is a quick and simple method for labelling a few hundred images for object detection. Labeling process with LabelImg is shown in Figure 4. Annotations were stored in the PASCAL VOC format as XML files which ImageN uses [30]. It also supports the YOLO and Create ML formats. ...
... The annotation file for YOL in .txt format is shown in the Figure 5: Annotations were stored in the PASCAL VOC format as XML files which ImageNet uses [30]. It also supports the YOLO and Create ML formats. ...
... The annotation file for YOLO in .txt format is shown in the Figure 5: Annotations were stored in the PASCAL VOC format as XML files which ImageNet uses [30]. It also supports the YOLO and Create ML formats. ...
Article
Full-text available
The teeth are the most challenging material to work with in the human body. Existing methods for detecting teeth problems are characterised by low efficiency, the complexity of the experiential operation, and a higher level of user intervention. Older oral disease detection approaches were manual, time-consuming, and required a dentist to examine and evaluate the disease. To address these concerns, we propose a novel approach for detecting and classifying the four most common teeth problems: cavities, root canals, dental crowns, and broken-down root canals, based on the deep learning model. In this study, we apply the YOLOv3 deep learning model to develop an automated tool capable of diagnosing and classifying dental abnormalities, such as dental panoramic X-ray images (OPG). Due to the lack of dental disease datasets, we created the Dental X-rays dataset to detect and classify these diseases. The size of datasets used after augmentation was 1200 images. The dataset comprises dental panoramic images with dental disorders such as cavities, root canals, BDR, dental crowns, and so on. The dataset was divided into 70% training and 30% testing images. The trained model YOLOv3 was evaluated on test images after training. The experiments demonstrated that the proposed model achieved 99.33% accuracy and performed better than the existing state-of-the-art models in terms of accuracy and universality if we used our datasets on other models.
... Then, we use transfer learning to combine self-supervised and remote sensing image classification tasks by fine-tuning a small number of epochs on the labeled dataset. During pre-training, we will use the imagenet-1k [46] dataset, which does not contain labels for training. Additionally, a large number of patches (e.g., 75%) of image patches are randomly masked out. ...
... AlexNet [46] 76.69 ± 0.19 76.85 ± 0.18 VGGNet [60] 76.47 ± 0.18 79.79 ± 0.65 GoogleNet [58] 76.19 ± 0.38 78.48 ± 0.26 SPPNet [58] 82.13 ± 0.30 84.64 ± 0.23 D-CNN with AlexNet [58] 85.56 ± 0.20 87.24 ± 0.12 D-CNN with VGGNet-16 [58] 89.22 ± 0.50 91.89 ± 0.22 DenseNet-121 [61] 88.31 ± 0.35 90.47 ± 0.33 EfficientNet-B0-aux [62] 89.96 ± 0.27 -EfficientNet-B3-aux [62] 91.08 ± 0.14 -Contourlet CNN [63] 85.93 ± 0.51 89.57 ± 0.45 VGG-MS2AP [64] 92.27 ± 0.21 93.91 ± 0.15 Inception-v3-CapsNet [65] 89.03 ± 0.21 92.60 ± 0.11 ACNet [66] 91.09 ± 0.13 92.42 ± 0.16 Xu's method [67] 91 Table 5. Classification accuracy on the AID dataset compared to CNN-based models (OA ± STD). ...
Article
Full-text available
This paper provides insights into the interpretation beyond simply combining self-supervised learning (SSL) with remote sensing (RS). Inspired by the improved representation ability brought by SSL in natural image understanding, we aim to explore and analyze the compatibility of SSL with remote sensing. In particular, we propose a self-supervised pre-training framework for the first time by applying the masked image modeling (MIM) method to RS image research in order to enhance its efficacy. The completion proxy task used by MIM encourages the model to reconstruct the masked patches, and thus correlate the unseen parts with the seen parts in semantics. Second, in order to figure out how pretext tasks affect downstream performance, we find the attribution consensus of the pre-trained model and downstream tasks toward the proxy and classification targets, which is quite different from that in natural image understanding. Moreover, this transferable consensus is persistent in cross-dataset full or partial fine-tuning, which means that SSL could boost general model-free representation beyond domain bias and task bias (e.g., classification, segmentation, and detection). Finally, on three publicly accessible RS scene classification datasets, our method outperforms the majority of fully supervised state-of-the-art (SOTA) methods with higher accuracy scores on unlabeled datasets.
... Although extracting features from the Taylor expansion is a versatile approach in Natural science, N-Jet still have two problems as a system of SRF components. The first problem is SRF on N-Jet performs worse than the conventional convolution when training on sufficiently largescale datasets such as the ImageNet dataset [7]. This problem critically, which is not actually treated in this paper, shows there is a room of investigating another component system for SRF. ...
... In this section, we analyze convolution filter weights in a well-trained CNN and show our OtX design policy. The CNN model used for analysis is a VGG16 classifier [19] trained on the ImageNet dataset [7]. We use the weights data from a model zoo, PyTorch Image Models [20]. ...
Article
Full-text available
Convolutions in neural networks are still essential on various vision tasks. To develop neural convolutions, this study focuses on Structured Receptive Field (SRF), representing a convolution filter as a linear combination of widely acting designed components. Although SRF can represent convolution filters with fewer components than the number of filter bins, N-Jet, the sole component system implementation, requires ten trainable parameters per filter to improve accuracy even for 3 × 3 convolutions. Hence, we aim to formulate a new component system for SRF that can represent valid filters with fewer components. Our component system named “OtX” is based on the Principal Component Analysis of well-trained filter weights because the extracted components will also be principal for neural convolution filters. In addition to proposing the component system, we develop a component scaling method to defuse massive scale differences among the coefficients in a linear combination of OtX components. In the experimental section, we train image classification models on CIFAR-100 dataset under the hyperparameters tuned for the original models with the standard convolutions. For NFNet-F0 classifier, OtX with six components performs 0.5% better than the standard convolution, 3.1% better than N-Jet with six components, and only 0.1% worse than N-Jet with ten components. Besides, OtX with nine components provides stabler training than N-Jet, performing 0.5% better than the standard for NFNet-F0. OtX suits when replacing standard convolutions because OtX performs at least comparably against N-Jet with further parameter efficiency and training stability.
... ML techniques developed for the classification, recognition, and diagnosis of diseases provide promising results [14][15][16][17]. Deep Learning (DL), a special area of machine learning, provides automatic recognition performing end-to-end learning and using preproduced weights [18][19][20][21]. Compared to ML techniques, DL provides higher accuracy rates in addition to the less human interference, workload and short time requirement in the development phase [22][23][24][25][26]. ...
... After the pooling layer, different activation methods (e.g., ReLU, Leaky ReLU, sigmoid, linear, tanh) are exploited to reduce the numerical characteristics of the data. AlexNet [19], GoogLeNet [41], DenseNet201 [42], ResNet [43], InceptionNet [44], VggNet [45], ShuffleNet [46] are the best known methods with successful recognition results. However, the successes of these models vary depending on the data types and number and quality of the datasets. ...
Article
Using deep learning techniques on radiological lung images for detecting COVID-19 is a promising technique in shortening the diagnosis time. In this study, we propose a hybrid deep learning model, detecting the COVID-19 and Pneumonia virus using Chest X-ray images. The proposed model, named SpiCoNET, first runs multiple well-known deep learning models combined with Spiking Neural Network (SNN) in order to identify the models with higher accuracy rates. Then, SpiCoNET combines the features of the two models with the highest accuracy rates among the well-known models and hands the combined features over to a different SNN layer as an input. Finally, the features are classified by using the SEFRON learning algorithm. The proposed hybrid deep learning model takes advantage of the features of the well-known models combined with SNN providing the highest accuracy rate. Moreover, the proposed model makes use of the SEFRON learning algorithm to provide better classification. The proposed model provides an accuracy rate of 97.09% for the classification of images of the COVID-19, Pneumonia and Normal, which outperforms AlexNet (91.27%) and DenseNet201 (90.40%). The results reveal that deep learning based systems for the identification of COVID-19 and Pneumonia can help healthcare professionals control the COVID-19 pandemic in an effective manner.
... It is thus expected that deep CNNs perform better than their shallower counterparts in increasingly complex problems. Indeed, from AlexNet [1] to VGG [2] and ResNet [3], deeper networks have been corresponding to a clear performance increase. ...
... Through the development of new network architectures, CNNs are becoming increasingly more effective for image classification tasks [1] [6] [2] [3] [7]. In general, the deeper a network is, i.e. the more consecutive convolutional layers it has, the better it is expected to perform. ...
Preprint
Deep Convolutional Neural Networks (CNNs) for image classification successively alternate convolutions and downsampling operations, such as pooling layers or strided convolutions, resulting in lower resolution features the deeper the network gets. These downsampling operations save computational resources and provide some translational invariance as well as a bigger receptive field at the next layers. However, an inherent side-effect of this is that high-level features, produced at the deep end of the network, are always captured in low resolution feature maps. The inverse is also true, as shallow layers always contain small scale features. In biomedical image analysis engineers are often tasked with classifying very small image patches which carry only a limited amount of information. By their nature, these patches may not even contain objects, with the classification depending instead on the detection of subtle underlying patterns with an unknown scale in the image's texture. In these cases every bit of information is valuable; thus, it is important to extract the maximum number of informative features possible. Driven by these considerations, we introduce a new CNN architecture which preserves multi-scale features from deep, intermediate, and shallow layers by utilizing skip connections along with consecutive contractions and expansions of the feature maps. Using a dataset of very low resolution patches from Pancreatic Ductal Adenocarcinoma (PDAC) CT scans we demonstrate that our network can outperform current state of the art models.
... Alexnet architecture was developed in 2012 by Krizhevsky et al. It won the ImageNet competition held in 2012, causing deep learning to attract the attention of the scientific World [29]. Resnet50 architecture, which was developed by He et al. in 2015, became the winner of the ImageNet competition with an error rate of 3.6%. ...
Article
Full-text available
Otitis media with effusion (OME) is defined as a middle ear disease that occurs with the accumulation of fluid in the posterior part of the eardrum, usually without any symptoms. When OME disease is not treated, some negative consequences arise that deeply affect the education, social and cultural life of the patient. OME disease is a difficult issue to diagnose by specialists. In this article, autoendoscopic images of the eardrum have been classified using deep learning methods to help specialists in the diagnosis of OME. In this study, a hybrid deep model based on artificial intelligence is proposed. In the proposed hybrid model, feature maps were obtained using Efficientnetb0 and Densenet201 architectures from both the original dataset and the improved dataset using the gaussian method. Then, the merging process was applied to these feature maps. Unnecessary features are eliminated by applying NCA dimension reduction to the combined feature map. The most valuable features obtained at the end of the optimization process are classified in different machine learning classifiers. The proposed model reached a very competitive accuracy value of 98.20% in the SVM classifier.
... For example, a convolutional neural network based on attention mechanism to solve the problem of image classification in complex scenes are proposed. This model is proficient in fine-grained classification and has better robustness [11][12][13]. The process of convolu-tion is the process of feature extraction. ...
Article
Full-text available
With the progress of science and technology and the arrival of the big data era, people increasingly rely on computers to deal with daily life and related affairs. In recent years, machine learning has become more and more popular and has achieved good results in some fields, which also makes machine learning widely used. Among them, visual neural network technology can more intelligently analyze the emotional expression of oil painting, which is one of the current research hotspots, involving machine vision, pattern recognition, image processing, artificial intelligence, and other fields. However, in the art field, oil painting is still very different from other images. At present, there is no deep learning algorithm to identify the application of emotional expression analysis in oil painting theme creation. This paper will start with the neural network algorithm and combine the big data recognition technology to analyze the emotional expression of the oil painting subject in the public environment and establish the emotional expression analysis model of oil painting creation based on big data and neural network. The experiment shows that the graphics synthesized by this model have high resolution and good definition, but the speed is slow in the process of experimental operation. It takes about one hour to complete a round of image optimization.
... Neural Networks (NN) and other Machine Learning (ML) techniques are commonly used in various tasks where it is hard or even not possible to specify sets of procedural rules (Serban et al., 2020). Deep convolutional neural networks (DCNN) became a very popular method for image classification (Krizhevsky et al., 2017). It is based on low/mid/high-level features integration where those levels can be enriched by the number of stacked layers (depth) (He et al., 2016). ...
... With the vast usage of neural networks (NNs) for detection and classification tasks, the NN-based algorithms form the most appropriate kind of algorithm for object detection purposes. A. Krizhevsky et al. [14] are the first ones to utilize a GPU to train a deep NN called Alexnet for a classification task, and achieved a state-of-the-art performance that was far better than the second best in the ImageNet Large-Scale Visual Recognition Challenge (LSVRC)-2010 contest [15]. Since then, a variety of deep NNs are being invented and classic feature extractors are being replaced to be the most used algorithm in various computer vision tasks, and also in speech recognition, semantic recognition, etc. applications. ...
Article
Full-text available
This paper proposes a deep learning based object detection method to locate a distant region in an image in real-time. It concentrates on distant objects from a vehicular front camcorder perspective, trying to solve one of the common problems in Advanced Driver Assistance Systems (ADAS) applications, which is, to detect the smaller and faraway objects with the same confidence as those with the bigger and closer objects. This paper presents an efficient multi-scale object detection network, termed as ConcentrateNet to detect a vanishing point and concentrate on the near-distant region. Initially, the object detection model inferencing will produce a larger scale of receptive field detection results and predict a potentially vanishing point location, that is, the farthest location in the frame. Then, the image is cropped near the vanishing point location and processed with the object detection model for second inferencing to obtain distant object detection results. Finally, the two-inferencing results are merged with a specific Non-Maximum Suppression (NMS) method. The proposed network architecture can be employed in most of the object detection models as the proposed model is implemented in some of the state-of-the-art object detection models to check feasibility. Compared with original models using higher resolution input size, ConcentrateNet architecture models use lower resolution input size, with less model complexity, achieving significant precision and recall improvements. Moreover, the proposed ConcentrateNet architecture model is successfully ported onto a low-powered embedded system, NVIDIA Jetson AGX Xavier, suiting the real-time autonomous machines.
... Accordingly, the quality of Ψ now depends on the selection of Θ , g , and the distance function that Θ should efficiently separate the image features from the degeneration agents, while g and distance function should maintain the data fidelity for successful reconstruction results. The Θ , g , and the distance function can be either learned by using neural networks [37], or manually chosen according to the statistical features of the captured images [38]. Considering that both m  and n Ω do not contain edge information as they originate from illumination problems (spatially slow varying), we choose the spatial gradient operator to extract the edge feature of the image for the reconstruction of Ψ . ...
Preprint
Full-text available
We present a simple but efficient and robust reconstruction algorithm for Fourier ptychographic microscopy, termed error-laxity Fourier ptychographic iterative engine (Elfpie), that is simultaneously robust to (1) noise signal (including Gaussian, Poisson, and salt & pepper noises), (2) problematic background illumination problem, (3) vignetting effects, and (4) misaligning of LED positions, without the need of calibrating or recovering of these system errors. In Elfpie, we embedded the inverse problem of FPM under the framework of feature extraction/recovering and proposed a new data fidelity cost function regularized by the global second-order total-variation regularization (Hessian regularization). The closed-form complex gradient for the cost function is derived and is back-propagated using the AdaBelief optimizer with an adaptive learning rate to update the entire Fourier spectrum of the sample and system pupil function. The Elfpie is tested on both simulation data and experimental data and is compared against the state-of-the-art (SOTA) algorithm. Results show the superiority of the Elfpie among other SOTA methods, in both reconstruction quality under different degeneration issues, and implementation efficiency. In general, compared against SOTA methods, the Elfpie is robust to Gaussian noise with 100 times larger noise strength, salt & pepper noise with 1000 times larger noise strength, and Poisson noise with 10 times noise strength. The Elfpie is able to reconstruct high-fidelity sample field under large LED position misalignments up to 2 mm. It can also bypass the vignetting effect in which all SOTA methods fail to reconstruct the sample pattern. With parallel computation, the Elfpie is able to be K times faster than traditional FPM, where K is the number of used LEDs.
... This study employed three pre-trained DL architecture, namely, AlexNet [24], Goog-leNet, and VGGNet were to retrieve the most informative features from the pre-processed images. AlexNet model was trained on one million objects for 1000 class labels, including pen, keyboard, pencil, mug, coffee, and many animals. ...
Article
Full-text available
This study proposed an AlexNet-based crowd anomaly detection model in the video (image frames). The proposed model was comprised of four convolution layers (CLs) and three Fully Connected layers (FC). The Rectified Linear Unit (ReLU) was used as an activation function, and weights were adjusted through the backpropagation process. The first two CLs are followed by max-pool layer and batch normalization. The CLs produced features that are utilized to detect the anomaly in the image frame. The proposed model was evaluated using two parameters—Area Under the Curve (AUC) using Receiver Operator Characteristic (ROC) curve and overall accuracy. Three benchmark datasets comprised of numerous video frames with various abnormal and normal actions were used to evaluate the performance. Experimental results revealed that the proposed model outperformed other baseline studies on all three datasets and achieved 98% AUC using the ROC curve. Moreover, the proposed model achieved 95.6%, 98%, and 97% AUC on the CUHK Avenue, UCSD Ped-1, and UCSD Ped-2 datasets, respectively.
... One superior classifier is created by combining three CNN models for excellent prediction performance. The CNN models are initially trained using a sizable dataset of naturally occurring image annotations (ImageNet) 24 . Then, these models are fine-tuned using annotated semiconductor wafer defect dataset. ...
Article
Full-text available
Semiconductor wafer defects severely affect product development. In order to reduce the occurrence of defects, it is necessary to identify why they occur, and it can be inferred by analyzing the patterns of defects. Automatic defect classification (ADC) is used to analyze large amounts of samples. ADC can reduce human resource requirements for defect inspection and improve inspection quality. Although several ADC systems have been developed to identify and classify wafer surfaces, the conventional ML-based ADC methods use numerous image recognition features for defect classification and tend to be costly, inefficient, and time-consuming. Here, an ADC technique based on a deep ensemble feature framework (DEFF) is proposed that classifies different kinds of wafer surface damage automatically. DEFF has an ensemble feature network and the final decision network layer. The feature network learns features using multiple pre-trained convolutional neural network (CNN) models representing wafer defects and the ensemble features are computed by concatenating these features. The decision network layer decides the classification labels using the ensemble features. The classification performance is further enhanced by using a voting-based ensemble learning strategy in combination with the deep ensemble features. We show the efficacy of the proposed strategy using the real-world data from SK Hynix.
... Finally, the S augmented training subsets are input into the S classification CNN models for training. It is worth noting that the initial parameters of the S classification CNN models here are all the same, and they are all copies of the model obtained after pre-training on the ImageNet dataset (Krizhevsky, Sutskever & Hinton, 2017). After training on distributed data, we obtained S different trained CNN models. ...
Article
Full-text available
With the popularity of wine culture and the development of artificial intelligence (AI) technology, wine label image retrieval becomes more and more important. Taking an wine label image as an input, the goal of this task is to return the wine information that the user hopes to know, such as the main brand and sub-brand of the wine. The main challenge in wine label image retrieval task is that there are a large number of wine brands with the imbalance of their sample images which strongly affects the training of the retrieval system based on deep learning. To solve this problem, this article adopts a distribted strategy and proposes two distributed retrieval frameworks. It is demonstrated by the experimental results on the large scale wine label dataset and the Oxford flowers dataset that both our proposed distributed retrieval frameworks are effective and even greatly outperform the previous state-of-the-art retrieval models.
... DNNs have achieved remarkable performance on a wide range of applications [1][2][3][4][5][6], yet they are vulnerable towards adversarial examples (malicious inputs that could fool DNNs by adding humanimperceptible perturbations [7][8][9][10]). To address this problem, plenty of adversarial defense techniques have been proposed [11,12]. ...
Preprint
Adversarial training has been demonstrated to be one of the most effective remedies for defending adversarial examples, yet it often suffers from the huge robustness generalization gap on unseen testing adversaries, deemed as the \emph{adversarially robust generalization problem}. Despite the preliminary understandings devoted on adversarially robust generalization, little is known from the architectural perspective. Thus, this paper tries to bridge the gap by systematically examining the most representative architectures (e.g., Vision Transformers and CNNs). In particular, we first comprehensively evaluated \emph{20} adversarially trained architectures on ImageNette and CIFAR-10 datasets towards several adversaries (multiple $\ell_p$-norm adversarial attacks), and found that Vision Transformers (e.g., PVT, CoAtNet) often yield better adversarially robust generalization. To further understand what architectural ingredients favor adversarially robust generalization, we delve into several key building blocks and revealed the fact via the lens of Rademacher complexity that the higher weight sparsity contributes significantly towards the better adversarially robust generalization of Vision Transformers, which can be often achieved by attention layers. Our extensive studies discovered the close relationship between architectural design and adversarially robust generalization, and instantiated several important insights. We hope our findings could help to better understand the mechanism towards designing robust deep learning architectures.
... The trainable parameters Θ (weights w, bias b) in the neural network provides a large degree of freedom for the approximation capability of the network, thus DL can extract features from large amounts of data. Nowadays, DL has been effectively deployed in a variety of fields, such as image recognition [34,35], target detection [36], image generation [37], machine translation [38], speech recognition [39], natural language processing [40], and etc. Although deep learning has achieved remarkable achievements in such fields, only recently has it been gradually introduced into the field of scientific computing [41]. ...
Preprint
In this paper, we firstly extend the Fourier neural operator (FNO) to discovery the soliton mapping between two function spaces, where one is the fractional-order index space $\{\epsilon|\epsilon\in (0, 1)\}$ in the fractional integrable nonlinear wave equations while another denotes the solitonic solution function space. To be specific, the fractional nonlinear Schr\"{o}dinger (fNLS), fractional Korteweg-de Vries (fKdV), fractional modified Korteweg-de Vries (fmKdV) and fractional sine-Gordon (fsineG) equations proposed recently are studied in this paper. We present the train and evaluate progress by recording the train and test loss. To illustrate the accuracies, the data-driven solitons are also compared to the exact solutions. Moreover, we consider the influences of several critical factors (e.g., activation functions containing Relu$(x)$, Sigmoid$(x)$, Swish$(x)$ and $x\tanh(x)$, depths of fully connected layer) on the performance of the FNO algorithm. We also use a new activation function, namely, $x\tanh(x)$, which is not used in the field of deep learning. The results obtained in this paper may be useful to further understand the neural networks in the fractional integrable nonlinear wave systems and the mappings between two spaces.
... Cette stratégie est souvent employée car elle permet expérimentalement d'obtenir le résultat en moins d'itération qu'avec la version stochastique. En contrepartie, le calcul du gradient doit se faire pour tous les individus sélectionnés simultanément, ce qui n'est pas un problème pour les architectures de calcul parallèle modernes [79]. ...
Thesis
Avec l’avènement et le développement rapide des technologies numériques, les données sont devenues à la fois un bien précieux et très abondant. Cependant, avec une telle profusion, se posent des questions relatives à la qualité et l’étiquetage de ces données. En effet, à cause de l’augmentation des volumes de données disponibles, alors que le coût de l’étiquetage par des experts humains reste très important, il est de plus en plus nécessaire de pouvoir renforcer l’apprentissage semi-supervisé grâce l’exploitation des données nonlabellisées. Ce problème est d’autant plus marqué dans le cas de l’apprentissage multilabels, et en particulier pour la régression, où chaque unité statistique est guidée par plusieurs cibles différentes, qui prennent la forme de scores numériques. C’est dans ce cadre fondamental, que s’inscrit cette thèse. Tout d’abord, nous commençons par proposer une méthode d’apprentissage pour la régression semi-supervisée, que nous mettons à l’épreuve à travers une étude expérimentale détaillée. Grâce à cette nouvelle méthode, nous présentons une deuxième contribution, plus adaptée au contexte multi-labels. Nous montrons également son efficacité par une étude comparative, sur des jeux de données issues de la littérature. Par ailleurs, la dimensionnalité du problème demeure toujours la difficulté de l’apprentissage automatique, et sa réduction suscite l’intérêt de plusieurs chercheurs dans la communauté. Une des tâches majeures répondant à cette problématique est la sélection de variables, que nous proposons d’étudier ici dans un cadre complexe : semi-supervisé, multi-labels et pour la régression.
... Consequently, it can be challenging to obtain new knowledge not in agreement with the one already established. Benchmark databases and competitions have been a cornerstone of the tremendous success in reaching the extremely fast performance gains in machine learning, for example, of object recognition (Krizhevsky et al., 2017). There already exists an online platform hosting challenging time series datasets for causal discovery (http://www.causeme.net; ...
Article
Full-text available
Teleconnections that link climate processes at widely separated spatial locations form a key component of the climate system. Their analysis has traditionally been based on means, climatologies, correlations, or spectral properties, which cannot always reveal the dynamical mechanisms between different climatological processes. More recently, causal discovery methods based either on time series at grid locations or on modes of variability, estimated through dimension-reduction methods, have been introduced. A major challenge in the development of such analysis methods is a lack of ground truth benchmark datasets that have facilitated improvements in many parts of machine learning. Here, we present a simplified stochastic climate model that outputs gridded data and represents climate modes and their teleconnections through a spatially aggregated vector-autoregressive model. The model is used to construct benchmarks and evaluate a range of analysis methods. The results highlight that the model can be successfully used to benchmark different causal discovery methods for spatiotemporal data and show their strengths and weaknesses. Furthermore, we introduce a novel causal discovery method at the grid level and demonstrate that it has orders of magnitude better performance than the current approaches. Improved causal analysis tools for spatiotemporal climate data are pivotal to advance process-based understanding and climate model evaluation.
... This system constitutes convolutional neural networks. For classifications some existing architectures developed such as AlexNet [25], GoogleNet [26], VGG16 [27,28], VGG19 [28] and ResNet50 [28,29]. The problem is analyzed using these existing architectures and the proposed model. ...
Article
A public health emergency threat is happening due to novel coronavirus 2019 (nCoV-2019) throughout the world. nCoV-2019 is also named Severe Acute Respiratory Syndrome-CoronaVirus-2 (SARS-CoV-2). COVID-19 is the disease caused by this virus. The virus originates in bats and is transmitted to humans by some unidentified intermediate animals. This virus started around December 2019 at Wuhan of China. After that, it turned into a pandemic. Even though there is no efficient vaccination, the entire world fights against the COVID-19. This article presents an overview of the scenario of the world as well as India. Some of the leading countries in the world are also affected by this virus badly. Even India is the 2nd highest population, is taking necessary precautions to protect it. With the Government of India's decisions, along with effective social distancing and hygienic measures, India is in a better position. But, in the future, COVID19 cases in India, still unpredictable. We designed an algorithm based on Convolutional Neural Network (CNN), which helps to classify COVID19+ and COVID19- persons using people's chest X-ray images automatically generated within the shortest time. The proposed method discovered that employing CT scan medical images produced more accurate results than X-ray images.
... The former involves applying spatial changes to an object. For example, it includes flipping, rotating, and cropping [10][11][12][13]. In contrast, the latter involves pixel-level image transformation and includes contrast, which adjusts the ratio of contrast in an image, and the addition of random noise to increase the adaptability of the data under various environments [14,15]. ...
Article
Full-text available
Owing to the continuous increase in the damage to farms due to wild animals’ destruction of crops in South Korea, various methods have been proposed to resolve these issues, such as installing electric fences and using warning lamps or ultrasonic waves. Recently, new methods have been attempted by applying deep learning-based object-detection techniques to a robot. However, for effective training of a deep learning-based object-detection model, overfitting or biased training should be avoided; furthermore, a huge number of datasets are required. In particular, establishing a training dataset for specific wild animals requires considerable time and labor. Therefore, this study proposes an Extract–Append data augmentation method where specific objects are extracted from a limited number of images via semantic segmentation and corresponding objects are appended to numerous arbitrary background images. Thus, the study aimed to improve the model’s detection performance by generating a rich dataset on wild animals with various background images, particularly images of water deer and wild boar, which are currently causing the most problematic social issues. The comparison between the object detector trained using the proposed Extract–Append technique and that trained using the existing data augmentation techniques showed that the mean Average Precision (mAP) improved by ≥2.2%. Moreover, further improvement in detection performance of the deep learning-based object-detection model can be expected as the proposed technique can solve the issue of the lack of specific data that are difficult to obtain.
... The early attempts to develop machine learning tools to interpret crystallisation images 296 failed to come close to human scoring accuracy for (at least) two reasons: firstly, by not 297 having access to a sufficiently large, diverse and well-classified training set and secondly, 298 many of the attempts predated the breakthrough applications of Convolutional Neural 299 Networks (CNNs) in 2012 [41]. The MARCO consortium brought together images from 300 five different institutes, creating a larger and more diverse dataset than had been used 301 before for training crystallisation image classification models, and also tapped into the 302 established expertise in CNNs. ...
Preprint
Full-text available
The use of imaging systems in protein crystallisation means that the experimental setups no longer require manual inspection to determine the outcome of the trials. However, it leads to the problem of how best to find images which contain useful information about the crystallisation experiments. The adoption of a deeplearning approach in 2018 enabled a four-class machine classification system of the images to exceed human accuracy for the first time. Underpinning this was the creation of a labelled training set which came from a consortium of several different laboratories. The MARCO classification model does not have the same accuracy on local data as it does on images from the original test set; this can be somewhat mitigated by retraining the ML model and including local images. We have characterised the image data used in the original MARCO model, and performed extensive experiments to identify training settings most likely to enhance the local performance of a MARCO-dataset based ML classification model.
... DL can automatically detect and extract representations (features) with strong discriminatory power from input data . After DL's success in the ImageNet challenge in 2012, it only took 5 years for the first DL algorithm for medical imaging (Krizhevsky et al., 2012). A deep neural network (DNN) (Misman et al., 2019) has one or more hidden layers between the input and output layers. ...
Article
Full-text available
Autism spectrum disorder (ASD) is a complex neurodevelopmental disorder that affects approximately 1% of the population and causes significant burdens. ASD’s pathogenesis remains elusive; hence, diagnosis is based on a constellation of behaviors. Structural magnetic resonance imaging (sMRI) studies have shown several abnormalities in volumetric and geometric features of the autistic brain. However, inconsistent findings prevented most contributions from being translated into clinical practice. Establishing reliable biomarkers for ASD using sMRI is crucial for the correct diagnosis and treatment. In recent years, machine learning (ML) and specifically deep learning (DL) have quickly extended to almost every sector, notably in disease diagnosis. Thus, this has led to a shift and improvement in ASD diagnostic methods, fulfilling most clinical diagnostic requirements. However, ASD discovery remains difficult. This review examines the ML-based ASD diagnosis literature over the past 5 years. A literature-based taxonomy of the research landscape has been mapped, and the major aspects of this topic have been covered. First, we provide an overview of ML’s general classification pipeline and the features of sMRI. Next, representative studies are highlighted and discussed in detail with respect to methods, and biomarkers. Finally, we highlight many common challenges and make recommendations for future directions. In short, the limited sample size was the main obstacle; Thus, comprehensive data sets and rigorous methods are necessary to check the generalizability of the results. ML technologies are expected to advance significantly in the coming years, contributing to the diagnosis of ASD and helping clinicians soon.
... AlexNet [47] architecture consists of 8 layers, including 5 convolution layers (with 2 Convolution 2D layers, 3 Grouped Convolution 2D layers, 5 Rectified Linear Unit (ReLU) layers, 2 Cross-Channel Normalization layers, 3 Max-Pooling 2D layers) and 3 fully connected layers (with 3 Fully Connected layer, 2 ReLU layers, 2 Dropout layers for regularization, a Softmax layer using a normalized exponential function). The network's originality is the effective implementation of the ReLU activation function, as well as the usage of the Dropout mechanism and data augmentation technique to prevent overfitting. ...
Article
Full-text available
Background: Although polysomnography (PSG) is a gold standard tool for diagnosing sleep apnea (SA), it can reduce the patient’s sleep quality by the placement of several disturbing sensors and can only be interpreted by a highly trained sleep technician or scientist. In recent years, electrocardiogram (ECG)-derived respiration (EDR) and heart rate variability (HRV) have been used to automatically diagnose SA and reduce the drawbacks of PSG. Up to now, most of the proposed approaches focus on machine-learning (ML) algorithms and feature engineering, which require prior expert knowledge and experience. The present study proposes an SA detection algorithm to differentiate a normal and apnea event using a deep-learning (DL) framework based on 1D and 2D deep CNN with empirical mode decomposition (EMD) of a preprocessed ECG signal. The EMD is ideally suited to extract essential components which are characteristic of the underlying biological or physiological processes. In addition, the simple and compact architecture of 1D deep CNN, which only performs 1D convolutions, and pretrained 2D deep CNNs, are suitable for real-time and low-cost hardware implementation. Method: This study was validated using 7 h to nearly 10 h overnight ECG recordings from 33 subjects with an average apnea-hypopnea index (AHI) of 30.23/h originated from PhysioNet Apnea-ECG database (PAED). In preprocessing, the raw ECG signal was normalized and filtered using the FIR band pass filter. The preprocessed ECG signal was then decomposed using the empirical mode decomposition (EMD) technique to generate several features. Several important generated features were selected using neighborhood component analysis (NCA). Finally, deep learning algorithm based on 1D and 2D deep CNN were used to perform the classification of normal and apnea event. The synthetic minority oversampling technique (SMOTE) was also applied to evaluate the influence of the imbalanced data problem. Results: The segment-level classification performance had 93.8% accuracy with 94.9% sensitivity and 92.7% specificity based on 5-fold cross-validation (5fold-CV), meanwhile, the subject-level classification performance had 83.5% accuracy with 75.9% sensitivity and 88.7% specificity based on leave-one-subject-out cross-validation (LOSO-CV). Conclusion: A novel and robust SA detection algorithm based on the ECG decomposed signal using EMD and deep CNN was successfully developed in this study.
... Although FM and its variants [4][5][6] make it possible to learn 2 ndorder interactions automatically, learning higher-order interactions remains impractical due to its high computational complexity. Inspired by the triumph of deep learning in computer vision [7,8], speech recognition [9,10] and natural language processing [11,12], many researchers resort to deep neural networks (DNNs) to learn high-order feature interactions and obtain decent results [13,14]. Nevertheless, some studies [15][16][17] point out that DNNs model feature interactions in an implicit way and can hardly learn interactions higher than the 2 nd order. ...
Article
Full-text available
CTR (Click-Through Rate) prediction has attracted more and more attention from academia and industry for its significant contribution to revenue. In the last decade, learning feature interactions have become a mainstream research direction, and dozens of feature interaction-based models have been proposed for the CTR prediction task. The most common approach for existing models is to enumerate all possible feature interactions or to learn higher-order feature interactions by designing complex models. However, a simple enumeration will introduce meaningless and harmful interactions, and a complex model structure will bring a higher complexity. In this work, we propose a lightweight, yet effective model called the Gated Adaptive feature Interaction Network (GAIN). We devise a novel cross module to drop meaningless feature interactions and preserve informative ones. Our cross module consists of multiple gated units, each of which can independently learn an arbitrary-order feature interaction. We combine the cross module with a deep module into GAIN and conduct comparative experiments with state-of-the-art models on two public datasets to verify its validity. Our experimental results show that GAIN can achieve a comparable or even better performance compared to its competitors. Furthermore, in order to verify the effectiveness of the feature interactions learned by GAIN, we transfer learned interactions to other models, such as Logistic Regression (LR) and Factorization Machines (FM), and find out that their performance can be significantly improved.
... In recent years, artificial intelligence has developed rapidly and achieved great success in many fields, such as image classification [7], image segmentation [8], super-resolution [9], and image denoising [10], etc. In CT applications, deep learning are attracting more and more attention. ...
... AlexNet. AlexNet is a CNN proposed by Krizhevsky et al. 55 . It consists of 25 layers, eight of which are learnable (5 convolutional layers followed by 3 fully connected layers). ...
Article
Full-text available
... More recently, the advent of deep neural networks, and particularly Convolutional Neural Network (CNN) specifically designed for gridded data, has proposed new opportunites for segmentation tasks and in particular for land cover mapping (Ndikumana et al., 2018;Gao et al., 2019;Liu et al., 2019). A CNN is a deep learning architecture widely adopted and a very effective model for analyzing images or image-like data for pattern recognition (Krizhevsky et al., 2012). A CNN is structured in layers: an input layer connected to the data, an output layer connected to the quantities to estimate, and multiple hidden layers in between. ...
Article
Full-text available
Land cover mapping is of great interest in the Alps region for monitoring the surface occupation changes (e.g. forestation, urbanization, etc). In this pilot study, we investigate how time series of radar satellite imaging (C-band single-polarized SENTINEL-1 Synthetic Aperture Radar, SAR), also acquired through clouds, could be an alternative to optical imaging for land cover segmentation. Concretely, we compute for every location (using SAR pixels over 45 × 45 m ) the temporal coherence matrix of the Interferometric SAR (InSAR) phase over 1 year. This normalized matrix of size 60, ×, 60 (60 acquisition dates over 1 year) summarizes the reflectivity changes of the land. Two machine learning models, a Support Vector Machine (SVM) and a Convolutional Neural Network (CNN) have been developed to estimate land cover classification performances of 6 main land cover classes (such as forests, urban areas, water bodies, or pastures). The training database was created by projecting to the radar geometry the reference labeled CORINE Land Cover (CLC) map on the mountainous area of Grenoble, France. Upon evaluation, both models demonstrated good performances with an overall accuracy of 78% (SVM) and of 81% (CNN) over Chambéry area (France). We show how, even with a spatially coarse training database, our model is able to generalize well, as a large part of the misclassifications are due to a low precision of the ground truth map. Although some less computationally expensive approaches (using optical data) could be available, this land cover mapping based on very different information, i.e., patterns of land changes over a year, could be complementary and thus beneficial; especially in mountainous regions where optical imaging is not always available due to clouds. Moreover, we demonstrated that the InSAR temporal coherence matrix is very informative, which could lead in the future to other applications such as automatic detection of abrupt changes as snow fall or landslides.
... Since the introduction of AlexNet [1] in the early 2010s, Convolutional Neural Networks (CNNs) have become the central pillar of computer vision. Over the next decade, the field gradually shifted from engineering features to designing CNN architectures. ...
Article
Full-text available
The field of Neural Style Transfer (NST) has led to interesting applications that enable us to transform reality as human beings perceive it. Particularly, NST for material translation aims to transform the material of an object into that of a target material from a reference image. Since the target material (style) usually comes from a different object, the quality of the synthesized result totally depends on the reference image. In this paper, we propose a material translation method based on NST with automatic style image retrieval. The proposed CNN-feature-based image retrieval aims to find the ideal reference image that best translates the material of an object. An ideal reference image must share semantic information with the original object while containing distinctive characteristics of the desired material (style). Thus, we refine the search by selecting the most-discriminative images from the target material, while focusing on object semantics by removing its style information. To translate materials to object regions, we combine a real-time material segmentation method with NST. In this way, the material of the retrieved style image is transferred to the segmented areas only. We evaluate our proposal with different state-of-the-art NST methods, including conventional and recently proposed approaches. Furthermore, with a human perceptual study applied to 100 participants, we demonstrate that synthesized images of stone, wood, and metal can be perceived as real and even chosen over legitimate photographs of such materials.
Article
Plant diseases are one of the dominant factors that threaten sustainable agriculture, leading to economic losses. Developing an accurate mobile-based plant disease detection methodology is important for enabling rapid identification of emerging diseases directly from the farms. The deep learning methods have limited usage in mobile-based applications as they require larger memory and processing power to operate directly on smartphones or internet connectivity when used with a client–server computing model. To address this challenge, we propose a mobile-based lightweight deep learning-based model, which requires only a small footprint and processing power while maintaining higher detection accuracy. With around 0.088 billion multiply–accumulation operations, 0.26 million parameters, and 1 MB storage space, this framework achieved 97%, 97.1% and 96.4% accuracies on apple, citrus and tomato leaves datasets, respectively. One of our tiny models achieved 93.33% accuracy on a custom sourced in-the-wild apple leaves images dataset, which affirms the in-field applicability of the proposed framework. The superiority of the proposed model is further demonstrated through a comparative study with equivalent lightweight models.
Article
Chiral metasurfaces have been widely used in sensing, imaging and other fields because they can manipulate light through the efficient circular dichroism (CD). However, its on-demand design is still a very challenging task. In this work, we propose an on-demand multiple reverse design based on deep learning, named target-driven conditional generative network (TCGN). It can reverse design the metasurface structure that meets the required CD, and its mean square error (MAE) is 0.0089. We use this method to inversely design multiple sets of metasurfaces with different structures, and all their CD values can exceed 0.36. Both simulations and experiments prove the feasibility and effectiveness of using deep learning to reverse design metasurfaces. In addition, the designed metasurface can realize chiral wavefront control under dual frequency. This design method based on deep learning can rapidly and efficiently design the chiral metasurfaces, which provides a new way for the design of metasurfaces.
Thesis
Mapping the structural connectivity of the human brain is a major scientific challenge. Describing the trajectory and connections made by the hundred billion neurons that make up the brain is a titanic and multi-scale task.The major fiber bundles have been described by classical anatomical approaches since the 20th century. These studies also revealed the existence of shorter bundles, called superficial bundles, that ensure the connectivity between neighboring anatomical regions. The small size and complex shape of these bundles set a serious challenge to their visualization, so that their description remains under discussion to this day.The first research axis of this thesis aims at pushing the limits of diffusion MRI and proposing a new ex-vivo dataset of the whole human brain, called Chenonceau, dedicated to the characterization of the fine connectivity of the brain.The dataset consists of two T2-weighted anatomical acquisitions at 100 and 150 micron resolution, as well as 175 dMRI datasets at 200 micron resolution with diffusion weighting reaching 8000 s/mm2. More than 4500 hours of acquisition, distributed across two and a half years were necessary to acquire this data.Chenonceau takes advantage of the Bruker 11.7T preclinical MRI system, equipped with both a high magnetic field and a powerful gradient tunnel (780mT/m) allowing to reach the mesoscopic resolution and a very high diffusion weighting.To reconcile the large size of the human brain with the preclinical system, a new acquisition protocol is proposed. It is based on the separation of the brain into smaller samples, which are imaged individually, then reassembled in post-processing to reconstitute the full volume.The whole process is presented, including the protocol for the cutting and the storage of the anatomical samples, the details of the MRI sequences and the description of the image processing pipeline. Special attention is dedicated to the definition of the registration step which recomposes the whole volume from the individual acquisitions.The first inferences of anatomical connectivity from this new dataset are also presented. Tractography associated with clustering techniques allow the extraction of the long and superficial bundles of Chenonceau.The second part of the thesis focused on the development of a new method for fiber tracking, based on the use of the spin glass model.The latter expresses the tractography problem as a set of fiber fragments, called spins, distributed in the sample and whose position and orientation, as well as the connections they establish, are associated with an amount of energy. The construction of the tracts results from the displacement and connection of the spins, with the aim of reaching the global minimum of energy.This thesis proposes to replace the Metropolis-Hastings method used for optimization by an agent trained in a reinforcement learning framework.This new formulation aims at improving the choice of actions, which would no longer be randomly drawn, but dictated by a strategy learned by the agent, fruit of its past interactions with similar environments.The anticipation and projection capacities of such an agent appear particularly adequate to propose the most relevant trajectory in regions where the diffusion information is ambiguous. Moreover, the possibility for the algorithm to learn through interactions allows to circumvent the difficulty of establishing datasets of ground-truth bundles.
Chapter
Image segmentation can play the significant role in helping the visually impaired people to walk freely. We are proposing image segmentation on our custom dataset of tactile paving surface or blind sidewalk. The underlying model for the image segmentation is U-Net. We have used intersection over union (IoU) as a metric to know how our model is performing. We have achieved IoU score of 0.9391.
Article
In this paper it is shown that operational regimes in flotation plants can be identified by machine learning models, based on the availability of just a few tens of froth images of each regime. This is accomplished through the generation of synthetic images that can be used to train machine learning models used in froth pattern identification. The study was based on a small set of images from a platinum group metals plant. Synthetic images were generated with two convolutional neural networks. Features extracted from synthetic and real images with local binary patterns and AlexNet were indistinguishable from each other. In addition, features from a small set of 30 synthetic images that were used as predictors in a random forest model performed similarly to the same features extracted from real images, even from considerably larger data set. This suggests that the use of synthetic froth images generated by deep learning models can serve as the basis for few shot learning models.
Article
Full-text available
Parcel-level cropland maps are an essential data source for crop yield estimation, precision agriculture, and many other agronomy applications. Here, we proposed a rice field mapping approach that combines agricultural field boundary extraction with fine-resolution satellite images and pixel-wise cropland classification with Sentinel-1 time series SAR (Synthetic Aperture Radar) imagery. The agricultural field boundaries were delineated by image segmentation using U-net-based fully convolutional network (FCN) models. Meanwhile, a simple decision-tree classifier was developed based on rice phenology traits to extract rice pixels with time series SAR imagery. Agricultural fields were then classified as rice or non-rice by majority voting from pixel-wise classification results. The evaluation indicated that SeresNet34, as the backbone of the U-net model, had the best performance in agricultural field extraction with an IoU (Intersection over Union) of 0.801 compared to the simple U-net and ResNet-based U-net. The combination of agricultural field maps with the rice pixel detection model showed promising improvement in the accuracy and resolution of rice mapping. The produced rice field map had an IoU score of 0.953, while the User‘s Accuracy and Producer‘s Accuracy of pixel-wise rice field mapping at only 0.824 and 0.816, respectively. The proposed model combination scheme merely requires a simple pixel-wise cropland classification model that incorporates the agricultural field mapping results to produce high-accuracy and high-resolution cropland maps.
Article
Battery-powered, ultra-low-power embedded devices are often limited by the size and maintenance costs of batteries, giving rise to battery-less devices and the emergence of energy harvesting systems. Energy harvesters obtain enough energy from the environment in order to satisfy program execution. However, the difference in the harvesting source and the size of the energy storage makes the program not execute continuously due to frequent interruptions due to power failures. Frequent power failures make the program lose volatile state, inconsistent data, and non-termination, so the energy harvesting system has to preserve the storage of volatile logic, maintain data consistency, and avoid non-termination. In this paper, we show the transient computing techniques for energy harvesting systems. We hope that this research will provide researchers with insights into transient computing and help them address the remaining challenges.
Conference Paper
Full-text available
We are interested in large-scale image classification and especially in the setting where images corresponding to new or existing classes are continuously added to the training set. Our goal is to devise classifiers which can incorporate such images and classes on-the-fly at (near) zero cost. We cast this problem into one of learning a metric which is shared across all classes and explore k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers. We learn metrics on the ImageNet 2010 challenge data set, which contains more than 1.2M training images of 1K classes. Surprisingly, the NCM classifier compares favorably to the more flexible k-NN classifier, and has comparable performance to linear SVMs. We also study the generalization performance, among others by using the learned metric on the ImageNet-10K dataset, and we obtain competitive performance. Finally, we explore zero-shot classification, and show how the zero-shot model can be combined very effectively with small training datasets.
Article
Full-text available
When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.
Conference Paper
Full-text available
Intelligent tasks, such as visual perception, auditory perception, and language understanding require the construction of good internal representations of the world (or "features")? which must be invariant to irrelevant variations of the input while, preserving relevant information. A major question for Machine Learning is how to learn such good features automatically. Convolutional Networks (ConvNets) are a biologically-inspired trainable architecture that can learn invariant features. Each stage in a ConvNets is composed of a filter bank, some nonlinearities, and feature pooling layers. With multiple stages, a ConvNet can learn multi-level hierarchies of features. While ConvNets have been successfully deployed in many commercial applications from OCR to video surveillance, they require large amounts of labeled training samples. We describe new unsupervised learning algorithms, and new non-linear stages that allow ConvNets to be trained with very few labeled samples. Applications to visual object recognition and vision navigation for off-road mobile robots are described.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
There has been much interest in unsuper- vised learning of hierarchical generative mod- els such as deep belief networks. Scaling such models to full-sized, high-dimensional images remains a dicult problem. To ad- dress this problem, we present the convolu- tional deep belief network, a hierarchical gen- erative model which scales to realistic image sizes. This model is translation-invariant and supports ecient bottom-up and top-down probabilistic inference. Key to our approach is probabilistic max-pooling, a novel technique which shrinks the representations of higher layers in a probabilistically sound way. Our experiments show that the algorithm learns useful high-level visual features, such as ob- ject parts, from unlabeled images of objects and natural scenes. We demonstrate excel- lent performance on several visual recogni- tion tasks and show that our model can per- form hierarchical (bottom-up and top-down) inference over full-sized images.
Conference Paper
Full-text available
Neural networks are a powerful technology for classification of visual inputs arising from documents. However, there is a confusing plethora of different neural network methods that are used in the literature and in industry. This paper describes a set of concrete best practices that document analysis researchers can use to get good results with neural networks. The most important practice is getting a training set as large as possible: we expand the training set by adding a new form of distorted data. The next most important practice is that convolutional neural networks are better suited for visual document tasks than fully connected networks. We propose that a simple "do-it-yourself" implementation of convolution with a flexible architecture is suitable for many visual document problems. This simple convolutional neural network does not require complex methods, such as momentum, weight decay, structure- dependent learning rates, averaging layers, tangent prop, or even finely-tuning the architecture. The end result is a very simple yet general architecture which can yield state-of-the-art performance for document analysis. We illustrate our claims on the MNIST set of English digit images.
Conference Paper
Full-text available
We address image classification on a large-scale, i.e. when a large number of images and classes are involved. First, we study classification accuracy as a function of the image signature dimensionality and the training set size. We show experimentally that the larger the training set, the higher the impact of the dimensionality on the accuracy. In other words, high-dimensional signatures are important to obtain state-of-the-art results on large datasets. Second, we tackle the problem of data compression on very large signatures (on the order of 10 5 dimensions) using two lossy compression strategies: a dimensionality reduction technique known as the hash kernel and an encoding technique based on product quantizers. We explain how the gain in storage can be traded against a loss in accuracy and/or an increase in CPU cost. We report results on two large databases ImageNet and a dataset of lM Flickr images showing that we can reduce the storage of our signatures by a factor 64 to 128 with little loss in accuracy. Integrating the decompression in the classifier learning yields an efficient and scalable training algorithm. On ILSVRC2010 we report a 74.3% accuracy at top-5, which corresponds to a 2.5% absolute improvement with respect to the state-of-the-art. On a subset of 10K classes of ImageNet we report a top-1 accuracy of 16.7%, a relative improvement of 160% with respect to the state-of-the-art.
Article
Full-text available
We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, 19.51%, 0.35%, respectively. Deep nets trained by simple back-propagation perform better than more shallow ones. Learning is surprisingly rapid. NORB is completely trained within five epochs. Test error rates on MNIST drop to 2.42%, 0.97% and 0.48% after 1, 3 and 17 epochs, respectively.
Article
Full-text available
While many models of biological object recognition share a common set of "broad-stroke" properties, the performance of any one model depends strongly on the choice of parameters in a particular instantiation of that model--e.g., the number of units per layer, the size of pooling kernels, exponents in normalization operations, etc. Since the number of such parameters (explicit or implicit) is typically large and the computational cost of evaluating one particular parameter set is high, the space of possible model instantiations goes largely unexplored. Thus, when a model fails to approach the abilities of biological visual systems, we are left uncertain whether this failure is because we are missing a fundamental idea or because the correct "parts" have not been tuned correctly, assembled at sufficient scale, or provided with enough training. Here, we present a high-throughput approach to the exploration of such parameter sets, leveraging recent advances in stream processing hardware (high-end NVIDIA graphic cards and the PlayStation 3's IBM Cell Processor). In analogy to high-throughput screening approaches in molecular biology and genetics, we explored thousands of potential network architectures and parameter instantiations, screening those that show promising object recognition performance for further analysis. We show that this approach can yield significant, reproducible gains in performance across an array of basic object recognition tasks, consistently outperforming a variety of state-of-the-art purpose-built vision systems from the literature. As the scale of available computational power continues to expand, we argue that this approach has the potential to greatly accelerate progress in both artificial vision and our understanding of the computational underpinning of biological vision.
Article
Full-text available
Many image segmentation algorithms first generate an affinity graph and then partition it. We present a machine learning approach to computing an affinity graph using a convolutional network (CN) trained using ground truth provided by human experts. The CN affinity graph can be paired with any standard partitioning algorithm and improves segmentation accuracy significantly compared to standard hand-designed affinity functions. We apply our algorithm to the challenging 3D segmentation problem of reconstructing neuronal processes from volumetric electron microscopy (EM) and show that we are able to learn a good affinity graph directly from the raw EM images. Further, we show that our affinity graph improves the segmentation accuracy of both simple and sophisticated graph partitioning algorithms. In contrast to previous work, we do not rely on prior knowledge in the form of hand-designed image features or image preprocessing. Thus, we expect our algorithm to generalize effectively to arbitrary image types.
Article
Full-text available
Thesis (Ph. D.)--Harvard University, 1975. Includes bibliographical references.
Article
Full-text available
We introduce a challenging set of 256 object categories containing a total of 30607 images. The original Caltech-101 [1] was collected by choosing a set of object categories, downloading examples from Google Images and then manually screening out all images that did not fit the category. Caltech-256 is collected in a similar manner with several improvements: a) the number of categories is more than doubled, b) the minimum number of images in any category is increased from 31 to 80, c) artifacts due to image rotation are avoided and d) a new and larger clutter category is introduced for testing background rejection. We suggest several testing paradigms to measure classification performance, then benchmark the dataset using two simple metrics as well as a state-of-the-art spatial pyramid matching [2] algorithm. Finally we use the clutter category to train an interest detector which rejects uninformative background regions.
Article
Full-text available
Progress in understanding the brain mechanisms underlying vision requires the construction of computational models that not only emulate the brain's anatomy and physiology, but ultimately match its performance on visual tasks. In recent years, "natural" images have become popular in the study of vision and have been used to show apparently impressive progress in building such models. Here, we challenge the use of uncontrolled "natural" images in guiding that progress. In particular, we show that a simple V1-like model--a neuroscientist's "null" model, which should perform poorly at real-world visual object recognition tasks--outperforms state-of-the-art object recognition systems (biologically inspired and otherwise) on a standard, ostensibly natural image recognition test. As a counterpoint, we designed a "simpler" recognition test to better span the real-world variation in object pose, position, and scale, and we show that this test correctly exposes the inadequacy of the V1-like model. Taken together, these results demonstrate that tests based on uncontrolled natural images can be seriously misleading, potentially guiding progress in the wrong direction. Instead, we reexamine what it means for images to be natural and argue for a renewed focus on the core problem of object recognition--real-world image variation.
Conference Paper
Full-text available
We assess the applicability of several popular learning methods for the problem of recognizing generic visual categories with invariance to pose, lighting, and surrounding clutter. A large dataset comprising stereo image pairs of 50 uniform-colored toys under 36 azimuths, 9 elevations, and 6 lighting conditions was collected (for a total of 194,400 individual images). The objects were 10 instances of 5 generic categories: four-legged animals, human figures, airplanes, trucks, and cars. Five instances of each category were used for training, and the other five for testing. Low-resolution grayscale images of the objects with various amounts of variability and surrounding clutter were used for training and testing. Nearest neighbor methods, support vector machines, and convolutional networks, operating on raw pixels or on PCA-derived features were tested. Test error rates for unseen object instances placed on uniform backgrounds were around 13% for SVM and 7% for convolutional nets. On a segmentation/recognition task with highly cluttered images, SVM proved impractical, while convolutional nets yielded 16/7% error. A real-time version of the system was implemented that can detect and classify objects in natural scenes at around 10 frames per second.
Technical Report
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Article
Article
We describe how to train a two-layer convolutional Deep Belief Network (DBN) on the 1.6 million tiny images dataset. When training a convolutional DBN, one must decide what to do with the edge pixels of teh images. As the pixels near the edge of an image contribute to the fewest convolutional lter outputs, the model may
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
The article describes analytic and algorithmic methods for determining the coefficients of the Taylor expansion of an accumulated rounding error with respect to the local rounding errors, and hence determining the influence of the local errors on the accumulated error. Second and higher order coefficients are also discussed, and some possible methods of reducing the extensive storage requirements are analyzed.
Article
Current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. In addition, no algorithm presented in the literature has been tested on more than a handful of object categories. We present an method for learning object categories from just a few training images. It is quick and it uses prior information in a principled way. We test it on a dataset composed of images of objects belonging to 101 widely varied categories. Our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. A generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. The parameters of the model are learnt incrementally in a Bayesian manner. Our incremental algorithm is compared experimentally to an earlier batch Bayesian algorithm, as well as to one based on maximum likelihood. The incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. Both Bayesian methods outperform maximum likelihood on small training sets.
Article
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Conference Paper
Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units ” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. 1.
Conference Paper
We show how to learn many layers of features on color images and we use these features to initialize deep autoencoders. We then use the autoencoders to map images to short binary codes. Using semantic hashing [6], 28-bit codes can be used to retrieve images that are similar to a query image in a time that is independent of the size of the database. This extremely fast retrieval makes it possible to search using multiple different transformations of the query image. 256-bit binary codes allow much more accurate matching and can be used to prune the set of images found using the 28-bit codes.
Article
This article outlines the overall strategy and summarizes a few key innovations of the team that won the first Netflix progress prize.
Conference Paper
In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer. Most systems use only one stage of feature extraction in which the filters are hard-wired, or two stages where the filters in one or both stages are learned in supervised or unsupervised mode. This paper addresses three questions: 1. How does the non-linearities that follow the filter banks influence the recognition accuracy? 2. does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters? 3. Is there any advantage to using an architecture with two stages of feature extraction, rather than one? We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks. We show that two stages of feature extraction yield better accuracy than one. Most surprisingly, we show that a two-stage system with random filters can yield almost 63% recognition rate on Caltech-101, provided that the proper non-linearities and pooling layers are used. Finally, we show that with supervised refinement, the system achieves state-of-the-art performance on NORB dataset (5.6%) and unsupervised pre-training followed by supervised refinement produces good accuracy on Caltech-101 (> 65%), and the lowest known error rate on the undistorted, unprocessed MNIST dataset (0.53%).
Article
Research in object detection and recognition in cluttered scenes requires large image collections with ground truth labels. The labels should provide information about the object classes present in each image, as well as their shape and locations, and possibly other attributes such as pose. Such data is useful for testing, as well as for supervised learning. This project provides a web-based annotation tool that makes it easy to annotate images, and to instantly sharesuch annotations with the community. This tool, plus an initial set of 10,000 images (3000 of which have been labeled), can be found at http://www.csail.mit.edu/$\sim$brussell/research/LabelMe/intro.html
Article
A neural network model for a mechanism of visual pattern recognition is proposed in this paper. The network is self-organized by "learning without a teacher", and acquires an ability to recognize stimulus patterns based on the geometrical similarity (Gestalt) of their shapes without affected by their positions. This network is given a nickname "neocognitron". After completion of self-organization, the network has a structure similar to the hierarchy model of the visual nervous system proposed by Hubel and Wiesel. The network consists of an input layer (photoreceptor array) followed by a cascade connection of a number of modular structures, each of which is composed of two layers of cells connected in a cascade. The first layer of each module consists of "S-cells", which show characteristics similar to simple cells or lower order hypercomplex cells, and the second layer consists of "C-cells" similar to complex cells or higher order hypercomplex cells. The afferent synapses to each S-cell have plasticity and are modifiable. The network has an ability of unsupervised learning: We do not need any "teacher" during the process of self-organization, and it is only needed to present a set of stimulus patterns repeatedly to the input layer of the network. The network has been simulated on a digital computer. After repetitive presentation of a set of stimulus patterns, each stimulus pattern has become to elicit an output only from one of the C-cells of the last layer, and conversely, this C-cell has become selectively responsive only to that stimulus pattern. That is, none of the C-cells of the last layer responds to more than one stimulus pattern. The response of the C-cells of the last layer is not affected by the pattern's position at all. Neither is it affected by a small change in shape nor in size of the stimulus pattern.
We present an application of back-propagation networks to handwritten digit recognition. Minimal preprocessing of the data was required, but architecture of the network was highly constrained and specifically designed for the task. The input of the network consists of normalized images of isolated digits. The method has 1% error rate and about a 9% reject rate on zipcode digits provided by the U.S. Postal Service. 1 INTRODUCTION The main point of this paper is to show that large back-propagation (BP) networks can be applied to real image-recognition problems without a large, complex preprocessing stage requiring detailed engineering. Unlike most previous work on the subject (Denker et al., 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating the ability of BP networks to deal with large amounts of low level information. Previous work performed on simple digit images (Le Cun, 1989) showed that the architecture of the network s...
Large scale visual recognition challenge 2010. www.image-net.org/challenges. 2010. Berg A. Deng J. Fei-Fei L. Large scale visual recognition challenge
• A Berg
• J Deng
• L Fei-Fei
High-performance neural networks for visual object classification
• D C Cireşan
• U Meier
• J Masci
• L M Gambardella
• J Schmidhuber
D.C. Cireşan, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. High-performance neural networks for visual object classification. Arxiv preprint arXiv:1102.0183, 2011.
Improving neural networks by preventing co-adaptation of feature detectors
• G E Hinton
• N Srivastava
• A Krizhevsky
• I Sutskever
• R R Salakhutdinov
G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
Hinton G. Srivastava N. Krizhevsky A. Sutskever I. Salakhutdinov R. Improving neural networks by preventing co-adaptation of feature detectors
• G Hinton
• N Srivastava
• A Krizhevsky
• I Sutskever
• R Salakhutdinov