Conference PaperPDF Available

# ImageNet: a Large-Scale Hierarchical Image Database

Authors:
• Salesforce

## Abstract and Figures

The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Content may be subject to copyright.
A preview of the PDF is not available
... The best four results are highlighted in red, blue, green, and cyan respectively. Table 5. Performance comparison of different GCP methods on ImageNet [12] based on ResNet-18 [18]. The failure times denote the total times of non-convergence of the SVD solver during one training process. ...
... This demonstrates that these treatments are complementary and can benefit each other. Table 5 presents the total failure times of the SVD solver in one training process and the validation accuracy on ImageNet [12] based on ResNet-18 [18]. The results are very coherent with our experiment of decorrelated BN. ...
... We use ResNet-18 [18] for the GCP experiment and train it from scratch on ImageNet [12]. Fig. 7 displays the overview of a GCP model. ...
Preprint
Full-text available
Inserting an SVD meta-layer into neural networks is prone to make the covariance ill-conditioned, which could harm the model in the training stability and generalization abilities. In this paper, we systematically study how to improve the covariance conditioning by enforcing orthogonality to the Pre-SVD layer. Existing orthogonal treatments on the weights are first investigated. However, these techniques can improve the conditioning but would hurt the performance. To avoid such a side effect, we propose the Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR). The effectiveness of our methods is validated in two applications: decorrelated Batch Normalization (BN) and Global Covariance Pooling (GCP). Extensive experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization. Moreover, the combinations with orthogonal weight can further boost the performances.
... We evaluate our defence across three different tasks. For image classification, we report the robust performance on the ImageNet [10] and RESISC-45 [6] datasets. For object detection, we test on the PASCAL VOC [12] dataset. ...
... We adopt the PSPNet [50] with the ResNet-50 [20] backbone as the patch detector of PatchZero. We initialize the PSPNet with weights pre-trained on the ImageNet [10] and follow the loss function for image segmentation. We train our PSPNet patch detector through the two-stage adversarial training introduced in Section 3.3. ...
Preprint
... Deep neural networks (DNNs) have achieved great success in various visual recognition tasks owing to open source datasets that contain a large number of samples, e.g., ImageNet [8] and MS COCO [22]. While the classes of images in these recognition datasets have an approximately uniform distribution, largescale real-world datasets tend to follow a long-tailed distribution; a few classes occupy most of data, whereas most classes have few samples. ...
... Long-tailed ImageNet: The original ImageNet [8] is one of the largest image recognition datasets, which contains 1280K training images and 50K test images with 1,000 categories. Following [23], we built a long-tailed version of the ImageNet dataset, which contains 115.8K training images. ...
Preprint
There is a growing interest in the challenging visual perception task of learning from long-tailed class distributions. The extreme class imbalance in the training dataset biases the model to prefer to recognize majority-class data over minority-class data. Recently, the dual branch network (DBN) framework has been proposed, where two branch networks; the conventional branch and the re-balancing branch were employed to improve the accuracy of long-tailed visual recognition. The re-balancing branch uses a reverse sampler to generate class-balanced training samples to mitigate bias due to class imbalance. Although this strategy has been quite successful in handling bias, using a reversed sampler for training can degrade the representation learning performance. To alleviate this issue, the conventional method used a carefully designed cumulative learning strategy, in which the influence of the re-balancing branch gradually increases throughout the entire training phase. In this study, we aim to develop a simple yet effective method to improve the performance of DBN without cumulative learning that is difficult to optimize. We devise a simple data augmentation method termed bilateral mixup augmentation, which combines one sample from the uniform sampler with another sample from the reversed sampler to produce a training sample. Furthermore, we present class-conditional temperature scaling that mitigates bias toward the majority class for the proposed DBN architecture. Our experiments performed on widely used long-tailed visual recognition datasets show that bilateral mixup augmentation is quite effective in improving the representation learning performance of DBNs, and that the proposed method achieves state-of-the-art performance for some categories.
... where λ and β are empirically set to 1. We use FCOS [21] with ResNet-50 [4] backbone pre-trained on ImageNet [3] and FPN as our base model. All experiments are conducted based on Pytorch [12]. ...
Preprint
Most of the existing object detection works are based on the bounding box annotation: each object has a precise annotated box. However, for rib fractures, the bounding box annotation is very labor-intensive and time-consuming because radiologists need to investigate and annotate the rib fractures on a slice-by-slice basis. Although a few studies have proposed weakly-supervised methods or semi-supervised methods, they could not handle different forms of supervision simultaneously. In this paper, we proposed a novel omni-supervised object detection network, which can exploit multiple different forms of annotated data to further improve the detection performance. Specifically, the proposed network contains an omni-supervised detection head, in which each form of annotation data corresponds to a unique classification branch. Furthermore, we proposed a dynamic label assignment strategy for different annotated forms of data to facilitate better learning for each branch. Moreover, we also design a confidence-aware classification loss to emphasize the samples with high confidence and further improve the model's performance. Extensive experiments conducted on the testing dataset show our proposed method outperforms other state-of-the-art approaches consistently, demonstrating the efficacy of deep omni-supervised learning on improving rib fracture detection performance.
... In the era of big data, the generation of image data is exponentially exploding, and it is a very important issue to organize and manage these large-scale data effectively and to explore the potential value of these big data for image-related applications [1][2][3][4], such as classification and labelling of images [5,6]. Features have a very important position in image classification, which is related to the efficiency of image classification, and likewise, the research of quantum image feature extraction methods is also very critical. ...
Article
Full-text available
In digital image processing, feature extraction occupies a very important position, which is related to the effect of image classification or recognition. At present, effective quantum feature extraction methods are relatively lacking. And the current feature extraction methods are mainly devoted to the extraction of basic features of images, failing to consider the global features of classical images and the global features of quantum images comprehensively. In this paper, we propose a dual quantum image feature extraction method named PSQIFE, which focuses on the global energy representation of images by constructing dual quantum image global features. The representation of the global features of the dual quantum image is obtained by quantum superposition of two parts of quantum state features. In this paper, quantum image reconstruction and quantum image fidelity tests are performed on the extracted global features by 9 classes of classical images, and the overall fidelity is above 95%. In addition, the effectiveness of PSQIFE dual quantum image feature extraction method is verified by comparing the image classification test with convolutional feature extraction method on Mnist dataset. The method has some reference significance for the research of quantum image feature extraction and classification.
... An example of AIThaiGen programming is shown in Figure 1. In this figure, the programming inter- Net (Sandler et al., 2018), which has been trained to recognize 1000 classes of objects in the ImageNet database (Deng et al., 2009). It is used as a feature extractor for transfer learning (Tan et al., 2018) and the extracted deep feature vectors from the user-created datasets are then applied to construct a K-nearest neighbour classifier. ...
Article
Full-text available
Background Artificial intelligence (AI) has gained increasing popularity in human society, and it is important to educate people about this emerging technology. Many countries have adopted school curricula to incorporate AI into their classrooms. However, developing tools for discovering AI concepts remains challenging. There are few studies on AI education tools, particularly in Thailand. Objectives This study designs AIThaiGen, a web‐based learning platform for junior high school students that introduces AI concepts. It can communicate with remote hardware stations, allowing students to test their AI models in real‐world scenarios. Methods A total of 106 students in 7th and 8th grade in Thailand participated, and a single‐group pre‐test‐post‐test research design was employed in this study. Pre‐post‐tests on the basic concepts of AI and the students' attitude questionnaire on AIThaiGen were used to collect and analyse data. Results and Conclusions The results show that there is a significant improvement (p < 0.001) in the pre‐post‐tests on the basic concepts of AI, and the overall result of the students' attitude questionnaire on AIThaiGen is X¯=3.88$$\overline{X}=3.88$$, indicating positive outlooks. Furthermore, notable student projects are showcased, highlighting their ability to initiate new ideas for solving real problems after studying with AIThaiGen.
Article
Cracks are widespread in infrastructure that are closely related to human activity. It is very popular to use artificial intelligence to detect cracks intelligently, which is known as crack detection. The noise in the background of crack images, discontinuity of cracks and other problems make the crack detection task a huge challenge. Although many approaches have been proposed, there are still two challenges: (1) cracks are long and complex in shape, making it difficult to capture long-range continuity; (2) most of the images in the crack dataset have noise, and it is difficult to detect only the cracks and ignore the noise. In this paper, we propose a novel method called Transformer-based Multi-scale Fusion Model (TransMF) for crack detection, including an Encoder Module (EM), Decoder Module (DM) and Fusion Module (FM). The Encoder Module uses a hybrid of convolution blocks and Swin Transformer block to model the long-range dependencies of different parts in a crack image from a local and global perspective. The Decoder Module is designed with symmetrical structure to the Encoder Module. In the Fusion Module, the output in each layer with unique scales of Encoder Module and Decoder Module are fused in the form of convolution, which can release the effect of background noise and strengthen the correlations between relevant context in order to enhance the crack detection. Finally, the output of each layer of the Fusion Module is concatenated to achieve the purpose of crack detection. Extensive experiments on three benchmark datasets (CrackLS315, CRKWH100 and DeepCrack) demonstrate that the proposed TransMF in this paper exceeds the best performance of present baselines.
Article
Building segmentation for Unmanned Aerial Vehicle (UAV) imagery usually requires pixel-level labels, which are time-consuming and expensive to collect. Weakly supervised semantic segmentation methods for image-level labeling have recently achieved promising performance in natural scenes, but there have been few studies on UAV remote sensing imagery. In this paper, we propose a reliable label-supervised pixel attention mechanism for building segmentation in UAV imagery. Our method is based on the class activation map. However, classification networks tend to capture discriminative parts of the object and are insensitive to over-activation; therefore, class activation maps cannot directly guide segmentation network training. To overcome these challenges, we first design a Pixel Attention Module that captures rich contextual relationships, which can further mine more discriminative regions, in order to obtain a modified class activation map. Then, we use the initial seeds generated by the classification network to synthesize reliable labels. Finally, we design a reliable label loss, which is defined as the sum of the pixel-level differences between the reliable labels and the modified class activation map. Notably, the reliable label loss can handle over-activation. The preceding steps can significantly improve the quality of the pseudo-labels. Experiments on our home-made UAV data set indicate that our method can achieve 88.8% mIoU on the test set, outperforming previous state-of-the-art weakly supervised methods.
Article
Objectives Over the course of their treatment, patients often switch hospitals, requiring staff at the new hospital to import external imaging studies to their local database. In this study, the authors present MOdality Mapping and Orchestration (MOMO), a Deep Learning–based approach to automate this mapping process by combining metadata analysis and a neural network ensemble. Methods A set of 11,934 imaging series with existing anatomical labels was retrieved from the PACS database of the local hospital to train an ensemble of neural networks (DenseNet-161 and ResNet-152), which process radiological images and predict the type of study they belong to. We developed an algorithm that automatically extracts relevant metadata from imaging studies, regardless of their structure, and combines it with the neural network ensemble, forming a powerful classifier. A set of 843 anonymized external studies from 321 hospitals was hand-labeled to assess performance. We tested several variations of this algorithm. Results MOMO achieves 92.71% accuracy and 2.63% minor errors (at 99.29% predictive power) on the external study classification task, outperforming both a commercial product (82.86% accuracy, 1.36% minor errors, 96.20% predictive power) and a pure neural network ensemble (72.69% accuracy, 10.3% minor errors, 99.05% predictive power) performing the same task. We find that the highest performance is achieved by an algorithm that combines all information into one vote-based classifier. Conclusion Deep Learning combined with metadata matching is a promising and flexible approach for the automated classification of external DICOM studies for PACS archiving. Key Points • The algorithm can successfully identify 76 medical study types across seven modalities (CT, X-ray angiography, radiographs, MRI, PET (+CT/MRI), ultrasound, and mammograms). • The algorithm outperforms a commercial product performing the same task by a significant margin (> 9% accuracy gain). • The performance of the algorithm increases through the application of Deep Learning techniques.
Article
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evaluation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
Article
Full-text available
This report presents the results of the 2006 PASCAL Visual Object Classes Challenge (VOC2006). Details of the challenge, data, and evalu-ation are presented. Participants in the challenge submitted descriptions of their methods, and these have been included verbatim. This document should be considered preliminary, and subject to change.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
The Face Recognition Technology (FERET) program database is a large database of facial images, divided into development and sequestered portions. The development portion is made available to researchers, and the sequestered portion is reserved for testing facerecognition algorithms. The FERET evaluation procedure is an independently administered test of face-recognition algorithms. The test was designed to: (1) allow a direct comparison between different algorithms, (2) identify the most promising approaches, (3) assess the state of the art in face recognition, (4) identify future directions of research, and (5) advance the state of the art in face recognition.
Article
Current approaches to object category recognition require datasets of training images to be manually prepared, with varying degrees of supervision. We present an approach that can learn an object category from just its name, by utilizing the raw output of image search engines available on the Internet. We develop a new model, TSI-pLSA, which extends pLSA (as applied to visual words) to include spatial information in a translation and scale invariant manner. Our approach can handle the high intra-class variability and large proportion of unrelated images returned by search engines. We evaluate the models on standard test sets, showing performance competitive with existing methods trained on hand prepared datasets.