Figure 1 - uploaded by Yurong Chen
Content may be subject to copyright.
Architecture of RGNN. Layers such as pooling, ReLU and soft-max are omitted in the illustration for clarity. Example region proposals are depicted on the input image.
Contexts in source publication
Context 1
... gate units and classifier are integrated in one deep neural network, so all the parameters can be learnt together. Figure 1 illustrates the architecture of the proposed RGNN. To the best of our knowledge, we are the first to impose gate unit on CNN framework for contextual region selection and multi-label image classification with multi-task optimizations in an end-to-end manner. ...
Similar publications
Image classification is one of the major data mining tasks in smart city applications. However, deploying classification models that have good generalization accuracy is highly crucial for reliable decision-making in such applications. One of the ways to achieve good generalization accuracy is through the use of multiple classifiers and the fusion...
Retinal swelling due to the accumulation of fluid is associated with the most vision-threatening retinal diseases. Optical coherence tomography (OCT) is the current standard of care in assessing the presence and quantity of retinal fluid and image-guided treatment management. Deep learning methods have made their impact across medical imaging and m...
Recently, Convolutional Neural Networks (CNNs) have achieved remarkable success in computer vision, image processing and image processing tasks. Traditional CNN models work directly with spatial domain images. On the other hand, images obtained with Fast Fourier Transform (FFT) represent the Frequency domain and provide an advantage in computationa...
Convolutional networks are at the center of best in class computer vision applications for a wide assortment of undertakings. Since 2014, profound amount of work began to make better convolutional architectures, yielding generous additions in different benchmarks. Albeit expanded model size and computational cost will, in general, mean prompt quali...
Citations
... To date, many multi-label learning approaches for image classification have been proposed [6], [9], [10], [11], [12], [13], [14], [15]. Simply speaking, multi-label image classification can be achieved by casting this task into several binary-class subproblems, where each subproblem is to predict whether the image is relevant to the corresponding label. ...
In this paper, we present a novel deep metric learning method to tackle the multi-label image classification problem. In order to better learn the correlations among images features, as well as labels, we attempt to explore a latent space, where images and labels are embedded via two unique deep neural networks, respectively. To capture the relationships between image features and labels, we aim to learn a \emph{two-way} deep distance metric over the embedding space from two different views, i.e., the distance between one image and its labels is not only smaller than those distances between the image and its labels' nearest neighbors, but also smaller than the distances between the labels and other images corresponding to the labels' nearest neighbors. Moreover, a reconstruction module for recovering correct labels is incorporated into the whole framework as a regularization term, such that the label embedding space is more representative. Our model can be trained in an end-to-end manner. Experimental results on publicly available image datasets corroborate the efficacy of our method compared with the state-of-the-arts.
... The aim of this approach is to highlight the important feature dimensions and prune feature responses to preserve only the activations relevant to the specific task. Another approach is inspired by the gating mechanism used in long short-term memory(LSTM) [304,305]. That is, this gate unit in KD is elaborately designed to remember features across different image regions and to control the pass of each region feature as a whole by their contribution to the task (e.g., classification) with the weight of importance. ...
Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more heavy computation power and failing to be deployed on edge devices. Besides, the performance boost is highly dependent on redundant labeled data. To achieve faster speeds and to handle the problems caused by the lack of data, knowledge distillation (KD) has been proposed to transfer information learned from one model to another. KD is often characterized by the so-called `Student-Teacher' (S-T) learning framework and has been broadly applied in model compression and knowledge transfer. This paper is about KD and S-T learning, which are being actively studied in recent years. First, we aim to provide explanations of what KD is and how/why it works. Then, we provide a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically for vision tasks. In general, we consider some fundamental questions that have been driving this research area and thoroughly generalize the research progress and technical details. Additionally, we systematically analyze the research status of KD in vision applications. Finally, we discuss the potentials and open challenges of existing methods and prospect the future directions of KD and S-T learning.
... Similarly, [10] automatically selected relevant image regions from global image labels using weakly supervised learning. Zhao et al. [67] reduced irrelevant and noisy regions with the help of region gating module. These region proposal based methods usually suffer from redundant computation and sub-optimal performance. ...
Intraclass compactness and interclass separability are crucial indicators to measure the effectiveness of a model to produce discriminative features, where intraclass compactness indicates how close the features with the same label are to each other and interclass separability indicates how far away the features with different labels are. In this paper, we investigate intraclass compactness and interclass separability of features learned by convolutional networks and propose a Gaussian-based softmax (G-softmax) function that can effectively improve intraclass compactness and interclass separability. The proposed function is simple to implement and can easily replace the softmax function. We evaluate the proposed G-softmax function on classification data sets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) and on multilabel classification data sets (i.e., MS COCO and NUS-WIDE). The experimental results show that the proposed G-softmax function improves the state-of-the-art models across all evaluated data sets. In addition, the analysis of the intraclass compactness and interclass separability demonstrates the advantages of the proposed function over the softmax function, which is consistent with the performance improvement. More importantly, we observe that high intraclass compactness and interclass separability are linearly correlated with average precision on MS COCO and NUS-WIDE. This implies that the improvement of intraclass compactness and interclass separability would lead to the improvement of average precision.
... Similarly, [10] automatically selected relevant image regions from global image labels using weakly supervised learning. Zhao et al. [68] reduced irrelevant and noisy regions with the help of region gating module. These region proposal based methods usually suffer from redundant computation and sub-optimal performance. ...
Intra-class compactness and inter-class separability are crucial indicators to measure the effectiveness of a model to produce discriminative features, where intra-class compactness indicates how close the features with the same label are to each other and inter-class separability indicates how far away the features with different labels are. In this work, we investigate intra-class compactness and inter-class separability of features learned by convolutional networks and propose a Gaussian-based softmax (-softmax) function that can effectively improve intra-class compactness and inter-class separability. The proposed function is simple to implement and can easily replace the softmax function. We evaluate the proposed -softmax function on classification datasets (i.e., CIFAR-10, CIFAR-100, and Tiny ImageNet) and on multi-label classification datasets (i.e., MS COCO and NUS-WIDE). The experimental results show that the proposed -softmax function improves the state-of-the-art models across all evaluated datasets. In addition, analysis of the intra-class compactness and inter-class separability demonstrates the advantages of the proposed function over the softmax function, which is consistent with the performance improvement. More importantly, we observe that high intra-class compactness and inter-class separability are linearly correlated to average precision on MS COCO and NUS-WIDE. This implies that improvement of intra-class compactness and inter-class separability would lead to improvement of average precision.
... Then it clustered the regions into groups and keeps only one representative region in each group, which could also be regarded as a variant of max pooling scheme. Zhao et al. [46] adopted a multi-scale max pooling scheme on different sized objects selected by gating networks. Wang et al. [36] used RNN coupled with CNN feature to exploit label interactions for multi-label classification. ...
... For example, commonly used object proposal algorithms like selective search [35], edge boxes [47] and RPN [29] can usually generate several thousand box proposals on an image. In this work, we adopted the top 1000 ranked proposals on the image returned by the edge box algorithm, which was widely used in recent related works [38,46]. Given the proposal bounding boxes, we can further denote the collection of the cropped region image data as ...
... Multi-task learning strategy has shown its advantages in existing object feature based image classification tasks [46]. We design the grid labeling loss as an extra clue in the network to help the training of FCN and LSTM classifiers in addition to the original image classification loss. ...
The spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of image and video contents. Specifically, we first extract high-level semantic object features on input image with convolutional neural networks for every object proposals, and organize them to the designed SFM so that spatial information among objects are preserved. To fully exploit the spatial relationship among objects, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) on top of SFM for final recognition. For better training, we also introduce a multi-task learning framework to train the model in an end-to-end manner. It is composed of an overall image classification loss as well as a grid labeling loss, which predicts the objects label at each SFM grid. Extensive experiments are conducted to verify the effectiveness of the proposed approach. For image classification, very promising results are obtained on Pascal VOC 2007/2012 and MS-COCO benchmarks. We also directly transfer the SFM learned on image domain to the video classification task. The results on CCV benchmark demonstrate the robustness and generalization capability of the proposed approach.
... Combined with CNNs, the proposed framework learns a joint image-label embedding to characterize both semantic label dependency and image label relevance. Zhao et al. [37] developed a regional gating neural network framework. Candidate image regions are fed to a shared CNN to produce regional representation. ...
Traditional flat classification methods (e.g., binary or multi-class classification) neglect the structural information between different classes. In contrast, Hierarchical Multi-label Classification (HMC) considers the structural information embedded in the class hierarchy, and uses it to improve classification performance. In this paper, we propose a local hierarchical ensemble framework for HMC, Fully Associative Ensemble Learning (FAEL). We model the relationship between each class node’s global prediction and the local predictions of all the class nodes as a multi-variable regression problem with Frobenius norm or l1 norm regularization. It can be extended using the kernel trick, which explores the complex correlation between global and local prediction. In addition, we introduce a binary constraint model to restrict the optimal weight matrix learning. The proposed models have been applied to image annotation and gene function prediction datasets with tree structured class hierarchy and large scale visual recognition dataset with Direct Acyclic Graph (DAG) structured class hierarchy. The experimental results indicate that our models achieve better performance when compared with other baseline methods.
... Since IF-DLDL does not need region proposals or bounding box information, it may be effectively and efficiently implemented for practical multilabel application such as multi-label image retrieval [47]. It is also possible that by adopting new techniques (such as the region proposal method using gated unit in [48], which has higher accuracy that ours on VOC tasks), the accuracy of our DLDL methods can be further improved. ...
Convolutional Neural Networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its success. However, it is difficult to collect sufficient training images with precise labels in some domains such as apparent age estimation, head pose estimation, multi-label classification and semantic segmentation. Fortunately, there is ambiguous information among labels, which makes these tasks different from traditional classification. Based on this observation, we convert the label of each image into a discrete label distribution, and learn the label distribution by minimizing a Kullback-Leibler divergence between the predicted and ground-truth label distributions using deep ConvNets. The proposed DLDL (Deep Label Distribution Learning) method effectively utilizes the label ambiguity in both feature learning and classifier learning, which prevents the network from over-fitting even when the training set is small. Experimental results show that the proposed approach produces significantly better results than state-of-the-art methods for age estimation and head pose estimation. At the same time, it also improves recognition performance for multi-label classification and semantic segmentation tasks.
... While the earlier works ignore explicit search for object location [80], [81], or require bounding box annotation [82]- [84], recent results indicate that effective classifiers for images with multiple objects in cluttered scenes can be trained from weak image-level annotation by explicitly searching over multiple scales and locations [85]- [89]. Our multilabel setup follows closely the pipeline of [87] with a few exceptions detailed in § 6.3. ...
... At test time, we obtain a single ranking of class labels per image by max pooling the scores for each class. We follow this basic setup, but note that a 1 − 2% improvement is possible with a more sophisticated aggregation of information from the different image regions [84], [87]. ...
... Finally, we note that the current state of the art classification results on VOC 2007 are reported in [84], [87], [89]. Our 91.8% mAP of SVM ML matches exactly the result of LSSVM-Max in [87], which operates in the setting closest to ours in terms of image representation and the learning architecture. ...
Top-k error is currently a popular performance measure on large scale image classification benchmarks such as ImageNet and Places. Despite its wide acceptance, our understanding of this metric is limited as most of the previous research is focused on its special case, the top-1 error. In this work, we explore two directions that shed more light on the top-k error. First, we provide an in-depth analysis of established and recently proposed single-label multiclass methods along with a detailed account of efficient optimization algorithms for them. Our results indicate that the softmax loss and the smooth multiclass SVM are surprisingly competitive in top-k error uniformly across all k, which can be explained by our analysis of multiclass top-k calibration. Further improvements for a specific k are possible with a number of proposed top-k loss functions. Second, we use the top-k methods to explore the transition from multiclass to multilabel learning. In particular, we find that it is possible to obtain effective multilabel classifiers on Pascal VOC using a single label per image for training, while the gap between multiclass and multilabel methods on MS COCO is more significant. Finally, our contribution of efficient algorithms for training with the considered top-k and multilabel loss functions is of independent interest.
... While the earlier works ignore explicit search for object location [80], [81], or require bounding box annotation [82]- [84], recent results indicate that effective classifiers for images with multiple objects in cluttered scenes can be trained from weak image-level annotation by explicitly searching over multiple scales and locations [85]- [89]. Our multilabel setup follows closely the pipeline of [87] with a few exceptions detailed in § 6.3. ...
... At test time, we obtain a single ranking of class labels per image by max pooling the scores for each class. We follow this basic setup, but note that a 1 − 2% improvement is possible with a more sophisticated aggregation of information from the different image regions [84], [87]. ...
... Finally, we note that the current state of the art classification results on VOC 2007 are reported in [84], [87], [89]. Our 91.8% mAP of SVM ML matches exactly the result of LSSVM-Max in [87], which operates in the setting closest to ours in terms of image representation and the learning architecture. ...
Top-k error is currently a popular performance measure on large scale image classification benchmarks such as ImageNet and Places. Despite its wide acceptance, our understanding of this metric is limited as most of the previous research is focused on its special case, the top-1 error. In this work, we explore two directions that shed more light on the top-k error. First, we provide an in-depth analysis of established and recently proposed single-label multiclass methods along with a detailed account of efficient optimization algorithms for them. Our results indicate that the softmax loss and the smooth multiclass SVM are surprisingly competitive in top-k error uniformly across all k, which can be explained by our analysis of multiclass top-k calibration. Further improvements for a specific k are possible with a number of proposed top-k loss functions. Second, we use the top-k methods to explore the transition from multiclass to multilabel learning. In particular, we find that it is possible to obtain effective multilabel classifiers on Pascal VOC using a single label per image for training, while the gap between multiclass and multilabel methods on MS COCO is more significant. Finally, our contribution of efficient algorithms for training with the considered top-k and multilabel loss functions is of independent interest.
... In this way, it works like a gate that can prevent the noisy regions of the feature representation from misleading the classifiers. A similar strategy is found effective on clean images for multi-label image classification [51]. ...
Large-scale datasets have driven the rapid development of deep neural networks for visual recognition. However, annotating a massive dataset is expensive and time-consuming. Web images and their labels are, in comparison, much easier to obtain, but direct training on such automatically harvested images can lead to unsatisfactory performance, because the noisy labels of Web images adversely affect the learned recognition models. To address this drawback we propose an end-to-end weakly-supervised deep learning framework which is robust to the label noise in Web images. The proposed framework relies on two unified strategies -- random grouping and attention -- to effectively reduce the negative impact of noisy web image annotations. Specifically, random grouping stacks multiple images into a single training instance and thus increases the labeling accuracy at the instance level. Attention, on the other hand, suppresses the noisy signals from both incorrectly labeled images and less discriminative image regions. By conducting intensive experiments on two challenging datasets, including a newly collected fine-grained dataset with Web images of different car models, the superior performance of the proposed methods over competitive baselines is clearly demonstrated.