To read the full-text of this research, you can request a copy directly from the authors.


Deep neural networks (DNNs) have been very successful for supervised learning. However, their high generalization performance often comes with the high cost of annotating data manually. Collecting low-quality labeled dataset is relatively cheap, e.g., using web search engines, while DNNs tend to overfit to corrupted labels easily. In this paper, we propose a collaborative learning (co-learning) approach to improve the robustness and generalization performance of DNNs on datasets with corrupted labels. This is achieved by designing a deep network with two separate branches, coupled with a relabelling mechanism. Co-learning could safely recover the true labels of most mislabeled samples, not only preventing the model from overfitting the noise, but also exploiting useful information from all the samples. Although being very simple, the proposed algorithm is able to achieve high generalization performance even a large portion of the labels are corrupted. Experiments show that co-learning consistently outperforms existing state-of-the-art methods on three widely used benchmark datasets.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... When the learned models disagree on predictions for the label of a sample, this is considered as a sign that the label of this sample may be noisy. When the models used are diverse enough, these methods are often found to be quite efficient [17,46]. However these algorithms suffer from learning their own biases and diversity needs to be introduced in the learning procedure. ...
... Using algorithms from different classes of models and different origins can increase the diversity among them by introducing more source of biases [28]. Alternating between learning from the data and from the other models is another way to combat the reinforcement of the models' biases [46]. These algorithms rely on carefully made heuristics to be efficient. ...
... -CoLearning (CoL) [46] is a good representative of the family of collaborative learning algorithms. It uses disagreements criteria to detect noisy labels and is tailored for end-to-end deep learning where the two models are branches of a larger neural networks. ...
Full-text available
This paper has been accepted at the IAL@ECML Workshop 2021 ( -------- "In this paper we show that the combination of a Contrastive representation with a label noise-robust classification head requires fine-tuning the representation in order to achieve state-of-the-art performances. Since fine-tuned representations are shown to outperform frozen ones, one can conclude that noise-robust classification heads are indeed able to promote meaningful representations if provided with a suitable starting point. Experiments are conducted to draw a comprehensive picture of performances by featuring six methods and nine noise instances of three different kinds (none, symmetric, and asymmetric). In presence of noise the experiments show that fine tuning of Contrastive representation allows the six methods to achieve better results than end-to-end learning and represent a new reference compare to the recent state of art. Results are also remarkable stable versus the noise level."
... Secondly, the state-of-the-art deep learning techniques failed to reliably classify the thermogram images into different classes, which were graded based on the TCI score. If a database is publicly available, it is easy to re-evaluate the labels in datasets if it is found that the labels are questionable [46]. Aradillas et al. [47] mentioned scenarios where they found errors in the labeling of the training samples in databases and proposed cross-validation techniques to remove them. ...
Full-text available
Diabetes mellitus (DM) is one of the most prevalent diseases in the world, and is correlated to a high index of mortality. One of its major complications is diabetic foot, leading to plantar ulcers, amputation, and death. Several studies report that a thermogram helps to detect changes in the plantar temperature of the foot, which may lead to a higher risk of ulceration. However, in diabetic patients, the distribution of plantar temperature does not follow a standard pattern, thereby making it difficult to quantify the changes. The abnormal temperature distribution in infrared (IR) foot thermogram images can be used for the early detection of diabetic foot before ulceration to avoid complications. There is no machine learning-based technique reported in the literature to classify these thermograms based on the severity of diabetic foot complications. This paper uses an available labeled diabetic thermogram dataset and uses the k-mean clustering technique to cluster the severity risk of diabetic foot ulcers using an unsupervised approach. Using the plantar foot temperature, the new clustered dataset is verified by expert medical doctors in terms of risk for the development of foot ulcers. The newly labeled dataset is then investigated in terms of robustness to be classified by any machine learning network. Classical machine learning algorithms with feature engineering and a convolutional neural network (CNN) with image-enhancement techniques are investigated to provide the best-performing network in classifying thermograms based on severity. It is found that the popular VGG 19 CNN model shows an accuracy, precision, sensitivity, F1-score, and specificity of 95.08%, 95.08%, 95.09%, 95.08%, and 97.2%, respectively, in the stratification of severity. A stacking classifier is proposed using extracted features of the thermogram, which is created using the trained gradient boost classifier, XGBoost classifier, and random forest classifier. This provides a comparable performance of 94.47%, 94.45%, 94.47%, 94.43%, and 93.25% for accuracy, precision, sensitivity, F1-score, and specificity, respectively.
Conference Paper
Full-text available
Supervised learning depends on annotated examples, which are taken to be the ground truth. But these labels often come from noisy crowdsourcing platforms, like Amazon Mechanical Turk. Practitioners typically collect multiple labels per example and aggregate the results to mitigate noise (the classic crowdsourcing problem). Given a fixed annotation budget and unlimited unlabeled data, redundant annotation comes at the expense of fewer labeled examples. This raises two fundamental questions: (1) How can we best learn from noisy workers? (2) How should we allocate our labeling budget to maximize the performance of a classifier? We propose a new algorithm for jointly modeling labels and worker quality from noisy crowd-sourced data. The alternating minimization proceeds in rounds, estimating worker quality from disagreement with the current model and then updating the model by optimizing a loss function that accounts for the current estimate of worker quality. Unlike previous approaches, even with only one annotation per example, our algorithm can estimate worker quality. We establish a generalization error bound for models learned with our algorithm and establish theoretically that it's better to label many examples once (vs less multiply) when worker quality exceeds a threshold. Experiments conducted on both ImageNet (with simulated noisy workers) and MS-COCO (using the real crowdsourced labels) confirm our algorithm's benefits.
Full-text available
Brain–computer interfaces (BCIs), which control external equipment using cerebral activity, have received considerable attention recently. Translating brain activities measured by electroencephalography (EEG) into correct control commands is a critical problem in this field. Most existing EEG decoding methods separate feature extraction from classification and thus are not robust across different BCI users. In this paper, we propose to learn subject-specific features jointly with the classification rule. We develop a deep convolutional network (ConvNet) to decode EEG signals end-to-end by stacking time–frequency transformation, spatial filtering, and classification together. Our proposed ConvNet implements a joint space–time–frequency feature extraction scheme for EEG decoding. Morlet wavelet-like kernels used in our network significantly reduce the number of parameters compared with classical convolutional kernels and endow the features learned at the corresponding layer with a clear interpretation, i.e. spectral amplitude. We further utilize subject-to-subject weight transfer, which uses parameters of the networks trained for existing subjects to initialize the network for a new subject, to solve the dilemma between a large number of demanded data for training deep ConvNets and small labeled data collected in BCIs. The proposed approach is evaluated on three public data sets, obtaining superior classification performance compared with the state-of-the-art methods.
Full-text available
In recent years, deep learning has been a revolution in the field of machine learning, for computer vision in particular. In this approach, a deep (multilayer) artificial neural network (ANN) is trained in a supervised manner using backpropagation. Huge amounts of labeled examples are required, but the resulting classification accuracy is truly impressive, sometimes outperforming humans. Neurons in an ANN are characterized by a single, static, continuous-valued activation. Yet biological neurons use discrete spikes to compute and transmit information, and the spike times, in addition to the spike rates, matter. Spiking neural networks (SNNs) are thus more biologically realistic than ANNs, and arguably the only viable option if one wants to understand how the brain computes. SNNs are also more hardware friendly and energy-efficient than ANNs, and are thus appealing for technology, especially for portable devices. However, training deep SNNs remains a challenge. Spiking neurons' transfer function is usually non-differentiable, which prevents using backpropagation. Here we review recent supervised and unsupervised methods to train deep SNNs, and compare them in terms of accuracy, but also computational cost and hardware friendliness. The emerging picture is that SNNs still lag behind ANNs in terms of accuracy, but the gap is decreasing, and can even vanish on some tasks, while the SNNs typically require much fewer operations.
Full-text available
The growing importance of massive datasets with the advent of deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling for large datasets, non-expert labeling, and label corruption by data poisoning adversaries. In the latter case, corruptions may be arbitrarily bad, even so bad that a classifier predicts the wrong labels with high confidence. To protect against such sources of noise, we leverage the fact that a small set of clean labels is often easy to procure. We demonstrate that robustness to label noise up to severe strengths can be achieved by using a set of trusted data with clean labels, and propose a loss correction that utilizes trusted examples in a data-efficient manner to mitigate the effects of label noise on deep neural network classifiers. Across vision and natural language processing tasks, we experiment with various label noises at several strengths, and show that our method significantly outperforms existing methods.
Full-text available
This paper addresses the limitations of previous training methods that emphasize either easy examples like self-paced learning or difficult examples like hard example mining. Inspired by active learning, we propose two alternatives to re-weight training samples based on lightweight estimates of sample uncertainty in stochastic gradient descent (SGD): the variance in predicted probability of the correct class across iterations of mini-batch SGD, and the proximity of the correct class probability to the decision threshold (or threshold closeness). Extensive experimental results on multiple datasets show that our methods reliably improve accuracy in various network architectures, including providing additional gains on top of other popular training tools, such as ADAM, dropout, and distillation.
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at
Full-text available
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
Full-text available
In this paper, we study a classification problem in which sample labels are randomly corrupted. In this scenario, there is an unobservable sample with noise-free labels. However, before being observed, the true labels are independently flipped with a probability $\rho\in[0,0.5)$, and the random label noise can be class-conditional. Here, we address two fundamental problems raised by this scenario. The first is how to best use the abundant surrogate loss functions designed for the traditional classification problem when there is label noise. We prove that any surrogate loss function can be used for classification with noisy labels by using importance reweighting, with consistency assurance that the label noise does not ultimately hinder the search for the optimal classifier of the noise-free sample. The other is the open problem of how to obtain the noise rate $\rho$. We show that the rate is upper bounded by the conditional probability $P(y|x)$ of the noisy sample. Consequently, the rate can be estimated, because the upper bound can be easily reached in classification problems. Experimental results on synthetic and real datasets confirm the efficiency of our methods.
Full-text available
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
Humans and animals learn much better when the examples are not randomly presented but organized in a meaningful order which illustrates gradually more concepts, and gradually more complex ones. Here, we formalize such training strategies in the context of machine learning, and call them "curriculum learning". In the context of recent research studying the difficulty of training in the presence of non-convex training criteria (for deep deterministic and stochastic neural networks), we explore curriculum learning in various set-ups. The experiments show that significant improvements in generalization can be achieved. We hypothesize that curriculum learning has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained: curriculum learning can be seen as a particular form of continuation method (a general strategy for global optimization of non-convex functions).
Learning an effective visual classifier from few labeled samples is a challenging problem, which has motivated the multi-source adaptation scheme in machine learning. While the advantages of multi-source adaptation have been widely recognized, there still exit three major limitations in extant methods. Firstly, how to effectively select the discriminative sources is yet an unresolved issue. Secondly, multiple different visual features on hand cannot be effectively exploited to represent a target object for boosting the adaptation performance. Last but not least, they mainly focus on either visual understanding or feature learning independently, which may lead to the so-called semantic gap between the low-level features and the high-level semantics. To overcome these defects, we propose a novel Multi-source Adaptation Multi-Feature (MAMF) co-regression framework by jointly exploring multi-feature co-regression, multiple latent spaces learning, and discriminative sources selection. Concretely, MAMF conducts the multi-feature representation co-regression with feature learning by simultaneously uncovering multiple optimal latent spaces and taking into account correlations among multiple feature representations. Moreover, to discriminatively leverage multi-source knowledge for each target feature representation, MAMF automatically selects the discriminative source models trained on source datasets by formulating it as a row-sparsity pursuit problem. Different from the state-of-the-arts, our method is able to adapt knowledge from multiple sources even if the features of each source and the target are partially different but overlapping. Experimental results on three challenging visual domain adaptation tasks consistently demonstrate the superiority of our method in comparison with the related state-of-the-arts.
Conference Paper
Recent deep networks are capable of memorizing the entire data even when the labels are completely random. To overcome the overfitting on corrupted labels, we propose a novel technique of learning another neural network, called MentorNet, to supervise the training of the base deep networks, namely, StudentNet. During training, MentorNet provides a curriculum (sample weighting scheme) for StudentNet to focus on the sample the label of which is probably correct. Unlike the existing curriculum that is usually predefined by human experts, MentorNet learns a data-driven curriculum dynamically with StudentNet. Experimental results demonstrate that our approach can significantly improve the generalization performance of deep networks trained on corrupted training data. Notably, to the best of our knowledge, we achieve the best-published result on WebVision, a large benchmark containing 2.2 million images of real-world noisy labels.
Conference Paper
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model fam- ily, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional ap- proaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a ran- dom labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by com- pletely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks al- ready have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Collecting large training datasets, annotated with high quality labels, is a costly process. This paper proposes a novel framework for training deep convolutional neural networks from noisy labeled datasets. The problem is formulated using an undirected graphical model that represents the relationship between noisy and clean labels, trained in a semi-supervised setting. In the proposed structure, the inference over latent clean labels is tractable and is regularized during training using auxiliary sources of information. The proposed model is applied to the image labeling problem and is shown to be effective in labeling unseen images as well as reducing label noise in training on CIFAR-10 and MS COCO datasets.
Conference Paper
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
In recent years, deep neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
The accuracy of active learning is critically influenced by the existence of noisy labels given by a noisy oracle. In this paper, we propose a novel pool-based active learning framework through robust measures based on density power divergence. By minimizing density power divergence, such as β-divergence and γ-divergence, one can estimate the model accurately even under the existence of noisy labels within data. Accordingly, we develop query selecting measures for pool-based active learning using these divergences. In addition, we propose an evaluation scheme for these measures based on asymptotic statistical analyses, which enables us to perform active learning by evaluating an estimation error directly. Experiments with benchmark datasets and real-world image datasets show that our active learning scheme performs better than several baseline methods.
In this paper, we first present a self-training semi-supervised support vector machine (SVM) algorithm and its corresponding model selection method, which are designed to train a classifier with small training data. Next, we prove the convergence of this algorithm. Two examples are presented to demonstrate the validity of our algorithm with model selection. Finally, we apply our algorithm to a data set collected from a P300-based brain computer interface (BCI) speller. This algorithm is shown to be able to significantly reduce training effort of the P300-based BCI speller.
Conference Paper
Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that typically we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1
Conference Paper
The construction of appearance-based object detection systems is time-consuming and difficult because a large number of training examples must be collected and man- ually labeled in order to capture variations in object ap- pearance. Semi-supervised training is a means for reduc- ing the effort needed to prepare the training set by train- ing the model with a small number of fully labeled examples and an additional set of unlabeled or weakly labeled exam- ples. In this work we present a semi-supervised approach to training object detection systems based on self-training. We implement our approach as a wrapper around the train- ing process of an existing object detector and present em- pirical results. The key contributions of this empirical study is to demonstrate that a model trained in this manner can achieve results comparable to a model trained in the tradi- tional manner using a much larger set of fully labeled data, and that a training data selection metric that is defined in- dependently of the detector greatly outperforms a selection metric based on the detection confidence generated by the detector.
We review the literature on semi-supervised learning, which is an area in machine learning and more generally, artificial intelligence. There has been a whole spectrum of interesting ideas on how to learn from both labeled and unlabeled data, i.e. semi-supervised learning. This document is a chapter excerpt from the author’s doctoral thesis (Zhu, 2005). However the author plans to update the online version frequently to incorporate the latest development in the field. Please obtain the latest version at
A closer look at memorization in deep networks
  • D Arpit
  • S Jastrzebski
  • N Ballas
  • D Krueger
  • E Bengio
  • M S Kanwal
Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., et al. (2017). A closer look at memorization in deep networks. In Proceedings of the 34th international conference on machine learning (pp. 233-242).
Robust loss functions under label noise for deep neural networks
  • A Ghosh
  • H Kumar
  • P S Sastry
Ghosh, A., Kumar, H., & Sastry, P. S. (2017). Robust loss functions under label noise for deep neural networks. In Proceedings of the 31th conference on artificial intelligence, February 4-9, 2017, San Francisco, California, USA (pp. 1919-1925).
Training deep neural-networks using a noise adaptation layer
  • J Goldberger
  • E Ben-Reuven
Goldberger, J., & Ben-Reuven, E. (2017). Training deep neural-networks using a noise adaptation layer. In International conference on learning representations.
Co-teaching: robust training of deep neural networks with extremely noisy labels
  • B Han
  • Q Yao
  • X Yu
  • G Niu
  • M Xu
  • W Hu
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., et al. (2018). Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems (pp. 8527-8537).
Knowledge distillation by on-the-fly native ensemble
  • X Lan
  • X Zhu
  • S Gong
Lan, X., Zhu, X., & Gong, S. (2018). Knowledge distillation by on-the-fly native ensemble. In Advances in neural information processing systems (pp. 7517-7527).
  • Coco Microsoft
Microsoft COCO: Common objects in context. arXiv e-prints, arXiv:1405.0312.
Learning with confident examples: Rank pruning for robust classification with noisy labels
  • C G Northcutt
  • T Wu
  • I L Chuang
Northcutt, C. G., Wu, T., & Chuang, I. L. (2017). Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv e-prints, arXiv:1705.01936.
  • S Reed
  • H Lee
  • D Anguelov
  • C Szegedy
  • D Erhan
  • A Rabinovich
Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., & Rabinovich, A. (2014). Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596.
Learning to reweight examples for robust deep learning
  • M Ren
  • W Zeng
  • B Yang
  • R Urtasun
Ren, M., Zeng, W., Yang, B., & Urtasun, R. (2018). Learning to reweight examples for robust deep learning. In Proceedings of the 35th international conference on machine learning.
Learning dictionaries for information extraction by multi-level bootstrapping
  • E Riloff
  • R Jones
Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI '99/IAAI '99, Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference innovative applications of artificial intelligence (pp. 474-479). Menlo Park, CA, USA: American Association for Artificial Intelligence, ISBN: 0-262-51106-1, URL cfm?id=315149.315364.
The bayesian bootstrap. The Annals of Statistics
  • D B Rubin
Rubin, D. B. (1981). The bayesian bootstrap. The Annals of Statistics, 130-134.
Very deep convolutional networks for large-scale image recognition
  • K Simonyan
  • A Zisserman
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.
Training convolutional networks with noisy labels
  • S Sukhbaatar
  • J Bruna
  • M Paluri
  • L Bourdev
  • R Fergus
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., & Fergus, R. (2018). Training convolutional networks with noisy labels. In International conference on learning representations.
On the importance of initialization and momentum in deep learning
  • I Sutskever
  • J Martens
  • G Dahl
  • G Hinton
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In S. Dasgupta, & D. McAllester (Eds.), Proceedings of machine learning research: Vol. 28, Proceedings of the 30th international conference on machine learning (3), (pp. 1139-1147). Atlanta, Georgia, USA: PMLR.
Generalized cross entropy loss for training deep neural networks with noisy labels
  • Z Zhang
  • M Sabuncu
Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems (pp. 8792-8802).