Article

Review and Arrange: Curriculum Learning for Natural Language Understanding

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the notable success of pretrained language models, the pretraining-fine-tuning paradigm has become a dominant solution for natural language understanding (NLU) tasks. Typically, the training instances of a target NLU task are introduced in a completely random order and treated equally at the fine-tuning stage. However, these instances can vary greatly in difficulty, and similar to human learning procedures, language models can benefit from an easy-to-difficult curriculum. Based on this concept, we propose a curriculum learning (CL) framework. Our framework consists of two stages, Review and Arrange, targeting the two main challenges in curriculum learning, i.e., how to define the difficulty of instances and how to arrange a curriculum based on the difficulty, respectively. In the first stage, we devise a cross-review (CR) method to train several teacher models first and then review the training set in a crossed manner to distinguish easy instances from difficult instances. In the second stage, two sampling algorithms, a coarse-grained arrangement (CGA) and a fine-grained arrangement (FGA), are proposed to arrange a curriculum for language models in which the learning materials start from the easiest instances, and more difficult instances are gradually added into the training procedure. Compared to previous heuristic CL methods, our framework can avoid the errors caused by a gap in difficulty between humans and machines and has strong generalization ability. We conduct comprehensive experiments, and the results show that our curriculum learning framework, without any manual model architecture design or use of external data, obtains significant and universal performance improvements on a wide range of NLU tasks in different languages.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Recently, various studies [3,29,31,41,52] have been conducted to achieve efficient data learning. Most real datasets are large and have different data difficulties. ...
Article
Full-text available
Since most real-world network data include nodes and edges that evolve gradually, an embedding model for dynamic heterogeneous networks is crucial for network analysis. Transformer models have remarkable success in natural language processing but are rarely applied to learning representations of dynamic heterogeneous networks. In this study, we propose a new transformer model (DHG-BERT) that (i) constructs a dataset based on a network curriculum and (ii) includes pre/post-learning through self-supervised learning. Our proposed model learns complex relationships by leveraging an easier understanding of relationships through data reconstruction. Additionally, we use self-supervised learning to learn network structural features and temporal changes in structure and then fine-tune the proposed model by focusing on specific meta-paths by considering domain characteristics or target tasks. We evaluated the quality of the vector representation produced by the proposed transducer model using real bibliographic networks. Our model achieved an average accuracy of 0.94 in predicting research collaboration between researchers, outperforming existing models by a minimum of 0.13 and a maximum of 0.35. As a result, we confirmed that DHG-BERT is an effective transformer model tailored to dynamic heterogeneous network embeddings. Our study highlights the model’s ability to understand complex network relationships and appropriately capture the structural nuances and temporal changes inherent in networks. This study provides future research directions for applying the transformer model to real-world network data and a new approach to analyzing dynamic heterogeneous networks using transformers.
... It addresses the problems of catastrophic forgetting, enhances convergence rate of the training process, and improves model performance in some tasks [10]. The concept of CL for machine learning was introduced in [11] and has since gained interest across several fields such as in federated face recognition [12] and natural language understanding [13] tasks. Prior research as summarized in [14] has shown that the benefits of CL cannot be generalized across all fields. ...
Conference Paper
Full-text available
Anomaly detection in multivariate time series (MVTS) is a significant research domain across several industries , including cybersecurity and industrial systems. There have been several deep learning-based methods proposed within this research domain. The development of high capacity frameworks has received the majority of attention in recent years due to the availability of large open-source datasets and improvement in computer processing power. However, there has not been much attention in investigating the importance of how the data is presented to these techniques during training. Curriculum learning (CL), a technique based on ordered learning, was proposed for machine learning. In CL, the model first learns from easy data and is progressively trained with increasingly difficult data. In this paper, we propose data-based CL for MVTS anomaly detection. We further introduce the CL concept to the learner (model), in which we first train a simple model and then utilize a complex model in the final training round. To the best of our knowledge, we are the first to investigate these approaches in MVTS anomaly detection. We evaluate the proposed designs on the SWaT dataset using the F1 score and the results show an improvement in performance.
... There are many CL-based unsupervised or semi-supervised works that focus on pseudo-labeling [51], [52]. Most of them are in NLP [53], [54], classification [55] and detection [56] tasks, where the pseudo-labels are relatively simple [50]. However, there are few CL works that focus on more complex cross-modal tasks (e.g., VQA, VLN, VG) due to the difficulty in evaluating data and models with diverse modalities and task targets [49], [50]. ...
Article
Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised methods have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78% to 10.67% and 11.39% to 14.87%, respectively. Furthermore, our approach even outperforms existing weakly supervised methods. The code and models are available at https://github.com/linhuixiao/CLIP-VG .
... With the development of deep learning and pre-trained language models, text classification has gained great success [1][2][3][4][5]. Generally, training deep-learning text classification models require a large amount of labeled data to achieve competitive performance. ...
Article
Full-text available
Training a deep-learning text classification model usually requires a large amount of labeled data, yet labeling data are usually labor-intensive and time-consuming. Few-shot text classification focuses on predicting unknown samples using only a few labeled samples. Recently, metric-based meta-learning methods have achieved promising results in few-shot text classification. They use episodic training in labeled samples to enhance the model’s generalization ability. However, existing models only focus on learning from a few labeled samples but neglect to learn from a large number of unlabeled samples. In this paper, we exploit the knowledge learned by the model in unlabeled samples to improve the generalization performance of the meta-network. Specifically, we introduce a novel knowledge distillation method that expands and enriches the meta-learning representation with self-supervised information. Meanwhile, we design a graph aggregation method that efficiently interacts the query set information with the support set information in each task and outputs a more discriminative representation. We conducted experiments on three public few-shot text classification datasets. The experimental results show that our model performs better than the state-of-the-art models in 5-way 1-shot and 5-way 5-shot cases.
... There are many CL-based unsupervised or semi-supervised works that focus on pseudo-labeling [51], [52]. Most of them are in NLP [53], [54], classification [55] and detection [56] tasks, where the pseudo-labels are relatively simple [50]. However, there are few CL works that focus on more complex cross-modal tasks (e.g., VQA, VLN, VG) due to the difficulty in evaluating data and models with diverse modalities and task targets [49], [50]. ...
Preprint
Full-text available
Visual Grounding (VG) refers to locating a region described by expressions in a specific image, which is a critical topic in vision-language fields. To alleviate the dependence on labeled data, existing unsupervised methods try to locate regions using task-unrelated pseudo-labels. However, a large proportion of pseudo-labels are noisy and diversity scarcity in language taxonomy. Inspired by the advances in V-L pretraining, we consider utilizing the VLP models to realize unsupervised transfer learning in downstream grounding task. Thus, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP via exploiting pseudo-language labels to solve VG problem. By elaborating an efficient model structure, we first propose a single-source and multi-source curriculum adapting method for unsupervised VG to progressively sample more reliable cross-modal pseudo-labels to obtain the optimal model, thus achieving implicit knowledge exploiting and denoising. Our method outperforms the existing state-of-the-art unsupervised VG method Pseudo-Q in both single-source and multi-source scenarios with a large margin, i.e., 6.78%~10.67% and 11.39%~24.87% on RefCOCO/+/g datasets, even outperforms existing weakly supervised methods. The code and models will be released at \url{https://github.com/linhuixiao/CLIP-VG}.
Article
Sequential prediction has great value for resource allocation due to its capability in analyzing intents for next prediction. A fundamental challenge arises from real-world interaction dynamics where similar sequences involving multiple intents may exhibit different next items. More importantly, the character of volume candidate items in sequential prediction may amplify such dynamics, making deep networks hard to capture comprehensive intents. This paper presents a sequential prediction framework with De coupled P r o gressive D istillation (DePoD), drawing on the progressive nature of human cognition. We redefine target and non-target item distillation according to their different effects in the decoupled formulation. This can be achieved through two aspects: (1) Regarding how to learn, our target item distillation with progressive difficulty increases the contribution of low-confidence samples in the later training phase while keeping high-confidence samples in the earlier phase. And, the non-target item distillation starts from a small subset of non-target items from which size increases according to the item frequency. (2) Regarding whom to learn from, a difference evaluator is utilized to progressively select an expert that provides informative knowledge among items from the cohort of peers. Extensive experiments on four public datasets show DePoD outperforms state-of-the-art methods in terms of accuracy-based metrics.
Article
Since deep learning is the dominant paradigm in the multi-turn dialogue generation task, large-scale training data is the key factor affecting the model performance. To make full use of the training data, the existing work directly applied curriculum learning to the multi-turn dialogue generation task, training model in a “easy-to-hard” way. But the design of the current methodology does not consider dialogue-specific features. To close this gap, we propose a Multi-Level Curriculum Learning (MLCL) method for multi-turn dialogue generation by considering the word-level linguistic feature and utterance-level semantic relation in a dialogue. The motivation is that word-level knowledge is beneficial to understanding complex utterance-level dependency of dialogue. Thus, we design two difficulty measurements and a self-adaptive curriculum scheduler, making the model gradually shift the learning focus from word-level to utterance-level information during the training process. We also verify the independence and complementarity of the two measurements at different levels. We evaluate the performance on two widely used multi-turn dialogue datasets, and the results demonstrate that our proposed method outperforms the strong baselines and existing CL methods in terms of automated metrics and human evaluation. We will release the code files upon acceptance.
Article
Few-shot relation classification (FSRC) focuses on recognizing novel relations by learning with merely a handful of annotated instances. Meta-learning has been widely adopted for such a task, which trains on randomly generated few-shot tasks to learn generic data representations. Despite impressive results achieved, existing models still perform suboptimally when handling hard FSRC tasks with similar categories that confuse the model to distinguish correctly. We argue this is largely due to two reasons, 1) ignoring pivotal and discriminate information that is crucial to distinguish confusing classes, and 2) training indiscriminately via randomly sampled tasks of varying difficulty. In this paper, we introduce a novel prototypical network approach with contrastive learning that learns more informative and discriminative representations by exploiting relation label information. We further design two strategies that increase the difficulty of training tasks and allow the model to adaptively learn to focus on hard tasks. By doing so, our model can better represent subtle inter-relation variance and grow up through task difficulty. Extensive experiments on three standard benchmarks demonstrate the effectiveness of our method.
Article
Unsupervised cross-lingual transfer has been shown great potentials for dependency parsing of the low-resource languages when there is no annotated treebank available. Recently, the self-training method has received increasing interests because of its state-of-the-art performance in this scenario. In this work, we advance the method further by coupling it with curriculum learning, which guides the self-training in an easy-to-hard manner. Concretely, we present a novel metric to measure the instance difficulty of a dependency parser which is trained mainly on a Treebank from a resource-rich source language. By using the metric, we divide a low-resource target language into several fine-grained sub-languages by their difficulties, and then apply iterative-self-training progressively on these sub-languages. To fully explore the auto-parsed training corpus from sub-languages, we exploit an improved parameter generation network to model the sub-languages for better representation learning. Experimental results show that our final curriculum-style self-training can outperform a range of strong baselines, leading to new state-of-the-art results on unsupervised cross-lingual dependency parsing. We also conduct detailed experimental analyses to examine the proposed approach in depth for comprehensive understandings.
Article
Deep learning has demonstrated great potential in the field of gas recognition using an electronic nose. However, owing to the limited resources of embedded devices, they can only meet the computing and storage requirements of lightweight deep neural network (DNN) models, which reduces the classification accuracy. This study proposes three curriculum learning (CL) approaches to improve the performance of lightweight DNN models for end-to-end gas recognition, including a domain knowledge-based approach and two approaches that automatically arrange the curriculum. Among the two automated methods, the first uses the correlation strength between sample responses to measure the difficulty of the training samples, and the second is based on the classification difficulty and disagreements among multiple teacher models. Furthermore, our approaches are combined with a mixup-based data augmentation method to apply CL to small datasets. The superiority of the proposed approaches was verified using 14 electronic nose datasets. The experimental results demonstrate that these approaches can be used on different types of datasets while boosting the gas recognition accuracy and F1 score of various DNN models by 9.9% and 10.9% on average, respectively, which can guide the future design of high-accuracy gas sensing embedded systems.
Article
Full-text available
Curriculum learning (CL) is a training strategy that trains a machine learning model from easier data to harder data, which imitates the meaningful learning order in human curricula. As an easy-to-use plug-in, the CL strategy has demonstrated its power in improving the generalization capacity and convergence rate of various models in a wide range of scenarios such as computer vision and natural language processing etc. In this survey article, we comprehensively review CL from various aspects including motivations, definitions, theories, and applications. We discuss works on curriculum learning within a general CL framework, elaborating on how to design a manually predefined curriculum or an automatic curriculum. In particular, we summarize existing CL designs based on the general framework of Difficulty Measurer + Training Scheduler and further categorize the methodologies for automatic CL into four groups, i.e., Self-paced Learning, Transfer Teacher, RL Teacher, and Other Automatic CL. We also analyze principles to select different CL designs that may benefit practical applications. Finally, we present our insights on the relationships connecting CL and other machine learning concepts including transfer learning, meta-learning, continual learning and active learning, etc., then point out challenges in CL as well as potential future research directions deserving further investigations.
Conference Paper
Full-text available
Emotion-controllable response generation is an attractive and valuable task that aims to make open-domain conversations more empathetic and engaging. Existing methods mainly enhance the emotion expression by adding regularization terms to standard cross-entropy loss and thus influence the training process. However, due to the lack of further consideration of content consistency, the common problem of response generation tasks, safe response, is intensified. Besides, query emotions that can help model the relationship between query and response are simply ignored in previous models, which would further hurt the coherence. To alleviate these problems, we propose a novel framework named Curriculum Dual Learning (CDL) which extends the emotion-controllable response generation to a dual task to generate emotional responses and emotional queries alternatively. CDL utilizes two rewards focusing on emotion and content to improve the duality. Additionally, it applies curriculum learning to gradually generate high-quality responses based on the difficulties of expressing various emotions. Experimental results show that CDL significantly outperforms the baselines in terms of coherence, diversity, and relation to emotion factors.
Article
Full-text available
Machine reading comprehension tasks require a machine reader to answer questions relevant to the given document. In this paper, we present the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C ³ ), containing 13,369 documents (dialogues or more formally written mixed-genre texts) and their associated 19,577 multiple-choice free-form questions collected from Chinese-as-a-second-language examinations. We present a comprehensive analysis of the prior knowledge (i.e., linguistic, domain-specific, and general world knowledge) needed for these real-world problems. We implement rule-based and popular neural methods and find that there is still a significant performance gap between the best performing model (68.5%) and human readers (96.0%), especiallyon problems that require prior knowledge. We further study the effects of distractor plausibility and data augmentation based on translated relevant datasets for English on model performance. We expect C ³ to present great challenges to existing systems as answering 86.8% of questions requires both knowledge within and beyond the accompanying document, and we hope that C ³ can serve as a platform to study how to leverage various kinds of prior knowledge to better understand a given written or orally oriented text. C ³ is available at https://dataset.org/c3/ .
Conference Paper
Full-text available
Human attribute analysis is a challenging task in the field of computer vision. One of the significant difficulties is brought from largely imbalance-distributed data. Conventional techniques such as re-sampling and cost-sensitive learning require prior-knowledge to train the system. To address this problem, we propose a unified framework called Dynamic Curriculum Learning (DCL) to adaptively adjust the sampling strategy and loss weight in each batch, which results in better ability of generalization and discrimination. Inspired by curriculum learning, DCL consists of two-level curriculum schedulers: (1) sampling sched-uler which manages the data distribution not only from imbalance to balance but also from easy to hard; (2) loss scheduler which controls the learning importance between classification and metric learning loss. With these two schedulers, we achieve state-of-the-art performance on the widely used face attribute dataset CelebA and pedestrian attribute dataset RAP.
Article
Full-text available
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.
Article
Full-text available
This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
Conference Paper
Full-text available
Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).
Article
Full-text available
We present NewsQA, a challenging machine comprehension dataset of over 100,000 question-answer pairs. Crowdworkers supply questions and answers based on a set of over 10,000 news articles from CNN, with answers consisting in spans of text from the corresponding articles. We collect this dataset through a four-stage process designed to solicit exploratory questions that require reasoning. A thorough analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing entailment. We measure human performance on the dataset and compare it to several strong neural models. The performance gap between humans and machines (25.3% F1) indicates that significant progress can be made on NewsQA through future research. The dataset is freely available at datasets.maluuba.com/NewsQA.
Chapter
In the paper, we present a ‘ Open image in new window ’+‘ Open image in new window ’ +‘ Open image in new window ’ three-stage paradigm, which is a supplementary framework for the standard ‘ Open image in new window ’+‘ Open image in new window ’ language model approach. Furthermore, based on three-stage paradigm, we present a language model named PPBERT. Compared with original BERT architecture that is based on the standard two-stage paradigm, we do not fine-tune pre-trained model directly, but rather post-train it on the domain or task related dataset first, which helps to better incorporate task-awareness knowledge and domain-awareness knowledge within pre-trained model, also from the training dataset reduce bias. Extensive experimental results indicate that proposed model improves the performance of the baselines on 24 NLP tasks, which includes eight GLUE benchmarks, eight SuperGLUE benchmarks, six extractive question answering benchmarks. More remarkably, our proposed model is a more flexible and pluggable model, where post-training approach is able to be plugged into other PLMs that are based on BERT. Extensive ablations further validate the effectiveness and its state-of-the-art (SOTA) performance. The open source code, pre-trained models and post-trained models are available publicly.
Article
Current state-of-the-art neural dialogue systems are mainly data-driven and are trained on human-generated responses. However, due to the subjectivity and open-ended nature of human conversations, the complexity of training dialogues varies greatly. The noise and uneven complexity of query-response pairs impede the learning efficiency and effects of the neural dialogue generation models. What is more, so far, there are no unified dialogue complexity measurements, and the dialogue complexity embodies multiple aspects of attributes—specificity, repetitiveness, relevance, etc. Inspired by human behaviors of learning to converse, where children learn from easy dialogues to complex ones and dynamically adjust their learning progress, in this paper, we first analyze five dialogue attributes to measure the dialogue complexity in multiple perspectives on three publicly available corpora. Then, we propose an adaptive multi-curricula learning framework to schedule a committee of the organized curricula. The framework is established upon the reinforcement learning paradigm, which automatically chooses different curricula at the evolving learning process according to the learning status of the neural dialogue generation model. Extensive experiments conducted on five state-of-the-art models demonstrate its learning efficiency and effectiveness with respect to 13 automatic evaluation metrics and human judgments.
Article
Despite the great success of two-stage detectors, single-stage detector is still a more elegant and efficient way, yet suffers from the two well-known disharmonies during training, i.e. the huge difference in quantity between positive and negative examples as well as between easy and hard examples. In this work, we first point out that the essential effect of the two disharmonies can be summarized in term of the gradient. Further, we propose a novel gradient harmonizing mechanism (GHM) to be a hedging for the disharmonies. The philosophy behind GHM can be easily embedded into both classification loss function like cross-entropy (CE) and regression loss function like smooth-L1 (SL1) loss. To this end, two novel loss functions called GHM-C and GHM-R are designed to balancing the gradient flow for anchor classification and bounding box refinement, respectively. Ablation study on MS COCO demonstrates that without laborious hyper-parameter tuning, both GHM-C and GHM-R can bring substantial improvement for single-stage detector. Without any whistles and bells, the proposed model achieves 41.6 mAP on COCO testdev set which surpass the state-of-the-art method, Focal Loss (FL) + SL1, by 0.8. The code1 is released to facilitate future research.
Conference Paper
Recent deep networks are capable of memorizing the entire data even when the labels are completely random. To overcome the overfitting on corrupted labels, we propose a novel technique of learning another neural network, called MentorNet, to supervise the training of the base deep networks, namely, StudentNet. During training, MentorNet provides a curriculum (sample weighting scheme) for StudentNet to focus on the sample the label of which is probably correct. Unlike the existing curriculum that is usually predefined by human experts, MentorNet learns a data-driven curriculum dynamically with StudentNet. Experimental results demonstrate that our approach can significantly improve the generalization performance of deep networks trained on corrupted training data. Notably, to the best of our knowledge, we achieve the best-published result on WebVision, a large benchmark containing 2.2 million images of real-world noisy labels. https://github.com/google/mentornet
Chapter
We propose a novel deep network called Mancs that solves the person re-identification problem from the following aspects: fully utilizing the attention mechanism for the person misalignment problem and properly sampling for the ranking loss to obtain more stable person representation. Technically, we contribute a novel fully attentional block which is deeply supervised and can be plugged into any CNN, and a novel curriculum sampling method which is effective for training ranking losses. The learning tasks are integrated into a unified framework and jointly optimized. Experiments have been carried out on Market1501, CUHK03 and DukeMTMC. All the results show that Mancs can significantly outperform the previous state-of-the-arts. In addition, the effectiveness of the newly proposed ideas has been confirmed by extensive ablation studies.
Chapter
We present a simple yet efficient approach capable of training deep neural networks on large-scale weakly-supervised web images, which are crawled raw from the Internet by using text queries, without any human annotation. We develop a principled learning strategy by leveraging curriculum learning, with the goal of handling a massive amount of noisy labels and data imbalance effectively. We design a new learning curriculum by measuring the complexity of data using its distribution density in a feature space, and rank the complexity in an unsupervised manner. This allows for an efficient implementation of curriculum learning on large-scale web images, resulting in a high-performance CNN the model, where the negative impact of noisy labels is reduced substantially. Importantly, we show by experiments that those images with highly noisy labels can surprisingly improve the generalization capability of model, by serving as a manner of regularization. Our approaches obtain state-of-the-art performance on four benchmarks: WebVision, ImageNet, Clothing-1M and Food-101. With an ensemble of multiple models, we achieved a top-5 error rate of 5.2% on the WebVision challenge [18] for 1000-category classification. This result was the top performance by a wide margin, outperforming second place by a nearly 50% relative error rate. Code and models are available at: https://github.com/MalongTech/CurriculumNet.
Article
Our first contribution in this paper is a theoretical investigation of curriculum learning in the context of stochastic gradient descent when optimizing the least squares loss function. We prove that the rate of convergence of an ideal curriculum learning method in monotonically increasing with the difficulty of the examples, and that this increase in convergence rate is monotonically decreasing as training proceeds. In our second contribution we analyze curriculum learning in the context of training a CNN for image classification. Here one crucial problem is the means to achieve a curriculum. We describe a method which infers the curriculum by way of transfer learning from another network, pre-trained on a different task. While this approach can only approximate the ideal curriculum, we observe empirically similar behavior to the one predicted by the theory, namely, a significant boost in convergence speed at the beginning of training. When the task is made more difficult, improvement in generalization performance is observed. Finally, curriculum learning exhibits robustness against unfavorable conditions such as strong regularization.
Article
Visual attributes, from simple objects (e.g., backpacks, hats) to soft-biometrics (e.g., gender, height, clothing) have proven to be a powerful representational approach for many applications such as image description and human identification. In this paper, we introduce a novel method to combine the advantages of both multi-task and curriculum learning in a visual attribute classification framework. Individual tasks are grouped after performing hierarchical clustering based on their correlation. The clusters of tasks are learned in a curriculum learning setup by transferring knowledge between clusters. The learning process within each cluster is performed in a multi-task classification setup. By leveraging the acquired knowledge, we speed-up the process and improve performance. We demonstrate the effectiveness of our method via ablation studies and a detailed analysis of the covariates, on a variety of publicly available datasets of humans standing with their full-body visible. Extensive experimentation has proven that the proposed approach boosts the performance by 4% to 10%.
Article
Visual attributes, from simple objects (e.g., backpacks, hats) to soft-biometrics (e.g., gender, height, clothing) have proven to be a powerful representational approach for many applications such as image description and human identification. In this paper, we introduce a novel method to combine the advantages of both multi-task and curriculum learning in a visual attribute classification framework. Individual tasks are grouped based on their correlation so that two groups of strongly and weakly correlated tasks are formed. The two groups of tasks are learned in a curriculum learning setup by transferring the acquired knowledge from the strongly to the weakly correlated. The learning process within each group though, is performed in a multi-task classification setup. The proposed method learns better and converges faster than learning all the tasks in a typical multi-task learning paradigm. We demonstrate the effectiveness of our approach on the publicly available, SoBiR, VIPeR and PETA datasets and report state-of-the-art results across the board.
Article
We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on. We describe a family of Teacher algorithms that rely on the intuition that the Student should practice more those tasks on which it makes the fastest progress, i.e. where the slope of the learning curve is highest. In addition, the Teacher algorithms address the problem of forgetting by also choosing tasks where the Student's performance is getting worse. We demonstrate that TSCL matches or surpasses the results of carefully hand-crafted curricula in two tasks: addition of decimal numbers with LSTM and navigation in Minecraft. Using our automatically generated curriculum enabled to solve a Minecraft maze that could not be solved at all when training directly on solving the maze, and the learning was an order of magnitude faster than uniform sampling of subtasks.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multi-armed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level.
Article
In this paper, we present an alternative to the Turing Test that has some conceptual and practical advantages. A Wino-grad schema is a pair of sentences that differ only in one or two words and that contain a referential ambiguity that is resolved in opposite directions in the two sentences. We have compiled a collection of Winograd schemas, designed so that the correct answer is obvious to the human reader, but cannot easily be found using selectional restrictions or statistical techniques over text corpora. A contestant in the Winograd Schema Challenge is presented with a collection of one sentence from each pair, and required to achieve human-level accuracy in choosing the correct disambiguation. Copyright © 2012, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Article
Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Further progress towards understanding compositionality in tasks such as sentiment detection requires richer supervised training and evaluation resources and more powerful models of composition. To remedy this, we introduce a Sentiment Treebank. It includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment composition-ality. To address them, we introduce the Recursive Neural Tensor Network. When trained on the new treebank, this model outperforms all previous methods on several metrics. It pushes the state of the art in single sentence positive/negative classification from 80% up to 85.4%. The accuracy of predicting fine-grained sentiment labels for all phrases reaches 80.7%, an improvement of 9.7% over bag of features baselines. Lastly, it is the only model that can accurately capture the effects of negation and its scope at various tree levels for both positive and negative phrases.