Conference Paper

Label-Efficient Online Continual Object Detection in Streaming Video

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... However, Faster R-CNN is prone to catastrophic forgetting when learning new objects in an incremental manner [29], [30]. The adaptation of this architecture to the OOD context remains poorly explored [25], [26]. In addition, there has been no research to identify which components of the architecture are subject to forgetting. ...
... Our work focuses on OOD, where training data is presented as a stream, and the model has access to only a single pass for training. Three benchmarks are commonly used to evaluate OOD models [24], [26], [27]: ...
... The two benchmarks on EgoObjects are originally introduced for continual object detection [55]. To respect the online constraint specific to the OOD field of study, some works [24], [26], [27] have followed the original data apparition order from the benchmark but constrained training to a single epoch. In addition, they built a test set similar to OAK by performing a split that holds out one frame every 16 labeled frames. ...
Article
Full-text available
Online Object Detection (OOD) requires learning new object categories from a stream of images, similar to an agent exploring new environments. In this context, the widely used Faster R-CNN (Region Convolutional Neural Network) architecture faces catastrophic forgetting: the acquisition of new knowledge leads to the loss of previously learned information. In this paper, we investigate the learning and forgetting mechanisms of the Faster R-CNN in OOD through three main contributions. First, we observe that the forgetting curves of the Faster R-CNN exhibit patterns similar to those described in human memory studies by Hermann Ebbinghaus: knowledge is lost exponentially over time and recall improves knowledge retention. Second, we present a new methodology for analysing the Faster R-CNN architecture and quantifying forgetting across the Faster R-CNN components. We show that forgetting is mainly localised in the Softmax classification layer. Finally, we propose a new training strategy for OOD called Configurable Recall (CR). CR performs recalls on old data using images stored in a memory buffer with variable frequency and recall length to ensure efficient learning. CR also masks the logits of old objects in the softmax classification layer to mitigate forgetting. We evaluate our strategy against state-of-the-art methods on three OOD benchmarks. We analyse the effectiveness of different recall types in mitigating forgetting and show that CR outperforms existing methods.
... For OAK evaluations, we use Faster R-CNN [46], a popular two-stage object detector. We initialize the ResNet-50 [27] backbone with the backbone of the final checkpoint of the streaming SSL model, and fine-tune the entire model on OAK with IID training for 10 epochs, following the training configurations of [63]. ...
Preprint
Full-text available
Self-supervised learning holds the promise to learn good representations from real-world continuous uncurated data streams. However, most existing works in visual self-supervised learning focus on static images or artificial data streams. Towards exploring a more realistic learning substrate, we investigate streaming self-supervised learning from long-form real-world egocentric video streams. Inspired by the event segmentation mechanism in human perception and memory, we propose "Memory Storyboard" that groups recent past frames into temporal segments for more effective summarization of the past visual streams for memory replay. To accommodate efficient temporal segmentation, we propose a two-tier memory hierarchy: the recent past is stored in a short-term memory, and the storyboard temporal segments are then transferred to a long-term memory. Experiments on real-world egocentric video datasets including SAYCam and KrishnaCam show that contrastive learning objectives on top of storyboard frames result in semantically meaningful representations which outperform those produced by state-of-the-art unsupervised continual learning methods.
... This paradigm aims to emulate the human ability to continually learn and adapt to new information without losing prior learning, thus enabling models to remain applicable and effective in dynamically changing environments. Continual learning is used in different research areas such as anomaly detection [33],security [34] and object detection [35]. The continual learning methods are categorized into regularization-based, replay-based, and parameter-isolation-based methods. ...
Article
Full-text available
The creation and dissemination of deepfake videos have become increasingly prevalent nowadays, facilitated by advanced technological tools. These synthetic videos pose significant security challenges as they can spread misinformation and manipulation thereby undermining the digital media. Owing to the continuous generation of novel synthetic data, deepfake detection models must be regularly updated to enhance their generalization capabilities. In this research article, we propose a deepfake video detection system with self-attention mechanism and continual learning. A self-attention residual module is specifically introduced to extract detailed facial features. We enable the deepfake detection process with continual learning to improve detection capability and improve generalization. The framework uses weight regularization and a dynamic sample set to continuously learn and adapt to new synthetic data. We demonstrated our proposed approach on Xception-Net backbone with benchmark datasets such as Celeb-DF and Face-Forensics++ datasets. Experimental results shows AUC values of 99.26% on Celeb-DF and 99.67%, 93.57%, 99.78% and 90.00% on different categories on FaceForensics++ such as Deepfakes, Face2Face, FaceSwap and Neural Textures respectively.
... Traditionally, creating such datasets requires manually tagging images-a process that is both labor-intensive and prone to human error. As object detection models continue to evolve, so do the strategies for efficiently generating these annotated datasets, which has become a key research focus [21,8]. ...
Preprint
Automated object detection has become increasingly valuable across diverse applications, yet efficient, high-quality annotation remains a persistent challenge. In this paper, we present the development and evaluation of a platform designed to interactively improve object detection models. The platform allows uploading and annotating images as well as fine-tuning object detection models. Users can then manually review and refine annotations, further creating improved snapshots that are used for automatic object detection on subsequent image uploads - a process we refer to as semi-automatic annotation resulting in a significant gain in annotation efficiency. Whereas iterative refinement of model results to speed up annotation has become common practice, we are the first to quantitatively evaluate its benefits with respect to time, effort, and interaction savings. Our experimental results show clear evidence for a significant time reduction of up to 53% for semi-automatic compared to manual annotation. Importantly, these efficiency gains did not compromise annotation quality, while matching or occasionally even exceeding the accuracy of manual annotations. These findings demonstrate the potential of our lightweight annotation platform for creating high-quality object detection datasets and provide best practices to guide future development of annotation platforms. The platform is open-source, with the frontend and backend repositories available on GitHub.
... Both methods showed they outperform conventional video CIL methods across the same test bed. Efficient-CLS [44] is another state-of-the-art video online continual learning (OCL) model proposed for low-labeled scenarios such as learning from online streaming videos. ...
Preprint
Video language continual learning involves continuously adapting to information from video and text inputs, enhancing a model's ability to handle new tasks while retaining prior knowledge. This field is a relatively under-explored area, and establishing appropriate datasets is crucial for facilitating communication and research in this field. In this study, we present the first dedicated benchmark, ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks. The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets. Additionally, we introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects. This framework addresses challenges including memory complexity from long video clips, natural language complexity from open queries, and text-video misalignment. We posit that ViLCo-Bench, with greater complexity compared to existing continual learning benchmarks, would serve as a critical tool for exploring the video-language domain, extending beyond conventional class-incremental tasks, and addressing complex and limited annotation issues. The curated data, evaluations, and our novel method are available at https://github.com/cruiseresearchgroup/ViLCo .
... This is a major roadblock for developing more advanced pretrained models, e.g. continual learning and update of the pretrained model given the newly arrived data stream [80] in the ever-changing metaverse. To overcome this challenge, it is important to optimize data communication between storage and compute nodes, and to exploit parallelism in deep learning models. ...
Preprint
In the Metaverse, the physical space and the virtual space co-exist, and interact simultaneously. While the physical space is virtually enhanced with information, the virtual space is continuously refreshed with real-time, real-world information. To allow users to process and manipulate information seamlessly between the real and digital spaces, novel technologies must be developed. These include smart interfaces, new augmented realities, efficient storage and data management and dissemination techniques. In this paper, we first discuss some promising co-space applications. These applications offer experiences and opportunities that neither of the spaces can realize on its own. We then argue that the database community has much to offer to this field. Finally, we present several challenges that we, as a community, can contribute towards managing the Metaverse.
Article
The online defect detection for actual photovoltaic data streams is challenging, as the data stream from multiple production lines are unsupervised and unevenly distributed, while existing static defect detectors cannot adapt to such unsupervised non-stationary photovoltaic data stream. In this paper, we propose a novel Semantic Mining and Domain Correction network (SMDC), which can aid the static defect detector adapt to non-stationary photovoltaic data streams encountered in deployment environments in an online unsupervised manner. In SMDC, we proposes a Consistency-based Semantic-knowledge Mining (CSM) module, which selects reliable low-confidence teacher predictions as pseudo labels for potential defective instances during the self-training process based on the teacher-student architecture, so as to better mine the potential semantic knowledge of unlabeled samples in the non-stationary data stream. Then, we introduce a Domain Correction Contrastive Learning (DCCL) module, which performs contrastive learning to measure and narrow the distance between defect-free features of test samples and pre-preserved background prototypes in the feature space, thereby helping detector to learn robust domain-invariant background feature representations. Finally, extensive experiments based on the public photovoltaic dataset demonstrate that the proposed SMDC achieves the state-of-the-art performance for the online defect detection task of the non-stationary photovoltaic data stream, which is crucial for the continuous quality monitoring of PV products.
Article
Full-text available
What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.
Article
Full-text available
In the metaverse the physical space and the virtual space co-exist, and interact simultaneously. While the physical space is virtually enhanced with information, the virtual space is continuously refreshed with real-time, real-world information. To allow users to process and manipulate information seamlessly between the real and digital spaces, novel technologies must be developed. These include smart interfaces, new augmented realities, and efficient data storage, management, and dissemination techniques. In this paper, we first discuss some promising co-space applications. These applications offer opportunities that neither of the spaces can realize on its own. Then, we further discuss several emerging technologies that empower the construction of metaverse. After that, we discuss comprehensively the data centric challenges. Finally, we discuss and envision what are likely to be required from the database and system perspectives.
Article
Full-text available
As image-based deep learning becomes pervasive on every device, from cell phones to smart watches, there is a growing need to develop methods that continually learn from data while minimizing memory footprint and power consumption. While memory replay techniques have shown exceptional promise for this task of continual learning, the best method for selecting which buffered images to replay is still an open question. In this paper, we specifically focus on the online class-incremental setting where a model needs to learn new classes continually from an online data stream. To this end, we contribute a novel Adversarial Shapley value scoring method that scores memory data samples according to their ability to preserve latent decision boundaries for previously observed classes (to maintain learning stability and avoid forgetting) while interfering with latent decision boundaries of current classes being learned (to encourage plasticity and optimal learning of new class boundaries). Overall, we observe that our proposed ASER method provides competitive or improved performance compared to state-of-the-art replay-based continual learning methods on a variety of datasets.
Conference Paper
Full-text available
Despite huge success, deep networks are unable to learn effectively in sequential multitask learning settings as they forget the past learned tasks after learning new tasks. Inspired from complementary learning systems theory, we address this challenge by learning a generative model that couples the current task to the past learned tasks through a discriminative embedding space. We learn an abstract generative distribution in the embedding that allows generation of data points to represent past experience. We sample from this distribution and utilize experience replay to avoid forgetting and simultaneously accumulate new knowledge to the abstract distribution in order to couple the current task with past experience. We demonstrate theoretically and empirically that our framework learns a distribution in the embedding, which is shared across all tasks, and as a result tackles catastrophic forgetting.
Article
Full-text available
Despite advances in deep learning, artificial neural networks do not learn the same way as humans do. Today, neural networks can learn multiple tasks when trained on them jointly, but cannot maintain performance on learnt tasks when tasks are presented one at a time -- this phenomenon called catastrophic forgetting is a fundamental challenge to overcome before neural networks can learn continually from incoming data. In this work, we derive inspiration from human memory to develop an architecture capable of learning continuously from sequentially incoming tasks, while averting catastrophic forgetting. Specifically, our model consists of a dual memory architecture to emulate the complementary learning systems (hippocampus and the neocortex) in the human brain, and maintains a consolidated long-term memory via generative replay of past experiences. We (i) substantiate our claim that replay should be generative, (ii) show the benefits of generative replay and dual memory via experiments, and (iii) demonstrate improved performance retention even for small models with low capacity. Our architecture displays many important characteristics of the human memory and provides insights on the connection between sleep and learning in humans.
Article
Full-text available
Significance Deep neural networks are currently the most successful machine-learning technique for solving a variety of tasks, including language translation, image classification, and image generation. One weakness of such models is that, unlike humans, they are unable to learn multiple tasks sequentially. In this work we propose a practical solution to train such models sequentially by protecting the weights important for previous tasks. This approach, inspired by synaptic consolidation in neuroscience, enables state of the art results on multiple reinforcement learning problems experienced sequentially.
Article
Full-text available
A major open problem on the road to artificial intelligence is the development of incrementally learning systems that learn about more and more concepts over time from a stream of data. In this work, we introduce a new training strategy, iCaRL, that allows learning in such a class-incremental way: only the training data for a small number of classes has to be present at the same time and new classes can be added progressively. iCaRL learns strong classifiers and a data representation simultaneously. This distinguishes it from earlier works that were fundamentally limited to fixed data representations and therefore incompatible with deep learning architectures. We show by experiments on the CIFAR-100 and ImageNet ILSVRC 2012 datasets that iCaRL can learn many classes incrementally over a long period of time where other strategies quickly fail.
Article
Full-text available
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
Article
Full-text available
Sleep replay of awake experience in the cortex and hippocampus has been proposed to be involved in memory consolidation. However, whether temporally structured replay occurs in the cortex and whether the replay events in the two areas are related are unknown. Here we studied multicell spiking patterns in both the visual cortex and hippocampus during slow-wave sleep in rats. We found that spiking patterns not only in the cortex but also in the hippocampus were organized into frames, defined as periods of stepwise increase in neuronal population activity. The multicell firing sequences evoked by awake experience were replayed during these frames in both regions. Furthermore, replay events in the sensory cortex and hippocampus were coordinated to reflect the same experience. These results imply simultaneous reactivation of coherent memory traces in the cortex and hippocampus during sleep that may contribute to or reflect the result of the memory consolidation process.
Article
According to the Complementary Learning Systems (CLS) theory [1] in neuroscience, humans do effective continual learning through two complementary systems: a fast learning system centered on the hippocampus for rapid learning of the specifics, individual experiences; and a slow learning system located in the neocortex for the gradual acquisition of structured knowledge about the environment. Motivated by this theory, we propose DualNets (for Dual Networks), a general continual learning framework comprising a fast learning system for supervised learning of pattern-separated representation from specific tasks and a slow learning system for representation learning of task-agnostic general representation via Self-Supervised Learning (SSL). DualNets can seamlessly incorporate both representation types into a holistic framework to facilitate better continual learning in deep neural networks. Via extensive experiments, we demonstrate the promising results of DualNets on a wide range of continual learning protocols, ranging from the standard offline, task-aware setting to the challenging online, task-free scenario. Notably, on the CTrL [2] benchmark that has unrelated tasks with vastly different visual images, DualNets can achieve competitive performance with existing state-of-the-art dynamic architecture strategies [3]. Furthermore, we conduct comprehensive ablation studies to validate DualNets efficacy, robustness, and scalability. Code will be made available at https://github.com/phquang/DualNet .
Article
Deep semi-supervised learning is a fast-growing field with a range of practical applications. This paper provides a comprehensive survey on both fundamentals and recent advances in deep semi-supervised learning methods from perspectives of model design and unsupervised loss functions. We first present a taxonomy for deep semi-supervised learning that categorizes existing methods, including deep generative methods, consistency regularization methods, graph-based methods, pseudo-labeling methods, and hybrid methods. Then we provide a comprehensive review of 60 representative methods and offer a detailed comparison of these methods in terms of the type of losses, architecture differences, and test performance results. In addition to the progress in the past few years, we further discuss some shortcomings of existing methods and provide some tentative heuristic solutions for solving these open problems.
Article
In a real-world setting, object instances from new classes can be continuously encountered by object detectors. When existing object detectors are applied to such scenarios, their performance on old classes deteriorates significantly. A few efforts have been reported to address this limitation, all of which apply variants of knowledge distillation to avoid catastrophic forgetting. We note that although distillation helps to retain previous learning, it obstructs fast adaptability to new tasks, which is a critical requirement for incremental learning. In this pursuit, we propose a meta-learning approach that learns to reshape model gradients, such that information across incremental tasks is optimally shared. This ensures a seamless information transfer via a meta-learned gradient preconditioning that minimizes forgetting and maximizes knowledge transfer. In comparison to existing meta-learning methods, our approach is task-agnostic, allows incremental addition of new-classes and scales to high-capacity models for object detection. We evaluate our approach on a variety of incremental learning settings defined on PASCAL-VOC and MS COCO datasets, where our approach performs favourably well against state-of-the-art methods. Code and trained models: https://github.com/JosephKJ/iOD .
Chapter
Much research on object detection focuses on building better model architectures and detection algorithms. Changing the model architecture, however, comes at the cost of adding more complexity to inference, making models slower. Data augmentation, on the other hand, doesn’t add any inference complexity, but is insufficiently studied in object detection for two reasons. First it is more difficult to design plausible augmentation strategies for object detection than for classification, because one must handle the complexity of bounding boxes if geometric transformations are applied. Secondly, data augmentation attracts less research attention perhaps because it is believed to add less value and to transfer poorly compared to advances in network architectures.
Chapter
We discuss a general formulation for the Continual Learning (CL) problem for classification—a learning task where a stream provides samples to a learner and the goal of the learner, depending on the samples it receives, is to continually upgrade its knowledge about the old classes and learn new ones. Our formulation takes inspiration from the open-set recognition problem where test scenarios do not necessarily belong to the training distribution. We also discuss various quirks and assumptions encoded in recently proposed approaches for CL. We argue that some oversimplify the problem to an extent that leaves it with very little practical importance, and makes it extremely easy to perform well on. To validate this, we propose GDumb that (1) greedily stores samples in memory as they come and; (2) at test time, trains a model from scratch using samples only in the memory. We show that even though GDumb is not specifically designed for CL problems, it obtains state-of-the-art accuracies (often with large margins) in almost all the experiments when compared to a multitude of recently proposed algorithms. Surprisingly, it outperforms approaches in CL formulations for which they were specifically designed. This, we believe, raises concerns regarding our progress in CL for classification. Overall, we hope our formulation, characterizations and discussions will help in designing realistically useful CL algorithms, and GDumb will serve as a strong contender for the same.
Chapter
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible algorithm can use to avoid catastrophic forgetting. As most, if not all, previous CL methods violate these constraints, we propose an algorithmic solution to MC-OCL: Batch-level Distillation (BLD), a regularization-based CL approach, which effectively balances stability and plasticity in order to learn from data streams, while preserving the ability to solve old tasks through distillation. Our extensive experimental evaluation, conducted on three publicly available benchmarks, empirically demonstrates that our approach successfully addresses the MC-OCL problem and achieves comparable accuracy to prior distillation methods requiring higher memory overhead (Code available at https://github.com/DonkeyShot21/batch-level-distillation).
Article
Machine learning has been highly successful in data-intensive applications but is often hampered when the data set is small. Recently, Few-shot Learning (FSL) is proposed to tackle this problem. Using prior knowledge, FSL can rapidly generalize to new tasks containing only a few samples with supervised information. In this article, we conduct a thorough survey to fully understand FSL. Starting from a formal definition of FSL, we distinguish FSL from several relevant machine learning problems. We then point out that the core issue in FSL is that the empirical risk minimizer is unreliable. Based on how prior knowledge can be used to handle this core issue, we categorize FSL methods from three perspectives: (i) data, which uses prior knowledge to augment the supervised experience; (ii) model, which uses prior knowledge to reduce the size of the hypothesis space; and (iii) algorithm, which uses prior knowledge to alter the search for the best hypothesis in the given hypothesis space. With this taxonomy, we review and discuss the pros and cons of each category. Promising directions, in the aspects of the FSL problem setups, techniques, applications, and theories, are also proposed to provide insights for future research.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
One major obstacle towards artificial intelligence is the poor ability of models to quickly solve new problems, without forgetting previously acquired knowledge. To better understand this issue, we study the problem of learning over a continuum of data, where the model observes, once and one by one, examples concerning an ordered sequence of tasks. First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their test accuracy, but also in terms of their ability to transfer knowledge across tasks. Second, we propose a model to learn over continuums of data, called Gradient of Episodic Memory (GEM), which alleviates forgetting while allowing beneficial transfer of knowledge to previous tasks. Our experiments on variants of the MNIST and CIFAR-100 datasets demonstrate the strong performance of GEM when compared to the state-of-the-art.
Article
The recently proposed temporal ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, temporal ensembling becomes unwieldy when using large datasets. To overcome this problem, we propose a method that averages model weights instead of label predictions. As an additional benefit, the method improves test accuracy and enables training with fewer labels than earlier methods. We report state-of-the-art results on semi-supervised SVHN, reducing the error rate from 5.12% to 4.41% with 500 labels, and achieving 5.39% error rate with 250 labels. By using extra unlabeled data, we reduce the error rate to 2.76% on 500-label SVHN.
Article
We update complementary learning systems (CLS) theory, which holds that intelligent agents must possess two learning systems, instantiated in mammalians in neocortex and hippocampus. The first gradually acquires structured knowledge representations while the second quickly learns the specifics of individual experiences. We broaden the role of replay of hippocampal memories in the theory, noting that replay allows goal-dependent weighting of experience statistics. We also address recent challenges to the theory and extend it by showing that recurrent activation of hippocampal traces can support some forms of generalization and that neocortical learning can be rapid for information that is consistent with known structure. Finally, we note the relevance of the theory to the design of artificial intelligent agents, highlighting connections between neuroscience and machine learning.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Gradient based sample selection for online continual learning
  • Rahaf Aljundi
  • Min Lin
  • Baptiste Goujaud
  • Yoshua Bengio
Learning fast, learning slow: A general continual learning method based on complementary learning system
  • Elahe Arani
  • Fahad Sarfraz
  • Bahram Zonooz
Mixmatch: A holistic approach to semi-supervised learning
  • David Berthelot
  • Nicholas Carlini
  • Ian Goodfellow
  • Nicolas Papernot
  • Avital Oliver
  • Colin A Raffel
Dark experience for general continual learning: a strong, simple baseline
  • Pietro Buzzega
  • Matteo Boschini
  • Angelo Porrello
  • Davide Abati
  • Simone Calderara
Faster r-cnn: Towards real-time object detection with region proposal networks
  • Kaiming Shaoqing Ren
  • Ross He
  • Jian Girshick
  • Sun
A simple semi-supervised learning framework for object detection
  • Kihyuk Sohn
  • Zizhao Zhang
  • Chun-Liang Li
  • Han Zhang
  • Chen-Yu Lee
  • Tomas Pfister
Online continual learning with maximal interfered retrieval
  • Rahaf Aljundi
  • Eugene Belilovsky
  • Tinne Tuytelaars
  • Laurent Charlin
  • Massimo Caccia
  • Min Lin
Mitigating forgetting in online continual learning via instance-aware parameterization
  • Hung-Jen Chen
  • An-Chieh Cheng
  • Da-Cheng Juan
  • Wei Wei
  • Min Sun
Consistency-based semi-supervised learning for object detection
  • Jisoo Jeong
  • Seungeui Lee
  • Jeesoo Kim
  • Nojun Kwak
Contextual transformation networks for online continual learning
  • Quang Pham
  • Chenghao Liu
  • Doyen Sahoo
  • Hoi Steven
Unbiased teacher for semi-supervised object detection
  • Yen-Cheng Liu
  • Chih-Yao Ma
  • Zijian He
  • Chia-Wen Kuo
  • Kan Chen
  • Peizhao Zhang
New insights on reducing abrupt representation change in online continual learning
  • Caccia
Efficient lifelong learning with agem
  • Chaudhry
Gradient based sample selection for online continual learning
  • Aljundi
Mixmatch: A holistic approach to semi-supervised learning
  • Berthelot
A simple semi-supervised learning framework for object detection
  • Sohn
Learning fast, learning slow: A general continual learning method based on complementary learning system
  • Arani
Dark experience for general continual learning: a strong, simple baseline
  • Buzzega
Mitigating forgetting in online continual learning via instance-aware parameterization
  • Chen
3rd continual learning workshop challenge on egocentric category and instance level object understanding
  • Pellegrini
Contextual transformation networks for online continual learning
  • Pham