Conference Paper

DANet: Multi-scale UAV Target Detection with Dynamic Feature Perception and Scale-aware Knowledge Distillation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Multi-scale infrared unmanned aerial vehicle (UAV) targets (IRUTs) detection under dynamic scenarios remains a challenging task due to weak target features, varying shapes and poses, and complex background interference. Current detection methods find it difficult to address the above issues accurately and efficiently. In this paper, we design a dynamic attentive network (DANet) incorporating a scale-adaptive feature enhancement mechanism (SaFEM) and an attention-guided cross-weighting feature aggregator (ACFA). The SaFEM adaptively adjusts the network's receptive fields at hierarchical network levels leveraging separable deformable convolu-tion (SDC), which enhances the network's multi-scale IRUT awareness. The ACFA, modulated by two crossing attention mechanisms, strengthens structural and semantic properties on neighboring levels for the accurate representation of multi-scale IRUT features from different levels. A plug-and-play anti-distractor contrastive regularization (ADCR) is also imposed on our DANet, which enforces similarity on features of targets and distractors from a new uncompressed feature projector (UFP) to increase the network's anti-distractor ability in complex backgrounds. To further increase the multi-scale UAV detection performance of DANet while maintaining its efficiency superiority, we propose a novel scale-specific knowledge distiller (SSKD) based on a divide-and-conquer strategy. For the "divide" stage, we intendedly construct three task-oriented teachers to learn tailored knowledge for small-, medium-, and large-scale IRUTs. For the "conquer" stage, we propose a novel element-wise attentive distillation module (EADM), where we employ a pixel-wise attention mechanism to highlight teacher and student IRUT features, and incorporate IRUT-associated prior knowledge for the collaborative transfer of refined multi-scale IRUT features to our DANet. Extensive experiments on real infrared UAV datasets demonstrate that our DANet is able to detect multi-scale UAVs with a satisfactory balance between accuracy and efficiency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Unmanned aerial vehicle (UAV) detection based on thermal infrared imaging has been one of the most important sensing technologies in the anti-UAV system. However, the technical limitations and long-range detection of thermal sensors often lead to acquiring low-resolution (LR) infrared images, thereby bringing great challenges for the subsequent target detection task. In this paper, we propose a novel spatial and contrast interactive super-resolution network (SCINet) for assisting infrared UAV target detection. The network consists of two main subnetworks: a spatial enhancement branch (SEB) and a contrast enhancement branch (CEB). The SEB embeds the lightweight convolution module and attention mechanism to highlight the spatial structure detail features of infrared UAV targets. The proposed CEB incorporates the center-oriented contrast-aware module and multi-branch collapsible module, which can provide local contrast priors to reconstruct the super-resolved UAV target image. The spatial features of the intermediate layers from the SEB are integrated into the CEB as a rich gradient prior. Besides, the output features of the CEB are aggregated into those of the SEB for further supplementing the contrast of spatial features in return. Dual-branch feature interaction of the SEB and CEB can further enhance the spatial details and target saliency of the targets. Additionally, we also introduce an infrared UAV detection network via a new dual-dimensional feature calibration module (DFCM) for boosting the detection performance. Extensive experiments demonstrate that the SCINet outperforms the state-of-the-art SR methods on real infrared UAV sequences and improves the detection performance of infrared small UAV targets. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup
Conference Paper
Universal object detectors aim to detect any object in any scene without human annotation, exhibiting superior generalization. However, the current universal object detectors show degraded performance in harsh weather, and their insufficient real-time capabilities limit their application. In this paper, we present Uni-YOLO, a universal detector designed for complex scenes with real-time performance. Uni-YOLO is a one-stage object detector that uses general object confidence to distinguish between objects and backgrounds, and employs a grid cell regression method for real-time detection. To improve its robustness in harsh weather conditions, the input of Uni-YOLO is adaptively enhanced with a physical model-based enhancement module. During training and inference, Uni-YOLO is guided by the extensive knowledge of the vision-language model CLIP. An object augmentation method is proposed to improve generalization in training by utilizing multiple source datasets with heterogeneous annotations. Furthermore, an online self-enhancement method is proposed to allow Uni-YOLO to further focus on specific objects through self-supervised fine-tuning in a given scene. Extensive experiments on public benchmarks and a UAV deployment are conducted to validate its superiority and practical value.
Article
Full-text available
Bayesian neural networks (BNNs) have been demonstrated to be effective in accurate retrieval of sea ice concentration (SIC) from multi-source data, while providing estimates of uncertainty, which are essential for downstream services. However, uncertainty obtained by BNNs are intrinsically uncalibrated, which indicates that it may not correlate well with model error. To address this issue, we investigate a new approach that combines an auxiliary prediction interval (PI) estimator with the BNN-based SIC mean estimator to develop a well-calibrated SIC retrieval model that is both accurate and reliable. We adopt a training strategy called “uncertainty matching" to train the model, which ensures that the estimated uncertainties match the estimated PIs. We use a subset of AMSR2 brightness temperature data and ERA5 atmospheric data collected from 2014 to 2015 in the Baffin Bay area as input features of the model. Comparison between model inference and SIC labels obtained from the enhanced NASA Team (NT2) algorithm shows that the proposed approach is able to produce well-calibrated uncertainty with more accurate predictions in marginal ice zones.
Article
Full-text available
Single-frame infrared small target (SIRST) detection has been a challenging task due to a lack of inherent characteristics, imprecise bounding box regression, a scarcity of real-world datasets, and sensitive localization evaluation. In this article, we propose a comprehensive solution to these challenges. First, we find that the existing anchor-free label assignment method is prone to mislabeling small targets as background, leading to their omission by detectors. To overcome this issue, we propose an all-scale pseudobox-based label assignment scheme that relaxes the constraints on the scale and decouples the spatial assignment from the size of the ground-truth target. Second, motivated by the structured prior of feature pyramids, we introduce the one-stage cascade refinement network (OSCAR), which uses the high-level head as soft proposal for the low-level refinement head. This allows OSCAR to process the same target in a cascade coarse-to-fine manner. Finally, we present a new research benchmark for infrared small target detection, consisting of the SIRST-V2 dataset of real-world, high-resolution single-frame targets, the normalized contrast evaluation metric, and the DeepInfrared toolkit for detection. We conduct extensive ablation studies to evaluate the components of OSCAR and compare its performance to state-of-the-art model- and data-driven methods on the SIRST-V2 benchmark. Our results demonstrate that a top-down cascade refinement framework can improve the accuracy of infrared small target detection without sacrificing efficiency. The DeepInfrared toolkit, dataset, and trained models are available at https://github.com/YimianDai/open-deepinfrared .
Article
Full-text available
Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied to infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNA-Net) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repetitive interaction in DNIM, the information of infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNA-Net, contextual information of small targets can be well incorporated and fully exploited by repetitive fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection ( Pd ), false-alarm rate ( Fa ), and intersection of union ( IoU ).
Article
Full-text available
Infrared small target detection plays an important role in the infrared search and tracking applications. In recent years, deep learning techniques have been introduced to this task and achieved noteworthy effects. Following general object segmentation methods, existing deep learning methods usually process the image from the global view. However, the locality of small targets and extreme class-imbalance between the target and background pixels are not well-considered by these deep learning methods, which causes the low-efficiency on training and high-dependence on numerous data. A local patch network (LPNet) with global attention is proposed in this article to detect small targets by jointly considering the global and local properties of infrared small target images. From the global view, a supervised attention module trained by the small target spread map is proposed to suppress most background pixels irrelevant with small target features. From the local view, local patches are split from global features and share the same convolution weights with each other in an LPNet. By leveraging both the global and local properties, the data-driven framework proposed in this article has the ability of fusing multiscale features for small target detection. Extensive experiments on synthetic and real datasets show that the proposed method achieves the state-of-the-art performance in comparison with both traditional and deep learning methods.
Article
Full-text available
The use of small and remotely controlled unmanned aerial vehicles (UAVs), referred to as drones, has increased dramatically in recent years, both for professional and recreative purposes. This goes in parallel with (intentional or unintentional) misuse episodes, with an evident threat to the safety of people or facilities [1]. As a result, the detection of UAV has also emerged as a research topic [2]. Most of the existing studies on drone detection fail to specify the type of acquisition device, the drone type, the detection range, or the employed dataset. The lack of proper UAV detection studies employing thermal infrared cameras is also acknowledged as an issue, despite its success in detecting other types of targets [2]. Besides, we have not found any previous study that addresses the detection task as a function of distance to the target. Sensor fusion is indicated as an open research issue as well to achieve better detection results in comparison to a single sensor, although research in this direction is scarce too [3], [4], [5], [6]. To help in counteracting the mentioned issues and allow fundamental studies with a common public benchmark, we contribute with an annotated multi-sensor database for drone detection that includes infrared and visible videos and audio files. The database includes three different drones, a small-sized model (Hubsan H107D+), a medium-sized drone (DJI Flame Wheel in quadcopter configuration), and a performance-grade model (DJI Phantom 4 Pro). It also includes other flying objects that can be mistakenly detected as drones, such as birds, airplanes or helicopters. In addition to using several different sensors, the number of classes is higher than in previous studies [4]. The video part contains 650 infrared and visible videos (365 IR and 285 visible) of drones, birds, airplanes and helicopters. Each clip is of ten seconds, resulting in a total of 203328 annotated frames. The database is complemented with 90 audio files of the classes drones, helicopters and background noise. To allow studies as a function of the sensor-to-target distance, the dataset is divided into three categories (Close, Medium, Distant) according to the industry-standard Detect, Recognize and Identify (DRI) requirements [7], built on the Johnson criteria [8]. Given that the drones must be flown within visual range due to regulations, the largest sensor-to-target distance for a drone in the dataset is 200 m, and acquisitions are made in daylight. The data has been obtained at three airports in Sweden: Halmstad Airport (IATA code: HAD/ICAO code: ESMT), Gothenburg City Airport (GSE/ESGP) and Malmö Airport (MMX/ESMS). The acquisition sensors are mounted on a pan-tilt platform that steers the cameras to the objects of interest. All sensors and the platform are controlled with a standard laptop vis a USB hub.
Article
Full-text available
To mitigate the issue of minimal intrinsic features for pure data-driven methods, in this article, we propose a novel model-driven deep network for infrared small target detection, which combines discriminative networks and conventional model-driven methods to make use of both labeled data and the domain knowledge. By designing a feature map cyclic shift scheme, we modularize a conventional local contrast measure method as a depthwise parameterless nonlinear feature refinement layer in an end-to-end network, which encodes relatively long-range contextual interactions with clear physical interpretability. To highlight and preserve the small target features, we also exploit a bottom-up attentional modulation integrating the smaller scale subtle details of low-level features into high-level features of deeper layers. We conduct detailed ablation studies with varying network depths to empirically verify the effectiveness and efficiency of the design of each component in our network architecture. We also compare the performance of our network against other model-driven methods and deep networks on the open SIRST data set as well. The results suggest that our network yields a performance boost over its competitors.
Article
Full-text available
In this letter, a weighted strengthened local contrast measure (WSLCM) algorithm for infrared (IR) small target detection is proposed, it consists of two modules, the strengthened local contrast measure (SLCM), and the weighting function. In the SLCM calculation, the ideas of matched filter and background estimation are adopted to enhance true target and suppress complex background, then both ratio and difference operations are used to calculate the SLCM. In the weighting function definition, three components are considered: the characteristics of the target, the characteristics of the background, and the difference between them. Especially, an improved regional intensity level (IRIL) algorithm is proposed to evaluate the complexity of a cell, thus it can suppress random noises better. Experiments on some real IR images show that the proposed WSLCM can achieve a better detection performance under complex background.
Article
Full-text available
Excellent performance, real time and strong robustness are three vital requirements for infrared small target detection. Unfortunately, many current state-of-the-art methods merely achieve one of the expectations when coping with highly complex scenes. In fact, a common problem is that real-time processing and great detection ability are difficult to coordinate. Therefore, to address this issue, a robust infrared patch-tensor model for detecting an infrared small target is proposed in this paper. On the basis of infrared patch-tensor (IPT) model, a novel nonconvex low-rank constraint named partial sum of tensor nuclear norm (PSTNN) joint weighted l1 norm was employed to efficiently suppress the background and preserve the target. Due to the deficiency of RIPT which would over-shrink the target with the possibility of disappearing, an improved local prior map simultaneously encoded with target-related and background-related information was introduced into the model. With the help of a reweighted scheme for enhancing the sparsity and high-efficiency version of tensor singular value decomposition (t-SVD), the total algorithm complexity and computation time can be reduced dramatically. Then, the decomposition of the target and background is transformed into a tensor robust principle component analysis problem (TRPCA), which can be efficiently solved by alternating direction method of multipliers (ADMM). A series of experiments substantiate the superiority of the proposed method beyond state-of-the-art baselines.
Article
This study developed an efficient 3-D inversion algorithm for controlled-source electromagnetic method (CSEM). The spectral element method based on high-order Gauss-Lobatto-Legendre (GLL) basis functions with infinite element boundary conditions is used to quickly solve forward problems and adjoint forward problems for inversion, which can significantly reduce the computational cost while guarantee the accuracy. Due to the use of forward modeling with the high-order basis functions, we can give priority to the demand of inversion when designing the grid, and then choose the appropriate order for forward modeling based on the size of the grid, which can make the grid more scientific both for forward and inversion. The L-BFGS optimization algorithm without explicit computation and storage of Hessen matrix is used to solve the objective function minimization problem. Furthermore, a new preconditioner is introduced to update the initial approximate Hessian matrix in L-BFGS, which improves the convergence of the algorithm. To further improve the efficiency of the algorithm, we implemented parallelization based on Message Passing Interface (MPI). The synthetic examples show the efficiency of our algorithm compared with the conventional inversion method, and the field example demonstrated the adaptability and utility of the algorithm.
Article
The traditional CNN-based methods usually employ the spatial information in the amplitude of complex Synthetic Aperture Radar (SAR) images. Several studies have started to concentrate on merging the unique physical properties of SAR images, such as DSN-v1, extracting the backscattering characteristic from the frequency domain. Although DSN-v1 has obtained impressive classification ability, there is some room for improvement. In this letter, DSN-v2 is proposed to boost the classification ability of man-made and natural objects in SAR images. The improvement is reflected in two aspects. First, a multi-scale sub-band feature extraction (MSFE) component is designed for natural objects. Since we observe their multi-scale sub-band spectrum is significantly different, multiple encoders are used to extract effective features. Second, the additive angular margin (AAM) loss is introduced to distinguish man-made objects more clearly by manually adding a margin to the decision boundary. The experimental results on the Sentinel-1 (S1) dataset show DSN-v2 achieves superior classification performance and model training speed compared with DSN-v1.
Article
In modern radar applications, data-driven sea-land segmentation of complex environments based directly on radar returns without the help of electronic charts and other information has attracted researchers’ attention due to the solidification of exogenous information and defect of non-real timeness. However, the sea-land segmentation based on radar echoes still faces two problems: First, complex clutter environment with mixed ground and sea clutter leads to a large dynamic range of clutter power, which makes power-based segmentation unreliable; Second, stable segmentation features are difficult to obtain when echoes in a single scan period are limited. In the case of multi-frame radar echoes, this paper first unwraps the phases of two adjacent echo sequences and extracts a new similarity measure for sea-land segmentation based on the covariance of phase difference sequences between sea clutter and ground clutter. Then, inspired by the idea of iteration, this paper proposes a method to iterate the covariance matrix by iterating the clutter map of the characteristic differences of multi-frame echoes to distinguish between sea and ground clutter more effectively. Experimental results based on the measured data show that the proposed sea-land segmentation method based on multi-frame echoes can effectively separate sea and land areas and ensure the quality of the segmentation results. And after several consecutive scan periods, the proposed sea-land segmentation method is more accurate and robust than other sea-land segmentation methods.
Article
In the realm of infrared small-target detection, the weighted local contrast approaches, which seek to improve targets by the defined weighted factors, have garnered a lot of interest. However, there are several problems with these methods as follows. 1) The vast number of local contrast sliding sub-windows restricts the time efficiency. 2) The dim targets in the complicated backgrounds are incorrectly eliminated by the background suppression procedure. 3) The background noise in the complicated environment cannot be effectively muted. A simplified dual-weighted three-layer window local contrast method (SDWTLLCM) is suggested in this work as a solution to these issues. In order to extract tiny targets and suppress complicated backgrounds, a hierarchical convolution filtering window is first created. Then, even without sub-window division, a simple three-layer sliding window is created for time efficiency enhancement. The dual-weighted local contrast approach is also intended to minimize the background and further highlight tiny objects. Eventually, the tiny targets may be extracted more effectively using the adaptive threshold segmentation procedure. The vast experimental findings show that our suggested strategy is effective and efficient.
Article
Realizing robust infrared small target detection in complex backgrounds is of great essence for infrared search and tracking (IRST) applications. However, the high-intensity structures in background regions, such as the sharp edges, make it a challenging task, especially when the target is with low signal-to-clutter ratio (SCR). To address this issue, we propose an infrared small target detection method using local contrast-weighted multidirectional derivative (LCWMD). It is a robust detector that comprehensively considers the target property, background information, and the relation between them. First, we take into account the approximate isotropy of the infrared small target and present a new multidirectional derivative with penalty factors based on the Facet model to develop the target salience in the local region. Second, a dual local contrast fusion model with the tri-layer design is introduced to amplify the difference between the target and the background, so as to further suppress the high-intensity structural clutters. Finally, the LCWMD map is obtained by weighting the above two filtered maps, after which an adaptive segmentation operation is applied to accomplish the target detection. The results of comparative experiments implemented on real infrared images demonstrate that our method outperforms other state-of-the-art detectors by several times in terms of signal-to-clutter ratio gain (SCRG) and background suppression factor (BSF).
Article
Intelligent unmanned aerial vehicle (UAV) surveillance based on infrared imaging has wide applications in the anti-UAV system for protecting urban security and aerial safety. However, weak target features and complex background distraction pose great challenges for the accurate detection of UAVs. To address this issue, we propose a novel differentiated attention guided network to adaptively strengthen the discriminative features between UAV targets and complex background. First, a novel spatial-aware channel attention (SCA) is introduced into deep layers via preserving critical spatial features and leveraging channel interdependencies to focus on the large-scale targets. The channel-modulated deformable spatial attention is introduced into shallow layers via refining channel context and dynamically perceiving the spatial features for focusing on the small-scale targets. A combination of the above two attention mechanisms is employed in intermediate layers of the network for concentrating on the medium-scale targets. Then, we embed a feature aggregator at the detection branches to guide the information exchange of high-level feature maps and low-level feature maps with a bottom-up context modulation, and integrate an SCA at the end to further boost the distinctive feature representation for task-awareness. The above design can adaptively enhance multi-scale UAV target features and suppress complex background interferences, leading to better detection performance, especially for small targets. Extensive experiments on real infrared UAV datasets reveal that the proposed method outperforms the baseline object detectors by a large margin, validating its feasibility in real-world infrared UAV detection. The source code can be found at https://github.com/KALEIDOSCOPEIP/DAGNet
Article
Efficient visual detection is a crucial component in self-driving perception and lays the foundation for later planning and control stages. Deep-networks-based visual systems achieve state-of-the-art performance, but they are usually cumbersome and computationally infeasible for embedded devices (e.g., dash cams). Knowledge distillation is an effective way to derive more efficient models. However, most existing works target classification tasks and treat all instances equally. In this paper, we first present our Adaptive Instance Distillation (AID) method for self-driving visual detection. It can selectively impart the teacher's knowledge to the student by re-weighing each instance and each scale for distillation based on the teacher's loss. In addition, to enable the student to effectively digest knowledge from multiple sources, we also propose a Multi-Teacher Adaptive Instance Distillation (M-AID) method. Our M-AID helps the student to learn the best knowledge from each teacher w.r.t. certain instances and scales. Unlike previous KD methods, our M-AID adjusts the distillation weights in an instance, scale, and teacher adaptive manner. Experiments on the KITTI, COCO-Traffic, and SODA10M datasets show that our methods improve the performance of a wide variety of state-of-the-art KD methods on different detectors in self-driving scenarios. Compared to the baseline, our AID leads to an average of 2.28% and 2.98% mAP increases for single-stage and two-stage detectors, respectively. By strategically integrating knowledge from multiple teachers, our M-AID method achieves an average of 2.92% mAP improvement.
Article
Knowledge distillation (KD), as an efficient and effective model compression technique, has received considerable attention in deep learning. The key to its success is about transferring knowledge from a large teacher network to a small student network. However, most existing KD methods consider only one type of knowledge learned from either instance features or relations via a specific distillation strategy, failing to explore the idea of transferring different types of knowledge with different distillation strategies. Moreover, the widely used offline distillation also suffers from a limited learning capacity due to the fixed large-to-small teacher-student architecture. In this article, we devise a collaborative KD via multiknowledge transfer (CKD-MKT) that prompts both self-learning and collaborative learning in a unified framework. Specifically, CKD-MKT utilizes a multiple knowledge transfer framework that assembles self and online distillation strategies to effectively: 1) fuse different kinds of knowledge, which allows multiple students to learn knowledge from both individual instances and instance relations, and 2) guide each other by learning from themselves using collaborative and self-learning. Experiments and ablation studies on six image datasets demonstrate that the proposed CKD-MKT significantly outperforms recent state-of-the-art methods for KD.
Article
Thermal infrared imaging possesses the ability to monitor unmanned aerial vehicles (UAVs) in both day and night conditions. However, long-range detection of the infrared UAVs often suffers from small/dim targets, heavy clutter, and noise in the complex background. The conventional local prior-based and the nonlocal prior-based methods commonly have a high false alarm rate and low detection accuracy. In this letter, we propose a model that converts small UAV detection into a problem of predicting the residual image (i.e., background, clutter, and noise). Such novel reformulation allows us to directly learn a mapping from the input infrared image to the residual image. The constructed image-to-image network integrates the global and the local dilated residual convolution blocks into the U-Net, which can capture local and contextual structure information well and fuse the features at different scales both for image reconstruction. Additionally, subpixel convolution is utilized to upscale the image and avoid image distortion during upsampling. Finally, the small UAV target image is obtained by subtracting the residual image from the input infrared image. The comparative experiments demonstrate that the proposed method outperforms state-of-the-art ones in detecting real-world infrared images with heavy clutter and dim targets.
Article
In computer vision, object detection is one of most important tasks, which underpins a few instance-level recognition tasks and many downstream applications. Recently one-stage methods have gained much attention over two-stage approaches due to their simpler design and competitive performance. Here we propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to other dense prediction problems such as semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the pre-defined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating the intersection over union (IoU) scores during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at: git.io/AdelaiDet\tt git.io/AdelaiDet </inline-formula
Article
Knowledge distillation (KD) is an effective learning paradigm for improving the performance of lightweight student networks by utilizing additional supervision knowledge distilled from teacher networks. Most pioneering studies either learn from only a single teacher in their distillation learning methods, neglecting the potential that a student can learn from multiple teachers simultaneously, or simply treat each teacher to be equally important, unable to reveal the different importance of teachers for specific examples. To bridge this gap, we propose a novel adaptive multi-teacher multi-level knowledge distillation learning framework (AMTML-KD), which consists two novel insights: (i) associating each teacher with a latent representation to adaptively learn instance-level teacher importance weights which are leveraged for acquiring integrated soft-targets (high-level knowledge) and (ii) enabling the intermediate-level hints (intermediate-level knowledge) to be gathered from multiple teachers by the proposed multi-group hint strategy. As such, a student model can learn multi-level knowledge from multiple teachers through AMTML-KD. Extensive results on publicly available datasets demonstrate the proposed learning framework ensures student to achieve improved performance than strong competitors.
Article
Effective and efficient infrared (IR) small target detection is essential for IR search and tracking (IRST) systems. The current methods have some limitations in background suppression or detection of targets close to each other. In this letter, a double-neighborhood gradient method (DNGM) is proposed. First, a new technology of the tri-layer sliding window is designed to measure the double-neighborhood gradient. Then, the DNGM is obtained by multiplying the double-neighborhood gradient. In this way, even the sizes of the targets may vary, ranging from 2 x 1 to 9 x 9 pixels, the target can be better highlighted under a fixed scale, and background interference can be suppressed. Finally, the target is segmented from the DNGM salience map by an adaptive threshold. Experiments illustrate that the proposed method can avoid the "expansion effect" of the traditional multiscale human vision system (HVS) method and can accurately detect multiple targets close to each other. Besides, the proposed method is more robust and real-time than the existing methods.
Article
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality . While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN.