ArticlePDF Available

PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark

Authors:

Abstract and Figures

Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Content may be subject to copyright.
A preview of the PDF is not available
... Two popular TIR-PT datasets are used in our experiments: LSOTB-TIR [25] and PTB-TIR [26]. LSOTB-TIR includes 1,280 TIR video sequences for evaluation and 120 for training, the number of total image frames is more than 600K. ...
... We construct a TIR-PT model that utilizes the final searched and retrained network architecture and compare it with the state-of-the-art methods, including DBF [29], DiMP [14], ECO-stir [31], ECO-deep [30], HSSNet [19], MCFTS [18], MDNet [32], MLSSNet [20], MMNet [5], SiamFC [33], SiamTri [34], and TADT [35], on the LSOTB-TIR [25] and PTB-TIR [26] benchmark datasets. Comparison results are shown in Table 2, our method achieves competitive performance. ...
... Compared with MMNet [5], which learns duallevel deep representation for TIR tracking, our method has more than 18% improvement of all metrics on LSOTB-TIR. The TIR-PT model searched and retrained on the LSOTB-TIR dataset can be indeed transferable to the PTB-TIR [26] dataset, achieves the best success score of 0.641 and a comparable precision score of 0.776, outperforming DiMP with relative gains of 2.7% and 2.3%, respectively. ECO-STIR [31] uses ResNet-50 as its base network and utilizes synthetic TIR data generated from RGB data to train the model, which obtains the best precision score of 0.830, with a relative gain of 5.4% compared to our method. ...
Article
Full-text available
Manually-designed network architectures for thermal infrared pedestrian tracking (TIR-PT) require substantial effort from human experts. AlexNet and ResNet are widely used as backbone networks in TIR-PT applications. However, these architectures were originally designed for image classification and object detection tasks, which are less complex than the challenges presented by TIR-PT. This paper makes an early attempt to search an optimal network architecture for TIR-PT automatically, employing single-bottom and dual-bottom cells as basic search units and incorporating eight operation candidates within the search space. To expedite the search process, a random channel selection strategy is employed prior to assessing operation candidates. Classification, batch hard triplet, and center loss are jointly used to retrain the searched architecture. The outcome is a high-performance network architecture that is both parameter- and computation-efficient. Extensive experiments proved the effectiveness of the automated method.
... To make the trained convolutional feature extraction network have a strong target representation capability, an unsupervised learning method is adopted to generate pseudo-labels for the unlabeled training samples in the source RGB domain for the convolutional feature extraction network training. • Extensive experimental results show the competitiveness of the proposed tracking method compared with other tracking methods on the PTB-TIR [35] and the LSOTB-TIR [36] benchmarks. ...
... To verify the performance of the proposed unsupervised cross-domain tracking method (UCDT), we made some comparative experiments for the UCDT and several state-of-the-art tracking methods on PTB-TIR [35] and LSOTB-TIR [36] benchmark datasets. Following the PTB-TIR [35] benchmark, we used the precision and success scores as the evaluation metrics to evaluate the proposed UCDT tracker. ...
... To verify the performance of the proposed unsupervised cross-domain tracking method (UCDT), we made some comparative experiments for the UCDT and several state-of-the-art tracking methods on PTB-TIR [35] and LSOTB-TIR [36] benchmark datasets. Following the PTB-TIR [35] benchmark, we used the precision and success scores as the evaluation metrics to evaluate the proposed UCDT tracker. More details of the testing benchmarks and evaluation metrics can be found in [35,36]. ...
Article
Full-text available
The limited availability of thermal infrared (TIR) training samples leads to suboptimal target representation by convolutional feature extraction networks, which adversely impacts the accuracy of TIR target tracking methods. To address this issue, we propose an unsupervised cross-domain model (UCDT) for TIR tracking. Our approach leverages labeled training samples from the RGB domain (source domain) to train a general feature extraction network. We then employ a cross-domain model to adapt this network for effective target feature extraction in the TIR domain (target domain). This cross-domain strategy addresses the challenge of limited TIR training samples effectively. Additionally, we utilize an unsupervised learning technique to generate pseudo-labels for unlabeled training samples in the source domain, which helps overcome the limitations imposed by the scarcity of annotated training data. Extensive experiments demonstrate that our UCDT tracking method outperforms existing tracking approaches on the PTB-TIR and LSOTB-TIR benchmarks.
... Weight decay of 0.0005 and momentum of 0.9 are used. The evaluation method used in this article was OPE, and the evaluation metrics were precision and success rate [45]. ...
... The testing datasets we used were LSOTB-TIR [44] and PTB-TIR [45]. The LSOTB-TIR dataset consists of a tracking evaluation dataset and a general training dataset. ...
... B. Quantitative Analysis 1) Experiments on PTB-TIR Benchmark [45]: Fig. 6 shows the experimental results of our Siamese network with a hierarchical attention mechanism (SiamHAN) tracker with some other state-of-the-art tracking methods, which including MDNet [57], SiamRPN++ [16], AMFT [53], SRDCF [51], DSiam [55], TADT [47], MMNet [35], VITAL [58], MCCT [49], GFSDCF [54], UDT [56], CREST [52], MLSSNet [48], MCFTS [32], SiamFC [14], HSSNet [18], SiamTri [34], CFNet [36], and STAMT [46] on the PTB-TIR [45] benchmark. Compared to these deep DCF-based tracking methods [46], [52], [54], our SiamHAN tracking method uses a hierarchical attention Siamese network for feature extraction which gets improvements in tracking accuracy. ...
Article
Full-text available
Thermal infrared (TIR) target tracking is an important topic in the computer vision area. The TIR images are not affected by ambient light and have strong environmental adaptability, making them widely used in battlefield perception, video surveillance, assisted driving, etc. However, TIR target tracking faces problems such as relatively insufficient information and lack of target texture information, which significantly affects the tracking accuracy of the TIR tracking methods. To solve the above problems, we propose a TIR target tracking method based on a Siamese network with a hierarchical attention mechanism (called: SiamHAN). Specifically, the CIoU Loss is introduced to make full use of the regression box information to calculate the loss function more accurately. The GCNet attention mechanism is introduced to reconstruct the feature extraction structure of fine-grained information for the fine-grained information of thermal infrared images. Meanwhile, for the feature information of the hierarchical backbone network of the Siamese network, the ECANet attention mechanism is used for hierarchical feature fusion, so that it can fully utilize the feature information of the multi-layer backbone network to represent the target. On the LSOTB-TIR, the hierarchical attention Siamese network achieved a 2.9% increase in success rate and a 4.3% increase in precision relative to the baseline tracker. Experiments show that the proposed SiamHAN method has achieved competitive tracking results on the thermal infrared testing datasets.
... Most existing tracking datasets that include thermal images are annotated for single-object tracking [30,32,33]. For the task of MOT, the City-Scene dataset [1,14] contains 15 sequences for a total of 1 997 annotated frames for both a FLIR thermal camera and an RGB camera. ...
Preprint
Full-text available
Multiple Object Tracking (MOT) in thermal imaging presents unique challenges due to the lack of visual features and the complexity of motion patterns. This paper introduces an innovative approach to improve MOT in the thermal domain by developing a novel box association method that utilizes both thermal object identity and motion similarity. Our method merges thermal feature sparsity and dynamic object tracking, enabling more accurate and robust MOT performance. Additionally, we present a new dataset comprised of a large-scale collection of thermal and RGB images captured in diverse urban environments, serving as both a benchmark for our method and a new resource for thermal imaging. We conduct extensive experiments to demonstrate the superiority of our approach over existing methods, showing significant improvements in tracking accuracy and robustness under various conditions. Our findings suggest that incorporating thermal identity with motion data enhances MOT performance. The newly collected dataset and source code is available at https://github.com/wassimea/thermalMOT
... Among these, the datasets from OSU [62], LITIV [63], ASL-TID [64], and BUTIV [65] are out of date and impractical for certain applications, like short-term tracking of a single target. The VOT-TIR15 [66], VOT-TIR16 [67], VOT-TIR17 [68], PTB-TIR [69], and LSOTB-TIR [70] datasets, on the other hand, are widely recognized and frequently used to assess the effectiveness of TIR trackers. These datasets are useful reference points for evaluating the precision and efficacy of TIR tracking techniques. ...
Preprint
Full-text available
Unmanned Aerial Vehicles (UAVs) are becoming more popular in various sectors, offering many benefits, yet introducing significant challenges to privacy and safety. This paper investigates state-of-the-art solutions for detecting and tracking quadrotor UAVs to address these concerns. Cutting-edge deep learning models, specifically the YOLOv5 and YOLOv8 series, are evaluated for their performance in identifying UAVs accurately and quickly. Additionally, robust tracking systems, BoT-SORT and Byte Track, are integrated to ensure reliable monitoring even under challenging conditions. Our tests on the DUT dataset reveal that while YOLOv5 models generally outperform YOLOv8 in detection accuracy, the YOLOv8 models excel in recognizing less distinct objects, demonstrating their adaptability and advanced capabilities. Furthermore, BoT-SORT demonstrated superior performance over Byte Track, achieving higher IoU and lower center error in most cases, indicating more accurate and stable tracking. Code: https://github.com/zmanaa/UAV_detection_and_tracking Tracking demo: https://drive.google.com/file/d/1pe6HC5kQrgTbA2QrjvMN-yjaZyWeAvDT/view?usp=sharing
Article
Full-text available
Discriminative methods have been widely applied to construct the appearance model for visual tracking. Most existing methods incorporate online updating strategy to adapt to the appearance variations of targets. The focus of online updating for discriminative methods is to select the positive samples emerged in past frames to represent the appearances. However, the appearances of positive samples might be very dissimilar to each other; traditional online updating strategies easily overfit on some appearances and neglect the others. To address this problem, we propose an effective method to learn a discriminative template, which maintains the multiple appearances information of targets in the long-term variations. Our method is based on the obvious observation that the target appearances vary very little in a certain number of successive video frames. Therefore, we can use a few instances to represent the appearances in the scope of the successive video frames. We propose exclusive group sparse to describe the observation and provide a novel algorithm, called coefficients constrained exclusive group LASSO, to solve it in a single objective function. The experimental results on CVPR2013 benchmark datasets demonstrate that our approach achieves promising performance.
Article
Full-text available
Linear discriminant analysis (LDA) is a very popular supervised feature extraction method and has been extended to different variants. However, classical LDA has the following problems: 1) The obtained discriminant projection does not have good interpretability for features. 2) LDA is sensitive to noise. 3) LDA is sensitive to the selection of number of projection directions. In this paper, a novel feature extraction method called robust sparse linear discriminant analysis (RSLDA) is proposed to solve the above problems. Specifically, RSLDA adaptively selects the most discriminative features for discriminant analysis by introducing the l2;1 norm. An orthogonal matrix and a sparse matrix are also simultaneously introduced to guarantee that the extracted features can hold the main energy of the original data and enhance the robustness to noise, and thus RSLDA has the potential to perform better than other discriminant methods. Extensive experiments on six databases demonstrate that the proposed method achieves the competitive performance compared with other state-of-the-art feature extraction methods. Moreover, the proposed method is robust to the noisy data.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer are not robust to the size change of target object. Once the size of the target object has significant changes, the tracker drifts away from the target object. In this paper, we present a novel tracker based on multi-scale feature, spatiotemporal features and deep residual network to accurately estimate the size of the target object. Our tracker can successfully locate the target object in the consecutive video frames. To solve the multi-scale change issue in visual object tracking, we sample each input image with 67 different size templates and resize the samples to a fixed size. And then these samples are used to offline train deep residual network model with multi-scale feature that we have built up. After that spatial feature and temporal feature are fused into the deep residual network model with multi-scale feature, so that we can get deep multi-scale spatiotemporal features model, which is named MSST-ResNet feature model. Finally, MSST-ResNet feature model is transferred into the tracking tasks and combined with three different Kernelized Correlation Filters (KCFs) to accurately locate target object in the consecutive video frames. Unlike the previous trackers, we directly learn various change of the target appearance by building up a MSST-ResNet feature model. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Real-time object tracking has wide applications in time-critical multimedia processing areas such as motion analysis and human-computer interaction. It remains a hard problem to balance between accuracy and speed. In this paper, we present a fast real-time context-based visual tracking algorithm with a new saliency prior context (SPC) model. Based on the probability formulation, the tracking problem is solved by sequentially maximizing the computed confidence map of target location in each video frame. To handle the various cases of feature distributions generated from different targets and their contexts, we exploit low level features as well as fast spectral analysis for saliency to build a new prior context model. Then based on this model and a spatial context model learned on-line, a confidence map is computed and the target location is estimated. In addition, under this framework, the tracking procedure can be accelerated by the fast Fourier transform (FFT). Therefore, the new method generally achieves a real-time running speed. Extensive experiments show that our tracking algorithm based on the proposed SPC model achieves real-time computation efficiency with overall best performance comparing with other state-of-the-art methods.
Article
This paper studies the problem of object tracking in challenging scenarios by leveraging multimodal visual data. We propose a grayscale-thermal object tracking method in Bayesian filtering framework based on multitask Laplacian sparse representation. Given one bounding box, we extract a set of overlapping local patches within it, and pursue the multitask joint sparse representation for grayscale and thermal modalities. Then, the representation coefficients of the two modalities are concatenated into a vector to represent the feature of the bounding box. Moreover, the similarity between each patch pair is deployed to refine their representation coefficients in the sparse representation, which can be formulated as the Laplacian sparse representation. We also incorporate the modal reliability into the Laplacian sparse representation to achieve an adaptive fusion of different source data. Experiments on two grayscale-thermal datasets suggest that the proposed approach outperforms both grayscale and grayscale-thermal tracking approaches.