ArticlePDF Available

PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark

Authors:

Abstract and Figures

Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Content may be subject to copyright.
A preview of the PDF is not available
... In addition, our variant, ECOHG_LS, also obtains improvements by merging HOG and gray features to represent the object. The favorable performance against others on three thermal infrared tracking benchmarks, PTB-TIR [11], VOT-TIR2016 [12], and VOT-TIR2015 [13] demonstrates the effectiveness of our approach. ...
... The precision plot presents the percentage of frames whose CLE is less than a given threshold. The success plot shows the proportion of frames whose OR is higher than a given threshold [11]. ...
... Datasets: PTB-TIR is a dataset published for TIR pedestrian tracker evaluation, which includes 60 sequences acquired from video websites and existing commonly used TIR datasets [11]. ...
Article
Full-text available
Existing thermal infrared (TIR) trackers based on correlation filters cannot adapt to the abrupt scale variation of nonrigid objects. This deficiency could even lead to tracking failure. To address this issue, we propose a TIR tracker, called ECO_LS, which improves the performance of efficient convolution operators (ECO) via the level set method. We first utilize the level set to segment the local region estimated by the ECO tracker to gain a more accurate size of the bounding box when the object changes its scale suddenly. Then, to accelerate the convergence speed of the level set contour, we leverage its historical information and continuously encode it to effectively decrease the number of iterations. In addition, our variant, ECOHG_LS, also achieves better performance via concatenating histogram of oriented gradient (HOG) and gray features to represent the object. Furthermore, experimental results on three infrared object tracking benchmarks show that the proposed approach performs better than other competing trackers. ECO_LS improves the EAO by 20.97% and 30.59% over the baseline ECO on VOT-TIR2016 and VOT-TIR2015, respectively.
... • Extensive comparative evaluations demonstrate the superiority of the proposed tracker compared with state-of-the-art trackers on PTB-TIR [28] and LSOTB-TIR [29] datasets. ...
... To verify the tracking performance of the proposed STAMT tracker, we compared it with several state-of-the-art tracking methods on two widely used thermal infrared tracking benchmark datasets: PTB-TIR [28] and LSOTB-TIR [29]. The PTB-TIR [28] dataset includes 60 thermal infrared sequences, and these sequences have 9 challenge attribute labels, including occlusion: OCC, scale variation: SV, background clutter: BC, low resolution: LR, fast motion: FM, motion blur: MB, out-of-view: OV, intensity variation: IV, and thermal crossover: TC. ...
... To verify the tracking performance of the proposed STAMT tracker, we compared it with several state-of-the-art tracking methods on two widely used thermal infrared tracking benchmark datasets: PTB-TIR [28] and LSOTB-TIR [29]. The PTB-TIR [28] dataset includes 60 thermal infrared sequences, and these sequences have 9 challenge attribute labels, including occlusion: OCC, scale variation: SV, background clutter: BC, low resolution: LR, fast motion: FM, motion blur: MB, out-of-view: OV, intensity variation: IV, and thermal crossover: TC. While the LSOTB-TIR [29] includes 120 thermal infrared testing sequences. ...
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
... It has been widely used in maritime rescue, video surveillance, and driver assistance at night [1] as it can track the object in total darkness. Despite much progress, TIR object tracking still faces several challenging problems, such as distractor, occlusion, size change, and thermal crossover [2]. Due to the specific characteristic of the TIR image, the distractor is widely studied as one of the most frequent challenges in TIR object tracking. ...
... To verify the effectiveness of the proposed dual-level feature model, we propose an offline tracker and an online tracker based on the proposed feature model. Extensive experimental results on the VOT-TIR2015 [22], VOT-TIR2017 [23], and PTB-TIR [2] benchmarks demonstrate that the proposed methods perform favorably against the state-ofthe-art methods. ...
... Each challenge has a corresponding subset that can be used to evaluate the ability of a tracker to handle the challenge. In addition to the VOT-TIR2015 and VOT-TIR2017 datasets, we also use a TIR pedestrian tracking dataset, PTB-TIR [2], to evaluate the proposed algorithm. PTB-TIR is a recently published tracking benchmark that contains 60 sequences with 9 different challenges, such as background clutter, occlusion, out-of-view, and scale variation. ...
Article
Full-text available
The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grained discriminative information. To this end, we propose a dual-level feature model containing the TIR-specific discriminative feature and fine-grained correlation feature for robust TIR object tracking. Specifically, to distinguish inter-class TIR objects, we first design an auxiliary multi-classification network to learn the TIR-specific discriminative feature. Then, to recognize intra-class TIR objects, we propose a fine-grained aware module to learn the fine-grained correlation feature. These two kinds of features complement each other and represent TIR objects in the levels of inter-class and intra-class respectively. These two feature models are constructed using a multi-task matching framework and are jointly optimized on the TIR object tracking task. In addition, we develop a large-scale TIR image dataset to train the network for learning TIR-specific feature patterns. To the best of our knowledge, this is the largest TIR tracking training dataset with the richest object class and scenario. To verify the effectiveness of the proposed dual-level feature model, we propose an offline TIR tracker (MMNet) and an online TIR tracker (ECO-MM) based on the feature model and evaluate them on three TIR tracking benchmarks. Extensive experimental results on these benchmarks demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.
... Techniques and applications for intelligent service based on big data, computer vision and pattern recognition have been rapidly developed, and the multimedia information plays an important role in those systems, such as driving assistance [1][2][3], industry control [4], smart city [5][6][7] ,and intelligent medical care [8], involving all sectors of society. The rapid expansion of multimedia databases provides a hotbed for the development of artificial intelligence (AI) technologies. ...
... In the past few years, human location methods based on deep learning have matured to be used ubiquitously in computer vision systems, such as video surveillance [5], image analysis [6] and driving assistance [1]. Face detection and location are usually the crucial steps in these applications, especially for video analysis. ...
Article
Full-text available
In recent years, some learning-based methods are proposed to detect and locate humans in real-time via convolutional neural networks (CNN). However, high-performance graphics processing units (GPUs) are required in those methods. To resolve this problem, a preprocessing procedure based on video segmentation is proposed to speed up face detection. Meanwhile, an accelerating toolkit is employed in this study to perform face detection in real-time on a standard central processing unit (CPU). Experimental results indicate that the proposed method can achieve an F1-Score of 93.2% and 4.5 times of real-time speed with one CPU on 155883 test frames from the RAI dataset, YouTube, and YOUKU. Notably, when the video sequence is with fewer frames of human faces, the highest speed is nearly 18 times faster than that without video segmentation.
... The average duration of short-term datasets is always less than one minute, and following benchmarks mainly innovate in video content. (e.g., TC-128 [31] evaluates color-enhanced trackers on color sequences; NUS-PRO [33] focuses on tracking pedestrian and rigid objects; UAV123 [32] assesses unmanned aerial vehicle tracking performance; PTB-TIR [35] and VOT-TIR [36] are thermal tracking datasets; GOT-10k [12] includes 563 object classes based on the WordNet [37]). ...
Preprint
Single object tracking (SOT) research falls into a cycle - trackers perform well on most benchmarks but quickly fail in challenging scenarios, causing researchers to doubt the insufficient data content and take more effort constructing larger datasets with more challenging situations. However, isolated experimental environments and limited evaluation methods more seriously hinder the SOT research. The former causes existing datasets can not be exploited comprehensively, while the latter neglects challenging factors in the evaluation process. In this article, we systematize the representative benchmarks and form a single object tracking metaverse (SOTVerse) - a user-defined SOT task space to break through the bottleneck. We first propose a 3E Paradigm to describe tasks by three components (i.e., environment, evaluation, and executor). Then, we summarize task characteristics, clarify the organization standards, and construct SOTVerse with 12.56 million frames. Specifically, SOTVerse automatically labels challenging factors per frame, allowing users to generate user-defined spaces efficiently via construction rules. Besides, SOTVerse provides two mechanisms with new indicators and successfully evaluates trackers under various subtasks. Consequently, SOTVerse firstly provides a strategy to improve resource utilization in the computer vision area, making research more standardized and scientific. The SOTVerse, toolkit, evaluation server, and results are available at http://metaverse.aitestunion.com.
Article
Many trackers use attention mechanisms to enhance the details of feature maps. However, most attention mechanisms are designed based on RGB images and thus cannot be effectively adapted to infrared images. The features of infrared images are weak, and the attention mechanism is difficult to learn. Most thermal infrared trackers based on Siamese networks use traditional cross-correlation techniques, which ignore the correlation between local parts. To address these problems, this paper proposes a Siamese multigroup spatial shift (SiamMSS) network for thermal infrared tracking. The SiamMSS network uses a spatial shift model to enhance the details of feature maps. First, the feature map is divided into four groups according to the channel, moving unit wise in four directions of the two dimensions of height and width. Next, the sample and search image features are cross-correlated using the graph attention module cross-correlation method. Finally, split attention is used to fuse multiple response maps. Results of experiments on challenging benchmarks, including VOT-TIR2015, PTB-TIR, and LSOTB-TIR, demonstrate that the proposed SiamMSS outperforms state-of-the-art trackers. The code is available at lvlanbing/SiamMSS (github.com).
Article
Full-text available
Unmanned Flight Vehicle (UAVs) primarily provides many applications and uses in the commerce and also recreation fields. Thus, the perception and visualization of the state of UAVs are of prime importance. Also in this paper, the authors incorporate the primary objective involving capturing and Detecting Drones, thus deriving important And Valuable data of position along with also coordinates. The wide and overall diffusion of drones increases the existing hazards of their misuse in a lot of illegitimate actions for example drug smuggling and also terrorism. Thereby, drones' surveillance and also automated detection are very crucial for protecting and safeguarding certain restricted areas or special zones and regions from illegal drone entry. Although, when present under low illumination situations and scenes, the designed capturers may lose the capability in discovering valuable data, which may lead to the wrong and not accurate results. In order to alleviate and resolve this, there are some works that consider using and reading infrared (IR) videos and images for object detection and tracking. The crucial drawback existing for infrared images is pertaining that they generally possess low resolution, this, thus provides inadequate resolution and information for trackers. Thus, considering the provided above analysis, fusing RGB and visible data along with also infrared picture data is essential in capturing and detecting drones. Moreover, this leverages data consisting of more than a single mode of crucial data which is useful and advantageous in studying along with understanding precise with also important drone existing capturers. Thus, the very use involves few good data comprising more than a single mode which is also needed in order for learning and understanding some objectives involving detecting and capturing UAVs. This paper introduces an automated video and image-based drone tracking and detection system which utilizes a crucial and advanced deep-learning-based image and object detection and tracking method known as you only look once (YOLOv5) to protect restricted areas and regions or special regions and zones from the unlawful drone entry and interventions. YOLO v5, part of the single-stage existing detectors, has one of the best detection and tracking performances required for balancing both the accuracy and also speed by collecting in-depth and also high-level extracted features. Based on YOLO v5, this paper also improves it to track and detect UAVs more accurately and precisely, and it's one of the first times introducing a YOLO v5-based developed algorithm for UAV object tracking and detection for the anti-UAV. It also adopts the last four existing scales of feature extraction maps instead of the previous three pertaining scales of feature maps required to predict and draft bounding boxes of given objects, which can alsodeliver moretextureandalsoimportantcontour data for the necessity to track and detect tiny and small objects. Also at the same time, in order to reduce and decrease the calculation, the provided size of the UAV in the existing four scales feature and contour maps are calculated according to the provided input data, and also then the tracked number of anchor existing boxes is also modified and adjusted. Therefore, the proposed UAV tracking and detection technology can also be applied in the given field of anti-UAV. Accordingly, an important and effective method named a double-training strategy has been developed mainly in drone detection and capturing. Trained mainly in class and instance segmentation spanning in moving frames and image series, the capturer also understands the accurate and important segments data along with information and also derives some distinct and important instantaneous and class-order characteristics.
Article
The lack of large labeled training datasets hinders the usage of deep neural network for Thermal Infrared (TIR) tracking. Regular practice is to train a tracking network with large-scale RGB datasets and then retrain it to the TIR domain with limited TIR data. However, we observe that existing Siamese-based trackers can hardly generalize to TIR images though they achieve outstanding performance on RGB tracking. Therefore, the main challenge is the generalization problem: How to design a generalization-friendly Siamese tracking network and what affects the network generalization. To tackle this problem, we introduce the self-adaption structure into Siamese network and propose an effective TIR tracking model, GFSNet. GFSNet is successfully generalized to different TIR tracking tasks, including ground target, aircraft and high-diversity object tracking tasks. To estimate generalization ability, we present a notion of Growth Rate, the improvement of overall performance after retraining. Experimental results show that the Growth Rates of GFSNet exceed state-of-the-art SiamRPN++ by more than 7 times, which indicates the great power of GFSNet in generalization. In addition to experimental validations, we provide the theoretical analysis of network generalization from a novel perspective, model sensitivity. By performing some tests to analyze the sensitivity, we conclude that the self-adaption structure helps GFSNet converge to a more sensitive minimum with better generalization to new tasks. Furthermore, when compared with popular tracking methods, GFSNet maintains comparable accuracy while achieving real-time tracking with the speed of 112 FPS, 5 times faster than other TIR trackers.
Chapter
Full-text available
Article
The existing body of work on video object tracking (VOT) algorithms has studied various image conditions such as occlusion, clutter, and object shape, which influence video quality and affect tracking performance. Nonetheless, there is no clear distinction between the performance reduction caused by scene-dependent challenges such as occlusion and clutter, and the effect of authentic in-capture and post-capture distortions. Despite the plethora of VOT methods in the literature, there is a lack of detailed studies analyzing the performance of videos with authentic in-capture and post-capture distortions. We introduced a new dataset of authentically distorted videos (AD-SVD) to address this issue. This dataset contains 4476 videos with different authentic distortions and surveillance activities. Furthermore, it provides benchmarking results for evaluating ten state-of-the-art visual object trackers (from VOT 2017–2018 challenges) based on the proposed dataset. In addition, this study develops an approach for performance prediction and quality-aware feature selection for single-object tracking in authentically distorted surveillance videos. The method predicts the performance of a VOT algorithm with high accuracy. Then, the probability of obtaining the reference output is maximized without executing the tracking algorithms. We also propose a framework to reduce video tracker computation resources (time and video storage space). We achieve this by balancing processing time and tracking accuracy by predicting the performance in a range of spatial resolutions. This approach can reduce the execution time by up to 34% with a slight decrease in performance of 3%.
Article
Full-text available
Discriminative methods have been widely applied to construct the appearance model for visual tracking. Most existing methods incorporate online updating strategy to adapt to the appearance variations of targets. The focus of online updating for discriminative methods is to select the positive samples emerged in past frames to represent the appearances. However, the appearances of positive samples might be very dissimilar to each other; traditional online updating strategies easily overfit on some appearances and neglect the others. To address this problem, we propose an effective method to learn a discriminative template, which maintains the multiple appearances information of targets in the long-term variations. Our method is based on the obvious observation that the target appearances vary very little in a certain number of successive video frames. Therefore, we can use a few instances to represent the appearances in the scope of the successive video frames. We propose exclusive group sparse to describe the observation and provide a novel algorithm, called coefficients constrained exclusive group LASSO, to solve it in a single objective function. The experimental results on CVPR2013 benchmark datasets demonstrate that our approach achieves promising performance.
Article
Full-text available
Linear discriminant analysis (LDA) is a very popular supervised feature extraction method and has been extended to different variants. However, classical LDA has the following problems: 1) The obtained discriminant projection does not have good interpretability for features. 2) LDA is sensitive to noise. 3) LDA is sensitive to the selection of number of projection directions. In this paper, a novel feature extraction method called robust sparse linear discriminant analysis (RSLDA) is proposed to solve the above problems. Specifically, RSLDA adaptively selects the most discriminative features for discriminant analysis by introducing the l2;1 norm. An orthogonal matrix and a sparse matrix are also simultaneously introduced to guarantee that the extracted features can hold the main energy of the original data and enhance the robustness to noise, and thus RSLDA has the potential to perform better than other discriminant methods. Extensive experiments on six databases demonstrate that the proposed method achieves the competitive performance compared with other state-of-the-art feature extraction methods. Moreover, the proposed method is robust to the noisy data.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer are not robust to the size change of target object. Once the size of the target object has significant changes, the tracker drifts away from the target object. In this paper, we present a novel tracker based on multi-scale feature, spatiotemporal features and deep residual network to accurately estimate the size of the target object. Our tracker can successfully locate the target object in the consecutive video frames. To solve the multi-scale change issue in visual object tracking, we sample each input image with 67 different size templates and resize the samples to a fixed size. And then these samples are used to offline train deep residual network model with multi-scale feature that we have built up. After that spatial feature and temporal feature are fused into the deep residual network model with multi-scale feature, so that we can get deep multi-scale spatiotemporal features model, which is named MSST-ResNet feature model. Finally, MSST-ResNet feature model is transferred into the tracking tasks and combined with three different Kernelized Correlation Filters (KCFs) to accurately locate target object in the consecutive video frames. Unlike the previous trackers, we directly learn various change of the target appearance by building up a MSST-ResNet feature model. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Real-time object tracking has wide applications in time-critical multimedia processing areas such as motion analysis and human-computer interaction. It remains a hard problem to balance between accuracy and speed. In this paper, we present a fast real-time context-based visual tracking algorithm with a new saliency prior context (SPC) model. Based on the probability formulation, the tracking problem is solved by sequentially maximizing the computed confidence map of target location in each video frame. To handle the various cases of feature distributions generated from different targets and their contexts, we exploit low level features as well as fast spectral analysis for saliency to build a new prior context model. Then based on this model and a spatial context model learned on-line, a confidence map is computed and the target location is estimated. In addition, under this framework, the tracking procedure can be accelerated by the fast Fourier transform (FFT). Therefore, the new method generally achieves a real-time running speed. Extensive experiments show that our tracking algorithm based on the proposed SPC model achieves real-time computation efficiency with overall best performance comparing with other state-of-the-art methods.
Article
This paper studies the problem of object tracking in challenging scenarios by leveraging multimodal visual data. We propose a grayscale-thermal object tracking method in Bayesian filtering framework based on multitask Laplacian sparse representation. Given one bounding box, we extract a set of overlapping local patches within it, and pursue the multitask joint sparse representation for grayscale and thermal modalities. Then, the representation coefficients of the two modalities are concatenated into a vector to represent the feature of the bounding box. Moreover, the similarity between each patch pair is deployed to refine their representation coefficients in the sparse representation, which can be formulated as the Laplacian sparse representation. We also incorporate the modal reliability into the Laplacian sparse representation to achieve an adaptive fusion of different source data. Experiments on two grayscale-thermal datasets suggest that the proposed approach outperforms both grayscale and grayscale-thermal tracking approaches.