Article

# DASOT: A Unified Framework Integrating Data Association and Single Object Tracking for Online Multi-Object Tracking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

In this paper, we propose an online multi-object tracking (MOT) approach that integrates data association and single object tracking (SOT) with a unified convolutional network (ConvNet), named DASOTNet. The intuition behind integrating data association and SOT is that they can complement each other. Following Siamese network architecture, DASOTNet consists of the shared feature ConvNet, the data association branch and the SOT branch. Data association is treated as a special re-identification task and solved by learning discriminative features for different targets in the data association branch. To handle the problem that the computational cost of SOT grows intolerably as the number of tracked objects increases, we propose an efficient two-stage tracking method in the SOT branch, which utilizes the merits of correlation features and can simultaneously track all the existing targets within one forward propagation. With feature sharing and the interaction between them, data association branch and the SOT branch learn to better complement each other. Using a multi-task objective, the whole network can be trained end-to-end. Compared with state-of-the-art online MOT methods, our method is much faster while maintaining a comparable performance.

## No full-text available

... The video includes special tracking scenes such as variable scale, partial occlusion, short-term full occlusion and long-term full occlusion. The experiment results show that compared with some typical algorithms proposed in recent years, OsaMOT exhibits improved tracking performance and the ability to resist occlusion and scale change, and most of its evaluation indexes are superior to those of STAM [25], NOTA [26], STRN [27], BLSTM_MTP_O [28], KCF16 [29], PHD_LMP [48], DEEP_TAMA [49] and DASOT [50]. ...
... Also note that indexes MOT, Rcll, FN are improved obviously than other methods in Table 11. For example, we achieve 2.56 MOTA boosts than NOTA [26] in Table 9. 5.8 MOTA boosts than BLSTM_MTP_O [28] and 4.49 Rcll boosts than DASOT [50] is observed in Table 11. In addition, 7.95 Rcll improvements is observed than DASOT [50] in Table 10. ...
... For example, we achieve 2.56 MOTA boosts than NOTA [26] in Table 9. 5.8 MOTA boosts than BLSTM_MTP_O [28] and 4.49 Rcll boosts than DASOT [50] is observed in Table 11. In addition, 7.95 Rcll improvements is observed than DASOT [50] in Table 10. The other indexes, such as MT, FN, ML are also improved obviously in Tables 10 and 11. ...
Article
Full-text available
Abstract Multi‐object tracking (MOT), which uses the context information of image sequences to locate, maintain identities and generate trajectories of multiple targets in each frame, is key technology in the field of computer vision. To address the problems of occlusion and scale variation in low‐viewpoint MOT, OsaMOT is proposed here. First, according to the global occlusion state of each frame, OsaMOT proposes the adaptive anti‐occlusion feature to enhance the awareness and adaptability for occlusion. At the same time, OsaMOT uses the cascade screening mechanism to reduce the “virtual new target” phenomenon due to the dramatic change in target features caused by scale variation and occlusion. Finally, considering that the occluded templates will affect the tracking performance, OsaMOT proposes an adaptive anti‐noise template update mechanism according to the partial occlusion state of the target, which improves the purity of the template library and further enhances the applicability to occlusion. The experimental results show that OsaMOT can weaken the influence of scale variation, partial occlusion, short‐term full occlusion and long‐term full occlusion in the low‐viewpoint tracking scenes. Most evaluation indexes of OsaMOT under low‐viewpoint tracking scenario are superior to those of some typical algorithms proposed in recent years, and the tracking robustness is improved.
... Using the other detections and the objects in the active set A (t−1) , we construct the bipartite graph G and obtain the optimal matching using the Hungarian algorithm. As a result, the bounding boxes of the objects C (t−1) 1 and C (t−1) 2 are determined to be q (t) 1 and q (t) 2 , respectively. Next, we verify that q (t) 3 is a new object and include it in the active set A (t) at frame t. ...
... Require: Detection results D (1) , · · · , D (T ) Ensure: Active sets A (1) , · · · , A (T ) 1: Initialize A (1) 2: for t = 2 to T do 3: ...
... Require: Detection results D (1) , · · · , D (T ) Ensure: Active sets A (1) , · · · , A (T ) 1: Initialize A (1) 2: for t = 2 to T do 3: ...
Article
Full-text available
How to make an online tracking model effectively adapt to newly appearing objects and object disappearance as well as appearance variations of target objects from few examples is an essential issue in multiple object tracking (MOT). Learning target appearances from few examples is a few-shot classification problem, while identifications of newly appearing objects and object disappearance has the aspect of open-set classification. In this work, we regard online MOT as open-set few-show classification to address both learning from few examples (few-shot classification) and unknown classes such as new objects (open-set classification). Specifically, we develop an embedding neural network, called VOFNet, consisting of convolutional and recurrent parts, to perform open-set few-shot classification. The convolutional part constructs a feature from an example of a target object and the recurrent part determines a representative feature of a target object from few examples. Then VOFNet is trained to provide effective features for open-set few-shot classification. Finally, we develop an online multiple object tracker based on the combination of VOFNet and the bipartite matching. The proposed tracker achieves 49.2 multiple object tracking accuracy (MOTA) with 28.9 frames per second on MOT17 dataset, which shows a significantly better trade-off between the accuracy and the speed than the existing algorithms. For example, the proposed algorithm yields about 3.17 times faster speed with 0.99 times lower accuracy than recent existing MOT algorithm [1].
... At the same time, our approach surpasses the other methods, TrctrD [42], LSST [11], FAMNet [6], YOONKJ [45], STRN [41], MTDF [13] and DASOT [7]. These approaches range from 53.7% to 52.0% MOTA and only LSST performs better at IDF1 but shows low number of MT. ...
Preprint
Full-text available
This paper proposes a novel method for online Multi-Object Tracking (MOT) using Graph Convolutional Neural Network (GCNN) based feature extraction and end-to-end feature matching for object association. The Graph based approach incorporates both appearance and geometry of objects at past frames as well as the current frame into the task of feature learning. This new paradigm enables the network to leverage the "context" information of the geometry of objects and allows us to model the interactions among the features of multiple objects. Another central innovation of our proposed framework is the use of the Sinkhorn algorithm for end-to-end learning of the associations among objects during model training. The network is trained to predict object associations by taking into account constraints specific to the MOT task. Experimental results demonstrate the efficacy of the proposed approach in achieving top performance on the MOT17 Challenge among state-of-the-art online approaches.
... Multiple Object Tracking (MOT) is an important area of research in computer vision and artificial intelligence (Milan et al. 2017;Chu et al. 2020;Huang and Zhou 2019) dating back to 1988 (Pylyshyn and Storm 1988). The trackingbydetection paradigm allows us to decompose MOT into two tasks. ...
Article
Object association, i.e., the identification of which observations correspond to the same object, is a central task for the area of multiple object tracking. Two prominent models capturing this task have been introduced in the literature: the Lifted Multicut model and the more recent Lifted Paths model. Here, we carry out a detailed complexity-theoretic study of the problems arising from these two models that is aimed at complementing previous empirical work on object association. We obtain a comprehensive complexity map for both models that takes into account natural restrictions to instances such as possible bounds on the number of frames, number of tracked objects and branching degree, as well as less explicit structural restrictions such as having bounded treewidth. Our results include new fixed-parameter and XP algorithms for the problems as well as hardness proofs which altogether indicate that the Lifted Paths problem exhibits a more favorable complexity behavior than Lifted Multicut.
... For example, [84] uses a correlation layer that learns the dense temporal relations given sequential feature maps. DASOT [163] integrates data association and SOT in a unified framework, in which dense correlation feature maps are estimated for the temporal association, built upon truncated ResNet-50 with feature pyramid network (FPN) [164]. In addition, [165] estimates both temporal correlation and multi-scale spatial correlation with dense feature maps. ...
Preprint
Full-text available
Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories. With the advancement of deep neural networks and the increasing demand for intelligent video analysis, MOT has gained significantly increased interest in the computer vision community. Embedding methods play an essential role in object location estimation and temporal identity association in MOT. Unlike other computer vision tasks, such as image classification, object detection, re-identification, and segmentation, embedding methods in MOT have large variations, and they have never been systematically analyzed and summarized. In this survey, we first conduct a comprehensive overview with in-depth analysis for embedding methods in MOT from seven different perspectives, including patch-level embedding, single-frame embedding, cross-frame joint embedding, correlation embedding, sequential embedding, tracklet embedding, and cross-track relational embedding. We further summarize the existing widely used MOT datasets and analyze the advantages of existing state-of-the-art methods according to their embedding strategies. Finally, some critical yet under-investigated areas and future research directions are discussed.
... At the same time, our approach surpasses the other methods, TrctrD [46], LSST [12], FAMNet [7], YOONKJ [49], STRN [45], MTDF [14] and DASOT [8]. These approaches range from 53.7% to 52.0% MOTA and only LSST performs better at IDF1 but shows low number of MT. ...
Chapter
Different modalities have their own advantages and disadvantages. In a tracking-by-detection framework, fusing data from multiple modalities would ideally improve tracking performance than using a single modality, but this has been a challenge. This study builds upon previous research in this area. We propose a deep-learning based tracking-by-detection pipeline that uses multiple detectors and multiple sensors. For the input, we associate object proposals from 2D and 3D detectors. Through a cross-modal attention module, we optimize interaction between the 2D RGB and 3D point clouds features of each proposal. This helps to generate 2D features with suppressed irrelevant information for boosting performance. Through experiments on a published benchmark, we prove the value and ability of our design in introducing a multi-modal tracking solution to the current research on Multi-Object Tracking (MOT).
Article
Full-text available
We propose a novel online multiple object tracker taking structure information into account. State-of-the-art multi-object tracking (MOT) approaches commonly focus on discriminative appearance features, while neglect in different levels structure information and the core of data association. Addressing this, we design a new tracker fully exploiting structure information and encoding such information into the cost function of the graph matching model. Firstly, a new measurement is proposed to compare the structure similarity of two graphs whose nodes are equal. With this measurement, we define a complete matching which performs association in high efficiency. Secondly, for incomplete matching scenarios, a structure keeper net (SKnet) is designed to adaptively establish the graph for matching. Finally, we conduct extensive experiments on benchmarks including MOT2015 and MOT17. The results demonstrate the competitiveness and practicability of our tracker.
Article
Full-text available
In this paper, we propose a generic boosting framework for multiple object tracking (MOT). Unlike other works tracking objects from zero, our framework uses their results (tracklets) and makes further optimizations. The motivation of us derives from the observation that most modern MOT trackers have been acceptable performance and can yield relatively reliable tracklets; accordingly, we straight focus on the tracklet-level re-identification, which is the most challenging issue in this case. To achieve that goal, we simultaneously utilize the techniques of single object tracking, tracking fragment (tracklets) and re-identification mechanism through casting them into a multi-label energy optimization and then innovatively solving it using the $$\alpha -$$expansion with label costs algorithm. All these techniques inspire recent MOT a lot to mitigate the occlusion problem, but to our knowledge, by far few works explore to reasonably combine them all like us. Furthermore, we introduce a spatial attention to improve the appearance model and a hierarchical clustering as post-process to progressively improve the tracking consistency. Finally, testing results on the most used benchmarks demonstrate the significant effectiveness and generality of our framework, and the importance of each contribution is also verified through ablative studies.
Article
Most modern multi-object tracking (MOT) systems for videos follow the tracking-by-detection paradigm, where objects of interest are first located in each frame then associated correspondingly to form their intact trajectories. In this setting, the appearance features of objects usually provide the most important cues for data association, but it is very susceptible to occlusions, illumination variations, and inaccurate detections, thus easily resulting in incorrect trajectories. To address this issue, in this study we propose to make full use of the neighboring information. Our motivations derive from the observations that people tend to move in a group. As such, when an individual target's appearance is remarkably changed, the observer can still identify it with its neighbor context. To model the contextual information from neighbors, we first utilize the spatiotemporal relations among trajectories to efficiently select suitable neighbors for targets. Subsequently, we construct neighbor graph for each target and corresponding neighbors then employ the graph convolutional networks (GCNs) to model their relations and learn the graph features. To the best of our knowledge, it is the first time to explicitly leverage neighbor cues via GCN in MOT. Finally, standardized evaluations on the MOT16 and MOT17 data sets demonstrate that our approach can remarkably reduce the identity switches whilst achieve state-of-the-art overall performance.
Article
Multiple object tracking (MOT) has gained increasing attention due to its academic and commercial interests in computer vision tasks. Most of the existing state-of-the-art MOT methods consider the tracking-by-detection (TBD) framework, which localizes the pedestrian in each frame and connects these object hypotheses into the trajectories without any initial labeling. These methods heavily depend on detection accuracy and data association. However, occlusion often occurs in real surveillance scenes. The frequent occlusion leads to many false detections and inaccurate appearance, decreasing the tracking performance. In this paper, we aim to propose a novel feature matching method that combines the global and partial feature matching model between two bounding boxes to improve the similarity measurement between them. Moreover, the new feature matching method leverages the advantage that global features can illustrate the whole image, and partial features can effectively handle occlusion and noise. In addition, we propose a detection modifier method based on human pose information. This detection method can be used to filter out false pedestrian detections. Finally, the experimental results demonstrate the effectiveness of our proposed method and achieve comparable performance with the state-of-the-art MOT trackers.
Article
The advances of Visual object tracking tasks in computer vision have enabled a growing value in its application to video surveillance, particularly in a traffic scenario. In recent years, significant attention has been made for the improvement of multiple object tracking frameworks to be effective in real-time while maintaining accuracy and generality. By breaking down the tasks involved in a Multiple Object Tracking framework based on the Tracking-By-Detection approach - an extension of simply detecting and identifying objects, further involved solving a filtering problem by defining a similarity function to associate objects. Hence, this paper focuses on the task of data association via uniquely defined similarity functions and filters only where we review current literature about these techniques which have been used to advance the performance in MOT for vehicle and pedestrian scenarios. While there is difficulty in classifying the quantitative results for the association task only within a proposed MOT framework, our study tries to outline the fundamental ideas put forward by researchers and compare results in a theoretically qualitative approach. Tracking methods are reviewed by categories based on legacy techniques like Probabilistic and Hierarchical methods, followed by an analysis of new approaches and hybrid models. The models identified in each category are further analyzed based on performance in stability, accuracy, robustness, speed and computational complexity to derive an understanding of which direction the research within the data association level is strong and which is lacking. Our review further aims to identify the successful models applied to recognize the weaknesses for future improvement.
Article
Occlusion between different objects is a typical challenge in Multi-Object Tracking (MOT), which often leads to inferior tracking results due to the missing detected objects. The common practice in multi-object tracking is re-identifying the missed objects after their reappearance. Though tracking performance can be boosted by the re-identification, the annotation of identity is required to train the model. In addition, such practice of re-identification still can not track those highly occluded objects when they are missed by the detector. In this paper, we focus on online multi-object tracking and design two novel modules, the unsupervised re-identification learning module and the occlusion estimation module, to handle these problems. Specifically, the proposed unsupervised re-identification learning module does not require any (pseudo) identity information nor suffer from the scalability issue. The proposed occlusion estimation module tries to predict the locations where occlusions happen, which are used to estimate the positions of missed objects by the detector. Our study shows that, when applied to state-of-the-art MOT methods, the proposed unsupervised re-identification learning is comparable to supervised re-identification learning, and the tracking performance is further improved by the proposed occlusion estimation module.
Preprint
Online multi-object tracking (MOT) is a longstanding task for computer vision and intelligent vehicle platform. At present, the main paradigm is tracking-by-detection, and the main difficulty of this paradigm is how to associate the current candidate detection with the historical tracklets. However, in the MOT scenarios, each historical tracklet is composed of an object sequence, while each candidate detection is just a flat image, which lacks the temporal features of the object sequence. The feature difference between current candidate detection and historical tracklets makes the object association much harder. Therefore, we propose a Spatial-Temporal Mutual {Representation} Learning (STURE) approach which learns spatial-temporal representations between current candidate detection and historical sequence in a mutual representation space. For the historical trackelets, the detection learning network is forced to match the representations of sequence learning network in a mutual representation space. The proposed approach is capable of extracting more distinguishing detection and sequence representations by using various designed losses in object association. As a result, spatial-temporal feature is learned mutually to reinforce the current detection features, and the feature difference can be relieved. To prove the robustness of the STURE, it is applied to the public MOT challenge benchmarks and performs well compared with various state-of-the-art online MOT trackers based on identity-preserving metrics.
Article
Appearance similarity is of great importance for the association between objects and candidates. Recurrent models and similarity vector are two ways widely used by trackers for calculating similarities between objects and candidates. Recurrent models, like Long Short Term Memory network (LSTM), are capable of modeling the continuous change of object’s appearance in trajectory. But it is prone to identity (ID) switch when only employ recurrent models as appearance model. The similarity vector way is able to maintain correct IDs for objects when they reappear. But association fails easily when the object is partially occluded and similarity vector is used as the only appearance model. To obtain more accurate and robust appearance similarity, in this paper, we propose an online association by continuous-discrete appearance similarity measurement, OA-CDASM, for multi-object tracking. For continuous perspective, the concept of “smoothness” is proposed to explicitly model and use the continuous and smooth change of object’s appearance in trajectory. For discrete perspective, similarity vector is employed. By taking both continuous smoothness and discrete similarity vector into consideration, we can get the continuous-discrete appearance similarity measurement, CDASM, and further perform online association based on CDASM. Experimental results on three public benchmarks demonstrate the effectiveness of our work.
Article
Full-text available
The data association problem of multi-object tracking (MOT) aims to assign IDentity (ID) labels to detections and infer a complete trajectory for each target. Most existing methods assume that each detection corresponds to a unique target and thus cannot handle situations when multiple targets occur in a single detection due to detection failure in crowded scenes. To relax this strong assumption for practical applications, we formulate the MOT as a Maximizing An Identity-Quantity Posterior (MAIQP) problem on the basis of associating each detection with an identity and a quantity characteristic and then provide solutions to tackle two key problems arising. Firstly, a local target quantification module is introduced to count the number of targets within one detection. Secondly, we propose an identity-quantity harmony mechanism to reconcile the two characteristics. On this basis, we develop a novel Identity-Quantity HArmonic Tracking (IQHAT) framework that allows assigning multiple ID labels to detections containing several targets. Through extensive experimental evaluations on five benchmark datasets, we demonstrate the superiority of the proposed method.
Article
Online multi-object tracking (MOT) is a longstanding task for computer vision and intelligent vehicle platform. At present, the main paradigm is tracking-by-detection, and the main difficulty of this paradigm is how to associate current candidate detections with historical tracklets. However, in the MOT scenarios, each historical tracklet is composed of an object sequence, while each candidate detection is just a flat image, which lacks temporal features of the object sequence. The feature difference between current candidate detections and historical tracklets makes the object association much harder. Therefore, we propose a Spatial-Temporal Mutual Representation Learning (STURE) approach which learns spatial–temporal representations between current candidate detections and historical sequences in a mutual representation space. For historical trackelets, the detection learning network is forced to match the representations of sequence learning network in a mutual representation space. The proposed approach is capable of extracting more distinguishing detection and sequence representations by using various designed losses in object association. As a result, spatial–temporal feature is learned mutually to reinforce the current detection features, and the feature difference can be relieved. To prove the robustness of the STURE, it is applied to the public MOT challenge benchmarks and performs well compared with various state-of-the-art online MOT trackers based on identity-preserving metrics.
Article
Full-text available
The traditional visual multi‐object tracking methods based on the Gaussian mixture probability hypothesis density filter are generally not well adapted for tracking the targets in the complex scenarios, where there are a large number of unknowable newborn objects and occluded objects, even some missing objects cannot be associated with their previous trajectories when they are redetected. An improved visual multi‐object tracking algorithm is proposed by integrating an improved efficient convolution operator of the correlation filter and the Gaussian mixture probability hypothesis density filter. First, a similarity matrix based on the intersection‐of‐union is proposed for classifying the objects of survival objects, newborn objects, and then the improved efficient convolution operator method is employed to further identify whether the objects disappear or are missing. Moreover, the feature pyramid similarity is proposed to update the objects for enhancing the tracking accuracy. Finally, compared with some challenging methods on some challenging video sequences from publicly available MOT17 dataset, the proposed Gaussian mixture probability hypothesis density–feature pyramid similarity—efficient convolution operator* method has a good performance on detecting the newborn objects, occluded objects, blurring objects and re‐identifying the missing objects with higher multiple object tracking accuracy.
Article
In multi-target tracking, object interactions and occlusions are two significant factors that affect tracking performance. To settle this, we propose an identity association network (IANet) that integrates the geometry refinement network (GRNet) and the identity verification (IV) module to perform data association and reason the mapping between the detections and tracklets. In our data association process, the object drifts caused by object interactions are suppressed effectively by encoding the direction and velocity of objects to refine the geometric position of tracklets. The tracklets with refined geometric information are further utilized in the IV module to achieve a sufficient encoding of multivariate spatial cues including both appearance and geometry information, which defeats the misleading impacts of interactions and occlusions dramatically in multi-object tracking. The extensive experiments and comparative evaluations have demonstrated that our proposed method can significantly outperform many state-of-the-art methods on benchmarks of 2D MOT2015, MOT16, MOT17, MOT20, and KITTI by using public detection and online settings.
Article
Modern multi-object tracking (MOT) systems usually build trajectories through associating per-frame detections. However, facing the challenges of camera motion, fast motion, and occlusion, it is difficult to ensure the quality of long-range tracking or even the tracklet purity, especially for small objects. Most of tracking frameworks depend heavily on the performance of re-identification (ReID) for the data association. Unfortunately, the ReID-based association is not only unreliable and time-consuming, but still cannot address the false negatives for occluded and blurred objects, due to noisy partial-detections, similar appearances, and lack of temporal-spatial constraints. In this paper, we propose an enhanced MOT paradigm, namely Motion-Aware Tracker (MAT). Our MAT is a plug-and-play solution, it mainly focuses on high-performance motion-based prediction, reconnection, and association. First, the nonrigid pedestrian motion and rigid camera motion are blended seamlessly to develop the Integrated Motion Localization (IML) module. Second, the Dynamic Reconnection Context (DRC) module is devised to guarantee the robustness for long-range motion-based reconnection. The core ideas in DRC are the motion-based dynamic-window and cyclic pseudo-observation trajectory filling strategy, which can smoothly fill in the tracking fragments caused by occlusion or blur. At last, we present the 3D Integral Image (3DII) module to efficiently cut off useless track-detection association connections using temporal-spatial constraints. Extensive experiments are conducted on the MOT16&17 challenging benchmarks. The results demonstrate that our MAT can achieve superior performance and surpass other state-of-the-art trackers by a large margin with high efficiency.
Conference Paper
Full-text available
In this paper, we propose the methods to handle temporal errors during multi-object tracking. Temporal error occurs when objects are occluded or noisy detections appear near the object. In those situations, tracking may fail and various errors like drift or ID-switching occur. It is hard to overcome temporal errors only by using motion and shape information. So, we propose the historical appearance matching method and joint-input siamese network which was trained by 2-step process. It can prevent tracking failures although objects are temporally occluded or last matching information is unreliable. We also provide useful technique to remove noisy detections effectively according to scene condition. Tracking performance, especially identity consistency, is highly improved by attaching our methods.
Article
Full-text available
Online Multi-Object Tracking (MOT) is a challenging problem and has many important applications including intelligence surveillance, robot navigation and autonomous driving. In existing MOT methods, individual object’s movements and inter-object relations are mostly modeled separately and relations between them are still manually tuned. In addition, inter-object relations are mostly modeled in a symmetric way, which we argue is not an optimal setting. To tackle those difficulties, in this paper, we propose a Deep Continuous Conditional Random Field (DCCRF) for solving the online MOT problem in a track-bydetection framework. The DCCRF consists of unary and pairwise terms. The unary terms estimate tracked objects’ displacements across time based on visual appearance information. They are modeled as deep Convolution Neural Networks, which are able to learn discriminative visual features for tracklet association. The asymmetric pairwise terms model inter-object relations in an asymmetric way, which encourages high-confidence tracklets to help correct errors of low-confidence tracklets and not to be affected by low-confidence ones much. The DCCRF is trained in an end-to-end manner for better adapting the influences of visual information as well as inter-object relations. Extensive experimental comparisons with state-of-the-arts as well as detailed component analysis of our proposed DCCRF on two public benchmarks demonstrate the effectiveness of our proposed MOT framework.
Article
Full-text available
Deep convolutional neutral networks have achieved great success on image recognition tasks. Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable. We present deep feature flow, a fast and accurate framework for video recognition. It runs the expensive convolutional sub-network only on sparse key frames and propagates their deep feature maps to other frames via a flow field. It achieves significant speedup as flow computation is relatively fast. The end-to-end training of the whole architecture significantly boosts the recognition accuracy. Deep feature flow is flexible and general. It is validated on two recent large scale video datasets. It makes a large step towards practical video recognition.
Conference Paper
Full-text available
Data association is the backbone to many multiple object tracking (MOT) methods. In this paper we formulate data association as a Generalized Maximum Multi Clique problem (GMMCP). We show that this is the ideal case of modeling tracking in real world scenario where all the pair-wise relationships between targets in a batch of frames are taken into account. Previous works assume simplified version of our tracker either in problem formulation or problem optimization. However, we propose a solution using GMMCP where no simplification is assumed in either steps. We show that the NP hard problem of GMMCP can be formulated through Binary-Integer Program where for small and medium size MOT problems the solution can be found efficiently. We further propose a speed-up method, employing Aggregated Dummy Nodes for modeling occlusion and miss-detection, which reduces the size of the input graph without using any heuristics. We show that, using the speed-up method, our tracker lends itself to real-time implementation which is plausible in many applications. We evaluated our tracker on six challenging sequences of Town Center, TUD-Crossing, TUD-Stadtmitte, Parking-lot 1, Parking-lot 2 and Parking-lot pizza and show favorable improvement against state of art.
Conference Paper
Full-text available
Online multi-object tracking aims at producing complete tracks of multiple objects using the information accumulated up to the present moment. It still remains a difficult problem in complex scenes, because of frequent occlusion by clutter or other objects, similar appearances of different objects, and other factors. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first propose the tracklet confidence using the de-tectability and continuity of a tracklet, and formulate a multi-object tracking problem based on the tracklet confidence. The multi-object tracking problem is then solved by associating tracklets in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive associations. Here, for reliable association between tracklets and detections, we also propose a novel on-line learning method using an incremental linear discrimi-nant analysis for discriminating the appearances of objects. By exploiting the proposed learning method, tracklet association can be successfully achieved even under severe oc-clusion. Experiments with challenging public datasets show distinct performance improvement over other batch and on-line tracking methods.
Article
Full-text available
Many recent advances in multiple target tracking aim at finding a (nearly) optimal set of trajectories within a temporal window. To handle the large space of possible trajectory hypotheses, it is typically reduced to a finite set by some form of data-driven or regular discretization. In this work, we propose an alternative formulation of multitarget tracking as minimization of a continuous energy. Contrary to recent approaches, we focus on designing an energy that corresponds to a more complete representation of the problem, rather than one that is amenable to global optimization. Besides the image evidence, the energy function takes into account physical constraints, such as target dynamics, mutual exclusion, and track persistence. In addition, partial image evidence is handled with explicit occlusion reasoning, and different targets are disambiguated with an appearance model. To nevertheless find strong local minima of the proposed nonconvex energy, we construct a suitable optimization scheme that alternates between continuous conjugate gradient descent and discrete transdimensional jump moves. These moves, which are executed such that they always reduce the energy, allow the search to escape weak minima and explore a much larger portion of the search space of varying dimensionality. We demonstrate the validity of our approach with an extensive quantitative evaluation on several public data sets.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
Simultaneous tracking of multiple persons in real-world environments is an active research field and several approaches have been proposed, based on a variety of features and algorithms. Recently, there has been a growing interest in organizing systematic evaluations to compare the various techniques. Unfortunately, the lack of common metrics for measuring the performance of multiple object trackers still makes it hard to compare their results. In this work, we introduce two intuitive and general metrics to allow for objective comparison of tracker characteristics, focusing on their precision in estimating object locations, their accuracy in recognizing object configurations and their ability to consistently label objects over time. These metrics have been extensively used in two large-scale international evaluations, the 2006 and 2007 CLEAR evaluations, to measure and compare the performance of multiple object trackers for a wide variety of tracking tasks. Selected performance results are presented and the advantages and drawbacks of the presented metrics are discussed based on the experience gained during the evaluations.
Article
Recent approaches for high accuracy detection and tracking of object categories in video consist of complex multistage solutions that become more cumbersome each year. In this paper we propose a ConvNet architecture that jointly performs detection and tracking, solving the task in a simple and effective way. Our contributions are threefold: (i) we set up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression; (ii) we introduce correlation features that represent object co-occurrences across time to aid the ConvNet during tracking; and (iii) we link the frame level detections based on our across-frame tracklets to produce high accuracy detections at the video level. Our ConvNet architecture for spatiotemporal object detection is evaluated on the large-scale ImageNet VID dataset where it achieves state-of-the-art results. Our approach provides better single model performance than the winning method of the last ImageNet challenge while being conceptually much simpler. Finally, we show that by increasing the temporal stride we can dramatically increase the tracker speed.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Online multi-object tracking aims at estimating the tracks of multiple objects instantly with each incoming frame and the information provided up to the moment. It still remains a difficult problem in complex scenes, because of the large ambiguity in associating multiple objects in consecutive frames and the low discriminability between objects appearances. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first define the tracklet confidence using the detectability and continuity of a tracklet, and decompose a multi-object tracking problem into small subproblems based on the tracklet confidence. We then solve the online multi-object tracking problem by associating tracklets and detections in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive association steps. For more reliable association between tracklets and detections, we also propose a deep appearance learning method to learn a discriminative appearance model from large training datasets, since the conventional appearance learning methods do not provide rich representation that can distinguish multiple objects with large appearance variations. In addition, we combine online transfer learning for improving appearance discriminability by adapting the pre-trained deep model during online tracking. Experiments with challenging public datasets show distinct performance improvement over other state-of-the-arts batch and online tracking methods, and prove the effect and usefulness of the proposed methods for online multi-object tracking.
Article
We present a multi-cue metric learning framework to tackle the popular yet unsolved Multi-Object Tracking (MOT) problem. One of the key challenges of tracking methods is to effectively compute a similarity score that models multiple cues from the past such as object appearance, motion, or even interactions. This is particularly challenging when objects get occluded or share similar appearance properties with surrounding objects. To address this challenge, we cast the problem as a metric learning task that jointly reasons on multiple cues across time. Our framework learns to encode long-term temporal dependencies across multiple cues with a hierarchical Recurrent Neural Network. We demonstrate the strength of our approach by tracking multiple objects using their appearance, motion, and interactions. Our method outperforms previous works by a large margin on multiple publicly available datasets including the challenging MOT benchmark.
Article
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
Conference Paper
To help accelerate progress in multi-target, multi-camera tracking systems, we present (i) a new pair of precision-recall measures of performance that treats errors of all types uniformly and emphasizes correct identification over sources of error; (ii) the largest fully-annotated and calibrated data set to date with more than 2 million frames of 1080 p, 60 fps video taken by 8 cameras observing more than 2,700 identities over 85 min; and (iii) a reference software system as a comparison baseline. We show that (i) our measures properly account for bottom-line identity match performance in the multi-camera setting; (ii) our data set poses realistic challenges to current trackers; and (iii) the performance of our system is comparable to the state of the art.
Article
We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code will be made publicly available.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Article
In this paper we presen algorithms for the solution of the general assignment and transportation problems. In Section 1, a statement of the algorithm for the assignment problem appears, along with a proof for the correctness of the algorithm. The remarks which constitute the proof are incorporated parenthetically into the statement of the algorithm. Following this appears a discussion of certain theoretical aspects of the problem. In Section 2, the algorithm is generalized to one for the transportation problem. The algorithm of that section is stated as concisely as possible, with theoretical remarks omitted. 1. THE ASSIGNMENT PROBLEM. The personnel-assignment problem is the problem of choosing an optimal assignment of n men to n jobs, assuming that numerical ratings are given for each man’s performance on each job. An optimal assignment is one which makes the sum of the men’s ratings for their assigned jobs a maximum. There are n! possible assignments (of which several may be optimal), so that it is physically impossible, except
Conference Paper
We analyze the computational problem of multi-object tracking in video sequences. We formulate the problem using a cost function that requires estimating the number of tracks, as well as their birth and death states. We show that the global solution can be obtained with a greedy algorithm that sequentially instantiates tracks using shortest path computations on a flow network. Greedy algorithms allow one to embed pre-processing steps, such as nonmax suppression, within the tracking algorithm. Furthermore, we give a near-optimal algorithm based on dynamic programming which runs in time linear in the number of objects and linear in the sequence length. Our algorithms are fast, simple, and scalable, allowing us to process dense input data. This results in state-of-the-art performance.
Conference Paper
We propose a network flow based optimization method for data association needed for multiple object tracking. The maximum-a-posteriori (MAP) data association problem is mapped into a cost-flow network with a non-overlap constraint on trajectories. The optimal data association is found by a min-cost flow algorithm in the network. The network is augmented to include an explicit occlusion model(EOM) to track with long-term inter-object occlusions. A solution to the EOM-based network is found by an iterative approach built upon the original algorithm. Initialization and termination of trajectories and potential false observations are modeled by the formulation intrinsically. The method is efficient and does not require hypotheses pruning. Performance is compared with previous results on two public pedestrian datasets to show its improvement.
Conference Paper
We present a detection-based three-level hierarchical association approach to robustly track multiple objects in crowded environments from a single camera. At the low level, reliable tracklets (i.e. short tracks for further analysis) are generated by linking detection responses based on conservative affinity constraints. At the middle level, these tracklets are further associated to form longer tracklets based on more complex affinity measures. The association is formulated as a MAP problem and solved by the Hungarian algorithm. At the high level, entries, exits and scene occluders are estimated using the already computed tracklets, which are used to refine the final trajectories. This approach is applied to the pedestrian class and evaluated on two challenging datasets. The experimental results show a great improvement in performance compared to previous methods.
Article
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL data sets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI--SVM in terms of latent variables. A latent SVM is semiconvex, and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.
• Y Jia
• E Shelhamer
• J Donahue
• S Karayev
• J Long
• R Girshick
• T Darrell
Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.
• A Milan
• L Leal-Taixé
• I Reid
• S Roth
• K Schindler
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; and Schindler, K. 2016. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
Online multi-object tracking with dual matching attention networks
• J Zhu
• H Yang
• N Liu
• M Kim
• W Zhang
• M.-H Yang
Zhu, J.; Yang, H.; Liu, N.; Kim, M.; Zhang, W.; and Yang, M.-H. 2018. Online multi-object tracking with dual matching attention networks. In ECCV.