Conference PaperPDF Available

Real-Time Multiple People Tracking with Deeply Learned Candidate Selection and Person Re-Identification


Abstract and Figures

Online multi-object tracking is a fundamental problem in time-critical video analysis applications. A major challenge in the popular tracking-by-detection framework is how to associate unreliable detection results with existing tracks. In this paper, we propose to handle unreliable detection by collecting candidates from outputs of both detection and tracking. The intuition behind generating redundant candidates is that detection and tracks can complement each other in different scenarios. Detection results of high confidence prevent tracking drifts in the long term, and predictions of tracks can handle noisy detection caused by occlusion. In order to apply optimal selection from a considerable amount of candidates in real-time, we present a novel scoring function based on a fully convolutional neural network, that shares most computations on the entire image. Moreover, we adopt a deeply learned appearance representation, which is trained on large-scale person re-identification datasets, to improve the identification ability of our tracker. Extensive experiments show that our tracker achieves real-time and state-of-the-art performance on a widely used people tracking benchmark.
Content may be subject to copyright.
Long Chen, Haizhou Ai, Zijie Zhuang, Chong Shang
Tsinghua National Lab for Info. Sci. & Tech. (TNList),
Department of Computer Science and Technology, Tsinghua University, Beijing, China, 100084.
{l-chen16, shang-c13, zhuangzj15},
Online multi-object tracking is a fundamental problem in
time-critical video analysis applications. A major challenge
in the popular tracking-by-detection framework is how to as-
sociate unreliable detection results with existing tracks. In
this paper, we propose to handle unreliable detection by col-
lecting candidates from outputs of both detection and track-
ing. The intuition behind generating redundant candidates is
that detection and tracks can complement each other in dif-
ferent scenarios. Detection results of high confidence prevent
tracking drifts in the long term, and predictions of tracks can
handle noisy detection caused by occlusion. In order to apply
optimal selection from a considerable amount of candidates
in real-time, we present a novel scoring function based on
a fully convolutional neural network, that shares most com-
putations on the entire image. Moreover, we adopt a deeply
learned appearance representation, which is trained on large-
scale person re-identification datasets, to improve the identi-
fication ability of our tracker. Extensive experiments show
that our tracker achieves real-time and state-of-the-art perfor-
mance on a widely used people tracking benchmark.
Index TermsMulti-object tracking, convolutional neu-
ral network, person re-identification
Tracking multiple objects in a complex scene is a challenging
problem in many video analysis and multimedia applications,
such as visual surveillance, sport analysis, and autonomous
driving. The objective of multi-object tracking is to estimate
trajectories of objects in a specific category. Here we tackle
the problem of people tracking by taking advantage of person
Multi-object tracking benefits a lot from advances in ob-
ject detection in the past decade. The popular tracking-by-
detection methods apply the detector on each frame, and as-
sociate detection across frames to generate object trajectories.
Both intra-category occlusion and unreliable detection are
tremendous challenges in such a tracking framework [1, 2].
Intra-category occlusion and similar appearances of objects
619 622 625 628
Detection Track
Fig. 1: Candidate selection based on unified scores. Candi-
dates from detection and tracks are visualized as blue solid
rectangles and red dotted rectangles, respectively. Detection
and tracks can complement each other for data association.
can result in ambiguities in data association. Multiple cues,
including motion, shape and object appearances, are fused
to mitigate this problem [3, 4]. On the other hand, detec-
tion results are not always reliable. Pose variation and occlu-
sion in crowded scenes often cause detection failures such as
false positives, missing detection, and non-accurate bound-
ing. Some studies proposed to handle unreliable detection
in a batch mode [2, 5, 6]. These methods address detection
noises by introducing information from future frames. Detec-
tion results in whole video frames or a temporal window are
employed and linked to trajectories by solving a global op-
timization problem. Tracking in a batch mode is non-causal
and not suitable for time-critical applications. In contrast to
these works, we focus on the online multiple people tracking
problem, using only the current and past frames.
In order to handle unreliable detection in an online mode,
our tracking framework optimally selects candidates from
outputs of both detection and tracks in each frame (as shown
in Figure 1). In most of the existing tracking-by-detection
methods, when talking about data association, candidates to
be associated with existing tracks are only made up of de-
tection results. Yan et al. [4] proposed to treat the tracker
and object detector as two independent identities, and keep
results of them as candidates. They selected candidates based
on hand-crafted features, e.g., color histogram, optical flow,
and motion features. The intuition behind generating redun-
dant candidates is that detection and tracks can complement
each other in different scenarios. On the one hand, reliable
predictions from the tracker can be used for short-term asso-
ciation in case of missing detection or non-accurate bounding.
On the other hand, confident detection results are essential to
prevent tracks drifting to backgrounds in the long term. How
to score outputs of both detection and tracks in an unified way
is still an open question.
Recently, deep neural networks, especially convolutional
neural networks (CNN), have made great progress in the field
of computer vision and multimedia. In this paper, we take full
advantage of deep neural networks to tackle unreliable detec-
tion and intra-category occlusion. Our contribution is three
fold. First, we handle unreliable detection in online tracking
by combining both detection and tracking results as candi-
dates, and selecting optimal candidates based on deep neu-
ral networks. Second, we present a hierarchical data associ-
ation strategy, which utilizes spatial information and deeply
learned person re-identification (ReID) features. Third, we
demonstrate real-time and state-of-the-art performance of our
tracker on a widely used people tracking benchmark.
Tracking-by-detection is becoming the most popular strategy
for multi-object tracking. Bae et al. [1] associated track-
lets with detection in different ways according to their con-
fidence values. Sanchez-Matilla et al. [7] exploited multiple
detectors to improve tracking performance. They collected
outputs from multiple detectors, during the so called over-
detection process. Combining results from multiple detectors
can improve the tracking performance but is not efficient for
real-time applications. In contrast, our tracking framework
needs only one detector and generates candidates from exist-
ing tracks. Chu et al. [?] used a binary classifier and single
object tracker for online multi-object tracking. They shared
the feature maps for classification but still had a high compu-
tation complicity.
Batch methods formulate tracking as a global optimiza-
tion problem [4, 5, 6, 8]. These methods utilized informa-
tion from future frames to handle noisy detection and reduce
ambiguities in data association. Liu et al. [9] proposed a
rewind to track strategy to generate backward tracklets in-
volving future information, to obtain a more stable similar-
ity measurement for association. Person re-identification was
also explored in [6, 8, 10] for the global optimization. Our
framework leverages deeply learned ReID features in an on-
line mode, to improve the identification ability when coping
with the problem of intra-category occlusion.
3.1. Framework Overview
In this work, we extend traditional tracking-by-detection by
collecting candidates from outputs of both detection and
tracks. Our framework consists of two sequential tasks, that
is, candidate selection and data association.
We first measure all the candidates using an unified scor-
ing function. A discriminatively trained object classifier and
a well-designed tracklet confidence are fused to formulate the
scoring function, as described in Section 3.2 and Section 3.3.
Non-maximal suppression (NMS) is subsequently performed
with the estimated scores. After obtaining candidates with-
out redundancy, we use both appearance representations and
spatial information to hierarchically associate existing tracks
with the selected candidates. Our appearance representations
are deeply learned from the person re-identification as de-
scribed in Section 3.4. Hierarchical data association is de-
tailed in Section 3.5.
3.2. Real-Time Object Classification
Combining outputs of both detection and tracks will result in
an excessive amount of candidates. Our classifier shares most
computations on the entire image by using a region-based
fully convolutional neural network (R-FCN) [11]. Thus, it
is much more efficient comparing to classification on image
patches, which are cropped from heavily overlapped candi-
date regions. The comparison of time consumption of these
two methods can be found in Figure 3.
Our efficient classifier is illustrated in Figure 2. Given
an image frame, score maps of the entire image are predicted
using a fully convolutional neural network with an encoder-
decoder architecture. The encoder part is a light-weight con-
volutional backbone for real-time performance, and we intro-
duce the decoder part with up-sampling to increase the spatial
resolution of output score maps for later classification. Each
candidate to be classified is defined as a region of interest
(RoI) by x= (x0, y0, w, h), where (x0, y0)denotes the top-
left point and w,hrepresent width and height of the region.
For computational efficiency, we expect that the classifica-
tion probability of each RoI is directly voted by the shared
score maps. A straightforward approach for voting is to con-
struct foreground probabilities for all points on the image, and
then calculate the average probability of points inside the RoI.
However, this simple strategy loses the spatial information of
objects. For instance, even if the RoI only covers a part of the
object, a high confidence score still can be obtained.
In order to explicitly encode spatial information into the
score maps, we employ the position-sensitive RoI pooling
DecoderEncoder RoI Pooling
Score maps
Fig. 2: R-FCN architecture for efficient classification. Fea-
tures from the encoder network are concatenated with up-
sampled features in the decoder part, to capture both the se-
mantic and low-level information. Each color in the last block
represents a specific score map.
layer [11] and estimate the classification probability from k2
position-sensitive score maps z. In particular, we split a RoI
into k×kbins by a regular grid. Each of the bins has the
same size w
k, and represents a specific spatial location
of the object. We extract responses of k×kbins from k2score
maps. Each score map only corresponds to one bin. The final
classification probability of a RoI xis formulated as:
p(y|z,x) = σ(1
zi(x, y)),(1)
where σ(x) = 1
1+exis the sigmoid function, and zidenotes
the i-th score map.
During the training procedure, we randomly sample RoIs
around the ground truth bounding boxes as positive examples,
and take the same number of RoIs from backgrounds as nega-
tive examples. By training the network end-to-end, the output
on the top of the decoder part, that is, the k2score maps,
learns to response to specific spatial locations of the object.
For example, if k= 3, we have 9 score maps response to
top-left, top-center, top-right, ..., bottom-right of the object,
respectively. In this way, the RoI pooling layer is sensitive to
spatial positions and has a strong discriminative ability for ob-
ject classification without using learnable parameters. Please
note that the proposed neural network is trained only for can-
didate classification, not for bounding box regression.
3.3. Tracklet Confidence and Scoring Function
Given a new frame, we estimate the new location of each ex-
isting track using the Kalman filter. These predictions are
adopted to handle detection failures caused by varying visual
properties of objects and occlusion in crowded scenes. But
they are not suitable for long-term tracking. The accuracy of
the Kalman filter could decrease if it is not updated by de-
tection over a long time. Tracklet confidence is designed to
measure the accuracy of the filter using temporal information.
A tracklet is generated through temporal association of
candidates from consecutive frames. We can split a track into
a set of tracklets since a track can be interrupted and retrieved
for times during its lifetime. Every time a track is retrieved
from lost state, the Kalman filter will be reinitialized. There-
fore, only the information of the last tracklet is utilized to
formulate the confidence of the track. Here we define Ldet as
the number of detection results associated to the tracklet, and
Ltrk as the number of track predictions after the last detection
is associated. The tracklet confidence is defined as:
strk = max(1 log(1 + α·Ltrk),0) ·(Ldet 2),(2)
where (·)is the indicator function that equals 1 if the input
is true, otherwise equals 0. We require Ldet 2to construct
a reasonable motion model using observed detection before
the track is used as a candidate.
The unified scoring function for a candidate xis formated
by fusing classification probability and tracklet confidence:
s=p(y|z,x)·( (xCdet)) + strk (xCtr k )).(3)
Here we use Cdet to denote the candidates from detection, and
Ctrk for candidates from tracks, and strk [0,1] to punish
candidates from uncertain tracks. Candidates for data asso-
ciation are finally selected based on the unified scores using
non-maximal suppression. We define the maximum intersec-
tion over union (IoU) by a threshold τnms, also there is a
threshold τsfor the minimum score.
3.4. Appearance Representation with ReID Features
The similarity function between candidates is the key com-
ponent of data association. We argue that the object appear-
ance, which are deeply learned by a data driven approach,
outperforms traditional hand-crafted features on the task of
similarity estimation. For the purpose of learning the object
appearance and similarity function, we employ a deep neural
network to extract feature vectors from RGB images, and for-
mulate the similarity using the distance between the obtained
We utilize the network architecture proposed in [12] and
train the network on a combination of several large scale per-
son re-identification datasets. The network Hreid consists of
the convolutional backbone from GoogLeNet [13] followed
by Kbranches of part-aligned fully connected (FC) layers.
We refer to [12] for more details on the network architecture.
Given an RGB image Iof a person, the appearance represen-
tation is formulated as f=Hreid(I). We directly use Eu-
clidean distance between the feature vectors to measure the
distance dij of two images Iiand Ij. During the training pro-
cedure, images of identities in training datasets are formed as
a set of triplets T={hIi,Ij,Iki}, where hIi,Ijiis a positive
pair from the same person, and hIi,Ikiis the negative pair
from two different people. Given Ntriplets, the loss function
going to be minimized is formulated as:
ltriplet =1
max(dij dik +m, 0),(4)
where m > 0is a predefined margin. We ignore triplets that
are easy to handle, i.e. dik dij > m, to enhance the dis-
criminative ability of learned feature representations.
3.5. Hierarchical Data Association
Predictions of tracks are utilized to handle missing detection
occurred in crowded scenes. Influenced by intra-category oc-
clusion, these predictions may be involved with other objects.
To avoid taking other unwanted objects and backgrounds into
appearance representations, we hierarchically associate tracks
with different candidates using different features.
In particular, we first apply data association on candi-
dates from detection, using appearance representations with
a threshold τdfor the maximum distance. Then, we associate
the remaining candidates with unassociated tracks based on
IoU between candidates and tracks, with a threshold τiou. We
only update appearance representations of tracks when they
are associated to detection. The updating is conducted by sav-
ing ReID features from the associated detection. Finally, new
tracks are initialized based on the remaining detection results.
The detail of the proposed online tracking algorithm is illus-
trated in Algorithm 1. With the hierarchical data association,
we only need to extract ReID features for candidates from
detection once per frame. Combining this with the previous
efficient scoring function and tracklet confidences, our frame-
work can run at real-time speed.
4.1. Experiment Setup
To evaluate the performance of the proposed online tracking
method, we conduct extensive experiments on the MOT16
dataset [14], which is a widely used benchmark for multiple
people tracking. This dataset contains a training set and a test
set, each with 7 challenging video sequences filmed in uncon-
strained environments. We form a validation set with 5 video
sequences from the training set to analyze the contribution of
each component in our framework. Afterwards, we submit
the tracking result on the test set to the benchmark, and com-
pare it with state-of-the-art methods on the benchmark.
Implementation details. We employ SqueezeNet [15],
as the backbone of R-FCN for real-time performance. Our
fully convolutional network, consisting of SqueezeNet and
the decoder part, costs only 8ms to estimate score maps for
an input image with the size of 1152x640 on a GTX1080Ti
GPU. We set k= 7 for position-sensitive score maps, and
train the network using RMSprop optimizer with the learning
rate of 1e-4 and the batch size of 32 for 20k iterations. The
training data for person classification is collected from MS
COCO [16] and the remaining two training video sequences.
We set τnms = 0.3and τs= 0.4for candidate selection.
When coping with the ReID network, we train it on a com-
bination of three large scale person re-identification datasets,
Algorithm 1: The proposed tracking algorithm.
Input: A video sequence vwith Nvframes and object detection
Output: Tracks Tof the video
1Initialization: T ← ∅; appearance of tracks Ftrk ← ∅
2foreach frame fkin vdo
3Estimate score maps zfrom fusing R-FCN
/*collect candidates */
4Cdet ← Dk;Ctrk ← ∅
5foreach tin Tdo
6Predict new location xof tusing Kalman filter
7Ctrk Ctrk ∪ {x}
/*select candidates */
9CCdet Ctrk
10 Sunified scores computed from Equation 3
11 C, S NMS(C, S, τnms)
12 C, S Filter(C, S, τs)// filter out if s<τs
/*extract appearance features */
13 Fdet ← ∅
14 foreach xin Cdet do
15 IxCrop(fk,x)
16 Fdet ← Fdet Hreid(Ix)
17 end
/*hierarchical data association */
18 Associate Tand Cdet using distances of Ftrk and Fdet
19 Associate remaining tracks and candidates using IoU
20 Ftrk ← Ftrk ∪ Fdet
/*initialize new tracks */
21 Cremain remaining candidates from Cdet
22 Fremain features of Cremain
23 T,Ftrk ← T Cremain ,Ftrk ∪ Fremain
24 end
i.e. Market1501 [17], CUHK01 and CUHK03 [18], to en-
hance the generation ability for tracking. We set τd= 0.4and
τiou = 0.3for hierarchical data association. The following
experiments are based on the same hyper-parameters.
Evaluation metrics. In order to measure accuracies of
bounding boxes and identities at the same time, we adopt
multiple metrics used in the benchmark to evaluate the pro-
posed method, including multiple object tracking accuracy
(MOTA) [19], false alarm per frame (FAF), the number of
mostly tracked targets (MT, >80% recovered), the number of
mostly lost targets (ML, <20% recovered) [20], false posi-
tives (FP), false negatives (FN), identity switches (IDS), iden-
tification recall (IDR), identification F1 score (IDF1) [21],
and processing speed (frames per second, FPS).
4.2. Analysis on Validation Set
Contribution of each component. In order to demonstrate
the effectiveness of the proposed method, we investigate con-
tribution of different components in our framework in Table
1. The baseline method predicts new location of each track
using the Kalman filter, and then associates tracks with de-
tection based on the IoU. Using the classification probability
to select candidates from both detection and tracks, in which
case, improves the MOTA by 4.6%, comparing to the baseline
Table 1: Evaluation results on the validation set in terms
of different components used. C: classification probability,
T: tracklet confidence, A: appearance representations with
ReID feature. The arrow after each metric indicates that the
higher () or lower () value is better.
Baseline 28.4 32.8 628 0.85
X33.0 37.6 445 0.77
X X 33.7 37.3 475 0.63
X30.6 42.4 234 1.01
Proposed X X X 35.7 45.3 184 0.58
Table 2: Evaluation results on the validation set in terms of
different appearance representations.
None - 33.7 37.3 475 0.63
Color histogram 750 34.9 38.6 250 0.73
HOG 1152 34.6 38.5 317 0.70
Color + HOG 1902 34.7 39.3 307 0.68
ReID feature 512 35.7 45.3 184 0.58
method. By punishing candidates from uncertain tracks, the
combination of tracklet confidence with classification proba-
bility further improves the MOTA and reduces false positives,
as we expected in Section 3.3. On the other hand, by intro-
ducing appearance representations based on ReID features,
we can obtain a significant improvement on the performance
of identification (evaluated by IDF1 and IDS). Our proposed
method that combining the unified scoring function and ReID
features has the best results for all metrics.
Comparison with different appearance features. As
shown in Table 2, we compare representations learned by a
data driven approach detailed in Section 3.4 with two typical
hand-crafted features, i.e. color histogram, histogram of ori-
ented gradient (HOG). Following the fixed part model, which
is widely used for appearance descriptors [22], we divide each
image of a person into six horizontal stripes with an equal size
for the color histogram. The color histogram of each stripe is
built from the HSV color space with 125 bins. We normal-
ize both the color histogram and HOG features by L2norm,
and formulate the similarity using the cosine similarity func-
tion. As shown in the table, our appearance representation
outperforms traditional hand-crafted features by a large mar-
gin in terms of IDF1 and IDS, in spite of the shorter feature
vector comparing to other methods. The evaluation result on
the validation set verifies the effectiveness of our data driven
approach for multiple people tracking. The proposed track-
ing framework can be easily transfered for other categories,
by learning the appearance representation from correspond-
ing datasets, such as vehicle re-identification [23].
0.0 50.0 100.0 150.0 200.0 250.0
Classification ReID Selection Association
Fig. 3: Average time consumption of one frame on MOT16-
03 sequence, which contains over 50 people per frame. Patch:
classification based on image patches using Squeezenet and
two FC layers; R-FCN : classification using the same CNN
backbone and a position-sensitive RoI pooling layer.
4.3. Evaluation on Test Set
We first analyze the time consumption of the proposed track-
ing framework on MOT16-03 sequence. As shown in Figure
3, the proposed method is much more time efficient by shar-
ing computations on the entire image.
We report evaluation results on the test set of MOT16, and
compare our tracker with other offline and online trackers in
Table 3. Note that the tracking performance depends heav-
ily on the quality of detection. For the fair comparison, all
the trackers in the table use the same detection provided by
the benchmark. As shown in the table, Our tracker runs at
real-time speed, and outperforms existing online trackers on
most of the metrics, especially for IDF1, IDR, MT, and ML.
The identification ability is enhanced by the deeply learned
appearance representation. The improvement on MT and ML
demonstrates the advantage of our unified scoring function
for candidate selection. Selecting candidates from detection
and tracks indeed reduces tracking failures caused by miss-
ing detection. Moreover, our online tracker has much lower
computational complexity and is about 5~20 times faster than
most of the existing methods.
In this paper, we propose an online multiple people tracking
framework, which takes full advantage of recent deep neural
networks. We tackle unreliable detection by selecting candi-
dates from outputs of both detection and tracks. The scor-
ing function for candidate selection is formulated by an effi-
cient R-FCN, which shares computations on the entire image.
Moreover, we improve the identification ability when coping
with intra-category occlusion by introducing ReID features
for data association. ReID features trained by a data driven
approach outperforms traditional hand-crafted features by a
large margin. The proposed tracker achieves real-time and
state-of-the-art performance on the MOT16 benchmark. A fu-
ture study is planed to further improve efficiency by sharing
convolutional layers with both classification and appearance
Table 3: Evaluation results on the MOT16 test set.
Tracker Method MOTA(%)IDF1(%)IDR(%)MT(%)ML(%)FPFNIDSFPS
LINF1 [2] batch 41.0 45.7 34.2 11.6 51.3 7,896 99,224 430 4.2
MHT DAM [5] batch 45.8 46.1 35.3 16.2 43.2 6,412 91,758 590 0.8
JMC [10] batch 46.3 46.3 35.6 15.5 39.7 6,373 90,914 657 0.8
LMP [6] batch 48.8 51.3 40.1 18.2 40.1 6,654 86,245 481 0.5
EAMTT [7] online 38.8 42.4 31.5 7.9 49.1 8,114 102,452 965 11.8
CDA DDAL [1] online 43.9 45.1 34.1 10.7 44.4 6,450 95,175 676 0.5
STAM [?]online 46.0 50.0 38.5 14.6 43.6 6,895 91,117 473 0.2
AMIR [3] online 47.2 46.3 34.8 14.0 41.6 2,681 92,856 774 1.0
MOTDT (Ours) online 47.6 50.9 40.3 15.2 38.3 9,253 85,431 792 20.6
This work was supported by the Natural Science Foundation of
China (Project Number 61521002).
[1] Seung-Hwan Bae and Kuk-Jin Yoon, “Confidence-based data
association and discriminative deep appearance learning for
robust online multi-object tracking, IEEE Transactions on
PAMI, 2017.
[2] Lo¨
ıc Fagot-Bouquet, Romaric Audigier, Yoann Dhome, and
eric Lerasle, “Improving multi-frame data association
with sparse representations for robust near-online multi-object
tracking,” in ECCV, 2016.
[3] Amir Sadeghian, Alexandre Alahi, and Silvio Savarese,
“Tracking the untrackable: Learning to track multiple cues
with long-term dependencies,” in ICCV, 2017.
[4] Xu Yan, Xuqing Wu, Ioannis A Kakadiaris, and Shishir K
Shah, “To track or to detect? an ensemble framework for opti-
mal selection,” in ECCV, 2012.
[5] Chanho Kim, Fuxin Li, Arridhana Ciptadi, and James M Rehg,
“Multiple hypothesis tracking revisited, in ICCV, 2015.
[6] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt
Schiele, “Multiple people tracking by lifted multicut and per-
son re-identification,” in CVPR, 2017.
[7] Ricardo Sanchez-Matilla, Fabio Poiesi, and Andrea Caval-
laro, “Online multi-target tracking with strong and weak de-
tections,” in ECCV Workshops, 2016.
[8] Yaowen Guan, Xiaoou Chen, Deshun Yang, and Yuqian Wu,
“Multi-person tracking-by-detection with local particle filter-
ing and global occlusion handling,” in ICME, 2014.
[9] Jiang Liu, Jia Chen, De Cheng, Chenqiang Gao, and Alexan-
der G Hauptmann, “Rewind to track: Parallelized apprentice-
ship learning with backward tracklets,” in ICME, 2017.
[10] Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, and Bernt
Schiele, “Multi-person tracking by multicut and deep match-
ing,” in ECCV, 2016.
[11] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun, “R-fcn: Object
detection via region-based fully convolutional networks,” in
NIPS, 2016.
[12] Liming Zhao, Xi Li, Jingdong Wang, and Yueting Zhuang,
“Deeply-learned part-aligned representations for person re-
identification,” in ICCV, 2017.
[13] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet,
Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Van-
houcke, and Andrew Rabinovich, “Going deeper with convo-
lutions,” in CVPR, 2015.
[14] Anton Milan, Laura Leal-Taix´
e, Ian Reid, Stefan Roth, and
Konrad Schindler, “Mot16: A benchmark for multi-object
tracking,” arXiv preprint arXiv:1603.00831, 2016.
[15] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid
Ashraf, William J Dally, and Kurt Keutzer, “Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and <0.5
mb model size,” arXiv preprint arXiv:1602.07360, 2016.
[16] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Doll´
ar, and C Lawrence
Zitnick, “Microsoft coco: Common objects in context,” in
ECCV, 2014.
[17] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong
Wang, and Qi Tian, “Scalable person re-identification: A
benchmark,” in ICCV, 2015.
[18] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang, “Deepreid:
Deep filter pairing neural network for person re-identification,”
in CVPR, 2014.
[19] Keni Bernardin and Rainer Stiefelhagen, “Evaluating multiple
object tracking performance: the clear mot metrics,” EURASIP
Journal on Image and Video Processing, 2008.
[20] Yuan Li, Chang Huang, and Ram Nevatia, “Learning to asso-
ciate: Hybridboosted multi-target tracker for crowded scene,”
in CVPR, 2009.
[21] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara,
and Carlo Tomasi, “Performance measures and a data set for
multi-target, multi-camera tracking,” in ECCV Workshops,
[22] Riccardo Satta, Appearance descriptors for person re-
identification: a comprehensive review, arXiv preprint
arXiv:1307.5748, 2013.
[23] Xinchen Liu, Wu Liu, Huadong Ma, and Huiyuan Fu, “Large-
scale vehicle re-identification in urban surveillance videos, in
ICME, 2016.
... Object detection is one of the most active topics in computer vision and the basis of MOT. The continuous development of deep learning techniques has greatly improved the performance of MOT algorithms [14,15] and has made the tracking-by-detection (TBD) two-stage pedestrian tracking algorithms [16,17] the current mainstream framework. The TBD algorithm first detects the current frame image through the object detection network [17][18][19][20][21][22] and obtains multiple pedestrian detection boxes, and then correlates them with the pedestrian trajectories already established in the previous sequence of video frames by the Kalman filter and Hungarian algorithm. ...
... The continuous development of deep learning techniques has greatly improved the performance of MOT algorithms [14,15] and has made the tracking-by-detection (TBD) two-stage pedestrian tracking algorithms [16,17] the current mainstream framework. The TBD algorithm first detects the current frame image through the object detection network [17][18][19][20][21][22] and obtains multiple pedestrian detection boxes, and then correlates them with the pedestrian trajectories already established in the previous sequence of video frames by the Kalman filter and Hungarian algorithm. The DeepSORT algorithm is a classical TBD algorithm that was proposed by Wojke et al. [23] It uses YOLOv3 [24] as the pedestrian detection network and extracts pedestrian apparent features via a pedestrian re- identification module, but because pedestrian detection and apparent feature extraction are performed in two steps, there are many redundant calculations. ...
Full-text available
Aiming at the problems of frequent identity switches (IDs) and trajectory interruption of multi-pedestrian tracking algorithms in dense scenes, this paper proposes a multi-pedestrian tracking algorithm based on an attention mechanism and dual data association. First, the FairMOT algorithm is used as a baseline to introduce the feature pyramid network in the CenterNet detection network and up-sampling the output multi-scale fused feature maps, effectively reducing the rate of missed detection of small-sized and obscured pedestrians. The improved channel attention mechanism module is embedded in the CenterNet’s backbone network to improve detection accuracy. Then, a re-identification (ReID) branch is embedded in the head of the detection network, and the two sub-tasks of pedestrian detection and pedestrian apparent feature extraction are combined in a multi-task joint learning approach to output the pedestrian apparent feature vectors while detecting pedestrians, which improves the computational efficiency and localization accuracy of the algorithm. Finally, we propose a dual data association tracking model that tracks by associating almost every detection box instead of only the high-scoring ones. For low-scoring detection boxes, we utilize their similarities with trajectories to recover obscured pedestrians. The experiment using the MOT17 dataset shows that the tracking accuracy is improved by 0.6% compared with the baseline FairMOT algorithm, and the number of switches decreases from 3303 to 2056, which indicates that the proposed algorithm can effectively reduce the number of trajectory interruptions and identity switching.
... A S represented autonomous vehicles Internet of Things (IoT) device number has increased dramatically, deep learning techniques have brought a paradigm in the variable application field, such as driven application [1], speech recognition [2], and image identification [3], which can use a large of history data to mine hidden patterns and make predictions about what will not happen in the future. As a result, people can optimize their travel time using deep learning techniques, significantly improving their work efficiency. ...
The rapidly expanding number of Internet of Things (IoT) devices is generating huge quantities of data, but the data privacy and security exposure in IoT devices, especially in the automatic driving system. Federated learning (FL) is a paradigm that addresses data privacy, security, access rights, and access to heterogeneous message issues by integrating a global model based on distributed nodes. However, data poisoning attacks on FL can undermine the benefits, destroying the global model's availability and disrupting model training. To avoid the above issues, we build up a hierarchical defense data poisoning (HDDP) system framework to defend against data poisoning attacks in FL, which monitors each local model of individual nodes via abnormal detection to remove the malicious clients. Whether the poisoning defense server has a trusted test dataset, we design the \underline{l}ocal \underline{m}odel \underline{t}est \underline{v}oting (LMTV) and \underline{k}ullback-\underline{l}eibler divergence \underline{a}nomaly parameters \underline{d}etection (KLAD) algorithms to defend against label-flipping poisoning attacks. Specifically, the trusted test dataset is utilized to obtain the evaluation results for each classification to recognize the malicious clients in LMTV. More importantly, we adopt the kullback leibler divergence to measure the similarity between local models without the trusted test dataset in KLAD. Finally, through extensive evaluations and against the various label-flipping poisoning attacks, LMTV and KLAD algorithms could achieve the $100\%$ and $40\%$ to $85\%$ successful defense ratios under different detection situations.
In this paper, we propose an online multi-object tracking (MOT) method based on the delta Generalized Labeled Multi-Bernoulli (δ-GLMB) filter framework to address occlusion and miss-detection issues and recover identity switch (ID switch). Along with the principal δ-GLMB filter that performs multi-object tracking, we propose a one-step δ-GLMB filter to handle occlusion and miss-detection. The one-step δ-GLMB filter is non-iterative and only requires current measurements. The filter is based on a proposed measurement-to-reappeared track association method and addresses MOT issues by incorporating all occluded and miss-detected objects. We introduce a novel similarity metric to apply in the measurement-to-reappeared track association process to define the weight of hypothesized reappeared tracks. To ensure the track consistency, we also extend the principal δ-GLMB filter to efficiently recover switched IDs using the cardinality density, size, and visual features of the hypothesized tracks. In addition, we perform an ablation study to demonstrate the contribution of the main parts of the proposed method. We evaluate the proposed method on well-known and publicly available test datasets focused on pedestrian tracking. Note that our proposed method is online and not based on the learning paradigm. So it does not use any additional source of information such as private detections and pre-trained networks. Despite that, we achieved a reliable performance in multiple persons tracking at complex scenes by applying occlusion/miss-detection and ID switch handlers. Experimental results show that the proposed tracker performs better or at least at the same level of the state-of-the-art online and offline MOT methods.
In this paper, we address the problem of person re-identification, which refers to associating the persons captured from different cameras. We propose a simple yet effective human part-aligned representation for handling the body part misalignment problem. Our approach decomposes the human body into regions (parts) which are discriminative for person matching, accordingly computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score. Our formulation, inspired by attention models, is a deep neural network modeling the three steps together, which is learnt through minimizing the triplet loss function without requiring body part labeling information. Unlike most existing deep learning algorithms that learn a global or spatial partition-based local representation, our approach performs human body partition, and thus is more robust to pose changes and various human spatial distributions in the person bounding box. Our approach shows state-of-the-art results over standard datasets, Market-$1501$, CUHK$03$, CUHK$01$ and VIPeR.
Online multi-object tracking aims at estimating the tracks of multiple objects instantly with each incoming frame and the information provided up to the moment. It still remains a difficult problem in complex scenes, because of the large ambiguity in associating multiple objects in consecutive frames and the low discriminability between objects appearances. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first define the tracklet confidence using the detectability and continuity of a tracklet, and decompose a multi-object tracking problem into small subproblems based on the tracklet confidence. We then solve the online multi-object tracking problem by associating tracklets and detections in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive association steps. For more reliable association between tracklets and detections, we also propose a deep appearance learning method to learn a discriminative appearance model from large training datasets, since the conventional appearance learning methods do not provide rich representation that can distinguish multiple objects with large appearance variations. In addition, we combine online transfer learning for improving appearance discriminability by adapting the pre-trained deep model during online tracking. Experiments with challenging public datasets show distinct performance improvement over other state-of-the-arts batch and online tracking methods, and prove the effect and usefulness of the proposed methods for online multi-object tracking.
Conference Paper
To help accelerate progress in multi-target, multi-camera tracking systems, we present (i) a new pair of precision-recall measures of performance that treats errors of all types uniformly and emphasizes correct identification over sources of error; (ii) the largest fully-annotated and calibrated data set to date with more than 2 million frames of 1080 p, 60 fps video taken by 8 cameras observing more than 2,700 identities over 85 min; and (iii) a reference software system as a comparison baseline. We show that (i) our measures properly account for bottom-line identity match performance in the multi-camera setting; (ii) our data set poses realistic challenges to current trackers; and (iii) the performance of our system is comparable to the state of the art.
Conference Paper
In Tang et al. (2015), we proposed a graph-based formulation that links and clusters person hypotheses over time by solving a minimum cost subgraph multicut problem. In this paper, we modify and extend Tang et al. (2015) in three ways: (1) We introduce a novel local pairwise feature based on local appearance matching that is robust to partial occlusion and camera motion. (2) We perform extensive experiments to compare different pairwise potentials and to analyze the robustness of the tracking formulation. (3) We consider a plain multicut problem and remove outlying clusters from its solution. This allows us to employ an efficient primal feasible optimization algorithm that is not applicable to the subgraph multicut problem of Tang et al. (2015). Unlike the branch-and-cut algorithm used there, this efficient algorithm used here is applicable to long videos and many detections. Together with the novel pairwise feature, it eliminates the need for the intermediate tracklet representation of Tang et al. (2015). We demonstrate the effectiveness of our overall approach on the MOT16 benchmark (Milan et al. 2016), achieving state-of-art performance.
Conference Paper
We propose an online multi-target tracker that exploits both high- and low-confidence target detections in a Probability Hypothesis Density Particle Filter framework. High-confidence (strong) detections are used for label propagation and target initialization. Low-confidence (weak) detections only support the propagation of labels, i.e. tracking existing targets. Moreover, we perform data association just after the prediction stage thus avoiding the need for computationally expensive labeling procedures such as clustering. Finally, we perform sampling by considering the perspective distortion in the target observations. The tracker runs on average at 12 frames per second. Results show that our method outperforms alternative online trackers on the Multiple Object Tracking 2016 and 2015 benchmark datasets in terms tracking accuracy, false negatives and speed.
Conference Paper
Multiple Object Tracking still remains a difficult problem due to appearance variations and occlusions of the targets or detection failures. Using sophisticated appearance models or performing data association over multiple frames are two common approaches that lead to gain in performances. Inspired by the success of sparse representations in Single Object Tracking, we propose to formulate the multi-frame data association step as an energy minimization problem, designing an energy that efficiently exploits sparse representations of all detections. Furthermore, we propose to use a structured sparsity-inducing norm to compute representations more suited to the tracking context. We perform extensive experiments to demonstrate the effectiveness of the proposed formulation, and evaluate our approach on two public authoritative benchmarks in order to compare it with several state-of-the-art methods.