Conference PaperPDF Available

Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers

Authors:
  • The Beijing Institute of Basic Medical Sciences

Abstract and Figures

Online Multi-Object Tracking (MOT) from videos is a challenging computer vision task which has been extensively studied for decades. Most of the existing MOT algorithms are based on the Tracking-by-Detection (TBD) paradigm combined with popular machine learning approaches which largely reduce the human effort to tune algorithm parameters. However, the commonly used supervised learning approaches require the labeled data (e.g., bounding boxes), which is expensive for videos. Also, the TBD framework is usually suboptimal since it is not end-to-end, i.e., it considers the task as detection and tracking, but not jointly. To achieve both label-free and end-to-end learning of MOT, we propose a Tracking-by-Animation framework, where a differentiable neural model first tracks objects from input frames and then animates these objects into reconstructed frames. Learning is then driven by the reconstruction error through backpropagation. We further propose a Reprioritized Attentive Tracking to improve the robustness of data association. Experiments conducted on both synthetic and real video datasets show the potential of the proposed model. Our project page is publicly available at: https://github.com/zhen-he/tracking-by-animation
Content may be subject to copyright.
Tracking by Animation:
Unsupervised Learning of Multi-Object Attentive Trackers
Zhen He1,2,3Jian Li2Daxue Liu2Hangen He2David Barber3,4
1Academy of Military Medical Sciences
2National University of Defense Technology
3University College London
4The Alan Turing Institute
Abstract
Online Multi-Object Tracking (MOT) from videos is a
challenging computer vision task which has been extensively
studied for decades. Most of the existing MOT algorithms
are based on the Tracking-by-Detection (TBD) paradigm
combined with popular machine learning approaches which
largely reduce the human effort to tune algorithm param-
eters. However, the commonly used supervised learning
approaches require the labeled data (e.g., bounding boxes),
which is expensive for videos. Also, the TBD framework is
usually suboptimal since it is not end-to-end, i.e., it consid-
ers the task as detection and tracking, but not jointly. To
achieve both label-free and end-to-end learning of MOT, we
propose a Tracking-by-Animation framework, where a differ-
entiable neural model first tracks objects from input frames
and then animates these objects into reconstructed frames.
Learning is then driven by the reconstruction error through
backpropagation. We further propose a Reprioritized Atten-
tive Tracking to improve the robustness of data association.
Experiments conducted on both synthetic and real video
datasets show the potential of the proposed model. Our
project page is publicly available at:
https://github.
com/zhen-he/tracking- by-animation
1. Introduction
We consider the problem of online 2D multi-object track-
ing from videos. Given the historical input frames, the goal
is to extract a set of 2D object bounding boxes from the
current input frame. Each bounding box should have an
one-to-one correspondence to an object and thus should not
change its identity across different frames.
MOT is a challenging task since one must deal with:
(i) unknown number of objects, which requires the tracker
to be correctly reinitialized/terminated when the object ap-
pears/disappears; (ii) frequent object occlusions, which re-
Correspondence to Zhen He (email: hezhen.cs@gmail.com).
quire the tracker to reason about the depth relationship
among objects; (iii) abrupt pose (e.g., rotation, scale, and
position), shape, and appearance changes for the same ob-
ject, or similar properties across different objects, both of
which make data association hard; (iv) background noises
(e.g., illumination changes and shadows), which can mislead
tracking.
To overcome the above issues, one can seek to use ex-
pressive features, or improve the robustness of data associa-
tion. E.g., in the predominant Tracking-by-Detection (TBD)
paradigm [
1
,
21
,
7
,
8
], well-performed object detectors are
first applied to extract object features (e.g., potential bound-
ing boxes) from each input frame, then appropriate matching
algorithms are employed to associate these candidates of dif-
ferent frames, forming object trajectories. To reduce the hu-
man effort to manually tune parameters for object detectors
or matching algorithms, many machine learning approaches
are integrated into the TBD framework and have largely im-
proved the performance [
68
,
54
,
53
,
39
]. However, most of
these approaches are based on supervised learning, while
manually labeling the video data is very time-consuming.
Also, the TBD framework does not consider the feature ex-
traction and data association jointly, i.e., it is not end-to-end,
thereby usually leading to suboptimal solutions.
In this paper, we propose a novel framework to achieve
both label-free and end-to-end learning for MOT tasks. In
summary, we make the following contributions:
We propose a Tracking-by-Animation (TBA) framework,
where a differentiable neural model first tracks objects
from input frames and then animates these objects into
reconstructed frames. Learning is then driven by the
reconstruction error through backpropagation.
We propose a Reprioritized Attentive Tracking (RAT) to
mitigate overfitting and disrupted tracking, improving
the robustness of data association.
We evaluate our model on two synthetic datasets
(MNIST-MOT and Sprites-MOT) and one real dataset
(DukeMTMC [49]), showing its potential.
2. Tracking by Animation
Our TBA framework consists of four components: (i) a
feature extractor that extracts input features from each input
frame; (ii) a tracker array where each tracker receives input
features, updates its state, and emits outputs representing
the tracked object; (iii) a renderer (parameterless) that ren-
ders tracker outputs into a reconstructed frame; (iv) a loss
that uses the reconstruction error to drive the learning of
Components (i) and (ii), both label-free and end-to-end.
2.1. Feature Extractor
To reduce the computation complexity when associat-
ing trackers to the current observation, we first use a neu-
ral network
NNfeat
, parameterized by
θfeat
, as a feature
extractor to compress the input frame at each timestep
t{1,2, . . . , T }:
Ct= NNfeat Xt;θfeat (1)
where
Xt[0,1]H×W×D
is the input frame of height
H
,
width
W
, and channel size
D
, and
CtRM×N×S
is the
extracted input feature of height
M
, width
N
, and channel
size S, containing much fewer elements than Xt.
2.2. Tracker Array
The tracker array comprises
I
neural trackers indexed by
i{1,2, . . . , I}
(thus
I
is the maximum number of tracked
objects). Let
ht,i RR
be the state vector (vectors are
assumed to be in row form throughout this paper) of Tracker
i
at time
t
, and
Ht={ht,1,ht,2,...,ht,I }
be the set of all
tracker states. Tracking is performed by iterating over two
stages:
(i) State Update
. The trackers first associate input fea-
tures from
Ct
to update their states
Ht
, through a neural
network NNupd parameterized by θupd:
Ht= NNupd Ht1,Ct;θupd(2)
Whilst it is straightforward to set
NNupd
as a Recurrent
Neural Network (RNN) [
52
,
16
,
11
] (with all variables
vectorized), we introduce a novel RAT to model
NNupd
in order to increase the robustness of data association,
which will be discussed in Sec. 3.
(ii) Output Generation
. Then, each tracker generates its
output from
ht,i
via a neural network
NNout
parame-
terized by θout:
Yt,i = NNout ht,i;θout (3)
where
NNout
is shared by all trackers, and the output
Yt,i =yc
t,i,yl
t,i,yp
t,i,Ys
t,i,Ya
t,i
is a mid-level repre-
sentation of objects on 2D image planes, including:
Confidence yc
t,i [0,1]
Probability of having cap-
tured an object, which can be thought as a soft sign
of the trajectory validity (1/0 denotes valid/invalid).
When time proceeds, an increase/decrease of
yc
t,i
can be thought as a soft initialization/termination of
the trajectory.
Layer yl
t,i {0,1}K
One-hot encoding of the image
layer possessed by the object. We consider each im-
age to be composed of
K
object layers and a back-
ground layer, where higher layer objects occlude
lower layer objects and the background is the 0-th
(lowest) layer. E.g., when
K= 4
,
yl
t,i = [0,0,1,0]
denotes the 3-rd layer.
Pose yp
t,i =[bsx
t,i,bsy
t,i,b
tx
t,i,b
ty
t,i][1,1]4
Normalized
object pose for calculating the scale
[sx
t,i, sy
t,i] =
[1 + ηxbsx
t,i,1 + ηybsy
t,i]
and the translation
[tx
t,i, ty
t,i] = [ W
2b
tx
t,i,H
2b
ty
t,i]
, where
ηx, ηy>0
are
constants.
Shape Ys
t,i {0,1}U×V×1
Binary object shape mask
with height U, width V, and channel size 1.
Appearance Ya
t,i [0,1]U×V×D
Object appearance
with height U, width V, and channel size D.
In the output layer of
NNout
,
yc
t,i
and
Ya
t,i
are gener-
ated by the sigmoid function,
yp
t,i
is generated by the
tanh function, and
yl
t,i
and
Ys
t,i
are sampled from the
Categorical and Bernoulli distributions, respectively.
As sampling is not differentiable, we use the Straight-
Through Gumbel-Softmax estimator [
26
] to reparam-
eterize both distributions so that backpropagation can
still be applied.
The above-defined mid-level representation is not only
flexible, but also can be directly used for input frame
reconstruction, enforcing the output variables to be dis-
entangled (as would be shown later). Note that through
our experiments, we have found that the discreteness of
yl
t,i
and
Ys
t,i
is also very important for this disentangle-
ment.
2.3. Renderer
To define a training objective with only the tracker outputs
but no training labels, we first use a differentiable renderer
to convert all tracker outputs into reconstructed frames, and
then minimize the reconstruction error through backpropa-
gation. Note that we make the renderer both parameterless
and deterministic so that correct tracker outputs can be en-
couraged in order to get correct reconstructions, enforcing
the feature extractor and tracker array to learn to generate
desired outputs. The rendering process contains three stages:
(i) Spatial Transformation
. We first scale and shift
Ys
t,i
and
Ya
t,i
according to
yp
t,i
via a Spatial Transformer
Network (STN) [25]:
Ts
t,i = STN Ys
t,i,yp
t,i(4)
Ta
t,i = STN Ya
t,i,yp
t,i(5)
Figure 1: Illustration of the rendering process converting the tracker outputs into a
reconstructed frame at time
t
, where the tracker number
I=4
and the layer number
K=2.
Figure 2: Overview of the TBA frame-
work, where the tracker number
I=4
.
where
Ts
t,i ∈ {0,1}H×W×1
and
Ta
t,i [0,1]H×W×D
are
the spatially transformed shape and appearance, respec-
tively.
(ii) Layer Compositing
. Then, we synthesize
K
image
layers, where each layer can contain several objects.
The k-th layer is composited by:
Lm
t,k = min 1,X
i
yc
t,iyl
t,i,kTs
t,i!(6)
Lf
t,k =X
i
yc
t,iyl
t,i,kTs
t,i Ta
t,i (7)
where
Lm
t,k [0,1]H×W×1
is the layer foreground mask,
Lf
t,k [0, I]H×W×D
is the layer foreground, and
is
the element-wise multiplication which broadcasts its
operands when they are in different sizes.
(iii) Frame Compositing
. Finally, we iteratively recon-
struct the input frame layer-by-layer, i.e., for
k=
1,2, . . . , K:
c
X(k)
t=1Lm
t,kc
X(k1)
t+Lf
t,k (8)
where
c
X(0)
t
is the extracted background, and
c
X(K)
t
is
the final reconstruction. The whole rendering process
is illustrated in Fig. 1, where ηx=ηy=1.
Whilst the layer compositing can be parallelized by ma-
trix operations, it cannot model occlusion since pixel values
in overlapped object regions are simply added; conversely,
the frame compositing well-models occlusion, but the iter-
ation process cannot be parallelized, consuming more time
and memory. Thus, we combine the two to both reduce the
computation complexity and maintain the ability of occlu-
sion modeling. Our key insight is that though the number of
occluded objects can be large, the occlusion depth is usually
small. Thus, occlusion can be modeled efficiently by using
a small layer number
K
(e.g.,
K= 3
), in which case each
layer will be shared by several non-occluded objects.
2.4. Loss
To drive the learning of the feature extractor as well as
the tracker array, we define a loss ltfor each timestep:
lt= MSE c
Xt,Xt+λ·1
IX
i
sx
t,i sy
t,i (9)
where, on the RHS, the first term is the reconstruction Mean
Squared Error, and the second term, weighted by a constant
λ > 0
, is the tightness constraint penalizing large scales
[sx
t,i, sy
t,i]
in order to make object bounding boxes more com-
pact. An overview of our TBA framework is shown in Fig. 2.
3. Reprioritized Attentive Tracking
In this section, we focus on designing the tracker state
update network
NNupd
defined in (2). Although
NNupd
can
be naturally set as a single RNN as mentioned in Sec. 2.2,
there can be two issues: (i) overfitting, since there is no
mechanism to capture the data regularity that similar pat-
terns are usually shared by different objects; (ii) disrupted
tracking, since there is no incentive to drive each tracker to
associate its relevant input features. Therefore, we propose
the RAT, which tackles Issue (i) by modeling each tracker
independently and sharing parameters for different trackers
(this also reduces the parameter number and makes learning
more scalable with the tracker number), and tackles Issue (ii)
by utilizing attention to achieve explicit data association
(Sec. 3.1). RAT also avoids conflicted tracking by employ-
ing memories to allow tracker interaction (Sec. 3.2) and
reprioritizing trackers to make data association more robust
(Sec. 3.3), and improves efficiency by adapting the compu-
tation time according to the number of objects presented in
the scene (Sec. 3.4).
3.1. Using Attention
To make Tracker
i
explicitly associate its relevant input
features from
Ct
to avoid disrupted tracking, we adopt a
content-based addressing. Firstly, the previous tracker state
ht1,i is used to generate key variables kt,i and βt,i:
nkt,i,b
βt,io= Linear ht1,i ;θkey (10)
βt,i = 1 + ln 1 + exp b
βt,i (11)
where
Linear
is the linear transformation parameterized by
θkey
,
kt,i RS
is the addressing key, and
b
βt,i R
is the
activation for the key strength
βt,i (1,+)
. Then,
kt,i
is
used to match each feature vector in
Ct
, denoted by
ct,m,n
RS
where
m{1,2, . . . , M }
and
n{1,2, . . . , N }
, to get
attention weights:
Wt,i,m,n =
exp βt,i K (kt,i,ct,m,n )
Pm0,n0exp βt,i K (kt,i,ct,m0,n0)(12)
where
K
is the cosine similarity defined as
K (p,q) =
pqT/(kpkkqk)
, and
Wt,i,m,n
is an element of the atten-
tion weight
Wt,i [0,1]M×N
, satisfying
Pm,nWt,i,m,n = 1
.
Next, a read operation is defined as a weighted combination
of all feature vectors of Ct:
rt,i =X
m,n
Wt,i,m,n ct,m,n (13)
where
rt,i RS
is the read vector, representing the associ-
ated input feature for Tracker
i
. Finally, the tracker state is
updated with an RNN parameterized by
θrnn
, taking
rt,i
instead of Ctas its input feature:
ht,i = RNN (ht1,i,rt,i ;θrnn)(14)
Whilst each tracker can now attentively access
Ct
, it still
cannot attentively access
Xt
if the receptive field of each
feature vector
ct,m,n
is too large. In this case, it remains
hard for the tracker to correctly associate an object from
Xt
.
Therefore, we set the feature extractor
NNfeat
as a Fully
Convolutional Network (FCN) [
37
,
70
,
61
] purely consisting
of convolution layers. By designing the kernel size of each
convolution/pooling layer, we can control the receptive field
of
ct,m,n
to be a local region on the image so that the tracker
can also attentively access
Xt
. Moreover, parameter sharing
in FCN captures the spatial regularity that similar patterns are
shared by objects on different image locations. As a local
image region contains little information about the object
translation
[tx
t,i, ty
t,i]
, we add this information by appending
the 2D image coordinates as two additional channels to Xt.
3.2. Input as Memory
To allow trackers to interact with each other to avoid con-
flicted tracking, at each timestep, we take the input feature
Ct
as an external memory through which trackers can pass
messages. Concretely, Let
C(0)
t=Ct
be the initial memory,
we arrange trackers to sequentially read from and write to
it, so that
C(i)
t
records all messages written by the past
i
trackers. In the i-th iteration (i= 1,2, . . . , I ), Tracker ifirst
reads from
C(i1)
t
to update its state
ht,i
by using (10)–(14)
(where
Ct
is replaced by
C(i1)
t
). Then, an erase vector
et,i [0,1]Sand a write vector vt,i RSare emitted by:
{b
et,i,vt,i }= Linear ht,i;θwr t(15)
et,i = sigmoid (b
et,i)(16)
With the attention weight
Wt,i
produced by (12), we then
define a write operation, where each feature vector in the
memory is modified as:
c(i)
t,m,n = (1Wt,i,m,net,i )c(i1)
t,m,n +Wt,i,m,nvt,i (17)
Our tracker state update network defined in (10)–(17)
is inspired by the Neural Turing Machine [
18
,
19
]. Since
trackers (controllers) interact through the external memory
by using interface variables, they do not need to encode mes-
sages of other trackers into their own working memories (i.e.,
states), making tracking more efficient.
3.3. Reprioritizing Trackers
Whilst memories are used for tracker interaction, it is
hard for high-priority (small
i
) but low-confidence trackers
to associate data correctly. E.g., when the first tracker (
i=1
)
is free (
yc
t1,1=0
), it is very likely for it to associate or, say,
‘steal’ a tracked object from a succeeding tracker, since from
the unmodified initial memory
C(0)
t
, all objects are equally
chanced to be associated by a free tracker.
To avoid this situation, we first update high-confidence
trackers so that features corresponding to the tracked objects
can be firstly associated and modified. Therefore, we define
the priority
pt,i {1,2, . . . , I}
of Tracker
i
as its previous (at
time
t1
) confidence ranking (in descending order) instead
of its index
i
, and then we can update Tracker
i
in the
pt,i
-th
iteration to make data association more robust.
3.4. Using Adaptive Computation Time
Since the object number varies with time and is usu-
ally less than the tracker number
I
(assuming
I
is set large
enough), iterating over all trackers at every timestep is in-
efficient. To overcome this, we adapt the idea of Adaptive
Computation Time (ACT) [
17
] to RAT. At each timestep
t
,
we terminate the iteration at Tracker
i
(also disable the write
operation) once
yc
t1,i <0.5
and
yc
t,i <0.5
, in which case
there are unlikely to be more tracked/new objects. While
for the remaining trackers, we do no use them to generate
outputs. An illustration of the RAT is shown in Fig. 3. The
algorithm of the full TBA framework is presented in Fig. 4.
Figure 3: Illustration of the RAT with the tracker number
I= 4
. Green/Blue bold lines denote attentive read/write
operations on memory. Dashed arrows denote copy opera-
tions. At time
t
, the iteration is performed by 3 times and
terminated at Tracker 1.
1: # Initialization
2: for i1to Ido
3: h0,i 0
4: yc
0,i 0
5: end for
6: # Forward pass
7: for t1to Tdo
8: # (i) Feature extractor
9: extract Ctfrom Xt, see (1)
10: # (ii) Tracker array
11: C(0)
tCt
12:
use
yc
t1,1, yc
t1,2,...,yc
t1,I
to calculate
pt,1, pt,2,...,pt,I
13: for j1to Ido
14: select the i-th tracker whose priority pt,i =j
15:
use
ht1,i
and
C(j1)
t
to generate
Wt,i
, see (10)–(12)
16:
read from
C(j1)
t
according to
Wt,i
, and update
ht1,i to ht,i, see (13) and (14)
17: use ht,i to generate Yt,i, see (3)
18: if yt1,i <0.5and yt,i <0.5then
19: break
20: end if
21:
write to
C(j1)
t
using
ht,i
and
Wt,i
, obtaining
C(j)
t
,
see (15)–(17)
22: end for
23: # (iii) Renderer
24: use Yt,1,Yt,2,...,Yt,I to render
c
Xt, see (4)–(8)
25: # (iv) Loss
26: calculate lt, see (9)
27: end for
Figure 4: Algorithm of the TBA framework.
4. Experiments
The main purposes of our experiments are: (i) investigat-
ing the importance of each component in our model, and (ii)
testing whether our model is applicable to real videos. For
Purpose (i), we create two synthetic datasets (MNIST-MOT
and Sprites-MOT), and consider the following configura-
tions:
TBA
The full TBA model as described in Sec. 2and Sec. 3.
TBAc
TBA with constant computation time, by not using
the ACT described in Sec. 3.4.
TBAc-noOcc
TBAc without occlusion modeling, by setting
the layer number K=1.
TBAc-noAtt
TBAc without attention, by reshaping the
memory
Ct
into size
[1,1, M NS ]
, in which case the at-
tention weight degrades to a scalar (
Wt,i =Wt,i,1,1=1
).
TBAc-noMem
TBAc without memories, by disabling the
write operation defined in (15)–(17).
TBAc-noRep
TBAc without the tracker reprioritization de-
scribed in Sec. 3.3.
AIR
Our implementation of the ‘Attend, Infer, Re-
peat’ (AIR) [
13
] for qualitative evaluation, which is a
probabilistic generative model that can be used to detect
objects from individual images through inference.
Note that it is hard to set a supervised counterpart of our
model for online MOT, since calculating the supervised loss
with ground truth data is per se an optimization problem
which requires to access complete trajectories and thus is
usually done offline [
54
]. For Purpose (ii), we evaluate TBA
on the challenging DukeMTMC dataset [
49
], and compare
it to the state-of-the-art methods. In this paper, we only
consider videos with static backgrounds
c
X(0)
t
, and use the
IMBS algorithm [
6
] to extract them for input reconstruction.
Implementation details of our experiments are given in
Appendix A.1. The MNIST-MOT experiment is reported in
Appendix A.2. The appendix can be downloaded from our
project page.
4.1. Sprites-MOT
In this toy task, we aim to test whether our model can
robustly handle occlusion and track the pose, shape, and
appearance of the object that can appear/disappear from the
scene, providing accurate and consistent bounding boxes.
Thus, we create a new Sprites-MOT dataset containing
2M frames, where each frame is of size 128
×
128
×
3, con-
sisting of a black background and at most three moving
Figure 5: Training curves of different configurations on
Sprites-MOT.
Figure 6: Qualitative results of different configurations on Sprites-MOT. For each configuration, we show the reconstructed
frames (top) and the tracker outputs (bottom). For each frame, tracker outputs from left to right correspond to tracker 1 to
I
(here I=4), respectively. Each tracker output Yt,i is visualized as yc
t,i Ys
t,i Ya
t,i[0,1]U×V×D.
Table 1: Tracking performances of different configurations on Sprites-MOT.
Configuration IDF1IDPIDRMOTAMOTPFAFMTMLFPFNIDSFrag
TBA 99.2 99.3 99.2 99.2 79.1 0.01 985 1 60 80 30 22
TBAc 99.0 99.2 98.9 99.1 78.8 0.01 981 0 72 83 36 29
TBAc-noOcc 93.3 93.9 92.7 98.5 77.9 0 969 0 48 227 64 105
TBAc-noAtt 43.2 41.4 45.1 52.6 78.6 0.19 982 0 1,862 198 8,425 89
TBAc-noMem 0 0 0 0 0 987 0 22,096 0 0
TBAc-noRep 93.0 92.5 93.6 96.9 78.8 0.02 978 0 232 185 267 94
sprites that can occlude each other. Each sprite is ran-
domly scaled from a 21
×
21
×
3 image patch with a random
shape (circle/triangle/rectangle/diamond) and a random color
(red/green/blue/yellow/magenta/cyan), moves towards a ran-
dom direction, and appears/disappears only once. To solve
this task, for TBA configurations we set the tracker number
I=4 and layer number K=3.
Training curves are shown in Fig. 5. TBAc-noMem has
the highest validation loss, indicating that it cannot well
reconstruct the input frames, while other configurations per-
form similarly and have significantly lower validation losses.
However, TBA converges the fastest, which we conjecture
benefits from the regularization effect introduced by ACT.
To check the tracking performance, we compare TBA
against other configurations on several sampled sequences,
as shown in Fig. 6. We can see that TBA consistently per-
forms well on all situations, where in Seq. 1 TBAc perform
as well as TBA. However, TBAc-noOcc fails to track objects
from occluded patterns (in Frames 4 and 5 of Seq. 2, the red
diamond is lost by Tracker 2). We conjecture the reason is
that adding values of occluded pixels into a single layer can
result in high reconstruction errors, and thereby the model
just learns to suppress tracker outputs when occlusion oc-
curs. Disrupted tracking frequently occurs on TBAc-noAtt
which does not use attention explicitly (in Seq. 3, trackers
frequently change their targets). For TBAc-noMem, all track-
ers know nothing about each other and compete for a same
object, resulting in identical tracking with low confidences.
For TBAc-noRep, free trackers incorrectly associate the ob-
jects tracked by the follow-up trackers. Since AIR does not
consider the temporal dependency of sequence data, it fails
to track objects across different timesteps.
We further quantitatively evaluate different configura-
tions using the standard CLEAR MOT metrics (Multi-Object
Tracking Accuracy (MOTA), Multi-Object Tracking Preci-
sion (MOTP), etc.) [
4
] that count how often the tracker
makes incorrect decisions, and the recently proposed ID
metrics (Identification F-measure (IDF1), Identification Pre-
cision (IDP), and Identification Recall (IDR)) [
49
] that mea-
sure how long the tracker correctly tracks targets. Note
that we only consider tracker outputs
Yt,i
with confidences
yc
t,i >0.5
and convert the corresponding poses
yp
t,i
into
object bounding boxes for evaluation. Table 1reports the
tracking performance. Both TBA and TBAc gain good per-
formances and TBA performs slightly better than TBAc.
For TBAc-noOcc, it has a significantly higher False Nega-
tive (FN) (227), ID Switch (IDS) (64), and Fragmentation
(Frag) (105), which is consistent with our conjecture from
the qualitative results that using a single layer can sometimes
suppress tracker outputs. TBAc-noAtt performs poorly on
most of the metrics, especially with a very high IDS of 8425
potentially caused by disrupted tracking. Note that TBAc-
noMem has no valid outputs as all tracker confidences are
below 0.5. Without tracker reprioritization, TBAc-noRep is
less robust than TBA and TBAc, with a higher False Positive
(FP) (232), FN (185), and IDS (267) that we conjecture are
mainly caused by conflicted tracking.
4.2. DukeMTMC
To test whether our model can be applied to the real
applications involving highly complex and time-varying
data patterns, we evaluate the full TBA on the challenging
DukeMTMC dataset [
49
]. It consists of 8 videos of reso-
lution 1080
×
1920, with each split into 50/10/25 minutes
long for training/test(hard)/test(easy). The videos are taken
from 8 fixed cameras recording movements of people on
various places of Duke university campus at 60fps. For TBA
configurations, we set the tracker number
I= 10
and layer
number
K=3
. Input frames are down-sampled to 10fps and
resized to 108
×
192 to ease processing. Since the hard test
set contains very different people statistics from the training
set, we only evaluate our model on the easy test set.
Fig. 7shows sampled qualitative results. TBA performs
well under various situations: (i) frequent object appear-
ing/disappearing; (ii) highly-varying object numbers, e.g., a
single person (Seq. 4) or ten persons (Frame 1 in Seq. 1); (iii)
frequent object occlusions, e.g., when people walk towards
each other (Seq. 1); (iv) perspective scale changes, e.g.,
when people walk close to the camera (Seq. 3); (v) frequent
shape/appearance changes; (vi) similar shapes/appearances
for different objects (Seq. 6).
Quantitative performances are presented in Table 2. We
can see that TBA gains an IDF1 of 82.4%, a MOTA of 79.6%,
and a MOTP of 80.4% which is the highest, being very
competitive to the state-of-the-art methods in performance.
However, unlike these methods, our model is the first one
free of any training labels or extracted features.
4.3. Visualizing the RAT
To get more insights into how the model works, we visu-
alize the process of RAT on Sprites-MOT (see Fig. 8). At
time
t
, Tracker
i
is updated in the
pt,i
-th iteration, using its
attention weight
Wt,i
to read from and write to the memory
C(pt,i1)
t
, obtaining
C(pt,i)
t
. We can see that the memory
content (bright region) related to the associated object is
attentively erased (becomes dark) by the write operation,
thereby preventing the next tracker from reading it again.
Note that at time
(t+1)
, Tracker 1 is reprioritized with a
priority
pt+1,1= 3
and thus is updated at the 3-rd iteration,
and the memory value has not been modified in the 3-rd iter-
ation by Tracker 1 at which the iteration is terminated (since
yc
t,1<0.5and yc
t+1,1<0.5).
5. Related Work
Unsupervised Learning for Visual Data Understanding
There are many works focusing on extracting interpretable
representations from visual data using unsupervised learning:
some attempt to find low-level disentangled factors ([
33
,
10
,
51
] for images and [
43
,
29
,
20
,
12
,
15
] for videos), some
aim to extract mid-level semantics ([
35
,
41
,
24
] for images
and [
28
,
63
,
67
,
22
] for videos), while the remaining seek
to discover high-level semantics ([
13
,
71
,
48
,
57
,
66
,
14
] for
images and [
62
,
65
] for videos). However, none of these
works deal with MOT tasks. To the best of our knowledge,
the proposed method first achieves unsupervised end-to-end
learning of MOT.
Data Association for online MOT
In MOT tasks, data
association can be either offline [
73
,
42
,
34
,
3
,
45
,
9
,
40
] or
online [
59
,
2
,
64
], deterministic [
44
,
23
,
69
] or probabilis-
tic [
55
,
5
,
30
,
60
], greedy [
7
,
8
,
56
] or global [
47
,
31
,
46
].
Since the proposed RAT deals with online MOT and uses
soft attention to greedily associate data based on tracker con-
fidence ranking, it belongs to the probabilistic and greedy
online methods. However, unlike these traditional methods,
RAT is learnable, i.e., the tracker array can learn to generate
matching features, evolve tracker states, and modify input
features. Moreover, as RAT is not based on TBD and is
end-to-end, the feature extractor can also learn to provide
discriminative features to ease data association.
6. Conclusion
We introduced the TBA framework which achieves unsu-
pervised end-to-end learning of MOT tasks. We also intro-
duced the RAT to improve the robustness of data association.
We validated our model on different tasks, showing its po-
tential for real applications such as video surveillance. Our
future work is to extend the model to handle videos with
dynamic backgrounds. We hope our method could pave the
way towards more general unsupervised MOT.
Figure 7: Qualitative results of TBA on DukeMTMC. For each sequence, we show the input frames (top), reconstructed frames
(middle), and the tracker outputs (bottom). For each frame, tracker outputs from left to right correspond to tracker 1 to
I
(here
I=10), respectively. Each tracker output Yt,i is visualized as yc
t,i Ys
t,i Ya
t,i[0,1]U×V×D.
Table 2: Tracking performances of different methods on DukeMTMC.
Method IDF1IDPIDRMOTAMOTPFAFMTMLFPFNIDSFrag
DeepCC [50] 89.2 91.7 86.7 87.5 77.1 0.05 1,103 29 37,280 94,399 202 753
TAREIDMTMC [27] 83.8 87.6 80.4 83.3 75.5 0.06 1,051 17 44,691 131,220 383 2,428
TBA (ours)*82.4 86.1 79.0 79.6 80.4 0.09 1,026 46 64,002 151,483 875 1,481
MYTRACKER [72] 80.3 87.3 74.4 78.3 78.4 0.05 914 72 35,580 193,253 406 1,116
MTMC CDSC [58] 77.0 87.6 68.6 70.9 75.8 0.05 740 110 38,655 268,398 693 4,717
PT BIPCC [38] 71.2 84.8 61.4 59.3 78.7 0.09 666 234 68,634 361,589 290 783
BIPCC [49] 70.1 83.6 60.4 59.4 78.7 0.09 665 234 68,147 361,672 300 801
*
The results are hosted at
https://motchallenge.net/results/DukeMTMCT
, where our TBA tracker is named as ‘MOT TBA’.
Figure 8: Visualization of the RAT on Sprites-MOT. Both
the memory
Ct
and the attention weight
Wt,i
are visualized
as
M×N
(
8×8
) matrices, where for
Ct
the matrix denotes
its channel mean
1
SPS
s=1 Ct,1:M,1:N,s
normalized in
[0,1]
.
References
[1]
Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. People-
tracking-by-detection and people-detection-by-tracking. In
CVPR, 2008. 1
[2]
Seung-Hwan Bae and Kuk-Jin Yoon. Robust online multi-
object tracking based on tracklet confidence and online dis-
criminative appearance learning. In CVPR, 2014. 7
[3]
Jerome Berclaz, Francois Fleuret, Engin Turetken, and Pascal
Fua. Multiple object tracking using k-shortest paths optimiza-
tion. IEEE TPAMI, 33(9):1806–1819, 2011. 7
[4]
Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple
object tracking performance: the clear mot metrics. Journal
on Image and Video Processing, 2008:1, 2008. 7
[5]
Samuel S Blackman. Multiple hypothesis tracking for multi-
ple target tracking. IEEE Aerospace and Electronic Systems
Magazine, 19(1):5–18, 2004. 7
[6]
Domenico Bloisi and Luca Iocchi. Independent multimodal
background subtraction. In CompIMAGE, 2012. 5
[7]
Michael D Breitenstein, Fabian Reichlin, Bastian Leibe, Es-
ther Koller-Meier, and Luc Van Gool. Robust tracking-by-
detection using a detector confidence particle filter. In ICCV,
2009. 1,7
[8]
Michael D Breitenstein, Fabian Reichlin, Bastian Leibe, Es-
ther Koller-Meier, and Luc Van Gool. Online multiperson
tracking-by-detection from a single, uncalibrated camera.
IEEE TPAMI, 33(9):1820–1833, 2011. 1,7
[9]
Asad A Butt and Robert T Collins. Multi-target tracking by
lagrangian relaxation to min-cost network flow. In CVPR,
2013. 7
[10]
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya
Sutskever, and Pieter Abbeel. Infogan: Interpretable rep-
resentation learning by information maximizing generative
adversarial nets. In NIPS, 2016. 7
[11]
Kyunghyun Cho, Bart Van Merri
¨
enboer, Caglar Gulcehre,
Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. Learning phrase representations using rnn
encoder-decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078, 2014. 2,11
[12]
Emily L Denton et al. Unsupervised learning of disentangled
representations from video. In NIPS, 2017. 7
[13]
SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval
Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, in-
fer, repeat: Fast scene understanding with generative models.
In NIPS, 2016. 5,7
[14]
SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse,
Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ru-
derman, Andrei A Rusu, Ivo Danihelka, Karol Gregor,
et al. Neural scene representation and rendering. Science,
360(6394):1204–1210, 2018. 7
[15]
Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole
Winther. A disentangled recognition and nonlinear dynamics
model for unsupervised learning. In NIPS, 2017. 7
[16]
Felix A Gers, J
¨
urgen Schmidhuber, and Fred Cummins.
Learning to forget: Continual prediction with lstm. Neural
Computation, 12(10):2451–2471, 2000. 2
[17]
Alex Graves. Adaptive computation time for recurrent neural
networks. arXiv preprint arXiv:1603.08983, 2016. 4
[18]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing
machines. arXiv preprint arXiv:1410.5401, 2014. 4
[19]
Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley,
Ivo Danihelka, Agnieszka Grabska-Barwi
´
nska, Sergio G
´
omez
Colmenarejo, Edward Grefenstette, Tiago Ramalho, John
Agapiou, et al. Hybrid computing using a neural network
with dynamic external memory. Nature, 538(7626):471–476,
2016. 4
[20]
Klaus Greff, Sjoerd van Steenkiste, and J
¨
urgen Schmidhuber.
Neural expectation maximization. In NIPS, 2017. 7
[21]
Jo
˜
ao F Henriques, Rui Caseiro, Pedro Martins, and Jorge
Batista. Exploiting the circulant structure of tracking-by-
detection with kernels. In ECCV, 2012. 1
[22]
Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and
Juan Carlos Niebles. Learning to decompose and disentangle
representations for video prediction. In NeurIPS, 2018. 7
[23]
Chang Huang, Bo Wu, and Ramakant Nevatia. Robust object
tracking by hierarchical association of detection responses.
In ECCV, 2008. 7
[24]
Jonathan Huang and Kevin Murphy. Efficient inference in
occlusion-aware generative models of images. In ICLR Work-
shop, 2016. 7
[25]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al.
Spatial transformer networks. In NIPS, 2015. 2
[26]
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparam-
eterization with gumbel-softmax. In ICLR, 2017. 2
[27]
Na Jiang, SiChen Bai, Yue Xu, Chang Xing, Zhong Zhou, and
Wei Wu. Online inter-camera trajectory association exploit-
ing person re-identification and camera topology. In ACM
International Conference on Multimedia, 2018. 8
[28]
Nebojsa Jojic and Brendan J Frey. Learning flexible sprites
in video layers. In CVPR, 2001. 7
[29]
Maximilian Karl, Maximilian Soelch, Justin Bayer, and
Patrick van der Smagt. Deep variational bayes filters: Un-
supervised learning of state space models from raw data. In
ICLR, 2017. 7
[30]
Zia Khan, Tucker Balch, and Frank Dellaert. Mcmc-based
particle filtering for tracking a variable number of interacting
targets. IEEE TPAMI, 27(11):1805–1819, 2005. 7
[31]
Suna Kim, Suha Kwak, Jan Feyereisl, and Bohyung Han. On-
line multi-target tracking by large margin structured learning.
In ACCV, 2012. 7
[32]
Diederik Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In ICLR, 2015. 11
[33]
Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and
Josh Tenenbaum. Deep convolutional inverse graphics net-
work. In NIPS, 2015. 7
[34]
Cheng-Hao Kuo, Chang Huang, and Ramakant Nevatia.
Multi-target tracking by on-line learned discriminative ap-
pearance models. In CVPR, 2010. 7
[35]
Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John
Winn. Learning a generative model of images by factoring
appearance and shape. Neural Computation, 23(3):593–650,
2011. 7
[36]
Yann LeCun, L
´
eon Bottou, Yoshua Bengio, and Patrick
Haffner. Gradient-based learning applied to document recog-
nition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
11
[37]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In CVPR,
2015. 4
[38]
Andrii Maksai, Xinchao Wang, Francois Fleuret, and Pascal
Fua. Non-markovian globally consistent multi-object tracking.
In ICCV, 2017. 8
[39]
Anton Milan, Seyed Hamid Rezatofighi, Anthony R Dick,
Ian D Reid, and Konrad Schindler. Online multi-target track-
ing using recurrent neural networks. In AAAI, 2017. 1
[40]
Anton Milan, Stefan Roth, and Konrad Schindler. Continuous
energy minimization for multitarget tracking. IEEE TPAMI,
36(1):58–72, 2014. 7
[41]
Pol Moreno, Christopher KI Williams, Charlie Nash, and
Pushmeet Kohli. Overcoming occlusion with inverse graphics.
In ECCV, 2016. 7
[42]
Juan Carlos Niebles, Bohyung Han, and Li Fei-Fei. Efficient
extraction of human motion volumes by tracking. In CVPR,
2010. 7
[43]
Peter Ondr
´
u
ˇ
ska and Ingmar Posner. Deep tracking: seeing
beyond seeing using recurrent neural networks. In AAAI,
2016. 7
[44]
AG Amitha Perera, Chukka Srinivas, Anthony Hoogs, Glen
Brooksby, and Wensheng Hu. Multi-object tracking through
simultaneous long occlusions and split-merge conditions. In
CVPR, 2006. 7
[45]
Hamed Pirsiavash, Deva Ramanan, and Charless C Fowlkes.
Globally-optimal greedy algorithms for tracking a variable
number of objects. In CVPR, 2011. 7
[46]
Zhen Qin and Christian R Shelton. Improving multi-target
tracking via social grouping. In CVPR, 2012. 7
[47]
Vladimir Reilly, Haroon Idrees, and Mubarak Shah. Detec-
tion and tracking of large number of targets in wide area
surveillance. In ECCV, 2010. 7
[48]
Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed,
Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsuper-
vised learning of 3d structure from images. In NIPS, 2016.
7
[49]
Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara,
and Carlo Tomasi. Performance measures and a data set for
multi-target, multi-camera tracking. In ECCV, 2016. 1,5,7,
8
[50]
Ergys Ristani and Carlo Tomasi. Features for multi-target
multi-camera tracking and re-identification. In CVPR, 2018.
8
[51]
Jason Tyler Rolfe. Discrete variational autoencoders. In ICLR,
2017. 7
[52]
David E Rumelhart, Geoffrey E Hinton, and Ronald J
Williams. Learning representations by back-propagating er-
rors. Nature, 323(6088):533–536, 1986. 2
[53]
Amir Sadeghian, Alexandre Alahi, and Silvio Savarese. Track-
ing the untrackable: Learning to track multiple cues with
long-term dependencies. In ICCV, 2017. 1
[54]
Samuel Schulter, Paul Vernaza, Wongun Choi, and Manmo-
han Chandraker. Deep network flow for multi-object tracking.
In CVPR, 2017. 1,5
[55]
Dirk Schulz, Wolfram Burgard, Dieter Fox, and Armin B Cre-
mers. People tracking with mobile robots using sample-based
joint probabilistic data association filters. The International
Journal of Robotics Research, 22(2):99–116, 2003. 7
[56]
Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand,
and Mubarak Shah. Part-based multiple-person tracking with
partial occlusion handling. In CVPR, 2012. 7
[57]
Russell Stewart and Stefano Ermon. Label-free supervision
of neural networks with physics and domain knowledge. In
AAAI, 2017. 7
[58]
Yonatan Tariku Tesfaye, Eyasu Zemene, Andrea Prati, Mar-
cello Pelillo, and Mubarak Shah. Multi-target tracking in
multiple non-overlapping cameras using constrained domi-
nant sets. arXiv preprint arXiv:1706.06196, 2017. 8
[59]
Ryan D Turner, Steven Bottone, and Bhargav Avasarala. A
complete variational tracker. In NIPS, 2014. 7
[60]
B-N Vo and W-K Ma. The gaussian mixture probability
hypothesis density filter. IEEE Transactions on Signal Pro-
cessing, 54(11):4091–4104, 2006. 7
[61]
Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan
Lu. Visual tracking with fully convolutional networks. In
ICCV, 2015. 4
[62]
Nicholas Watters, Daniel Zoran, Theophane Weber, Peter
Battaglia, Razvan Pascanu, and Andrea Tacchetti. Visual in-
teraction networks: Learning a physics simulator from video.
In NIPS, 2017. 7
[63]
John Winn and Andrew Blake. Generative affine localisation
and tracking. In NIPS, 2005. 7
[64]
Bo Wu and Ram Nevatia. Detection and tracking of multi-
ple, partially occluded humans by bayesian combination of
edgelet based part detectors. IJCV, 75(2):247–266, 2007. 7
[65]
Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh
Tenenbaum. Learning to see physics via visual de-animation.
In NIPS, 2017. 7
[66]
Jiajun Wu, Joshua B Tenenbaum, and Pushmeet Kohli. Neural
scene de-rendering. In CVPR, 2017. 7
[67]
Jonas Wulff and Michael Julian Black. Modeling blurred
video with layers. In ECCV, 2014. 7
[68]
Yu Xiang, Alexandre Alahi, and Silvio Savarese. Learning to
track: Online multi-object tracking by decision making. In
ICCV, 2015. 1
[69]
Junliang Xing, Haizhou Ai, and Shihong Lao. Multi-object
tracking through occlusions by local tracklets filtering and
global tracklets association with detection responses. In
CVPR, 2009. 7
[70]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron
Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption genera-
tion with visual attention. In ICML, 2015. 4
[71]
Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and
Honglak Lee. Perspective transformer nets: Learning single-
view 3d object reconstruction without 3d supervision. In
NIPS, 2016. 7
[72]
Kwangjin Yoon, Young-min Song, and Moongu Jeon. Mul-
tiple hypothesis tracking algorithm for multi-target multi-
camera tracking with disjoint views. IET Image Processing,
2018. 8
[73]
Li Zhang, Yuan Li, and Ramakant Nevatia. Global data
association for multi-object tracking using network flows. In
CVPR, 2008. 7
A. Supplementary Materials for Experiments
A.1. Implementation Details
Model Configuration
There are some common model
configurations for all tasks. For the
NNfeat
defined in (1),
we set it as a FCN, where each convolution layer is com-
posed via convolution, adaptive max-pooling, and ReLU
and the convolution stride is set to 1 for all layers. For the
RNN
defined in (14), we set it as a Gated Recurrent Unit
(GRU) [
11
] to capture long-range temporal dependencies.
For the
NNout
defined in (3), we set it as a Fully-Connected
network (FC), where the ReLU is chosen as the activation
function for each hidden layer. For the loss defined in (9),
we set
λ= 1
. For the model configurations specified to
each task, please see in Table 3. Note that to use attention,
the receptive field of
ct,m,n
is crafted as a local region on
Xt
, i.e., 40
×
40 for MNIST-MOT and Sprites-MOT, and
44
×
24 for DukeMTMC (this can be calculated using the
FCN hyper-parameters in Table 3).
Training Configuration
For MNIST-MOT and Sprites-
MOT, we split the data into a proportion of 90/5/5 for train-
ing/validation/test; for DukeMTMC, we split the provided
training data into a proportion of 95/5 for training/validation.
For all tasks, in each iteration we feed the model with a mini-
batch of 64 subsequences of length 20. During the forward
pass, the tracker states and confidences at the last time step
are preserved to initialize the next iteration. To train the
model, we minimize the averaged loss on the training set
w.r.t. all model parameters
Θ={θfeat,θupd,θout }
using
Adam [
32
] with a learning rate of
5×104
. Early stopping
is used to terminate training.
A.2. MNIST-MOT
As a pilot experiment, we focus on testing whether our
model can robustly track the position and appearance of each
object that can appear/disappear from the scene. Thus, we
create a new MNIST-MOT dataset containing 2M frames,
where each frame is of size 128
×
128
×
1, consisting of a
black background and at most three moving digits. Each
digit is a 28
×
28
×
1 image patch randomly drawn from the
MNIST dataset [
36
], moves towards a random direction, and
appears/disappears only once. When digits overlap, pixel
values are added and clamped in
[0,1]
. To solve this task,
for TBA configurations we set the tracker number
I= 4
and layer number
K= 1
, and fix the scale
sx
t,i =sy
t,i = 1
and shape
Ys
t,i =1
, thereby only compositing a single layer
by adding up all transformed appearances. We also clamp
the pixel values of the reconstructed frames in
[0,1]
for all
configurations.
Training curves are shown in Fig. 9. The TBA, TBAc,
and TBAc-noRep have similar validation losses which are
slightly better than that of TBAc-noAtt. Similar to the re-
sults on Sprites-MOT, TBA converges the fastest, and TBAc-
noMem has a significantly higher validation loss as all track-
Figure 9: Training curves of different configurations on
MNIST-MOT.
ers are likely to focus on a same object, which affects the
reconstruction.
Qualitative results are shown in Fig. 10. Similar phenom-
ena are observed as in Sprites-MOT, revealing the impor-
tance of the disabled mechanisms. Specifically, as temporal
dependency is not considered in AIR, overlapped objects are
failed to be disambiguated (Seq. 5).
We further quantitatively evaluate different configurations.
Results are reported in Table 4, which are similar to those of
the Sprites-MOT.
Table 3: Model configurations specified to each task, where ‘conv h
×
w’ denotes a convolution layer with kernel size h
×
w, ‘fc’
denotes a fully-connected layer, and ‘out’ denotes an output layer. Note that for
NNfeat
, the first layer has two additional
channels than Xt, which are the 2D image coordinates (as mentioned in Sec. 3.1).
Hyper-parameter MNIST-MOT Sprites-MOT DukeMTMC
Size of Xt:[H, W, D][128, 128, 1] [128, 128, 3] [108, 192, 3]
Size of Ct:[M, N , S][8, 8, 50] [8, 8, 20] [9, 16, 200]
Size of Ya
t,i:[U, V , D][28, 28, 1] [21, 21, 3] [9, 23, 3]
Size of ht,i:R200 80 800
Tracker number: I4 4 10
Layer number: K1 3 3
Coef. of [bsx
t,i,bsy
t,i]:[ηx, η y][0, 0] [0.2, 0.2] [0.4, 0.4]
Layer sizes of NNfeat (FCN)
[128, 128, 3] (conv 5×5) [128, 128, 5] (conv 5×5) [108, 192, 5] (conv 5×5)
[64, 64, 32] (conv 3×3) [64, 64, 32] (conv 3×3) [108, 192, 32] (conv 5×3)
[32, 32, 64] (conv 1×1) [32, 32, 64] (conv 1×1) [36, 64, 128] (conv 5×3)
[16, 16, 128] (conv 3×3) [16, 16, 128] (conv 3×3) [18, 32, 256] (conv 3×1)
[8, 8, 256] (conv 1×1) [8, 8, 256] (conv 1×1) [9, 16, 512] (conv 1×1)
[8, 8, 50] (out) [8, 8, 20] (out) [9, 16, 200] (out)
Layer sizes of NNout (FC)
200 (fc) 80 (fc) 800 (fc)
397 (fc) 377 (fc) 818 (fc)
787 (out) 1772 (out) 836 (out)
Number of parameters 1.21 M 1.02 M 5.65 M
Table 4: Tracking performances of different configurations on MNIST-MOT.
Configuration IDF1IDPIDRMOTAMOTPFAFMTMLFPFNIDSFrag
TBA 99.6 99.6 99.6 99.5 78.4 0 978 0 49 49 22 7
TBAc 99.2 99.3 99.2 99.4 78.1 0.01 977 0 54 52 26 11
TBAc-noAtt 45.2 43.9 46.6 59.8 81.8 0.20 976 0 1,951 219 6,762 86
TBAc-noMem 0 0 0 0 0 983 0 22,219 0 0
TBAc-noRep 94.3 92.9 95.7 98.7 77.8 0.01 980 0 126 55 103 10
Figure 10: Qualitative results of different configurations on MNIST-MOT. For each configuration, we show the reconstructed
frames (top) and the tracker outputs (bottom). For each frame, tracker outputs from left to right correspond to tracker 1 to
I
(here I=4), respectively. Each tracker output Yt,i is visualized as yc
t,i Ys
t,i Ya
t,i[0,1]U×V×D.
... The goal of Multiple Object Tracking and Segmentation (MOTS) algorithms is to establish temporally consistent associations among segmentation masks of multiple objects observed at different frames of a video sequence. To accomplish that goal, most state-of-the-art MOTS methods [30,39] employ supervised learning approaches to generate discriminative embeddings and then apply feature association algorithms based on sophisticated target behavior models [14,25,26]. This paper proposes a novel perspective on the problem of temporal association of segmentation masks based on spatio-temporal clustering strategies. ...
... We evaluate our algorithm on two synthetic and two real-world datasets. The MNIST-MOT and Sprites-MOT [14] synthetic datasets allow us to simulate challenging MOTS scenarios involving pose, scale, and shape variations. In addition, their bounding box and segmentation masks are readily available. ...
... We generate synthetic MNIST-MOT and Sprites-MOT sequences using the procedure described in [14], which includes most of the common challenges observed in MOT problems. For the MNIST-MOT dataset, we generate 9 digit classes and for Sprites-MOT we generate 4 geometric shapes. ...
Conference Paper
Full-text available
Assigning consistent temporal identifiers to multiple moving objects in a video sequence is a challenging problem. A solution to that problem would have immediate ramifications in multiple object tracking and segmentation problems. We propose a strategy that treats the temporal identification task as a spatio-temporal clustering problem. We propose an unsupervised learning approach using a convolutional and fully connected autoencoder, which we call deep heterogeneous autoencoder, to learn discriminative features from segmentation masks and detection bounding boxes. We extract masks and their corresponding bounding boxes from a pretrained instance segmentation network and train the autoencoders jointly using task-dependent uncertainty weights to generate common latent features. We then construct constraints graphs that encourage associations among objects that satisfy a set of known temporal conditions. The feature vectors and the constraints graphs are then provided to the kmeans clustering algorithm to separate the corresponding data points in the latent space. We evaluate the performance of our method using challenging synthetic and real-world multiple-object video datasets. Our results show that our technique outperforms several state-of-the-art methods. Code and models are available at https://bitbucket. org/Siddiquemu/usc_mots.
... Similarly, adding one dimension to the problem, many approaches have tackled unsupervised video decomposition [16,7,6,12,17,11,13]. An added issue in this case is finding the correspondences of the decomposed objects across time (i.e. ...
... However, in this case the supervision is done by means of contrastive learning and a bipartite loss. Tracking by Animation [11] shows that decomposition, disentanglement and deterministic generation of objects is a good self-supervision signal for learning tracking. We will use their ideas on attention-based tracking for our method. ...
... Our method uses concepts of soft-attention for featurebased tracking. We build on top of [11] for our tracking mechanism, and modify the architecture to allow stochasticity and forecasting. Following, we describe the main modules that form our model: ...
Preprint
Full-text available
Human interpretation of the world encompasses the use of symbols to categorize sensory inputs and compose them in a hierarchical manner. One of the long-term objectives of Computer Vision and Artificial Intelligence is to endow machines with the capacity of structuring and interpreting the world as we do. Towards this goal, recent methods have successfully been able to decompose and disentangle video sequences into their composing objects and dynamics, in a self-supervised fashion. However, there has been a scarce effort in giving interpretation to the dynamics of the scene. We propose a method to decompose a video into moving objects and their attributes, and model each object's dynamics with linear system identification tools, by means of a Koopman embedding. This allows interpretation, manipulation and extrapolation of the dynamics of the different objects by employing the Koopman operator K. We test our method in various synthetic datasets and successfully forecast challenging trajectories while interpreting them.
... Some works [198], [251] follow self-and unsupervised learning framework in MOT. This kind of work takes consideration of prior consistency and constraints in embedding learning. ...
... Specifically, [198] proposes a self-supervised tracker by using cross-input consistency, in which two distinct inputs are constructed for the same sequence of video by hiding different information about the sequence in each input. [251] proposes a tracking-by-animation (TBA) framework, where a differentiable neural model tracks objects from input frames and then animates these objects into reconstructed frames, which achieves both label-free and end-to-end learning of MOT. [252] follows the TBA framework for noisy environments. ...
Preprint
Full-text available
Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories. With the advancement of deep neural networks and the increasing demand for intelligent video analysis, MOT has gained significantly increased interest in the computer vision community. Embedding methods play an essential role in object location estimation and temporal identity association in MOT. Unlike other computer vision tasks, such as image classification, object detection, re-identification, and segmentation, embedding methods in MOT have large variations, and they have never been systematically analyzed and summarized. In this survey, we first conduct a comprehensive overview with in-depth analysis for embedding methods in MOT from seven different perspectives, including patch-level embedding, single-frame embedding, cross-frame joint embedding, correlation embedding, sequential embedding, tracklet embedding, and cross-track relational embedding. We further summarize the existing widely used MOT datasets and analyze the advantages of existing state-of-the-art methods according to their embedding strategies. Finally, some critical yet under-investigated areas and future research directions are discussed.
... Analogously, autonomous systems' reasoning and scene understanding capabilities could benefit from decomposing scenes into objects and modeling each of these independently. This approach has been proven beneficial to perform a wide variety of computer vision tasks without explicit supervision, including unsupervised object detection (Eslami et al., 2016), future frame prediction (Weis et al., 2021;Greff et al., 2019), and object tracking (He et al., 2019;Veerapaneni et al., 2020). ...
... These methods can be further di-vided into two different groups depending on the class of latent representations used. On the one hand, some methods (Eslami et al., 2016;Kosiorek et al., 2018;Stanic et al., 2021;He et al., 2019) explicitly encode the input into factored latent variables, which represent specific properties such as appearance, position, and presence. On the other hand, other models Weis et al., 2021;Locatello et al., 2020) decompose the image into unconstrained per-object latent representations. ...
Conference Paper
Full-text available
The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components , which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsuper-vised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable. Code and models to reproduce our experiments can be found in https://github.com/AIS-Bonn/Unsupervised-Decomposition-PCDNet.
... Analogously, autonomous systems' reasoning and scene understanding capabilities could benefit from decomposing scenes into objects and modeling each of these independently. This approach has been proven beneficial to perform a wide variety of computer vision tasks without explicit supervision, including unsupervised object detection [11], future frame prediction [46,15], and object tracking [17,44]. ...
... These methods can be further divided into two different groups depending on the class of latent representations used. On the one hand, some methods [11,24,43,17] explicitly encode the input into factored latent variables, which represent specific properties such as appearance, position, and presence. On the other hand, other models [7,46,33] decompose the image into unconstrained per-object latent representations. ...
Preprint
Full-text available
The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable.
... The combination of a deep appearance model (automatic speaker recognition, SED) with a deep dynamical model in a sound source tracking system is a largely open problem and certainly a key ingredient for future developments in robust multi-source acoustic scene analysis in adverse acoustic environments and complex scenarios. Given the problem of annotated data scarcity in SSL, DL-based sound source localization and tracking may take inspiration from the unsupervised deep approaches to the MOT problem recently proposed by several researchers (e.g., Crawford and Pineau, 2020;He et al., 2019b;Karthik et al., 2020;Lin et al., 2022;Luiten et al., 2020). ...
Article
This article is a survey of deep learning methods for single and multiple sound source localization, with a focus on sound source localization in indoor environments, where reverberation and diffuse noise are present. We provide an extensive topography of the neural network-based sound source localization literature in this context, organized according to the neural network architecture, the type of input features, the output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy. Tables summarizing the literature survey are provided at the end of the paper, allowing a quick search of methods with a given set of target characteristics.
... A large body of work has approached fully-unsupervised video segmentation and tracking by combining recurrent slot-based encoders with a reconstruction objective. In this class, (Kosiorek et al., 2018;Stanić & Schmidhuber, 2019;Crawford & Pineau, 2019a;Lin et al., 2020a;Wu et al., 2021;Singh et al., 2021;He et al., 2019) use bounding boxes for tracking. A parallel line of research learns to localize objects via segmentation masks (Greff et al., 2017;Van Steenkiste et al., 2018;Veerapaneni et al., 2019;Watters et al., 2019;Weis et al., 2020;Du et al., 2020;Kipf et al., 2021;Kabra et al., 2021;Zoran et al., 2021;Besbinar & Frossard, 2021;Creswell et al., 2020Creswell et al., , 2021. ...
Preprint
Unsupervised object-centric learning aims to represent the modular, compositional, and causal structure of a scene as a set of object representations and thereby promises to resolve many critical limitations of traditional single-vector representations such as poor systematic generalization. Although there have been many remarkable advances in recent years, one of the most critical problems in this direction has been that previous methods work only with simple and synthetic scenes but not with complex and naturalistic images or videos. In this paper, we propose STEVE, an unsupervised model for object-centric learning in videos. Our proposed model makes a significant advancement by demonstrating its effectiveness on various complex and naturalistic videos unprecedented in this line of research. Interestingly, this is achieved by neither adding complexity to the model architecture nor introducing a new objective or weak supervision. Rather, it is achieved by a surprisingly simple architecture that uses a transformer-based image decoder conditioned on slots and the learning objective is simply to reconstruct the observation. Our experiment results on various complex and naturalistic videos show significant improvements compared to the previous state-of-the-art.
... Another branch of unsupervised models is based on geometric rendering and visual reconstruction. The authors of [35] proposed a tracking-by-animation framework, which first tracks objects from video frames and then renders tracker outputs into reconstructed frames. A VAE-based spatiallyinvariant label-free object tracking model was proposed in [36]. ...
Preprint
Full-text available
In this paper, we present an unsupervised probabilistic model and associated estimation algorithm for multi-object tracking (MOT) based on a dynamical variational autoencoder (DVAE), called DVAE-UMOT. The DVAE is a latent-variable deep generative model that can be seen as an extension of the variational autoencoder for the modeling of temporal sequences. It is included in DVAE-UMOT to model the objects' dynamics, after being pre-trained on an unlabeled synthetic dataset of single-object trajectories. Then the distributions and parameters of DVAE-UMOT are estimated on each multi-object sequence to track using the principles of variational inference: Definition of an approximate posterior distribution of the latent variables and maximization of the corresponding evidence lower bound of the data likehood function. DVAE-UMOT is shown experimentally to compete well with and even surpass the performance of two state-of-the-art probabilistic MOT models. Code and data are publicly available.
... Dynamic Scenes: Inspired by the methods proposed for learning from single-viewpoint static scenes, several methods, such as Relational N-EM (van Steenkiste et al. 2018), SQAIR (Kosiorek et al. 2018), R-SQAIR (Stanic and Schmidhuber 2019), TBA (He et al. 2019), SILOT (Crawford and Pineau 2020), SCALOR ), OP3 (Veerapaneni et al. 2020), and PROVIDE (Zablotskaia et al. 2021, have been proposed for learning from video sequences. The difficulties of this problem setting include modeling object motions and relationships, as well as maintaining the identities of objects even if objects disappear and reappear after full occlusion (Weis et al. 2021). ...
Preprint
Visual scenes are extremely rich in diversity, not only because there are infinite combinations of objects and background, but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a visual scene that contains multiple objects from multiple viewpoints, humans are able to perceive the scene in a compositional way from each viewpoint, while achieving the so-called "object constancy" across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have the similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision, and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. To infer latent representations, the information contained in different viewpoints is iteratively integrated by neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
Article
Full-text available
Multi-Target Multi-Camera Tracking (MTMCT) tracks many people through video taken from several cameras. Person Re-Identification (Re-ID) retrieves from a gallery images of people similar to a person query image. We learn good features for both MTMCT and Re-ID with a convolutional neural network. Our contributions include an adaptive weighted triplet loss for training and a new technique for hard-identity mining. Our method outperforms the state of the art both on the DukeMTMC benchmarks for tracking, and on the Market-1501 and DukeMTMC-ReID benchmarks for Re-ID. We examine the correlation between good Re-ID and good MTMCT scores, and perform ablation studies to elucidate the contributions of the main components of our system. Code is available.
Article
Full-text available
In this study, a multiple hypothesis tracking (MHT) algorithm for multi-target multi-camera tracking (MCT) with disjoint views is proposed. The authors' method forms track-hypothesis trees, and each branch of them represents a multi-camera track of a target that may move within a camera as well as move across cameras. Furthermore, multi-target tracking within a camera is performed simultaneously with the tree formation by manipulating a status of each track hypothesis. Each status represents three different stages of a multi-camera track: tracking, searching, and end-of-track. The tracking status means targets are tracked by a single camera tracker. In the searching status, the disappeared targets are examined if they reappear in other cameras. The end-of-track status does the target exited the camera network due to its lengthy invisibility. These three status assists MHT to form the track-hypothesis trees for multi-camera tracking. Furthermore, a gating technique which eliminates the unlikely observation-to-track association using space-time information has been introduced. In the experiments, the proposed method has been tested using two datasets, DukeMTMC and NLPR\_MCT, which demonstrates that the method outperforms the state-of-the-art method in terms of improvement of the accuracy. In addition, real-time and online performance of proposed method is also showed in this study.
Article
We introduce a paradigm for understanding physical scenes without human annotations. At the core of our system is a physical world representation that is first recovered by a perception module and then utilized by physics and graphics engines. During training, the perception module and the generative models learn by visual de-animation - interpreting and reconstructing the visual information stream. During testing, the system first recovers the physical world state, and then uses the generative models for reasoning and future prediction. Even more so than forward simulation, inverting a physics or graphics engine is a computationally hard problem; we overcome this challenge by using a convolutional inversion network. Our system quickly recognizes the physical world state from appearance and motion cues, and has the flexibility to incorporate both differentiable and non-differentiable physics and graphics engines. We evaluate our system on both synthetic and real datasets involving multiple physical scenes, and demonstrate that our system performs well on both physical state estimation and reasoning problems. We further show that the knowledge learned on the synthetic dataset generalizes to constrained real images.
Chapter
Scene understanding tasks such as the prediction of object pose, shape, appearance and illumination are hampered by the occlusions often found in images. We propose a vision-as-inverse-graphics approach to handle these occlusions by making use of a graphics renderer in combination with a robust generative model (GM). Since searching over scene factors to obtain the best match for an image is very inefficient, we make use of a recognition model (RM) trained on synthetic data to initialize the search. This paper addresses two issues: (i) We study how the inferences are affected by the degree of occlusion of the foreground object, and show that a robust GM which includes an outlier model to account for occlusions works significantly better than a non-robust model. (ii) We characterize the performance of the RM and the gains that can be made by refining the search using the GM, using a new dataset that includes background clutter and occlusions. We find that pose and shape are predicted very well by the RM, but appearance and especially illumination less so. However, accuracy on these latter two factors can be clearly improved with the generative model.
Conference Paper
Online inter-camera trajectory association is a promising topic in intelligent video surveillance, which concentrates on associating trajectories belong to the same individual across different cameras according to time. It remains challenging due to the inconsistent appearance of a person in different cameras and the lack of spatio-temporal constraints between cameras. Besides, the orientation variations and the partial occlusions significantly increase the difficulty of inter-camera trajectory association. Targeting to solve these problems, this work proposes an orientation-driven person re-identification (ODPR) and an effective camera topology estimation based on appearance features for online inter-camera trajectory association. ODPR explicitly leverages the orientation cues and stable torso features to learn discriminative feature representations for identifying trajectories across cameras, which alleviates the pedestrian orientation variations by the designed orientation-driven loss function and orientation aware weights. The effective camera topology estimation introduces appearance features to generate the correct spatio-temporal constraints for narrowing the retrieval range, which improves the time efficiency and provides the possibility for intelligent inter-camera trajectory association in large-scale surveillance environments. Extensive experimental results demonstrate that our proposed approach significantly outperforms most state-of-the-art methods on the popular person re-identification datasets and the public multi-target, multi-camera tracking benchmark.
Article
Scene representation—the process of converting visual sensory data into concise descriptions—is a requirement for intelligent behavior. Recent work has shown that neural networks excel at this task when provided with large, labeled datasets. However, removing the reliance on human labeling remains an important open problem. To this end, we introduce the Generative Query Network (GQN), a framework within which machines learn to represent scenes using only their own sensors. The GQN takes as input images of a scene taken from different viewpoints, constructs an internal representation, and uses this representation to predict the appearance of that scene from previously unobserved viewpoints. The GQN demonstrates representation learning without human labels or domain knowledge, paving the way toward machines that autonomously learn to understand the world around them.