PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We propose a method for multi-object tracking and segmentation (MOTS) that does not require fine-tuning or per benchmark hyperparameter selection. The proposed method addresses particularly the data association problem. Indeed, the recently introduced HOTA metric, that has a better alignment with the human visual assessment by evenly balancing detections and associations quality, has shown that improvements are still needed for data association. After creating tracklets using instance segmentation and optical flow, the proposed method relies on a space-time memory network (STM) developed for one-shot video object segmentation to improve the association of tracklets with temporal gaps. To the best of our knowledge, our method, named MeNToS, is the first to use the STM network to track object masks for MOTS. We took the 4th place in the RobMOTS challenge. The project page is https://mehdimiah.com/mentos.html.
MeNToS : Tracklets Association with a Space-Time Memory Network
Mehdi Miah, Guillaume-Alexandre Bilodeau and Nicolas Saunier
Polytechnique Montr´
eal
{mehdi.miah, gabilodeau, nicolas.saunier}@polymtl.ca
Abstract
We propose a method for multi-object tracking and seg-
mentation (MOTS) that does not require fine-tuning or
per benchmark hyperparameter selection. The proposed
method addresses particularly the data association prob-
lem. Indeed, the recently introduced HOTA metric, that
has a better alignment with the human visual assessment
by evenly balancing detections and associations quality,
has shown that improvements are still needed for data as-
sociation. After creating tracklets using instance segmen-
tation and optical flow, the proposed method relies on a
space-time memory network (STM) developed for one-shot
video object segmentation to improve the association of
tracklets with temporal gaps. To the best of our knowl-
edge, our method, named MeNToS, is the first to use the
STM network to track object masks for MOTS. We took the
4th place in the RobMOTS challenge. The project page is
https://mehdimiah.com/mentos.html.
1. Introduction
Multi-object tracking (MOT) is a core problem in com-
puter vision. Given a video, the objective is to detect all ob-
jects of interest then to track them throughout the video with
consistent identities. Common difficulties are occlusions,
small objects, fast motion (or equivalently low framerate)
and deformations. Recently, the multi-object tracking and
segmentation (MOTS) task [10] was introduced: instead of
localizing objects with bounding boxes, they are described
by their segmentation mask at the pixel level.
The MOTA metric has been commonly used to evalu-
ate MOT but has a tendency to give more weight to detec-
tion errors compared to association errors. The newly in-
troduced HOTA metric [5] balances these two aspects and
provides further incentives to work on the association step.
That is why we developed a method which relies first on
an instance segmentation followed by two data association
steps. The first one is applied between consecutive frames
using optical flow to obtain mask location prediction and
mask Intersection over Union (mIoU) for matching. The
second association step relies on a space-time memory net-
work (STM). This is our main contribution. It is inspired
by some results in one-shot video object segmentation (OS-
VOS), a task in computer vision that consists of tracking at
a pixel level a mask provided only at the first frame. We use
mask propagation with a STM network to associate track-
lets separated by longer temporal gaps. Experiments show
that the long-term data association significantly improves
the HOTA score over the datasets used in the challenge.
2. Related works
MOTS Similarly to MOT where the “tracking-by-
detection” paradigm is popular, MOTS is mainly solved by
creating tracklets from segmentation masks then by build-
ing long-term tracks by merging the tracklets [3,12,13].
Usually, methods use an instance segmentation method to
generate binary masks; Re-MOTS [12] used two advanced
instance segmentation methods and self-supervision to re-
fine masks. As for the association step, many methods re-
quire a re-identification (reID) step. For instance, Voigt-
laender et al. [10] extended the Mask R-CNN by an associ-
ation head to return an embedding for each detection, Yang
et al. [12] associated two tracklets if they were temporally
close, without any temporal overlap with similar appearance
features based on all their observations and a hierarchical
clustering, while Zhang et al. [13] used temporal attention
to lower the weight of frames with occluded objects.
STM Closely related to MOTS, OSVOS requires track-
ing objects whose segmentation masks are only provided
at the first frame. STM [6] was proposed to solve OSVOS
by storing some previous frames and masks in a memory
that is later read by an attention mechanism to predict the
new mask in a target image. Such a network was recently
used [2] to solve video instance segmentation, a problem in
which no prior knowledge was given about the objects to
track. However, it is unclear how STM behaves when mul-
tiple instances from the same class appear in the video. We
show in this work they behave well and that they can help
to solve a reID problem by taking advantage of the informa-
tion at the pixel level and the presence of other objects.
1
arXiv:2107.07067v1 [cs.CV] 15 Jul 2021
t1
t2
t3
t4
t5
t6
t1
t2
t3
t4
t5
t6
Instance
segmentation
Optical flow
Mask-IoU
Hungarian
algorithm
Deletion of
low
confidence
tracklets
Admissible pairs
❌ ✔ ✔ ❌
❌ ✔ ✔ ❌
✔ ❌ ❌ ✔
✔ ❌ ❌ ❌
❌ ✔ ❌ ❌
Short-term association Filter on length
Long-term
association
Filter on confidence
Similarity using a
memory network
-1 -1 0,6 0,4 -1
-1 -1 0,2 0,8 -1
0,6 0,2 -1 -1 0,1
0,4 0,8 -1 -1 -1
-1 -1 0,1 -1 -1
Deletion of short
tracklets
Updates
Ground truth : Hypotheses :
raw id1id2
id3id4id5id6
ID1ID2
t1
t2
t3
t4
t5
t6
t1
t2
t3
t4
t5
t6
t1
t2
t3
t4
t5
t6
Figure 1. Illustration of our MeNToS method. Given an instance segmentation, binary masks are matched in consecutive frames to create
tracklets. Very short tracklets are deleted. An appearance similarity, based on a memory network, is computed between two admissible
tracklets. Then, tracklets are gradually merged starting with the pair having the highest similarity while respecting the updated constraints.
Finally, low confidence tracks are deleted.
3. Method
As illustrated in Figure 1, our pipeline for tracking mul-
tiple objects is based on three main steps: detections of all
objects of interest, a short-term association of segmentation
masks in consecutive frames and a greedy long-term asso-
ciation of tracklets using a memory network.
3.1. Detections
Our method follow the “tracking-by-detection”
paradigm. First, we used the public raw object masks
provided by the challenge. They were obtained from a
Mask R-CNN X-152 and Box2Seg Network. Objects
with a detection score higher than θdand bigger than θa
are extracted. Then, to avoid, for instance, that a car is
simultaneously detected as a car and as a truck, segmenta-
tion masks having a mutual mIoU higher than θmIoU are
merged to form a multi-class hypothesis.
3.2. Short-term association (STA)
We associate temporally close segmentation masks be-
tween consecutive frames by computing the Farneback op-
tical flow [1] due to its simplicity. Masks from the previous
frames are warped and a mIoU is computed between these
warped masks and the masks from the next frame.
The Hungarian algorithm is used to associate masks
where the cost matrix is computed based on the negative
mIoU. Matched detections with a mIoU above a threshold
θsare connected to form a tracklet and the remaining de-
tections form new tracklets. The class of the tracklet is
the most dominant one among its detections. Then, a non-
overlap algorithm is applied to avoid any spatial overlap be-
tween masks, giving the priority to the pixels of the most
confident mask. Finally, tracklets with only one detection
are deleted since they often correspond to a false positive.
3.3. Greedy long-term association (GLTA)
GLTA and the use of a memory network for re-
identification are the novelties of our approach. Once track-
lets have been created, it is necessary to link them in case
of fragmentation caused, for example, by occlusion. In this
long-term association, we used a memory network to prop-
agate some masks of a tracklet in the past and the future.
In case of a spatial overlap with another tracklet, the two
tracklets are merged. Given the fact that this procedure is
applied at the pixel-level on the whole image, the similarity
is only computed on a selection of admissible tracklets pairs
to reduce the computational cost. At this step, we point out
that all tracklets have a length longer than or equal to two.
3.3.1 Measure of similarity between tracklets
Our similarity measure is based on the ability to match some
parts of two different tracklets (say TAand TB) and can be
interpreted as a pixel-level visual-spatial alignment rather
than a patch-level visual alignment [12,13]. For that, we
2
propagate some masks of tracklet TAto other frames where
the tracklet TBis present and then compare the masks of
TBand the propagated version of the mask heatmaps, com-
puted before the binarization, for TA. The more they are
spatially aligned, the higher the similarity is. In details, let
us consider two tracklets TA= (MA
1, M A
2,· · · , MA
N)and
TB= (MB
1, M B
2,· · · , MB
P)of length Nand Prespec-
tively, such that TAappears first and where MA
1denotes
the first segmentation mask of the tracklet TA. We use
a pre-trained STM network [6] to store two binary masks
as references (and their corresponding frames): the closest
ones (MA
Nfor TAand MB
1for TB) and a second mask a
little farther (MA
Nn1for TAand MB
nfor TB). The far-
ther masks are used because the first and last object masks
of a tracklet are often incomplete due, for example, to oc-
clusions. Then, the reference frames are used as queries
to produce heatmaps with values between 0 and 1 (HA
N,
HA
Nn1,HB
1,HB
n). Finally, the average cosine similar-
ity between these four heatmaps and the four masks (MA
N,
MA
Nn1,MB
1,MB
n) is the final similarity between the two
tracklets. Figure 2illustrates a one-frame version of this
similarity measure between tracklets.
3.3.2 Selection of pairs of tracklets
Instead of estimating a similarity measure between all pairs
of tracklets, a naive selection is made to reduce the compu-
tational cost. The selection is based on the following heuris-
tic: two tracklets may belong to the same objects if they be-
long to the same class, are temporally close, spatially close
and with a small temporal overlap.
In details, let us denote f(M)the frame where the mask
Mis present, ¯
Mits center and f ps, H and Wrespectively
the number of frames per second, height and width of the
video. The temporal (Ct(TA, T B)), spatial (Cs(TA, T B))
and temporal overlap (Co(TA, T B)) costs between TAand
TBare defined respectively as:
Ct(TA, T B) = |f(MA
N)f(MB
1)|
fps ,(1)
Cs(TA, T B) = 2
H+W× || ¯
MA
N¯
MB
1||1,(2)
Co(TA, T B) = |{f(M)MTA}∩{f(M)MTB}|
(3)
A pair (TA, T B)is admissible if the tracklets belong to
the same class, Ct(TA, T B)τt,Cs(TA, T B)τsand
Co(TA, T B)τo.
3.3.3 Greedy association
Similarly to Singh et al. [8], we gradually merge the admis-
sible pairs with the highest cosine similarity, if it is above
a threshold θl, while continuously updating the admissible
Frame 3
Frame 4
Frame 5
Frame 6
Frame 12 Frame 13 Frame 14
Space-
Time
Memory
Network
Space-
Time
Memory
Network
Cosine similarity :
0,81
read
query
Cosine similarity :
0,87
Tracklet A
Tracklet B
Average : 0,84
query
read
Figure 2. Similarity used at the long-term association step. For a
matter of simplicity, only one mask and frame is used as reference
and as target in the space-time memory network.
pairs using equation 3. A tracklet can therefore be repeat-
edly merged with other tracklets. Finally, tracks having
their highest detection score lower than 90 % are deleted.
4. Experiments
4.1. Implementation details
At the detection step, θdis 0.5, small masks whose area
is less than θa= 128 pixels are removed and θmIoU = 0.5.
For the GLTA step, the selection is done with (τt, τs, τo) =
(1.5,0.2,1). To measure similarity, the second frame is
picked using n= 5. If that frame is not available, n= 2,
is used instead. As for the thresholds at the STA and GLTA
steps, we selected θs= 0.15 and θl= 0.30. These hyperpa-
rameters were selected through cross-validation and remain
fixed regardless of the dataset and object classes.
3
Method BDD DAVIS KITTI MOTSCha.OVIS TAO Waymo YT-VIS RobMOTS
HOTA HOTA HOTA HOTA HOTA HOTA HOTA HOTA HOTA DetA AssA
RobTrack [11]57.9 56.9 71.6 61.0 61.6 55.0 57.2 68.3 61.2 59.4 64.8
SBT [9] 53.0 50.3 74.0 64.4 55.6 51.8 55.2 64.4 58.6 55.9 63.1
SIA [7]53.4 47.4 70.8 62.2 54.8 49.6 54.1 62.7 56.9 55.8 59.8
MeNToS 52.3 49.6 69.7 60.2 55.6 39.2 53.4 64.2 55.5 52.4 60.8
STP [4] 49.4 48.2 66.4 60.4 52.8 43.8 51.8 62.3 54.4 55.8 55.0
Table 1. Results on the RobMOTS test set. The HOTA metrics on each benchmark is reported alongside with the overall DetA, AssA and
HOTA. Red and blue indicate respectively the first and second best methods.
4.2. Datasets and performance evaluation
The tracking algorithms are applied on the benchmarks
on the RobMOTS challenge [4]. It consists of eight tracking
datasets with a high diversity in terms of framerate (ranges
from 1 to 30 frames per second), objects of interest, dura-
tion and number of objects. Here, we considered the 80
categories of objects from COCO.
Recently, the HOTA metric was introduced to fairly bal-
ance the quality of detections and associations. It can be
decomposed into the DetA and AssA metrics to measure
the quality of these two components. The higher the HOTA
is, the more the tracker is aligned with the human visual as-
sessment. The final HOTA on RobMOTS is the average of
the eight HOTA metrics.
4.3. Results
Results of Table 1indicate that our method is competi-
tive for MOTS. MeNToS performs well on all benchmarks
except TAO. This benchmark is more difficult for MeNToS,
and all the other methods, since it is composed of videos
with a very low framerate (1 fps). Without this outlier,
MeNToS would perform on par with SIA.
This issue comes from the second step of our method
where the optical flow struggles to correctly associate con-
secutive masks of the same object. As a result, the dele-
tion of very short tracklets leads to remove detections in
this case, thus reducing DetA, the quality of detection (-3
percentage points on DetA compared to the baseline STP).
However, MeNToS is able to correctly associate track-
lets) in the long-term data association partially balances
this drawback (+6 percentage points on AssA compared to
STP).
5. Conclusion
In this work, we have developed a memory network-
based tracker for multi-object tracking and segmentation.
After creating tracklets, the STM network is used to com-
pute a similarity score between tracklets. We can inter-
pret this evaluation as a pixel-level visual-spatial alignment
leveraging segmentation masks and the information of the
whole image. Improving the creation of tracklets during
the short-term data association may lead to further improve-
ments.
Acknowledgment
We acknowledge the support of the Natural Sciences
and Engineering Research Council of Canada (NSERC),
[DGDND-2020-04633 and DG individual 06115-2017].
References
[1] Gunnar Farneb¨
ack. Two-Frame Motion Estimation Based on Polynomial Ex-
pansion. In Josef Bigun and Tomas Gustavsson, editors, Image Analysis, Lec-
ture Notes in Computer Science, pages 363–370, Berlin, Heidelberg, 2003.
Springer. 2
[2] Shubhika Garg and Vidit Goel. Mask Selection and Propagation for Unsuper-
vised Video Object Segmentation. In WACV, 2021. 1
[3] J. Luiten, T. Fischer, and B. Leibe. Track to Reconstruct and Reconstruct to
Track. IEEE Robotics and Automation Letters, 5(2):1803–1810, Apr. 2020. 1
[4] Jonathon Luiten, Arne Hoffhues, Blin Beqa, Paul Voigtlaender, Istv´
an S´
ar´
andi,
Patrick Dendorfer, Aljosa Osep, Achal Dave, Tarasha Khurana, Tobias Fischer,
Xia Li, Yuchen Fan, Pavel Tokmakov, Song Bai, Linjie Yang, Federico Perazzi,
Ning Xu, Alex Bewley, Jack Valmadre, Sergi Caelles, Jordi Pont-Tuset, Xing-
gang Wang, Andreas Geiger, Fisher Yu, Deva Ramanan, Laura Leal-Taix´
e, and
Bastian Leibe. RobMOTS : A Benchmark and Simple Baselines for Robust
Multi-Object Tracking and Segmentation. In CVPR RVSU Workshop, 2021. 4
[5] Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger,
Laura Leal-Taixe, and Bastian Leibe. HOTA: A Higher Order Metric for Evalu-
ating Multi-Object Tracking. International Journal of Computer Vision (IJCV),
Oct. 2020. 1
[6] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video Object
Segmentation Using Space-Time Memory Networks. In ICCV, 2019. 1,3
[7] Jeongwon Ryu and Kwangjin Yoon. SIA : Simple Re-Identification Associ-
ation for Robust Multi-Object Tracking and Segmentation. In CVPR RVSU
Workshop, 2021. 4
[8] Gurinderbeer Singh, Sreeraman Rajan, and Shikharesh Majumdar. A Greedy
Data Association Technique for Multiple Object Tracking. In 2017 IEEE Third
International Conference on Multimedia Big Data (BigMM), pages 177–184,
Apr. 2017. 3
[9] Jiasheng Tang, Fei Du, Weihua Chen, Hao Luo, Fan Wang, and Hao Li. SBT : A
Simple Baseline with Cascade Association for Robust Multi-Objects Tracking.
In CVPR RVSU Workshop, 2021. 4
[10] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Bal-
achandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. MOTS: Multi-
Object Tracking and Segmentation. In CVPR, 2019. 1
[11] Dongxu Wei, Jiashen Hua, Hualiang Wang, Baisheng Lai, Kejie Huang, Chang
Zhou, Jianqiang Huang, and Xiansheng Hua. RobTrack : A Robust Tracker
Baseline towards Real-World Robustness in Multi-Object Tracking and Seg-
mentation. In CVPR RVSU Workshop, 2021. 4
[12] Fan Yang, Xin Chang, Chenyu Dang, Ziqiang Zheng, Sakriani Sakti, Satoshi
Nakamura, and Yang Wu. ReMOTS: Self-Supervised Refining Multi-Object
Tracking and Segmentation. In CVPR - Workshops, 2020. 1,2
[13] Haotian Zhang, Yizhou Wang, Jiarui Cai, Hung-Min Hsu, Haorui Ji, and Jenq-
Neng Hwang. LIFTS: Lidar and monocular image fusion for multi-object track-
ing and segmentation. In CVPR - Workshops, 2020. 1,2
4
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Multi-object tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, higher order tracking accuracy (HOTA), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance.
Conference Paper
Full-text available
This paper presents a novel two-frame motion estimation algorithm. The first step is to approximate each neighborhood of both frames by quadratic polynomials, which can be done efficiently using the polynomial expansion transform. From observing how an exact polynomial transforms under translation a method to estimate displacement fields from the polynomial expansion coefficients is derived and after a series of refinements leads to a robust algorithm. Evaluation on the Yosemite sequence shows good results.
Article
Object tracking and 3D reconstruction are often performed together, with tracking used as input for reconstruction. However, the obtained reconstructions also provide useful information for improving tracking. We propose a novel method that closes this loop, first tracking to reconstruct, and then reconstructing to track. Our approach, MOTSFusion (Multi-Object Tracking, Segmentation and dynamic object Fusion), exploits the 3D motion extracted from dynamic object reconstructions to track objects through long periods of complete occlusion and to recover missing detections. Our approach first builds up short tracklets using 2D optical flow, and then fuses these into dynamic 3D object reconstructions. The precise 3D object motion of these reconstructions is used to merge tracklets through occlusion into long-term tracks, and to locate objects when detections are missing. On KITTI, our reconstruction-based tracking reduces the number of ID switches of the initial tracklets by more than 50%, and outperforms all previous approaches for both bounding box and segmentation tracking. Code available: https://github.com/tobiasfshr/MOTSFusion .
RobMOTS : A Benchmark and Simple Baselines for Robust Multi-Object Tracking and Segmentation
  • Jonathon Luiten
  • Arne Hoffhues
  • Blin Beqa
  • Paul Voigtlaender
  • István Sárándi
  • Patrick Dendorfer
  • Aljosa Osep
  • Achal Dave
  • Tarasha Khurana
  • Tobias Fischer
  • Xia Li
  • Yuchen Fan
  • Pavel Tokmakov
  • Song Bai
  • Linjie Yang
  • Federico Perazzi
Jonathon Luiten, Arne Hoffhues, Blin Beqa, Paul Voigtlaender, István Sárándi, Patrick Dendorfer, Aljosa Osep, Achal Dave, Tarasha Khurana, Tobias Fischer, Xia Li, Yuchen Fan, Pavel Tokmakov, Song Bai, Linjie Yang, Federico Perazzi, Ning Xu, Alex Bewley, Jack Valmadre, Sergi Caelles, Jordi Pont-Tuset, Xinggang Wang, Andreas Geiger, Fisher Yu, Deva Ramanan, Laura Leal-Taixé, and Bastian Leibe. RobMOTS : A Benchmark and Simple Baselines for Robust Multi-Object Tracking and Segmentation. In CVPR RVSU Workshop, 2021. 4
Video Object Segmentation Using Space-Time Memory Networks
  • Joon-Young Seoung Wug Oh
  • Ning Lee
  • Seon Joo Xu
  • Kim
Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video Object Segmentation Using Space-Time Memory Networks. In ICCV, 2019. 1, 3
SIA : Simple Re-Identification Association for Robust Multi-Object Tracking and Segmentation
  • Jeongwon Ryu
  • Kwangjin Yoon
Jeongwon Ryu and Kwangjin Yoon. SIA : Simple Re-Identification Association for Robust Multi-Object Tracking and Segmentation. In CVPR RVSU Workshop, 2021. 4
SBT : A Simple Baseline with Cascade Association for Robust Multi-Objects Tracking
  • Jiasheng Tang
  • Fei Du
  • Weihua Chen
  • Hao Luo
  • Fan Wang
  • Hao Li
Jiasheng Tang, Fei Du, Weihua Chen, Hao Luo, Fan Wang, and Hao Li. SBT : A Simple Baseline with Cascade Association for Robust Multi-Objects Tracking. In CVPR RVSU Workshop, 2021. 4
Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. MOTS: Multi-Object Tracking and Segmentation
  • Paul Voigtlaender
  • Michael Krause
  • Aljosa Osep
  • Jonathon Luiten
Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. MOTS: Multi-Object Tracking and Segmentation. In CVPR, 2019. 1