PreprintPDF Available

MV-YOLO: Motion Vector-aided Tracking by Semantic Object Detection

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Object tracking is the cornerstone of many visual analytics systems. While considerable progress has been made in this area in recent years, robust, efficient, and accurate tracking in real-world video remains a challenge. In this paper, we present a hybrid tracker that leverages motion information from the compressed video stream and a general-purpose semantic object detector acting on decoded frames to construct a fast and efficient tracking engine suitable for a number of visual analytics applications. The proposed approach is compared with several well-known recent trackers on the OTB tracking dataset. The results indicate advantages of the proposed method in terms of speed and/or accuracy. Another advantage of the proposed method over most existing trackers is its simplicity and deployment efficiency, which stems from the fact that it reuses and re-purposes the resources and information that may already exist in the system for other reasons.
Content may be subject to copyright.
MV-YOLO: Motion Vector-aided Tracking by
Semantic Object Detection
Saeed Ranjbar Alvar
School of Engineering Science
Simon Fraser University
Burnaby, BC, Canada
Email: saeedr@sfu.ca
Ivan V. Baji ´
c
School of Engineering Science
Simon Fraser University
Burnaby, BC, Canada
Email: ibajic@ensc.sfu.ca
Abstract—Object tracking is the cornerstone of many visual
analytics systems. While considerable progress has been made in
this area in recent years, robust, efficient, and accurate tracking
in real-world video remains a challenge. In this paper, we present
a hybrid tracker that leverages motion information from the
compressed video stream and a general-purpose semantic object
detector acting on decoded frames to construct a fast and efficient
tracking engine. The proposed approach is compared with several
well-known recent trackers on the OTB tracking dataset. The
results indicate advantages of the proposed method in terms of
speed and/or accuracy. Other desirable features of the proposed
method are its simplicity and deployment efficiency, which stems
from the fact that it reuses the resources and information that
may already exist in the system for other reasons.
Index Terms—Object tracking, semantic tracking, motion vec-
tors, region of interest
I. INTRODUCTION
Visual object tracking is one of the fundamental tasks in
computer vision, and the cornerstone of many visual ana-
lytics applications in video surveillance, smart homes/cities,
independent living, human-computer interaction, and so on.
Despite the significant advances in the performance of trackers
in recent years, robust, efficient, and accurate tracking in real-
world video remains a challenge.
Existing tracking approaches can be classified in a number
of ways. For the purposes of this study, a division in terms of
the input data domain is useful: pixel domain, compressed do-
main, and hybrid. Pixel-domain trackers are the most abundant
and the most well-studied in the literature. Many successful
tracking approaches were developed in this group, such as
those based on correlation filters (e.g. [1]) and those based on
learned deep features (e.g. [2], [3]). Advantages of this class
of methods include their potential for high accuracy and the
fact that they are video codec-agnostic. However, they tend
to be resource intensive, because all pixel values need to be
reconstructed, stored in memory, and processed.
The second group of trackers operate on compressed-
domain data, with only partial decoding of the video bit
stream. Compressed-domain data carry valuable information
that has been shown to be useful in many applications, such
as face detection [4] and localization [5], motion segmenta-
tion [6], and object segmentation and tracking [7], [8]. The
This work was supported in part by the NSERC Grant RGPIN-2016-04590
key insight from the studies in [6], [7], [8] is that motion
vectors (MVs) and related coding syntax elements are good
indicators of the movements of objects in the scene. Since this
information already exists in the video bit stream, it seems
natural to try to use it in tracking. Advantages of compressed-
domain trackers include efficiency and speed, since they avoid
most of video decoding, pixel value storage and processing,
and generally operate on less input data. Their downside is
the dependence on the video coding method used to compress
the video, as well as potentially lower accuracy, limited by the
low resolution of the motion sampling grid: usually, a single
MV is assigned to blocks/units of size 4×4or larger.
The third group of trackers are hybrid ones, trying to take
advantage of both compressed and pixel-domain data. An
example of such approach is [9], which performs tracking by
combining MVs and block coding modes extracted from the
High Efficiency Video Coding (HEVC) bit stream with the
color information from the decoded Intra frames.
The tracking method proposed in this work is also a hybrid
one, combining decoded MVs with semantic object detection
operating on fully decoded frames. The basic idea is that
MVs, which already exist in the compressed video bitstream,
are good enough to indicate the approximate location of the
target object. Semantic object detector then refines the object’s
location by providing pixel-precision bounding box on the
decoded frame. The idea of two-stage tracking (approximation
followed by refinement) has also been advocated in two other
recent works, Parallel Tracking and Verifying (PTAV) [10] and
ROLO [11]. Both these approaches are pixel-domain trackers,
while ours is the first hybrid one, to our knowledge. PTAV
uses a fast but less accurate pixel-domain tracker in the first
stage followed by a Siamese network based on VGGNet [12]
for refinement in the second stage. In ROLO, the first stage
approximation is given by the YOLO object detector [13],
while the second stage refinement is provided by a Long Short-
Term Memory (LSTM) network.
The paper is organized as follows. In Section II, we present
the details of the proposed tracking method. In Section III
we describe the experiments, and discuss the results and
comparisons with several representative trackers from the
literature. Section IV concludes the paper.
arXiv:1805.00107v2 [cs.CV] 15 Jun 2018
Fig. 1: An overview of the proposed tracking method
II. PROPOS ED MET HOD
The proposed tracking framework is illustrated in Fig. 1. We
refer to it as MV-aided YOLO, or MV-YOLO for short. Ini-
tially, approximate location of the target object is constructed
based on MVs of the current inter-coded frame and the object’s
location in the previous frame. The constructed approximate
location is referred to as the Region Of Interest (ROI). At the
same time, decoded current frame is passed on to a semantic
object detector (in our case YOLO), which detects locations of
various objects in the frame. The ROI then helps decide which
of these locations corresponds to the target object. Details are
presented in the following subsections.
A. ROI creation
ROI creator uses the MVs from the HEVC bit stream to
construct an approximate location of the target object in the
current frame t, given the object’s location in the previous
frame t1. The procedure is relatively simple. MVs of frame
tare read from the HEVC bit stream during frame decoding.
The MV associated with a PU is assigned to all its pixels.
Then, each pixel whose MV refers to the object’s location in
frame t1is labeled as ROI-pixel. Finally, ROI is selected
as the smallest axis-aligned rectangle that includes all ROI-
pixels. The process is illustrated in Fig. 2, where the ROI in
frame tis shown in red and several MVs from frame tto
frame t1are shown in yellow.
While the basic idea behind ROI creation is fairly intuitive,
several technical challenges need to be resolved along the
way. These include PUs without MVs (such as SKIP and
intra-coded PUs), MVs pointing to frames other than t1,
and fractional precision MVs. Of these challenges, SKIP PUs
are easiest to resolve. Since the SKIP mode indicates that
the corresponding PU is almost exactly the same as the
corresponding co-located region in the previous frame, zero
MV is assigned to each SKIP PU.
Assigning meaningful motion to intra-coded PUs is a bit
more involved, since the fact that intra mode was chosen
by the encoder is an indication that the underlying motion
is too complicated to be taken advantage of by conventional
motion compensation. For such PUs, we collect the MVs of
all neighboring inter-coded PUs in the same Coding Tree
Unit (CTU), and then apply Polar Vector Median (PVM) [7]
to come up with a suitable MV for such PUs. Specifically,
Fig. 2: An example of ROI creation
let V= (v1,v2, ..., vn)be the list of MVs from neigh-
boring PUs sorted according to their angle with respect to
the horizontal axis. Then, a sub-list of m=b(n+ 1)/2c
consecutive vectors from Vis selected such that the sum
of their angle differences is minimized. That is, the selected
group is (vk,vk+1, ..., vk+m1)where kis chosen as
k= argmin
j
j+m2
X
i=j
(vi+1 vi)(1)
Then the angle and the magnitude of the PVM vector b
vare
set as:
b
v=median (vk,vk+1, ..., vk+m1)
kb
vk2=median (kb
v1k2,kb
v2k2, ..., kb
vnk2)(2)
Finally, MVs in frame tthat point to frames other than
t1are scaled (assuming continuity of motion) such that
their scaled version points to frame t1, similarly to [7].
Components of fractional-precision MVs are rounded down
to the nearest integer.
B. Object detection
Semantic object detection refers to finding the locations of
objects in the image and classifying them according to their
type, e.g., human, car, dog, ... Any semantic object detector
can be used in our proposed framework shown in Fig. 1.
However, for the experiments, we chose three versions of the
popular YOLO detector: YOLOv3 [14], YOLOv2 [15], and
TinyYOLO, which is a simpler and faster (though less acurate)
version of YOLOv2.
Initial position of the object to be tracked is specified in the
first frame of the sequence. Our tracker then tries to infer the
object’s class.1This is done by running the object detector on
the first five frames of the sequence. In each frame, the object
detector outputs a number of boxes together with the object
class with highest confidence for each box. In each frame, the
detected box with the largest Intersection-Over-Union (IOU)
with the specified location of the object is found, and the object
class of that box is recorded. The most frequent among these
object classes is the inferred class of the object to be tracked.
After the object class is inferred, the frame at time tis
fed into the object detector and a set of Nboxes B=
{B1, ..., BN}are given as the output. These boxes carry the
object location, object class, and confidence score. From these
Nboxes, we eliminate all those whose class does not match
the class of the object we are tracking. This way we end up
with MNboxes, which we relabel as b
B={b
B1, ..., b
BM}.
These are used in the final decision stage, described below.
Note that the proposed tracking framework relies on se-
mantics (i.e., object class) to eliminate some of the irrelevant
objects/boxes in the frame. In principle, semantic information
should help in difficult situations such as occlusion or mul-
tiple object tracking. However, even non-semantic detectors
(those that do not output object class) could be used in our
framework, but the accuracy would likely suffer due to a larger
number of irrelevant boxes and the higher potential for making
wrong decisions in the final stage.
C. Final box decision
After the object detector outputs a set of boxes b
B, the box
corresponding to the target has to be identified. This is done
with the help of the ROI found in the first stage. Among the
boxes in b
B, the one that has the highest IOU seems like a
good candidate. However, even the highest IOU can be small.
Hence, we also compare this highest IOU with an adaptive
threshold in order to arrive at the final decision. Details are
given in Algorithm 1.
IOU between the ROI and the box b
Bib
Bis computed as
IOU(ROI, b
Bi) =
Area nROI b
Bio
Area nROI b
Bio(3)
The adaptive threshold TIOU in Algorithm 1 changes with
respect to the IOU between the target and the ROI in the
previous frame. Adaptation of this threshold (lines 10-18 in
Algorithm 1) is designed to help with the cases where object
detector fails to detect the target object, but instead detects the
surrounding objects. It also helps in case of occlusion. In such
1In some applications, object’s class may be specified in the first frame, in
which case it does not have to be inferred, and this step can be skipped.
Algorithm 1 Final box decision
Input: I nitial T IOU = 0.7initial threshold
Input: TIOU = 0.7adaptive threshold
Input: TReduction = 0.5threshold reduction
Input: b
Bboxes found in the object detection stage
Input: M number of boxes in b
B
Input: I={} IOU scores for the found boxes
Output: e
B final box
1: if M== 0 then:
2: No boxes in b
B=Take the target location in frame
t1as e
B
3: else
4: for i= 1 to Mdo:
5: Compute IOU(ROI, b
Bi)from (3)
6: Add IOU(ROI, b
Bi)to I
7: ii+ 1
8: j= arg max(I)b
Bjhas largest IOU with ROI
9: Check validity of b
Bj:
10: if IOU(ROI, b
Bj)>(1 TReduction)·TI OU then
11: e
Bb
BjFinal box found
12: if IOU(ROI, b
Bj)> I nitial T IOU then
13: TIOU Initial T I OU
14: else
15: TIOU IOU (ROI , b
Bj)
16: else
17: No suitable box is found =Take the target
location in frame t1as e
B
18: TReduction TReduction + 0.2
19: return e
B
cases, boxes produced by the object detector are not matching
the target in the previous frame in terms of IOU (line 10 in
Algorithm 1), so none of them are chosen, and instead the
location of the target in the previous frame is taken as the final
box e
Bfor the current frame (line 17 in Algorithm 1). But if
the mismatch continues, the IOU acceptance threshold reduces
(TReduction increases in line 18 in Algorithm 1). Eventually,
the lower IOU acceptance threshold (line 10 in Algorithm 1)
will cause one of the detected boxes to be accepted as the the
final box e
B.
D. Summary
We now summarize several key features of the proposed
tracking framework.
Compatibility with many object detectors: One advantage
of our tracking framework is that it is not crucially dependent
on any particular object detector. While we use three versions
of YOLO in our experiments for demonstration purposes, other
detectors such as R-CNN [16], Fast R-CNN [17], Faster R-
CNN [18], SSD [19], and so on, can be used as well.
Resource sharing: The object detector in our tracking
framework may be used for other applications as well. For
example, if the detector is placed in the cloud, other cloud
services can use it for other purposes, such as object detection
TABLE I: List of sequences used in the experiments
Sequences
Bird1 BlurBody BlurCar1 BlurCar3 Car4
CarDark CarScale Couple Dancer Dancer2
David3 Diving Dog Girl2 Gym
Human2 Human3 Human6 Human7 Human8
Human9 Jump Singer1 Singer2 Skater
Skater2 Skating1 Suv Walking2 Woman
in user-supplied photos. This way, a single deep model can
serve many applications.
Data reuse: In tracking, motion is usually one of the
key challenges to conquer. But in our framework, motion is
handled via MVs, which exist in the video bit stream anyway.
This reuse of existing data speeds up the processing, and
makes good engineering sense.
Robustness: Other key challenges in tracking are appear-
ance and scale changes. Many trackers try to model these
explicitly. Our framework handles this problem by using an
image-based object detector, which is not burdened by the
memory of the object’s appearance in the previous frames. As
a result, the tracker is quite robust to appearance changes, as
illustrated by an example in Fig. 4(b).
III. EXPERIMENTAL RE SULT S
A. Experimental settings
A total of 30 sequences out of 100 sequences in OTB100
dataset [20] were chosen for testing. These sequences contain
object classes that are supported by YOLO. They are listed
in Table I. Test sequences were encoded using the HEVC
reference software HM16.15 [21] with the configuration pa-
rameters in encoder_lowdelay_P_main.cfg [22] and
the Quantization Parameter (QP) set to 32. The motion vectors
were then extracted from the compressed HEVC bit streams.
The proposed tracking framework was compared against
DSST [1] (the winner of the VOT 2014 challenge), CNN-
SVM [2], and Re3 [3]. The latter two are representatives
of the class of trackers based on deep neural networks that
currently dominate this field. Within our framework, we used
three versions of the YOLO object detector: YOLOv3 [14],
YOLOv2 [15] and TinyYOLO [23], which is a simpler version
of YOLOv2. The resulting trackers are referred to as MV-
YOLOv3, MV-YOLOv2, and MV-TinyYOLO, respectively.
The detection thresholds for YOLOv3 , YOLOv2 and TinyY-
OLO were set to 0.1, 0.1, and 0.03, respectively.
B. Results
To evaluate the trackers, one-pass evaluation (OPE) [20] is
performed. The Success and Precision plots [20] are shown
in Fig. 3. For each tracker, the Success curve is derived from
the IOU of the predicted object box and the ground truth box,
while the Precision curve represents the percentage of frames
where average Euclidean distance between the centroids of
the predicted box and the ground truth is less than a given
threshold. In the Success plot (top graph in Fig. 3), the Area
Under the Curve (AUC) is indicated in brackets next to each
Fig. 3: Success and Precision curves.
tracker in the legend. In the Precision plot (bottom graph), the
numbers in the legend represent the percentage of the predicted
boxes with centroids located within 20 pixels of the ground-
truth centroid.
As seen in Fig. 3, the accuracy of the proposed tracking
framework depends on the object detector employed. YOLOv3
leads to the best accuracy, followed by YOLOv2 and TinyY-
OLO. To further illustrate this point, Table II shows Overlap
Success Rate (OSR) and Distance Precision Rate (DPR) [20]
at thresholds of 0.5 and 20, respectively, for the three versions
of the proposed tracker. Both DPR and OSR follow the trends
in Fig. 3, indicating that MV-YOLOv3 is the most accurate of
the three, and MV-TinyYOLO is the least accurate.
However, speed results show the opposite trend. The last
row of Table II indicates the speed of the three trackers from
the proposed framework. The tracker speed was computed
as follows. We first measured the speed of ROI generation
from MVs on a desktop machine with an Intel Corei7-6800K
processor at 3.40 GHz, 128 GB RAM and a 12 GB Nvidia
Fig. 4: The performance of the proposed method when (a) occlusion or (b) scale change occurs. The red box is the ROI derived
from MVs (Section II-A) and the blue box is the final predicted target location (Section II-C).
TABLE II: The performance and speed of the proposed
tracking framework with three object detectors.
Metric MV-YOLOv3 MV-YOLOv2 MV-TinyYOLO
DPR (%) 73 64 46
OSR (%) 65 54 36
Speed (fps) 28 47 88
Titan X GPU. During this measuremen, any disk access time
was ignored. Then we added this time to the object detection
time reported on the official YOLO website [23]. The inverse
of the sum of these two times gives the speed in frames per
second (fps) reported in the last row of Table II. The fastest of
the three trackers is MV-TinyYOLO at 88 fps, and the slowest
is MV-YOLOv3 at 28 fps, which is still relatively fast.
The Precision and Success results of DSST [1], CNN-
SVM [2] were taken from the official websites of each tracker.
For Re3 [3], the authors of Re3 shared their results on OTB100
dataset with us. We see from Fig. 3 that MV-YOLOv3 has the
best average Success AUC among all tested trackers, while
CNN-SVM has the best average Precision. In both precision
and success results, MV-YOLOv3 is more accurate than Re3,
which is encouraging. In turn Re3 is more accurate than DSST,
which was the winning tracker in the VOT 2014 challenge.
This illustrates the progress that has been made in the field in
the last few years. All three versions of MV-YOLO are faster
than CNN-SVM and DSST, but slower than Re3, according
to the speed reported in the respective papers.
To further analyze the precision results of MV-YOLOv3 and
CNN-SVM, the comparison of DPR (%) at the threshold of
20 pixels is reported in Table III for each test sequence sep-
arately, where the better result is indicated in bold. Although
CNN-SVM achieves higher DPR on average, MV-YOLOv3
outperforms CNN-SVM in almost half the test sequences.
By examining the sequences in which MV-YOLOv3 has
considerably lower performance such as Bird1 and CarDark,
we found that problems arise in cases where an object of the
same class as the one being tracked (e.g., bird or car) comes
close to the ROI and the tracker accidentally “latches on” to
it. Further work is needed to handle these situations, perhaps
by incorporating object attributes into the tracking framework.
TABLE III: Comparison of DPR (%) at 20 pixels distance in
the tested sequences for MV-YOLOv3 and CNN-SVM [2].
Sequence MV-YOLOV3 CNN-SVM
Bird1 14 37
BlurBody 86 58
BlurCar1 72 99
BlurCar3 68 99
Car4 100 100
CarDark 9 100
CarScale 99 70
Couple 6 100
Dancer 43 96
Dancer2 91 97
David3 98 100
Diving 68 43
Dog 94 95
Girl2 34 94
Gym 98 96
Human2 74 33
Human3 93 92
Human6 97 54
Human7 90 99
Human8 98 100
Human9 89 100
Jump 66 5
Singer1 22 95
Singer2 5 83
Skater 78 90
Skater2 99 81
Skating1 99 44
Suv 95 94
Walking2 100 86
Woman 98 100
Average 73 81
Despite this, there are many sequences in which MV-YOLOv3
offers better DPR than the more complex and slower CNN-
SVM.
Finally, some visual examples of the performance of MV-
YOLOv3 are shown in Fig. 4, where the red box indicates the
ROI created from the MVs (Section II-A) and the blue box
shows the final predicted box (Section II-C). Part (a) of the
figure illustrates occlusion, where the pedestrian being tracked
gets occluded by a tree trunk. Significant occlusion starts from
frame 82 and continues until frame 85. In frames 82, 83 and
84, no box with significant overlap with ROI is found by the
object detector, so the target box from frame 81 is chosen as
the predicted target location (step 17 in Algorithm 1) and the
IOU acceptance threshold is reduced (step 18 in Algorithm 1).
In frame 85, only the head of the person is detected, and
the IOU of the detected box and the ROI is relatively small.
However, since the IOU threshold was reduced in frames 82,
83 and 84, the detected box is chosen as the target. The tracker
locks on to the person and continues for a few frames with a
small box tracking the head. Later, when the person is in full
view and the object detector detects it fully, the tracker locks
onto the person again (frame 97 and later).
Part (b) of Fig. 4 shows the robustness of the proposed
tracking framework to scale change. In frame 10, the car being
tracked is small and located in the bottom-left part of the
frame. Within the next 240 frames, the car moves towards right
and towards the camera, while the camera itself also moves
towards right. At frame 250, the car is about 15 times larger
than it was in frame 10, and its appearance has changed: frame
10 was showing mostly the front view of the car, while frame
250 starts to show the rear view. Throughout these frames, the
car is accurately tracked despite these appearance changes.
C. Final Remarks
If the class of the object to be tracked is not supported by
the object detector, there are two workarounds. The first is to
fine-tune the detector (using transfer learning) for the desired
object class. Alternatively, one could swithch to a generic
object detector (e.g. “objectness”).
The proposed method relies on the MVs from the video bit-
stream, but is not necessarily dependent on MVs produced by
an HEVC encoder. Since the earliest video coding standards,
MVs were used to represent block-based transnational motion,
and this is all that is required in the proposed framework. Our
tracker only needs a hint of motion – refinement of the object
box is provided by the object detector.
The performance of our tracking framework is highly de-
pendant on the accuracy of the object detector, which in
turn depends on the input image quality. Unfortunately, video
sequences in the OTB100 dataset are stored frame-by-frame
as JPEG images, and the quality of these JPEG images is
not particularly good. In certain cases, coding artifacts can
easily be seen. To obtain motion vectors for our tests, we had
to further encode and decode these using HEVC, which has
created additional artifacts. This has caused the object detector
to miss the target object or wrongly classify objects in some
cases, which negatively impacted our results. Other trackers
in the study were not fed with HEVC-coded frames, so their
performance was not affected by additional HEVC coding.
IV. CONCLUSION
In this paper we proposed MV-YOLO, a novel tracking
framework that incorporates data reuse from the compressed
video bit stream and semantic object detection. Based on the
MVs extracted during the video decoding process, a ROI for
the target object is created in the current frame. Then the
output of a semantic object detector is used to more precisely
localize the target object with the help of the ROI.
The experiments show that MV-YOLO is a fast and robust
tracking framework. The accuracy and speed of MV-YOLO
depends on the particular object detector being used. However,
even the slowest version we tested was reasonably fast at 28
fps, while its accuracy was comparable to the recent trackers
based on deep models.
In the present study, we examined only single object track-
ing. However, the MV-YOLO framework contains all the
ingredients to support multiple object tracking as well. This is
a topic for future research.
REFERENCES
[1] M. Danelljan, G. Hger, F. S. Khan, and M. Felsberg, “Discriminative
scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
no. 8, pp. 1561–1575, Aug 2017.
[2] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning
discriminative saliency map with convolutional neural network,” in Proc.
ICML’15, 2015, pp. 597–606.
[3] D. Gordon, A. Farhadi, and D. Fox, “Re3: Real-time recurrent regression
networks for visual tracking of generic objects,” IEEE Robotics and
Automation Letters, vol. 3, no. 2, pp. 788–795, April 2018.
[4] S. R. Alvar, H. Choi, and I. V. Baji´
c, “Can you tell a face from a HEVC
bitstream?,” in Proc. IEEE MIPR’18, Apr. 2018.
[5] S. R. Alvar, H. Choi, and I. V. Baji´
c, “Can you find a face in a HEVC
bitstream?,” in Proc. IEEE ICASSP’18, Apr. 2018.
[6] Y. M. Chen, I. V. Baji´
c, and P. Saeedi, “Moving region segmentation
from compressed video using global motion estimation and Markov
random fields,” IEEE Trans. Multimedia, vol. 13, pp. 421–431, 2011.
[7] S. H. Khatoonabadi and I. V. Baji´
c, “Video object tracking in the
compressed domain using spatio-temporal Markov random fields, IEEE
Trans. Image Processing, vol. 22, no. 1, pp. 300–313, Jan. 2013.
[8] L. Zhao, Z. He, W. Cao, and D. Zhao, “Real-time moving object
segmentation and classification from HEVC compressed surveillance
video,” IEEE Trans. Circuits Syst. Video Technol., 2018, to appear.
[9] S. Gl, J. T. Meyer, C. Hellge, T. Schierl, and W. Samek, “Hybrid
video object tracking in H.265/HEVC video streams,” in Proc. IEEE
MMSP’16, Sept 2016, pp. 1–5.
[10] H. Fan and H. Ling, “Parallel tracking and verifying: A framework for
real-time and high accuracy visual tracking,” in Proc. IEEE ICCV’17,
Oct 2017, pp. 5487–5495.
[11] G. Ning, Z. Zhang, C. Huang, X. Ren, H. Wang, C. Cai, and Z. He,
“Spatially supervised recurrent convolutional neural networks for visual
object tracking,” in Proc. ISCAS’17, May 2017, pp. 1–4.
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proc. IEEE CVPR’16, Jun.
2016, pp. 779–788.
[14] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
arXiv preprint arXiv:1804.02767, 2018.
[15] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger, in
IEEE CVPR, July 2017, pp. 6517–6525.
[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE CVPR’14, June 2014, pp. 580–587.
[17] R. Girshick, “Fast R-CNN, in Proc. ICCV’15, pp. 1440–1448.
[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
time object detection with region proposal networks,” in Proc. NIPS’15,
2015, pp. 91–99.
[19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and
A. C. Berg, “SSD: Single shot multibox detector,” in Proc. European
conference on computer vision. Springer, 2016, pp. 21–37.
[20] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848, 2015.
[21] “HEVC reference software (HM 16.15),” https://hevc.hhi.fraunhofer.de/
trac/hevc/browser/tags/HM-16.15, Accessed: 2017-05-27.
[22] F. Bossen, “Common HM test conditions and software reference confi-
gurations,” ISO/IEC JTC1/SC29 WG11, JCTVC-L1100, Jan. 2013.
[23] J. Redmon, “YOLO: Real-time object detection,” https://pjreddie.com/
darknet/yolo/, Accessed: 2018-04-25.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Finding faces in images is one of the most important tasks in computer vision, with applications in biometrics, surveillance, human-computer interaction, and other areas. In our earlier work, we demonstrated that it is possible to tell whether or not an image contains a face by only examining the HEVC syntax, without fully reconstructing the image. In the present work we move further in this direction by showing how to localize faces in HEVC-coded images, without full reconstruction. We also demonstrate the benefits that such approach can have in privacy-friendly face localization.
Article
Full-text available
Image and video analytics are being increasingly used on a massive scale. Not only is the amount of data growing, but the complexity of the data processing pipelines is also increasing, thereby exacerbating the problem. It is becoming increasingly important to save computational resources wherever possible. We focus on one of the poster problems of visual analytics -- face detection -- and approach the issue of reducing the computation by asking: Is it possible to detect a face without full image reconstruction from the High Efficiency Video Coding (HEVC) bitstream? We demonstrate that this is indeed possible, with accuracy comparable to conventional face detection, by training a Convolutional Neural Network on the output of the HEVC entropy decoder.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
Article
Robust object tracking requires knowledge and understanding of the object being tracked: its appearance, its motion, and how it changes over time. A tracker must be able to modify its underlying model and adapt to new observations. We present Re 3 , a real-time deep object tracker capable of incorporating temporal information into its model. Rather than focusing on a limited set of objects or training a model at test-time to track a specific instance, we pretrain our generic tracker on a large variety of objects and efficiently update on the fly; Re 3 simultaneously tracks and updates the appearance model with a single forward pass. This lightweight model is capable of tracking objects at 150 FPS while attaining competitive results on challenging benchmarks. We also show that our method handles temporary occlusion better than other comparable trackers using experiments that directly measure performance on sequences with occlusion.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
Moving object segmentation and classification from compressed video plays an important role for intelligent video surveillance. Compared with H.264/AVC, HEVC introduces a host of new coding features which can be further exploited for moving object segmentation and classification. In this paper, we present a real-time approach to segment and classify moving object using unique features directly extracted from the HEVC compressed domain for video surveillance. In the proposed method, firstly, motion vector interpolation for intra-coded prediction unit and MV outlier removal are employed for preprocessing. Secondly, blocks with non-zero motion vectors are clustered into the connected foreground regions using the fourconnectivity component labeling algorithm. Thirdly, object region tracking based on temporal consistency is applied to the connected foreground regions to remove the noise regions. The boundary of moving object region is further refined by the coding unit size and prediction unit size. Finally, a person-vehicle classification model using bag of spatial-temporal HEVC syntax words is trained to classify the moving objects, either persons or vehicles. The experimental results demonstrate that the proposed method provides the solid performance and can classify moving persons and vehicles accurately.
Article
We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.