PreprintPDF Available

Multiple receptive fields and small-object-focusing weakly-supervised segmentation network for fast object detection

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Object detection plays an important role in various visual applications. However, the precision and speed of detector are usually contradictory. One main reason for fast detectors' precision reduction is that small objects are hard to be detected. To address this problem, we propose a multiple receptive field and small-object-focusing weakly-supervised segmentation network (MRFSWSnet) to achieve fast object detection. In MRFSWSnet, multiple receptive fields block (MRF) is used to pay attention to the object and its adjacent background's different spatial location with different weights to enhance the feature's discriminability. In addition, in order to improve the accuracy of small object detection, a small-object-focusing weakly-supervised segmentation module which only focuses on small object instead of all objects is integrated into the detection network for auxiliary training to improve the precision of small object detection. Extensive experiments show the effectiveness of our method on both PASCAL VOC and MS COCO detection datasets. In particular, with a lower resolution version of 300x300, MRFSWSnet achieves 80.9% mAP on VOC2007 test with an inference speed of 15 milliseconds per frame, which is the state-of-the-art detector among real-time detectors.
Content may be subject to copyright.
1
AbstractObject detection plays an important role in various
visual applications. However, the precision and speed of detector
are usually contradictory. One main reason for fast detectors
precision reduction is that small objects are hard to be detected.
To address this problem, we propose a multiple receptive field and
small-object-focusing weakly-supervised segmentation network
(MRFSWSnet) to achieve fast object detection. In MRFSWSnet,
multiple receptive fields block (MRF) is used to pay attention to
the object and its adjacent background’s different spatial location
with different weights to enhance the feature’s discriminability. In
addition, in order to improve the accuracy of small object
detection, a small-object-focusing weakly-supervised
segmentation module which only focuses on small object instead of
all objects is integrated into the detection network for auxiliary
training to improve the precision of small object detection.
Extensive experiments show the effectiveness of our method on
both PASCAL VOC and MS COCO detection datasets. In
particular, with a lower resolution version of 300×300,
MRFSWSnet achieves 80.9% mAP on VOC2007 test with an
inference speed of 15 milliseconds per frame, which is the state-of-
the-art detector among real-time detectors.
Index TermsMultiple receptive field, small-object-focusing,
weakly-supervised segmentation, object detection, multiple tasks
loss.
I. INTRODUCTION
BJECT detection plays an important role in many visual
applications, such as visual navigation [1]-[2], video
surveillance [3]-[4], intelligent transport [5]-[6] and so on.
Minaeian [1] proposed a customized detection algorithm for
UAV’s navigation, and Yuan [2] presented a novel context-
aware multichannel feature pyramid for vehicle’s navigation.
Yang [5] also proposed a fast and accurate vanishing point
detection method for various types of roads used for autopilot.
Comparing with traditional object detection methods based on
hand-craft features [7]-[11], recent detectors based on deep
convolution neural network (CNN) [12]-[24] show powerful
performance because of robust and discriminate features. The
CNN-based methods for object detection can be divided into
two classes, which are region-based two-stage detector [12]-[17]
and region-free one-stage detector [18]-[24]. Two-stage
detectors [12]-[17] achieve higher precision, however, their
complex computation and lower speed limited the practical
S. Sun is with the Research Center of Precision Sensing and Control,
Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China,
application. To accelerate the speed of object detection, several
one-stage detectors [18]-[24] were proposed. Their running
speed keep real time performance, but the accuracy has a clear
drop which is about 10% to 40% relative to state-of-the-art two-
stage detector [12]-[17]. Previous one-stage detector, such as
SSD [19], designed a serial of reference anchor boxes with
different scales and aspect ratios and directly regressed these
anchors on features from different levels. Lower features are
mainly used to detect smaller object and higher features are
used to detect larger object. However, the lower feature with
less semantic information will result in difficult detection for
small objects. In order to solve the problem of small object
detection, several strategies [25]-[27], such as the fusion of
multiple scales’ features, the new regulation of anchors, the new
module of object detector, are introduced into one-stage
detector, however, the running speed decreased because of new
computation burden. To solve these problems, we proposed
multiple receptive fields and small-object-focusing weakly-
supervised segmentation network (MRFSWSnet) for fast object
detection.
Many advances [28]-[34] have approved the fact that
detectors can achieved better result because of receptive field
modules. Inception module of GoogleNet [28] uses multiple
convolution layers with different kernels which are sampled at
the same center to construct its receptive field module. The
receptive field block (RFB) [34] was also proposed by imitating
the human population receptive field. Different from the
Inception module [28], multiple convolution layers with
different dilate rates were used to construct RFB. Different
from above two types of receptive fields modules used in object
detection, multiple receptive fields block (MRF) is designed in
the proposed detector, which is composed of multiple
convolution layers with different kernels and different dilated
rates. The proposed MRF are used to pay attention to the object
and its adjacent background’s different spatial locations with
different weights to enhance the features’ discriminability, and
it can cover more positions from the center of an object to its
and the School of Artificial Intelligence, University of Chinese Academy of
Science, Beijing 100049, China (e-mail: sunsiyang2015@ia.ac.cn).
Multiple receptive fields and small-object-
focusing weakly-supervised segmentation
network for fast object detection
Siyang Sun
O
2
surrounding. MRF can achieve better feature discriminability
with lower computation burden.
Object detection usually focuses on whole object but lack of
paying attention to local cues, while semantic segmentation
pays close attention to each position within the object. Many
advances [35]-[36] show that combining object detection with
weakly-supervised semantic segmentation can improve the
performance of object detection. In order to further improve the
precision of small object detection, a small-object-focusing
weakly-supervised segmentation module (SWS) is integrated
into the detection network for auxiliary training.
Our main contributions can be summarized as follows:
(1) Firstly, a multiple receptive field and small-object-
focusing weakly-supervised segmentation network
(MRFSWSnet) is proposed to achieve fast object detection. It
achieves state-of-the-art accuracy with real-time running speed
on PASCAL VOC and MS COCO dataset.
(2) Secondly, multiple receptive fields block (MRF) is
designed as new convolutional predictors for SSD [19] to
improve detection accuracy. MRF are used to pay attention to
the object and its adjacent background’s different spatial
location with different weights to enhance the features’
discriminability, and it can cover more positions from the center
of an object to its surrounding.
(3) Thirdly, a small-object-focusing weakly-supervised
segmentation module is integrated into MRFSWSnet as an
auxiliary training task which only focuses on small object
instead of all objects to improve the precision of small object
detection. MRFSWSnet shows significant performance
boosting for small object detection, while it still keeps real-time
speed.
II. RELATED WORK
The CNN-based modern object detector can be divided into
two classes, which are region-based two-stage detectors and
region-free one-stage detectors. Two-stage detector due to
higher accuracy is received attention, the process of which
generates a sparse of proposals firstly and then classifies these
proposals by classifier. The representative method of two-stage
detector is Faster RCNN [14], which generates a serial of
candidate proposals by region proposal network (RPN) and
then regresses and classifies these proposals through Fast
RCNN [13]. Its descendants such as R-FCN [15], FPN [16] and
Mask R-CNN [37] are proposed to further improve the
detection accuracy. However, their slower running speed and
complex computation limited its practical application. One-
stage detector is proposed to solve the problem of slower speed
and complex computation of two-stage detector. It discards the
phase of generating proposals and detects objects in a dense
manner e.g. YOLO [18] and SSD [19]. YOLO and SSD adopted
lightweight network as backbone to obtain faster running speed
at expense of detection accuracy. However, many advances [23]
[37] have approved the fact that detectors can achieved better
result because of the robust and discriminate features of deeper
network [38] with complex computation.
In order to improve performance of detector with less
computation, some works focus on enhancing the
discrimination of lightweight network’s features by different
receptive fields. Inception family [28]-[30] were born, which
integrated multiple convolution layers with different kernels to
get different scales’ information with different receptive field.
Different from Inception family [28]-[30], ASPP [31] for
semantic segmentation adopted multiple dilated convolution
layers to generate a serial of uniform resolution features without
additional parameters. Deformable CNN [32] proposed a novel
dilated convolution operator by learning the offset of individual
object on the basic of ASPP. RFB [34] designed a receptive
field block with different dilation convolution layers only
focusing on special position without focusing all surrounding
specified position of one pixel. We propose multiple receptive
Fig. 1. The structure of the proposed MRFSWSnet
3
fields block (MRF) in our detector, which is composed of
multiple convolution layers with different kernels and different
dilated rates. The proposed MRF pays attention to the different
spatial location with different weights according to different
distances from objects’ center position to its surrounding, and
it can cover more surrounding positions of the object.
With these improved methods, the precision for object
detection can be promoted. However, the accuracy for small
object detection was still lower. In order to improve the
accuracy of small object detection, Zhu [26] proposed a novel
anchor strategy to support anchor-based face detector, which
improve the performance of tiny face detection. Levi [27]
constructed the relational module through the spatial and
semantic relations for object detection to improve the
performance of small object detection. In our detector, a small-
object-focusing weakly-supervised segmentation module (SWS)
is integrated into the detection network which only focusing on
small object for auxiliary training to improve precision of small
object detection. Object detection usually focuses on the whole
object without paying attention to local cues, while semantic
segmentation pays close attention to each position within the
object. Many advances [35]-[36] show that combining object
detection with weakly semantic segmentation can improve
performance of object detection. S. Gidaris et al. [35] proposed
a semantic segmentation-aware CNN model for object
detection by enhancing the detection feature with semantic
segmentation tasks at highest level. The semantic features with
weakly-supervised were used to activate the feature of detection.
Zhang et al. [36] used the semantic segmentation’s feature for
activating the feature of detection at lowest layer, which could
also improve the detection accuracy. The two methods used the
feature of weakly-supervised semantic segmentation to activate
the feature of detection, which enhanced the robustness and
discrimination of feature for detection. All the objects with
weakly-supervised ground truth were used to train the semantic
features. Different from the above methods, a small-object-
focusing weakly-supervised segmentation module (SWS) is
integrated into the detection network which only focusing on
small object for auxiliary training. SWS shows significant
performance boosting for small object detection, while our
detector still keeps real-time speed.
III. MULTIPLE RECEPTIVE FIELDS AND SMALL-OBJECT-
FOCUSING WEAKLY-SUPERVISED SEGMENTATION NETWORK
In this section, we firstly introduce the whole detection
architecture of MRFSWSnet in Section III.A. Then the multiple
receptive fields block (MRF) for object detection is presented
in Section III.B. Afterwards, the small-object-focusing weakly-
supervised segmentation module (SWS) is described in Section
III.C. Finally, the training of MRFSWSnet in the form of
multiple tasks is presented in Section III.D.
A. The network architecture of MRFSWSnet
The proposed multiple receptive fields and small-object-
focusing weakly-supervised segmentation network
(MRFSWSnet) is composed of two branches, which are the
detection branch and the segmentation branch. The whole
network architecture is shown in Fig.1. The detection branch
reuses the structure of SSD due to the effectiveness of
detection’s accuracy and running speed. The base network in
MRFSWSnet is same with original SSD, which is VGG-16 [39]
trained in ImageNet dataset [40].
In our detection branch of MRFSWSnet, we keep the cascade
structure unchanged while add one feature conv3_3_E
generated by feature pyramid network (FPN) [16] in our
MRFSWSnet. FPN [16] has been proven to ameliorate the
performance of features and improve the detection accuracy.
With a coarser resolution’s feature map conv9 as shown in Fig.
1, we upsample the previous feature by a factor of 2 (using
nearest neighbor upsampling), and then the upsampled feature
is merged with the corresponding bottom-up feature map. The
process does not end until the last finest-resolution feature map
conv3_3_E is obtained, which is used as the first detection
feature. Different from the original SSD, seven features (conv9,
conv8, conv7, conv6, conv5_3, conv4_3, conv3_3_E) as shown
in Fig. 1 are used as predicted features and each one except for
conv9 and conv8 follows a multiple receptive fields (MRF)
block as a new convolutional predictor. The resolution of last
two detection feature maps are 1×1 and 3×3, which is unable to
be replaced with MRF because these feature maps are too
smaller to apply the larger kernels with 5×5 size. The detail
structure of MRF will be shown in Section III.B.
In our segmentation branch of MRFSWSnet, an auxiliary
small-object-focusing weakly-supervised semantic
segmentation module (SWS) follows conv3_3_E. SWS focuses
only small object instead of all objects and it is integrated into
the detection network for auxiliary training to improve the
precision of small object detection. The detection branch and
segmentation branch are combined in the form of multiple tasks.
B. Multiple Receptive Fields Block(MRF)
As shown in Fig. 2, the proposed multiple receptive fields
block (MRF) is composed of multiple convolution layers with
different receptive fields for focusing on more positions from
the center of an object to its surrounding, and these convolution
layers can be divided into two parts according to the forms of
different receptive fields. One part of MRF is similar with
Inception structure [28], whose components are several
convolution layers with different kernel sizes, including 1×1
convolution layer, 3×3 convolution layer and 5×5 convolution
layer. The receptive fields of these convolution layers can cover
more positions of one object from its center to its surrounding.
The other part of MRF includes several dilated convolution
layers with different dilated ratios, such as 3×3 convolution
layer with dilated ratios 1, 2 and 3, respectively. These
convolution layers pay attention to different spatial locations
with different weights according to different distances from
objects’ center position to its surrounding.
4
Fig. 2. The multiple receptive field block
Fig. 3 The structure of multiple receptive field blocks (MRF)
For example, as shown in Fig. 2, the 1×1 convolution layer
with dilated ratio 1 only focus on a center point’s pixel of object,
the 3×3 convolution layers with dilated ratios 1, 2 and 3 pay
close attention to 9 pixels around one object at different spatial
location. What’s more, the 5×5 convolution layer with dilated
ratio 5 has a larger receptive field, and more contextual
information are introduced. According to these designed
convolution layers, more positions of one object and its
adjacent background are focused. In addition, the different
convolution layers will learn different weights according to
different receptive fields, and it has two advantages. On the one
hand, the bigger weight will be learned according to smaller
receptive fields and it will be assigned to these pixels whose
positions are nearer to the center of object. It shows that these
pixels are more important than the further ones. On the other
hand, more contextual information is also focused by larger
receptive field, and smaller weights are allocated to these
further pixels. Thus, the MRF can capture the characteristic of
different position pixels of one object and its adjacent
background with different weights, which is helpful for
extracting high quality features.
The detail structure of MRF is shown in Fig. 3, firstly, we
adopt one bottleneck layer (1×1 convolution layer) to decrease
and unify the number of channels in the previous feature map.
Secondly, several convolution layers with different kernel sizes
such as 1×1 convolution layer, 3×3 convolution layer, 5×5
convolution layer and several dilated convolution layers with
different dilated ratios e.g. 3×3 convolution layer with dilated
rate 1, 3×3 convolution layer with dilated rate 2, 3×3
convolution layer with dilated rate 3 are followed. Finally, the
feature maps of all convolution layers with different receptive
fields are concatenated to be inputted into a 1×1 convolution
layer. In addition, we also design a shortcut layer likely ResNet
[38] to maintain the performance of pervious layer.
C. Small-Object-Focusing Weakly-supervised Segmentation
Module (SWS)
Small-object-focusing weakly-supervised segmentation
module (SWS) is a weakly supervised semantic segmentation
at the level of bounding box. The input of SWS is the feature
conv3_3_E. The feature map conv3_3_E shown in Fig. 1 is
generated by FPN [16], which contains both semantic
information and detailed information. As shown in Fig. 4, the
ground truth of weakly-supervised segmentation is generated
by only focusing the small object’s bounding box. The output
of SWS is the predicted score maps corresponding to
foreground class and background class, respectively. The
details of small-object-focusing weakly-supervised
segmentation module are as follows.
Fig. 4 The ground truth of small-object-focusing weakly-supervised semantic
segmentation module
Small objects are only focused on the semantic segmentation
branch without considering the class of object, and pixels of the
ground truth mask are labeled as foreground if it falls within the
bounding box of small objects. We define two thresholds T1 and
T2 for limiting the areas of small objects. The objects whose
areas are in [T1, T2] are labeled as foreground. We also define
some pixels as invalid pixels according to the setting threshold
T1, and the objects whose areas are lower than T1 are overlooked
in the training phase. The reason is that the extreme small object
has not enough effective information for training segmentation
module, which will result in incorrect prediction. The objects
whose areas are higher than in T2 are labeled as background. If
a pixel is overlapped with multiple ground truths such as
foreground truth and background truth and invalid truth,
5
foreground truth will be selected preferentially. An example of
ground truth for semantic segmentation is shown in Fig. 4.
The weakly-supervised semantic segmentation module
consists of two transposed convolution layers following
conv3_3_E and one transition convolutional layer with a ReLU
non-linearity layer. Finally, a binary softmax classifier is
designed to predict probability score of each pixel in the output
mask as shown in Fig. 1.
D. The Training of Multiple Tasks
During the process of training, our final loss function
includes the binary cross-entropy loss for semantic
segmentation !"#$ %&' () , the softmax loss for classification
!*+,-%&' .) and the smooth-l1 l oss function for localization
!/+*%&' 0). In short, the final loss combines detection loss with
semantic loss, which is expressed as follows.
1 2' 3' 4' 5' 6' 7 8 19:; 2' 3' 4 ' 5 < =1>:? 6' 7 (1)
19:; 3' 2' 4' 5 8 @
A1BCDE 2' 4 < F1GCB%2' 3' 5)H (2)
where 19:;%2' 3' 4 ' 5) is the detection loss, 1>:?% 6' 7) is the
segmentation loss, = is a balance factor between detection task
and segmentation task, F is another balance factor between
classification and localization in detection. I is the input image,
and M is the ground truth at the level of bounding box for
segmentation. g is the ground truth box for detection, c is the
confidence score for predicted bounding box, l is the predicted
bounding box, and x represents the predicted bounding box
whether matches the ground truth or not.
Specifically, the main detection loss function is shown in (3)
and (4), which is similar with the loss function of original SSD
[19].
1BCDE 2' 4 8 I 2JK
L3M5 4J
L
A
JNOC> I3M5 4J
P
JNA:? H (3)
1GCB 2' 3' 5 8 I 2JK
Q
RN BS'BT'U'V WXMMYZ[@
A
JNOC> 3J
RI 5K
R (4)
Where 4J
L8#\]H :^
_
:SL :^
_
_
, 5K
BS 8 5K
BS I `J
BS `J
U, 5K
BT 8
5K
BT I `J
BT `J
V, 5K
U83M5 ?a
b
9^
b and 5K
V83M5 ?a
c
9^
b. d is
designed default bounding box. Similar to SSD, we regress
offsets for the center 42'4d of the default bounding box (d)
and for its width (w) and height (h).
The added loss function of auxiliary segmentation task is
shown in (5).
1>:? 6' 7 8 I @
ef 3M5 g
U'V ' 7U'VU'V (5)
Where g N h' i @jejf represents the predicted mask, 7 N
h' i @jejf is the bounding-box-level ground truth for
segmentation. During the training phase, the two tasks are
treated fairly. The balance factor k and l are set to 1.
IV. EXPERIMENTS
The extensive experiment is conducted on two main
detection datasets, which are PASCAL VOC [41] and MS
COCO [42]. For PASCAL VOC, we use the common dataset
division for evaluation. During the training phase, we use the
VOC2007 trainval and VOC2012 trainval as training data. The
VOC2007 test is used as testing data. The mean accuracy
precision (mAP) with a fixed value of Intersection over Union
(IoU) is used as evaluation function. A predicted bounding box
is positive if the value of IoU with ground truth is higher than
0.5. For MS COCO, we follow a popular split, which uses
train2015 as training dataset and minval2015 as testing data.
The baseline of detector choses the classic based-VGG16
SSD due to its higher accuracy and faster running speed, and
we can confirm the effectiveness of the proposed modules on
our detector by combining the classic based-VGG16 SSD on
different datasets. Specifically, there are seven detection
features including five MRFs on our detector MRFSWSnet300,
and there are eight detection features including six MRFs on
our detector MRFSWSnet512. The small-object-focusing
weakly-supervised segmentation module (SWS) is integrated
into detection branch following conv3_3_E layer for both
MRFSWSnet300 and MRFSWSnet512. We follow the SSD
training strategy throughout our experiments.
Our detector is also compared with some state-of-the-art
methods including Faster RCNN [14], R-FCN [15], FPN [16],
Yolov2 [25], SSD [19], DSSD [20], RFB [34], RetinaNet500
[23], these results of state-of-the-art detector are from original
official published expect for RFB300 and RFB512. RFB300
and RFB 512 are trained and tested by ourselves on the same 4
Titan X GPUs for fair comparison.
A. Experiment on PASCAL VOC
On PASCAL VOC dataset, we do the training on a machine
with 4 Titan X GPUs. The training process is little different
from original SSD [19]. At the beginning of training, the
learning rate is a “warm up” process, whose range is from ihmn
to ihmo at the first 10 epochs. After 10 epochs, the learning rate
is set to ihmo until 150-th epochs. Then, it is divided by 10 at
150-th epoch and 250-th. The total number of training epochs
is set to 300. Following original SSD, the weight decay is
0.0005, the momentum is 0.9, and the batch size is set to 32.
These parameters are similar with the original SSD. We use the
pre-trained VGG-16 on the ILSVRC CLS-LOC dataset [40] to
initialize our detector. T1 and T2 are set as 1024 and 9216. Its
fc6 layer and fc7 layer in original VGG-16 are instead of
convolution layers with the sub-sampling parameters, and the
fc8 layer is removed. Other new convolutional layers of our
detector including layers of auxiliary segmentation branch are
Method
backbone
Time
mAP (%)
Faster RCNN [14] *
VGG16
147ms
73.2
Faster RCNN ++ [27] *
ResNet-101
3.36s
76.4
R-FCN [15] *
ResNet-101
110ms
80.5
YOLOv2 [25] *
darknet
25ms
78.6
SSD300 [19] *
VGG16
12ms
77.2
SSD512 [19] *
VGG16
28ms
79.8
DSSD513 [20] *
ResNet-101
182ms
81.5
RFB300 [34]
VGG16
15ms
79.9
RFB512 [34]
VGG16
30ms
81.5
MRFSWSnet300
VGG16
15ms
80.9
MRFSWSnet512
VGG16
31ms
81.8
6
initialized with the MSRA method [43]. Table I shows the
performance of different detectors on PASCAL VOC2007 test,
and the bond font in Table I represents the best three results.
As shown in Table I, our detector MRFSWSnet outperforms
original SSD on both resolution settings. The mAP of
MRFSWSnet300 is 3.7% higher than original SSD300,
meanwhile it also keeps almost equivalent running speed. With
the high resolution, mAP of MRFSWSnet512 improves 2%
from 79.8% to 81.8% while keeping the real-time speed same
as SSD512. Compared with other state-of-the-art detectors such
as popular state-of-the-art two-stage methods R-FCN [15] and
detector combined with receptive fields block RFB [34], our
detector still shows a significant performance improvement. It
is also better than most one stage and two stage object detection
systems equipped with very deep base backbone networks such
as ResNet101.
B. Experiment on MS COCO
The strategy for training our detector on MS COCO is almost
similar with the strategy on PASCAL VOC. However, the
defaulting anchors are slight different from anchors on
PASCAL VOC2007 to fit COCO dataset, whose scale is
smaller. The update strategy of learning rate also adopts “warm
up” at the first 10 epochs. After 10-th epochs, the learning rate
is set to pjihmo until 150-th epochs. Then, it is divided by 10
at 200-th epoch and 300-th epoch. The total epoch is set to 350.
The momentum is set to be 0.9 and the weight decay is set to be
0.0005, which are consistent with the original SSD settings. T1
and T2 are set as 1024 and 9216. Similar to the experiment used
for PASCAL VOC2007, we use the pre-trained VGG-16 on the
ILSVRC CLS-LOC dataset [40] to initialize our detector. Table
II shows the performance of different detectors on COCO 2015
dataset, the bond font in Table II represents the best three results.
From the Table II, the best three methods with higher mAP
such as 36.2%, 37.1% and 39.1% are all region-based two-stage
detector, but whose running time are much slower. Compared
with these state-of-the-art methods, the average precision of
MRFSWSnet300 can achieve 30.5%, and the mAP at IoU with
0.5 is 49.8% at lower resolution on the COCO 2015 dataset, but
the running time of MRFSWSnet300 is much faster than these
methods, and it is much better than the baseline detector
SSD300. With lower resolution, MRFSWSnet300 is even better
than that of R-FCN [15] with higher input size (600×1000),
which employed ResNet-101 as the backbone under the two
stages framework. The speed of our detector is only 16
milliseconds and 7 times faster than the speed of R-FCN.
In aspect of the bigger model, the result of MRFSWSnet512
with higher resolution (512×512) is also better than SSD512,
the relative improvement of average precision is 14.9% from
28.8% to 33.1%. Our detector with higher resolution only
consumes 28 milliseconds per frame, which is equivalent to
SSD512. Compared with the recent advance state-of-the-art
one-stage model RetinaNet500, the result of our detector is little
inferior with the similar input size. However, RetinaNet is
based on deeper and more complicated ResNet101 with FPN
backbone, and a new focus loss focusing on hard examples.
MRFSWSnet512 is only based on lightweight VGG model with
FPN backbone. Moreover, RetinaNet is not a real-time detector,
which runs 90 milliseconds per frame. The speed of RetinaNet
is far slower than MRFSWSnet512 with similar input size.
Considering the precision and the running time, our detector
achieves the state-of-the-art real-time detector.
In addition, our detector is also compared with RFB, whose
architecture is similar with ours. The average precision of our
detector is higher than RBF regardless of the size of input and
the mAP of our detector is also higher than RFB at higher IOU
evaluation criteria (0.75). Whats more, the average precision
of our detector on small object is far better than RFB. For
example, the average precision of our detector on small object
is 14.3% and 17.5% compared with 11.8% and 15.9% of RFB
with input size of 300×300 and 512×512, respectively.
Compared with the state-of-the-art detector RetainNet500, the
precision MRFSWSnet512 of for small object detection is
higher 2.8% with the similar input. This proves the
effectiveness of the proposed MRFSWSnet for small object
detection.
TABLE II
THE PERF ORMANCE OF DIFFERENT DETECTORS ON MS COCO 2015
Method backbone
Time
Avg . Precision IOU:
Precision,
Avg . Pr eci sio n/A rea
Precision,
0.5:0.95
0.5
0.75
S
M
L
Faster RCNN [14]
VGG16
147ms
24.2
45.3
23.5
7.7
26.4
37.3
Faster RCNN ++ [27]
ResNet-101
3.36s
34.9
55.7
37.4
15.6
38.7
50.9
FPN [16]
ResNet-101-FPN
240ms
36.2
59.1
39.0
18.2
39.0
48.2
R-FCN [15]
ResNet-101
110ms
29.9
51.9
10.8
32.8
45.0
Mask RCNN [26]
ResNext-101-FPN
210ms
37.1
60.0
39.4
16.9
39.9
53.5
YOLOv2 [25]
darknet
25ms
21.6
44.0
19.2
5.0
22.4
35.5
SSD300 [19]
VGG16
13ms
25.1
43.1
25.8
--
--
--
SSD512 [19]
VGG16
28ms
28.8
48.5
30.3
--
--
--
DSSD513 [20]
ResNet-101
182ms
33.2
53.3
35.2
13.0
35.4
51.1
RetinaNet500 [23]
ResNet-101-FPN
90ms
34.4
53.1
36.8
14.7
38.5
49.1
RetinaNet800 [23]
ResNet-101-FPN
198ms
39.1
59.1
42.3
21.8
42.7
50.2
RFB300
VGG16
16ms
30.3
49.3
31.8
11.8
31.9
45.9
RFB512
VGG16
33ms
33.0
52.7
34.7
15.9
37.7
47.9
MRFSWSnet300
VGG16
16ms
30.5
49.8
31.7
14.3
32.3
45.3
MRFSWSnet512
VGG16
28ms
33.1
53.0
34.7
17.5
37.1
47.9
s represents second, and ms is millisecond.
7
C. Ablation Experiments
To further prove the effectiveness of the proposed modules
including MRF and SWS on our detector, some experimental
results with different settings on PASCAL VOC2007 test
dataset are shown in Table III. The resolution for all
experiments is same with 300×300.
1) Multiple receptive field block (MRF)
In order to better understand the effectiveness of MRF, we
compared the original SSD without MRF [19], SSD with RFB
[34] and SSD with MRF. The result is summarized in Table III.
As shown in Table III, the mean average precision (mAP) of
original SSD [19] with the data augmentation achieves 77.2%.
With RFB architecture [34], the mAP of SSD obtains 79.6%,
which is 2.4% superior than original SSD. Instead of RFB, the
mAP of SSD with MRF achieves 80.1%, which is 2.9% higher
than original SSD and 0.5% higher than RFB. This shows that
MRF is helpful for improving the detection accuracy.
2) More detection feature with more anchors
The original SSD consists of only six predicted features for
object detection. Recent research [44] show that the higher
resolution feature has benefit to detect small objects. We thus
add a new convolution layer conv3_3_E as predicted feature.
With the finest resolution layer, more anchors are proposed to
cover more small instances. The experimental result is shown
in Table III. We can see that the performance of our detector
with conv3_3_E can improve 0.2% from 80.1% to 80.3% on
PASCAL VOC2007 test.
3) Small-object-focusing weakly-supervised semantic
segmentation module (SWS)
To further verify the effectiveness of the proposed small-
object-focusing weakly-supervised semantic segmentation
module, we conduct two experiments with different setting on
PASCAL VOC2007 test. In the first experiment, we add all-
object-focusing weakly-supervised semantic segmentation
module (AWS) into our detector. In other words, all objects are
need to be identify in the segmentation module regardless of the
size of the object. The improvement of object detection’s
performance is 0.4% from 80.3% to 80.7%. We believe that the
weakly-supervised semantic segmentation module is important
to improve the performance of object detection. In the second
experiment, small-object-focusing weakly-supervised semantic
segmentation module (SWS) is integrated into our detector
which only focuses on small object as described in Section III.C.
As shown in Table III, the performance of object detection
improves 0.6% compared with our detector without SWS,
which is also 0.2% better than our detector with AWS. This
indicates that the small-object-focusing weakly supervised
semantic segmentation module for auxiliary training is crucial
to enhance the performance of detection.
D. Inference Speed Comparisons
Fig. 5. The mAP vs speed of different detectors on MS COCO dataset
The mAP and running time comparison between our
proposed MRFSWSnet and recent state-of-the-art detectors are
shown in Fig. 5 and Table IV. We follow [23] to plot the
speed/accuracy trade-off curve for some recent methods and
our detector on the MS COCO dataset. This curve shows the
relation between precision and inference time for each detector.
The bond font in Table IV represents the best three results.
From the Table IV, the best three methods with higher mAP
such as 36.2%, 37.1% and 39.1% are all region-based two-stage
detector, whose running time are much slower.
Our detector MRFSWSnet300 is little slower than original
VGG16-based SSD300, but mAP of MRFSWSnet300 is much
THE PERF ORMANCE OF OUR DETECTOR WITH VARIOUS DESI GNS ON THE
PASCAL VOC2007 TEST
Method
Original
SSD [19]
MRFSWSnet300
Added RFB [34]
×
×
×
×
Added MRF
×
×
More anchors with
conv3_3_E
×
×
×
All-object-focusing
weakly-supervised
segmentation (AWS)
×
×
×
×
×
Small-object-focusing
weakly-supervised
segmentation (SWS)
×
×
×
×
×
mean Aver age Pr eci sio n
(mAP)/(%)
77.2
79.6
80.1
80.3
80.7
80.9
Method
backbone
Time (ms )
mAP(%)
Faster RCNN [14]
VGG16
147
24.2
FPN [16]
ResNet-101-FPN
240
36.2
R-FCN [15]
ResNet-101
110
29.9
Mask RCNN [26]
ResNext-101-FPN
210
37.1
YOLOv2 [25]
darknet
25
21.6
SSD300 [19]
VGG16
13
25.1
SSD512 [19]
VGG16
28
28.8
DSSD513 [20]
ResNet-101
182
33.2
RetinaNet500 [23]
ResNet-101-FPN
90
34.4
RetinaNet800 [23]
ResNet-101-FPN
198
39.1
RFB300
VGG16
16
30.3
RFB512
VGG16
33
33.0
MRFSWSnet300
VGG16
16
30.5
MRFSWSnet512
VGG16
28
33.1
8
higher than SSD300. Meanwhile, it is faster than other real-time
detectors such as YOLOv2, RFB300 with the same input size.
With higher resolution (512×512), our detector
MRFSWSnet512 also keeps lower running time (28ms per
frame) and higher precision (33.1%). Considering the precision
and running time, our detector is the state-of-the-art method
among real-time detectors.
V. CONCLUSION
In this paper, we propose a multiple receptive field and small-
object-focusing weakly-supervised segmentation network
(MRFSWSnet) for fast object detection. In MRFSWSnet,
multiple receptive fields (MRF) are used to pay attention to
different spatial locations of the object and its adjacent
background with different weights. MRF can effectively
enhance the feature discriminability and robustness according
to the verification of extensive experiments. In addition, in
order to further improve the accuracy of small object detection,
a small-object-focusing weakly-supervised segmentation
module (SWS) is integrated into the detection network for
auxiliary training. SWS shows significant performance
boosting for small object detection, while it still keeps real-time
speed. Extensive experiments show the effectiveness of our
detector on PASCAL VOC and MS COCO detection datasets.
REFERENCES
[1] S. Minaeian, J. Liu et al.,Vision-based target detection and localization
via a team of cooperative UAV and UGVs,” IEEE Trans. Syst. Man. Cy-
S., vol. 46, no. 7, pp. 1005-1016, Jul. 2016.
[2] X. Yuan, X. Cao, X. Hao et al., “Vehicle detection by a context-aware
multichannel feature pyramid,” IEEE Trans. Syst. Man. Cy-S., vol. 47, no.
7, pp. 1348-1357, Jul. 2017.
[3] A. B. Mabrouk, E. Zagrouba, “Abnormal behavior recognition for
intelligent video surveillance systems: A review,” Expert Syst. Appl., vol.
91, pp. 480-491, Jan. 2018.
[4] S. Bashbaghi, E. Granger, R. Sabourin et al., “Deep learning architectures
for face recognition in video surveillance,” arXiv preprint
arXiv:1802.0999, Jun. 2018, pp. 1-9.
[5] W. Yang, B. Fang, Y. Tang.,Fast and accurate vanishing point detection
and its application in inverse perspective mapping of structured road,”
IEEE Trans. Syst. Man. Cy-S., vol. 48, no. 5, pp. 755-766, May. 2018.
[6] H. Kuang, L. Chen, L. Chan et al., Feature selection based on tensor
decomposition and object proposal for night-time multiclass vehicle
detection,” IEEE Trans. Syst. Man. Cy-S., vol. 49, no. 1, pp. 71-80, Jan.
2019.
[7] N. Dalal, B. Triggs, “Histograms of oriented gradients for human
detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., CA, USA,
Jun. 2005, pp. 886-893.
[8] D. G. Lowe, “Object recognition from local scale-invariant features,” in
Proc. IEEE Conf. Comput. Vis., Corfu, Greece, Sep. 1999, pp. 1150-1157.
[9] K. E. A. Van de Sande, J. R. R. Uijlings, T. Gevers et al., “Segmentation
as selective search for object recognition,” in Proc. IEEE Conf. Comput.
Vis., Barcelona, Spain, Nov. 2011, pp. 1879-1886.
[10] J. Pont-Tuset, P. Arbelaez, J. T. Barron et al., “Multiscale combinatorial
grouping for image segmentation and object proposal generation,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 128-140, Jan. 2017.
[11] C. L. Zitnick, P. Dollár, “Edge boxes: Locating object proposals from
edges,” in Proc. Eur. Conf. Comput. Vis., Zurich, Switzerland, Sep. 2014,
pp. 391-405.
[12] K. He, X. Zhang, S. Ren, J. Sun, “Spatial pyramid pooling in deep
convolutional networks for visual recognition,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 37, pp. 1904-1916, Jan. 2015.
[13] R. Girshick, “Fast R-CNN,” in Proc. IEEE Conf. Comput. Vis., Santiago,
Chile, Oct, 2015, pp. 1440-1448.
[14] S. Ren, K. He, R. Girshick. “Faster R-CNN: Towards real-time object
detection with region proposal networks,” in Proc. Int. Conf. Neural Inf.
Process. Sys., Palais des Congrès de Montréal, Montréal CANADA, Sep.
2015, pp. 91-99.
[15] J. Dai, Y. Li, K. He et al., “R-FCN: Object detection via region-based
fully convolutional networks,” in Proc. Int. Conf. Neural Inf. Process. Sys.,
Barcelona, Spain, Dec. 2016, pp. 379-387.
[16] T. Y. Lin, P. Dollár, R. Girshick et al., “Feature pyramid networks for
object detection,” IEEE Conf. Comput. Vis. Pattern Recognit., Hawaii,
USA, Jul. 2017, pp. 2117-2125.
[17] B. Singh, L. S. Davis, “An analysis of scale invariance in object
detectionSNIP,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Salt Lake City, Utah, USA, Jun. 2018, pp. 3578-3587.
[18] J. Redmon, S. Divvala, R. Girshick, “You only look once: Unified, real-
time object detection.” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Las Vegas, Nevada, Jun. 2016, pp. 779-788.
[19] W. Liu, D. Anguelov, D. Erhan. “SSD: Single shot multibox detector,” in
Proc. Eur. Conf. Comput. Vis., Amsterdam, Netherlands, Sep. 2016, pp.
21-37.
[20] C. Y. Fu, W. Liu, A. Ranga et al., “DSSD: Deconvolutional single shot
detector,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Hawaii,
USA, Jul. 2017, pp. 1-9.
[21] J. Ren, X. H. Chen, J. B. Liu et al., “Accurate single stage detector using
recurrent rolling convolution,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Hawaii, USA, Jul. 2017, pp. 1-9.
[22] Z. Shen, Z. Liu, J. Li et al., “DSOD: Learning deeply supervised object
detectors from scratch,” in Proc. IEEE Conf. Comput. Vis., Venice, Italy,
Oct. 2017, pp. 7.
[23] T. Y. Lin, P. Dollár, R. Girshick et al., “Focal loss for dense object
detection,” in Proc. IEEE Conf. Comput. Vis., Venice Italy, Oct. 2017, pp.
1-10.
[24] S. Zhang, L. Wen, X. Bian et al., “Single-shot refinement neural network
for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Salt Lake City, Utah, USA, Jun. 2018, pp. 4203-4212.
[25] J. Redmon, A. Farhadi. “YOLO9000: better, faster, stronger,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Hawaii, USA, Jul. 2017, pp.
7263-7271.
[26] C. Zhu, R. Tao, K. Luu et al., “Seeing small faces from robust anchor’s
perspective,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Utah,
USA, Jun. 2018, pp. 5127-5136.
[27] H. Levi, S. Ullman. “Efficient coarse-to-fine non-local module for the
detection of small objects, arXiv preprint arXiv:1811.12152, 2018.
[28] C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Boston, USA, Jun.
2015, pp. 1-9.
[29] C. Szegedy, V. Vanhoucke, S. Ioffe et al., “Rethinking the inception
architecture for computer vision,” IEEE Conf. Comput. Vis. Pattern
Recognit., Las Vegas, Nevada, Jun. 2016, pp. 2818-2826.
[30] C. Szegedy, S. Ioffe, V. Vanhoucke et al., “Inception-v4, inception-resnet
and the impact of residual connections on learning,” in Proc. Thirty-First
AAAI Conf. Artif. Intell., San Francisco, California USA, Feb. 2017, pp.
4-12.
[31] L. C. Chen, G. Papandreou, I. Kokkinos et al., “Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully
connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4,
pp. 834-848, Apr. 2018.
[32] J. Dai, H. Qi, Y. Xiong et al., “Deformable convolutional networks,” in
Proc. IEEE Conf. Comput. Vis., Venice Italy, Oct. 2017, pp. 764-773.
[33] Z. Li, C. Peng, G. Yu et al., “Detnet: Design backbone for object
detection,” in Proc. Eur. Conf. Comput. Vis., Munich, Germany, Sep.
2018, pp. 339-354.
[34] S. Liu, D. Huang. “Receptive field block net for accurate and fast object
detection,” in Proc. Eur. Conf. Comput. Vis., Munich, Germany, Sep.
2018, pp. 385-400.
[35] S. Gidaris, N. Komodakis. “Object detection via a multi- region and
semantic segmentation-aware cnn model,” in Proc. IEEE Conf. Comput.
Vis., Santiago, Chile, Oct, 2015, pp. 1134-1142.
[36] Z. Zhang, S. Qiao, C. Xie et al., “Single-shot object detection with
enriched semantics,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Salt Lake City, Utah, USA, Jun. 2018, pp. 1-9.
[37] K. He, G. Gkioxari, P. Dollár et al., “Mask r-cnn,” in Proc. IEEE Conf.
Comput. Vis., Venice Italy, Oct. 2017, pp. 2980-2988.
[38] K. He, X. Zhang, S. Ren et al., “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las
Vegas, Nevada, Jun. 2016, pp. 770-778.
9
[39] K. Simonyan, A. Zisserman. “Very deep convolutional networks for
large-scale image recognition,” in Proc. Int. Conf. Learn. Rep., San Diego,
CA, May, 2015, pp. 1-14.
[40] O. Russakovsky, J. Deng, H. Su et al., “Imagenet large scale visual
recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3 pp. 211-252,
Apr. 2015.
[41] M. Everingham, L. VanGool et al., “The pascal visual object classes (voc)
challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303-338, Sep. 2009.
[42] T. Y. Lin, M. Maire, S. Belongie et al., “Microsoft coco: Common objects
in context,” in Proc. Eur. Conf. Comput. Vis., Switzerland, Zurich, Sep.
2014, pp. 740-755.
[43] K. He, X. Zhang, S. Ren et al., “Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification,” in Proc. IEEE Conf.
Comput. Vis., Santiago, Chile, Oct, 2015, pp. 1026-1034.
[44] P. Hu, D. Ramanan, “Finding tiny faces,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Hawaii, USA, Jul. 2017, pp. 1522-1530.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Face recognition (FR) systems for video surveillance (VS) applications attempt to accurately detect the presence of target individuals over a distributed network of cameras. In video-based FR systems, facial models of target individuals are designed a priori during enrollment using a limited number of reference still images or video data. These facial models are not typically representative of faces being observed during operations due to large variations in illumination, pose, scale, occlusion, blur, and to camera inter-operability. Specifically, in still-to-video FR application, a single high-quality reference still image captured with still camera under controlled conditions is employed to generate a facial model to be matched later against lower-quality faces captured with video cameras under uncontrolled conditions. Current video-based FR systems can perform well on controlled scenarios, while their performance is not satisfactory in uncontrolled scenarios mainly because of the differences between the source (enrollment) and the target (operational) domains. Most of the efforts in this area have been toward the design of robust video-based FR systems in unconstrained surveillance environments. This chapter presents an overview of recent advances in still-to-video FR scenario through deep convolutional neural networks (CNNs). In particular, deep learning architectures proposed in the literature based on triplet-loss function (e.g., cross-correlation matching CNN, trunk-branch ensemble CNN and HaarNet) and supervised autoencoders (e.g., canonical face representation CNN) are reviewed and compared in terms of accuracy and computational complexity.
Article
Full-text available
An analysis of different techniques for recognizing and detecting objects under extreme scale variation is presented. Scale specific and scale invariant design of detectors are compared by training them with different configurations of input data. To examine if upsampling images is necessary for detecting small objects, we evaluate the performance of different network architectures for classifying small objects on ImageNet. Based on this analysis, we propose a deep end-to-end trainable Image Pyramid Network for object detection which operates on the same image scales during training and inference. Since small and large objects are difficult to recognize at smaller and larger scales respectively, we present a novel training scheme called Scale Normalization for Image Pyramids (SNIP) which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale. On the COCO dataset, our single model performance is 45.7% and an ensemble of 3 networks obtains an mAP of 48.3%. We use ImageNet-1000 pre-trained models and only train with bounding box supervision. Our submission won the Best Student Entry in the COCO 2017 challenge. Code will be made available at http://bit.ly/2yXVg4c.
Conference Paper
Full-text available
For object detection, the two-stage approach (e.g., Faster R-CNN) has been achieving the highest accuracy, whereas the one-stage approach (e.g., SSD) has the advantage of high efficiency. To inherit the merits of both while overcoming their disadvantages, in this paper, we propose a novel single-shot based detector, called RefineDet, that achieves better accuracy than two-stage methods and maintains comparable efficiency of one-stage methods. RefineDet consists of two inter-connected modules, namely, the anchor refinement module and the object detection module. Specifically, the former aims to (1) filter out negative anchors to reduce search space for the classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor. The latter module takes the refined anchors as the input from the former to further improve the regression and predict multi-class label. Meanwhile, we design a transfer connection block to transfer the features in the anchor refinement module to predict locations, sizes and class labels of objects in the object detection module. The multi-task loss function enables us to train the whole network in an end-to-end way. Extensive experiments on PASCAL VOC 2007, PASCAL VOC 2012, and MS COCO demonstrate that RefineDet achieves state-of-the-art detection accuracy with high efficiency. Code is available at https://github.com/sfzhang15/RefineDet.
Conference Paper
We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN [6, 18] that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets) [9], for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20× faster than the Faster R-CNN counterpart. Code is made publicly available at: https://github.com/daijifeng001/r-fcn.
Article
Night-time vehicle detection is essential in building intelligent transportation systems (ITS) for road safety. Most of current night-time vehicle detection approaches focus on one or two classes of vehicles. In this paper, we present a novel multiclass vehicle detection system based on tensor decomposition and object proposal. Commonly used features such as histogram of oriented gradients and local binary pattern often produce useless image blocks (regions), which can result in unsatisfactory detection performance. Thus, we select blocks via feature ranking after tensor decomposition and only extract features from these selected blocks. To generate windows that contain all vehicles, we propose a novel object-proposal approach based on a state-of-the-art object-proposal method, local features, and image region similarity. The three terms are summed with learned weights to compute the reliability score of each proposal. A bio-inspired image enhancement method is used to enhance the brightness and contrast of input images. We have built a Hong Kong night-time multiclass vehicle dataset for evaluation. Our proposed vehicle detection approach can successfully detect four types of vehicles: 1) car; 2) taxi; 3) bus; and 4) minibus. Occluded vehicles and vehicles in the rain can also be detected. Our proposed method obtains 95.82% detection rate at 0.05 false positives per image, and it outperforms several state-of-the-art night-time vehicle detection approaches.
Chapter
Recent CNN based object detectors, either one-stage methods like YOLO, SSD, and RetinaNet, or two-stage detectors like Faster R-CNN, R-FCN and FPN, are usually trying to directly finetune from ImageNet pre-trained models designed for the task of image classification. However, there has been little work discussing the backbone feature extractor specifically designed for the task of object detection. More importantly, there are several differences between the tasks of image classification and object detection. (i) Recent object detectors like FPN and RetinaNet usually involve extra stages against the task of image classification to handle the objects with various scales. (ii) Object detection not only needs to recognize the category of the object instances but also spatially locate them. Large downsampling factors bring large valid receptive field, which is good for image classification, but compromises the object location ability. Due to the gap between the image classification and object detection, we propose DetNet in this paper, which is a novel backbone network specifically designed for object detection. Moreover, DetNet includes the extra stages against traditional backbone network for image classification, while maintains high spatial resolution in deeper layers. Without any bells and whistles, state-of-the-art results have been obtained for both object detection and instance segmentation on the MSCOCO benchmark based on our DetNet (4.8G FLOPs) backbone. Codes will be released (https://github.com/zengarden/DetNet).
Article
Current top-performing object detectors depend on deep CNN backbones, such as ResNet-101 and Inception, benefiting from their powerful feature representation but suffering from high computational cost. Conversely, some lightweight model based detectors fulfil real time processing, while their accuracies are often criticized. In this paper, we explore an alternative to build a fast and accurate detector by strengthening lightweight features using a crafting mechanism. Inspired by the structure of Receptive Fields (RFs) in human visual systems, we propose a novel RF Block (RFB) module, which takes the relationship between the size and eccentricity of RFs into account, to enhance the discriminability and robustness of features. We further assemble the RFB module to the top of SSD with a lightweight CNN model, constructing the RFB Net detector. To evaluate its effectiveness, experiments are conducted on two major benchmarks and the results show that RFB Net is able to reach the accuracy of advanced very deep backbone network based detectors while keeping the real-time speed. Code will be make publicly available soon.