Conference PaperPDF Available

Intelligent traffic city management from surveillance systems (CERTH-ITI)

Authors:

Abstract and Figures

Surveillance and more specifically traffic management technologies constitute one of the most intriguing aspects of smart city applications. In this work we investigate the applicability of an object detector for vehicle detection and propose a novel hybrid shallow-deep representation to surpass its limits. Furthermore, we leverage the detector's output, so as to localize new vehicles and track them throughout the whole duration that they exist in the video scene. The detection and tracking system is then evaluated and compared with other State-of-the-Art algorithms on the new developed NVIDIA AI city datasets.
Content may be subject to copyright.
Intelligent traffic city management from surveillance
systems (CERTH-ITI)
Konstantinos Avgerinakis Panagiotis Giannakeris Alexia Briassouli Anastasios Karakostas Stefanos Vrochidis
Ioannis Kompatsiaris
Centre for Research and Technology Hellas (CERTH) - Information Technologies Institute (ITI)
koafgeri@iti.gr
giannakeris@iti.gr abria@iti.gr stefanos@iti.gr akarakos@iti.gr ikom@iti.gr
Abstract—Surveillance and more specifically traffic manage-
ment technologies constitute one of the most intriguing aspects
of smart city applications. In this work we investigate the
applicability of an object detector for vehicle detection and
propose a novel hybrid shallow-deep representation to surpass
its limits. Furthermore, we leverage the detector’s output, so as
to localize new vehicles and track them throughout the whole
duration that they exist in the video scene. The detection and
tracking system is then evaluated and compared with other State-
of-the-Art algorithms on the new developed NVIDIA AI city
datasets.
I. INTRODUCTION
Smart city technologies and more specifically assistive
transportation and safe driving make up one of the most
intriguing domains of computer science and have attracted
significant attention during the last decade. Close-Circuit Tele-
vision (CCTV) systems and other types of visual monitoring
infrastructure provide vast amounts of potentially useful data
for optimal management of traffic, safety in crowded urban
environments, mitigation of traffic issues in adverse weather
conditions and numerous other applications of traffic monitor-
ing. Moreover, increasing industry trends towards autonomous
driving, vehicles, and transportation in general, is changing
the landscape of traffic management. The data from traffic
cameras (static, mobile, drone-based and others, depending on
the application) will, in the near future, also be used to manage
autonomous vehicle navigation, by sending information about
events elsewhere in the city, traffic conditions, pedestrian
congestion, to optimally guide vehicles [16].
The wealth of information in these videos remains inaccessible
without their automated analysis, as the manual extraction of
information from them is very time consuming and cumber-
some, making automated video analysis methods a necessity.
To this end, numerous computer vision and machine learning
solutions have been developed for automatic object detection
and tracking in traffic, abnormal event detection in videos
of crowded scenes (e.g. pedestrians, traffic), human activity
recognition and others. The improving accuracy of visual
analysis of traffic videos and its decreasing computational
cost, in combination with the increasing use of GPUs, is
facilitating the automation of traffic video analysis in recent
years. However, the large amounts of surveillance data, in
addition to the lack of annotations of these datasets, create
obstacles for the development of automated analysis methods.
For this reason, large scale annotation efforts and real-world
benchmarking challenges and competitions are necessary, to
help researchers develop novel, highly efficient and useful
solutions in this domain. In this work, CERTH-ITI investigates
the performance of State-of-the-Art (SoA) object detector [14]
and proposes a novel hybrid one, namely DeepHOG, that
combines shallow with deep representation schemes in order
to improve the detection of the former. Furthermore, CERTH-
ITI introduces a tracking algorithm that uses deepHOG’s
bounding boxes to localize new vehicles in the video scene
and monitor them throughout the whole duration that exist
inside it. It is of our greatest belief that shallow descriptors and
more specifically the relation amogst them(i.e. bag-of-word,
fisher vectors) can be combined with Deep Learning SoA
techniques, such as Faster R-CNN, so as to introduce a great
boost in the representation framework. CERTH-ITI hopes that
the proposed framework will help tackling traffic congestion,
safety and security issues related to traffic and urban areas
and participate in the development of safer, more liveable
and enriching smart urban environments. This work has been
presented in the AI City Challenge [12] by IEEE Smart
World and NVIDIA, which also included an annotation phase
that generated more than 150,000 annotated video keyframes
and 1.4 million annotations producing the NVIDIA AI City
dataset.
II. RE LATE D WO RK
Vehicle detection constitutes an essential subcategory of
object detection and generally it follows the same framework
in order to accomplish its purposes; (a) Object localization
is initially performed so as to find the regions of interest
(i.e. multi-scale bounding boxes) that exist inside each image
or video frame, (b) Object representation uses these areas in
order to describe the information that exist inside and machine
learning is finally deployed to discriminate between classes.
As far as localization is concerned, earlier techniques
followed the computationally expensive sliding window
paradigm, but have been recently replaced by selective
search [18] and techniques that deploy multi-scale bounding
box proposals [5], [23] instead of exhaustive dense searching
in the image scene. Similar results have accomplished the
Fig. 1: Block diagram of deepHOG, our novel representation scheme.
objectness measure [1] and its computationally efficient coun-
terpart, named BING [2]. While, geodesic object proposal [8]
achieved among the highest detection performance (i.e. recall)
even when the requested number of candidate proposals was
small.
Vehicle representation uses the localization outcome in or-
der to describe the objects that exist inside these areas, such as
cars, trucks and pedestrians. Until recently, this would require
the extraction of local based histograms (i.e. HOG, LBP,
HOGles etc.) that encode light intensity [17], texture [20] and
the existence of specific shapes or other image features [19].
More recently, SoA techniques of this domain turned their
attention into deep convolutional neural networks to represent
vehicles inside images. They train the parameters of their
models on large datasets, like COCO [10] or ImageNet 1000-
class [15] and then retrain the weights and parameters of the
model in vehicle-tailored datasets, such as UA-Detrac [21].
As far as deep convolutional techniques are concerned, we
can encounter several works in the literature that deal with
object detection. First and foremost Fast R-CNN [4] and Faster
R-CNN [14], which leverage deep convolutional representation
schemes to lead to robust and highly accurate object detection.
An interesting modification of Fast R-CNN was proposed
in [22] and applied a MultiPath network to predict segmenta-
tion masks in addition to bounding boxes. Position-sensitive
score maps in a fully convolutional network [3], on the other
hand, enabled the fully adoption of the ResNet architecture
for the purposes of object detection. Current SoA techniques
has currently turned its attention to developing faster, rather
than more accurate techniques, such as YOLO [13], SSD [11]
and the relief R-CNN [9] which generates proposals from
convolutional features by simple rules.
III. METHODOLOGY
Traffic management in our methodology is accomplished
by combining vehicle detection with multi-target tracking
algorithm. We investigate a SoA object detection scheme,
namely Faster R-CNN, and based on its limitations we propose
a novel hybrid representation (i.e. DeepHOG) that leverages
both shallow, mid-level and deep representation to overcome
and introduce a better recognition performance. Bounding
boxes with a detected vehicle or object are then used by
our multi-target tracker so as to localize and monitor them
throughout the whole duration that exist in the video scene.
Fig. 1 shows the framework that is followed so as to tackle
the detection and tracking issues.
A. Vehicle detection
Vehicle detection in this work is deployed by following
two approaches: (a) CERTH-RCNN and (b) DeepHOG. An
ensemble framework is also investigated as a complementary
approach so as to study the impact that the fusion of the the
two. Object localization in both cases is performed by using
the Region Proposal Network (RPN) that CERTH-CNN uses.
CERTH-RCNN uses a modification of the original Faster
R-CNN [14], the Faster-RCNN-Resnet101 architecture. We
chose to implement this technique as it was found as one of
the most accurate and computational efficient models amongst
the current SoA [7]. This is attributed to the fact that it uses a
single feed-forward convolutional network to localize object
proposals and predict classes, without requiring a second
stage per-proposal classification operation. We used the Faster-
RCNN-Resnet101 model that is pretrained on the COCO
dataset and tuned it on NVIDIA AI city dataset, so as to be
able to detect vehicles in video frames.
For DeepHOG vehicle detection, CERTH deployed a novel
hybrid representation scheme that combines shallow with
deep features. More specifically, we used Histograms of Ori-
ented Gradients(HOG) [17] as a local appearance features to
represent pedestrians and vehicle objects and encode them
into a Fisher vector. This resulted in a powerful mid-level
representation vector that maintain the relation difference of
each object to the most discriminant ones and provided to
a Neural Network in order to train a highly accurate hybrid
feature representation.
The ensemble model decides about the class for each box
based on the most confident score given. This way we hope
to expose and take advantage of a possible complementary
nature of the two models.
B. Veicle tracking
CERTH-KCF is based on the tracking algorithm that
was proposed in [6]. Vehicle detection is deployed every
Wdetect = 3 video frames and in conjunction with KCF
tracking algorithm are responsible to monitor the vehicles
that exist in the video scene. For that purposes, a vehicle
database is built so as to record the vehicle ids: vehid by
creating space for new detected vehicles, and retaining it until
they disappear from the scene (i.e. when the algorithm could
not detect the vehid for more than Wtrack = 3 sequential
video frames). Along with the vehid , the appropriate vehicle
class: vehlabel and its detection score: vehscore are also
retained in this structure. Overlapped bounding boxes are also
tackled by CERTH-KCF by allowing the creation of a new
vehid only when the intersection over union score with the
already existing bounding boxes is larger than 70%. To tackle
occlusions between different vehid, CERTH-KCF merge the
two ids at the current frame, keep the oldest vehid and throw
the other.
IV. EXP ER IM EN TAL WORK
Our detection algorithm was evaluated on NVIDIA AI city
datasets aic480,aic540 and aic1080. The datasets acquired
video frames from surveillance cameras that depict intersec-
tions in urban areas under diverse weather conditions, daytime
and nighttime.
A. Parameter selection
As far as CERTH-RCNN is concerned, a convolutional
feature extractor (Resnet101) is initially applied on the input
image so as to obtain high level features, which are then
given as input to the R-CNN to detect what exist inside
it. The model is initially trained on the COCO dataset and
then finetuned to the NVIDIA AI city aic1080 and aic480
dataset, while aic540 shared the same model with aic1080.
Localization and classification losses were weighted equally,
allowing a maximum of 500 detections and online training
with a learning rate schedule initialized at 5e5and decreasing
it progressively. We concluded that training at 40000 steps
was the optimal solution to train our model. Argmax was
used for classification purposes and SmoothL1to compute
location loss function.
For the training of the DeepHOG’s neural network, we
used RELU activation functions on a 3Full Convolutional(FC)
layer with a width of 512 for each one, using 0.8dropout
chance after every layer and performing Adam optimization
with default parameters.
B. Results
Observing Tables I, II, III we can safely say that DeepH OG
improved against CE RT H RCNN (RCNN in the table
to save space), almost in all classes except from car, which
in the case of DeepHOG is usually confused with category
SUV. The ensemble algorithm though fused the two outcomes
efficiently and lead to even higher accuracy rates.
V. CONCLUDING
The CERTH team proposed a novel detection algorithm,
named deepHOG, which in most cases performed better than
SoA CERTH-RCNN and we believe that it can be considered a
fair contribution to the literature if the results get improved. As
a future work, CERTH plans to extend its traffic management
algorithm so as to enable traffic density classification and
crossroad traffic analysis. Spatio-temporal information that
leverage motion and appearance features is going to be de-
ployed for the implementation of the former, while the second
will leverage tracking results and vast log files to analyze the
traffic behavioural patterns throughout large duration timelines
(i.e. daily, weekly and monthly intersection analysis).
VI. AK NOW LE DG EM EN T
This work was supported by beAWARE1project partially
funded by the European Commission under grant agreement
No 700475.
REFERENCES
[1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of
image windows. IEEE transactions on pattern analysis and machine
intelligence, 34(11):2189–2202, 2012.
[2] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr. Bing: Binarized normed
gradients for objectness estimation at 300fps. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 3286–
3293, 2014.
[3] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-
based fully convolutional networks. In Advances in neural information
processing systems, pages 379–387, 2016.
[4] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international
conference on computer vision, pages 1440–1448, 2015.
[5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
[6] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed
tracking with kernelized correlation filters. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 37(3):583–596, 2015.
[7] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy.
Speed/accuracy trade-offs for modern convolutional object detectors.
CoRR, abs/1611.10012, 2016.
[8] P. Kr¨
ahenb¨
uhl and V. Koltun. Geodesic object proposals. In European
Conference on Computer Vision, pages 725–739. Springer, 2014.
[9] G. Li, J. Liu, C. Jiang, L. Zhang, K. Tang, and Z. Zhu. Relief r-cnn:
Utilizing convolutional feature interrelationship for fast object detection
deployment. arXiv preprint arXiv:1601.06719, 2016.
[10] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll´
ar, and C. L. Zitnick. Microsoft coco: Common objects in context.
In European conference on computer vision, pages 740–755, 2014.
[11] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and
A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
[12] M. Naphade, D. C. Anastasiu, A. Sharma, V. Jagrlamudi, H. Jeon,
K. Liu, M.-C. Chang, S. Lyu, and Z. Gao. The nvidia ai city chal-
lenge. In 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing,
Advanced & Trusted Computed, Scalable Computing & Communica-
tions, Cloud & Big Data Computing, Internet of People and Smart
City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI),
SmartWorld’17, Piscataway, NJ, USA, 2017. IEEE.
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look
once: Unified, real-time object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 779–
788, 2016.
[14] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time
object detection with region proposal networks. In Advances in neural
information processing systems, pages 91–99, 2015.
1http://beaware-project.eu/
aic480 Car SUV Van Bus Bicycle Motorcycle
RCNN 0.54 0.12 0.01 0 0 0
DeepHOG 0.41 0.21 0.03 0.05 0 0
Ensemble 0.50 0.17 0.03 0.01 0 0
Small Truck Medium Truck Large Truck Pedestrian Localization Overall
RCNN 0.03 0.03 0.02 0 0.84 0.15
DeepHOG 0.10 0.10 0.02 0 0.64 0.14
Ensemble 0.08 0.07 0.02 0 0.74 0.15
TABLE I: Test results on aic480 dataset
aic540 Car SUV Van Bus Bicycle Motorcycle Small
Truck
Medium
Truck
RCNN 0.49 0.09 0.01 0 0.02 0.01 0.02 0.02
DeepHOG 0.26 0.13 0.05 00.15 0.05 0.14 0.09
Ensemble 0.49 0.1 0.02 0 0.15 0.04 0.08 0.05
Large
Truck Pedestrian Group Of
People Red Signal Green
Signal
Yellow
Signal Localization Overall
RCNN 0.01 0 0 0 0 0 0.73 0.09
DeepHOG 0.01 0 0.01 0 0 0 0.56 0.10
Ensemble 0.01 00.01 0 0 0 0.70 0.11
TABLE II: Test results on aic540 dataset
aic1080 Car SUV Van Bus Bicycle Motorcycle Small
Truck
Medium
Truck
RCNN 0.48 0.07 0.01 0 0.03 0.01 0.02 0.01
DeepHOG 0.27 0.18 0.08 0.01 0.23 0.12 0.14 0.11
Ensemble 0.48 0.11 0.06 0 0.22 0.10 0.09 0.08
Large
Truck Pedestrian Group Of
People Red Signal Green
Signal
Yellow
Signal Localization Overall
RCNN 0 0 0 0.01 0 0 0.58 0.08
DeepHOG 0.02 0.13 0.05 0.05 0.01 0 0.38 0.12
Ensemble 0.01 0.13 0.05 0.05 0.01 0 0.47 0.12
TABLE III: Test results on aic1080 dataset
[15] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large
scale visual recognition challenge. International Journal of Computer
Vision, 115(3):211–252, 2015.
[16] H. G. Seif and X. Hu. Autonomous driving in the icityhd maps as a
key challenge of the automotive industry. Engineering, 2(2):159–162,
2016.
[17] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi. Pedestrian
detection using infrared images and histograms of oriented gradients.
In Intelligent Vehicles Symposium, 2006 IEEE, pages 206–212. IEEE,
2006.
[18] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Se-
lective search for object recognition. International journal of computer
vision, 104(2):154–171, 2013.
[19] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. Hoggles:
Visualizing object detection features. In Proceedings of the IEEE
International Conference on Computer Vision, pages 1–8, 2013.
[20] X. Wang, T. X. Han, and S. Yan. An hog-lbp human detector with partial
occlusion handling. In Computer Vision, 2009 IEEE 12th International
Conference on, pages 32–39. IEEE, 2009.
[21] L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim, M. Yang, and
S. Lyu. UA-DETRAC: A new benchmark and protocol for multi-object
tracking. CoRR, abs/1511.04136, 2015.
[22] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chintala,
and P. Doll´
ar. A multipath network for object detection. arXiv preprint
arXiv:1604.02135, 2016.
[23] C. L. Zitnick and P. Doll´
ar. Edge boxes: Locating object proposals from
edges. In European Conference on Computer Vision, pages 391–405.
Springer, 2014.
... Out of all of these algorithms, currently, only one-stage detectors are capable of obtaining real-time performance with limited computing power. However, most of the recent work in vehicle detection has focused on obtaining the best accuracy by using models utilising specialised GPUs [21,22]. Therefore, one of the objectives of this work is to estimate what accuracy can be achieved on embedded systems, i.e., NVIDIA Jetson Tx2. ...
Article
Full-text available
Traffic monitoring from closed-circuit television (CCTV) cameras on embedded systems is the subject of the performed experiments. Solving this problem encounters difficulties related to the hardware limitations, and possible camera placement in various positions which affects the system performance. To satisfy the hardware requirements, vehicle detection is performed using a lightweight Convolutional Neural Network (CNN), named SqueezeDet, while, for tracking, the Simple Online and Realtime Tracking (SORT) algorithm is applied, allowing for real-time processing on an NVIDIA Jetson Tx2. To allow for adaptation of the system to the deployment environment, a procedure was implemented leading to generating labels in an unsupervised manner with the help of background modelling and the tracking algorithm. The acquired labels are further used for fine-tuning the model, resulting in a meaningful increase in the traffic estimation accuracy, and moreover, adding only minimal human effort to the process allows for further accuracy improvement. The proposed methods, and the results of experiments organised under real-world test conditions are presented in the paper.
... Those models achieved a better trade-off between accuracy and speed. In a previous work of ours [17] we have proposed a novel scheme to detect vehicles and pedestrians from traffic surveillance cameras. The same framework has also been deployed in UA-DETRAC vehicle detection dataset [18], achieving a really high detection rate. ...
Conference Paper
This paper presents a novel warning system framework for detecting people and vehicles in danger. The system was tested in several images compiled from Flickr and other social media sources and is highly suggested to get integrated in future warning surveillance and safety systems for preventing or solving crisis events. The proposed framework recruits State-of-the-Art deep learning technologies so as to solve a series of image processing and machine learning challenges and provides a near real-time localization solution for detecting and scoring severity safety levels of people and vehicles in flood and fire images.
... Regarding object recognition, similar hybrid representations will be deployed, combined with the selective scheme that YOLO uses, in order to design a low computational cost and accurate system, specifically tailored for embedded vision purposes [37]. ...
Conference Paper
Full-text available
Deafblindness is a condition that limits communication capabilities primarily to the haptic channel. In the EU-funded project SUITCEYES we design a system which allows haptic and thermal communication via soft interfaces and textiles. Based on user needs and informed by disability studies, we combine elements from smart textiles, sensors, semantic technologies, image processing, face and object recognition, machine learning, affective computing, and gamification. In this work, we present the underlying concepts and the overall design vision of the resulting assistive smart wearable.
Article
Full-text available
In this paper, we study the trade-off between accuracy and speed when building an object detection system based on convolutional neural networks. We consider three main families of detectors --- Faster R-CNN, R-FCN and SSD --- which we view as "meta-architectures". Each of these can be combined with different kinds of feature extractors, such as VGG, Inception or ResNet. In addition, we can vary other parameters, such as the image resolution, and the number of box proposals. We develop a unified framework (in Tensorflow) that enables us to perform a fair comparison between all of these variants. We analyze the performance of many different previously published model combinations, as well as some novel ones, and thus identify a set of models which achieve different points on the speed-accuracy tradeoff curve, ranging from fast models, suitable for use on a mobile phone, to a much slower model that achieves a new state of the art on the COCO detection challenge.
Article
Full-text available
This article provides in-depth insights into the necessary technologies for automated driving in future cities. State of science is reflected from different perspectives such as in-car computing and data management, road side infrastructure, and cloud solutions. Especially the challenges for the application of HD maps as core technology for automated driving are depicted in this article.
Article
Full-text available
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of bounding box priors over different aspect ratios and scales per feature map location. At prediction time, the network generates confidences that each prior corresponds to objects of interest and produces adjustments to the prior to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that requires object proposals, such as R-CNN and MultiBox, because it completely discards the proposal generation step and encapsulates all the computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on ILSVRC DET and PASCAL VOC dataset confirm that SSD has comparable performance with methods that utilize an additional object proposal step and yet is 100-1000x faster. Compared to other single stage methods, SSD has similar or better performance, while providing a unified framework for both training and inference.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box's boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code will be made publicly available.
Article
The recent MS COCO object detection dataset presents several new challenges for object detection. In particular, it contains objects at a broad range of scales, less prototypical images, and requires more precise localization. To address these challenges, we test three modifications to the standard Fast R-CNN object detector: (1) skip connections that give the detector access to features at multiple network layers, (2) a foveal structure to exploit object context at multiple object resolutions, and (3) an integral loss function and corresponding network adjustment that improve localization. The result of these modifications is that information can flow along multiple paths in our network, including through features from multiple network layers and from multiple object views. We refer to our modified classifier as a "MultiPath" network. We couple our MultiPath network with DeepMask object proposals, which are well suited for localization and small objects, and adapt our pipeline to predict segmentation masks in addition to bounding boxes. The combined system improves results over the baseline Fast R-CNN detector with Selective Search by 66% overall and by 4x on small objects. It placed second in both the COCO 2015 detection and segmentation challenges.