Conference PaperPDF Available

Abstract and Figures

In this paper, we present a real-time pedestrian detection system that has been trained using a virtual environment. This is a very popular topic of research having endless practical applications and recently, there was an increasing interest in deep learning architectures for performing such a task. However, the availability of large labeled datasets is a key point for an effective train of such algorithms. For this reason, in this work, we introduced ViPeD, a new synthetically generated set of images extracted from a realistic 3D video game where the labels can be automatically generated exploiting 2D pedestrian positions extracted from the graphics engine. We exploited this new synthetic dataset fine-tuning a state-of-the-art computationally efficient Convolutional Neural Network (CNN). A preliminary experimental evaluation, compared to the performance of other existing approaches trained on real-world images, shows encouraging results.
Content may be subject to copyright.
Learning Pedestrian Detection from Virtual
Worlds
Giuseppe Amato1, Luca Ciampi1, Fabrizio Falchi1, Claudio Gennaro1, and
Nicola Messina1
Institute of Information Science and Technologies (ISTI), Italian National Research
Council (CNR), Via G. Moruzzi 1, 56124 Pisa, Italy
name.surname@isti.cnr.it
Abstract. In this paper, we present a real-time pedestrian detection
system that has been trained using a virtual environment. This is a very
popular topic of research having endless practical applications and re-
cently, there was an increasing interest in deep learning architectures for
performing such a task. However, the availability of large labeled datasets
is a key point for an effective train of such algorithms. For this reason,
in this work, we introduced ViPeD, a new synthetically generated set of
images extracted from a realistic 3D video game where the labels can
be automatically generated exploiting 2D pedestrian positions extracted
from the graphics engine. We exploited this new synthetic dataset fine-
tuning a state-of-the-art computationally efficient Convolutional Neural
Network (CNN). A preliminary experimental evaluation, compared to
the performance of other existing approaches trained on real-world im-
ages, shows encouraging results.
1 Introduction
Pedestrian detection remains a very popular topic of research having endless
practical applications. An important application domain of this topic is certainly
video surveillance for public security, such as crime prevention, identification of
vandalism, etc. A real-time response in the case of an incident, however, requires
manual observation of the video stream, which is in most cases economically not
feasible.
We propose a real-time CNN-based solution that is able to localize pedestrian
instances in images captured by smart cameras. CNNs are a popular choice for
current objects detectors since they are able to automatically learn features char-
acterizing the objects themselves; in the last years, these solutions outperformed
approaches relying instead on hand-crafted features.
The great challenge we must address using CNNs is the ability of these
algorithms to generalize to new scenarios having different characteristics, like
©Springer Nature Switzerland AG 2019
E. Ricci et al. (Eds.): ICIAP 2019, LNCS 11751, pp. 302-312, 2019.
Final authenticated publication: https://doi.org/10.1007/978-3-030-30642-7 27
2 G. Amato et al.
different perspectives, illuminations and object scales. This is a must when we
are dealing with smart devices that should be easily installed and deployed,
without the need for an early tuning phase. Therefore, the availability of large
labeled training datasets that cover as much as possible the differences between
various scenarios is a key point for training state-of-the-art CNNs. Although
there are some large annotated generic datasets, such as ImageNet [1] and MS
COCO [2], annotating the images is a very time-consuming operation, since it
requires great human effort, and it is error-prone. Furthermore, sometimes it is
also problematic to create a training/testing dataset with specific characteristics.
A possible solution to this problem is to create a suitable dataset collecting
images from virtual world environments that mimics as much as possible all the
characteristics of our target real-world scenario. In this paper, we introduce a
new dataset named ViPeD (Virtual Pedestrian Dataset), a large collection of
images taken from the highly photo-realistic video game GTA V - Grand Theft
Auto V developed by Rockstar North, that extends the JTA (Joint Track Auto)
dataset presented in [3]. We demonstrate that we can improve performance and
achieve competitive results compared to the state-of-the-art approaches in the
pedestrian detection task.
In particular, we train a state-of-the-art object detector, YOLOv3 [4], over
the newly introduced ViPeD dataset. Then, we test the trained detector on
the MOT17 detection dataset (MOT17Det) [5], a real-world dataset suited for
pedestrian detection, in order to measure the generalization capabilities of the
proposed solution with respect to real-world scenarios.
To summarize, in this work we propose a real-time CNN-based system able
to detect pedestrians for surveillance smart cameras. We train the algorithm
using a new dataset collected using images from a realistic video game and we
take advantage of the graphics engine for extracting the annotations without any
human intervention. Finally, we evaluate the proposed method on a real-world
dataset demonstrating his effectiveness and robustness to other scenarios.
2 Related Work
In this section, we review the most important works in object and pedestrian de-
tection. We also analyze previous studies on using synthetic datasets as training
sets. Pedestrian detection is highly related to object detection. It deals with rec-
ognizing the specific class of pedestrians, usually walking in urban environments.
Approaches for tackling the pedestrian detection problem are usually subdivided
into two main research areas. The first class of detectors is based on handcrafted
features, such as ICF (Integral Channel Features) [6–10]. Those methods can
usually rely on higher computational efficiency, at the cost of lower accuracy. On
the other hand, deep neural networks approaches have been explored. [11–14]
proposed some modifications around the standard CNN network [15] in order to
detect pedestrians, even accounting for different scales.
Many datasets are available for pedestrian detection. Caltech [16], MOT17Det
[5], INRIA [17], and CityPersons [18] are among the most important ones. Since
Learning Pedestrian Detection from Virtual Worlds 3
they were collected in different living scenarios, they are intrinsically very hetero-
geneous datasets. Some of them [16, 17] were specifically collected for detecting
pedestrians in self-driving contexts. Our interest, however, is mostly concen-
trated on video-surveillance tasks and, in this scenario, the recently introduced
MOT17Det dataset has proved to be enough challenging due to the high vari-
ability of the video subsets. State-of-the-art results on this dataset are reached
by [13]. With the need for huge amounts of labeled data, generated datasets
have recently gained great interest. [19, 20] have studied the possibility of learn-
ing features from synthetic data, validating them on real scenarios. Unlike our
work, however, they did not explore deep learning approaches. [21, 22] focused
their attention on the possibility to perform domain adaptation in order to map
virtual features onto real ones. Authors in [3] created a dataset taking images
from the highly photo-realistic video game GTA V and demonstrated that it
is possible to reach excellent results on tasks such as people tracking and pose
estimation when validating on real data.
To the best of our knowledge, [23] and [24] are the works closest to our setup.
In particular, [23] also used GTA V as the virtual world but, unlike our method,
they used Faster-RCNN [25] and they concentrated on vehicle detection.
Instead, [24] used a synthetically generated dataset to train a simple convo-
lutional network to detect objects belonging to various classes in a video. The
convolutional network dealt only with the classification, while the detection of
objects relied on a background subtraction algorithm based on Gaussian mix-
ture models (GMMs). The real-world performance was evaluated on two common
pedestrian detection datasets, and one of these (MOTChallenge 2015 [26]) is an
older version of the dataset we used to carry out our experimentation.
3 The ViPeD Dataset
In this section, we describe the datasets exploited in this work. First, we in-
troduce ViPeD -Virtual Pedestrian Dataset, a new virtual collection used for
training the network. Then we outline MOT17Det [5], a real dataset employed for
the evaluation of our proposed solution. Finally, we illustrate CityPersons [18], a
real-world dataset for pedestrian detection we used as baseline. In order to show
the validity of ViPeD , we have compared our network trained with CityPersons
against the same network trained with ViPeD .
3.1 ViPeD - Virtual Pedestrian Dataset
As mentioned above, CNNs need large annotated datasets during the training
phase in order to learn models robust to different scenarios, and creating the
annotations is a very time-consuming operation that requires a great human
effort.
The main contribution of this paper is the creation of ViPeD , a huge collec-
tion of images taken from the highly photo-realistic video game GTA V devel-
oped by Rockstar North. This newly introduced dataset extends the JTA (Joint
4 G. Amato et al.
Track Auto) dataset presented in [3]. Since we are dealing with images collected
from a virtual world, we can extract pedestrian bounding boxes for free and with-
out the manual human effort, exploiting 2D pedestrian positions extracted from
the video card. The dataset includes a total of about 500K images, extracted
from 512 full-HD videos (256 for training and 256 for testing) of different urban
scenarios.
In the following, we report some details on the construction of the bounding
boxes and on the data augmentation procedure that we used to extend the JTA
dataset for the pedestrian detection task.
A) Bounding Boxes: Since JTA is specifically designed for pedestrian pose esti-
mation and tracking, the provided annotations are not directly suitable for the
pedestrian detection task. In particular, the annotations included in JTA are
related to the joints of the human skeletons present in the scene (Fig. 1a), while
what we need for our task are the coordinates of the bounding boxes surrounding
each pedestrian instance.
Bounding box estimation can be addressed using different approaches. The
GTA graphic engine is not publicly available, so it is not easy to extract the
detailed masks around each pedestrian instance; [23] overcame this issue by
extracting semantic masks and separating the instances by exploiting depth
information. Instead, our approach exploits the skeletons annotations already
extracted by the JTA team in order to reconstruct the precise bounding boxes.
This seems to be a more reliable solution than the depth separation approach,
especially when instances are densely distributed, as in the case of crowded
pedestrian scenarios.
The very basic setup consists of drawing the smallest bounding box that
encloses all the skeleton joints. The main issue with this simple approach is
that each bounding box perfectly contains the skeleton, but not the pedestrian
mesh. Indeed, we can note that the mesh is always larger than the skeleton
(Fig. 1b). We solved this problem by estimating a pad for the skeleton bounding
box, exploiting another information produced by the GTA graphic engine and
already present in JTA, i.e. the distance of all the pedestrians in the scene from
the camera.
In particular, the height of the ith mesh, denoted as hi
m, can be estimated
from the height of the ithskeleton hi
sby means of the formula:
hi
m=hi
s+α
zi(1)
where ziis the distance of the ith pedestrian center of mass from the camera,
and αis a parameter that depends on the camera projection matrix.
Given that ziis already available for every pedestrian, we estimated the
parameter αby manually annotating 30 random pedestrians, obtaining for them
the correct value for hi
m, and then performing linear regression. We visually
checked that the αparameter estimation was correct even for all the other non-
manually annotated pedestrians.
Learning Pedestrian Detection from Virtual Worlds 5
(a) (b)
Fig. 1: (a) Pedestrians in the JTA dataset with their skeletons. (b) Examples of
annotations in the ViPeD dataset; original bounding boxes are in yellow, while
the sanitized ones are in light blue.
We then estimated the mesh width wi
m. Unlike the height, the width is
strongly linked to the specific pedestrian pose, so it is difficult to be estimated
with only the camera distance information. We decided to estimate wi
mdirectly
from hi
m, assuming no changes in the aspect ratio for the original and adjusted
bounding boxes:
wi
m=hi
m
wi
s
hi
s
=hi
mri(2)
where riis the aspect ratio of the ith bounding box. Examples of final estimated
bounding boxes are shown in Fig. 1b.
Finally, we performed a global analysis of these new annotations. As we can
see in Fig. 2, in the dataset there are annotations of pedestrians farthest than
30-40 meters from the camera. However, we evaluated that humans annota-
tors tend to avoid annotating objects farthest than this amount. We performed
this analysis by measuring the height of the smallest bounding boxes in the
human-annotated MOT17Det dataset [5] and catching out in our dataset at
what distance from the camera the bounding boxes assume this human-limit
size. Therefore, in order to obtain annotations comparable to real-world human-
annotated ones, we decided to prune all the pedestrian annotations furthest than
40 meters from the camera.
From this point on, we will refer to the basic skeleton bounding boxes as
original bounding boxes. Instead, we will refer to the bounding boxes processed
by means of the previously described pipeline as sanitized (Fig. 1b).
B) Data Augmentation: Synthetic datasets should contain scenarios as close as
possible to real-world ones. Even though images grabbed from the GTA game
were already very realistic, we noticed some missing details. In particular, images
grabbed from the game are very sharp, edges are very pronounced and common
lens effects are missing. In light of this, we prepared a more realistic version of
the original images.
6 G. Amato et al.
Fig. 2: Histogram of distances between pedestrians and cameras.
We used GIMP image manipulation software, used in batch mode, in order to
modify every image of the original dataset, using a set of different filters: radial
blur, Gaussian blur, bloom effect, exposure/contrast. Parameters for these effects
are randomly sampled from a uniform distribution.
3.2 MOT17Det
We evaluate our solution using the recently introduced MOT17Det dataset [5], a
collection of challenging images for pedestrian detection taken from 14 sequences
with various crowded scenarios having different viewpoints, weather conditions,
and camera motions. The annotations for all the sequences are generated by
human annotators from scratch, following a specific protocol described in their
paper. The training images are taken from sequences 2, 4, 5, 9, 10, 11 and 13
(for a total of 5,316 images), while test images are taken from the remaining
sequences (for a total of 5,919 images). It should be noted that the authors
released only the ground-truth annotations belonging to the training subset.
The performance metrics concerning the test subset are instead available only
submitting results to the MOT17Det Challenge 1.
3.3 CityPersons
In order to compare our solution trained using synthetic data against the same
network trained with real images, we have also considered the CityPersons
dataset [18], a recent collection of images of interest for the pedestrian detec-
tion community. It consists of a large and diverse set of stereo video sequences
recorded in streets from different cities in Germany and neighboring countries.
In particular, authors provide 5,000 images from 27 cities labeled with bounding
boxes and divided across train/validation/test subsets.
1https://motchallenge.net/data/MOT17Det/
Learning Pedestrian Detection from Virtual Worlds 7
4 Method
We use YOLOv3 [4] as object detector architecture, exploiting the original Dark-
net [27] implementation. The architecture of YOLOv3 jointly performs a regres-
sion of the bounding box coordinates and classification for every proposed region.
Unlike other techniques, YOLOv3 performs these tasks in an optimized fully-
convolutional pipeline that takes pixels as input and outputs both the bounding
boxes and their respective proposed categories. It is particularly robust to scale
variance since it performs the detections at three different scales, down-sampling
the input image by factors 32, 16 and 8.
As a starting point, we considered a model of YOLO pre-trained on the
COCO dataset [2], a large dataset composed of images describing complex ev-
eryday scenes of common objects in their natural context, categorized in 80
different categories. Since this network is a generic objects detector, we then
specialized it to recognize and localize object instances belonging to a specific
category - i.e. the pedestrian category in our case.
Our goal is to evaluate the detector when it is trained with synthetic data.
For this reason, we need to partially retrain the architecture to include new
information deriving from a different domain.
In this particular work, domain adaptation between virtual and real scenarios
is simply carried out by fine-tuning the pre-trained YOLOv3 architecture. In
particular, we first extract the weights of the first 81 layers of the pre-trained
model, since these layers capture universal features (like curves and edges) that
are also relevant to our new problem. Then, we fine-tune YOLO initialing the
firsts 81 layers with the previously extracted weights, and the weights associated
with the remaining layers at random. In this way, we get the network to focus
on learning the dataset-specific features in the last layers. All the weights are
left unfrozen, so they can be adjusted by the back-propagation algorithm. With
this technique, we are forcing the architecture to adjust the learned features to
match those from the destination dataset.
5 Experimental Evaluation
We evaluate our solution in two different cases: first, in order to test the general-
ization capabilities, we train the detector using only our new synthetic dataset;
then, in order to obtain best results on the MOT17Det dataset and compare
them with the state-of-the-art, we evaluate detections after fine-tuning the de-
tector also on the MOT17Det dataset itself.
Since the authors did not release the ground-truth annotations belonging to
the test subset, we submitted our results to the MOT17Det Challenge in order
to obtain the performance metrics. In order to prevent overfitting during the
training in the second scenario, we create a validation split from the training
subset considering a randomly chosen sequence. For the first scenario, instead,
we validate on the full training set of MOT17Det.
Following other object detectors benchmarks, we use Precision, Recall and
Average Precision (AP) as the performance metrics. A key parameter in all
8 G. Amato et al.
these metrics is the intersection-over-union threshold (IoU ), which determines
if a bounding box is matched to an annotation or not, i.e. if it is a true positive
or a false positive.
Precision and Recall are defined as:
P recision =TPs
TPs+FPs Recall =TPs
TPs+F N s (3)
where TPs are the True Positives, FPs the False Positives and FN the False
Negatives. Average Precision is instead defined as the average of the maximum
precisions at different recall values.
It is fairly common to observe detection algorithms compared under different
thresholds, and there are often many variables and implementation details that
differ between evaluation scripts which may affect results significantly. In this
work, we consider only MOT17Det and COCO performance evaluators. We also
use the standard IoU threshold value of 0.5.
Evaluation of the generalization capabilities Considering the first sce-
nario, we first obtained a baseline using the original detector, i.e. the detector
trained using the real-world general-purpose COCO dataset. Then, we trained
the detector using our synthetic dataset, performing an ablation study over the
introduced extensions.
First, we considered the original images and the original bounding boxes.
Then, in order to evaluate how much the bounding-box construction policy can
affect the detection quality, we considered the sanitized bounding boxes. Third,
we considered also augmented images. Finally, we train the detector using the
real-world dataset CityPersons, specific for the pedestrian detection task. We
employ this experiment as a baseline over our ViPeD trained network. Results
are reported in Table 1.
Comparison with the state-of-the-art on MOT17Det Concerning the sec-
ond scenario, we obtained a baseline starting from the original detector trained
Table 1: Results of YOLOv3 detector on MOT17Det
Training Dataset MOT AP COCO AP Precision Recall
COCO (Baseline) 0.69 0.41 87.4 72.4
CityPersons 0.58 0.37 69.0 60.5
ViPeD : Orig. BBs - Orig. Imgs 0.58 0.37 68.6 64.8
ViPeD : Sanitized BBs - Orig. Imgs 0.63 0.40 91.1 69.2
ViPeD : Sanitized BBs - Aug. Imgs 0.71 0.48 89.3 73.9
Learning Pedestrian Detection from Virtual Worlds 9
with COCO and fine-tuning it with the training set of the MOT17Det dataset.
Then, we considered our previous detector trained with ViPeD (the one with
the sanitized bounding boxes and the augmented images) and we fine-tuned
again the network with the training set of the MOT17Det dataset. Results are
reported in Table 2, together with the ones obtained using the state-of-the-art
approaches publicly released in the MOT17 Challenge (at the time of writing).
Table 2: Results on MOT17Det: comparison with the state-of-the-art
Method MOT AP Precision Recall
YOLOv3 on COCO + MOT 0.80 89.9 82.8
YTLAB [13] 0.89 86.2 91.3
KDNT [28] 0.89 78.7 92.1
ZIZOM [29] 0.81 88.0 83.3
SDP [12] 0.81 92.6 83.5
YOLOv3 on ViPeD + MOT 0.80 90.2 84.6
Discussion Results in Table 1 show that we obtained best performances train-
ing the detector with ViPeD, using the sanitized bounding boxes and the aug-
mented images, overtaking also the networks trained with COCO and with
CityPersons. Therefore, our solution is able to generalize the knowledge learned
from the virtual-world to a real-world dataset, and it is also able to perform bet-
ter than the solutions trained using the real-world manual-annotated datasets.
Results in Table 2 demonstrate that our training procedure is able to reach
competitive performance even when compared to specialized pedestrian detec-
tion approaches.
6 Conclusions
In this work, we propose a real-time system able to detect pedestrian instances
in images. Our approach is based on a state-of-the-art fast detector, YOLOv3,
trained with a synthetic dataset named ViPeD, a huge collection of images ren-
dered out from the highly photo-realistic video game GTA V developed by
Rockstar North.
The choice of training the network using synthetic data is motivated by
the fact that a huge amount of different examples are needed in order for the
algorithm to generalize well. This huge amount of data is typically manually
collected and annotated by humans, but this procedure usually takes a lot of
10 G. Amato et al.
time and it is error-prone. We demonstrated that our solution is able to transfer
the knowledge learned from the synthetic data to the real-world, outperforming
the same approach trained instead on real-world manually-labeled datasets.
The YOLOv3 network is able to run on low-power devices, such as the
NVIDIA Jetson TX2 board, at 4 FPS. In this way, it could be deployed di-
rectly on smart devices, such as smart security cameras or drones. Even if we
trained YOLOv3 detector on the specific task of pedestrian detection, we think
that the presented procedure could be applied at a larger scale even on other
related tasks, such as object segmentation or image classification.
Acknowledgments
This work was partially supported by the AI4EU project, funded by the EC
(H2020 - Contract n. 825619). We gratefully acknowledge the support of NVIDIA
Corporation with the donation of the Jetson TX2 board used for this research.
References
1. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale
hierarchical image database,” in 2009 IEEE Conference on Computer Vision and
Pattern Recognition, June 2009, pp. 248–255.
2. T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona,
D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: common objects in
context,” CoRR, vol. abs/1405.0312, 2014.
3. M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara,
“Learning to detect and track visible and occluded body joints in a virtual world,”
in European Conference on Computer Vision (ECCV), 2018.
4. J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” CoRR, vol.
abs/1804.02767, 2018.
5. A. Milan, L. Leal-Taix´e, I. D. Reid, S. Roth, and K. Schindler, “MOT16: A bench-
mark for multi-object tracking,” CoRR, vol. abs/1603.00831, 2016.
6. R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten years of pedestrian
detection, what have we learned?” in Computer Vision - ECCV 2014 Workshops.
Cham: Springer International Publishing, 2015, pp. 613–627.
7. S. Zhang, C. Bauckhage, and A. B. Cremers, “Informed haar-like features improve
pedestrian detection,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2014.
8. S. Zhang, R. Benenson, and B. Schiele, “Filtered channel features for pedestrian
detection,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015, pp. 1751–1760.
9. S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele, “How far are we
from solving pedestrian detection?” in The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), June 2016.
10. W. Nam, P. Dollar, and J. H. Han, “Local decorrelation for improved pedestrian
detection,” in Advances in Neural Information Processing Systems 27. Curran
Associates, Inc., 2014, pp. 424–432.
Learning Pedestrian Detection from Virtual Worlds 11
11. Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning strong parts for pedestrian
detection,” in 2015 IEEE International Conference on Computer Vision (ICCV),
Dec 2015, pp. 1904–1912.
12. F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast and accurate cnn object
detector with scale dependent pooling and cascaded rejection classifiers,” in 2016
IEEE CVPR, June 2016, pp. 2129–2137.
13. Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified multi-scale deep con-
volutional neural network for fast object detection,” in Computer Vision – ECCV
2016. Cham: Springer International Publishing, 2016, pp. 354–370.
14. P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. Lecun, “Pedestrian detection
with unsupervised multi-stage feature learning,” in The IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2013.
15. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied
to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324,
Nov 1998.
16. P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evalua-
tion of the state of the art,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 34, no. 4, pp. 743–761, April 2012.
17. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog-
nition (CVPR’05), vol. 1, June 2005, pp. 886–893 vol. 1.
18. S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A diverse dataset for pedes-
trian detection,” CoRR, vol. abs/1702.05693, 2017.
19. B. Kaneva, A. Torralba, and W. T. Freeman, “Evaluation of image features using
a photorealistic virtual world,” in 2011 International Conference on Computer
Vision, Nov 2011, pp. 2282–2289.
20. J. Marn, D. Vzquez, D. Gernimo, and A. M. Lpez, “Learning appearance in virtual
scenarios for pedestrian detection,” in 2010 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, June 2010, pp. 137–144.
21. D. Vazquez, A. M. Lopez, and D. Ponsa, “Unsupervised domain adaptation of vir-
tual and real worlds for pedestrian detection,” in Proceedings of the 21st Interna-
tional Conference on Pattern Recognition (ICPR2012), Nov 2012, pp. 3492–3495.
22. D. Vzquez, A. M. Lpez, J. Marn, D. Ponsa, and D. Gernimo, “Virtual and real
world adaptation for pedestrian detection,” IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, vol. 36, no. 4, pp. 797–809, April 2014.
23. M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, and R. Vasudevan,
“Driving in the matrix: Can virtual worlds replace human-generated annotations
for real world tasks?” CoRR, vol. abs/1610.01983, 2016.
24. E. Bochinski, V. Eiselein, and T. Sikora, “Training a convolutional neural network
for multi-class object detection using solely virtual world data,” in Advanced Video
and Signal Based Surveillance (AVSS), 2016 13th IEEE International Conference
on. IEEE, 2016, pp. 278–285.
25. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in Neural Information Pro-
cessing Systems 28. Curran Associates, Inc., 2015, pp. 91–99.
26. L. Leal-Taix´e, A. Milan, I. D. Reid, S. Roth, and K. Schindler, “Motchallenge 2015:
Towards a benchmark for multi-target tracking,” CoRR, vol. abs/1504.01942, 2015.
27. J. Redmon, “Darknet: Open source neural networks in c,” 2013.
28. F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan, “POI: multiple object tracking with
high performance detection and appearance feature,” CoRR, vol. abs/1610.06136,
2016.
12 G. Amato et al.
29. C. Lin, J. Lu, G. Wang, and J. Zhou, “Graininess-aware deep feature learning for
pedestrian detection,” in The European Conference on Computer Vision (ECCV),
September 2018.
... Virtual Pedestrian Dataset (ViPeD) (Amato et al., 2019;Ciampi et al., 2020). The Virtual Pedestrian Dataset is a synthetic collection of images generated exploiting the highly photo-realistic graphical engine of the video game Grand Theft Auto V (GTA V) by Rockstar North. ...
... Labels are automatically provided by the game engine and consist of bounding boxes precisely localizing the pedestrians present in the scenes. More details on the generation of ViPeD can be found in Amato et al. (2019), Ciampi et al. (2020). ...
... To train the density-based pedestrian counter module, we adopt a supervised domain adaptation strategy consisting of training the network with the synthetic data and then fine-tuning it exploiting the real-world images, that has already been proved to be effective in Amato et al. (2019) and Ciampi et al. (2020), providing a performance boost compared to models trained only on real-world data. In particular, we set the initial weights of the network layers with values coming from a Gaussian distribution with 0.01 standard deviation. ...
Article
In many working and recreational activities, there are scenarios where both individual and collective safety have to be constantly checked and properly signaled, as occurring in dangerous workplaces or during pandemic events like the recent COVID-19 disease. From wearing personal protective equipment to filling physical spaces with an adequate number of people, it is clear that a possibly automatic solution would help to check compliance with the established rules. Based on an off-the-shelf compact and low-cost hardware, we present a deployed real use-case embedded system capable of perceiving people’s behavior and aggregations and supervising the appliance of a set of rules relying on a configurable plug-in framework. Working on indoor and outdoor environments, we show that our implementation of counting people aggregations, measuring their reciprocal physical distances, and checking the proper usage of protective equipment is an effective yet open framework for monitoring human activities in critical conditions.
... A possible solution for the aforementioned issues is to employ virtual worlds. The community has already recognized the potential of synthetic data, successfully used for benchmarking [50] or to compensate for the lack of training data [5,11]. To the best of our knowledge, so far, synthetic data could fully replace recorded data only for low-level tasks such as optical flow estimation [24]. ...
... To the best of our knowledge, so far, synthetic data could fully replace recorded data only for low-level tasks such as optical flow estimation [24]. For higher-level tasks, such as object detection, tracking and segmentation, existing methods usually need mixed synthetic and real data and employ alternate training scheme [5] or domain adaptation [11] and randomization [81] techniques. ...
... Such simulated environments have been successfully applied to low-level tasks, such as feature descriptor computation [46], visual odometry [35,37,96,69], optical flow estimation [7,13,69,59,50,60] and depth estimation [50,60]. Simulated worlds have also been recently utilized for higher-level tasks like semantic segmen- tation [83,36,72,43,69,50,70,49], multi-object tracking [31,27,80,42], hand tracking [76], human pose estimation [77,27,34,26], pedestrian and car detection [58,5,45], and as virtual environments for robot reinforcement learning [81]. The aforementioned works mainly leverage synthetic data for evaluation in scenarios where precise groundtruth data is difficult to obtain [50] or as means for pretraining data-hungry deep learning models. ...
Preprint
Full-text available
Deep learning-based methods for video pedestrian detection and tracking require large volumes of training data to achieve good performance. However, data acquisition in crowded public environments raises data privacy concerns -- we are not allowed to simply record and store data without the explicit consent of all participants. Furthermore, the annotation of such data for computer vision applications usually requires a substantial amount of manual effort, especially in the video domain. Labeling instances of pedestrians in highly crowded scenarios can be challenging even for human annotators and may introduce errors in the training data. In this paper, we study how we can advance different aspects of multi-person tracking using solely synthetic data. To this end, we generate MOTSynth, a large, highly diverse synthetic dataset for object detection and tracking using a rendering game engine. Our experiments show that MOTSynth can be used as a replacement for real data on tasks such as pedestrian detection, re-identification, segmentation, and tracking.
... Virtual Pedestrian Dataset (ViPeD) [14,50]. ViPeD is a new synthetic dataset introduced in this thesis (specifically, in Chapter 4), generated with the highly photo-realistic graphical engine of the video game GTA V (Grand Theft Auto V) by Rockstar North and automatically annotated with precise bounding boxes. ...
... The research presented in this chapter was published in [14,50]. The code, the models, and the dataset are made freely available at https://ciampluca.github.io/viped. ...
Preprint
Full-text available
In this thesis, I investigated and enhanced the visual counting task, which automatically estimates the number of objects in still images or video frames. Recently, due to the growing interest in it, several CNN-based solutions have been suggested by the scientific community. These artificial neural networks provide a way to automatically learn effective representations from raw visual data and can be successfully employed to address typical challenges characterizing this task, such as different illuminations and object scales. But apart from these difficulties, I targeted some other crucial limitations in the adoption of CNNs, proposing solutions that I experimentally evaluated in the context of the counting task which turns out to be particularly affected by these shortcomings. In particular, I tackled the problem related to the lack of data needed for training current CNN-based solutions. Given that the budget for labeling is limited, data scarcity still represents an open problem, particularly evident in tasks such as the counting one, where the objects to be labeled are thousands per image. Specifically, I introduced synthetic datasets gathered from virtual environments, where the training labels are automatically collected. I proposed Domain Adaptation strategies aiming at mitigating the domain gap existing between the training and test data distributions. I presented a counting strategy where I took advantage of the redundant information characterizing datasets labeled by multiple annotators. Moreover, I tackled the engineering challenges coming out of the adoption of CNN techniques in environments with limited power resources. I introduced solutions for counting vehicles directly onboard embedded vision systems. Finally, I designed an embedded modular Computer Vision-based system that can carry out several tasks to help monitor individual and collective human safety rules.
... Virtual Pedestrian Dataset (ViPeD) [14,50]. ViPeD is a new synthetic dataset introduced in this thesis (specifically, in Chapter 4), generated with the highly photo-realistic graphical engine of the video game GTA V (Grand Theft Auto V) by Rockstar North and automatically annotated with precise bounding boxes. ...
... The research presented in this chapter was published in [14,50]. The code, the models, and the dataset are made freely available at https://ciampluca.github.io/viped. ...
Thesis
Full-text available
In this dissertation, we investigated and enhanced Deep Learning (DL) techniques for counting objects, like pedestrians, cells or vehicles, in still images or video frames. In particular, we tackled the challenge related to the lack of data needed for training current DL-based solutions. Given that the budget for labeling is limited, data scarcity still represents an open problem that prevents the scalability of existing solutions based on the supervised learning of neural networks and that is responsible for a significant drop in performance at inference time when new scenarios are presented to these algorithms. We introduced solutions addressing this issue from several complementary sides, collecting datasets gathered from virtual environments automatically labeled, proposing Domain Adaptation strategies aiming at mitigating the domain gap existing between the training and test data distributions, and presenting a counting strategy in a weakly labeled data scenario, i.e., in the presence of non-negligible disagreement between multiple annotators. Moreover, we tackled the non-trivial engineering challenges coming out of the adoption of Convolutional Neural Network-based techniques in environments with limited power resources, introducing solutions for counting vehicles and pedestrians directly onboard embedded vision systems, i.e., devices equipped with constrained computational capabilities that can capture images and elaborate them.
... For instance, the authors (Di Benedetto et al., 2022) propose a VSDM scheme to monitor compliance with SD norms in indoor and outdoor environments. In doing so, the DA-based VSDM strategy consists of (i) launching a new real-world crowd counting and monitoring dataset, namely CrowdVisorPisa; (ii) training a Faster-RCNN model on a synthetic dataset, namely Virtual Pedestrian Dataset (ViPeD) (Amato, Ciampi, Falchi, Gennaro, & Messina, 2019), to detect pedestrians; (iii) fine-tuning this model on real-world data by employing the balanced gradient contribution (BGC) method that helps mix synthetic and realword data during the training to boost the performance; and (iv) measuring the physical distances between detected pedestrians using a pre-calibration strategy and a geometrical transformation. Table 2 presents a summary of TL-based VSDM frameworks and their features concerning the adopted ML model, method description, datasets used for validation, best performance, and advantage/ limitation. ...
Article
Full-text available
Since the start of the COVID-19 pandemic, social distancing (SD) has played an essential role in controlling and slowing down the spread of the virus in smart cities. To ensure the respect of SD in public areas, visual SD monitoring (VSDM) provides promising opportunities by (i) controlling and analyzing the physical distance between pedestrians in real-time, (ii) detecting SD violations among the crowds, and (iii) tracking and reporting individuals violating SD norms. To the authors’ best knowledge, this paper proposes the first comprehensive survey of VSDM frameworks and identifies their challenges and future perspectives. Typically, we review existing contributions by presenting the background of VSDM, describing evaluation metrics, and discussing SD datasets. Then, VSDM techniques are carefully reviewed after dividing them into two main categories: hand-crafted feature-based and deep-learning-based methods. A significant focus is paid to convolutional neural networks (CNN)-based methodologies as most of the frameworks have used either one-stage, two-stage, or multi-stage CNN models. A comparative study is also conducted to identify their pros and cons. Thereafter, a critical analysis is performed to highlight the issues and impediments that hold back the expansion of VSDM systems. Finally, future directions attracting significant research and development are derived.
... Unlike Vision Transformers, CNNs still maintain an important architectural prior, the spatial locality, which is very important for discovering image patch abnormalities and maintaining good data efficiency. CNNs, in fact, have a long-established success on many tasks, ranging from image classification [12,37] and object detection [32,1,7] to abstract visual reasoning [26,27]. ...
Chapter
Full-text available
Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd.
... To this end, I contributed to two works in which we studied the effect of using synthetic data to train pedestrian detectors for surveillance applications. In the first work [7], we presented a real-time pedestrian detection system trained using a virtual environment. In particular, we introduced ViPeD 1 , a new synthetically generated set of images extracted from a realistic 3D video game where the labels can be automatically generated exploiting 2D pedestrian positions extracted from the graphics engine. ...
Thesis
Full-text available
The increasing interest in social networks, smart cities, and Industry 4.0 is encouraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image. This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks: Relational Content-based Image Retrieval (R-CBIR), Visual-Textual Retrieval, and the Same-Different tasks. We use state-of-the-art deep learning methods for relational learning, such as the Relation Networks and the Transformer Networks for relating the different entities in an image or in a text.
... In the last years, many annotated datasets have been released for supporting the supervised learning of modern detectors based on deep neural networks [6,2,17,7]. However, only a few include images or videos taken from UAVs, and most are not focused on the marine environment. ...
Preprint
Full-text available
Modern Unmanned Aerial Vehicles (UAV) equipped with cameras can play an essential role in speeding up the identification and rescue of people who have fallen overboard, i.e., man overboard (MOB). To this end, Artificial Intelligence techniques can be leveraged for the automatic understanding of visual data acquired from drones. However, detecting people at sea in aerial imagery is challenging primarily due to the lack of specialized annotated datasets for training and testing detectors for this task. To fill this gap, we introduce and publicly release the MOBDrone benchmark, a collection of more than 125K drone-view images in a marine environment under several conditions, such as different altitudes, camera shooting angles, and illumination. We manually annotated more than 180K objects, of which about 113K man overboard, precisely localizing them with bounding boxes. Moreover, we conduct a thorough performance analysis of several state-of-the-art object detectors on the MOBDrone data, serving as baselines for further research.
Article
Full-text available
Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (\textit{visible heatmaps}, \textit{occluded heatmaps}, \textit{part affinity fields} and \textit{temporal affinity fields}) fed by a \textit{time linker} feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, more than 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.
Chapter
Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (visible heatmaps, occluded heatmaps, part affinity fields and temporal affinity fields) fed by a time linker feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.
Chapter
In this paper, we propose a graininess-aware deep feature learning method for pedestrian detection. Unlike most existing pedestrian detection methods which only consider low resolution feature maps, we incorporate fine-grained information into convolutional features to make them more discriminative for human body parts. Specifically, we propose a pedestrian attention mechanism which efficiently identifies pedestrian regions. Our method encodes fine-grained attention masks into convolutional feature maps, which significantly suppresses background interference and highlights pedestrians. Hence, our graininess-aware features become more focused on pedestrians, in particular those of small size and with occlusion. We further introduce a zoom-in-zoom-out module, which enhances the features by incorporating local details and context information. We integrate these two modules into a deep neural network, forming an end-to-end trainable pedestrian detector. Comprehensive experimental results on four challenging pedestrian benchmarks demonstrate the effectiveness of the proposed approach.
Article
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Even with the advent of more sophisticated, data-hungry methods, boosted decision trees remain extraordinarily successful for fast rigid object detection, achieving top accuracy on numerous datasets. While effective, most boosted detectors use decision trees with orthogonal (single feature) splits, and the topology of the resulting decision boundary may not be well matched to the natural topology of the data. Given highly correlated data, decision trees with oblique (multiple feature) splits can be effective. Use of oblique splits, however, comes at considerable computational expense. Inspired by recent work on discriminative decorrelation of HOG features, we instead propose an efficient feature transform that removes correlations in local neighborhoods. The result is an overcomplete but locally decorrelated representation ideally suited for use with orthogonal decision trees. In fact, orthogonal trees with our locally decorrelated features outperform oblique trees trained over the original features at a fraction of the computational cost. The overall improvement in accuracy is dramatic: on the Caltech Pedestrian Dataset, we reduce false positives nearly tenfold over the previous state-of-the-art. 1
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.