Conference PaperPDF Available


Content may be subject to copyright.
Learning Safety Equipment Detection
using Virtual Worlds
1Marco Di Benedetto, 2Enrico Meloni, 1Giuseppe Amato, 1Fabrizio Falchi, 1Claudio Gennaro
1Institute of Information Science and Technologies, National Research Council, Italy, {name.surname}
2Department of Information Engineering, Universit`
a di Pisa, Italy,
Abstract—Nowadays, the possibilities offered by state-of-the-
art deep neural networks allow the creation of systems capable of
recognizing and indexing visual content with very high accuracy.
Performance of these systems relies on the availability of high
quality training sets, containing a large number of examples (e.g.
million), in addition to the the machine learning tools themselves.
For several applications, very good training sets can be
obtained, for example, crawling (noisily) annotated images from
the internet, or by analyzing user interaction (e.g.: on social
networks). However, there are several applications for which
high quality training sets are not easy to be obtained/created.
Consider, as an example, a security scenario where one wants to
automatically detect rarely occurring threatening events.
In this respect, recently, researchers investigated the possibility
of using a visual virtual environment, capable of artificially
generating controllable and photo-realistic contents, to create
training sets for applications with little available training images.
We explored this idea to generate synthetic photo-realistic
training sets to train classifiers to recognize the proper use
of individual safety equipment (e.g.: worker protection helmets,
high-visibility vests, ear protection devices) during risky human
activities. Then, we performed domain adaptation to real images
by using a very small image data set of real-world photographs.
We show that training with the generated synthetic training
set and using the domain adaptation step is an effective solution
to address applications for which no training sets exist.
Index Terms—Deep Learning, Virtual Dataset, Transfer Learn-
ing, Domain Adaptation, Safety Equipment Detection
In the new spring of artificial intelligence, and in particular
in its sub-field known as machine learning, a significant
series of important results have shifted focus of industrial and
research communities toward the generation of valuable data
from which learning algorithms can be trained. For several
applications, in the era of big data, the availability of real
input examples, to train machine learning algorithms, is not
considered an issue. However, for several other applications
there is not such an abundance of training data. Sometimes,
even if data is available it must be manually revised to make
it usable as training data (e.g., by adding annotations, class
labels, or visual masks), with a considerable cost.
In fact, although a series of annotated datasets are available
and successfully used to produce important academic results
and commercially fruitful products, there is still a huge amount
of scenarios where laborious human intervention is needed to
produce high quality training sets.
For example, such cases include, but are not limited to,
safety equipment detection, weapon wielding detection, and
autonomous driven cars.
To overcome these limitations and to provide useful ex-
amples in a variety of scenarios, the research community has
recently started to leverage on the use of programmable virtual
scenarios to generate visual datasets and the neede associated
annotations. For example, in an image-based machine learning
technique, using a modern rendering engine (i.e., capable of
producing photo-realistic imagery) has been proven a valid
companion to automatically generate adequate datasets (see
Section II).
In this work we demonstrate the effectiveness of a virtual
rendering engine to address the problem of detection and
recognition in scenarios where little-to-no real images exist,
and apply it in the context of safety equipment visual detection
(see Figure 1), for which, to the best of our knowledge, no
public dataset exists. In particular, we show how the transfer
learning approach on a known deep neural network can reach
state-of-the-art results in automatic visual media indexing,
after being trained with virtually generated images containing
people equipped with safety items, like high-visibility jackets
and helmets, and domain adaptation using a few real image
training examples. More in detail, we contribute in this field
with the following results:
creation of a virtual training set for personal safety
equipment recognition, with different scene conditions,
creation of an annotated real-world image test set, and
creation of state-of-the-art classifiers for such scenario.
We will see that, in case of very few real available exam-
ples, the accuracy boost given by virtual images dramatically
increases the system performance.
The dataset that we created is made publicly available to
the research community [1].
This work is organized as follows: Section II gives an
overview of existing methods based on virtual environments;
Section III describes how we used an existing rendering engine
and the policy to create the dataset and the test set; Section
IV discusses our detection method; Section V shows our
experimental results; finally Section VI concludes.
With the advent of deep learning, object detection technolo-
gies have achieved accuracies that were unimaginable only a
few years ago. YOLO architectures [2], [3] and Faster-RCNN978-1-7281-4673-7/19/$31.00 ©2019 IEEE
a) b) c)
Fig. 1. Examples of safety equipment: a) real photograph of worker wearing a welding mask; b) and c) virtual renderings with people with helmets,
high-visibility vests, and welding masks.
[4] are today de facto-standard architectures for the object
detection task. They are trained on huge generic annotated
datasets, such as ImageNet [5], MS COCO [6], Pascal [7]
or OpenImages v4 [8]. These datasets collect an enormous
amount of pictures usually taken from the web and they are
manually annotated.
With the need for huge amounts of labeled data, virtually
generated datasets have recently gained great interest. The
possibility of learning features from virtual data and validating
them on real scenarios was explored in [9]. Unlike our work,
however, they did not explore deep learning approaches. In
[10], computer-generated imagery were used to study trained
CNNs to qualitatively and quantitatively analyze deep features
by varying the network stimuli according to factors of interest,
such as to object style, viewpoint and color. The works [11],
[12] exploit the popular Unreal Engine 4 (UE4) to build virtual
worlds and use them to train and test deep learning algorithms.
The problem of transferring deep neural network models
trained in simulated virtual worlds to the real world for
vision-based robotic control was explored in [13]. In a similar
scenario, [14] developed an end-to-end active tracker trained
in virtual environment that can adapt to real world robot
settings. To handle the variability in real-world data, [15]
relied upon the technique of domain randomization, in which
the parameters of the simulatorsuch as lighting, pose, object
textures were randomized in non-realistic ways to force the
neural network to learn the essential features of the object of
interest. A deep learning model was trained in [16] to drive in
a simulated environment and adapted it for the visual variation
experienced in the real world.
[17], [18] focused their attention on the possibility to
perform domain adaptation in order to map virtual features
onto real ones. Richter et al. [19] explored the use of the video
game Grand Theft Auto V (GTA-V) [20] for creating large-
scale pixel-accurate ground truth data for training semantic
segmentation systems. In [21], they used GTA-V for training
a self-driving car and generated around 480,000 images for
training. This work evidenced how GTA-V can indeed be used
to automatically generate a large dataset. The use of GTA-V to
train a self-driving car was explored also in [22], where images
from the game were used to train a classifier for recognizing
the presence of stop signs in an image and estimate their
distance. In [23] a different game was used for training a self-
driving car: TORCS, an open source racing simulator with a
graphics engine less focused on realism than GTA-V.
Authors in [24] created a dataset taking images from GTA-
V and demonstrated that it is possible to reach excellent results
on tasks such as real people tracking and pose estimation.
[25] also used GTA-V as the virtual world but, unlike
our method, they used Faster-RCNN and they concentrated on
vehicle detection validating their results on the KITTI dataset.
Instead, [26] used a synthetically generated virtual dataset
to train a simple convolutional network to detect objects
belonging to various classes in a video.
In this paper we show that a low cost and off-the-shelf
virtual rendering environment represents a viable solution for
generating a high quality training set for scenarios lacking
enough real training data. This methods allows generating a
very large amount of annotated images, with the possibility
of scenery changes like location, contents, and even weather
conditions, with very little human intervention.
In this work, we used the generated training set to train a
You Only Look Once (YOLO) neural system [2], [3] for its
efficiency and high detection accuracy. However, the applied
methodology can be used to other machine learning tools.
We used the Rockstar Advanced Game Engine (RAGE) from
the GTA-V computer game, and its scripting ability to deploy
a series of pedestrians with and without safety equipment in
different locations of the game map. The RAGE Plugin Hook
[27] allowed us to create and inject our C#scripts into the
Our scripts uses the plugin API to add pedestrians with
chosen equipment in various locations of the game map, place
cameras in places where we want to take pictures, check that
objects are in the field of view and not occluded, recover 3D
meshes bounding boxes from the rendering engine, and save
game screenshots (i.e., our dataset images) and their associated
annotations (bounding boxes and classes).
Personal safety equipment that we consider are, for exam-
ple, high-visibility vests, helmets, welding masks, and others.
In addition to persons wearing these equipement, we also
generate pedestrians without protections, where we annotate,
person, bare head, bare chest (see Figure 2 as example).
Fig. 2. Detected objects in a working scenario.
The generation of the virtual dataset required first to con-
figure the RAGE engine to create various types of scenarios
(Section III-A). Then, the RAGE engine was used to cap-
ture images along with annotations. For every image, the
annotations (coordinates of the bounding boxes and identities
of relevant elements) were retrieved from the RAGE engine
through our script (Section III-B). We used this approach both
for creating the virtual world training set and the virtual world
validation set. The dataset was eventually completed by adding
a real world test set, composed of real world images, to test the
accuracy of the trained neural network on real scenes (Section
A. Scenario Creation
To generate the training scenario we used the plugin API
to customize the following game features:
Camera: used to set up the viewpoints from which the
scenario must be recorded.
Pedestrians: used to set up the number of people in the
scene and their behavior, chosen from the set offered
by the game engine, such as wandering around an area,
chatting between themselves, fighting, and so on.
Place: used to set up the place where the pedestrians will
be generated; there is a series of game map preset places,
plus user-defined locations identified by map coordinates.
Time: used to set up the time of day during which the
scene takes place.
Weather: used to set up the weather conditions during
the animation.
We used 9 different game map locations with 3 different
weather conditions each to create the virtual training set. from
these we acquired a total of 126900 images with an average of
12 persons per shot. The virtual validation set spans 1 location
with 3 weather conditions, and consists of 350 images with an
average of 12 persons each. Therefore, in the end, we have 30
different scenarios where virtual world images were extracted
B. Dataset Annotation
Dataset annotation is the process which creates the anno-
tated images for the dataset. In our case, we annotate the
Fig. 3. Examples of safety equipment objects detected by our system.
Fig. 4. Bounding box estimation: oversized approximation with respect to
on-screen projections. With the available API, the hooked virtual engine is
able to provide the bounding boxes of individual 3D meshes, overestimated
due to collision proxy expansion for animations. Not being able to access the
original 3D geometry and the current animation frame, our best strategy is
to project on-screen the eight corners of the 3D bounding box, and then take
their containing minimum rectangle as our best annotation.
following elements (see Figure 3):
Helmet: a head wearing an helmet
Welding Mask: a head wearing a welding mask
Ear Protection: a head wearing hearing protection
High-Visibility Vest (HVV): person chest wearing a high
visibility vest
Person: a full-body person
Chest: the bare chest (without HVV)
Head: the bare head (without helmet or ear protection)
For each viewpoint setup in the scenario, we process every
object to extract its position on the 2D image. This is done by
first calculating the geometry of its transformed 3D bounding
box, then approximately testing the box visibility, and finally
extracting the image 2D bounding box by contouring the 3D
box vertices. The visibility is checked by testing the occlusion
of line-of-sight rays from the camera to a certain fixed amount
of point in the box volume, and the object is considered visible
if at least one ray is not occluded.
a) b) c)
Fig. 5. Real-World Validation Set: composed of 180 copyright-free images, our validation set is available at the project website.
C. Real world Test Set
The motivation of this work is to prove that it is possible to
train a system with a virtual world even when it supposed to be
used in the real world. To test the performance of the trained
neural network in the real world, we created a real world test
set using copyright-free photographs of people wearing safety
equipments. The set is composed of 180 images (see Figure 5)
showing persons with and without the items listed in Section
III-B, each associated with manually created annotations of
bounding boxes and element identities.
The backbone of our detection algorithm is the You Only
Look Once (YOLO) system [2], an efficient neural network
able to detect, in a single image, the objects which it has been
trained for. The detection ends with a list of 2D bounding
boxes, each associated with a class label referring to the
recognized object. In our implementation, we used the YOLO
v3 [3] (hereafter abbreviated with YOLO) network with the
Darknet-53 in its its core. We trained it to recognize personal
safety equipment components, as shown in the following.
A. Transfer Learning
We already said that we use the generated virtual world
training set to train YOLO to detect and recognize our ele-
ments of interests in images. In particular, we adapt YOLO
to our scenario using transfer learning. Our hypothesis is that
a pre-trained network already embeds enough knowledge that
allow us to specialize it for a new scenario, leveraging on the
transfer learning capability of deep neural networks and on
training sets generated from a virtual world.
The purpose of transfer learning is to exploit the first already
trained layers (i.e., the ones identifying low-level features) and
to extend the detection capability to the new set of objects by
updating the last layers of the network.
With a trained deep convolutional neural network, its first
layers have learned to identify features that are more and more
complex according to layer depth; for example, the first layer
will be able to detect straight borders, the second layer smooth
contours, the third some kind of color gradients, and so on
while arriving to last layers capable of identify entire objects.
In our case, we used a YOLO netowrk pre-trained on
the COCO dataset. We retrained it by blocking the learning
parameters update of the first part of the network, and allowing
updates only in the last sections. Specifically, we kept the first
81 (i.e., the feature extractors) of the total 106 layers, and
froze the weights of the first 74. The network was trained
for 24000 iterations, that is 11 epochs with the following
parameters: batch size 64, decay 0.0005, learning rate 0.001,
momentum 0.9, IoU threshold: 0.5, Confidence threshold:
0.25. As explained in III, our virtual dataset is composed of
30 scenarios, 27 of which were used as the training set and 3
were left for validation. The 3 scenarios of the validation set
contain 13500 images. From these, 350 images were randomly
selected to form the virtual validation dataset.
In this way, a new set of objects are recognized by the
B. Evaluation Metrics
To evaluate the performance of our implementation, we
applied the standard measures used in the object detection
literature, i.e., Intersection over Union (IoU ) based on the
area of the detected (D) and real (V) bounding boxes, and
Precision (P r) and Recall (Rc) based on true (T) / false (F)
positive (P) / negative (N) detections:
IoU = (DV)/(DV)
P r =T P /(T P +F P )
Rc =T P /(T P +F N )
Detected bounding boxes are associated with a confidence
score, ranging from 0 to 1, and are considered in the output if
and only if their confidence score is greater than a configurable
threshold. Given the above definitions, we calculate the mean
Average Precision (mAP ) as the average of the maximum
precision at different recall values.
A. Experimental Setups
We trained and evaluated three variations of the network
on both virtual and real images: YCV, which is YOLO base
trained on COCO and retrained with Virtual data; YCVR,
which is YCV fine-tuned with Real data; YCR, which is
YOLO base trained on COCO and retrained with Real data.
To obtain YCVR, we split the real world dataset in two parts
with 100 images each: a training part and a testing part. We
used the training part to apply domain adaptation from virtual
to real on the YCV network. To choose the set of weights from
which to start, we validated each of them on the training part,
Iter. Epochs Head Helmet Weld. Mask Ear Prot. Chest HVV Person mAP
18k 8 89.56% 86.76% 73.23% 87.97% 90.36% 90.04% 89.32% 86.75%
19k 9 89.42% 81.65% 73.39% 88.68% 89.72% 90.43% 89.87% 86.17%
20k 9 87.02% 81.76% 72.61% 88.87% 89.45% 90.33% 89.54% 85.66%
21k 10 89.75% 86.74% 75.53% 89.05% 89.75% 90.09% 89.79% 87.24%
22k 10 89.83% 87.49% 74.17% 88.27% 89.41% 89.95% 89.83% 86.99%
23k 10 89.75% 86.02% 73.39% 88.47% 89.89% 90.03% 89.76% 86.76%
24k 11 89.03% 88.46% 73.44% 89.05% 90.08% 90.04% 89.95% 87,15%
Iter. Epochs Head Helmet Weld. Mask Ear Prot. Chest HVV Person mAP
18k 8 29.99% 69.72% 22.68% 48.10% 49.98% 67.01% 77.51% 51.00%
19k 9 25.57% 61.07% 19.65% 34.58% 36.89% 66.60% 72.13% 45.21%
20k 9 36.25% 74.13% 27.31% 55.60% 45.66% 69.89% 76.91% 55.11%
21k 10 35.84% 69.18% 26.92% 47.41% 37.90% 66.70% 76.85% 51.54%
22k 10 26.31% 64.48% 22.65% 48.23% 43.54% 63.62% 74.93% 49.11%
23k 10 35.02% 68.48% 11.76% 56.57% 42.41% 65.02% 74.25% 50.51%
24k 11 33.71% 65.55% 27.11% 43.17% 37.78% 62.65% 73.31% 49.04%
choosing the one with highest mAP. We chose the weights
after 20,000 iterations, with 59.05 mAP. The fine-tuning was
done for 1000 iterations.
To better evaluate the benefit contributed by the virtual
world training set, we also fine-tuned YOLO base, pre-trained
on COCO, with the same 100 real images used for obtaining
YVCR. As we said before we call YCR this network.
B. Results
YCV obtains 87.24 mAP on when tested on virtual images
(see Table I). When testd on real world images it obtains 55.11
mAP (see Table III). Most of the AP loss is caused by the
classes Head, Welding Mask, Ear Protection, and Chest. We
believe that this is due to the fact that in real life there are
many more variations of these object classes than those the
game can render.
We want to note that on virtual world testing, YCV obtains
its best mAP after 21000 iterations. On real world testing, best
performance is reached after 20000 iterations. This implies that
the best performing set of weights for the virtual world test is
not the best also for real world validation.
YCVR obtains a significant boost and reaches 76.1 mAP.
This means that fine-tuning with only 100 real images is very
effective on a network which was previously fine-tuned with
several similar virtual images. We also note that testing YCVR
on the virtual world yields a lower mAP with respect to YCV.
The main drop of AP in this case is seen on Head and Welding
Mask, which are the classes with most differences between real
and virtual.
YCR obtain 57.3 mAP when tested on the real world test
set. This result is just slightly better than that obtained by YCV,
and by far worse than YCVR. This means that the contribution
given by the virtual world training set, to train the network for
the new scenario is very relevant, and just a fine-tuning with
a few images is enough to adapt the network back to the real
world domain.
Training deep neural networks in virtual environments has
been recently proven to be of help when the number of
available training examples for the specific task is low. In
this work, we considered the task of learning to detect proper
equipment in risky human activity scenarios.
We created and made available two datasets: the first one
has been generated using a virtual reality engine (RAGE from
GTA-V); the second one is composed of real photos.
In our experiments, we trained YOLO on the virtual dataset
and tested on the real images as well as using just a small
number of real photos to fine-tune the deep neural network
we trained in the virtual environment. The experiments we
conducted demonstrated that training on virtual world images,
and executing a step of domain adaptation with a limited
number of real images, is very effective. Obtained performance
when training with virtual world images and adapting to the
Network Test Head Helmet Weld. Mask Ear Prot. Chest HVV Person mAP
YCV V 89.7% 86.7% 75.5% 89.0% 89.7% 90.0% 89.7% 87.2%
YCV R 36.3% 74.1% 27.3% 55.6% 45.7% 69.9% 76.9% 55.1%
YCR R 44.1% 52.2% 42.3% 62.0% 59.1% 60.7% 80.6% 57.3%
YCVR R 78.8% 73.3% 66.3% 74.0% 74.7% 78.6% 87.1% 76.1%
domain with a few real images is much higher than just fine
tuning an existing network with a few real images for the
scenario at hand. We plan to use the same virtual environment
to train to detect people using weapons (see Figure 4).
This work was partially supported by “Automatic Data
and documents Analysis to enhance human-based processes”
(ADA), funded by CUP CIPE D55F17000290009, and by
the AI4EU project, funded by EC (H2020 - Contract n.
825619). We gratefully acknowledge the support of NVIDIA
Corporation with the donation of a Jetson TX2 board used for
this research.
[1] Enrico Meloni, Marco Di Benedetto, Giuseppe Amato, Fabrizio Falchi,
Claudio Gennaro, “Project Website,”, 2019.
[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2016.
[3] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
CoRR, vol. abs/1804.02767, 2018.
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in Neural
Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D.
Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015,
pp. 91–99.
[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE Conference
on Computer Vision and Pattern Recognition, June 2009, pp. 248–255.
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll ´
ar, and C. L. Zitnick, “Microsoft coco: Common objects in con-
text,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele,
and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014,
pp. 740–755.
[7] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman, “The pascal visual object classes challenge:
A retrospective,International Journal of Computer Vision, vol. 111,
no. 1, pp. 98–136, Jan 2015.
[8] A. Kuznetsova, H. Rom, N. Alldrin, J. R. R. Uijlings, I. Krasin, J. Pont-
Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari, “The
open images dataset V4: unified image classification, object detection,
and visual relationship detection at scale,” CoRR, vol. abs/1811.00982,
[9] J. Marn, D. Vzquez, D. Gernimo, and A. M. Lpez, “Learning appearance
in virtual scenarios for pedestrian detection,” in 2010 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, June
2010, pp. 137–144.
[10] M. Aubry and B. C. Russell, “Understanding deep features with
computer-generated imagery,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 2875–2883.
[11] W. Qiu and A. Yuille, “Unrealcv: Connecting computer vision to unreal
engine,” in European Conference on Computer Vision. Springer, 2016,
pp. 909–916.
[12] K.-T. Lai, C.-C. Lin, C.-Y. Kang, M.-E. Liao, and M.-S. Chen, “Vivid:
Virtual environment for visual deep learning,” in Proceedings of the 26th
ACM International Conference on Multimedia, ser. MM ’18. New York,
NY, USA: ACM, 2018, pp. 1356–1359.
[13] Z.-W. Hong, C. Yu-Ming, S.-Y. Su, T.-Y. Shann, Y.-H. Chang, H.-K.
Yang, B. H.-L. Ho, C.-C. Tu, Y.-C. Chang, T.-C. Hsiao et al., “Virtual-to-
real: Learning to control in visual semantic segmentation,” arXiv preprint
arXiv:1802.00285, 2018.
[14] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang, “End-to-end
active object tracking and its real-world deployment via reinforcement
learning,” IEEE transactions on pattern analysis and machine intelli-
gence, 2019.
[15] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil,
T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep
networks with synthetic data: Bridging the reality gap by domain
randomization,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, 2018, pp. 969–977.
[16] A. Bewley, J. Rigley, Y. Liu, J. Hawke, R. Shen, V.-D. Lam, and
A. Kendall, “Learning to drive from simulation without real world
labels,” arXiv preprint arXiv:1812.03823, 2018.
[17] D. Vzquez, A. M. Lopez, and D. Ponsa, “Unsupervised domain adap-
tation of virtual and real worlds for pedestrian detection,” in Pro-
ceedings of the 21st International Conference on Pattern Recognition
(ICPR2012), Nov 2012, pp. 3492–3495.
[18] D. Vzquez, A. M. Lpez, J. Marn, D. Ponsa, and D. Gernimo, “Virtual
and real world adaptation for pedestrian detection,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 36, no. 4, pp. 797–
809, April 2014.
[19] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data:
Ground truth from computer games,” in European Conference on
Computer Vision. Springer, 2016, pp. 102–118.
[20] Rockstar Games, Inc., “Grand Theft Auto - V,” https://www., 2013.
[21] M. Martinez, C. Sitawarin, K. Finch, L. Meincke, A. Yablonski,
and A. Kornhauser, “Beyond grand theft auto v for training, testing
and enhancing deep learning in self driving cars,” arXiv preprint
arXiv:1712.01397, 2017.
[22] A. Filipowicz, J. Liu, and A. Kornhauser, “Learning to recognize
distance to stop signs using the virtual world of grand theft auto 5,”
Tech. Rep., 2017.
[23] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learning
affordance for direct perception in autonomous driving,” in Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp.
[24] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cuc-
chiara, “Learning to detect and track visible and occluded body joints in
a virtual world,” in European Conference on Computer Vision (ECCV),
[25] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, and R. Vasude-
van, “Driving in the matrix: Can virtual worlds replace human-generated
annotations for real world tasks?” CoRR, vol. abs/1610.01983, 2016.
[26] E. Bochinski, V. Eiselein, and T. Sikora, “Training a convolutional neural
network for multi-class object detection using solely virtual world data,”
in Advanced Video and Signal Based Surveillance (AVSS), 2016 13th
IEEE International Conference on. IEEE, 2016, pp. 278–285.
[27] “RAGE Plugin Hook,”, 2013.
... TP is a ''worn'' prediction for a ''worn'' sample, FP is a ''worn'' prediction for an ''unworn'' sample, TN is an ''unworn'' prediction for an ''unworn'' sample, and FN is an ''unworn'' prediction for a ''worn'' sample. The standard accuracy, precision, recall, and specificity metrics as provided by (6), (4), Fig. 7 shows an example of the images provided by [23] and the instance segmentations provided by [24]. VOLUME 9, 2021 [23] for bounding boxes only, we had provided instance segmentations in [24]. ...
... The standard accuracy, precision, recall, and specificity metrics as provided by (6), (4), Fig. 7 shows an example of the images provided by [23] and the instance segmentations provided by [24]. VOLUME 9, 2021 [23] for bounding boxes only, we had provided instance segmentations in [24]. The dataset also does not contain person instance segmentations or visual relationship annotations. ...
Full-text available
We have identified a need in visual relationship detection and biometrics related research for a dataset and model which focuses on person-clothing pairs. Previous to our Relatable Clothing dataset, there were no publicly available datasets usable for “worn” and “unworn” clothing detection. In this paper we propose a novel visual relationship model architecture for “worn” and “unworn” clothing detection that makes use of a soft attention mechanism for feature fusion between a conventional ResNet backbone and our novel person-clothing mask feature extraction architecture. The best proposed model achieves 98.62% accuracy, 99.50% precision, 98.31% recall, and 99.14% specificity on the Relatable Clothing dataset, outperforming our previous iterations. We release our models which can be found on the Relatable Clothing GitHub repository ( ) for future research and applications into detecting and analyzing person-clothing pairs.
... Di Benedetto et al. [8] acquired a synthetic dataset using a Game Engine: the authors managed to obtain good results for object detection even with a small amount of real data to be used for fine-tuning. The works in [25,10,15,11,12] demonstrate the power of generating synthetic data automatically annotated for multiple tasks, such as optical flow estimation, semantic instance segmentation, object detection and tracking, object-level 3D scene layout estimation and visual odometry. ...
Being able to understand the relations between the user and the surrounding environment is instrumental to assist users in a worksite. For instance, understanding which objects a user is interacting with from images and video collected through a wearable device can be useful to inform the worker on the usage of specific objects in order to improve productivity and prevent accidents. Despite modern vision systems can rely on advanced algorithms for object detection, semantic and panoptic segmentation, these methods still require large quantities of domain-specific labeled data, which can be difficult to obtain in industrial scenarios. Motivated by this observation, we propose a pipeline which allows to generate synthetic images from 3D models of real environments and real objects. The generated images are automatically labeled and hence effortless to obtain. Exploiting the proposed pipeline, we generate a dataset comprising synthetic images automatically labeled for panoptic segmentation. This set is complemented by a small number of manually labeled real images for fine-tuning. Experiments show that the use of synthetic images allows to drastically reduce the number of real images needed to obtain reasonable panoptic segmentation performance.
... In [Di Benedetto et al., 2021;Di Benedetto et al., 2019], we generated photo-realistic synthetic image sets to train deep learning models to recognize the correct use of personal protection equipment (PPE, e.g., worker safety helmets, high visibility vests, ear protection devices) during at-risk work activities (see Figure 3). Then, we performed the adaptation of the domain to real-world images using a minimal set of realworld images. ...
Conference Paper
Full-text available
In this short paper, we report the activities of the Artificial Intelligence for Media and Humanities (AIMH) laboratory of the ISTI-CNR related to Industry. The massive digitalization affecting all the stages of product design, production, and control calls for data-driven algorithms helping in the coordination of humans, machines, and digital resources in Industry 4.0. In this context, we developed AI-based Computer-Vision technologies of general interest in the emergent digital paradigm of the fourth industrial revolution, fo-cusing on anomaly detection and object counting for computer-assisted testing and quality control. Moreover, in the automotive sector, we explore the use of virtual worlds to develop AI systems in otherwise practically unfeasible scenarios, showing an application for accident avoidance in self-driving car AI agents.
... A qualitative experiment of the proposed soft attention unit on a sample containing personal protective equipment is also done. The sample is taken from the PPE Dataset provided by [17]. The purpose of this qualitative experiment is to examine if there are any generalizability issues going from the Relatable Clothing Dataset to more practical real world examples with multiple people wearing multiple different articles of clothing, such as personal protective equipment. ...
... The dataset that we created is made publicly available to the research community [20]. This work extends the paper we presented at CBMI 2019 that received the best paper award [6]. The main extension is on experimenting our approach not only on the YOLO architecture but also on Faster-RCNN, another commonly used object detection method. ...
Full-text available
Deep learning has achieved impressive results in many machine learning tasks such as image recognition and computer vision. Its applicability to supervised problems is however constrained by the availability of high-quality training data consisting of large numbers of humans annotated examples (e.g. millions). To overcome this problem, recently, the AI world is increasingly exploiting artificially generated images or video sequences using realistic photo rendering engines such as those used in entertainment applications. In this way, large sets of training images can be easily created to train deep learning algorithms. In this paper, we generated photo-realistic synthetic image sets to train deep learning models to recognize the correct use of personal safety equipment (e.g., worker safety helmets, high visibility vests, ear protection devices) during at-risk work activities. Then, we performed the adaptation of the domain to real-world images using a very small set of real-world images. We demonstrated that training with the synthetic training set generated and the use of the domain adaptation phase is an effective solution for applications where no training set is available.
... A qualitative experiment of the proposed soft attention unit on a sample containing personal protective equipment is also done. The sample is taken from the PPE Dataset provided by [17]. The purpose of this qualitative experiment is to examine if there are any generalizability issues going from the Relatable Clothing Dataset to more practical real world examples with multiple people wearing multiple different articles of clothing, such as personal protective equipment. ...
Full-text available
Detecting visual relationships between people and clothing in an image has been a relatively unexplored problem in the field of computer vision and biometrics. The lack readily available public dataset for ``worn'' and ``unworn'' classification has slowed the development of solutions for this problem. We present the release of the Relatable Clothing Dataset which contains 35287 person-clothing pairs and segmentation masks for the development of ``worn'' and ``unworn'' classification models. Additionally, we propose a novel soft attention unit for performing ``worn'' and ``unworn'' classification using deep neural networks. The proposed soft attention models have an accuracy of upward $98.55\% \pm 0.35\%$ on the Relatable Clothing Dataset and demonstrate high generalizable, allowing us to classify unseen articles of clothing such as high visibility vests as ``worn'' or ``unworn''.
Purpose Automated dust monitoring in workplaces helps provide timely alerts to over-exposed workers and effective mitigation measures for proactive dust control. However, the cluttered nature of construction sites poses a practical challenge to obtain enough high-quality images in the real world. The study aims to establish a framework that overcomes the challenges of lacking sufficient imagery data (“data-hungry problem”) for training computer vision algorithms to monitor construction dust. Design/methodology/approach This study develops a synthetic image generation method that incorporates virtual environments of construction dust for producing training samples. Three state-of-the-art object detection algorithms, including Faster-RCNN, you only look once (YOLO) and single shot detection (SSD), are trained using solely synthetic images. Finally, this research provides a comparative analysis of object detection algorithms for real-world dust monitoring regarding the accuracy and computational efficiency. Findings This study creates a construction dust emission (CDE) dataset consisting of 3,860 synthetic dust images as the training dataset and 1,015 real-world images as the testing dataset. The YOLO-v3 model achieves the best performance with a 0.93 F 1 score and 31.44 fps among all three object detection models. The experimental results indicate that training dust detection algorithms with only synthetic images can achieve acceptable performance on real-world images. Originality/value This study provides insights into two questions: (1) how synthetic images could help train dust detection models to overcome data-hungry problems and (2) how well state-of-the-art deep learning algorithms can detect nonrigid construction dust.
Full-text available
Monitoring the size of key indicator species of fish is important to understand ecosystem functions, anthropogenic stress, and population dynamics. Standard methodologies gather data using underwater cameras, but are biased due to the use of baits, limited deployment time, and short field of view. Furthermore, they require experts to analyse long videos to search for species of interest, which is time consuming and expensive. This paper describes the Underwater Detector of Moving Object Size (UDMOS), a cost-effective computer vision system that records events of large fishes passing in front of a camera, using minimalistic hardware and power consumption. UDMOS can be deployed underwater, as an unbaited system, and is also offered as a free-to-use Web Service for batch video-processing. It embeds three different alternative large-object detection algorithms based on deep learning, unsupervised modelling, and motion detection, and can work both in shallow and deep waters with infrared or visible light.
Conference Paper
Full-text available
Due to the advances in deep reinforcement learning and the demand of large training data, virtual-to-real learning has gained lots of attention from computer vision community recently. As state-of-the-art 3D engines can generate photo-realistic images suitable for training deep neural networks, researchers have been gradually applied 3D virtual environment to learn different tasks including autonomous driving, collision avoidance, and image segmentation, to name a few. Although there are already many open-source simulation environments readily available, most of them either provide small scenes or have limited interactions with objects in the environment. To facilitate visual recognition learning, we present a new Virtual Environment for Visual Deep Learning (VIVID), which offers large-scale diversified indoor and outdoor scenes. Moreover, VIVID leverages the advanced human skeleton system, which enables us to simulate numerous complex human actions. VIVID has a wide range of applications and can be used for learning indoor navigation, action recognition, event detection, etc. We also release several deep learning examples in Python to demonstrate the capabilities and advantages of our system.
Full-text available
Multi-People Tracking in an open-world setting requires a special effort in precise detection. Moreover, temporal continuity in the detection phase gains more importance when scene cluttering introduces the challenging problems of occluded targets. For the purpose, we propose a deep network architecture that jointly extracts people body parts and associates them across short temporal spans. Our model explicitly deals with occluded body parts, by hallucinating plausible solutions of not visible joints. We propose a new end-to-end architecture composed by four branches (\textit{visible heatmaps}, \textit{occluded heatmaps}, \textit{part affinity fields} and \textit{temporal affinity fields}) fed by a \textit{time linker} feature extractor. To overcome the lack of surveillance data with tracking, body part and occlusion annotations we created the vastest Computer Graphics dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, more than 10 million body poses) of human body parts for people tracking in urban scenarios. Our architecture trained on virtual data exhibits good generalization capabilities also on public real tracking benchmarks, when image resolution and sharpness are high enough, producing reliable tracklets useful for further batch data association or re-id modules.
Full-text available
As an initial assessment, over 480,000 labeled virtual images of normal highway driving were readily generated in Grand Theft Auto V's virtual environment. Using these images, a CNN was trained to detect following distance to cars/objects ahead, lane markings, and driving angle (angular heading relative to lane centerline): all variables necessary for basic autonomous driving. Encouraging results were obtained when tested on over 50,000 labeled virtual images from substantially different GTA-V driving environments. This initial assessment begins to define both the range and scope of the labeled images needed for training as well as the range and scope of labeled images needed for testing the definition of boundaries and limitations of trained networks. It is the efficacy and flexibility of a "GTA-V"-like virtual environment that is expected to provide an efficient well-defined foundation for the training and testing of Convolutional Neural Networks for safe driving. Additionally, described is the Princeton Virtual Environment (PVE) for the training, testing and enhancement of safe driving AI, which is being developed using the video-game engine Unity. PVE is being developed to recreate rare but critical corner cases that can be used in re-training and enhancing machine learning models and understanding the limitations of current self driving models. The Florida Tesla crash is being used as an initial reference.
Computer graphics can not only generate synthetic images and ground truth but it also offers the possibility of constructing virtual worlds in which: (i) an agent can perceive, navigate, and take actions guided by AI algorithms, (ii) properties of the worlds can be modified (e.g., material and reflectance), (iii) physical simulations can be performed, and (iv) algorithms can be learnt and evaluated. But creating realistic virtual worlds is not easy. The game industry, however, has spent a lot of effort creating 3D worlds, which a player can interact with. So researchers can build on these resources to create virtual worlds, provided we can access and modify the internal data structures of the games. To enable this we created an open-source plugin UnrealCV (Project website: for a popular game engine Unreal Engine 4 (UE4). We show two applications: (i) a proof of concept image dataset, and (ii) linking Caffe with the virtual world to test deep network algorithms.
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide \(15\times \) more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, we study how the performance of several modern models evolves with increasing amounts of training data, and we demonstrate two applications made possible by having unified annotations of multiple types coexisting in the same images. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
We study active object tracking, where a tracker takes visual observations (i.e., frame sequences) as inputs and produces the corresponding camera control signals as outputs (e.g., move forward, turn left, etc.). Conventional methods tackle tracking and camera control tasks separately, and the resulting system is difficult to tune jointly. Such an approach also requires significant human efforts for image labeling and expensive trial-and-error system tuning in real-world. To address these issues, we propose, in this paper, an end-to-end solution via deep reinforcement learning. A ConvNet-LSTM function approximator is adopted for the direct frame-to-action prediction. We further propose environment augmentation techniques and a customized reward function which are crucial for successful training. The tracker trained in simulators (ViZDoom, Unreal Engine) demonstrates good generalization behaviors in the case of unseen object moving paths, unseen object appearances, unseen backgrounds, and distracting objects. The system is robust and can restore tracking after occasional lost of the target being tracked. We also find that the tracking ability, obtained solely from simulators, can potentially transfer to real-world scenarios. We demonstrate successful examples of such transfer, via experiments over the VOT dataset and the deployment of a real-world robot using the proposed active tracker trained in simulation.
Conference Paper
Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular architecture for tackling the virtual-to-real problem. The proposed architecture separates the learning model into a perception module and a control policy module, and uses semantic image segmentation as the meta representation for relating these two modules. The perception module translates the perceived RGB image to semantic image segmentation. The control policy module is implemented as a deep reinforcement learning agent, which performs actions based on the translated image segmentation. Our architecture is evaluated in an obstacle avoidance task and a target following task. Experimental results show that our architecture significantly outperforms all of the baseline methods in both virtual and real environments, and demonstrates a faster learning curve than them. We also present a detailed analysis for a variety of variant configurations, and validate the transferability of our modular architecture.
We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at