Conference PaperPDF Available

From Simulation to Reality: CNN Transfer Learning for Scene Classification

Authors:

Abstract and Figures

In this work, we show that both fine-tune learning and cross-domain sim-to-real transfer learning from virtual to real-world environments improve the starting and final scene classification abilities of a computer vision model. A 6-class computer vision problem of scene classification is presented from both videogame environments and photographs of the real world, where both datasets have the same classes. 12 networks are trained from 2, 4, 8, …, 4096 hidden interpretation neurons following a fine-tuned VGG16 Convolutional Neural Network for a dataset of virtual data gathered from the Unity game engine and for a photographic dataset gathered from an online image search engine. 12 Transfer Learning networks are then benchmarked using the trained networks on virtual data as a starting weight distribution for a neural network to classify the real-world dataset. Results show that all of the transfer networks have a higher starting accuracy pre-training, with the best showing an improvement of +48.34% image classification ability and an average increase of +38.33% for the starting abilities of all hyperparameter sets benchmarked. Of the 12 experiments, nine transfer experiments showed an improvement over non-transfer learning, two showed a slightly lower ability, and one did not change. The best accuracy overall was obtained by a transfer learning model with a layer of 64 interpretation neurons scoring 89.16% compared to the non-transfer counterpart of 88.27%. An average increase of +7.15% was observed over all experiments. The main finding is that not only can a higher final classification accuracy be achieved, but strong classification abilities prior to any training whatsoever are also encountered when transferring knowledge from simulation to real-world data, proving useful domain knowledge transfer between the datasets.
Content may be subject to copyright.
From Simulation to Reality: CNN Transfer
Learning for Scene Classification
Jordan J. Bird1, Diego R. Faria2, and Anik´
o Ek´
art3
Aston Robotics, Vision and Intelligent Systems Lab
Aston University
Birmingham, United Kingdom
Email: {birdj11, d.faria2, a.ekart3}@aston.ac.uk
Pedro P. S. Ayrosa
Universidade Estadual de Londrina
Londrina, Brazil
Email: ayrosa@uel.br
Abstract—In this work, we show that both fine-tune learning
and cross-domain sim-to-real transfer learning from virtual to
real-world environments improve the starting and final scene
classification abilities of a computer vision model. A 6-class
computer vision problem of scene classification is presented from
both videogame environments and photographs of the real world,
where both datasets have the same classes. 12 networks are
trained from 2, 4, 8, . . . , 4096 hidden interpretation neurons
following a fine-tuned VGG16 Convolutional Neural Network
for a dataset of virtual data gathered from the Unity game
engine and for a photographic dataset gathered from an online
image search engine. 12 Transfer Learning networks are then
benchmarked using the trained networks on virtual data as a
starting weight distribution for a neural network to classify the
real-world dataset. Results show that all of the transfer networks
have a higher starting accuracy pre-training, with the best
showing an improvement of +48.34% image classification ability
and an average increase of +38.33% for the starting abilities of all
hyperparameter sets benchmarked. Of the 12 experiments, nine
transfer experiments showed an improvement over non-transfer
learning, two showed a slightly lower ability, and one did not
change. The best accuracy overall was obtained by a transfer
learning model with a layer of 64 interpretation neurons scoring
89.16% compared to the non-transfer counterpart of 88.27%. An
average increase of +7.15% was observed over all experiments.
The main finding is that not only can a higher final classification
accuracy be achieved, but strong classification abilities prior to
any training whatsoever are also encountered when transferring
knowledge from simulation to real-world data, proving useful
domain knowledge transfer between the datasets.
Keywords—Sim-to-real, Transfer Learning, Deep Learning,
Computer Vision, Autonomous Perception, Scene Classification,
Environment Recognition
I. INTRODUCTION
The possibility of transfer learning from simulated data
to real-world application is promising due to the scarcity
of real-world labelled data being an issue encountered
in many applications of machine learning and artificial
intelligence [1], [2], [3]. Based on this, Fine-tune Learning
and Transfer learning are often both considered to be viable
solutions to the issue of data scarcity in the scientific state-of-
the-art via large-scale models such as ImageNet and VGG16
for the former and methods such as rule and weight transfer
for the latter [4], [5], [6]. Here, we attempt to perform both
of these methods in a pipeline for scene classification, by
fine-tuning a large-scale model and transferring knowledge
between rules learnt from simulation to real-world datasets.
The consumer-level quality of videogame technology
has rapidly improved towards arguable photo-realistic
graphical quality through ray-traced lighting, high resolution
photographic textures and Physically Based Rendering (PBR)
to name but several prominent techniques. This then raises
the question, since simulated environments are ever more
realistic, is it possible to transfer knowledge from them to
real-world situations? Should this be possible, the problem
of data scarcity would be mitigated, and also a more optimal
process of learning would become possible by introducing a
starting point learned from simulation. If this process provides
a better starting point than, for example, a classical random
weight distribution, then fewer computational resources
are required to learn about the real world and also fewer
labelled data points are required. In addition, if this process
is improved further, learning from real-world data may not
actually be required at all.
In this work, we perform 12 individual topology exper-
iments in order to show that real-world classification of
relatively scarce data can be improved via pre-training said
models on simulation data from a high-quality videogame
environment. The weights developed on simulation data are
applied as a starting point for the backpropagation learning of
real-world data, and we find that both starting accuracies and
asymptotes (final ability) are often higher when the model has
been able to train on simulation data before considering real
data.
The main scientific contributions of this work are threefold:
1) The formation of two datasets for a 6-class scene classi-
fication dataset, both artificial simulation and real-world
photographic data1.
2) 24 topology tuning experiments for best classification
of the two datasets, 12 for each of the datasets by
2,4,8...4096 interpretation neurons following the fine
1https://www.kaggle.com/birdy654/environment-recognition-simulation-to-
reality
1
Fig. 1. An example of the usage of Ray Tracing, Physically-Based Rendering
and high quality textures in order to generate a realistic simulation of a living
room environment [7].
tuning of a VGG16 CNN network (with interpretation
and softmax layers removed). This provides a baseline
comparison for Transfer Learning as well as the pre-
trained weights to be used in the following experiment.
3) 12 transfer learning experiments of the weights trained
on simulation data transferred to networks with the task
of classifying real-world data. The results are evidence
that transfer learning of useful domain knowledge is
possible from the classification of simulated environ-
ments to the classification of real-world photographic
data, further improving classification ability of real data.
The remainder of this article is organised as follows: in
Section II the state of the art in knowledge transfer from
virtual worlds to real world is discussed, in Section III our
methodology is outlined, while in Section IV experimental
results are presented and analysed. A discussion of possible
future work is provided in Section V before a final conclusion
to this study is drawn in Section VI.
II. BACK GROU ND AND RELATED WORK
In this section, state of the art in the area of knowledge
transfer from virtual world tasks to real life tasks is discussed.
The possibility of transfer from modern videogames to reality
for complex problems is a new and rapidly growing line of
thought within the field of deep learning. Related works are
limited due to the young age of the field2.
Technologies such as realistic Ray Tracing and PBR in
conjuction with photographic or photographically-enhanced
textures enable photorealism in simulated environments (in
this context, generated as a videogame environment). Ray
Tracing is a rendering technique that works by following the
individual pixel paths of light and simulating its physical
properties when interacting with objects in the scene, which
produces higher levels of realism in terms of lighting as
2The most popular works as of writing up the results of this study are still
in the form of preprints
opposed to the classical row-by-row scanline method [8].
Following various methods of implementation Pharr et
al. provide a detailed review of PBR methods [9]– PBR
is the concept of combining high quality 3D models and
surface-measured shading in order to produce accurate
representations of real materials and thus photo-realistic
quality objects [10]. An example of the quality of simulation
possible through the usage of these technologies can be seen
in Figure 1, developed by ArchVizPRO [7].
Transfer Learning is the improvement of a learning process
for a new task by transferring knowledge from a related
so-called source task to the new task, which is called target
task. In this study, trained weights from one classification
problem are used as the initial weights for a second problem
and are subsequently compared to standard random weight
distribution for this same problem [11]. The issue of data
availability is recognised in a notable survey on transfer
learning, where transfer learning approaches are suggested
to produce better solutions for a second task characterised
by more limited data than the first task [12]. The reduced
availability of real-world data in comparison to the almost
infinite possibilities in virtual environments is such a scenario.
Kim and Park recently argued against a classical heuristic
search approach for the tracking of road lanes in favour of
a deep learning approach featuring transfer learning from
Grand Theft Auto V (GTA V) and TORCS environments [13].
GTA V was also used to gather data for a computer vision
experiment in which vehicle collisions were succesfully
predicted when transfer learning was applied [14]. Trial-and-
error learning is not suited to high-risk activities such as
driving, and so, reinforcement learning is not possible when
the starting point is a real-world situation; researchers argue
that transfer of knowledge can improve the ability to perform
complex tasks, when initially performed in simulation [15]
and [16]. For autonomous navigation, environment mapping
and recognition is a very important task for self-driving
vehicles, many of which consider LiDAR data as input
towards mapping and subsequent successful real-time
navigation [17], [18], [19], [20].
In addition to LiDAR, many authors have argued for the
processing of photographic image data for environment or
scene recognition. Herranz et al. [21] show that classification
of both scenes and objects reach human-level classification
abilities of 70.17% on the SUN397 places dataset via manually
chosen combinations of ImageNet-CNNs and Places-CNNs.
Similarly, Wu et al. [22] achieved accuracy of 58.11% on the
same dataset through harvesting discriminative meta-objects,
outperforming Places-CNN (AlexNet fine tuning), which had
a benchmark accuracy of 56.2% [23].
In Tobin et al. researchers trained computer vision models
with a technique of domain randomisation for object recogni-
2
tion within a real-world simulation, which, when transferred
to real-world data, could recognise objects within an error
of around 1.5 centimetres [24]. This was further improved
when it was noted that a dataset of synthetic images from
a virtual environment could be used to train a real-world
computer vision model within an error rate of 1.5 to 3.5
millimetres on average [25]. Researchers noted that virtual
environments were simply treated as simply another variation
rather than providing unsuitable noise. Similarly, computer
vision models were also improved when initially trained in
simulation for further application in reality where distance
error rates were reduced for a vision-based robotic arm [26].
Scene recognition between virtual and real environments has
received little attention. Wallet et al. show via comparison
comparison of high and low detailed virtual environments that
high detail in the virtual environment leads to better results
for tasks such as scene classification and way-finding when
applied to real environments [27]. The study was based on
the experiences of 64 human subjects.
So far, very little exploration into the possibility of transfer
learning between virtual to real environments for the task of
environment recognition or scene classification has been per-
formed. Though many of these works are currently preprints
and are yet to be published, they already have a high impact,
and results are often replicated in related experiments. In terms
of scene classification, either LiDAR or photographic image
data are considered as a data source for the task, with the best
scores often being achieved by deep learning methods such
as the Convolutional Neural Network, which features often in
state-of-the-art work. Transfer learning features often in these
works, either by simply fine-tuning a pre-trained CNN on a
large dataset, or training on a dataset and transfer learning
weight matrices to a second, more scarce dataset. Inspired by
these works, we opt to select photographic data of virtual and
real environments before transfer learning by initial weight
distribution to a fine-tuned network in order to attempt to use
both methods. The successful transfer of knowledge attained
in this experiment serves as basis for further exploration
into the possibilities of improving environment classification
algorithms by considering an activity of pre-training on the
infinite possibilities of virtual environments before considering
a real-world problem.
III. THE P ROPO SED RESEAR CH QU EST ION A ND OU R
AP PROAC H
We propose to answer the research question ”Can knowl-
edge be transferred from simulation to real world, to improve
effectiveness and efficiency of learning to perform real world
tasks, when real world training data are scarce?”. Here,
we explain our approach, starting from building the datasets,
following with the experiment, choice of models and practical
implementation. We include chosen hyperparameters and com-
putational resources in order to promote replicability as well
as for future improvement and application to related state-of-
Fig. 2. In order to collect artificial data, a camera is attached to a humanoid
robot for height reference in the Unity game engine.
the-art problems.
A. Datasets
Initially, two large image datasets are gathered from the
following environments:
Forest
Field
Bathroom
Living Room
Staircase
Computer Lab
The first two are natural environments and the final four are
artificial environments.
For the simulation data, 1,000 images are collected per
environment from the Unity videogame engine via a rotating
camera of 22mm focal length (chosen since it is most similar
to the human eye [28]) affixed to the viewpoint of a 120cm
(3.93ft) robot model, as can be seen in Figure 2. The camera
is rotated 5 degrees around the Y axis per photograph, and
then rotated around the X axis 15 degrees three times after
the full Y rotation has occurred.3In total, 6,000 images are
collected in order to form a balanced dataset.
For the photographic real-world data, a Google Images
web-crawler is set to search and save the first 600 image
search results for each environment name. Each set of
collected images are sought through manually in order to
remove any false results and more data is then collected if
needed to retain perfect class balance.
In figure 3 samples of the virtual visual data gathered from
the Unity game engine (top row) and photographs of real world
environments gathered from Google Images (bottom row) are
3Unity script for data collection is available at https://github.com/jordan-
bird/Unity-Image-Dataset-Collector
3
Fig. 3. Samples of virtual (top) and real (bottom) environments from the two datasets gathered for these experiments.
VGG16
CNN
(FineTuning)
VGG16
CNN
(FineTuning)
Interpretation
(2,4,8...4096)
Interpretation
(2,4,8...4096)
Softmax
Softmax
VGG16
CNN
(FineTuning)
Interpretation
(2,4,8...4096) Softmax
WeightTransfer
(TransferLearning) ΔS,ΔF
Fig. 4. Overall diagram of the experiment showing the derivation of Sand
F(change in starting and final classification ability) for comparison.
shown. Various similarities can be seen, especially through the
colours that occur in nature. Some of the more photo-realistic
environments, such as the living room, bare similarity due to
the realistic high-poly models for example through the creases
in the sofa material. Less realistic environments, such as the
bathroom, feature fewer similarities through the shapes of the
models, although lighting differs between the two.
B. Experiment
With all image data represented as a 128 ×128 ×3array
of RGB values, the datasets are used to train the models.
Convolutional Neural Network layers are fine-tuned from the
VGG16 network [29] with input layers replaced by the shape
of our data, and interpretation layers are removed in order to
benchmark a single layer of 2,4,8, ..., 4096 neurons. All of
these sets of hyperparameters are trained on the simulation
images dataset, and an additional set of hyperparameters
are then trained on the real images dataset, both for 50
epochs. Following this, all weights trained on the simulation
dataset are then transferred to real-world data for a further
10 epochs of training in order to benchmark the possibilities
of transfer learning. Thus, both methods of fine-tune and
transfer learning are explored. All training of models is via
10-fold cross validation where starting (pre-training) and
asymptote (ultimate ability) abilities are measured in order
to discern whether knowledge transfer is possible between
the domains. A diagram of the experiment can be observed
in Figure 4 within which changes in starting (S) and final
abilities (F) of the classification of real-world environments
are compared with and without weight transfer from a model
pre-trained on data gathered from virtual environments.
The goal of the learning process is the minimisation of
loss (misclassification) through backpropagation of errors and
optimisation of weights. This is possible since all data are
labelled, and thus, predictions can be compared to the ground
truths. The goal is to reduce the cross-entropy loss [30], [31]:
M
X
c=1
yo,c log(po,c),(1)
where Mis the number of classes, in this case, 6, yis a
binary indicator of a correct or erroneous prediction, that class
cis the true class of the data object o,pis the probability
that ois predicted to belong to class c. If this value is
algorithmically minimised, the network is then able to learn
from errors and attempt to account for them and improve its
classification ability.
The activation layer of the interpretation layer and learning
rate optimisation algorithm were arbitrarily chosen as
Rectified Linear Units (ReLu) and ADAM. ReLu is defined
as y=max(0, x).
ADAM [32] is a method of optimisation of networks
weights during the learning process based on RMSProp [33]
and Momentum [34], is generally calculated via the steps:
1) The exponentially weighted average of past gradients,
vdW are calculated.
4
2) The exponentially weighted averages of the squares of
past gradient, sdW are calculated.
3) The bias towards zero in the previous are corrected,
resulting in vcorrected
dW and scorrected
dW .
Neural network parameters are then updated via:
vdW =β1vdW + (1 β1)J
∂W
sdW =β2sdW + (1 β2)J
∂W 2
vcorrected
dW =vdW
1(β1)t
scorrected
dW =sdW
1(β1)t
W=Wαvcorrected
dW
qscorrected
dW +ε,
(2)
where β1and β1are tunable hyperparameters, J
∂W is a cost
gradient of the network layer which is being currently tuned,
Wis a matrix of weights, αis the defined learning rate, and
εis a small value introduced in order to prevent division by
zero.
C. Practical Implementation
In this work, all models were trained on deep neural
networks developed in the Keras library with a TensorFlow
backend. Implementation was performed in Python. Random
weights were generated by an Intel Core i7 CPU which was
running at a clock speed of 3.7GHz. RAM used for the initial
storage of images was 32GB at a clock speed of 1202MHz
(Dual-Channel 16GB) before transfer to the 6GB of VRAM
and subsequent learning on a GTX 980Ti GPU via its 2816
CUDA cores.
IV. RES ULTS
In this section, the results from the experiments are
presented following the method described above. Firstly, the
classification ability of the networks trained on virtual data
is outlined, then a comparison between networks to classify
real-world data initialised with random weight distribution
and weights transferred from the networks trained on virtual
environments.
A. Initial Training for Virtual Environments
The classification accuracy of the 12 sets of weights
corresponding to 2..4096 interpretation neurons respectively
to be transferred in the experiment can be observed in Table
I. High accuracy is observed with regards to interpretation
neurons 8...4096, this is likely due to the CNN generating
sets of similar features to the repetitive nature of videogame
environments. In order to optimise the rendering of frames
to the desired 60fps models, textures and bump maps are
often repeated in order to reduce the execution time of the
graphical pipeline [35].
TABLE I
BENCHMARKING OF INT ERPR ETATION NE TWOR K TOPOLOGIES FOR
SIMULATION ENVIRONMENTS. HIGH RES ULTS (90%+) CA N BE
EXP ECTE D DUE TO RE PEATE D TEXTURES, BUMP MAPS A ND LIGHTING.
Interpretation Neurons Classification Accuracy (%)
233.28
449.69
888
16 96.04
32 98.33
64 98.33
128 98.16
256 98.76
512 97.02
1024 97.86
2048 64.08
4096 93.93
B. Transfer Learning vs Random Weights
The results for the transfer learning experiment can be
observed in Table II. The columns Sand Fshow the
change in Starting (epoch 0, no back propagation performed)
and Final classification accuracies in terms of transfer versus
non-transfer of weights, respectively. Interestingly, regardless
of the number of interpretation neurons, successful transfer of
knowledge is achieved for pre-training, with the lowest being
+3.1% via 2 interpretation neurons. The highest is +48.34%
accuracy in the case of 512 hidden interpretation neurons.
This shows that knowledge can be transferred as a starting
point. The average increase of starting accuracy over all
models was +38.33% when transfer learning was performed,
as opposed to an average starting accuracy of 16.4% without
knowledge transfer. In terms of the final classification
accuracy, success is achieved as well, 9 experiments lead to a
higher final accuracy whereas are were slightly lower (-0.22%
128 neurons and -3.98% 2048 neurons), and one does not
change (32 neurons). The average Fover all experiments is
+7.15% with the highest being +24.56% via 4 interpretation
neurons. On average, the final accuracy of all models when
transfer learning is performed is 76.34%m in comparison
to the average final accuracy of 69.16% without transfer of
weights.
Overall, the best model for classifying the real-world
data is a fine-tuned VGG16 CNN followed by 64 hidden
interpretation neurons with initial weights transferred from
the network trained on simulation video game environments,
this model scores a final classification accuracy of 89.16%
highlighted in bold in Table II when both fine-tune and
sim-to-real transfer learning are used in conjunction. The
majority of results, especially the highest S,F, and
final accuracy, show that transfer learning is not only a
possibility between simulation and real-world data for scene
classification, but also promote it as a viable solution in order
to both reduce computational resource requirements and lead
to higher classification ability overall.
5
TABLE II
COMPARISON OF NON-TRANSFER AND TRANSFER LEARNING EXPERIMENTS.SAND FDE FINE TH E CHA NGE IN S TARTIN G AND FI NAL ACC URAC IES
BETWEEN THE SELECTED STARTING WEIGHT DISTRIBUTION. A POSITIVE VALUE DENOTES SUCCESSFUL TRANSFER OF KNOWLEDGE BETWEEN
SI MULATI ON AN D REAL ITY.
Interpretation
Neurons
Experiment
Non-Transfer Learning Transfer Learning Comparison
Starting Accuracy (%) Final Accuracy (%) Starting Accuracy (%) Final Accuracy (%) SF
218.25 18.69 21.35 36.5 +3.1 +17.81
415.27 27.32 33.74 51.88 +18.47 +24.56
812.5 80.31 59.29 85.29 +46.79 +4.98
16 21.57 85.07 60.37 86.73 +38.8 +1.66
32 14.16 87.06 61.06 87.06 +46.9 0
64 16.04 88.27 54.42 89.16 +38.38 +0.89
128 15.93 87.17 61.17 86.95 +45.24 -0.22
256 17.26 85.73 60.95 87.94 +43.69 +2.21
512 14.27 77.88 62.61 79.65 +48.34 +1.77
1024 19.58 68.69 62.83 85.29 +43.25 +16.6
2048 17.7 67.7 56.75 63.72 +39.05 -3.98
4096 14.27 56.19 62.39 75.88 +48.12 +19.69
Average 16.4 69.16 54.73 76.34 38.33 7.15
The results serve as strong argument that transfer of knowl-
edge is possible in terms of pre-training of weights from
simulated environments. This is evidenced especially through
the initial ability of the transfer networks prior to any training
for classification of the real environments, but it is also shown
through the best ultimate score achieved by a network with
initial weights transferred.
V. DISCUSSION
In this section the limitations of this study are discussed
and directions for future work to further explore the potential
of this method are proposed. From the results observed in
this study, there are two main areas of future work which are
important to follow. Firstly, we propose to further improve the
artificial learning pipeline. Models were trained for 50 epochs
for each of the interpretation layers to be benchmarked. In
the future the possibility of deeper networks of more than
one hidden interpretation layer and also the combinations
of the hyperparameters can be explored. The training time
of the random weight networks was relatively limited at
50 epochs and even further limited for transfer learning
at 10 epochs, although this was by design and due to the
computational resources available. Future work could concern
deeper interpretation networks as well as increased training
time. In this study hyperparameters such as the activation and
learning rate optimisation algorithm were arbitrarily chosen,
therefore in the future these could be explored in a further
combinatorial optimisation experiment. Secondly, simulation
to real transfer learning could also be attempted in various
fields in order to benchmark the ability of this method for
other real-world applications. For example, autonomous cars
and drones training in a virtual environment for real-world
application. The next step for benchmarking could be to
compare the ability of this method to state-of-the-art methods
on publicly available datasets, should more computational
resources be available, similarly to the related works featured
in the literature review [21], [22], [23].
VI. CONCLUSION
In the experiments and results presented in this study,
we have shown success in transfer learning from virtual
environments to a task taking place in reality. A noticeable
set of high abilities were encountered for sole classification
of virtual data, as expected, due to the optimisation processes
of recycling objects and repeating textures found within
videogame environments. Of the 12 networks trained with
and without transfer learning, a pattern of knowledge
transfer was observed; with all starting accuracies being
substantially higher than a random weight distribution, and,
most importantly, a best classification ability of 89.16%
was achieved when knowledge was initially transferred from
virtual environments.
These results provide a strong argument for the application
of both fine-tune and transfer learning for autonomous scene
classification. The former was achieved through the tuning of
VGG16 Convolutional Neural Networks, and the latter was
achieved by transferring weights from a network trained on
simulation data from videogames and applied to a real-world
situation. Transfer learning leads to both the reduction of
resource requirements for said problems, and the achievement
of a higher classification ability overall when pre-training
has occurred on simulated data. As future directions, further
improvement of the learning pipeline benchmarked in this
study together with exploration on other complex real-world
problems faced by autonomous machines are proposed.
VII. ACKN OWL EDG EME NT
This work was partially supported by the Royal Society
through the project ”Sim2Real: From Simulation to Real
Robotic Application using Deep Reinforcement Learning and
Knowledge Transfer” with grant number RGS\R2\192498
awarded to D. R. Faria.
6
REFERENCES
[1] J. H. Chen and S. M. Asch, “Machine learning and prediction in
medicine—beyond the peak of inflated expectations, The New England
journal of medicine, vol. 376, no. 26, p. 2507, 2017.
[2] A. W. Tan, R. Sagarna, A. Gupta, R. Chandra, and Y. S. Ong, “Coping
with data scarcity in aircraft engine design,” in 18th AIAA/ISSMO
Multidisciplinary Analysis and Optimization Conference, p. 4434, 2017.
[3] A. Bouchachia, “On the scarcity of labeled data,” in International
Conference on Computational Intelligence for Modelling, Control and
Automation and International Conference on Intelligent Agents, Web
Technologies and Internet Commerce (CIMCA-IAWTIC’06), vol. 1,
pp. 402–407, IEEE, 2005.
[4] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu, “Trans-
fer learning for video recognition with scarce training data for deep
convolutional neural network, arXiv preprint arXiv:1409.4127, 2014.
[5] C. Hentschel, T. P. Wiradarma, and H. Sack, “Fine tuning cnns with
scarce training data—adapting imagenet to art epoch classification,”
in 2016 IEEE International Conference on Image Processing (ICIP),
pp. 3693–3697, IEEE, 2016.
[6] A. Bhowmik, S. Kumar, and N. Bhat, “Eye disease prediction from
optical coherence tomography images with transfer learning,” in Inter-
national Conference on Engineering Applications of Neural Networks,
pp. 104–114, Springer, 2019.
[7] “Archvizpro interior vol.1, Mar 2018.
[8] A. Appel, “Some techniques for shading machine renderings of solids,”
in Proceedings of the April 30–May 2, 1968, Spring Joint Computer
Conference, pp. 37–45, ACM, 1968.
[9] M. Pharr, W. Jakob, and G. Humphreys, Physically based rendering:
From theory to implementation. Morgan Kaufmann, 2016.
[10] C. Ulbricht, A. Wilkie, and W. Purgathofer, “Verification of physically
based rendering algorithms,” in Computer Graphics Forum, vol. 25,
pp. 237–255, Wiley Online Library, 2006.
[11] L. Torrey and J. Shavlik, “Transfer learning, in Handbook of research
on machine learning applications and trends: algorithms, methods, and
techniques, pp. 242–264, IGI Global, 2010.
[12] S. J. Pan and Q. Yang, A survey on transfer learning, IEEE Trans-
actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
1359, 2009.
[13] J. Kim and C. Park, “End-to-end ego lane estimation based on sequential
transfer learning for self-driving cars, in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
pp. 30–38, 2017.
[14] K. Lee, H. Kim, and C. Suh, “Crash to not crash: Playing video games
to predict vehicle collisions,” in ICML Workshop on Machine Learning
for Autonomous Vehicles, 2017.
[15] M. B. Uhr, D. Felix, B. J. Williams, and H. Krueger, “Transfer of
training in an advanced driving simulator: Comparison between real
world environment and simulation in a manoeuvring driving task, in
Driving Simulation Conference, North America, p. 11, 2003.
[16] A. Bewley, J. Rigley, Y. Liu, J. Hawke, R. Shen, V.-D. Lam, and
A. Kendall, “Learning to drive from simulation without real world
labels,” in 2019 International Conference on Robotics and Automation
(ICRA), pp. 4818–4824, IEEE, 2019.
[17] F. Yu, J. Xiao, and T. Funkhouser, “Semantic alignment of LiDAR data
at city scale,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1722–1731, 2015.
[18] C. Zach, A. Penate-Sanchez, and M. Pham, “A dynamic programming
approach for fast and robust object pose recognition from range images,
in IEEE CVPR, pp. 196–203, 2015.
[19] D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for
3D bounding box estimation,” in IEEE/CVF CVPR, pp. 244–253, 2018.
[20] A. Ess, B. Leibe, and L. Van Gool, “Depth and appearance for mobile
scene analysis,” in IEEE 11th International Conference on Computer
Vision (ICCV), pp. 1–8, 2007.
[21] L. Herranz, S. Jiang, and X. Li, “Scene recognition with-+ cnns: objects,
scales and dataset bias,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 571–579, 2016.
[22] R. Wu, B. Wang, W. Wang, and Y. Yu, “Harvesting discriminative meta
objects with deep cnn features for scene classification,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 1287–
1295, 2015.
[23] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
deep features for scene recognition using places database,” in Advances
in neural information processing systems, pp. 487–495, 2014.
[24] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
“Domain randomization for transferring deep neural networks from sim-
ulation to the real world,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 23–30, IEEE, 2017.
[25] T. Inoue, S. Choudhury, G. De Magistris, and S. Dasgupta, “Transfer
learning from synthetic to real images using variational autoencoders for
precise position detection,” in 2018 25th IEEE International Conference
on Image Processing (ICIP), pp. 2725–2729, IEEE, 2018.
[26] F. Zhang, J. Leitner, B. Upcroft, and P. Corke, “Vision-based reaching
using modular deep networks: from simulation to the real world,” arXiv
preprint arXiv:1610.06781, 2016.
[27] G. Wallet, H. Sauz´
eon, P. A. Pala, F. Larrue, X. Zheng, and B. N’Kaoua,
“Virtual/real transfer of spatial knowledge: Benefit from visual fidelity
provided in a virtual environment and impact of active navigation,”
Cyberpsychology, Behavior, and Social Networking, vol. 14, no. 7-8,
pp. 417–423, 2011.
[28] M. Mrochen, M. Kaemmerer, P. Mierdel, H.-E. Krinke, and T. Seiler,
“Is the human eye a perfect optic?,” in Ophthalmic Technologies XI,
vol. 4245, pp. 30–35, International Society for Optics and Photonics,
2001.
[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[30] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press,
2012.
[31] S. Kullback and R. A. Leibler, “On information and sufficiency, The
annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
[32] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[33] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop, coursera: Neural
networks for machine learning,” University of Toronto, Technical Report,
2012.
[34] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
of initialization and momentum in deep learning,” in International
conference on machine learning, pp. 1139–1147, 2013.
[35] J. Dargie, “Modeling techniques: movies vs. games,” ACM SIGGRAPH
Computer Graphics, vol. 41, no. 2, p. 2, 2007.
7
... However, acquiring real-world training data for artificial intelligence (AI) applications can be costly, error-prone, limited, or imbalanced [2,3,4]. Synthetic image data (e.g. in the form of video game engine generated scenes) has emerged as a promising alternative, offering scalability, precision, and potentially more robust and accurate models [5,2,6]. Nonetheless, guiding design knowledge on how to utilize synthetically generated image data in deep learning remains scarce. ...
... Therefore, the impulse for the underlying research stems from a genuine real-world circumstance, namely, the provision of scalable, precise, and ethical computer vision deep learning models and their respective training data. The highlighted problem originates from several prior studies that found synthetically trained computer vision models to be more robust, accurate, and less error-prone [5,2,6]. In addition, synthetic image data achieves photorealism and can be generated and scaled infinitely, making it a genuine alternative to conventional real imagery approaches [3]. ...
... Moreover, the combination of synthetic and real training data has been investigated by several researchers. Wan et al. [29], Bird et al. [5], and Abu Alhaija et al. [30] utilized mixed datasets, comprising both synthetic and real data, for document layout analysis, scene classification, and object detection in augmented reality, respectively. Thereby, these studies highlight the benefits of leveraging both synthetic and real data for training computer vision models. ...
Conference Paper
Full-text available
This paper explores the ethical and effective utilization of synthetic image data in computer vision deep learning. It addresses the challenges of acquiring real-world training data and proposes design principles for selecting, generating, and integrating synthetic images. These principles cover aspects such as ethical compliance, privacy protection, scene diversity, and complexity management. By adopting a design science research approach and using a multi-method research design, the study provides actionable guidance for researchers and practitioners, as these design principles ensure responsible use of synthetic image data while improving model performance and privacy protection. The paper contributes to design knowledge in the general IS, deep learning, and IS ethics field, highlighting the theoretical and practical relevance of the proposed principles. The reusability of the design principles promotes the efficient use of synthetic image data in computer vision and has been positively evaluated.
... However, acquiring real-world training data for artificial intelligence (AI) applications can be costly, error-prone, limited, or imbalanced [2,3,4]. Synthetic image data (e.g. in the form of video game engine generated scenes) has emerged as a promising alternative, offering scalability, precision, and potentially more robust and accurate models [5,2,6]. Nonetheless, guiding design knowledge on how to utilize synthetically generated image data in deep learning remains scarce. ...
... Therefore, the impulse for the underlying research stems from a genuine real-world circumstance, namely, the provision of scalable, precise, and ethical computer vision deep learning models and their respective training data. The highlighted problem originates from several prior studies that found synthetically trained computer vision models to be more robust, accurate, and less error-prone [5,2,6]. In addition, synthetic image data achieves photorealism and can be generated and scaled infinitely, making it a genuine alternative to conventional real imagery approaches [3]. ...
... Moreover, the combination of synthetic and real training data has been investigated by several researchers. Wan et al. [29], Bird et al. [5], and Abu Alhaija et al. [30] utilized mixed datasets, comprising both synthetic and real data, for document layout analysis, scene classification, and object detection in augmented reality, respectively. Thereby, these studies highlight the benefits of leveraging both synthetic and real data for training computer vision models. ...
Book
Full-text available
Proceedings of the 3rd International Workshop on Current Information Security and Compliance Issues in Information Systems Research (CIISR 2023), Co-located with the 18th International Conference on Wirtschaftsinformatik (WI 2023), Paderborn, Germany, September 18, 2023. CEUR Workshop Proceedings 3512, CEUR-WS.org 2023.
... As a result, we are highly motivated to develop a system that recognizes the scene in the real-time environment [6]. Scene recognition [7] is still a developing area in computer vision there are multiple applications of scene recognition. Due to so much variability in the environment, this becomes quite difficult to recognize the background scene of the image. ...
... Transfer learning, as the name suggests, means the learned features are used from one problem to solve another kind of similar problem with different parameters and environments in order to increase accuracy [31]. Transfer learning generally applies when the new dataset is small as compared to the ImageNet dataset on which all pre-trained models have been trained. ...
Article
Full-text available
For an effective dietary assessment system, it is necessary to keep track of the amount of food consumed. Food recognition is the first step to calorie estimation, and image processing technique is useful to achieve this. With the use of food image classification, people can count the amount of food taken and control the calories taken, which helps to reduce the risk of serious health conditions like hypertension, chronic diseases, and heart disease. The nature of food is very diverse, which makes the food image classification task more challenging. Deep learning methods for image classification give more accurate and efficient results as compared to traditional methods. This research work focuses on classifying Gujarati food images as no efforts have been made till now to classify Gujarati food images. A new dataset named “Traditional Gujarati Food Images Dataset (TGFD)” has been created. The dataset contains 1764 images belonging to five food classes and famous food items in Gujarat. The experiments start by implementing transfer learning on models, namely VGG16, VGG19, Resnet50, Inceptionv3, and Alexnet. Fine-tuning has been implemented on all models in order to increase accuracy. After fine-tuning all the models, the maximum accuracy achieved was “89.36%” on the Inception v3 model, but the loss was very high. Certain parameters, like the number of convolutional layers, number of neurons in fully connected layers, number of filters, and filter size, directly affect the model's accuracy. Taking these parameters into consideration to improve accuracy and reduce loss, this research work proposes a model named “depth-restricted convolutional neural network (DRCNN)” which achieves “95.48%” accuracy, which is remarkable. The DRCNN model contains 482,069 parameters, which is 48 times less than the parameters of the Inceptionv3 model, and the validation loss is only 0.8041. Introducing batch normalization in the proposed model drastically improves performance with a lower number of parameters. DRCNN has been tested on an increasing number of classes in the dataset and on different types of food datasets. In both cases, the model performs outstandingly, proving its versatility.
... Secondly, the models trained with such synthetic datasets should be tested using real data. Domain adaption techniques (DA) are usually applied here to secure the model performance on real data [67]. DA is a class of transfer learning (see Section 3.4) scenarios that solves the issues when training and testing data are not under the same statistical distributions. ...
Article
Full-text available
Artificial intelligence (AI) has been successfully applied in industry for decades, ranging from the emergence of expert systems in the 1960s to the wide popularity of deep learning today. In particular, inexpensive computing and storage infrastructures have moved data-driven AI methods into the spotlight to aid the increasingly complex manufacturing processes. Despite the recent proverbial hype, however, there still exist non-negligible challenges when applying AI to smart manufacturing applications. As far as we know, there exists no work in the literature that summarizes and reviews the related works for these challenges. This paper provides an executive summary on AI techniques for non-experts with a focus on deep learning and then discusses the open issues around data quality, data secrecy, and AI safety that are significant for fully automated industrial AI systems. For each challenge, we present the state-of-the-art techniques that provide promising building blocks for holistic industrial AI solutions and the respective industrial use cases from several domains in order to better provide a concrete view of these techniques. All the examples we reviewed were published in the recent ten years. We hope this paper can provide the readers with a reference for further studying the related problems.
Article
Full-text available
Artificial perception for robots operating in outdoor natural environments, including forest scenarios, has been the object of a substantial amount of research for decades. Regardless, this has proven to be one of the most difficult research areas in robotics and has yet to be robustly solved. This happens namely due to difficulties in dealing with environmental conditions (trees and relief, weather conditions, dust, smoke, etc.), the visual homogeneity of natural landscapes as opposed to the diversity of natural obstacles to be avoided, and the effect of vibrations or external forces such as wind, among other technical challenges. Consequently, we propose a new survey, describing the current state of the art in artificial perception and sensing for robots in precision forestry. Our goal is to provide a detailed literature review of the past few decades of active research in this field. With this review, we attempted to provide valuable insights into the current scientific outlook and identify necessary advancements in the area. We have found that the introduction of robotics in precision forestry imposes very significant scientific and technological problems in artificial sensing and perception, making this a particularly challenging field with an impact on economics, society, technology, and standards. Based on this analysis, we put forward a roadmap to address the outstanding challenges in its respective scientific and technological landscape, namely the lack of training data for perception models, open software frameworks, robust solutions for multi-robot teams, end-user involvement, use case scenarios, computational resource planning, management solutions to satisfy real-time operation constraints, and systematic field testing. We argue that following this roadmap will allow for robotics in precision forestry to fulfil its considerable potential.
Chapter
Virtual reality (VR) is gaining popularity very fast due to newer solutions that increase user perception. Glasses, sensors, and treadmills are the basic functionality for immersing yourself in a virtual environment. In this paper, we propose a human-AI collaboration for analyzing the newly generated images that can be used for creating worlds. The presented method is based on analyzing different scenes (from simulation and real environment) using generative adversarial networks (GAN) and the communication with the user for assessments of the created new environment. User’s information contributes to the analysis of sample quality and possible rebuilding or retraining of the GAN model. The proposal increases the perception of VR by taking the user’s feelings in creating new environments. For this purpose, we combine GAN with fuzzy soft sets inference to gain the possibility of retraining/remodeling the used neural network. It was examined in theoretical simulation and real-environment case study.
Chapter
The rate of penetration (ROP) is a key performance indicator in the oil and gas drilling industry as it directly translates to cost savings and emission reductions. A prerequisite for a drilling optimization algorithm is a predictive model that provides expected ROP values in response to surface drilling parameters and formation properties. The high predictive capability of current machine-learning models comes at the cost of excessive data requirements, poor generalization, and extensive computation requirements. These practical issues hinder ROP models for field deployment. Here we address these issues through transfer learning. Simulated and real data from the Volve field were used to pre-train models. Subsequently, these models were fine-tuned with varying retraining data percentages from other Volve wells and Marcellus Shale wells. Four out of the five test cases indicate that retraining the base model would always produce a model with lower mean absolute error than training an entirely new model or using the base model without retraining. One was on par with the traditional approach. Transfer learning allowed to reduce the training data requirement from a typical 70% down to just 10%. In addition, transfer learning reduced computational costs and training time. Finally, results showed that simulated data could be used in the absence of real data or in combination with real data to train a model without trading off model’s predictive capability.
Conference Paper
Full-text available
Capturing and labeling camera images in the real world is an expensive task, whereas synthesizing labeled images in a simulation environment is easy for collecting large-scale image data. However, learning from only synthetic images may not achieve the desired performance in the real world due to a gap between synthetic and real images. We propose a method that transfers learned detection of an object position from a simulation environment to the real world. This method uses only a significantly limited dataset of real images while leveraging a large dataset of synthetic images using variational autoen-coders. Additionally, the proposed method consistently performed well in different lighting conditions, in the presence of other distractor objects, and on different backgrounds. Experimental results showed that it achieved accuracy of 1.5 mm to 3.5 mm on average. Furthermore, we showed how the method can be used in a real-world scenario like a "pick-and-place" robotic task.
Article
Full-text available
Big data, we have all heard, promise to transform health care. But in the “hype cycle” of emerging technologies, machine learning now rides atop the “peak of inflated expectations,” and we need to better appreciate the technology’s capabilities and limitations.
Chapter
Optical Coherence Tomography (OCT) of the human eye are used by optometrists to analyze and detect various age-related eye abnormalities like Choroidal Neovascularization, Drusen (CNV), Diabetic Macular Odeama (DME), Drusen. Detecting these diseases are quite challenging and requires hours of analysis by experts, as their symptoms are somewhat similar. We have used transfer learning with VGG16 and Inception V3 models which are state of the art CNN models. Our solution enables us to predict the disease by analyzing the image through a convolutional neural network (CNN) trained using transfer learning. Proposed approach achieves a commendable accuracy of 94% on the testing data and 99.94% on training dataset with just 4000 units of data, whereas to the best of our knowledge other researchers have achieved similar accuracies using a substantially larger (almost 10 times) dataset.
Article
We present PointFusion, a generic 3D object detection method that leverages both image and 3D point cloud information. Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset-specific assumptions, PointFusion is conceptually simple and application-agnostic. The image data and the raw point cloud data are independently processed by a CNN and a PointNet architecture, respectively. The resulting outputs are then combined by a novel fusion network, which predicts multiple 3D box hypotheses and their confidences, using the input 3D points as spatial anchors. We evaluate PointFusion on two distinctive datasets: the KITTI dataset that features driving scenes captured with a lidar-camera setup, and the SUN-RGBD dataset that captures indoor environments with RGB-D cameras. Our model is the first one that is able to perform better or on-par with the state-of-the-art on these diverse datasets without any dataset-specific model tuning.
Article
Bridging the 'reality gap' that separates simulated robotics from experiments on hardware could accelerate robotic research through improved data availability. This paper explores domain randomization, a simple technique for training models on simulated images that transfer to real images by randomizing rendering in the simulator. With enough variability in the simulator, the real world may appear to the model as just another variation. We focus on the task of object localization, which is a stepping stone to general robotic manipulation skills. We find that it is possible to train a real-world object detector that is accurate to $1.5$cm and robust to distractors and partial occlusions using only data from a simulator with non-realistic random textures. To demonstrate the capabilities of our detectors, we show they can be used to perform grasping in a cluttered environment. To our knowledge, this is the first successful transfer of a deep neural network trained only on simulated RGB images (without pre-training on real images) to the real world for the purpose of robotic control.