Conference PaperPDF Available

Abstract and Figures

In this work, we show that both fine-tune learning and cross-domain sim-to-real transfer learning from virtual to real-world environments improve the starting and final scene classification abilities of a computer vision model. A 6-class computer vision problem of scene classification is presented from both videogame environments and photographs of the real world, where both datasets have the same classes. 12 networks are trained from 2, 4, 8, …, 4096 hidden interpretation neurons following a fine-tuned VGG16 Convolutional Neural Network for a dataset of virtual data gathered from the Unity game engine and for a photographic dataset gathered from an online image search engine. 12 Transfer Learning networks are then benchmarked using the trained networks on virtual data as a starting weight distribution for a neural network to classify the real-world dataset. Results show that all of the transfer networks have a higher starting accuracy pre-training, with the best showing an improvement of +48.34% image classification ability and an average increase of +38.33% for the starting abilities of all hyperparameter sets benchmarked. Of the 12 experiments, nine transfer experiments showed an improvement over non-transfer learning, two showed a slightly lower ability, and one did not change. The best accuracy overall was obtained by a transfer learning model with a layer of 64 interpretation neurons scoring 89.16% compared to the non-transfer counterpart of 88.27%. An average increase of +7.15% was observed over all experiments. The main finding is that not only can a higher final classification accuracy be achieved, but strong classification abilities prior to any training whatsoever are also encountered when transferring knowledge from simulation to real-world data, proving useful domain knowledge transfer between the datasets.
Content may be subject to copyright.
From Simulation to Reality: CNN Transfer
Learning for Scene Classification
Jordan J. Bird1, Diego R. Faria2, and Anik´
o Ek´
art3
Aston Robotics, Vision and Intelligent Systems Lab
Aston University
Birmingham, United Kingdom
Email: {birdj11, d.faria2, a.ekart3}@aston.ac.uk
Pedro P. S. Ayrosa
Universidade Estadual de Londrina
Londrina, Brazil
Email: ayrosa@uel.br
Abstract—In this work, we show that both fine-tune learning
and cross-domain sim-to-real transfer learning from virtual to
real-world environments improve the starting and final scene
classification abilities of a computer vision model. A 6-class
computer vision problem of scene classification is presented from
both videogame environments and photographs of the real world,
where both datasets have the same classes. 12 networks are
trained from 2, 4, 8, . . . , 4096 hidden interpretation neurons
following a fine-tuned VGG16 Convolutional Neural Network
for a dataset of virtual data gathered from the Unity game
engine and for a photographic dataset gathered from an online
image search engine. 12 Transfer Learning networks are then
benchmarked using the trained networks on virtual data as a
starting weight distribution for a neural network to classify the
real-world dataset. Results show that all of the transfer networks
have a higher starting accuracy pre-training, with the best
showing an improvement of +48.34% image classification ability
and an average increase of +38.33% for the starting abilities of all
hyperparameter sets benchmarked. Of the 12 experiments, nine
transfer experiments showed an improvement over non-transfer
learning, two showed a slightly lower ability, and one did not
change. The best accuracy overall was obtained by a transfer
learning model with a layer of 64 interpretation neurons scoring
89.16% compared to the non-transfer counterpart of 88.27%. An
average increase of +7.15% was observed over all experiments.
The main finding is that not only can a higher final classification
accuracy be achieved, but strong classification abilities prior to
any training whatsoever are also encountered when transferring
knowledge from simulation to real-world data, proving useful
domain knowledge transfer between the datasets.
Keywords—Sim-to-real, Transfer Learning, Deep Learning,
Computer Vision, Autonomous Perception, Scene Classification,
Environment Recognition
I. INTRODUCTION
The possibility of transfer learning from simulated data
to real-world application is promising due to the scarcity
of real-world labelled data being an issue encountered
in many applications of machine learning and artificial
intelligence [1], [2], [3]. Based on this, Fine-tune Learning
and Transfer learning are often both considered to be viable
solutions to the issue of data scarcity in the scientific state-of-
the-art via large-scale models such as ImageNet and VGG16
for the former and methods such as rule and weight transfer
for the latter [4], [5], [6]. Here, we attempt to perform both
of these methods in a pipeline for scene classification, by
fine-tuning a large-scale model and transferring knowledge
between rules learnt from simulation to real-world datasets.
The consumer-level quality of videogame technology
has rapidly improved towards arguable photo-realistic
graphical quality through ray-traced lighting, high resolution
photographic textures and Physically Based Rendering (PBR)
to name but several prominent techniques. This then raises
the question, since simulated environments are ever more
realistic, is it possible to transfer knowledge from them to
real-world situations? Should this be possible, the problem
of data scarcity would be mitigated, and also a more optimal
process of learning would become possible by introducing a
starting point learned from simulation. If this process provides
a better starting point than, for example, a classical random
weight distribution, then fewer computational resources
are required to learn about the real world and also fewer
labelled data points are required. In addition, if this process
is improved further, learning from real-world data may not
actually be required at all.
In this work, we perform 12 individual topology exper-
iments in order to show that real-world classification of
relatively scarce data can be improved via pre-training said
models on simulation data from a high-quality videogame
environment. The weights developed on simulation data are
applied as a starting point for the backpropagation learning of
real-world data, and we find that both starting accuracies and
asymptotes (final ability) are often higher when the model has
been able to train on simulation data before considering real
data.
The main scientific contributions of this work are threefold:
1) The formation of two datasets for a 6-class scene classi-
fication dataset, both artificial simulation and real-world
photographic data1.
2) 24 topology tuning experiments for best classification
of the two datasets, 12 for each of the datasets by
2,4,8...4096 interpretation neurons following the fine
1https://www.kaggle.com/birdy654/environment-recognition-simulation-to-
reality
1
Fig. 1. An example of the usage of Ray Tracing, Physically-Based Rendering
and high quality textures in order to generate a realistic simulation of a living
room environment [7].
tuning of a VGG16 CNN network (with interpretation
and softmax layers removed). This provides a baseline
comparison for Transfer Learning as well as the pre-
trained weights to be used in the following experiment.
3) 12 transfer learning experiments of the weights trained
on simulation data transferred to networks with the task
of classifying real-world data. The results are evidence
that transfer learning of useful domain knowledge is
possible from the classification of simulated environ-
ments to the classification of real-world photographic
data, further improving classification ability of real data.
The remainder of this article is organised as follows: in
Section II the state of the art in knowledge transfer from
virtual worlds to real world is discussed, in Section III our
methodology is outlined, while in Section IV experimental
results are presented and analysed. A discussion of possible
future work is provided in Section V before a final conclusion
to this study is drawn in Section VI.
II. BACK GROU ND AND RELATED WORK
In this section, state of the art in the area of knowledge
transfer from virtual world tasks to real life tasks is discussed.
The possibility of transfer from modern videogames to reality
for complex problems is a new and rapidly growing line of
thought within the field of deep learning. Related works are
limited due to the young age of the field2.
Technologies such as realistic Ray Tracing and PBR in
conjuction with photographic or photographically-enhanced
textures enable photorealism in simulated environments (in
this context, generated as a videogame environment). Ray
Tracing is a rendering technique that works by following the
individual pixel paths of light and simulating its physical
properties when interacting with objects in the scene, which
produces higher levels of realism in terms of lighting as
2The most popular works as of writing up the results of this study are still
in the form of preprints
opposed to the classical row-by-row scanline method [8].
Following various methods of implementation – Pharr et
al. provide a detailed review of PBR methods [9]– PBR
is the concept of combining high quality 3D models and
surface-measured shading in order to produce accurate
representations of real materials and thus photo-realistic
quality objects [10]. An example of the quality of simulation
possible through the usage of these technologies can be seen
in Figure 1, developed by ArchVizPRO [7].
Transfer Learning is the improvement of a learning process
for a new task by transferring knowledge from a related
so-called source task to the new task, which is called target
task. In this study, trained weights from one classification
problem are used as the initial weights for a second problem
and are subsequently compared to standard random weight
distribution for this same problem [11]. The issue of data
availability is recognised in a notable survey on transfer
learning, where transfer learning approaches are suggested
to produce better solutions for a second task characterised
by more limited data than the first task [12]. The reduced
availability of real-world data in comparison to the almost
infinite possibilities in virtual environments is such a scenario.
Kim and Park recently argued against a classical heuristic
search approach for the tracking of road lanes in favour of
a deep learning approach featuring transfer learning from
Grand Theft Auto V (GTA V) and TORCS environments [13].
GTA V was also used to gather data for a computer vision
experiment in which vehicle collisions were succesfully
predicted when transfer learning was applied [14]. Trial-and-
error learning is not suited to high-risk activities such as
driving, and so, reinforcement learning is not possible when
the starting point is a real-world situation; researchers argue
that transfer of knowledge can improve the ability to perform
complex tasks, when initially performed in simulation [15]
and [16]. For autonomous navigation, environment mapping
and recognition is a very important task for self-driving
vehicles, many of which consider LiDAR data as input
towards mapping and subsequent successful real-time
navigation [17], [18], [19], [20].
In addition to LiDAR, many authors have argued for the
processing of photographic image data for environment or
scene recognition. Herranz et al. [21] show that classification
of both scenes and objects reach human-level classification
abilities of 70.17% on the SUN397 places dataset via manually
chosen combinations of ImageNet-CNNs and Places-CNNs.
Similarly, Wu et al. [22] achieved accuracy of 58.11% on the
same dataset through harvesting discriminative meta-objects,
outperforming Places-CNN (AlexNet fine tuning), which had
a benchmark accuracy of 56.2% [23].
In Tobin et al. researchers trained computer vision models
with a technique of domain randomisation for object recogni-
2
tion within a real-world simulation, which, when transferred
to real-world data, could recognise objects within an error
of around 1.5 centimetres [24]. This was further improved
when it was noted that a dataset of synthetic images from
a virtual environment could be used to train a real-world
computer vision model within an error rate of 1.5 to 3.5
millimetres on average [25]. Researchers noted that virtual
environments were simply treated as simply another variation
rather than providing unsuitable noise. Similarly, computer
vision models were also improved when initially trained in
simulation for further application in reality where distance
error rates were reduced for a vision-based robotic arm [26].
Scene recognition between virtual and real environments has
received little attention. Wallet et al. show via comparison
comparison of high and low detailed virtual environments that
high detail in the virtual environment leads to better results
for tasks such as scene classification and way-finding when
applied to real environments [27]. The study was based on
the experiences of 64 human subjects.
So far, very little exploration into the possibility of transfer
learning between virtual to real environments for the task of
environment recognition or scene classification has been per-
formed. Though many of these works are currently preprints
and are yet to be published, they already have a high impact,
and results are often replicated in related experiments. In terms
of scene classification, either LiDAR or photographic image
data are considered as a data source for the task, with the best
scores often being achieved by deep learning methods such
as the Convolutional Neural Network, which features often in
state-of-the-art work. Transfer learning features often in these
works, either by simply fine-tuning a pre-trained CNN on a
large dataset, or training on a dataset and transfer learning
weight matrices to a second, more scarce dataset. Inspired by
these works, we opt to select photographic data of virtual and
real environments before transfer learning by initial weight
distribution to a fine-tuned network in order to attempt to use
both methods. The successful transfer of knowledge attained
in this experiment serves as basis for further exploration
into the possibilities of improving environment classification
algorithms by considering an activity of pre-training on the
infinite possibilities of virtual environments before considering
a real-world problem.
III. THE P ROPO SED RESEAR CH QU EST ION A ND OU R
AP PROAC H
We propose to answer the research question ”Can knowl-
edge be transferred from simulation to real world, to improve
effectiveness and efficiency of learning to perform real world
tasks, when real world training data are scarce?”. Here,
we explain our approach, starting from building the datasets,
following with the experiment, choice of models and practical
implementation. We include chosen hyperparameters and com-
putational resources in order to promote replicability as well
as for future improvement and application to related state-of-
Fig. 2. In order to collect artificial data, a camera is attached to a humanoid
robot for height reference in the Unity game engine.
the-art problems.
A. Datasets
Initially, two large image datasets are gathered from the
following environments:
Forest
Field
Bathroom
Living Room
Staircase
Computer Lab
The first two are natural environments and the final four are
artificial environments.
For the simulation data, 1,000 images are collected per
environment from the Unity videogame engine via a rotating
camera of 22mm focal length (chosen since it is most similar
to the human eye [28]) affixed to the viewpoint of a 120cm
(3.93ft) robot model, as can be seen in Figure 2. The camera
is rotated 5 degrees around the Y axis per photograph, and
then rotated around the X axis 15 degrees three times after
the full Y rotation has occurred.3In total, 6,000 images are
collected in order to form a balanced dataset.
For the photographic real-world data, a Google Images
web-crawler is set to search and save the first 600 image
search results for each environment name. Each set of
collected images are sought through manually in order to
remove any false results and more data is then collected if
needed to retain perfect class balance.
In figure 3 samples of the virtual visual data gathered from
the Unity game engine (top row) and photographs of real world
environments gathered from Google Images (bottom row) are
3Unity script for data collection is available at https://github.com/jordan-
bird/Unity-Image-Dataset-Collector
3
Fig. 3. Samples of virtual (top) and real (bottom) environments from the two datasets gathered for these experiments.
VGG16
CNN
(FineTuning)
VGG16
CNN
(FineTuning)
Interpretation
(2,4,8...4096)
Interpretation
(2,4,8...4096)
Softmax
Softmax
VGG16
CNN
(FineTuning)
Interpretation
(2,4,8...4096) Softmax
WeightTransfer
(TransferLearning) ΔS,ΔF
Fig. 4. Overall diagram of the experiment showing the derivation of Sand
F(change in starting and final classification ability) for comparison.
shown. Various similarities can be seen, especially through the
colours that occur in nature. Some of the more photo-realistic
environments, such as the living room, bare similarity due to
the realistic high-poly models for example through the creases
in the sofa material. Less realistic environments, such as the
bathroom, feature fewer similarities through the shapes of the
models, although lighting differs between the two.
B. Experiment
With all image data represented as a 128 ×128 ×3array
of RGB values, the datasets are used to train the models.
Convolutional Neural Network layers are fine-tuned from the
VGG16 network [29] with input layers replaced by the shape
of our data, and interpretation layers are removed in order to
benchmark a single layer of 2,4,8, ..., 4096 neurons. All of
these sets of hyperparameters are trained on the simulation
images dataset, and an additional set of hyperparameters
are then trained on the real images dataset, both for 50
epochs. Following this, all weights trained on the simulation
dataset are then transferred to real-world data for a further
10 epochs of training in order to benchmark the possibilities
of transfer learning. Thus, both methods of fine-tune and
transfer learning are explored. All training of models is via
10-fold cross validation where starting (pre-training) and
asymptote (ultimate ability) abilities are measured in order
to discern whether knowledge transfer is possible between
the domains. A diagram of the experiment can be observed
in Figure 4 within which changes in starting (S) and final
abilities (F) of the classification of real-world environments
are compared with and without weight transfer from a model
pre-trained on data gathered from virtual environments.
The goal of the learning process is the minimisation of
loss (misclassification) through backpropagation of errors and
optimisation of weights. This is possible since all data are
labelled, and thus, predictions can be compared to the ground
truths. The goal is to reduce the cross-entropy loss [30], [31]:
M
X
c=1
yo,c log(po,c),(1)
where Mis the number of classes, in this case, 6, yis a
binary indicator of a correct or erroneous prediction, that class
cis the true class of the data object o,pis the probability
that ois predicted to belong to class c. If this value is
algorithmically minimised, the network is then able to learn
from errors and attempt to account for them and improve its
classification ability.
The activation layer of the interpretation layer and learning
rate optimisation algorithm were arbitrarily chosen as
Rectified Linear Units (ReLu) and ADAM. ReLu is defined
as y=max(0, x).
ADAM [32] is a method of optimisation of networks
weights during the learning process based on RMSProp [33]
and Momentum [34], is generally calculated via the steps:
1) The exponentially weighted average of past gradients,
vdW are calculated.
4
2) The exponentially weighted averages of the squares of
past gradient, sdW are calculated.
3) The bias towards zero in the previous are corrected,
resulting in vcorrected
dW and scorrected
dW .
Neural network parameters are then updated via:
vdW =β1vdW + (1 β1)J
∂W
sdW =β2sdW + (1 β2)J
∂W 2
vcorrected
dW =vdW
1(β1)t
scorrected
dW =sdW
1(β1)t
W=Wαvcorrected
dW
qscorrected
dW +ε,
(2)
where β1and β1are tunable hyperparameters, J
∂W is a cost
gradient of the network layer which is being currently tuned,
Wis a matrix of weights, αis the defined learning rate, and
εis a small value introduced in order to prevent division by
zero.
C. Practical Implementation
In this work, all models were trained on deep neural
networks developed in the Keras library with a TensorFlow
backend. Implementation was performed in Python. Random
weights were generated by an Intel Core i7 CPU which was
running at a clock speed of 3.7GHz. RAM used for the initial
storage of images was 32GB at a clock speed of 1202MHz
(Dual-Channel 16GB) before transfer to the 6GB of VRAM
and subsequent learning on a GTX 980Ti GPU via its 2816
CUDA cores.
IV. RES ULTS
In this section, the results from the experiments are
presented following the method described above. Firstly, the
classification ability of the networks trained on virtual data
is outlined, then a comparison between networks to classify
real-world data initialised with random weight distribution
and weights transferred from the networks trained on virtual
environments.
A. Initial Training for Virtual Environments
The classification accuracy of the 12 sets of weights
corresponding to 2..4096 interpretation neurons respectively
to be transferred in the experiment can be observed in Table
I. High accuracy is observed with regards to interpretation
neurons 8...4096, this is likely due to the CNN generating
sets of similar features to the repetitive nature of videogame
environments. In order to optimise the rendering of frames
to the desired 60fps models, textures and bump maps are
often repeated in order to reduce the execution time of the
graphical pipeline [35].
TABLE I
BENCHMARKING OF INT ERPR ETATION NE TWOR K TOPOLOGIES FOR
SIMULATION ENVIRONMENTS. HIGH RES ULTS (90%+) CA N BE
EXP ECTE D DUE TO RE PEATE D TEXTURES, BUMP MAPS A ND LIGHTING.
Interpretation Neurons Classification Accuracy (%)
233.28
449.69
888
16 96.04
32 98.33
64 98.33
128 98.16
256 98.76
512 97.02
1024 97.86
2048 64.08
4096 93.93
B. Transfer Learning vs Random Weights
The results for the transfer learning experiment can be
observed in Table II. The columns Sand Fshow the
change in Starting (epoch 0, no back propagation performed)
and Final classification accuracies in terms of transfer versus
non-transfer of weights, respectively. Interestingly, regardless
of the number of interpretation neurons, successful transfer of
knowledge is achieved for pre-training, with the lowest being
+3.1% via 2 interpretation neurons. The highest is +48.34%
accuracy in the case of 512 hidden interpretation neurons.
This shows that knowledge can be transferred as a starting
point. The average increase of starting accuracy over all
models was +38.33% when transfer learning was performed,
as opposed to an average starting accuracy of 16.4% without
knowledge transfer. In terms of the final classification
accuracy, success is achieved as well, 9 experiments lead to a
higher final accuracy whereas are were slightly lower (-0.22%
128 neurons and -3.98% 2048 neurons), and one does not
change (32 neurons). The average Fover all experiments is
+7.15% with the highest being +24.56% via 4 interpretation
neurons. On average, the final accuracy of all models when
transfer learning is performed is 76.34%m in comparison
to the average final accuracy of 69.16% without transfer of
weights.
Overall, the best model for classifying the real-world
data is a fine-tuned VGG16 CNN followed by 64 hidden
interpretation neurons with initial weights transferred from
the network trained on simulation video game environments,
this model scores a final classification accuracy of 89.16%
highlighted in bold in Table II when both fine-tune and
sim-to-real transfer learning are used in conjunction. The
majority of results, especially the highest S,F, and
final accuracy, show that transfer learning is not only a
possibility between simulation and real-world data for scene
classification, but also promote it as a viable solution in order
to both reduce computational resource requirements and lead
to higher classification ability overall.
5
TABLE II
COMPARISON OF NON-TRANSFER AND TRANSFER LEARNING EXPERIMENTS.SAND FDE FINE TH E CHA NGE IN S TARTIN G AND FI NAL ACC URAC IES
BETWEEN THE SELECTED STARTING WEIGHT DISTRIBUTION. A POSITIVE VALUE DENOTES SUCCESSFUL TRANSFER OF KNOWLEDGE BETWEEN
SI MULATI ON AN D REAL ITY.
Interpretation
Neurons
Experiment
Non-Transfer Learning Transfer Learning Comparison
Starting Accuracy (%) Final Accuracy (%) Starting Accuracy (%) Final Accuracy (%) SF
218.25 18.69 21.35 36.5 +3.1 +17.81
415.27 27.32 33.74 51.88 +18.47 +24.56
812.5 80.31 59.29 85.29 +46.79 +4.98
16 21.57 85.07 60.37 86.73 +38.8 +1.66
32 14.16 87.06 61.06 87.06 +46.9 0
64 16.04 88.27 54.42 89.16 +38.38 +0.89
128 15.93 87.17 61.17 86.95 +45.24 -0.22
256 17.26 85.73 60.95 87.94 +43.69 +2.21
512 14.27 77.88 62.61 79.65 +48.34 +1.77
1024 19.58 68.69 62.83 85.29 +43.25 +16.6
2048 17.7 67.7 56.75 63.72 +39.05 -3.98
4096 14.27 56.19 62.39 75.88 +48.12 +19.69
Average 16.4 69.16 54.73 76.34 38.33 7.15
The results serve as strong argument that transfer of knowl-
edge is possible in terms of pre-training of weights from
simulated environments. This is evidenced especially through
the initial ability of the transfer networks prior to any training
for classification of the real environments, but it is also shown
through the best ultimate score achieved by a network with
initial weights transferred.
V. DISCUSSION
In this section the limitations of this study are discussed
and directions for future work to further explore the potential
of this method are proposed. From the results observed in
this study, there are two main areas of future work which are
important to follow. Firstly, we propose to further improve the
artificial learning pipeline. Models were trained for 50 epochs
for each of the interpretation layers to be benchmarked. In
the future the possibility of deeper networks of more than
one hidden interpretation layer and also the combinations
of the hyperparameters can be explored. The training time
of the random weight networks was relatively limited at
50 epochs and even further limited for transfer learning
at 10 epochs, although this was by design and due to the
computational resources available. Future work could concern
deeper interpretation networks as well as increased training
time. In this study hyperparameters such as the activation and
learning rate optimisation algorithm were arbitrarily chosen,
therefore in the future these could be explored in a further
combinatorial optimisation experiment. Secondly, simulation
to real transfer learning could also be attempted in various
fields in order to benchmark the ability of this method for
other real-world applications. For example, autonomous cars
and drones training in a virtual environment for real-world
application. The next step for benchmarking could be to
compare the ability of this method to state-of-the-art methods
on publicly available datasets, should more computational
resources be available, similarly to the related works featured
in the literature review [21], [22], [23].
VI. CONCLUSION
In the experiments and results presented in this study,
we have shown success in transfer learning from virtual
environments to a task taking place in reality. A noticeable
set of high abilities were encountered for sole classification
of virtual data, as expected, due to the optimisation processes
of recycling objects and repeating textures found within
videogame environments. Of the 12 networks trained with
and without transfer learning, a pattern of knowledge
transfer was observed; with all starting accuracies being
substantially higher than a random weight distribution, and,
most importantly, a best classification ability of 89.16%
was achieved when knowledge was initially transferred from
virtual environments.
These results provide a strong argument for the application
of both fine-tune and transfer learning for autonomous scene
classification. The former was achieved through the tuning of
VGG16 Convolutional Neural Networks, and the latter was
achieved by transferring weights from a network trained on
simulation data from videogames and applied to a real-world
situation. Transfer learning leads to both the reduction of
resource requirements for said problems, and the achievement
of a higher classification ability overall when pre-training
has occurred on simulated data. As future directions, further
improvement of the learning pipeline benchmarked in this
study together with exploration on other complex real-world
problems faced by autonomous machines are proposed.
VII. ACKN OWL EDG EME NT
This work was partially supported by the Royal Society
through the project ”Sim2Real: From Simulation to Real
Robotic Application using Deep Reinforcement Learning and
Knowledge Transfer” with grant number RGS\R2\192498
awarded to D. R. Faria.
6
REFERENCES
[1] J. H. Chen and S. M. Asch, “Machine learning and prediction in
medicine—beyond the peak of inflated expectations,The New England
journal of medicine, vol. 376, no. 26, p. 2507, 2017.
[2] A. W. Tan, R. Sagarna, A. Gupta, R. Chandra, and Y. S. Ong, “Coping
with data scarcity in aircraft engine design,” in 18th AIAA/ISSMO
Multidisciplinary Analysis and Optimization Conference, p. 4434, 2017.
[3] A. Bouchachia, “On the scarcity of labeled data,” in International
Conference on Computational Intelligence for Modelling, Control and
Automation and International Conference on Intelligent Agents, Web
Technologies and Internet Commerce (CIMCA-IAWTIC’06), vol. 1,
pp. 402–407, IEEE, 2005.
[4] Y.-C. Su, T.-H. Chiu, C.-Y. Yeh, H.-F. Huang, and W. H. Hsu, “Trans-
fer learning for video recognition with scarce training data for deep
convolutional neural network,arXiv preprint arXiv:1409.4127, 2014.
[5] C. Hentschel, T. P. Wiradarma, and H. Sack, “Fine tuning cnns with
scarce training data—adapting imagenet to art epoch classification,”
in 2016 IEEE International Conference on Image Processing (ICIP),
pp. 3693–3697, IEEE, 2016.
[6] A. Bhowmik, S. Kumar, and N. Bhat, “Eye disease prediction from
optical coherence tomography images with transfer learning,” in Inter-
national Conference on Engineering Applications of Neural Networks,
pp. 104–114, Springer, 2019.
[7] “Archvizpro interior vol.1,” Mar 2018.
[8] A. Appel, “Some techniques for shading machine renderings of solids,”
in Proceedings of the April 30–May 2, 1968, Spring Joint Computer
Conference, pp. 37–45, ACM, 1968.
[9] M. Pharr, W. Jakob, and G. Humphreys, Physically based rendering:
From theory to implementation. Morgan Kaufmann, 2016.
[10] C. Ulbricht, A. Wilkie, and W. Purgathofer, “Verification of physically
based rendering algorithms,” in Computer Graphics Forum, vol. 25,
pp. 237–255, Wiley Online Library, 2006.
[11] L. Torrey and J. Shavlik, “Transfer learning,” in Handbook of research
on machine learning applications and trends: algorithms, methods, and
techniques, pp. 242–264, IGI Global, 2010.
[12] S. J. Pan and Q. Yang, “A survey on transfer learning,IEEE Trans-
actions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–
1359, 2009.
[13] J. Kim and C. Park, “End-to-end ego lane estimation based on sequential
transfer learning for self-driving cars,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
pp. 30–38, 2017.
[14] K. Lee, H. Kim, and C. Suh, “Crash to not crash: Playing video games
to predict vehicle collisions,” in ICML Workshop on Machine Learning
for Autonomous Vehicles, 2017.
[15] M. B. Uhr, D. Felix, B. J. Williams, and H. Krueger, “Transfer of
training in an advanced driving simulator: Comparison between real
world environment and simulation in a manoeuvring driving task,” in
Driving Simulation Conference, North America, p. 11, 2003.
[16] A. Bewley, J. Rigley, Y. Liu, J. Hawke, R. Shen, V.-D. Lam, and
A. Kendall, “Learning to drive from simulation without real world
labels,” in 2019 International Conference on Robotics and Automation
(ICRA), pp. 4818–4824, IEEE, 2019.
[17] F. Yu, J. Xiao, and T. Funkhouser, “Semantic alignment of LiDAR data
at city scale,” in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1722–1731, 2015.
[18] C. Zach, A. Penate-Sanchez, and M. Pham, “A dynamic programming
approach for fast and robust object pose recognition from range images,
in IEEE CVPR, pp. 196–203, 2015.
[19] D. Xu, D. Anguelov, and A. Jain, “PointFusion: Deep sensor fusion for
3D bounding box estimation,” in IEEE/CVF CVPR, pp. 244–253, 2018.
[20] A. Ess, B. Leibe, and L. Van Gool, “Depth and appearance for mobile
scene analysis,” in IEEE 11th International Conference on Computer
Vision (ICCV), pp. 1–8, 2007.
[21] L. Herranz, S. Jiang, and X. Li, “Scene recognition with-+ cnns: objects,
scales and dataset bias,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 571–579, 2016.
[22] R. Wu, B. Wang, W. Wang, and Y. Yu, “Harvesting discriminative meta
objects with deep cnn features for scene classification,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 1287–
1295, 2015.
[23] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning
deep features for scene recognition using places database,” in Advances
in neural information processing systems, pp. 487–495, 2014.
[24] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
“Domain randomization for transferring deep neural networks from sim-
ulation to the real world,” in 2017 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), pp. 23–30, IEEE, 2017.
[25] T. Inoue, S. Choudhury, G. De Magistris, and S. Dasgupta, “Transfer
learning from synthetic to real images using variational autoencoders for
precise position detection,” in 2018 25th IEEE International Conference
on Image Processing (ICIP), pp. 2725–2729, IEEE, 2018.
[26] F. Zhang, J. Leitner, B. Upcroft, and P. Corke, “Vision-based reaching
using modular deep networks: from simulation to the real world,” arXiv
preprint arXiv:1610.06781, 2016.
[27] G. Wallet, H. Sauz´
eon, P. A. Pala, F. Larrue, X. Zheng, and B. N’Kaoua,
“Virtual/real transfer of spatial knowledge: Benefit from visual fidelity
provided in a virtual environment and impact of active navigation,”
Cyberpsychology, Behavior, and Social Networking, vol. 14, no. 7-8,
pp. 417–423, 2011.
[28] M. Mrochen, M. Kaemmerer, P. Mierdel, H.-E. Krinke, and T. Seiler,
“Is the human eye a perfect optic?,” in Ophthalmic Technologies XI,
vol. 4245, pp. 30–35, International Society for Optics and Photonics,
2001.
[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[30] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press,
2012.
[31] S. Kullback and R. A. Leibler, “On information and sufficiency,The
annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[33] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop, coursera: Neural
networks for machine learning,” University of Toronto, Technical Report,
2012.
[34] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance
of initialization and momentum in deep learning,” in International
conference on machine learning, pp. 1139–1147, 2013.
[35] J. Dargie, “Modeling techniques: movies vs. games,” ACM SIGGRAPH
Computer Graphics, vol. 41, no. 2, p. 2, 2007.
7
... Because of the costs associated to labelling and data collection, the availability of real-world images is often limited. For this reason, models are trained mostly using simulator images, while real-world images are used to fine-tune the DNN and then test it [3,24,29]. ...
... In this paper, we propose an approach to characterize the root causes (events) leading to DNN errors; we call such events hazard-triggering events. It targets contexts in which DNNs are trained using simulators, which is common practice in safety-critical contexts with complex inputs [3,24,29]; for example, DNN implementing vision-based driving tasks or interpreting human postures. This is the case for our industry partner, which relies on a simulator capable of generating images of human bodies, seated in a car environment, to train DNNs that interpret human postures (e.g., determine gaze or drowsiness). ...
... Our choice ensures that, when the Pareto front includes a number of individuals larger than the population size , for each objective, we preserve the individual that minimizes the fitness value. When the Pareto front has a number of individuals lower than the population size , our crowding-distance assignment ensures the selection of individuals that minimize the fitness 3 , which is what we intend to preserve in the final population. However, it is unlikely to obtain a Pareto front with a number of individuals lower than the population size . ...
Preprint
When Deep Neural Networks (DNNs) are used in safety-critical systems, engineers should determine the safety risks associated with DNN errors observed during testing. For DNNs processing images, engineers visually inspect all error-inducing images to determine common characteristics among them. Such characteristics correspond to hazard-triggering events (e.g., low illumination) that are essential inputs for safety analysis. Though informative, such activity is expensive and error-prone. To support such safety analysis practices, we propose SEDE, a technique that generates readable descriptions for commonalities in error-inducing, real-world images and improves the DNN through effective retraining. SEDE leverages the availability of simulators, which are commonly used for cyber-physical systems. SEDE relies on genetic algorithms to drive simulators towards the generation of images that are similar to error-inducing, real-world images in the test set; it then leverages rule learning algorithms to derive expressions that capture commonalities in terms of simulator parameter values. The derived expressions are then used to generate additional images to retrain and improve the DNN. With DNNs performing in-car sensing tasks, SEDE successfully characterized hazard-triggering events leading to a DNN accuracy drop. Also, SEDE enabled retraining to achieve significant improvements in DNN accuracy, up to 18 percentage points.
... The videos were precompiled from the Kaggle database. 58 The reason for utilizing virtual reality glasses (VR BOX2) is that this tool can convert 360-degree videos with a frame rate of 25 fps and images with dimensions of 250 × 250 pixels according to the three components of real VR simulation for subjects. The features include immersion, interaction, and imagination. ...
Article
Full-text available
Multiple sclerosis (MS) is a chronic, debilitating, and often progressive inflammatory disease of the nervous system. The disease is highly stressful, which accelerates the risk of depression in people diagnosed with it and possibly exacerbates MS activity. One approach to reducing stress levels in such patients is to utilize virtual reality (VR). Using VR technology and recording physiological signals before and after displaying different environments to the individual, this article proposed a novel therapy procedure for improving stress levels. In the first phase, by distinguishing the stress level obtained from each environment watched by the patient, their corresponding labels are determined by two psychiatrists. Accordingly, the automated model is designed based on the analysis of VR scenes and can accurately classify MS patients' stress levels after watching the 3D environment. The proposed model consists of a fractal descriptor and SVM‐RBF classifier to recognize VR scenes that can significantly reduce the stress level in MS patients. The accuracy of estimating MS patients' stress levels after watching different simulated VR environments is higher than 97%. By employing this method to classify VR scenes better and rehabilitate MS patients, it will be possible to significantly reduce their stress levels.
... [2019] report a positive improvement in performance on real data as a result of using transfer learning. The problem of model transfer from synthetic to real data in deep learning is well-studied [Zhuang et al., 2020] and it has been demonstrated that synthetic priors lead to better performance when compared to non-transfer models which learn fully on real data [Bird et al., 2020]. In this preliminary work, we use a synthetic dataset generated to resemble Sentinel-2 images and allow studying extreme bathymetry features, as we discuss in the next sections. ...
Article
Full-text available
Coastal development and urban planning are facing different issues including natural disasters and extreme storm events. The ability to track and forecast the evolution of the physical characteristics of coastal areas over time is an important factor in coastal development, risk mitigation and overall coastal zone management. Traditional bathymetry measurements are obtained using echo-sounding techniques which are considered expensive and not always possible due to various complexities. Remote sensing tools such as satellite imagery can be used to estimate bathymetry using incident wave signatures and inversion models such as physical models of waves. In this work, we present two novel approaches to bathymetry estimation using deep learning and we compare the two proposed methods in terms of accuracy, computational costs, and applicability to real data. We show that deep learning is capable of accurately estimating ocean depth in a variety of simulated cases which offers a new approach for bathymetry estimation and a novel application for deep learning.
... Normally a validation dataset is used to determine the training termination point for avoiding model overfitting. Drop-out, 39 data augmentation, 40 and transfer learning 41 have been proposed to improve the generalizability of a trained model. It is vitally important to evaluate the trained ML model using an independent external test set for assessing the generalizability of the method. ...
Article
Full-text available
Corneal diseases, uncorrected refractive errors, and cataract represent the major causes of blindness globally. The number of refractive surgeries, either cornea- or lens-based, is also on the rise as the demand for perfect vision continues to increase. With the recent advancement and potential promises of artificial intelligence (AI) technologies demonstrated in the realm of ophthalmology, particularly retinal diseases and glaucoma, AI researchers and clinicians are now channeling their focus toward the less explored ophthalmic areas related to the anterior segment of the eye. Conditions that rely on anterior segment imaging modalities, including slit-lamp photography, anterior segment optical coherence tomography, corneal tomography, in vivo confocal microscopy and/or optical biometers, are the most commonly explored areas. These include infectious keratitis, keratoconus, corneal grafts, ocular surface pathologies, preoperative screening before refractive surgery, intraocular lens calculation, and automated refraction, among others. In this review, we aimed to provide a comprehensive update on the utilization of AI in anterior segment diseases, with particular emphasis on the recent advancement in the past few years. In addition, we demystify some of the basic principles and terminologies related to AI, particularly machine learning and deep learning, to help improve the understanding, research and clinical implementation of these AI technologies among the ophthalmologists and vision scientists. As we march toward the era of digital health, guidelines such as CONSORT-AI, SPIRIT-AI, and STARD-AI will play crucial roles in guiding and standardizing the conduct and reporting of AI-related trials, ultimately promoting their potential for clinical translation.
... 3 Though beyond the current capabilities of autonomous machine hardware, an argument has recently been put forward for temporal awareness through LSTM [11], achieving 78.56% and 70.11% pixel accuracy on two large image datasets. A previous single-modality study found improvement of scene classification ability by transferring from both VGG16 and scene images from videogames to photographic images of real-life environments [12] with an average improvement of +7.15% when simulation data was present prior to transfer of weights. In terms of audio, the usefulness of MFCC audio features in statistical learning for recognition of environment has recently been shown [13], gaining classification accuracies of 89.5%, 89.5% and 95.1% with KNN, GMM, and SVM methods respectively. ...
Preprint
Full-text available
The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.
Chapter
The rate of penetration (ROP) is a key performance indicator in the oil and gas drilling industry as it directly translates to cost savings and emission reductions. A prerequisite for a drilling optimization algorithm is a predictive model that provides expected ROP values in response to surface drilling parameters and formation properties. The high predictive capability of current machine-learning models comes at the cost of excessive data requirements, poor generalization, and extensive computation requirements. These practical issues hinder ROP models for field deployment. Here we address these issues through transfer learning. Simulated and real data from the Volve field were used to pre-train models. Subsequently, these models were fine-tuned with varying retraining data percentages from other Volve wells and Marcellus Shale wells. Four out of the five test cases indicate that retraining the base model would always produce a model with lower mean absolute error than training an entirely new model or using the base model without retraining. One was on par with the traditional approach. Transfer learning allowed to reduce the training data requirement from a typical 70% down to just 10%. In addition, transfer learning reduced computational costs and training time. Finally, results showed that simulated data could be used in the absence of real data or in combination with real data to train a model without trading off model’s predictive capability.
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
Article
Transfer learning is a concept whereby data-driven models can be developed for tasks (e.g. molecular properties) with limited data availability (target task) by sharing information from a related task. In the context of chemical engineering, the two tasks can either pertain to related properties or to the same property calculated or measured in two different ways (with differing accuracies or resolution). Using an ensemble of linear and interpretable models, in this work, we present a conceptual study to explicate when transfer learning can be beneficial. We show that a large overlap of the underlying features of the two tasks (specifically greater than 50%) is required for transfer learning to improve the model for the target task. On the other hand, transferring information (in particular, information regarding salient features) from an uncorrelated task can be detrimental to train a model for the target task. Subsequently, we present three illustrative examples of transfer learning for molecular property prediction and rationalize the usefulness of transferred information based on the inferences from our conceptual studies. This work, thus, provides a simplified analysis of the concept of transfer learning for building molecular property models.
Conference Paper
Full-text available
The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16,000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.
Article
Full-text available
Big data, we have all heard, promise to transform health care. But in the “hype cycle” of emerging technologies, machine learning now rides atop the “peak of inflated expectations,” and we need to better appreciate the technology’s capabilities and limitations.
Chapter
Optical Coherence Tomography (OCT) of the human eye are used by optometrists to analyze and detect various age-related eye abnormalities like Choroidal Neovascularization, Drusen (CNV), Diabetic Macular Odeama (DME), Drusen. Detecting these diseases are quite challenging and requires hours of analysis by experts, as their symptoms are somewhat similar. We have used transfer learning with VGG16 and Inception V3 models which are state of the art CNN models. Our solution enables us to predict the disease by analyzing the image through a convolutional neural network (CNN) trained using transfer learning. Proposed approach achieves a commendable accuracy of 94% on the testing data and 99.94% on training dataset with just 4000 units of data, whereas to the best of our knowledge other researchers have achieved similar accuracies using a substantially larger (almost 10 times) dataset.
Article
We present PointFusion, a generic 3D object detection method that leverages both image and 3D point cloud information. Unlike existing methods that either use multi-stage pipelines or hold sensor and dataset-specific assumptions, PointFusion is conceptually simple and application-agnostic. The image data and the raw point cloud data are independently processed by a CNN and a PointNet architecture, respectively. The resulting outputs are then combined by a novel fusion network, which predicts multiple 3D box hypotheses and their confidences, using the input 3D points as spatial anchors. We evaluate PointFusion on two distinctive datasets: the KITTI dataset that features driving scenes captured with a lidar-camera setup, and the SUN-RGBD dataset that captures indoor environments with RGB-D cameras. Our model is the first one that is able to perform better or on-par with the state-of-the-art on these diverse datasets without any dataset-specific model tuning.
Article
Bridging the 'reality gap' that separates simulated robotics from experiments on hardware could accelerate robotic research through improved data availability. This paper explores domain randomization, a simple technique for training models on simulated images that transfer to real images by randomizing rendering in the simulator. With enough variability in the simulator, the real world may appear to the model as just another variation. We focus on the task of object localization, which is a stepping stone to general robotic manipulation skills. We find that it is possible to train a real-world object detector that is accurate to $1.5$cm and robust to distractors and partial occlusions using only data from a simulator with non-realistic random textures. To demonstrate the capabilities of our detectors, we show they can be used to perform grasping in a cluttered environment. To our knowledge, this is the first successful transfer of a deep neural network trained only on simulated RGB images (without pre-training on real images) to the real world for the purpose of robotic control.