Technical ReportPDF Available

Transfer Learning - A Comparative Analysis

Authors:

Abstract

Deep learning has paved the way for critical and revolutionary applications in almost every field of life in general. Ranging from engineering to healthcare, machine learning and deep learning has left its mark as the state-of-the-art technology application which holds the epitome of a reasonable high benchmarked solution. Incorporating neural network architectures into applications has become a common part of any software development process. In this paper we look at the transfer learning and how the process of transfer learning can augment the performance of a neural network architecture with pre-existing models in conjunction with densely connected layers suitable for a specific problem. We implement different transfer learning approaches and models for arriving at a unison conclusion regarding the best model for the hand-written digit recognition problem in terms of loss and accuracy metrics.We then perform a comparative analysis of the different models used and visualize the metrics using TensorBoard.Visualization helps us to analyze the time variations of the metrics, which helps us to get deeper insights. The comparative analysis provides us with the data for the final model that can be used in the system which later will be used to extract features and recognize digits and characters from an image containing written information written on a piece of paper or blackboard. Transfer learning, though new has proved to be one of the crucial inventions in the field of deep learning and computer vision because of the promising results it provides with very little distortion regarding entropy and randomization of the real-world unstructured data.
1
Transfer Learning A Comparative Analysis
Project report submitted
in partial fulfilment of the requirement for the degree of
Bachelor of Technology
By
Rohan Saha (1505614)
Debaruna Saha (1505029)
Under the supervision of
(Dr.) Prof. Sudan Jha
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
Deemed to be University, BHUBANESWAR
DECEMBER 2018
2
Certificate
This is to certify that the project entitled TRANSFER LEARNING A COMPARATIVE
APPROACH submitted by
ROHAN SAHA 1505614
DEBARUNA SAHA 1505029
in partial fulfilment of the requirements for the award of the Degree of Bachelor of Technology
in Computer Science and Engineering is a bonafide record of the work carried under my(our)
guidance and supervision at School of Computer Engineering, KIIT Deemed to be University.
Signature of Supervisor
(Dr.) Prof Sudan Jha
School of Computer Engineering
KIIT Deemed to be University
The Project was evaluated by us on _____________
EXAMINER 1
EXAMINER 2
EXAMINER 3
3
Acknowledgement
We feel extremely privileged in expressing our sincere gratitude to our supervisor and mentor
(Dr.) Prof. Sudan Jha, for his constant guidance and supervision throughout our project work.
His way of solving the real world problem by implementing it in machines have inspired us to
take up this project. Our heartfelt thanks to you sir for the support and patience shown to us.
We are also thankful to our department of School of Computer Science for giving us this
opportunity to implement our idea into real time implementation.
Rohan Saha (1505614)
Debaruna Saha (1505029)
4
Abstract
Deep learning has paved the way for critical and revolutionary applications in almost every field
of life in general. Ranging from engineering to healthcare, machine learning and deep learning
has left its mark as the state-of-the-art technology application which holds the epitome of a
reasonable high benchmarked solution. Incorporating neural network architectures into
applications has become a common part of any software development process.
In this paper we look at the transfer learning and how the process of transfer learning can
augment the performance of a neural network architecture with pre-existing models in
conjunction with densely connected layers suitable for a specific problem.
We implement different transfer learning approaches and models for arriving at a unison
conclusion regarding the best model for the hand-written digit recognition problem in terms of
loss and accuracy metrics. We then perform a comparative analysis of the different models used
and visualize the metrics using TensorBoard. Visualization helps us to analyze the time
variations of the metrics, which helps us to get deeper insights.
The comparative analysis provides us with the data for the final model that can be used in the
system which later will be used to extract features and recognize digits and characters from an
image containing written information written on a piece of paper or blackboard.
Transfer learning, though new has proved to be one of the crucial inventions in the field of deep
learning and computer vision because of the promising results it provides with very little
distortion regarding entropy and randomization of the real-world unstructured data.
5
Table Of Contents
Chapter
Title
Page No.
1
Introduction
6
2
Literature Survey
7-10
3
Related Work
11
4
Motivation
12
5
Transfer Learning
13-16
6
Implementation
17
7
Results
18-26
8
Conclusion
27-28
9
References
29-30
6
Chapter 1
Introduction
_____________________________________________________________________________________
The analysis of gigantic masses of digital visual information all around the world requires
several visual information image analysis techniques. The main purpose is to automatically
analyze the semantic contents of images. The distinction of the object from the whole image
needs various object recognition techniques. Recognition of object for human in real world is
achieved without any efforts, but for machines self-recognition is not possible until an
algorithmic description is fed to the machine. Thus object recognition techniques need to be
developed which are less complex and efficient.
Significant efforts have been made to develop representation schemes and algorithms aiming at
recognizing generic objects in images under different imaging conditions. Presently there are
several object recognition techniques that address the object recognition scenario. Several
techniques consider color and shape features for recognition. Simple task of object recognition
and segmentation scheme that removes some set of features in an image sequence and uses them
for further learning in the learning phase. Or feature extraction which is one of the most
important fields in artificial intelligence, extracts the most relevant features of an image and
assigns it into a label. Transfer learning is the improvement of learning in a new task through the
transfer of knowledge from a related task that has already been learned. The recognition process
is carried out by matching the test image against the stored object representations or models.
Various techniques are available each having pros and cons. Firstly the paper focuses on
different types of techniques for image recognition. Later on the main focus of the paper will be
image recognition using transfer learning which is based on deep learning. Deep learning has
many advantages over other conventional machine learning algorithms like statistical training,
detecting relationships between dependent and independent variables, detection of possible
interactions between predictor variables, pattern recognition and many more. The deep learning
model like Convolutional Neural Networks (CNNs) and Deep Belief Networks (DBNs) have
been widely used for image classification, object recognition, and action recognition. Therefore
opening new applications in every field.
7
Chapter 2
Literature Survey
_____________________________________________________________________________
Image recognition centered on appearance-based techniques as advanced are feature extraction
and template matching algorithms. Looking at them detail-
2.1 Template Matching
In [1], template matching is described as a technique to locate small parts of the image which
match a template image. A straightforward process where template images for different objects
are stored. An input image is matched with the stored template images to determine the object in
the input. The author proposed a mathematical morphological template matching approach for
object recognition in inertial navigation systems (INS).The focus of the paper is to detect and
track the ground objects. The flying systems equipped with camera were used to capture the
photos of ground; to identify the objects. Their method is independent of the altitude and
orientation of the object.
In [2], an approach for measuring similarity between visual images based on matching internal
self-similarities. A template image is to be compared to another image. Measuring similarity
across images can be complex, the similarity within each image can be easily revealed with
simple similarity measure, such as SSD (Sum of Square Differences), resulting in local self-
similarity descriptors which can be matched across images.
2.2 Feature Extraction
Feature extraction is one of the most important fields in artificial intelligence. It extracts the most
relevant features of an image and assign it into a label. In image classification, the crucial step is
to analyze the properties of image features and to organize the numerical features into classes.
In [3] the performance of the data models obtained by the different feature extraction techniques
in the context of binary and multiclass classification by using different classifiers is presented.
Some techniques used in the paper based on which the conclusion was reached are PHOG
(Pyramid Histogram of Oriented Gradients), LBP (Local Binary Patterns Features) etc. The
features extraction are based on the following.
8
2.2.1 Color Features
In image classification and image retrieval, the color is the most important feature. The color
histogram represents the most common method to extract color feature. It is regarded as the
distribution of the color in the image. The efficacy of the color feature resides in the fact that is
independent and insensitive to size, rotation and the zoom of the image.
2.2.2 Texture Features
Texture feature extraction is very robust technique for a large image which contains a repetitive
region. The texture is a group of pixel that has certain characterize. The texture feature methods
are classified into two categories: spatial texture feature extraction and spectral texture feature
extraction.
2.2.3 Shape Features
Shape features are very used in the object recognition and shape description. The shape features
extraction techniques are classified as: region based and contour based. The contour methods
calculate the feature from the boundary and ignore its interior, while the region methods
calculate the feature from the entire region.
The dataset was split into two parts: Training set and Test set. In this study, the dataset was
divided as follows: 60% of instances will be used in the training phase and 40% of remaining
instances constitute the test set. The Linear SVM, SVM with Gaussian kernel, Least Square
SVM (LS-SVM) and k-nearest neighbor is used for the classification. The results show that the
PHOG, GABOR and LBP methods have reached a high classification accuracy rate and are the
very precise and efficient methods.
9
2.3 Image Segmentation
Among the image recognition techniques Image segmentation is a process of disconnecting an
object into various segments so that the resultant piece is a clarified portrayal of an object into a
thing that is far more purposeful and easier to analyze.
The segmentation of object is classified as of two types based upon the camera’s mobility-
a) The static segmentation of camera
b) Moving segmentation of camera.
In the static segmentation of camera, the camera is located at a particular position at a fixed
angle, so that object and the background are firm. In dynamic camera segmentation, the camera
keeps on moving, it is more challenging because it requires acknowledging the movement of the
camera and modification of background.
In [4], a method is discussed which is based on spectral graph partitioning. The paper explains
joint optimization problem using patch grouping and pixel grouping. There work explained that
falsely detected parts can be powerfully cut away by perceptual organization. The results were
showed on the 120×120 synthetic image. They used the trained Linear Fisher classifier for every
body part and the output was a reliably better object segmentation.
In [5] , a method is presented of recognizing and segmenting the objects with the help of SIFT
and graph cuts, they took 20 object models from the angles of every 45 degrees, and the
obtained results were as follows; recall- 0.81, precision: 1.00. The seeds were automatically
given to graph cuts using SIFT, but the disadvantage of the approach was that if the number of
key points became less, the accuracy of recognition and segmentation would fall down and
computing time would increase.
In [6], a method for viewpoint independent recognition for free-form objects and their
segmentation. Their work automatically registered unordered views of an object with the
complexity of O(n). The experiment produced the overall recognition rate of 95% for synthetic
as well as real images on the whole. They contrasted their work with the spin image recognition
algorithms and stated that the algorithm was better in all the aspects such as in recognition rate
and efficiency.
10
In [7], an approach where objects were represented by the hierarchy of fragments. Their
approach was effective and provided an accessible framework for organization, segmentation
and recognition. The advantages of this approach included; class specific features, pictorial
features of the fragments that represented and recognized the complete object and their parts.
This approach was consistent with both psychological and physiological evidences. The major
challenges for the future could be; full scene interpretation and dynamic recognition of objects.
In [8], Kim et al. proposed an unsupervised moving object segmentation and recognition method
for Intelligent Transportation Systems (ITS). This method used clustering and neural network
approach. Advantages included; real-time and robust performance, efficient remote monitoring.
They have performed the experiments on a Pentium IV 1.4 Giga Hertz processor with windows
98, plus the algorithm was executed using development tool of MS Visual C++. Processing time
for each image frame took 0.36 seconds on an avg. Comparing with Badenas et al. [13], it
produced an average time of 0.57 sec, and a segmentation rate of 91.3% and the method
presented by them produced an average time of 0.22 sec and segmentation rate of 95.8 %.
In [9], a method of machine learning-dependent object recognizing algorithms that used the
MLS point clouds in such a way that it created maps using the road environment architecture.
They collected unorganized 3D point clouds and performed pre-processing, segmentation and
feature extraction. The local prediction of the segmented objects then formed the labelled object
locations. They used the MLS principle that helped in increasing the segmentation of point
clouds with the arrangement sharpness varying from 78.3% to 87.9%.
In [10], it is described that transfer learning mainly consists of two approaches:
1) Preserving the original pre-trained network and updating the weights based on the new
training dataset.
2) Using pre-trained network for feature extraction, and representation followed by a generic
classifier such as SVM for classification.
The second approach has been successfully applied for many recognition and classification
tasks [11], [12].The human action recognition technique also falls under the second category.
Among the recently proposed benchmark deep models such as AlexNet, and GoogleNet, the
AlexNet is selected as source model for building a target model for the action recognition task.
The source model has been used for feature extraction and representation followed by a hybrid
Support Vector Machine and K-Nearest Neighbor (SVM-KNN) classifier for action recognition.
11
Chapter 3
Related Work
___________________________________________________
The existing state-of-the-art methods for action recognition using handcrafted based
representations and deep learning. This method has achieved remarkable performance for human
action recognition. But this feature engineering process is labor-intensive and requires expertise
of the subject matter.
Due to the limitations, more research is directed to deep learning-based approach. This approach
has been used in several domains such as image classification, speech recognition, and object
recognition etc. These models have also been explored for human activity recognition. In [13], a
human action recognition method was proposed using unsupervised on-line deep learning
technique. This method achieved accuracy of 89.86%, and 88.5% on KTH and UCF sports
dataset respectively.
The handcrafted feature-based techniques, in particular, trajectory based methods have less
discriminative power. Conversely, deep network architectures are inefficient in capturing the
salient motion. For addressing this issue, [14] combined the deep convolutional networks with
trajectory for action recognition. However, deep learning-based methods also have some
limitations, these models require huge amount of data for training, and collecting huge amount of
domain-specific data is time consuming and expensive. Therefore, training the deep learning
model from scratch is not feasible for domain-specific problems. This problem can be solved
using pre-trained network as a source architecture for training the target model with small
dataset, known as using transfer learning.
Models of ImageNet Large Scale Visual Recognition Challenge (ILSVRC) such as AlexNet,
GoogleNet, and ResNet are publicly available as pre-trained networks. These networks can be
used for transfer learning. One of the important ways to employ the existing models for new task
is to use pre-trained models as feature extraction machine and combine this deep representation
with off-the-shelf classifiers for action recognition.
12
Chapter 4
Motivation
Our primary motivation for investment into this project was to build a solid foundation for
implementing a system which could recognize the handwritten characters written on a piece of
blackboard or whiteboard. The systems can be used in an environment where the primary
method of instruction is by writing on a mounted board.
The system should be able to recognize the instructors writing and automatically generate the
typed notes in a suitable font for easier understanding and cross-referencing for the students.
Existing systems occur such as the those launched by Panasonic; the electronic whiteboard. The
electronic whiteboard works by assessing the contents of the written information and creating a
digital version of it. The digital representation can be later shared with people for the
dissemination of information. However, there are some limitations in such type of systems. Some
are listed below:
Expensive
Requires additional hardware
Non-attachable to the existing blackboard.
Machinery may result in wear and tear.
Difficult to maintain.
Not power efficient.
The fundamental motive of this project is to alleviate all of the above costs and develop a
sustainable system that will be able to overcome the shortcomings of the existing systems.
Our system will be designed keeping in mind the use experience and the cost. Also, it is to be
designed in a way so as to obviate additional hardware. In essence, it should a plug-and-play
system.
In this project, we will primarily focus on the models that might be suitable in recognizing the
symbols written on the board with a reasonable amount of performance.
13
Chapter 5
Transfer Learning
______________________________________________________________________________
Training a new deep learning model from scratch requires huge amount of data, high
computational resources, and hours, in some cases, days of training. In real-world applications,
collecting and annotating huge amount of domain-specific data is time consuming and
expensive, which makes it a quite challenging to apply deep learning models. To overcome this
challenge, Researchers are convinced that, the knowledge of previous objects, assist in learning
the new objects through their similarity and connection with the new objects. Based on this idea,
some studies suggest that the deep learning models trained for a classification task, can be
employed for classification. Thus, the CNNs models trained on a specific dataset or task can be
fine-tuned for a new task even in a different domain. This concept is known as transfer learning.
Human inherent ways to transfer knowledge between tasks. We recognize and apply relevant
knowledge from previous learning experiences when we encounter new tasks. The more related a
new task is to our previous experience, the more easily we can master it. Common machine
learning algorithms traditionally address isolated tasks. Transfer learning attempts to change this
by developing methods to transfer knowledge learned in one or more source tasks and use it to
improve learning in a related target task.
Since long time, transfer learning has been studied as a machine learning technique for solving
the different visual categorization problems. In recent years, due to explosion of information
such as images, audios, and videos over the internet, demands for high accuracies, and
computational efficiencies are increased. Due to these reasons, the transfer learning has attracted
a lot of interests in the areas of machine learning and computer vision. When the traditional
machine learning techniques have reached their limits, the transfer learning unlocks new flow of
streams for visual categorization. It has been applied successfully for visual categorization tasks
in the domains of object recognition, image classification and human action recognition.
14
5.1 Methodology
In computer vision, Neural Networks usually try to detect edges in their earlier layers, shapes in
their middle layer and some task-specific features in the later layers. With transfer learning, the
early and middle layers are used and only the latter layers is retrained. It helps to leverage the
labeled data of the task it was initially trained on.
For example for a model trained for recognizing a backpack on an Image, which will be used to
identify Sunglasses. In the earlier layers, the model has learned to recognize objects and because
of that, we will only re-train the latter layers, so that it will learn what separates sunglasses from
other objects.
In Transfer Learning, we try to transfer as much knowledge as possible from the previous task,
the model was trained on, to the new task at hand. This knowledge can be in various forms
depending on the problem and the data. For example, it could be how models are composed
which would allow us to more easily identify novel objects.
15
Below shown is a basic convergence graph for transfer learning.
As an example, the architecture of VGG19 neural network is given below:
16
5.2 Applications
The main advantages of transfer learning are basically that time is saved, the Neural Network
model performs better in most cases and that there is no need a lot of data.
Usually, there is a need of lot of data to train a Neural Network from scratch but access to
enough data is not available. That is where Transfer Learning comes into play because with it a
solid machine learning model can be built with comparatively little training data because the
model is already pre-trained. This is especially valuable in Natural Language Processing (NLP)
because there is mostly expert knowledge required to created large labeled datasets. Therefore a
lot of training time can be saved, because it sometimes take days or even weeks to train a deep
Neural Network from scratch on a complex task.
5.3 Advantages:
1. Higher start-The initial skill (before refining the model) on the source model is higher than
without transfer method.
2. Higher slope-The rate of improvement of skill during training of the source model is steeper
than any other method.
3. Higher asymptote- The converged skill of the trained model is better than it other methods.
5.4 Disadvantages:
1. The distribution of the training data which the pre-trained model has used should be like the
data that it is going to face during test time or at least don't vary too much.
2. Second, the number of training data for transfer learning should be in a way that it will over fit
the model.
3. Cannot remove layers with confidence to reduce the number of parameters. Basically the
number of layers is a hyper-parameter, there is no consensus on how to be chosen. If the
convolutional layers are removed from the first layers, again based on experience, the model
won't have good learning because of the nature of the architecture which finds low level
features. Furthermore, if first layers is removed there will be a problem for the denser layers,
because the number of trainable parameters changes. Densely connected layers and deep
convolutional layers can be good points for reduction but it may take time to find how many
layers and neurons to be diminished in order not to over fit.
17
Chapter 6
Implementation
____________________________________________________
In the area of implementation, we have written the procedure for comparing the analysis for
different existing transfer learning models on the ‘mnist’ dataset of handwritten digits. Our
primary goal is to test the different transfer learning methodologies on the dataset after which the
best model can be used in conjunction with the extraneous densely connected hidden layers for
predicting the output. The feature set from the existing models is fed to the input of a series of
three hidden layers with the last layer being the output.
The following tools were used for implementing the project:
Google Colaboratory
Python
Keras (with TensorFlow backend)
Numpy
Pandas
Scikit-learn
TensorBoard (for computation graph visualization and visualizing loss)
In this paper, we evaluate the different approaches of transfer learning to the MNIST dataset of
handwritten digits as a benchmark for performance evaluation.
The following models were used in the experiment with ‘imagenet’ weights.
VGG16
VGG19
ResNet50
Xception
MobileNet
DenseNet
18
Chapter 7
Results
____________________________________________________
We have used a GPU execution environment for the model to be trained. The time taken for the
training process depends upon the number of epochs of the model in addition to defining the
batch size. We have used the ‘keras’ deep learning framework for implementing the deep
learning models in addition to the inbuilt transfer learning applications it provides to the users.
7.1 Algorithm for a generalized transfer learning
Algorithm: Transfer Learning for digit classification
Input: Input dataset (Dimension - 1 x 784)
Output: Probability of digit (0 to 9)
1. Import pre-trained model
2. Set weights:= ‘imagenet’
3. Reshape the input data to match the size of the input of the model
4. Feed the input dataset into the model
5. Extract bottleneck features
6. Feed the features to the densely connected network
7. Extract output
8. Plot the loss and accuracy over successive iterations
In our implementation, the following were the values of the parameters and the number of
examples for the training process.
1. Batch size 100
2. Epochs 100
3. Training size 60000 images
4. Validation size 10000 images
19
7.2 Metrics
To measure the performance of the model, we used the following set of metrics.
Loss function: Sparse categorical cross-entropy
Optimizer: Adam
Experimentally it has been found out that the above combination of metrics is most effective
while dealing with sparse feature set or data such as images.
The densely connected neural networks use the ‘Rectified Linear Unit’ activation function and
the “Softmax’ function as output.
The different results are plotted below with supplemental visual representation for a better
understanding. All the models are trained with the ‘imagenet’ weights.
All the values shown in the graph are actual results. The data is visualized in ‘loss’ and
‘accuracy’, in that order.
The blue line refers to the results corresponding to the validation data.
The orange line refers to the results corresponding to the training data.
20
7.3 Analysis
1. DenseNet
Epochs
Epochs
Loss: 14.6499
Accuracy: 6.24 %
21
2. MobileNet
Epochs
Epochs
Loss: 2.4834
Accuracy: 5.15 %
22
3. Xception
Epochs
Epochs
Loss: 2.5570
Accuracy: 15.09 %
23
4. ResNet50
Epochs
Epochs
Loss: 7.1141
Accuracy: 22.59 %
24
5. VGG19
Epochs
Epochs
Loss: 0.1177
Accuracy: 96.64 %
25
6. VGG16
Epochs
Epochs
Loss: 0.1354
Accuracy: 96.19 %
26
7.4 Inference
From the results above, it can be inferred that the VGG19 and the VGG16 architectures provide
the best results for the stipulated constraints. In fact, since both VGG16 and VGG19 are hardly
different from the architectural point of view, both of them can be used to solve similar
problems. In this case, the problem of hand-written digit recognition is solved with a very
confident validation accuracy of 96.64 % for VGG19 and an accuracy of 96.19 % for VGG16.
Therefore, we conclude that VGG19 or VGG16 can be used for the process for the problem of
hand-written digit recognition and other processes like character recognition.
Below given is a table with the results for the different neural network architectures.
Architecture
Loss
Accuracy (%)
DenseNet
14.6499
6.24
MobileNet
2.4834
5.15
Xception
2.5570
15.09
ResNet50
7.1141
22.59
VGG19
0.1177
96.64
VGG16
0.1354
96.19
27
Chapter 8
Conclusion
____________________________________________________
7.1 Summary
In this project, we explained the introduced to the reader the motivation to build a system of
character recognition using transfer learning methodologies. We first explained the drawbacks of
existing systems and provided goals of our systems that would alleviate the drawbacks of the
existing systems. We later explained the method of transfer learning in addition to citing the
advantages and disadvantages over other conventional methods.
We analyzed the different architectures of the pre-existing models and implemented them along
with plotting the graphs for the measures of the respective models. In the end, we stated the
model that would be best suitable for the problem of recognizing digits and which can also be
applied to other similar problems like character recognition.
7.2 Cost Analysis
Since all the dataset and the implementation was cloud based, there were no additional costs
required. The only miscellaneous cost that can be incurred is attributed to that of a high
resolution camera for efficient extraction of features and accurate recognition of characters and
digits.
7.3 Challenges
We encountered numerous challenges during the planning and implementation of the project.
Some of them are listed below.
Deciding the correct neural network architecture for the solution.
Transforming the shape of the input data so as to conveniently feed into the input layer.
Selecting a cloud environment for the implemented code to be run.
28
7.4 Planning and Project Management
Following is the detailed project planning and management:
Activity
Start Date
Number of weeks
Literature Review
10 -Oct
1
Finalizing problem definition
18-Oct
1
Requirement gathering
24-Oct
1
Implementation
10-Nov
2
Result analysis
24-Nov
1
Preparation of project report
4-Nov
1
Preparation of project
presentation
4-Nov
1
29
References
[1] W. Hu, A.M.Gharuib, A.Hafez, “Template Match Object Detection for Inertial Navigation Systems,”
Scientific research (SCIRP), pp.78-83, May 2011.
[2] E.Shectman, M.Irani, “Matching Local Self-Similarities across Images and Videos,” In IEEE
International Conference on Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[3] Seyyid Ahmed Medjahed, “A Comparative Study of Feature Extraction Methods in Images
Classification”,In I.J. Image, Graphics and Signal Processing.
[4] X. Y. Stella, R. Gross, J. Shi, Concurrent Object Recognition and Segmentation by Graph
Partitioning, published in Neural Information Processing Systems, Vancouver, pp. 1-8, 2002.
[5] A. Suga , K.. Fukuda, T. Takiguchi, Y. Ariki, Object Recognition and Segmentation using SIFT and
Graph cuts, published in 19th IEEE Conference on Pattern Recognition, pp. 1-4, 2008.
[6] A. S. Mian , M. Bennamoun, R. Owens, Three-Dimensional Model-Based Object Recognition and
Segmentation in Cluttered Scenes, published in IEEE Transaction on Pattern Analysis and Machine
Intelligence, Vol. 28, No. 10, pp.1584-1601, 2006.
[7] S. Ullman, Object recognition and segmentation by a fragmentbased hierarchy, published in Trends in
Cognitive Sciences Vol. 11, No. 12, pp. 58-64, 2007.
[8] J. B. Kim, H. S. Park, M. H. Park, H. J. Kim, Unsupervised Moving Object Segmentation and
Recognition Using Clustering and a Neural Network, published in International Joint Conference on
Neural Network, pp. 1240-1245, 2002
[9] M. Lehtomäki, A. Jaakkola, J. Hyyppä,J. Lampinen, H. Kaartinen, A. Kukko, E. Puttonen, H. Hyyppä,
Object Classification and Recognition from Mobile Laser Scanning Point Clouds in a Road Environment,
published in IEEE Transaction on Geoscience and Remote Sensing, Volume:PP , No: 99, pp. 1-14, 2015.
[10] Allah Bux Sargano, Human action recognition using transfer learning with deep
representations,published in IEEE Conference on 2017 International Joint Conference on Neural
Networks (IJCNN)
[11] H. Azizpour, S.A. Razavian, J. Sullivan, "From generic to specific deep representations for visual
recognition", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2015.
30
[12] S.A. Razavian, H. Azizpour, J. Sullivan, "CNN features off-the-shelf: an astounding baseline for
recognition", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2014.
[13] K. Charalampous, A. Gasteratos, "On-line deep learning method for action recognition", Pattern
Analysis and Applications, vol. 19, no. 2, pp. 337-354, 2016.
[14] L. Wang, Y. Qiao, X. Tang, "Action recognition with trajectory-pooled deep-convolutional
descriptors", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
[15] L. Cao, Z. Liu, T.S. Huang, "Cross-dataset action detection", Computer vision and pattern
recognition (CVPR) 2010 IEEE conference on, 2010.
... Our approach does not need any other equipment such as monochrome cameras or special lighting devices. It was constructed by the implementation of transfer [5] and ensemble learning [6] together. The transfer learning part was presented by using pre-trained CNN models to extract the features out of the frames series, then the XGBoost Regressor model [7] was trained using these features to output one value which is the SpO2 for each second of the video. ...
... These advantages motivated us to use it in our research, therefore, after extracting the features out of the consecutive frames of the train set, we fed them into the XGBoost Regressor, which required the number of estimators, and the depth of the decision trees. We chose the number of estimators and the depth from the intervals [50, 100] and [5,7], respectively. However, the models that achieved the best MAE for the different ROIs and pre-trained models were kept for further comparison to determine which models to test on UBFC-RPPG dataset-1 and Operators dataset. ...
Conference Paper
Full-text available
One of the most essential physiological indicators for a human health is oxygen saturation level (SpO2). It is the primary determinant of how efficiently the body transfers oxygen from the lungs to blood cells. SpO2 is typically measured with a pulse oximeter, however, non-contact SpO2 estimate approaches based on face or hand videos have gained popularity in recent years. In this paper, we proposed a novel methodology based on machine learning concepts to estimate SpO2 using facial videos. Our approach includes exploring several pre-trained convolutional neural networks (CNN) models to extract features from the consecutive images of different regions of interest (ROI), followed by the training of the XGBoost Regressor model, which in turn predicts SpO2 for three different test sets included in our research. We managed to determine the best three models through multiple stages of our testing process, which took into account three metrics: mean absolute error (MAE), Pearson’s correlation coefficient, and the shape of the predicted samples distribution. However, our final models achieved contactless estimations of SpO2 with decent accuracy and high performance according to the results of the testing process.
... An approach is known for developing novel applications in image classification, lo- CNNs, a transfer learning approach was introduced. Transfer learning is a technique in machine learning where a model is pre-trained on a given dataset and then re-used as a building block in a related task [55]. The reasoning behind transfer learning is to use a model trained on a certain task to improve learning in a related task of interest. ...
... The reasoning behind transfer learning is to use a model trained on a certain task to improve learning in a related task of interest. Transfer learning has been successful in many machine learning applications, including image classification [56], image localization and detection [57] as well as image segmentation [58], and has led to the development of CNN models that offer a much greater performance [55,59]. Examples of large transfer learning models include VGG, GoogleNet (Inception), and Residual Network (ResNet). ...
Preprint
Full-text available
In South Africa, it is a common practice for people to leave their vehicles beside the road when traveling long distances for a short comfort break. This practice might increase human encounters with wildlife, threatening their security and safety. Here we intend to create awareness about wildlife fencing, using drone technology and computer vision algorithms to recognize and detect wildlife fences and associated features. We collected data at Amakhala and Lalibela private game reserves in the Eastern Cape, South Africa. We used wildlife electric fence data containing single and double fences for the classification task. Additionally, we used aerial and still annotated images extracted from the drone and still cameras for the segmentation and detection tasks. The model training results from the drone camera outperformed those from the still camera. Generally, poor model performance is attributed to (1) over-decompression of images and (2) the ability of drone cameras to capture more details on images for the machine learning model to learn as compared to still cameras that capture only the front view of the wildlife fence. We argue that our model can be deployed on client-edge devices to inform people about the presence and significance of wildlife fencing, which minimizes human encounters with wildlife, thereby mitigating wildlife-vehicle collisions.
... Training a new deep learning model from scratch necessitates a massive amount of data, powerful computational resources, and hours, if not days, of training time. In addition, collecting and annotating huge amounts of domain-specific data is time consuming and expensive, leading to the idea that implementing deep learning models is challenging, and as it is known, the knowledge of previous objects assists in learning new objects due to their similarity with the latter; hence, CNN models trained on a specific dataset or task can be fine-tuned for a new task [29]. Based on the aforementioned facts, transfer learning was included in our work for the feature extraction part. ...
Article
Full-text available
One of the most effective vital signs of health conditions is blood pressure. It has such an impact that changes your state from completely relaxed to extremely unpleasant, which makes the task of blood pressure monitoring a main procedure that almost everyone undergoes whenever there is something wrong or suspicious with his/her health condition. The most popular and accurate ways to measure blood pressure are cuff-based, inconvenient, and pricey, but on the bright side, many experimental studies prove that changes in the color intensities of the RGB channels represent variation in the blood that flows beneath the skin, which is strongly related to blood pressure; hence, we present a novel approach to blood pressure estimation based on the analysis of human face video using hybrid deep learning models. We deeply analyzed proposed approaches and methods to develop combinations of state-of-the-art models that were validated by their testing results on the Vision for Vitals (V4V) dataset compared to the performance of other available proposed models. Additionally, we came up with a new metric to evaluate the performance of our models using Pearson’s correlation coefficient between the predicted blood pressure of the subjects and their respiratory rate at each minute, which is provided by our own dataset that includes 60 videos of operators working on personal computers for almost 20 min in each video. Our method provides a cuff-less, fast, and comfortable way to estimate blood pressure with no need for any equipment except the camera of your smartphone.
... Alex Net was developed by Alex Krizhevsky and his team in 2012 [42][43][44][45][46]. That year, they won the ImageNet Challenge in visual object recognition. ...
Article
Full-text available
Self-driving cars, i.e., fully automated cars, will spread in the upcoming two decades, according to the representatives of automotive industries; owing to technological breakthroughs in the fourth industrial revolution, as the introduction of deep learning has completely changed the concept of automation. There is considerable research being conducted regarding object detection systems, for instance, lane, pedestrian, or signal detection. This paper specifically focuses on pedestrian detection while the car is moving on the road, where speed and environmental conditions affect visibility. To explore the environmental conditions, a pedestrian custom dataset based on Common Object in Context (COCO) is used. The images are manipulated with the inverse gamma correction method, in which pixel values are changed to make a sequence of bright and dark images. The gamma correction method is directly related to luminance intensity. This paper presents a flexible, simple detection system called Mask R-CNN, which works on top of the Faster R-CNN (Region Based Convolutional Neural Network) model. Mask R-CNN uses one extra feature instance segmentation in addition to two available features in the Faster R-CNN, called object recognition. The performance of the Mask R-CNN models is checked by using different Convolutional Neural Network (CNN) models as a backbone. This approach might help future work, especially when dealing with different lighting conditions.
Article
Full-text available
The COVID-19 pandemic is a novel, fast-spreading, deadly virus. It has spread around the world in an extremely short time. Due to its rapid spread and negative effects on all aspects of our lives (health, finances, stress, etc.), scientists are seeking to find accurate and fast solutions to this crisis. In our paper, we present a systematic literature review (SLR) of the different machine learning (ML) and deep learning (DL) techniques used for the detection, classification, and segmentation of COVID-19. We depend on our review of reliable databases such as IEEE Explore, Google Scholar, MDPI, Springer, PubMed, and Science Direct. By surveying approximately 978 papers, we found that 160 were more authorized, 77 of which were selected for review and met the criteria. A taxonomy is introduced to describe the sequence of our paper. Subsequently, a deep analysis and critical review of the academic literature were conducted to highlight the challenges and significant gaps identified in the introduced subject. The results revealed a shortage of research that assessed and established standards for the methods utilized for identifying and categorizing COVID-19 chest imaging techniques. As we continue the assessment and standardization process, three main difficulties are anticipated: the existence of various evaluation criteria for each task, the conflicts between these criteria, and the importance of these criteria. Moreover, we present a review of different systems used from the beginning of this crisis based on ML and DL by using different medical image modalities, such as chest X-ray, chest computed tomography (CT), magnetic resonance imaging (MRI), and ultrasound imaging. We also highlight the datasets used and the different results of performance measures that have been developed by different researchers in this medical field. Finally, we discuss the limitations and lessons learned that are associated with the use of ML and DL techniques for diagnosing COVID-19. To support our work, we developed a new algorithm based on using transfer learning for several deep learning models and applied it to our own dataset. The aim of our paper is to collect various authorized data to help experts and specialists understand the importance of ML and DL systems in this respect, represent a new algorithm, and benefit them in future work toward fighting COVID-19.
Article
Full-text available
Automatic methods are needed to efficiently process the large point clouds collected using a mobile laser scanning (MLS) system for surveying applications. Machine-learning-based object recognition from MLS point clouds in a road and street environment was studied in order to create maps from the road environment infrastructure. The developed automatic processing workflow included the following phases: the removal of the ground and buildings, segmentation, segment classification, and object location estimation. Several novel geometry-based features, which were previously applied in autonomous driving and general point cloud processing, were applied for the segment classification of MLS point clouds. The features were divided into three sets, i.e., local descriptor histograms (LDHs), spin images, and general shape and point distribution features, respectively. These were used in the classification of the following roadside objects: trees, lamp posts, traffic signs, cars, pedestrians, and hoardings. The accuracy of the object recognition workflow was evaluated using a data set that contained more than 400 objects. LDHs and spin images were applied for the first time for machine-learning-based object classification in MLS point clouds in the surveying applications of the road and street environment. The use of these features improved the classification accuracy by 9.6% (resulting in 87.9% accuracy) compared with the accuracy obtained using 17 general shape and point distribution features that represent the current state of the art in the field of MLS; therefore, significant improvement in the classification accuracy was achieved. Connected component segmentation and ground extraction were the cause of most of the errors and should be thus improved in the future.
Conference Paper
Full-text available
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features and deep-learned features. Specifically, we utilize deep architectures to learn discriminative convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional features into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to transform convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMDB51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features and deep-learned features. Our method also achieves superior performance to the state of the art on these datasets (HMDB51 65.9%, UCF101 91.5%).
Article
Full-text available
Feature extraction is an important step in image classification. It allows to represent the content of images as perfectly as possible. However, in this paper, we present a comparison protocol of several feature extraction techniques under different classifiers. We evaluate the performance of feature extraction techniques in the context of image classification and we use both binary and multiclass classifications. The analyses of performance are conducted in term of: classification accuracy rate, recall, precision, f-measure and other evaluation measures. The aim of this research is to show the relevant feature extraction technique that improves the classification accuracy rate and provides the most implicit classification data. We analyze the models obtained by each feature extraction method under each classifier.
Article
Full-text available
In this paper an unsupervised on-line deep learning algorithm for action recognition in video sequences is proposed. Deep learning models capable of deriving spatio-temporal data have been proposed in the past with remarkable results, yet, they are mostly restricted to building features from a short window length. The model presented here, on the other hand, considers the entire sample sequence and extracts the description in a frame-by-frame manner. Each computational node of the proposed paradigm forms clusters and computes point representatives, respectively. Subsequently, a first-order transition matrix stores and continuously updates the successive transitions among the clusters. Both the spatial and temporal information are concurrently treated by the Viterbi Algorithm, which maximizes a criterion based upon (a) the temporal transitions and (b) the similarity of the respective input sequence with the cluster representatives. The derived Viterbi path is the node’s output, whereas the concatenation of nine vicinal such paths constitute the input to the corresponding upper level node. The engagement of ART and the Viterbi Algorithm in a Deep learning architecture, here, for the first time, leads to a substantially different approach for action recognition. Compared with other deep learning methodologies, in most cases, it is shown to outperform them, in terms of classification accuracy.
Conference Paper
Full-text available
Evidence is mounting that CNNs are currently the most efficient and successful way to learn visual representations. This paper address the questions on why CNN representations are so effective and how to improve them if one wants to maximize performance for a single task or a range of tasks. We assess experimentally the importance of different aspects of learning and choosing a CNN representation to its performance on a diverse set of visual recognition tasks. In particular, we investigate how altering the parameters in a network's architecture and its training impacts the representation's ability to specialize and generalize. We also study the effect of fine-tuning a generic network towards a particular task. Extensive experiments indicate the trends; (a) increasing specialization increases performance on the target task but can hurt the ability to generalize to other tasks and (b) the less specialized the original network the more likely it is to benefit from fine-tuning. As by-products we have learnt several deep CNN image representations which when combined with a simple linear SVM classifier or similarity measure produce the best performance on 12 standard datasets measuring the ability to solve visual recognition tasks ranging from image classification to image retrieval.
Conference Paper
Full-text available
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the OverFeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the OverFeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the OverFeat network was trained to solve. Remarkably we report better or competitive results compared to the state-of-the-art in all the tasks on various datasets. The results are achieved using a linear SVM classifier applied to a feature representation of size 4096 extracted from a layer in the net. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual classification tasks.
Article
Full-text available
This paper devoted to propose template match object detection for inertial navigation systems (INS). The proposed method is an image processing technique to improve the precision of the INS for detecting and tracking the ground ob-jects from flying vehicles. Template matching is one of the methods used for ground object detection and tracking. Ro-bust and reliable object detection is a critical step of object recognition. This paper presents a proposed mathematical morphological template matching method for detection and tracking of ground objects. Our focus is on flying systems equipped with camera to capture photos for the ground and recognize it. The proposed method is independent on the altitude or the orientation of the object. The algorithm is simulated using Matlab program and the numerical experiments are shown which verify the object detection for a wide range altitude and orientation. The results show superior-ity of this method for identifying and recognizing the ground objects
Conference Paper
Full-text available
In recent years, many research works have been carried out to recognize human actions from video clips. To learn an effective action classifier, most of the previous approaches rely on enough training labels. When being required to recognize the action in a different dataset, these approaches have to re-train the model using new labels. However, labeling video sequences is a very tedious and time-consuming task, especially when detailed spatial locations and time durations are required. In this paper, we propose an adaptive action detection approach which reduces the requirement of training labels and is able to handle the task of cross-dataset action detection with few or no extra training labels. Our approach combines model adaptation and action detection into a Maximum a Posterior (MAP) estimation framework, which explores the spatial-temporal coherence of actions and makes good use of the prior information which can be obtained without supervision. Our approach obtains state-of-the-art results on KTH action dataset using only 50% of the training labels in tradition approaches. Furthermore, we show that our approach is effective for the cross-dataset detection which adapts the model trained on KTH to two other challenging datasets.
Conference Paper
In this paper, we propose a method of object recognition and segmentation using scale-invariant feature transform (SIFT) and graph cuts. SIFT feature is invariant for rotations, scale changes, and illumination changes and it is often used for object recognition. However, in previous object recognition work using SIFT, the object region is simply presumed by the affine-transformation and the accurate object region was not segmented. On the other hand, graph cuts is proposed as a segmentation method of a detail object region. But it was necessary to give seeds manually. By combing SIFT and graph cuts, in our method, the existence of objects is recognized first by vote processing of SIFT keypoints. After that, the object region is cut out by graph cuts using SIFT keypoints as seeds. Thanks to this combination, both recognition and segmentation are performed automatically under cluttered backgrounds including occlusion.