ChapterPDF Available

Basics of Supervised Deep Learning

Authors:

Abstract and Figures

The use of supervised and unsupervised deep learning models has grown at a fast rate due to their success with learning of complex problems. High-performance computing resources, availability of huge amounts of data (labeled and unlabeled) and state-of-the-art open-source libraries are making deep learning more and more feasible for various applications. Since the main focus of this chapter is on supervised deep learning, Convolutional Neural Network (CNN or ConvNets) that is one of the most commonly used supervised deep learning models is discussed in this chapter.
Content may be subject to copyright.
Chapter 2
Chapter 2: Convolutional Neural Network
Chapter 2: Introduction to Supervised Deep Learning
2.1. Introduction
During recent years, the use of supervised and unsupervised deep learning models has
grown at a fast rate due to their success with learning of complex problems. High-
performance computing resources, availability of huge amounts of data (labeled and
un-labeled), state-of-the-art open-source libraries are making deep learning more and
more feasible for various applications. Since the main focus of this chapter is on
supervised deep learning, convolutional neural network (CNN or ConvNets) that is one
of the most commonly used supervised deep learning models is discussed in this
chapter.
2.2. Convolutional Neural Network (ConvNet/CNN)
Convolutional Neural Network also known as ConvNet or CNN is a deep learning
technique that consists of multiple number of layers. ConvNets are inspired by the
biological visual cortex. The visual cortex has small regions of cells that are sensitive
to specific regions of the visual field. Different neurons in the brain respond to different
features. For example, certain neurons fire only in presence of lines of a certain
orientation, some neurons fire when exposed to vertical edges and some when shown
horizontal or diagonal edges. This idea of certain neurons having a specific task is the
basis behind ConvNets.
ConvNets have shown excellent performance on several applications such as image
classification, object detection, speech recognition, natural language processing, and
medical image analysis. Convolutional neural networks are powering core of computer
vision that has many applications which includes self-driving cars, robotics, and
treatments for the visually impaired. The main concept of ConvNets is to obtain local
features from input (usually an image) at higher layers and combine them into more
Chapter 2: Convolutional Neural Network
complex features at the lower layers. However, due to its multi-layered architecture it
is computationally exorbitant and training such networks on a large dataset takes
several days. Therefore, such deep networks are usually trained on GPUs.
Convolutional neural networks are so powerful on visual tasks that they outperform
almost all the conventional methods.
2.3. Evolution of Convolutional Neural Network Model
LeNet: The first practical convolution based architecture was LeNet which used
backpropagation for training the network. LeNet was designed to classify handwritten
digits (MNIST), and it was adopted to read large numbers of handwritten checks in the
United States. Unfortunately, the approach did not get much success, as it didn’t scale
well to larger problems. The main reasons for this limitation were
a. Small labeled datasets.
b. Slow computers.
c. Use of wrong non-linearity (activation) function.
The use of appropriate activation function in a neural network has huge impact on the
final performance. Any deep neural network that uses non-linear activation function
like sigmoid or tanh and trains using backpropagation suffers from vanishing gradient.
Vanishing gradient is a problem found in training the neural networks with gradient-
based training methods. Vanishing gradient makes it hard to train and tune the
parameters of the top layers in a neural network. The problem worsens as the total
number of layers in the network increases.
AlexNet - The first breakthrough came in 2012 when the convolutional model which
was named AlexNet significantly outperformed all other conventional methods in
ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2012 that featured
the ImageNet dataset. The AlexNet brought down classification error rate from 26% to
15%, a significant improvement at that time. AlexNet was simple but much more
efficient than LeNet. The improvements to overcome the above-mentioned problems
were due to the following reasons:
Chapter 2: Convolutional Neural Network
a. Large labeled image database (ImageNet), which contained around 15 million
labeled images from a total of over 22,000 categories was used.
b. The model was trained on high-speed GTX 580 GPUs for five to six days
c. ReLu (Rectified Linear Unit) f(x) = max(x,0) activation function was used. This
activation function is several times faster than the conventional activation
functions like sigmoid and tanh. ReLu activation function does not experience
vanishing gradient problem.
AlexNet consists of 5 convolutional layers, three fully connected layers and a 1000-
way softmax classifier.
ZFNet - In 2013, an improved version of CNN architecture called ZF Net was
introduced. ZF Net reduced the filter size in the first layer from 11x11 to 7x7 and used
a stride of 2 instead of 4 which resulted in more distinctive features and fewer dead
features. ZF Net turned out to be the winner of ImageNet Large-Scale Visual
Recognition Challenge (ILSVRC) 2013.
VGG Net - VGG Net, introduced in 2014, used increased depth of the network for
improving the results. The depth of the network was made 19 layers by adding more
convolutional layers with 3x3 filters, along with 2x2 max-pooling layers with stride and
padding of 1 in all layers. Reducing filter size and increasing the depth of the network
resulted in CNN architecture that produced more accurate results. VGG Net achieved
an error rate of 7.32 % in ILSVRC 2014 and was the runner-up model in ILSVRC 2014.
Chapter 2: Convolutional Neural Network
GoogLeNet - Google developed a ConvNet model called GoogLeNet in 2015. The
model has 22-layers and was the winner of ILSVRC 2015 for having the error rate of
6.7%. The previous ConvNet models have convolution and pooling layers are stacked
on top of each other but the GoogLeNet architecture is a little different. It uses an
inception module which helps in reducing the number of parameters in the network.
The inception module is actually a concatenated layer of convolutions (3 x 3 and 5 x 5
convolutions) and pooling sub-layers at different scales with their output filter banks
concatenated into a single output vector making the input for the succeeding stage.
These sub-layers are not stacked sequentially but the sub-layers are connected in
parallel as shown in Figure 2.1 below. In order to compensate for additional
computational complexity due to extra convolutional operations, 1×1 convolutions are
used that result in reduced computations before expensive 3×3 and 5×5 convolutions
are performed. GoogLeNet model has two convolutional layers, four max-pooling
layers, 9 inception layers and a softmax layer. The use of this special inception
architecture makes GoogLeNet to have 12 times lesser parameters than AlexNet.
Increasing the number of layers increases the number of features which enhances the
accuracy of the network. However, there is a practical limitation to that. 1) Vanishing
gradients: Some neurons in too deep networks may die during training which can cause
loss of useful information and 2) Optimization difficulty: too many parameters can
make training the network a difficult task. The network depth should be increased
Figure 2.1 Inception module in GoogLeNet
Chapter 2: Convolutional Neural Network
without any negative effects. The inception model was refined as Inception V3 in 2016,
and as Inception-ResNet in 2017.
ResNet: Microsoft Research Asia proposed a CNN architecture in 2015 that is 152
layers deep and is called ResNet. ResNet introduced residual connections in which the
output of a conv-relu-conv series is added to the original input and then passed through
Rectified Linear Unit (ReLU) as shown in Figure 2.2. In this way, the information is
carried from the previous layer to the next layer and during backpropagation, the
gradient flows easily because of the addition operations, which distributes the gradient.
ResNet proved that a complex architecture like Inception is not required to achieve the
best results but a simple and deep architecture can be tweaked to get better results.
ResNet performed good in classification, detection, and localization and won ILSVRC
2015 with an incredible error rate of 3.6% which is better than the human error rate of
5-10%. ResNet is currently deepest network trained on ImageNet and has lesser
parameters than VGG Net which is eight times lesser in depth.
Inception-ResNet: A hybrid inception model which uses residual connections, as in
ResNet, was proposed in 2017. This hybrid model called Inception-ResNet dramatically
improved the training speed of Inception model and slightly outperformed the pure
ResNet model by a thin margin.
Figure 2.2: Residual connection in ResNet
Chapter 2: Convolutional Neural Network
Xception: A convolutional neural network architecture based on depthwise separable
convolution layers is called Xception. The architecture is actually inspired by Inception
model and that is why it is called Xception (Extreme Inception). Xception architecture
is a pile of depthwise separable convolution layers with residual connections. Xception
has 36 convolutional layers organized into 14 modules, all having linear residual
connections around them, except for the first and last modules. The Xception has
claimed to perform slightly better than Inception V3 on ImageNet. Table 2.1 and Figure
2.3 show classification performance of VGG-16, ResNet-152, Inception V3 and
Xception on ImageNet.
SqueezeNet: As the accuracy of new ConvNets models kept on improving, researchers
started focusing on how to reduce the size and complexity of the existing ConvNet
architectures without compromising on accuracy. The goal was to design a model that
has very few parameters while maintaining high accuracy. A pre-trained model is used
and those of its parameters with values below a certain threshold are replaced with zeros
to form a sparse matrix followed by few iterations of training on the sparse ConvNet.
Another version of SqueezeNet model used the following three main strategies to reduce
the parameters and computational effort significantly while maintaining high accuracy.
a) Replace 3x3 filters with 1x1 filters b) Reduce the number of input channels to 3x3
filters. c) Delay subsampling till late in the network so that convolution layers have
large activation maps. SqueezeNet achieved AlexNet-level accuracy on ImageNet with
50 times fewer parameters.
Table 2.1: Classification accuracy of AlexNet, VGG-16, ResNet-152, Inception and
Xception on ImageNet.
Top-1 accuracy
Top-5 accuracy
0.625
0.86
0.715
0.901
0.782
0.941
0.870
0.963
0.790
0.945
Chapter 2: Convolutional Neural Network
ShuffleNet: Another ConvNet architecture called ShuffleNet was introduced in 2017
for devices with limited computational power, like mobile devices, without
compromising on accuracy. ShuffleNet uses two ideas, pointwise group convolution,
and channel shuffle, to considerably decrease the computational cost while maintaining
the accuracy.
Figure 2.3: ILSRV Top-5 Error on ImageNet since 2010
Chapter 2: Convolutional Neural Network
2.4. Convolution Operation
Convolution is a mathematical operation performed on two functions and is written as
(f * g), where f and g are two functions. The output of the convolution operation for
domain n is defined as:
   
For time domain functions n is replaced by t. The convolution operation is commutative
in nature, so it can also be written as:
   
Convolution operation is one of the important operations used in Digital Signal
Processing and is used in many areas which includes statistics, probability, natural
language processing, computer vision and image processing.
Convolution operation can be applied to higher dimensional functions as well. It can be
applied to a two-dimensional function by sliding one function on top of another,
multiplying and adding. Convolution operation can be applied to images to perform
various transformations, here images are treated as two dimensional functions. An
example of a two dimensional filter, a two dimensional input, and a two dimensional
Figure 2.4: Convolution Operation
Chapter 2: Convolutional Neural Network
feature map is shown in Figure 2.4. Let the 2D input (i.e. 2D image) be denoted by A,
the 2D filter of size m x n be denoted by K, and the 2D feature map be denoted by F.
Here the image A is convolved with the filter K and produces the feature map F. This
convolution operation is denoted by A*K and is mathematically given as
      

The convolution operation is commutative in nature, so we can write Eq. 2.1 as
      

The kernel K is flipped relative to the input. If the kernel is not flipped then convolution
operation will be same as cross-correlation operation that is given below
      

Many CNN libraries use cross-correlation function as convolution function because
cross-correlation is more convenient to implement than convolution operation itself.
According to Eq 2.3, the operation computes the inner product (element-wise
multiplication) of the filter at every location in the image.
2.5. Architecture of CNN
In a traditional neural network, neurons are fully connected between different layers.
Layers that sit between the input layer and output layer are called hidden layers. Each
hidden layer is made up of a number of neurons, where each neuron is fully connected
to all neurons in the preceding layer. The problem with the fully connected neural
network is that its densely-connected network architecture does not scale well to large
images. For large images, the most preferred approach is to use convolutional neural
network.
Convolutional neural network is a deep neural network architecture designed to process
data that has a known, grid-like topology, for example, 1D time-series data, 2D or 3D
data such as images and speech signal, and 4D data such as videos. ConvNets have
three key features: Local Receptive Field, Weight Sharing, and Subsampling (pooling).
Chapter 2: Convolutional Neural Network
i) Local Receptive Field
In a traditional neural network, each neuron or hidden unit is connected to every neuron
in previous layer or every input unit. Convolutional neural networks however, have
local receptive field architecture i.e. each hidden unit can only connect to a small region
of the input called local receptive field. This is accomplished by making the
filter/weight matrix smaller than the input. With local receptive field, neurons can
extract elementary visual features like edges, corners, end-points etc.
ii) Weight Sharing
Weight sharing refers to using the same filter/weights for all receptive fields in a layer.
In ConvNet, since the filters are smaller than the input, each filter is applied at every
position of the input i: e same filter is used for all local receptive fields.
ConvNet consists of a sequence of different types of layers to achieve different tasks.
A typical Convolutional neural network consists of the following layers:
Convolutional layer
Activation Function Layer (ReLU)
Pooling layer
Fully-connected layer.
Dropout Layer
These layers are stacked up to make a full ConvNet architecture. Convolutional and
activation function layers are usually stacked together followed by an optional pooling
layer. Fully connected layer makes up the last layer of the network and the output of
the last fully connected layer produces the class scores of the input image. In addition
to these main layers mentioned above, ConvNet may include optional layers like batch
normalization layer to improve the training time and dropout layer to address the
overfitting issue.
Chapter 2: Convolutional Neural Network
iii) Subsampling (Pooling)
Subsampling reduces the spatial size of the input and thus reducing the parameters in
the network. There are few subsampling techniques available and the most common
subsampling technique is Max Pooling.
2.5.1 Convolution Layer
Convolution layer is the core building block of a Convolutional Neural Network which
uses convolution operation (represented by *) in place of general matrix multiplication.
Its parameters consist of a set of learnable filters also known as kernels. The main task
of the convolutional layer is to detect features found within local regions of the input
image that are common throughout the dataset and mapping their appearance to a
feature map. A feature map is obtained for each filter in the layer by repeated
application of the filter across sub-regions of the complete image i.e. convolving the
filter with the input image, adding a bias term and then applying an activation function.
The input area on which a filter is applied is called local receptive field. The size of
the receptive field is same as the size of the filter. Figure 2.5 below shows how a filter
(T-shaped) is convolved with the input to get the feature map.
Feature map is obtained after adding a bias term and then applying a non-linear function
to the output of the convolution operation. The purpose of non-linearity function is to
introduce non- linearity in the ConvNet model and there are a number of non-linearity
functions available which are briefly explained in next section.
Filters/Kernels
The weights in each convolutional layer specify the convolution filters and there may
be multiple filters in each convolutional layer. Every filter contains some feature like
edge, corner etc. and during forward pass, each filter is slid across the width and height
of the input generating feature map of that filter.
Chapter 2: Convolutional Neural Network
Hyperparameters
Convolutional Neural Network architecture has many hyper-parameters that are used
to control the behavior of the model. Some of these hyper-parameters control the size
of the output while some are used to tune the running time and memory cost of the
model. The four important hyper-parameters in the convolution layer of the ConvNet
are given below:
a. Filter Size: Filters can be of any size greater than 2 x 2 and less than the size of
the input but the conventional size varies from 11 x 11 to 3 x 3. The size of a filter
is independent of the size of input.
b. Number of Filters: This can be any reasonable number of filters. Alex Net used 96
filters of size 11 x 11 in first convolution layer. VGG Net used 96 filters of size 7
x 7 and another variant of VGG Net used 64 filters of size 11 x 11 in first
convolution layer.
c. Stride: It is the number of pixels to move at a time to define the local receptive
field for a filter. Stride of one means to move across and down a single pixel. The
value of stride should not be too small or too large. Too small stride will lead to
heavily overlapping receptive fields and too large value will overlap less and the
resulting output volume will have smaller dimensions spatially.
d. Zero Padding: This hyper-parameter describes the number of pixels to pad the
input image with zeros. Zero padding is used to control the spatial size of the output
volume.
Each filter in the convolution layer produces a feature map of size [A-K+2P]/S) +1
where, A is the input volume size, K is the size of the filter, P is the number of padding
applied and S is the stride. Suppose the input image has size 128x128, and 5 filters of
size 5x5 are applied, with single stride and zero padding, i.e. A=128, F=5, P=0 and S=1.
The number of feature maps produced will be equal to the number of filters applied i.e.
5 and the size of each feature map will be ([128-5+0]/1) + 1 = 124. Therefore, the
output volume will be 124 x 124 x 5
Chapter 2: Convolutional Neural Network
2.5.2 Activation Function Layer (ReLU)
The output of each convolutional layer is fed to an activation function layer. The
activation function layer consists of an activation function that takes the feature map
produced by the convolutional layer and generates the activation map as its output. The
activation function is used to transform the activation level of a neuron into an output
signal. It specifies the output of a neuron to a given input. An activation function usually
has a squashing effect which takes an input (a number), performs some mathematical
operation on it and outputs the activation level of a neuron between a given range e.g.
0 to 1 or -1 to 1.
A typical activation function should be differentiable and continuous everywhere. Since
ConvNets are trained using gradient-based methods, an activation function should be
Figure 2.5: Example of Convolution Operation
Chapter 2: Convolutional Neural Network
differential at any point. However, if a non-gradient based method is used then
differentiability is not necessary.
There are many of activation functions in use with artificial neural networks (ANN)
and some of the commonly used activation functions are:
Logistic/Sigmoid Activation Function: The sigmoid function is mathematically
represented as:

It is an S-shaped curve as shown in Figure 2.6 below. Sigmoid function squashes the
input into the range [0, 1].
Tanh Activation Function: The hyperbolic tangent function is similar to sigmoid
function but its output lies in the range [-1, 1]. The advantage of tanh over sigmoid
is that the negative inputs will be mapped strongly negative and the zero inputs will
be mapped near zero in the tanh graph as shown in Figure 2.7.
Figure 2.6: Graph of Sigmoid activation function
Chapter 2: Convolutional Neural Network
Softmax function (Exponential function): is often used in the output layer of a
neural network for classification. It is mathematically represented as

The softmax function is a more generalized logistic activation function which is
used for multiclass classification.
ReLU Activation Function: Rectified linear unit (ReLU) has gained some
importance in recent years and currently is the most popular activation function for
deep neural networks. Neural networks with ReLU train much faster than other
Figure 2.8: Rectified Linear Unit (ReLU) activation function
Figure 2.7 Graph of tanh activation function
Chapter 2: Convolutional Neural Network
activation functions like sigmoid and tanh. ReLU simply computes the activation
by thresholding the input at zero. In other words, a rectified linear unit has output 0
if the input is less than 0, and raw output otherwise. It is mathematically given as:
f(x) = max(0, x)
Rectified linear unit activation function produces a graph which is zero when x < 0
and linear with slope 1 when x > 0 as shown in Figure 2.8.
SWISH Activation Function:
Self-Gated Activation Function (SWISH) is actually a version of sigmoid function
and is given as:
f(x) = x * σ(x)
Where σ(x) is sigmoid of x given as

SWISH activation function is a non-monotonic function and is shown in Figure 2.9.
Figure 2.9: The Swish activation function
Chapter 2: Convolutional Neural Network
2.5.3. Pooling Layer
In ConvNets, the sequence of convolution layer and activation function layer is
followed by an optional pooling or down sampling layer to reduce the spatial size of
the input and thus reducing the number of parameters in the network. A pooling layer
takes each feature map output from the convolutional layer and down samples it i.e.
pooling layer summarizes a region of neurons in the convolution layer. There are few
pooling techniques available and the most common pooling technique is Max Pooling.
Max pooling simply outputs the maximum value in the input region. The input region
is a subset of input (usually 2x2). For example, if input region is of size 2 x 2 the max-
pooling unit will output the maximum of the four values as shown in Figure 2.10. Other
options for pooling layers are average pooling and L2-norm pooling.
Pooling layer operation discards less significant data but preserves the detected features
in a smaller representation. The intuitive reasoning behind pooling operation is that
feature detection is more important than feature's exact location. This strategy works
well for simple and basic problems but it has its own limitations and does not work well
for some problems.
2.5.4. Fully Connected Layer
Convolutional Neural Networks are composed of two stages: Feature extraction stage
and classification stage. In ConvNets, the stack of convolution and pooling layers act
as feature extraction stage while as the classification stage is composed of one or more
fully connected layers followed by a softmax function layer. The process of convolution
and pooling continues until enough features are detected. Next step is to make a
decision based on these detected features. In case of classification problem, the task is
Chapter 2: Convolutional Neural Network
use the detected features in the spatial domain to obtain probabilities that these features
represent each class, that is, obtain the class score. This is done by adding one or more
fully connected layers at the end. In fully connected layer each neuron from previous
layer (convolution layer or pooling layer or fully connected layer) is connected to every
neuron in the next layer and every value contributes in predicting how strongly a value
matches a particular class. Figure 2.11 shows the connection between a convolution
layer and a fully-connected layer. Like convolutional layers, fully connected layers can
be stacked to learn even more sophisticated combinations of features. The output of last
fully connected layer is fed to a classifier which outputs the class scores. Softmax and
Support Vector Machines (SVM) are the two main classifiers used in ConvNets.
Softmax classifier produces probabilities for each class with a total probability of 1,
and SVM which produces class scores and the class having highest score is treated as
the correct class.
Figure 2.10 Max Pooling
Chapter 2: Convolutional Neural Network
2.5.5. Dropout
Deep neural networks consist of multiple hidden layers enabling it to learn more
complicated features. It is followed by fully connected layers for decision making. A
fully connected layer is connected to all features, it is prone to overfitting. Overfitting
refers to the problem when a model is trained and it works so well on training data that
it negatively impacts the performance of the model on new data. In order to overcome
the problem of overfitting, a dropout layer can be introduced in the model in which
some neurons along with their connections are randomly dropped from the network
during training (See Figure 2.12). A reduced network is left; incoming and outgoing
edges to a dropped-out node are also removed. Only the reduced network is trained on
the data in that stage. The removed nodes are then reinserted into the network with their
a) b)
Figure 2.12: a) a simple neural network b) neural network after dropout
Figure 2.11: Connection between Convolution layer and Fully Connected Layer
Chapter 2: Convolutional Neural Network
original weights. Dropout notably reduces overfitting and improves the generalization
of the model.
2.6. Challenges and Future Research Direction
ConvNets have evolved over the years and have achieved very good performance on
various visual tasks like classification and object detection. In fact, deep networks have
now achieved human-level performance in classifying different objects. However, deep
networks like Convolutional Neural Networks have their limitations. Altering an image
in a way unnoticeable to humans can cause deep networks to miss-classify the image as
something else. Modifying a few pixels selectively can make deep neural networks to
produce incorrect results.
One of the reasons that make ConvNets vulnerable to these attacks is the way they
accomplish pooling operation to achieve reduced feature space at the cost of losing
important information about the precise location of the feature within the region. As a
result, ConvNets can only identify if certain feature exists in a certain region, or not
irrespective of its position relative to another feature. The consequence is the difficulty
in accurately recognizing objects that hold spatial relationships between features.
The vulnerabilities of deep networks put a big question mark on their reliability, raising
questions about the true generalization capabilities of deep neural networks. A deep
neural network architecture called Capsule Networks is used to address some of the
inadequacies of ConvNets. A Capsule Network consists of capsules, which are a group
of neurons representing the instantiation parameters of a specific object or part of it.
Capsule Networks use dynamic routing between capsules instead of max-pooling to
forward information from layer to layer. The study on Capsule Networks is still in its
early stages and their performance on different visual tasks is not known yet.
Bibliography
1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification
with deep convolutional neural networks." In Advances in neural information
processing systems, pp. 1097-1105. 2012.
Chapter 2: Convolutional Neural Network
2. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to
document recognition. Proceedings of the IEEE. 1998 Nov;86(11):2278-324.
3. Nguyen A, Yosinski J, Clune J. Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. InProceedings of the IEEE
Conference on Computer Vision and Pattern Recognition 2015 (pp. 427-436).
4. Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Swish: a Self-Gated
Activation Function." arXiv preprint arXiv: 1710.05941 (2017).
5. Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. InAdvances
in Neural Information Processing Systems 2017 (pp. 3859-3869).
6. Salakhutdinov, R., & Hinton, G. (2009, April). Deep boltzmann machines.
In Artificial Intelligence and Statistics (pp. 448-455).
7. Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural
networks 61 (2015): 85-117.
8. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556.
9. Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going
deeper with convolutions." In Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 1-9. 2015.
10. Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional
networks." In European conference on computer vision, pp. 818-833. Springer,
Cham, 2014.
... Thus, avoiding the need for initialization and allowing the model to use prior learned weights [12]. So, transfer learning is a way of taking weights learned for a particular task and then reusing them for another similar one [3]. DD essentially depends on Deep CNN and Inception. ...
... Then, the high-level features (i.e., faces and trees) appear, going deeper into the network. Finally, the rare final layers collect all those to configure combined effects (e.g., whole structures or trees) [3]. ...
Article
Full-text available
Deep Dream (DD) is a new technology that works as a creative image-editing approach by employing the representations of CNN to produce dreams-like images by taking the benefits of both Deep CNN and Inception to build the dream through layer-by-layer implementation. As the days go by, the DD becomes widely used in the artificial intelligence (AI) fields. This paper is the first systematic review of DD. We focused on the definition, importance, background, and applications of DD. Natural language processing (NLP), images, videos, and audio are the main fields in which DD is applied. We also discussed the main concepts of the DD, like transfer learning and Inception. We addressed the contributions, databases, and techniques that have been used to build the models, the limitations, and evaluation metrics for each one of the included research papers. Finally, some interesting recommendations have been listed to serve the researchers in the future.
... Convolutional layers are utilized in an early stage to learn the many low-level characteristics, such as corners and edges. Other convolutional layers that learn higher-level characteristics are given the output of these layers [31,32]. ...
Article
Full-text available
Alzheimer’s disease is a type of dementia that is well known and responsible for affecting the lives of the elderly. It is defined by the gradual loss of structure and function of neurons in the brain leading to memory, thinking and other activities. This is the most crucial step since a patient’s quality of life and the disease’s progression may both be improved with an early diagnosis. Nevertheless, existing diagnostic tests rarely diagnose the disease in its preliminary stage, and this has a significant impact on the course of the illness. The more conventional assessment techniques that involve neuroimaging and cognitive ability standardized tests are usually unable to pick up early stage alterations. To address these limitations, we have developed a new Hybrid AI Model, which combines both the conventional machine learning techniques, namely SVM, Naive Bayes, Cat boost, and XGBoost and Stacked DL model. This combination uses the advantages of the proposed models to enhance the diagnostic sensitivity based on the early AD biomarkers. The MRI data was obtained from Kaggle and the proposed Stacked DL Model achieved an accuracy93%, an f1-score94, and a specificity99%. The Voting classifier (ML models) outperformed the other models with an accuracy94.22%, an f1-score94%, and a specificity99%. proving the proposed model superior to the prior state of the art. The implications for clinical care contained in this model are vast. SPECT imaging with PIB is a very accurate means of identifying very early signs of AD that needs to be treated after prevent further deterioration, lessening the patient’s discomfort and saving money for the healthcare industry in the long run. Because the failures of this approach have been widely identified in early stage detection, it can, therefore, be greatly beneficial to lower the social and economic implications of AD. The Hybrid AI Model therefore offers a potential solution to the problem of developing better, more efficient approaches to diagnosing Alzheimer’s – an issue that could in turn dramatically transform current clinicians’ ability to identify this terrible disease.
... Then, advanced characteristics like faces and trees visible. All of these features are eventually compiled for use in configuring multiple effects, like trees or the entire structure [6]. The evaluation of whether a CNN has appropriately learned the correct image features involved DD visualization. ...
Article
Full-text available
Recently, Deep Learning (DL) has been used in a new technology known as the Deep Dream (DD) to produce images that resemble dreams. It is utilized to mimic hallucinations that drug users or people with schizophrenia experience. Additionally, DD is sometimes incorporated into the images as decoration. This study produces DD images using two deep-CNN model architectures (Inception-ResNet-V2 and Inception-v3). It starts by choosing particular layers in each model (from both lower and upper layers) to maximize their activation function, then detect several iterations. In each iteration, the gradient is computed and then used to compute loss and present the resulting images. Finally, the total loss is presented, and the final deep dream image is visualized. The output of the two models is different, and even for the same model there are some variations, the lower layers' loss values in the Inception-v3 model are significantly higher in comparison to the upper levels' values. In the case of Inception-ResNet-V2, the loss values are convergent.
... Small changes in the input images (unnoticeable to human vision) as pointed out in [8] can cause miss classification while using CNN; moreover, pooling might cause lose some important feature or pixel information triggering false prediction. Similarly, challenges while training the network [9], such as vanishing gradient, over-fitting, weight initialization; and the selection of activation function and learning rate are some application specific factors to be considered. ...
Preprint
Full-text available
Conventional Feature crafting from the tumor images involves low-level or mid-level feature extraction. It is highly erroneous and smaller in dimension being hand-engineered because it is difficult to craft all the features with human visualization. Recent Advancement in Machine learning model, especially Convolutional Neural Networks (CNN), involves low-level to high-level features-which includes higher dimension of feature parameters-extraction from images with higher accuracy of classification where kernel parameters are self learned by the learning model. This research work exploits structural and spatial information conservation to enhance the deep CNN performance. Structural optimization involves manipulating layers within block structure in Residual Network, going deeper using Residual Network and enhancing learning capabilities. It also involves performance evaluation of wider network InceptionV3 for non-uniform information distribution whereas for spatial information conservation, magnified images were used to learn the model. The model learning and evaluation suggests better model performance demonstrating reduced false negative rate (about 15% increased Recall) and increased F1-score of some 28 percent for magnified images while learning deeper network ResNet80; however, for both smaller Network (ResNet26) and wider network (InceptionV3), learning the model using magnified images suggests significant enhancement in accuracy of the model but not the Recall and F1-Score values rather small increment of precision values.
... This may be attributed to several factors. One key advantage of CNN models is their ability to perform more effective feature extraction from the input data [82]. This capability may have allowed the models to capture intricate patterns and relationships within the Sentinel-2 multispectral data that were critical for accurate predictions. ...
Article
Full-text available
There is a growing realization among policymakers that in order to pave the way for the development of evidence-based conservation recommendations for policy, it is essential to improve the capacity for soil-health monitoring by adopting multidimensional and integrated approaches. However, the existing ready-to-use maps are characterized mainly by a coarse spatial resolution (>200 m) and information that is not up to date, making their use insufficient for the EU’s policy requirements, such as the common agricultural policy. This work, by utilizing the Soil Data Cube, which is a self-hosted custom tool, provides yearly estimations of soil thematic maps (e.g., exposed soil, soil organic carbon, clay content) covering all the agricultural area in Lithuania. The pipeline exploits various Earth observation data such as a time series of Sentinel-2 satellite imagery (2018–2022), the LUCAS (Land Use/Cover Area Frame Statistical Survey) topsoil database, the European Integrated Administration and Control System (IACS) and artificial intelligence (AI) architectures to improve the prediction accuracy as well as the spatial resolution (10 m), enabling discrimination at the parcel level. Five different prediction models were tested with the convolutional neural network (CNN) model to achieve the best accuracy for both targeted indicators (SOC and clay) related to the R2 metric (0.51 for SOC and 0.57 for clay). The model predictions supported by the prediction uncertainties based on the PIR formula (average PIR 0.48 for SOC and 0.61 for clay) provide valuable information on the model’s interpretation and stability. The model application and the final predictions of the soil indicators were carried out based on national bare-soil-reflectance composite layers, generated by employing a pixel-based composite approach to the overlaid annual bare-soil maps and by using a combination of a series of vegetation indices such as NDVI, NBR2, and SCL. The findings of this work provide new insights for the generation of soil thematic maps on a large scale, leading to more efficient and sustainable soil management, supporting policymakers and the agri-food private sector.
Article
Full-text available
Introduction Forensic Odontology plays a crucial role in medicolegal identification by comparing dental evidence in antemortem (AM) and postmortem (PM) dental records, including orthopantomograms (OPGs). Due to the complexity and time-consuming nature of this process, imaging analysis optimization is an urgent matter. Convolutional neural networks (CNN) are promising artificial intelligence (AI) structures in Forensic Odontology for their efficiency and detail in image analysis, making them a valuable tool in medicolegal identification. Therefore, this study focused on the development of a CNN algorithm capable of comparing AM and PM dental evidence in OPGs for the medicolegal identification of unknown cadavers. Materials and methods The present study included a total sample of 1235 OPGs from 1050 patients from the Stomatology Department of Unidade Local de Saúde Santa Maria, aged 16 to 30 years. Two algorithms were developed, one for age classification and another for positive identification, based on the pre-trained model VGG16, and performance was evaluated through predictive metrics and heatmaps. Results Both developed models achieved a final accuracy of 85%, reflecting high overall performance. The age classification model performed better at classifying OPGs from individuals aged between 16 and 23 years, while the positive identification model was significantly better at identifying pairs of OPGs from different individuals. Conclusions The developed AI model is useful in the medicolegal identification of unknown cadavers, with advantage in mass disaster victim identification context, by comparing AM and PM dental evidence in OPGs of individuals aged 16 to 30 years.
Article
Full-text available
В статье исследуется возможность применения библиотеки OpenCV (реализованной на языке программирования Python) для цифровой реставрации объектов монументального искусства. В работе представлены методы анализа изображений и восстановления поврежденных элементов объектов с использованием алгоритмов обработки изображений, доступных в компьютерном зрении и машинном обучении, применяемые в системах поддержки принятий решений. Авторы рассматривают процесс создания интеллектуальной системы, предназначенной для обработки изображений, начиная с предварительной обработки и сегментации до финальной реставрации, векторизации. Обсуждаются преимущества и ограничения использования OpenCV в контексте цифровой реставрации, а также возможности дальнейшего развития применяемого подхода. Результаты цифровой реставрации и векторизации тематических сграффито на здании филиала Хакасэнерго в г. Абакане демонстрируют эффективность применения программы, которая в дальнейшем может быть полезным инструментом для сохранения историко-культурного культурного наследия Енисейской Сибири.
Article
Full-text available
Transfer Learning methods based on Convolutional Neural Networks (CNN) have shown promising results on various Image Classification problems. Even in Medical Images such as the X-Ray, CT Scan, MRI these methods prove to be useful. Recent Advancements in the branch of Deep Learning have made it possible to detect the presence of tumor in a Brain MRI scan. Transfer Learning is a Deep Learning method in which models that are pre-trained with specific weights are used as the starting point for training another model. It is utilized mainly because the pre-trained models are already trained on large datasets and their weights can be utilized for Medical Image Classification. In this paper, we present a detailed review on sixteen various deep learning models available with pre-trained weights on the Keras library with TensorFlow backend, that have been developed for image classification such as VGG19, Xception, InceptionV3, EfficientNet etc. These Deep Learning models are also used in a Medical Image Classification problem by incorporating Transfer Learning and their results are compared. Additionally, a brief description of the Convolutional Neural Network components are also provided.
Article
Full-text available
Deep neural networks (DNNs) have recently been achieving state-of-the-art performance on a variety of pattern-recognition tasks, most notably visual classification problems. Given that DNNs are now able to classify objects in images with near-human-level performance, questions naturally arise as to what differences remain between computer and human vision. A recent study revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library). Here we show a related result: it is easy to produce images that are completely unrecognizable to humans, but that state-of-the-art DNNs believe to be recognizable objects with 99.99% confidence (e.g. labeling with certainty that white noise static is a lion). Specifically, we take convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and then find images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class. It is possible to produce images totally unrecognizable to human eyes that DNNs believe with near certainty are familiar objects. Our results shed light on interesting differences between human vision and current DNNs, and raise questions about the generality of DNN computer vision.
Article
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
Article
The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose a new activation function, named Swish, which is simply f(x)=xsigmoid(x)f(x) = x \cdot \text{sigmoid}(x). Our experiments show that Swish tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
In recent years, deep neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit assignment paths, which are chains of possibly learnable, causal links between actions and effects. I review deep supervised learning (also recapitulating the history of backpropagation), unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
Conference Paper
Large Convolutional Neural Network models have recently demonstrated impressive classification performance on the ImageNet benchmark \cite{Kriz12}. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky \etal on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.