ThesisPDF Available

Insulator Detection Based on a Single-Stage Detector

Authors:

Abstract and Figures

Detection of the power line insulators from a cluttered background of aerial images is crucial for the steady and dependable operation of the complete power system. Besides, object detection is among the most challenging disciplines of computer vision and image processing. It becomes even more challenging regarding resource and power-constrained embedded platforms. This thesis proposes a robust deep-learning model for detecting insulators from aerial images on a low-powered embedded platform, i.e., NVIDIA Jetson Nano. In the proposed detection architecture, an active deep learning approach is used with the single shot multibox detector (SSD), a sophisticated deep meta-architecture. The SSD-based methodology can automate multi-level feature extraction from aerial images instead of manually doing so. Additionally, the MobileNetV2 feature extractor is utilized as the backbone of the SSD network instead of the standard VGG-Net. Two distinct approaches have been utilized to construct the model: traditional deep learning and deep active learning, which reduces the cost and time of data preparation. Two separate models have been developed using the proposed architectureSSD300 and SSD640. Experimental findings and analysis confirm the effectiveness and robustness of the proposed methods. The proposed active deep learning model reached an mAP of 94.5% using 43% of the total dataset, identical to the traditional deep learning approach (94.6%), which utilizes the entire dataset. The model is trained over 10,000 epochs for both approaches. Regarding the inference speed, the presented modes showed a frame rate of 9 fps for 10W power mode and 5.6 fps for 5W power mode on NVIDIA Jetson Nano.
Content may be subject to copyright.
Insulator Detection Based on a
Single-Stage Detector
Master Thesis
Submitted in Fulfilment of the
Requirements for the Academic Degree
M.Sc.
Dept. of Computer Science
Chair of Computer Engineering
Submitted by:
Name: - Soaibuzzaman
Student ID: 613488
Date: 08.09.2022
Supervising tutor:
Prof. Dr. Dr. h. c. Wolfram Hardt
Dr. Ariane Heller
Mohamed Salim Harras, MSc.
Acknowledgement
I am overwhelmed with gratitude and humility for everyone who has helped me
translate these concepts into something concrete that is well above the superficial
level.
I want to start by sincerely thanking my thesis supervisor, Mohamed Salim Harras,
MSc, who served as my technical adviser, mentor, and counselor throughout the
course of my thesis. His understanding, persistence, and drive helped me complete
my thesis effectively. Furthermore, his advice on how to write my report helped me
as I was writing it. I would also like to thank Mr. Shadi Saleh, who gave me much
valuable feedback regarding the implementation. His insightful comments on the
active deep learning framework design supported me immensely.
Moreover, it’s my pleasure to express gratitude to my academic supervisor, Prof.
Dr. Dr. h. c. Wolfram Hardt and Dr. Ariane Heller. The intuitive remarks were
given by Prof. Dr. Dr. h. c. Wolfram Hardt, on my concept presentation, helped
me to understand scientific report writing in a better way. Moreover, the guidance
provided by Dr. Ariane Heller during the time-to-time meeting kept me motivated,
and her thoughtful comments on this report drove me in the right direction.
Last but not least, I would like to thank all the members of the IFC lab at TU
Chemnitz for their guidance, encouragement, and support.
I
Abstract
Detection of the power line insulators from a cluttered background of aerial images
is crucial for the steady and dependable operation of the complete power system.
Besides, object detection is among the most challenging disciplines of computer vi-
sion and image processing. It becomes even more challenging regarding resource and
power-constrained embedded platforms. This thesis proposes a robust deep learn-
ing model for detecting insulators from aerial images on a low-powered embedded
platform, i.e., NVIDIA Jetson Nano.
In the proposed detection architecture, a active deep learning approach is used
with the single shot multibox detector (SSD), a sophisticated deep meta-architecture.
The SSD-based methodology can automate multi-level feature extraction from aerial
images instead of manually doing so. Additionally, the MobileNetV2 feature extrac-
tor is utilized as the backbone of the SSD network instead of the standard VGG-Net.
Two distinct approaches have been utilized to construct the model: traditional deep
learning and deep active learning, which reduces the cost and time of data prepa-
ration. Two separate models have been developed using the proposed architecture-
SSD300 and SSD640.
Experimental findings and analysis confirm the effectiveness and robustness of
the proposed methods. The proposed active deep learning model reached an mAP
of 94.5% using 43% of the total dataset, identical to the traditional deep learning
approach (94.6%), which utilizes the entire dataset. The model is trained over 10,000
epochs for both approaches. Regarding the inference speed, the presented modes
showed a frame rate of 9 fps for 10W power mode and 5.6 fps for 5W power mode
on NVIDIA Jetson Nano.
Keywords: Insulator detection, SSD, Deep Active Learning, Object
detection, Deep learning
II
Table of Contents
Acknowledgement ................................. I
Abstract ...................................... II
Table of Contents ................................. III
List of Figures ................................... VI
List of Tables ...................................VIII
List of Abbreviations ............................... IX
1 Introduction .................................. 1
1.1 Motivation................................. 2
1.2 ProblemStatement............................ 4
1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Technical Background ............................. 6
2.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 PoolingLayer........................... 8
2.1.3 Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . 8
2.2 ObjectDetector.............................. 9
2.2.1 Two-Stage Detector . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Single-Stage Detector . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 LearningApproach ............................ 11
2.3.1 Traditional Deep Learning . . . . . . . . . . . . . . . . . . . . 12
2.3.2 ActiveLearning.......................... 12
2.3.3 Active Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 ChapterSummary ............................ 14
3 State of The Art ................................ 16
3.1 ObjectDetection ............................. 16
3.1.1 Traditional Detectors . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Feature Extractors . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Deep Learning-based Detectors . . . . . . . . . . . . . . . . . 21
III
TABLE OF CONTENTS
3.2 InsulatorDetection............................ 27
3.2.1 Feature-based Insulator Detection . . . . . . . . . . . . . . . . 27
3.2.2 Deep Learning-based Insulator Detection . . . . . . . . . . . . 30
3.3 EmbeddedDevices ............................ 34
3.4 ChapterSummary ............................ 36
4 Methodology .................................. 38
4.1 DataPreprocessing............................ 39
4.1.1 DataCollection.......................... 39
4.1.2 Data Selection (for DeepAL) . . . . . . . . . . . . . . . . . . . 39
4.1.3 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 Single Shot MultiBox Detector(SSD) . . . . . . . . . . . . . . . . . . 43
4.2.1 BaseNetwork........................... 44
4.2.2 Multi-Scale Detection . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 TrainingAspects ............................. 46
4.3.1 LossFunction........................... 47
4.3.2 AspectRatio ........................... 48
4.3.3 Hard Negative Mining . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Active Deep Learning Framework . . . . . . . . . . . . . . . . . . . . 50
4.5 ChapterSummary ............................ 51
5 Implementation ................................ 53
5.1 TechnologyEnablers ........................... 53
5.1.1 Hardware ............................. 53
5.1.2 Software.............................. 55
5.2 ModelTraining .............................. 56
5.2.1 Google Colab Setup . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Preparing Dataset . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3 Model Configuration . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.4 Training.............................. 59
5.3 ChapterSummary ............................ 59
6 Results and Evaluation ............................ 61
6.1 EvaluationMatrices............................ 61
6.1.1 ConfusionMatrix......................... 61
6.1.2 Precision-Recall . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.1.3 COCOVariants: ......................... 63
6.2 SSD300 .................................. 64
6.2.1 Traditional Approach . . . . . . . . . . . . . . . . . . . . . . . 64
6.2.2 Active Deep Learning Approach . . . . . . . . . . . . . . . . . 65
6.3 SSD640 .................................. 71
6.3.1 Traditional Approach . . . . . . . . . . . . . . . . . . . . . . . 71
6.3.2 Active Deep Learning Approach . . . . . . . . . . . . . . . . . 74
IV
TABLE OF CONTENTS
6.4 Inference.................................. 77
6.4.1 FPS ................................ 77
6.4.2 Resource Consumption . . . . . . . . . . . . . . . . . . . . . . 82
6.5 ChapterSummary ............................ 82
7 Conclusion ................................... 85
7.1 Summary of The Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.2 FutureWork................................ 87
Bibliography .................................... 88
References form Professorship of Computer Engineering ..........101
V
List of Figures
1.1 Insulator inspection techniques (a) Human patrol [1] (b) helicopter
patrol [2] (c) Explainer-robot inspection[3] (d) Unmanned Aerial Ve-
hicles (UAV)-based inspection [4] . . . . . . . . . . . . . . . . . . . . 2
1.2 The hardware and software architecture of Adaptive Research Multi-
copter Platform (AREIOM) [5] . . . . . . . . . . . . . . . . . . . . . 4
2.1 Convolutional Layer with a 3 ×3×3 Kernel [6]. . . . . . . . . . . . . 7
2.2 PoolingLayer[7]. ............................. 8
2.3 Fully Connected (FC) Layer. [6]. . . . . . . . . . . . . . . . . . . . . 9
2.4 Meta architecture of deep learning based (a) Two-stage and (b) Single-
stage object detector. [8]. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Digital data growth from the beginning of 2010 to the end of 2020. [9]. 12
2.6 The design and operational illustration of pool-based active learning
[10]. .................................... 13
2.7 The architecture of a standard deep active learning combining deep
learning and active learning [10]. . . . . . . . . . . . . . . . . . . . . . 14
3.1 Timeline of object detection research over time. The gray area rep-
resents the era of traditional object detection (before 2014) and the
pinkish area represents the generation of deep-learning-based object
detection (after 2014) [11]. . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Fundamental working method of traditional object detector. . . . . . 17
3.3 Performance of traditional object detection algorithms on PASCAL
VOC dataset from 2007 to 2012 (based on [12]). . . . . . . . . . . . . 18
3.4 Accuracy of Faster RCNN, RFCN, and SSD using various backbone
networks and object sizes for 300x300 image resolution [13]. . . . . . 26
3.5 Time required for each model with different backbone networks for
an image size of 300x300 [13]. . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Operational flow of the mathematical morphology and Bayesian segmentation-
based insulator detection [14]. . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Insulator detection based on local features and spatial orders.The
dotted line box illustrates the formation of the feature library. [15]. . 29
3.8 Architecture of the Faster RCNN-based insulator detector [16]. . . . . 30
3.9 Multi-feature fusion and Fast RCNN based deep neural network ar-
chitecture forinsulator detection [17]. . . . . . . . . . . . . . . . . . . 31
3.10 Architecture of the Improved Single Shot MultiBox Detector (ISSD)
network[16]................................. 32
VI
LIST OF FIGURES
3.11 The YOLOV4-based network architecture for insulator detection [TUC6]. 33
3.12 DL-Inference Performance Benchmark on different Jetson devices [18]. 35
4.1 The Process flow of the proposed model. . . . . . . . . . . . . . . . . 38
4.2 Illustration of different data pools. . . . . . . . . . . . . . . . . . . . . 40
4.3 Data augmentation strategy used in the application. . . . . . . . . . . 43
4.4 Network architecture of (a) standard SSD & (b) improved SSD. . . . 44
4.5 Network architecture of the proposed base network (MobileNetv2 [19]). 45
4.6 Multi-scale object detection with default box. . . . . . . . . . . . . . 46
4.7 Process flow of Softmax Loss function. . . . . . . . . . . . . . . . . . 48
4.8 The aspect ratio of our training dataset of insulators. . . . . . . . . . 49
4.9 Proposed framework for Active Deep Learning (DeepAL). . . . . . . . 50
5.1 Task distribution among the computing platforms. . . . . . . . . . . . 54
5.2 The Top-level architecture of the TensorFlow library (based on [20]). 56
5.3 The dataset visualization after splitting (Iteration 1). . . . . . . . . . 58
6.1 Dataset visualization in different data pool. . . . . . . . . . . . . . . 64
6.2 Loss during training the SSD300 using passive DL approach. . . . . . 65
6.3 Confusion matrix of the SSD300 using passive DL approach. . . . . . 66
6.4 Loss during training the SSD300 using active DL approach. . . . . . . 67
6.5 Average Precision over Iteration for SSD300. . . . . . . . . . . . . . . 68
6.6 Average Recall over Iteration for SSD300. . . . . . . . . . . . . . . . 69
6.7 Confusion matrix of SSD300 for different iterations. . . . . . . . . . . 70
6.8 Loss during training the SSD640 using Traditional DL approach. . . . 72
6.9 Confusion matrix of the SSD640 using Traditional DL approach. . . . 73
6.10 Loss during training the SSD640 using active DL approach. . . . . . . 74
6.11 Average Precision over Iteration for SSD640. . . . . . . . . . . . . . . 76
6.12 Average Recall over Iteration for SSD640. . . . . . . . . . . . . . . . 77
6.13 Confusion matrix of SSD300 for different iterations. . . . . . . . . . . 78
6.14 Detection result with SSD300 7th iteration on different data pool. . . 80
6.15 Detection result with SSD640 8th iteration on different data pool. . . 81
VII
List of Tables
1.1 Comparison of different inspection methods . . . . . . . . . . . . . . 3
2.1 Comparison between Deep Learning and Active Learning. . . . . . . . 14
3.1 Summary of the widely used Complex Feature Extractor networks.
The accuracy is measured on the ImageNet dataset. . . . . . . . . . . 19
3.2 Summary of the widely used Lightweight Feature Extractor networks.
The accuracy is measured on the ImageNet dataset. . . . . . . . . . . 20
3.3 Performance analysis for two-stage detectors on PASCAL VOC dataset
with the image size of 1000x600. . . . . . . . . . . . . . . . . . . . . 22
3.4 Performance analysis for single-stage detectors on PASCAL VOC
dataset. .................................. 23
3.5 Performance analysis for single-stage detectors on MS COCO test-dev
dataset. .................................. 24
3.6 A hardware comparison of different Jetson devices.
1Available online: https://developer.nvidia.com/embedded/jetson-nano
2Available online: https://www.nvidia.com/de-de/autonomous-machines/
embedded-systems/jetson-tx2/
3Available online: https://www.nvidia.com/en-us/autonomous-machines/
embedded-systems/jetson-xavier-nx/ ............... 34
3.7 Inference rate of different object detection architecture on different
Jetsondevices[21]. ............................ 35
4.1 Feature comparison of different data pool. . . . . . . . . . . . . . . . 41
5.1 Dataset description of different iterations of active deep learning frame-
work..................................... 60
6.1 Evaluation result of SSD300 using passive DL approach. . . . . . . . 66
6.2 Evaluation result of SSD300 using active DL approach. . . . . . . . . 68
6.3 Accuracy of SSD300 using active DL approach. . . . . . . . . . . . . 71
6.4 Evaluation result of SSD640 using Traditional Deep Learning approach. 73
6.5 Evaluation result of SSD640 using Active DL approach. . . . . . . . . 75
6.6 Accuracy of SSD640 using Active DL approach. . . . . . . . . . . . . 76
6.7 Inference speed computation both for the SSD300 and SSD640. . . . 79
6.8 Resource consumption of the trained model during inference. . . . . . 83
VIII
List of Abbreviations
ANN Artificial Neural Network
CNN Convolutional Neural Network
DCNN Deep Convolutional Neural Network
DPM Deformable Part Model
FCN Fully Convolutional Network
FPN Feature Pyramid Network
MKL Multiple Kernel Learning
NAS Network Architecture Search
NMS Non-Maximum Suppression
RCNN Region-based Convolutional Neural Network
RFCN Region-based Fully Convolutional Network
ROI Region of Interest
RPN Region Proposal Network
SBC Single Board Computer
SDP Semi-Define Programming
SPP Spatial Pyramid Pooling
SSD Single Shot MultiBox Detector
UAV Unmanned Aerial Vehicle
YOLO You Only Look Once
IX
1 Introduction
Electricity is considered one of the critical resources in the modern world. And the
transmission line is an important component of these electric networks that transfer
electricity to the customers uninterruptedly. As a result, aside from generation and
distribution, transmission lines are an important part of power systems. Because
the transmission line’s continuous and efficient operation is crucial to the overall
performance of the power system.
Insulators are transmission system components that provide essential insulation
between line conductors and supports, preventing any current leakage from the con-
ductors to the ground. The breakdown of transmission line insulators, resulting in
a disruption in power supply, is a huge concern for the power distribution authority.
This can be caused by a variety of factors, including porcelain cracking, porosity,
puncture, mechanical stresses, flash-over, and so on. Therefore, it’s important to
inspect the insulators on transmission lines on a regular basis.
Unmanned Aerial Vehicles (UAVs) have been increasingly popular in the indus-
trial sector in recent years, delivering aerial photographic services. Due to their low
cost and low power consumption, UAVs equipped with high-resolution cameras be-
come effective investigation and monitoring tools. This introduces a wide range of
research areas such as real-time monitoring, rescuing, remote sensing, surveillance,
providing wireless coverage, infrastructure inspection, and so on [22, 23, 24]. One
such commercial application of UAVs in recent years has been the inspection of
power transmission lines.
The insulator is one of the key components of the power transmission and distribu-
tion system. Hence, detecting insulators from the power transmission line is vastly
essential. However, object detection in general, including insulator detection, has
been a focus of active research for decades [25]. It is considered to be one of the most
fundamental and difficult tasks in the field of computer vision. It involves locating
instances of preset types of objects in natural images and, if detected, returning the
precise location and size of each occurrence.
Being the core of image understanding and computer vision, object detection
has created the groundwork for resolving challenging and high-level vision problems
including image segmentation, scene understanding, object tracking, scene caption-
ing, etc. It improves a variety of applications, such as human-computer interaction,
content-based image retrieval, machine vision, intelligent video surveillance, and
unmanned aerial vehicles [26].
When it comes to embedded systems with limited resources, object detection
has proven more complex. Since a huge number of potential object proposal re-
gions must be considered, it is thought to be the most computationally difficult
1
1 Introduction
(a) (b)
(c) (d)
Figure 1.1: Insulator inspection techniques (a) Human patrol [1] (b) helicopter pa-
trol [2] (c) Explainer-robot inspection[3] (d) Unmanned Aerial Vehicles
(UAV)-based inspection [4]
task in the artificial intelligence arena [27]. Consequently, the integration of object
detection algorithms with embedded systems grew more challenging due to the con-
strained computation, memory, and energy resources. These resource-constrained
systems must frequently satisfy difficult requirements including real-time respon-
siveness, high performance, and trustworthy inference accuracy [28].
1.1 Motivation
Traditionally ground patrols or humans-experts are in charge of inspecting the insu-
lators [TUC1] worldwide power transmission lines. Although it is the most reliable
method, it is also risky and time-consuming. The dangers of inclement climate and
exposure to electricity include severe injury and even death. According to the data
from the Census of Fatal Occupational Injuries (CFOI) [29], 126 workers died in
2020 due to exposure to electricity. Among them, one-fifth worked in installation,
maintenance, and repair occupations. Moreover, manual inspection costs a lot of
money because of the high-tech equipment and transportation requirements, among
other things. Finally, the inspection procedure takes a long time to complete, and
the power needs to be turned off during the inspection period.
Helicopter patrols [30] and explainer robots [3, 31] are also utilized to inspect the
insulators on occasion. This approach effectively decreases the risk of human dam-
2
1 Introduction
age, although it is more expensive than the traditional way. UAV-based inspection
technique plays an important role in resolving the limitations of prior techniques. It
saves money and time on inspections while also keeping maintenance workers safe. It
can also inspect power lines from various angles, distances, and orientations, which
is ideal for insulator inspection.
Inspection Method Advantage Disadvantage
Ground Patrol Trustworthy, concise, and
accurate
Life-threatening, time-
consuming
Helicopter Patrol Accurate and time effi-
cient
Expensive and life-
threatening
Explainer Robot No human life risk,
better-accuracy
Very expensive, need ad-
ditional line for rolling
UAV Cost & time effective,
safe
Accuracy is not as good
as human, still in research
Table 1.1: Comparison of different inspection methods
Figure 1.1 depicts several insulator inspection methods and techniques, with ad-
vantages and downsides for each method given in Table 1.1. Though the UAV-based
inspection system is still in the research and development phase, it is able to attain
human-level reliability and accuracy using modern tools and technology.
The research work on this thesis is a part of the project APOLI: Automated Power
Line Inspection [TUC2]. The project aims to develop a vision-based monitoring
system for power transmission and distribution systems that is fully automated.
An Unmanned Aerial Vehicle (UAV) will be utilized to accomplish this goal, used
to assess damage to insulators, power transmission poles and lines. In order to
develop such a system, an Adaptive Research Multicopter Platform (AREIOM)
has been proposed. The AREIOM software and hardware architecture [TUC3] is
demonstrated in Figure 1.2, including software components, mapped to a specific
hardware component.
The Expert System (EXS) is responsible for making real-time decisions and run-
ning the inspection mission. At the same time, the communication with the Flight
Controller (FLC) is done via the MAVLink Abstraction Layer (MAL). On the
other hand, the vision system is built upon the Acquisition (ACQ), Inspection Mis-
sion Recorder (IMR) and Navigation Image Procession (NIP). The IMR collects
high-resolution inspection information while the NIP detects the inspection object.
Therefore the insulator detection is also done mainly via the NIP and ACQ. The
NIP gets the live video streaming data from the ACQ, which is processed and detects
the insulator from the live video.
The following are the key reasons for developing a deep learning-based inspection
system and real-time insulator detector:
Insulator detection enables inspection of electrical power distribution systems
that are automated, non-contact, and safe. In comparison to ground-based
3
1 Introduction
Figure 1.2: The hardware and software architecture of Adaptive Research Multi-
copter Platform (AREIOM) [5]
visual inspection by expensive specialists, the automated approach also keeps
costs down.
The accuracy and real-timeliness of the UAV-based inspection method are still
being researched, and the system’s reliability has to be enhanced.
The UAVs are equipped with edge devices, which consume less power and
allow for a more extended flight mission. Deep learning-based object detectors,
in general, require a significant amount of computer capacity to provide a
satisfactory outcome. However, in order to run on power-constrained edge
devices, the detector must be lightweight.
1.2 Problem Statement
This research aims to develop an onboard object detection algorithm that detects in-
sulators in real-time. Several challenges sought to be solved to design the algorithm.
They are as follows-
The insulator detection algorithm will be deployed on a low-powered embedded
device attached to a UAV. The embedded board will be powered by the UAV.
Therefore, the detector must provide good accuracy by consuming less power.
Consuming less power will increase the flight time of the UAV.
The insulator must be detected onboard from the live video streaming. There
is no option for post-image processing. The video will be captured by an indus-
trial camera attached to the UAV, which will be utilized both for inspection
and navigation. Hence, post-image processing will increase the latency of the
whole system.
4
1 Introduction
The inspection decision will be made based on the outcome of the detection
algorithm. Thus the detection algorithm should also provide almost real-time
results. Because the latency of the detection outcome will provide a delay in
the inspection decision.
The detector should be lightweight because of the resource-constrained edge
device.
The detector should be able to detect the small object accurately because it
will be fed by aerial video.
1.3 Structure of the Thesis
The thesis is divided into several chapters, covering everything from the fundamen-
tals of object detection to an extensive analysis of insulator detection. The chapter
introduction sought to provide background information on the overall project and
explain how this thesis project fits into it. Also covered are the overarching project
motivation as well as the thesis-specific motivation. The goal of each chapter in the
thesis work is briefly described as follows-
A general overview of the technical terms which would be essential to understand
this thesis and further chapters better are described in the Chapter 2 - Technical
Background. This includes an overview of convolutional neural networks and the ap-
proaches to detecting objects using deep learning. Different deep learning strategies
are also described in this chapter.
Chapter 3 - State of The Art includes a review of the literature and study fields
with a focus on issues relating to the thesis. It also conducts a comparative analysis
of the benefits and drawbacks of the various approaches covered. This analysis
comprises the general object detection algorithms, insulator detection, and different
embedded devices. Chapter 4 - Methodology, discusses the concept of achieving the
thesis’s goal. The proposed solution based on the literature review and the state-of-
the-art is explained in this chapter. Additionally, it seeks to fix the issue mentioned
in the problem statement.
The Chapter 5 - Implementation explains the implementation details of detecting
insulators. A brief overview of the hardware and software used to enforce the model
is provided. Furthermore, different computing platforms utilize to train, test, and
evaluate the detection model are also explained in a step-by-step strategy.
The output obtained after the implementation are analyzed and evaluated, and
the result is described in the Chapter 6 - Results and Evaluation. The accuracy of
the different trained models is calculated and explained. Additionally, the execution
and inference time is also mentioned.
Finally, Chapter 7 - Conclusion concerns the summary of the overall thesis work
along with the explanation of the possible future scopes based on the result obtained
from the implementation.
5
2 Technical Background
To comprehend Deep Learning-based object detection, one must first understand
the notion of a Convolutional Neural Network and its working methodology. This
chapter delves into the functional principles of CNN and the multiple types of object
detectors currently available in the industry. It also explains the different kinds of
deep-learn approaches and their trade-offs.
2.1 Convolutional Neural Network
The Convolutional Neural Network (CNN) refers to a deep-learning architecture
that takes an image as input and assigns importance (i.e., weights and biases) to
the different parts or aspects of the image. Later, depending on those weights, the
algorithm can distinguish them from one another. It is similar to classic Artificial
Neural Networks (ANN) in that they are made up of neurons that learn to optimize
themselves [32]. The most significant advantage of CNN is that they reduce the
number of variables in ANN. This accomplishment has led academics and developers
to consider larger models to perform complicated tasks previously impossible to solve
with traditional ANN [32].
The building blocks of CNN are similar to the architectural pattern of the human
brain. It consists of a sequence of layers resembling the convolution operation fre-
quently used in image processing.- Convolutional Layer,Pooling Layer, and Fully
Connected Layer. Each layer transforms its knowledge into another layer through
activation functions [33]. The structure and working procedure of different layers
are discussed in the following sections.
2.1.1 Convolutional Layer
The dot product between a filter/kernel and the image is called convolution opera-
tion. It’s done by placing a kernel onto small areas of an image and sliding it across
the entire image. The learnable kernels are emphasized in the parameters of the
layer.
Why do we require the Convolution layer than a standard neural network? Assume
a 32 ×32 pixel color image with a height and width of 32 pixels. Due to the three
layers of RGB channels, the depth will be 3. If we only connect one neuron to the
input layer, we’ll have 32×32 ×3 weight connections, or 3,072. If another neuron is
added to the hidden layer later, we’ll require another 3,072 connections. Moreover,
two neurons are insufficient to process images in a real-world application. For the
6
2 Technical Background
image mentioned earlier, we need (32 ×32 ×32) ×(32 ×32) connections which add
up to 3,145,728 [34, 33, 32].
To deal with this inefficiency and complexity, it is an excellent idea to transform
the image into such a form without losing features that will be useful for further
processing. This could be achieved by processing the image’s local region instead
of the whole image. As described in Figure 2.1, the convolution layer performs this
using a filter d defined with tiny sizes (e.g., 3 ×3, 5 ×5, etc.). For a filter of size
3×3, we need only (3 ×3×3) ×(32 ×32) connections which are 27,648 compared
to a fully connected network (i.e., 3,145,728) [32, 35, 36].
Figure 2.1: Convolutional Layer with a 3 ×3×3 Kernel [6].
This way, a feature map is created by convolving the filter over the input image
or the previous layer of the input image. The resulting dfeature map is a new 3D
structure of the input layer called tensor. For example, the tensor will be 32 ×32 ×
32×dfor an input of of size 32 ×32×32. This layer contains a very few parameters-
if the map contains 3×3×3 values, the total parameters will be 3 ×3×3 + 1(bias),
i.e. 28 parameters.
The output volume and the arrangement of the convolution layer depend on three
hyperparameters- depth, stride, and zero-padding [37, 7].
Firstly, the depth of the output volume refers to the number of filters used in
the previous layer. If the raw image is fed into the first convolutional layer,
distinct neurons along the depth dimension may fire in response to varied
oriented edges or color blobs. A depth column is a group of neurons that are
all scanning at the same input region.
After that, the stride has been specified, which directs to the sliding operation
of the filter. The filter slides depending on the stride value. If the stride is 1,
the filter slides one pixel at a time, and so on. The common practice of using
stride is 1 and 2. As a result, output volumes will be reduced.
7
2 Technical Background
Finally, padding the input volume with zeros around the border will be helpful
in some cases. Zero padding has the advantage of allowing us to control the
spatial size of the output regions. It’s most typically used to keep the input
volume’s spatial size constant, ensuring that the input and output width and
height are the same.
2.1.2 Pooling Layer
After reducing the size in the convolution layer, the elements are still too high. For
further dimensionality reduction, a pooling layer must be inserted between successive
convolution layers. The goal of this layer is to reduce the spatial domain’s size to
reduce the network’s computing parameters for controlling overfitting. It is also
helpful to extract the dominant features.
Two types of pooling can be performed in the pooling layer- Max Pooling and
Mean Pooling. The max-pooling layer returns the maximum value of the image
covered by the filter. Similarly, the average value of the image covered by the filter
is returned by mean pooling.
(a) (b)
Figure 2.2: Pooling Layer[7].
In Figure 2.1.2, an illustration of the pooling layer has been shown. In 2.2(a),
the size of the input image is 22 ×224 ×64. This is pooled with a filter with size
2, and produces the output image with size 112 ×112 ×64. Here, the depth (64)
is preserved. In 2.2(b), max-pooling downsampling operation with a stride of 2 has
been shown. Here, each operation is performed over 4 numbers (2 ×2 square).
2.1.3 Fully Connected Layer
A fully connected layer is a combination of all activations. All the non-linear high-
level features obtained from the convolutional layer are combined in this layer. Hence
it can be seen as a Multi-level perception/regular Neural Network. After that, matrix
multiplication followed by a bias offset was performed to learn a non-linear function
in that space.
8
2 Technical Background
Figure 2.3: Fully Connected (FC) Layer. [6].
The illustration of a fully connected layer has been shown in Figure 2.3. It is visible
that a fully connected layer is seen as a regular neural network or convolution layer.
The primary distinction is that the convolution layer is only connected to a tiny
input segment. But the convolution and fully connected layers have the same linear
functionality since they both compute dot products.
2.2 Object Detector
Detection is the task of locating and classifying items using rectangular bounding
boxes to identify them and classify them into distinct categories. Hence, Image
Classification and Object Localization work together to detect objects. Image classi-
fication is responsible for predicting the class of an object in an image. In contrast,
object localization refers to identifying the presence of objects in an image and, if
present, indicating their location with a bounding box.
Depending on the classification and localization, there are two general state-of-
the-art approaches for object detection: 1) the Two-stage approach and 2) the
single-stage approach. The working principle of both approaches is described in the
Figure 2.2.
2.2.1 Two-Stage Detector
The two-stages approach was first introduced by R-CNN [38] which combines the
region with CNN features. This method has used a strategy called selective-search
[39], which generates a set of region of interest (RoI) by merging similar pixels
into regions, using references boxes called anchors. Later, each region is fed into
a CNN model, and a high-dimensional feature vector is produced from the CNN.
9
2 Technical Background
(a) (b)
Figure 2.4: Meta architecture of deep learning based (a) Two-stage and (b) Single-
stage object detector. [8].
Finally, this vector is utilized for the classification and bounding box regression, as
in 2.4(a).Different parts of the two-stage based object detector are described below.
Anchors
The concept of Anchors was first introduced by Faster R-CNN [40], a successor
of R-CNN, to identify the bounding boxes in varying scales and sizes. Anchors
employ the feature map from the CNN’s output instead of image pyramids or filter
pyramids to detect bounding boxes of various scales and dimensions taken from the
input image. An anchor is a fixed bounding box with a center point, height, and
breadth applied to an input image. A collection of anchors with varied scales and
sizes is built for each location using a sliding window on the feature map.
Region Proposal Network
The Region Proposal Network (RPN) is considered a Fully Convolutional Network
(FCN) used for a region of proposals from the feature map. Anchors are fed into the
RPN as input, then RPN predicts a confidence score and performs box regression.
At first, it checks the likeliness of an anchor being an object or background. Then
identify the offset from the anchor to the actual box [40, 41, 42].
Region of interest pooling
Using the region proposals, a pooling is done to anticipate the object classes and
localization. For each region proposal, fixed-size features are extracted from the
feature map, providing a fixed-sized input to the classifier. The region proposal
10
2 Technical Background
filters and resizes the feature map to a given size. The most interesting or important
features are then extracted using max-pooling and fed into the final stage [43, 44].
Classification and Regression
The second stage of the two-stage detection approach is the classification of the
extracted features. The collected features are flattened and put into two fully con-
nected layers to accomplish this. These layers are responsible for classification and
regression [38, 40].
2.2.2 Single-Stage Detector
The minimalist methods for object detection are proposed in a Single-stage ap-
proach. In this approach, the detector comprises a single FCN that provides the
bounding boxes and object classification directly in a single feed-forward pass [8].
The main feature of the single-stage approach is that it treats the detection prob-
lem as a regression problem [45]. Hence, there is no need for regional proposals in
a separate stage as described in 2.4(b). So it becomes faster and more efficient in
terms of inference as well as training. However, these one-stage models could not
achieve remarkable accuracy compared to the two-stage approach due to the signif-
icant foreground-background imbalance in the images. The trade-offs are clarified
in more detail in the state-of-the-art chapter.
In most cases, single-stage detection algorithms work in three steps.:
The image is initially separated into a fixed number of equal-sized grids.
Then, for each grid, a certain number of bounding boxes with pre-defined
shapes are predicted around the grid center. A class probability and detection
confidence are associated with each prediction, i.e., whether it contains the
object or is the background.
Finally, only the bounding boxes with the highest detection confidence and
class probability are chosen. The object is assigned to the class with the
highest probability.
The anchors in this method know the shape and size of the object in the image
data set. Multiple anchors have been defined to detect an item of various sizes and
forms. In most cases, the final prediction differs from the specified anchor size and
location. This is handled by using an optimal offset from the feature map.
Since we are using this single-stage method in this research, a deep working pro-
cedure is described in the methodology chapter.
2.3 Learning Approach
The fundamental prerequisite for supervised learning is labeled training data, which
is extremely valuable. There are several learning approaches available in the machine
11
2 Technical Background
learning paradigm based on the utilization of the dataset. The entire dataset is
utilized in the traditional deep learning approach. On the contrary, active learning
provides an improved architecture where higher accuracy can be achieved by utilizing
fewer data. Different learning approaches and their working principle along with the
advantages and disadvantages are explained in this section.
2.3.1 Traditional Deep Learning
Deep learning aims to construct suitable frameworks by imitating the human brain’s
anatomy. The era of modern deep learning began in 1943 when the McCulloch-Pitts
(MCP) [46] model was proposed. By introduction of back-propagation[47], and other
learning techniques later in the previous century improved the framework for the
subsequent rapid development of deep learning.
The most common form of deep learning is supervised learning[48]. Let’s assume
we want to build an application that can classify images as containing, say, a car,
a pedestrian, a traffic light, or a sign. We first need to collect a large dataset of
vehicles, pedestrians, and traffic signs to do this. Then label each image with its
category. An image is fed to the machine during training, generating an output in
the form of a vector of scores, one for each category.
Figure 2.5: Digital data growth from the beginning of 2010 to the end of 2020. [9].
Though deep learning has excellent learning ability in the context of raw data
processing and feature extraction, it increases the labeling costs, which is consid-
ered in today’s world where data is growing exponentially[9]. Figure 2.5 shows the
exponential growth of data in the past decade, which has led to the labeling crisis.
2.3.2 Active Learning
Active learning or query learning is a subfield of machine learning that provides
better performance with less training. The previously described supervised deep
12
2 Technical Background
learning system often requires tons of labeled samples for training the model. How-
ever, this sample labeling comes with a cost. And with the increase in data, this
labeling task is becoming more expensive, time-consuming, and complex. Active
instructional strategies try to get around the labeling barrier.
Depending on the application scenario[49, 50], the active learning strategy can
be separated into three categories: 1) pool-based active learning, 2) membership
query synthesis, and 3) stream-based selective sampling. The fundamental design
and operation of pool-based active learning are shown in Figure 2.6. This strategy
contains a pool Uorganizing all the unlabeled samples. The Active learning strategy
is used to query for the sample in the unlabeled pool. Then the oracle, for instance, a
human annotator, labels the quired sample and adds them to the training dataset L.
After training with the new training set, the knowledge is used for further querying.
The same procedure is repeated until the termination requirements have been met.
Figure 2.6: The design and operational illustration of pool-based active learning [10].
With membership query synthesis, any unlabeled sample in the input space, in-
cluding the learner’s own sample, may be queried. The stream-based selective sam-
pling can independently determine if each sample in the dataset needs to check the
labels of unlabelled data. This sort of strategy is typically used in the lightweight so-
lution, for example, in resource-constrained embedded devices. On the other hand,
powerful machines with adequate computing and storage capacity frequently adopt
the pool-based technique [10, 50].
2.3.3 Active Deep Learning
Though deep learning and active learning are both subfields of machine learning,
they have their advantage and disadvantage along with the specific use cases [51, 10,
50]. The comparison between both approaches is described in Table 2.1. Comparing
both techniques makes it obvious to merge deep learning and active learning.
By blending both deep learning and active learning, a standard deep active learn-
ing (DeepAL) architecture [10] is illustrated in Figure 2.7. Here on the labeled train-
13
2 Technical Background
Figure 2.7: The architecture of a standard deep active learning combining deep
learning and active learning [10].
Deep Learning Active Learning
Model confidence Due to the softmax layer,
the deep learning models
are more confident
In terms of model con-
sistency, the query-based
technique is unreliable
Data limitation Deep Learning is notori-
ously data-hungry. The
more the data, the more
accurate the model will
be
To develop and modify
the model, active learn-
ing frequently requires
a small amount of la-
beled data
Pipeline processing In Deep Learning, fea-
ture learning and clas-
sifier training are maxi-
mized collectively
Active Learning tech-
niques are primarily
concerned with classifier
training
Table 2.1: Comparison between Deep Learning and Active Learning.
ing set L0, the properties of the learning algorithm are initialized or pre-trained.
The deep learning algorithm exploits the data from the unlabeled pool Uto retrieve
the features. The following steps involve choosing samples depending on the appro-
priate query strategy and then using the Oracle to query the label to establish a new
labeled training set L. After that, simultaneously train the deep learning model on
Land update U. Until the termination conditions are satisfied, the same procedure
is iterated.
2.4 Chapter Summary
A general overview of the technical terms which would be essential to understand
this thesis and further chapters better are described in this chapter.
At first, the basic concepts of convolutional neural networks and their different
layers are explained. To analyze pixel data, convolutional neural networks (CNNs)
14
2 Technical Background
are a specialized class of artificial neural networks used in image classification and
recognition. The main benefit of CNN is that it lowers the number of variables in
ANN. Intellectuals and programmers are now considering larger models to complete
challenging problems that were previously difficult to handle with standard ANN
due to this breakthrough.
A standard CNN is composed of three distinct layers: convolution layer, pooling
layer, and fully connected layer. The convolution layer reduces the image size by
extracting interesting features from the image, while the pooling layer reduces fur-
ther the dimensionality of the image. Finally, all the non-linear high-level features
obtained from the convolutional layer are combined in the fully connected layer.
After that, the working principle of deep learning-based object detectors is ex-
plained. Two types of DL-based detection algorithms currently exist- two-stage
and single-stage. The methodology, framework, and different components of each
strategy are described.
As the name suggests, the two-stage detector detects the object in two separate
steps. In the first step, a set of Regions of Interest (RoI) is proposed by merging
similar pixels into regions. These proposed regions are classified with a bounding
box regression in the later step. On the other hand, both the region proposal and
classification task are performed in a single forward pass in a Single-Stage detector.
It carries the detection problem as a regression task.
This chapter finally concerns the various learning strategies, including traditional
deep learning, active learning, and active deep learning. Depending on how the
dataset is used, the machine learning paradigm offers a variety of learning strategies.
The conventional deep learning approach makes use of the complete dataset. On
the other hand, active learning offers a better architecture that allows for increased
accuracy to be attained while using fewer data. This chapter explains those various
learning strategies, their underlying principles, benefits, and drawbacks.
15
3 State of The Art
3.1 Object Detection
Object detection is a crucial computer vision job that involves detecting instances of
specific visual features in image data. It aims to design mathematical methods and
algorithms that answer one of the essential questions required by image processing
applications: What objects are where? [52] Object detection has vastly progressed
through two historical phases in the last two decades: Traditional Object Detection
and Deep Learning-based object detection, according to mainstream consensus.
Figure 3.1 illustrates the different methods and algorithms developed over time
for object detection. The gray area represents the era of traditional object detec-
tion, while the pinkish area represents the generation of deep-learning-based object
detection. The evaluation of object detection over time is described in the following
sections.
Figure 3.1: Timeline of object detection research over time. The gray area represents
the era of traditional object detection (before 2014) and the pinkish area
represents the generation of deep-learning-based object detection (after
2014) [11].
16
3 State of The Art
3.1.1 Traditional Detectors
Before the age of deep learning, handcrafted features had been used to develop most
of the initial object detection algorithms. Researchers had no choice but to construct
complicated feature representations and numerous speed-up techniques to exhaust
the limited computational resources available at the time due to a lack of suitable
image representation.
Figure 3.2: Fundamental working method of traditional object detector.
Traditional object detectors include Deformable Part-based Model (DMP) [53],
Multiple Kernel Learning (MKL) [54], Selective Search [39], Boosted HOG-LBP [55],
etc. As described in Figure 3.2, the three main components of traditional object
detection methods are the region selector, feature extractor, and classifier [11]. The
region selector primarily utilizes a sliding-window of various sizes and ratios to slide
across the image from left to right and top to bottom by a predetermined step
size. It trimmed the original image and converted it into a uniform image block.
Afterward, the feature extractor extracts the features from the image block utilizing
HOG [56], SIFT [57], Haar [58], and other feature extractor algorithms. Finally, the
blocks are classified using an SVM [59] or AdaBoost [58] classifier to determine the
object category.
In 2004, P. Viola and M. Jones introduced the Viola-Jones detector [58, 60], which
could detect human faces in real-time. Their detector achieved immense detection
accuracy compared to the algorithms available in its time. They utilized the sliding
windows technique to slide through all potential areas and check if any window
included a human face. The detector used the Haar wavelet to extract the features
from the image and the Adaboost classifier to classify the features.
Later in 2005, N. Dalal and B. Triggs [56] presented the Histogram of Oriented
Gradients (HOG) feature descriptor. It achieved significant advancement in scale-
invariant feature transformation [57] and shape contexts [61] on that time. It is
designed to be computed on a dense grid of equally spaced cells and employs over-
lapping local contrast normalizing to increase accuracy while balancing feature in-
variance and non-linearity. It rescales the image representation numerous times
while maintaining the size of the sliding window the same to detect objects of vari-
ous sizes.
P. Felzenszwalb et al. [53] initially proposed the Deformable Part-based Model
(DPM) in 2008 as an extension of the HOG detector. The DPM follows the divide
and conquers detection principles, in which the training can be seen as learning an
efficient method to decompose an object. The inference can be characterized as an
array of detections on distinct object components. A root filter and numerous part-
filters build up a standard DPM detector. Rather than specifying the part filters’
17
3 State of The Art
size and position, DPM has established a partially supervised learning strategy in
which all configurations can be obtained as latent variables.
Following that, the co-author R. Girshick made several enhancements [62, 63] to
this detector. These include bounding box regression, hard negative mining, and
context priming. He also introduced a cascading architecture that achieved a much
faster inference without sacrificing accuracy.
Figure 3.3: Performance of traditional object detection algorithms on PASCAL VOC
dataset from 2007 to 2012 (based on [12]).
Later, many researchers proposed Multiple Kernel Learning (MKL)-based detec-
tor [54, 64, 65]. It was initially formulated as a Semi-Define Programming (SPD) [66]
challenge. This method uses a collection of kernels or a single kernel with varying
parameters. The optimization technique determines how to find the most suitable
kernel or kernel combination. Various regularizers, such as entropy-based [67], and
mixed norms [68], the L1 norm [69], Lpnorm [70] have been suggested to learn an
effective kernel combination. The L1 norm is most frequently used because it can
sparse solutions and effectively eliminate superfluous and noisy kernels [71]. This
technique attempted to solve two critical issues of object detection, such as classifier
accuracy enhancement and learning efficiency improvement.
Though the traditional object detection algorithms have ingrained weaknesses,
they were relatively mature. Running on a Pentium III CPU at 700MHz Viola-
Jones detector detects the human faces in real-time without any constraints [52].
Improved DPM and MKL achieved around 40% mean Average Precision (mAP)
on the PASCAL VOC dataset in 2011 (Figure 3.3). However, the region selection
technique based on sliding windows has a high computing complexity and high
window redundancy [72]. Besides, handcrafting robust features is challenging due
to the morphological variance of appearance, the variety of lighting changes, and
the cluttered background.
18
3 State of The Art
3.1.2 Feature Extractors
Feature extractors are the deep convolutional neural network mainly composed of
convolutional layers. They are used as the backbone or base network for deep-
learning models. The primary objective of the feature extractor is to extract the
features from the raw input image by reducing the dimensionality. Depending on the
network layers and parameters, it can be split into Complex (3.1) and Lightweight
(3.2). Complex networks have a deeper network design, as their name suggests.
Lightweight networks, however, have fewer layers.
In 1998, Yann LeCun proposed LeNet-5 [73], the first convolutional neural net-
work. In the following decade, it has not, however, advanced significantly. Com-
puting power imposes a limit on its expansion. In 2012, Alex Krizhevsky et al.
proposed AlexNet by increasing the breadth and depth of the LeNet-5 [74]. It has
three max-pooling layers, three fully connected layers, and five convolutional layers
totaling 60 million parameters. To increase the dataset and decrease overfitting,
AlexNet also incorporated data improvement techniques such as horizontal flipping,
clipping, translational transformation, and color illumination. The gradient decline
issue in a deeper network was resolved using the ReLu activation function as op-
posed to the conventional Sigmoid and Tanh activation function. AlexNet took first
place in the ILSVRC-2012 competition in 2012.
Feature Extractor Parameters
(M)
FLOPs
(M)
Top-1
(%)
Top-5
(%)
AlexNet [74] 60 720 57.2 80.3
ZFNet [75] 58 - 60.0 85.2
VGG16 [76] 138 15300 71.5 89.8
GoogleNet [77] 6.8 1550 69.8 93.3
InceptionV2 [78] 12 1940 79.9 95.2
InceptionV3 [79] 23.6 5000 82.7 96.5
ResNet50 [80] 23.4 3832 79.3 96.4
ResNet101 [80] 42 - 80.1 96.4
CSPDarkNe53 [81] 28 9000 80.05 95.09
Table 3.1: Summary of the widely used Complex Feature Extractor networks. The
accuracy is measured on the ImageNet dataset.
By improving AlexNet and introducing unpooling and deconvolution layers [82],
Matthew D. Zeiler et al. devised the ZFNet [75]. This network can visualize the
feature maps. Additionally, the kernel size of the AlexNet is reduced to 7x7 and the
stride size to 2 to preserve the low-level features. The lowered kernel can decrease
the downsampling rate, which helps localize large objects and identify small ones.
The location and classification of objects benefit from the performance gain of
feature representation. Increasing the network’s layers and the number of neurons
in each layer is the best strategy to boost performance. The number of parameters
will grow with network size, however. To address this issue, GoogLNet [77] was
19
3 State of The Art
introduced by utilizing the Network-in-Network (1x1 kernel) idea suggested by Min
Lin et al. The 1 ×1 kernel’s dimensionality reduction can increase the depth and
breadth of the network while reducing computational complexity.
Afterward, C. Szegedy et al. improved the GoogleNet (InceptionV1) and pro-
posed InceptionV2 [78] and InceptionV3 [79]. To speed up the learning rate, they
introduced batch normalization in InceptionV2, while the reduced kernel size is
introduced in InceptionV3.
Later in 2015, K. Simonyan and A. Zisserman developed the VGGNet [76] for
better feature representation by expanding the depth of AlexNet to 16-19 layers. In
comparison with the ZFNet, the kernel size in each layer is reduced to 3 ×3 with
a stride of 1. For obtaining the location information of the object in an image, the
small kernel and stride are more efficient. Using a small kernel instead of a large one,
the depth of the network can be increased while the receptive field is kept constant.
After the parameters are decreased, the network’s feature extraction capabilities are
improved.
The authors are focusing on increasing the network layers for better performance.
Nevertheless, while the network’s depth keeps growing, there may be instances where
accuracy hits saturation before falling off quickly during training. This occurrence is
referred to as deterioration. Kaiming He et al. suggested a residual learning module
[80] to address this issue. In order to further improve the network’s capacity to
extract the features, this can expand the network to hundreds of layers. ResNet50
and ResNet101 are frequently utilized for object detection as the backbone.
In 2018, Z. Li et al. proposed a backbone network dedicated to object detection
instead of classification called DetNet [83]. They employ a high down-sampling rate
in their proposed network to ensure that the extensive input vector is advantageous
for image classification but not for precisely localizing large objects and identifying
small objects.
Feature Extractor Parameters
(M)
FLOPs
(M)
Top-1
(%)
Top-5
(%)
SqueezeNet[84] 1.25 1700 57.2 80.3
Xception [85] 22.8 - 79.0 94.5
MobileNetV1 [86] 4.24 575 70.7 89.5
MobileNetV2 [19] 3.4 300 72.0 91.0
MobileNetV3 [87] 5.4 219 75.2 -
ShuffleNetV1 [88] 3.4 292 71.5 -
MnasNet [89] 4.2 317 70.6 89.5
DarkNet19 [90] - - 72.9 91.2
Table 3.2: Summary of the widely used Lightweight Feature Extractor networks.
The accuracy is measured on the ImageNet dataset.
The earlier described networks mostly improved their performance by adding lay-
ers to the network. Hence the network parameters are also increased, which re-
20
3 State of The Art
quires more storage and time. This could be a massive problem for resource and
time-constrained embedded devices. Therefore, researchers developed lightweight
networks to minimize network parameters and guarantee performance.
In 2017, FN Iandola et al. introduced the lightweight backbone network called
SqueezeNet [84]. They presented the Fire Module there, comprising an expansion
layer and a squeeze layer with only 1 ×1 convolutional kernels in the squeeze layer.
Convolutional kernels in the ratios of 1 : 1 and 3 : 3 make up the expansion layer.
The generated feature map from the 1×1 and 3×3 convolution kernels is synthesized
into the output of the expanded layer. The parameters are finally compressed using
deep compression technology to less than 0.5MB [91].
By improving the InceptionV2, F. Chollet proposed the Xception [85] introduc-
ing a depthwise separable convolution. Later network architectures like MobileNet
[86, 19, 87] and ShuffleNet [88, 92] adopt this method. The conventional convolu-
tion in MobileNet is converted into a depthwise convolution and a 1 : 1 pointwise
convolution via the depthwise separable convolution. The feature channel only uses
one convolution kernel when performing depthwise convolution. The size of feature
channels is the same as that of convolution kernels. A 1 : 1 convolution kernel is
used for pointwise convolution.
Inheriting the shortcut connection from the ResNet, the network architecture of
MobileNet is further improved. The convolution block in MobileNetV2 is assem-
bled by fusing the shortcut connection and depthwise separable convolution. By
combining the NetAdapt algorithm and Network Architecture Search (NAS) [93],
MobileNetV3 [87] has undergone further development.
Adapting the same method from Xception, Z. Xiangyu et al. proposed the Shuf-
fleNetV1 [88]. There the authors suggested replacing pointwise convolution with
group convolution. Each group performed convolution, which drastically minimizes
the calculation. Data transmission between groups is enforced using the channel
shuffle. ShuffleNetV2 [92] uses two 1x1 convolutions rather than two group con-
volutions, in contrast to ShuffleNetV1. The two branches are merged once the
convolution operation is complete.
3.1.3 Deep Learning-based Detectors
Object detection research with traditional image processing hit a peak in 2010 since
the hand-crafted features extraction became saturated. The dominant object detec-
tion algorithms are built on Deep Convolutional Neural Networks (DCNN) because
they can learn from an image’s robust and high-level visual features [94, 74].
Object localization and classification are two subtasks of object detection. Object
detection can be grouped into two categories based on these tasks: 1) two-stage
architecture and 2) Single-stage architecture (as in Figure 2.2). The main trade-off
between these two architectures is speed and accuracy. The two-stage architectures
have an accuracy advantage but offer a slow detection speed. On the other hand,
the single-stage architectures are faster but sacrifice some accuracy.
21
3 State of The Art
The first two-stage object detection architecture, RCNN, was proposed by R.
Girshick et al. [38], inspired by the great success of AlexNet [74] in 2014. RCNN
generates a set of region proposals using the selective search algorithm in the first
stage. The proposals are then rescaled and fed into a DCNN model to extract the
feature representation. The extracted features are classified in the next step using
the linear SVM classifiers.
On PASCAL VOC 2007 dataset, RCNN provides a considerable performance
enhancement, with mean Average Precision (mAP) increasing from 33.7% (DPM
on Figure 3.3 ) to 58.5%. Though it achieved a significant milestone in the era of
object detection, it comes with its shortcomings. The detection speed was too slow
because of the large set of overlapped proposal computations. Moreover, it takes a
lot of space, and there is scope for end-to-end training.
Architecture Backbone Dataset GPU Speed(fps) mAP(%)
RCNN [38] AlexNet 7 Titan 0.1 58.5
ZFNet 7 Titan 0.07 59.2
SPPNet [95] ZFNet 7 Titan 2.6 60.9
Fast RCNN [96] VGG16 7+12 Titan X 0.5 70.0
Faster RCNN [40] VGG16 7+12 Titan X 7.0 73.2
ResNet101 7+12 K40 2.4 76.4
ResNet101 7+12 Titan X 5.0 76.4
ZFNet 7+12 Titan X 18.0 62.1
RFCNN [97] ResNet101 7+12 Titan X 9.0 80.5
ResNet101 7+12 K40 5.8 79.5
Table 3.3: Performance analysis for two-stage detectors on PASCAL VOC dataset
with the image size of 1000x600.
To overcome the problem with RCNN, SPPNet was proposed by K. He et al. [95]
later in the same year. They introduced a Spatial Pyramid Pooling (SPP) layer,
which allows the DCNN to output a fixed-length feature vector regardless of the
image size. SPPNet detection speed is 10 to 100 times faster than RCNN since all
image feature vectors are shared. Because of the use of selective search for the region
extraction, it is still slow. Also, end-to-end training is not possible in SPPNet.
With the improvement of RCNN and SPPNet, R. Girshick presented Fast RCNN
in 2015 [96], which enables training a detector and a bounding box regressor at the
same time with the same network architecture. It also introduced the ROI pooling
layer to region proposal features. Fast RCNN has increased the mAP from 58.5%
to 70.0% (Table 3.3) on the PASCAL VOC 2007 dataset. However, the detection
speed is still slow, and no room for end-to-end training.
Later in the same year, Faster RCNN [40] was proposed by S. Ren et al., which
provided an end-to-end training opportunity by sharing the feature maps with the
backbone network. The detection speed has also increased (17fps with ZFNet). It
also achieved an mAP of 73.2% on PASCAL VOC 2007 and 70.4% (Table 3.3) on
22
3 State of The Art
PASCAL VOC 2012 dataset. Faster RCNN introduced the Region Proposal Network
(RPN) to substitute the slow selective search technique. Although Faster RCNN
provides a significant enhancement for object detection, it still performs poorly in
detecting multi-scale and small objects.
Afterward, Region-based Fully Convolutional Networks (RFCN) were developed
by J. Dai et al. in 2016 [97], and Mask RCNN [98] and Feature Pyramid Networks
(FPN) [99] by K. He et al. and T.-Y. Lin et al. respectively in 2017. RFCN
speeds up the process by lowering the effort required for each ROI by calculating the
fully convolutional of the entire image. It achieved an mAP of 82.0% on PASCAL
VOC 2007 and 83.6% on PASCAL VOC 2012 dataset. By substituting the ROI
pooling layer with the ROIAlign pooling layer, Mask RCNN improved the detection
accuracy. FPN introduced a multi-level feature fusion technique for detecting multi-
scale and small objects.
Architecture Backbone Image Size GPU Speed(fps) mAP(%)
YOLOv1 [100] VGG16 480x480 Titan X 21 66.4
DarkNet19 480x480 Titan X 45 63.4
YOLOv2 [90] DarkNet19 480x480 Titan X 59 77.8
DarkNet19 554x554 Titan X 40 78.6
SSD300 [101] VGG16 300x300 Titan X 46 74.3
SSD512 [101] VGG16 512x512 Titan X 19 76.8
DSSD321 [102] ResNet101 321x321 Titan X 9.5 78.6
DSSD513 [102] ResNet101 513x513 Titan X 5.5 81.5
Table 3.4: Performance analysis for single-stage detectors on PASCAL VOC dataset.
The OverFeat algorithm, presented by P. Sermanet et al. [103] in 2013, was the
first single-stage object detector. The idea was to combine the localization and
classification tasks using the help of feature sharing. Their proposed architecture
extracted the patch from the last pooling layer and connected the patches according
to the scores to classify. OverFeat had a clear speed advantage over the RCNN at
the time but lacked precision.
Even though Faster RCNN introduced RPN to minimize the number of region pro-
posals from roughly 2,000 to around 300 [40], there are still overlapping. J.Redmon
et al. presented YOLO [100] (You Only Look Once) in 2015 as a solution to this bot-
tleneck. In YOLO, the image is divided into a grid, and each grid cell is responsible
for detecting if the object’s center is inside the cell. With this architecture, YOLO
achieved a significant enhancement in terms of speed. On the PASCAL VOC07
dataset, it detects at 45 fps with a 63.4%mAP.
YOLO accomplished real-time detection speed. However, it has poor recall and
several localization problems. YOLOv2/YOLO9000 [90] has introduced enhance-
ments over YOLOv1 to attain improved accuracy. To accelerate network conver-
gence, YOLOv2 introduced batch normalization [78]. Additionally, it utilized the
23
3 State of The Art
notion of anchors from the Faster RCNN and K-means algorithms to automatically
locate the earlier bounding boxes, which could enhance detection performance.
YOLOv3 [104] inherits the ideas of YOLOV1 and YOLOv2 and has improved their
shortcomings to achieve the balance between speed and accuracy. To achieve this
goal, the authors combine the residual block [80], feature pyramid network (FPN)
[105], and binary cross-entropy loss to upgrade YOLO to YOLOV3. YOLOv3 builds
on the principles of YOLOV1 and YOLOV2 while also addressing their flaws to
strike a balance between speed and accuracy. The authors alter YOLO to YOLOv3
by merging the residual block, feature pyramid network, and binary cross-entropy
loss to accomplish the objective. With the Darknet53 backbone, YOLOv3 reached
the detection speed of 45.5 fps with an mAP of 51.5% on the MS COCO test-dev
dataset.
Architecture Backbone Image GPU Speed(fps) mAP-50(%)
YOLOv3 [104] Darknet53 320x320 M40 45.5 51.5
Darknet53 608x608 M40 19.6 57.9
RetinaNet [106] ResNet50 500x500 M40 13.6 32.5
ResNet101 500x500 M40 11.1 34.4
YOLOv4 [81] CSPDarknet53 416x416 M40 38 62.8
CSPDarknet53 512x512 M40 31 64.9
SSD [101] HarDNet68 512x512 1080ti 38 51.0
EfficientDet-D0 [107] EfficientB0 512x512 V100 62.5 52.2
M2Det [108] VGG16 320x320 M40 33.4 52.4
CenterNet [109] ResNet101 512x512 1080ti 45 53.0
Table 3.5: Performance analysis for single-stage detectors on MS COCO test-dev
dataset.
The further improvement to the YOLO family, YOLOv4 [81], was unveiled by
Alexey Bochkovskiy at el. in April 2020. The feature aggregation was improved
by Cross mini-Batch Normalization and Cross Stage Partial connections along with
mish activation function and mosaic data augmentation, which significantly im-
proved accuracy. They achieve a 65.7% accuracy on the MS COCO dataset at a
detection speed of 65fps. In addition, an extension, YOLOv4-tiny, delivers a slightly
lower detection accuracy but an attainable inference rate ( 443 fps on RTX 2080Ti).
The authors of YOLOv4 also improve the network architecture using the scaling
approach and publish it as Scaled-YOLOv4 [110]. It performs gradient truncation
across computing blocks to reduce the size of feature maps. With this enhancement,
they increase the accuracy from 65.7% to 73.3% on the same dataset. To enhance
the detection accuracy of the YOLO model without changing the detection speed,
X. Long at el. proposed PP-YOLO [111]. They used a larger batch size (192) along
with DropBlock to the FPN and Matrix NMS. X. Huang at el. further enhance the
model in PP-YOLOv2 [112] in 2021.
The YOLO family offers impressive speed and accuracy benefits. Even though
24
3 State of The Art
the YOLO has a faster detection capability, it has a limited ability to generalize
objects with sizeable dimensional changes and has a poor detection effect for tiny
objects. In 2016, Wei Liu et al. presented Single Shot MultiBox Detector (SSD)
[101] by combining the benefits of Faster RCNN and YOLOv1. It adopts VGG16
as the backbone network, replacing FC6/FC7 with Conv6/Conv7 and then adding
four convolutional layers at the end.
There are six steps to the architecture. Object classification and bounding-box
regression are performed at each phase, extracting feature maps at various semantic
levels. Multi-scale feature maps are integrated with an anchor mechanism to recog-
nize multi-scale objects. The SSD300 runs at 59 fps with an mAP of 74.3% on the
PASCAL VOC07 dataset, which is faster than YOLOv1. The SSD512 runs at 22
fps with an mAP of 76.8% on the same dataset.
To increase SSD’s ability to articulate low-level feature maps, C-Y Fu et al. pro-
posed Deconvolutional SSD (DSSD) [102] using ResNet101 as a backbone network.
Including deconvolution modules and skip-connection improves the representation
of low-level feature maps and allows for feature fusion to some extent. Likewise,
Z Li and F Zhou enhanced the accuracy by fusing the low-level features with the
high-level features and published as FSSD [113] in 2017.
T.-Y. Lin et al. proposed RetinaNet [106] in 2017 by claiming that the lead-
ing cause behind the low accuracy of single-stage detectors is the high foreground-
background class imbalance that occurs during dense network training. RetinaNet
has added the focal loss (loss function) that reshapes the conventional cross-entropy
loss to allow the detector to focus more on complex, misclassified samples during
training.
As introduced in Faster RCNN, most object detectors are anchor-based, mean-
ing the detectors scan and classify a nearly comprehensive set of potential object
locations, which is computationally extensive. To solve this issue, X. Zhou et al.
introduced an anchor-free algorithm, CenterNet [109]. In this algorithm, the object
is considered as a single point. It reduced the complex post-processing like NMS.
CenterNet established a notable speed-accuracy trade-off by achieving a frame rate
of 45 fps with an accuracy of 53.0% on the COCO dataset.
Two Feature Pyramid Networks (FPN) based architecture, M2Det [108], and
EfficientDet [107], were proposed in 2019 and 2020, respectively. In M2Det, the
authors presented a multi-level feature pyramid (MLFPN), while the authors of the
EfficientDet introduced a weighted bi-directional feature pyramid network (BiFPN).
Both of these FPNs allow multi-level feature fusion in a faster way. On the MS
COCO benchmark, EfficientDet achieved an average precision of 52.2% and M2Det
52.4%.
Two-stage vs Single-stage
As discussed earlier, the main trade-off between single-stage and two-stage detectors
is speed-accuracy. Two stage-based detectors propose a set of regions with the
possible objects. Then a classifier is used to classify the object category and location.
25
3 State of The Art
However, in the single-stage detectors, this classification and localization are done
in a single forward pass.
Several state-of-the-art detectors are currently available in the two-stage and
single-stage platforms, and they have sufficient performance to be used on con-
sumer goods. Nevertheless, it might be challenging for practitioners to choose the
model that best fits their application. Since running time, power consumption, and
memory utilization are crucial for actual computer vision-based applications, con-
ventional accuracy metrics like mean average precision (mAP) do not fully convey
the situation [13]. For example, in our case, the model will be running on a single-
board computer which requires a small memory footprint and real-time detection.
Figure 3.4: Accuracy of Faster RCNN, RFCN, and SSD using various backbone
networks and object sizes for 300x300 image resolution [13].
Although the approaches that showed an excellent performance in contests like the
MS COCO and PASCAL VOC challenge are accuracy-optimized, they frequently
depend on ensemble learning that is too sluggish for everyday use. However, eval-
uating the trade-off among them is challenging due to varying backbone networks
such as VGG and Residual Networks, various image sizes, and mixed hardware and
software platforms.
Addressing this issue, a group of researchers [13] from Google presented a state-of-
the-art comparison with different network architectures under the same hood. This
work compared the two-stage-based Faster RCNN, RFCN, and the one-stage-based
SSD with varying backbone networks such as VGG16, ResNet101, InceptionNet,
and MobileNet. Moreover, they used the same hardware configuration for this com-
parison. Their result is presented in Figure 3.4 and 3.5.
Figure 3.4 shows the accuracy chart for small, medium, and large object sizes.
Both two-stage and single-stage approaches perform identically well for large and
26
3 State of The Art
small objects. However, the two-stage detectors are outperformed in terms of small
objects. On the other hand, single-stage architecture requires less time than two-
stage architecture (Figure 3.5).
Figure 3.5: Time required for each model with different backbone networks for an
image size of 300x300 [13].
In another work, M C Garc´ıa et al. [8] compare the two-stage-based Faster
RCNN with single-stage-based YOLOv3, RetinaNet, and FCOS for autonomous
vehicles application. The authors used the Waymo Open Dataset [114] and dif-
ferent backbone networks under the same hardware architecture (Intel i7-8700 and
RTX 2080Ti). With this configuration and Res2Net101 backbone, the Faster RCNN
achieved the highest accuracy, 40.8% on the high-resolution image (1920x1280) and
32.4% on lower-resolution (1920x886). On the other hand, RetinaNet with Mo-
bileNetV2 achieved a frame rate of 15.8 fps on the high-resolution image and 38.3
fps on the low-resolution.
3.2 Insulator Detection
Several significant studies have been done for visual inspection and insulator detec-
tion in the past few years. Before the era of Deep Learning, conventional image
processing techniques are used for this task. Afterward, Deep Learning based de-
tection algorithms have improved the detection strategy significantly. However, the
traditional approaches are still used in many cases because of their less computa-
tional complexity, power, and resource consumption.
3.2.1 Feature-based Insulator Detection
In [14], the mathematical morphology [115] and Bayesian segmentation-based insu-
lator detection algorithm was proposed by X. Fei et al. There, the authors detect
the insulator from an image using the Bayesian segmentation. The operational flow
27
3 State of The Art
of their proposed solution is shown in Figure 3.6. First, the image has been con-
verted into grayscale. The image segmentation results are then obtained, and the
non-target image is further eliminated using a combined morphological adaptive fil-
ter. After mapping the filtered image’s gray level information to the original, the
restored image undergoes a closed process of mathematical morphology.
Figure 3.6: Operational flow of the mathematical morphology and Bayesian
segmentation-based insulator detection [14].
Y. Zhai et al. also used an adaptive morphology and saliency to detect faulty
glass insulators [116]. In this approach, the authors localize insulators by utilizing
color and gradient features. The color and gradient features are merged as a result
of saliency detection. Finally, to detect the fault of the insulator, they used their
adaptive morphology technique to find the standard gap between the insulator caps.
Their proposed method can detect the fault of the insulator at 2fps with an accuracy
of 92%.
M. Oberweger et al. [117] have suggested a RANSAC-based insulator detection
model for visual recognition and fault detection for power line insulators. The
proposed model is built on a voting approach for localization and discriminative
training of regional gradient-based descriptors. In order to extract the image patch
around each feature point, they first compute the Difference of Gaussians (DoG)
[57] of critical features in the image and classify the feature points. To successfully
fit the insulator model to the observed feature points, they cluster the feature points
by the scale and use a modified RANSAC [118] technique on all feature points of
each scale.
S. Liao and J An have proposed another feature-based insulator detection algo-
rithm [15] based on local features and spatial orders. The workflow of their proposed
algorithm is illustrated in Figure 3.7. They first identify local features and provide
a multi-scale and multi-feature descriptor to describe the local characteristics. They
then train these local features to produce a number of spatial scale features, increas-
ing the algorithm’s robustness. Lastly, they filter out background noise and localize
the region of insulators using a coarse-to-fine matching approach.
In [119], W. Li et al. used the template matching strategy to detect the insulator
from UAV. First, they extract the edge of the insulator using an enhanced MPEG-7
edge histogram descriptor [120]. Then the extracted edge is detected using template
matching. Lastly, the Kalman filter-based [121] real-time object tracking method is
developed.
In the professorship of computer engineering at TU Chemnitz, U. Tudevdagva et
al. [TUC4] have proposed a method for detecting the fault of the insulator based on
28
3 State of The Art
Figure 3.7: Insulator detection based on local features and spatial orders.The dotted
line box illustrates the formation of the feature library. [15].
the symmetry detection and the grab-cut algorithm. They achieved an accuracy of
80.43% for detecting the burn mark on the insulator. A. Bane