Conference PaperPDF Available

A Deep Learning Approach for Face Detection using YOLO

Authors:
  • Linnaeus University Sweden
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
A Deep Learning Approach for Face Detection
using YOLO
Dweepna Garg
Computer Engineering Department
Devang Patel Institute of Advance
Technology and Research, CHARUSAT
Changa, Anand, India
dweeps1989@gmail.com
Parth Goel
Computer Engineering Department
Devang Patel Institute of Advance
Technology and Research, CHARUSAT
Changa, Anand, India
er.parthgoel@gmail.com
Sharnil Pandya
Computer Science & Engineering
Department
Navrachana University
Vadodara, India
sharnil.pandya84@gmail.com
Amit Ganatra
Devang Patel Institute of Advance
Technology and Research
Charotar University of Science and Technology
Changa, Anand, India
amit
g
anatra.ce
@
charusat.ac.in
Ketan Kotecha
Symbiosis Institute of Technology
Symbiosis International University
Pune, India
drketankotecha@gmail.com
Abstract—Deep learning is nowadays a buzzword and is
considered a new era of machine learning which trains the
computers in finding the pattern from a massive amount of
data. It mainly describes the learning at multiple levels of
representation which helps to make sense on the data
consisting of text, sound and images. Many organizations are
using a type of deep learning known as a convolutional neural
network to deal with the objects in a video sequence. Deep
Convolution Neural Networks (CNNs) have proved to be
impressive in terms of performance for detecting the objects,
classification of images and semantic segmentation. Object
detection is defined as a combination of classification and
localization. Face detection is one of the most challenging
problems of pattern recognition. Various face related
applications like face verification, facial recognition, clustering
of face etc. are a part of face detection. Effective training needs
to be carried out for detection and recognition. The accuracy in
face detection using the traditional approach did not yield a
good result. This paper focuses on improving the accuracy of
detecting the face using the model of deep learning. YOLO
(You only look once), a popular deep learning library is used to
implement the proposed work. The paper compares the
accuracy of detecting the face in an efficient manner with
respect to the traditional approach. The proposed model uses
the convolutional neural network as an approach of deep
learning for detecting faces from videos. The FDDB dataset is
used for training and testing of our model. A model is fine-
tuned on various performance parameters and the best
suitable values are taken into consideration. It is also
compared the execution of training time and the performance
of the model on two different GPUs.
Keywords—Face Detection, YOLO, Neural Network, object
detection, Convolutional Neural Network
I. I
NTRODUCTION
In early times, research was carried out on the various
hand-crafted features extraction methods which were used in
training the traditional machine learning algorithms for
detection and recognition. It leads to an increase in the
computation power and time for extracting features and gives
less accurate results. To overcome the computation time,
power and accuracy, the same was implemented using the
models of neural networks and thereafter deep neural
networks.
There are various deep learning [1] models like
convolutional neural network, recurrent neural network etc.
but among all, deep convolutional neural networks (CNNs)
[2] are the best model for finding patterns from images. CNN
also has the capability to classify, detect and label the object
with high accuracy. Region-based CNN (R-CNN) [3], Fast
R-CNN [4], Faster R-CNN [5], and YOLO [6] are popular
object detection networks in recent years.
Face detection has a plethora of applications. It plays a
crucial role in face recognition algorithms. Face recognition
has several applications such as person identification in
surveillance and authentication for a security system. It is
also help for emotion recognition and based on detected
emotion, further analysis can be used for emotion based
applications. Hence, it is considered to be a way to deliver
rich information like age, emotion, gender and many more
about an individual. Other applications of face detection are
to automatically focus on human faces in camera, to give tag
and to identify different parts of faces. Automated face
detection has gained attention in computer vision and pattern
recognition. Earlier face detection systems could handle only
simple cases but now it has outperformed in various
situations using deep learning algorithms. Due to large
variation caused by occlusions, illumination and viewpoints,
face detection remains a challenging problem in the area of
computer vision. So accuracy, training time and processing
time in real-time videos for detecting faces are still research
issues.
In this paper, section two presents related work of face
detection algorithms. Section three describes the working of
YOLO framework for detecting objects. Proposed work is
explained in section four. Experimental setup and dataset
information are discussed in section five. Results are
analyzed in section six. Finally, conclusion and future work
are described in section seven.
II. R
ELATED
W
ORK
Face detection is one of the challenging problems in the
field of pattern recognition. Early in 1994 Vaillant et al. [7]
had applied the algorithm named neural networks for
detecting the faces. They had proposed a model which could
detect the absence or presence of the face in an image by
training a neural network. In this method, the entire image
was scanned with the network at all possible locations. In
the year 1998, [8] rotation invariant face detection method
was used wherein a “router” network estimated the
orientation of the face and proper detector network was
applied. For detecting the semi-frontal face from a complex
image, a neural network was developed by Gracia in the
year 2002 [9]. Convolutional neural network for pose
estimation and detection of the face was proposed by
Osadchy [10]. Wilson et al. presented harcascading for
facial feature detection [11]. But limitation arises for [10,
11] when the face is exposed to various illuminations, poses
and expressions.
In recent years, face detection is carried out using deep
learning models. One of the most popular models for it is
CNN (convolutional neural network) [12]. Faster R-CNN is
also achieving remarkable results for object detection. This
paper proposes an architecture of a convolutional neural
network to detect the face using the YOLO framework.
Our architecture does not rely on the hand-crafted
features. Faces are detected based on the CNN which extract
features by itself. Training and testing of a model are carried
out on two GPU and it detects the faces at a faster rate in real
time.
III. O
VERVIEW OF
Y
OLO
YOLO is a state-of-the-art deep learning framework for
real-time object detection. It is an improved model then the
region based detector and outperformed on standard
detection datasets like PASCAL VOC [13] and COCO [14]
dataset. Detecting the object on real-time basis is
comparatively faster with respect to other detection
networks. This model can run on different resolutions
thereby giving good speed and accuracy. To improve the
performance towards scale invariant, the images can be
resized to a random scale. The detector should be capable to
learn the features for a wide range of image sizes.
Object detection should be fast, accurate in a manner that
a variety of objects can be recognized [15]. With the help of
neural network, the YOLO frameworks are becoming
increasingly fast and accurate for detection. Still, a constraint
is observed for small set of objects. Presently, the datasets of
object detection are limited as compared to that of
classification and tagging. The object detection datasets
consist of thousands of images with tags which are object
coordinates in image. The classification datasets consist of
millions of images with categories. Assigning a tag of an
object to the image for detection is more expensive as
compared to assigning a label for classification.
Region-based CNN generates a bounding box in an
image and then runs the classifier on these boxes. The
bounding boxes are then refined using post-processing like
non-maximum suppression to eliminate duplicate detections.
A single CNN can predict multiple bounding boxes and class
probabilities of objects. YOLO optimizes the performance as
it is fast in detection. In YOLO (You Only Look Once), a
single neural network is applied on the entire image during
training and testing time. It encodes the information about
the appearance and the classes.
In our work, the bounding box is predicted based on the
features from the image. The bounding boxes across an
image are predicted in parallel. Hence, it can be said that the
network scans the full image as well as the object in the
image. With the help of YOLO, end-to-end training is
applied along with real-time speed. This enables to maintain
high average precision.
The working of YOLO is as follows: The input image is
divided into S x S grid. In case the center of the object falls
into a grid cell, then it is the responsibility of the grid cell to
detect the object. Each cell of the grid predicts the bounding
box and the confidence score for that box. The confidence
score depicts the accuracy with which the object is detected
in the bounding box. If no object is found in the cell, then the
confidence score is zero else it is calculated using the
intersection over union (IOU) between the predicted box and
the ground truth. There are in all mainly 5 predictions in the
bounding box: x, y, w, h and confidence. The center of the
box with respect to the bounds of the grid is represented by
the (x, y) coordinates. The height and width are predicted
relative to the whole image. Each cell also predicts the
conditional class probabilities. Multiplication of conditional
class probabilities with the individual box confidence
prediction gives the confidence score for each box. The
calculated confidence score depicts that how accurate the
predicted box fits the object.
There are various versions of YOLO. Yolov1 suffers
from the localization errors and has a low recall compared to
the other region based detection methods. The network
classifier of original YOLO is trained at 224 x 224 and for
detection, the resolution is increased to 448 x 448. For
ImageNet dataset, the network classifier is trained at 448 x
448 resolution by YOLOv2. The downsampling of the image
is carried out by the convolutional layer of YOLO by a factor
of 32, hence an image which is fed as an input of 416 gets an
output feature map of 13 x 13.
IV. P
ROPOSED
A
RCHITECTURE
Our proposed network takes an input as a colour image of
size 448 x 448. The architecture consists of 7 convolutional
layers followed by max pooling layer of size 2 x 2. Then
three fully connected layers are attached and output layer is
followed by last fully connected. The convolutional layers
find the simple features to complex features from the images
and the fully connected layer predicts the coordinates and
probabilities. Finally, the output layer predicts both class
probabilities and the coordinates of the bounding box using
NMS (Non-Maximum Suppression) technique.
V. E
XPERIMENTAL
S
ETUP
&
D
ATASET
I
NFORMATION
The experiment is performed on two machines. The first
experiment is performed on core i5 processor, 8GB RAM
and 2GB GeForce 820M GPU. The second experiment is
performed on core i7 processor, 16 GB RAM, and 4 GB
NVIDIA GTX 1050 Ti GPU. The proposed architecture of
convolutional neural network is trained and tested for face
detection on FDDB (Face Detection Dataset and Benchmark)
dataset [16].
FDDB Dataset is used in our work to train the proposed
architecture. It consists of 5171 faces in a set of 2845 images.
This dataset consists of regions of persons designed for
studying the problem of detection. This work deals with
2667 number of images and the total size of the dataset (in
our study) is 52.2 MB. Dataset was divided into 70% training
dataset and 30% testing dataset.
VI. R
ESULT
A
NALYSIS
The model was trained for 25 epochs with gradient
decent optimizer algorithm. It was observed that accuracy
remained nearly constant 92.2% after 20 epochs and the best
value of learning rate is considered after trying different
values and it is 0.0001 as shown in Fig. 1. Same epochs and
learning rate are considered for comparison of experimental
analysis on CPU and GPU.
Fig. 1. Loss vs learning rate.
Fig. 2 shows 92.2% accuracy which was achieved on test
dataset for 20 epochs. Network was also trained with
different batch size. The batch size was kept 1, 8, 16 and 32.
It was observed that when the batch size was 32 or 16, the
network was not able to get trained on 2 GB 820M Graphics
card. It happened due to less size of GPU memory which
could not accommodate increased batch size. The same is
depicted in Fig. 3.
Fig. 2. IoU accuracy vs Epoch.
Fig. 3. Batch size vs Training time (hours)
After training a network, weight file and network
configuration file were tested on different resolutions of
videos. Fig. 4 shows that resolutions were also a parameter
which affects the FPS (frames per second) rate. It was
observed that FPS was increased as the resolution was
decreased. Low-resolution image has less number of pixels,
so GPU process it speedily because of less number of
calculations of parameters.
Fig. 4. Resolution of video vs FPS
The accuracy of the proposed model was compared with
other face detection algorithms after fine-tuning all
parameters and hyperparameters of the proposed model. It
was shown that proposed model accuracy was higher than
the haar cascade algorithm and R-CNN based face detection
model which is depicted in Fig. 5.
Fig. 5. Comparison of accuracy of proposed model with other face detection
algorithm
VII. C
ONCLUSION
It can be concluded that processing a huge amount of
data using deep learning requires a high configuration
NVIDIA graphics card (GPU). If the configuration of the
GPU is high, then computation of the task can be achieved at
a faster rate. There are various parameters which are
responsible for detecting the face from either an image or a
video. Based on the analysis carried out by the proposed
model, the following points can be concluded. Firstly, the
learning rate depends on the network size and size of object
too. If the network is medium or large and size of object is
compared to less, then the learning rate should be kept small.
In our work, the network consists of 18 layers, so the
calculated learning rate = 0.0001. Secondly, if number of
times the dataset are trained on the network, better results are
obtained. It also provokes data overfitting issue so, epoch
size should be kept at an optimal number which can produce
neither network overfitting nor underfitting. In our work,
after 20 epochs it was observed that the IoU accuracy
obtained is the best i.e 92.2%. Also, resolution of the image
plays a very important role. Resolution of the image as
concluded is inversely proportional to the frames per second.
In future work, proposed model can be further optimized for
very small face detections, on different viewpoint variations,
and partial face detection.
R
EFERENCES
[1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
521, no. 7553, pp. 436–444, 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
classification with deep convolutional neural networks,” Proceedings
of the 25th International Conference on Neural Information
Processing Systems - Volume 1. Curran Associates Inc., pp. 1097–
1105, 2012.
[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature
Hierarchies for Accurate Object Detection and Semantic
Segmentation,” in 2014 IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 580–587.
[4] R. Girshick, “Fast R-CNN,” Proc. IEEE International Conference on
Computer Vision, ICCV 2015, pp. 1440–1448, 2015.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards real-
time object detection with region proposal networks,” Proceedings of
the 28th International Conference on Neural Information Processing
Systems - Volume 1. MIT Press, pp. 91–99, 2015.
[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look
Once: Unified, Real-Time Object Detection,” 2015.
[7] R. Vaillant, C. Monrocq, and Y. Lecun, “Original approach for the
localisation of objects in images,” IEEE Proceedings on Vision,
Image, and Signal Processing, vol. 4, 1994.
[8] H.A. Rowley, S. Baluja, T. Kanade, “Neural network-based face
detection”, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp.
23–38, 1998.
[9] C. Garcia and M. Delakis, "A neural architecture for fast and robust
face detection," IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no.
11, pp. 1408–1423, 2004.
[10] M. Osadchy, Y. Le Cun, and M. L. Miller, “Synergistic Face
Detection and Pose Estimation with Energy-Based Models,” Journal
of Machine Learning Research, vol. 8, pp. 1197-1215, 2007.
[11] F. J. Phillip Ian, “Facial feature detection using Haar classifiers,” J.
Comput. Sci. Coll., vol. 21, no. 4, pp. 127–133, 2002.
[12] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A Convolutional
Neural Network Cascade for Face Detection.”, IEEE International
Conference on Computer Vision and Pattern Recognition, CVPR
2015, pp. 5325-5334, 2015.
[13] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge”,
International Journal of Computer Vision, vol. 88, no. 2, pp. 303-338,
2010.
[14] T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,”
European Conference on Computer Vision, ECCV 2014, Lecture
Notes in Computer Science, vol 8693. Springer, Cham, pp. 740-755.
[15] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for
object detection,” Proceedings of the 26th International Conference
on Neural Information Processing Systems - Volume 2. Curran
Associates Inc., pp. 2553–2561, 2013.
[16] V. Jain and E. Learned-Miller, “FDDB: A Benchmark for Face
Detection in Unconstrained Settings.”, Technical Report UM-CS-
2010-009, Dept. of Computer Science, University of Massachusetts,
Amherst. 2010.
... The first study involved 27 participants and resulted in 55 video recordings, as each participant was recorded under two different driving conditions. The second study included 37 participants, divided into two age groups: 15 younger drivers (aged 30-45) and 22 older drivers (aged [60][61][62][63][64][65][66][67][68][69][70][71][72][73][74][75]. Each participant in this study was recorded once during a single driving condition. ...
... If only the facial skin is segmented, the algorithm can ignore parts of the image that contain hair, glasses, or background, resulting in a more accurate and reliable signal. The standard YOLO model is designed for object detection [72] and to use it for skin segmentation, it would be necessary to adapt or retrain it. This avenue is especially compelling due to the high processing speed of the YOLO architecture (up to 30 Hz, [73]). ...
Preprint
Full-text available
Camera-based monitoring of Pulse Rate (PR) enables continuous and unobtrusive assessment of driver's state, allowing estimation of fatigue or stress that could impact traffic safety. Commonly used wearable Photoplethysmography (PPG) sensors, while effective, suffer from motion artifacts and user discomfort. This study explores the feasibility of non-contact PR assessment using facial video recordings captured by a Red, Green, and Blue (RGB) camera in a driving simulation environment. The proposed approach detects subtle skin color variations due to blood flow and compares extracted PR values against reference measurements from a wearable wristband Empatica E4. We evaluate the impact of Eulerian Video Magnification (EVM) on signal quality and assess statistical differences in PR between age groups. Data obtained from 80 recordings from 64 healthy subjects covering a PR range of 45-160 bpm are analyzed, and signal extraction accuracy is quantified using metrics, such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). Results show that EVM slightly improves PR estimation accuracy, reducing MAE from 6.48 bpm to 5.04 bpm and RMSE from 7.84 bpm to 6.38 bpm. A statistically significant difference is found between older and younger groups with both video-based and ground truth evaluation procedures. Additionally, we discuss Empatica E4 bias and its potential impact on the overall assessment of contact measurements. Altogether the findings demonstrate the feasibility of camera-based PR monitoring in dynamic environments and its potential integration into driving simulators for real-time physiological assessment.
... Security is a critical concern in today's world, particularly in high-value environments like banks and retail shops. Traditional surveillance systems often fall short in providing the efficiency and immediacy needed to prevent or respond to unauthorized activities, relying heavily on manual monitoring and post-event analysis [1]. Addressing these limitations, the "Theft Control System" offers an intelligent surveillance solution powered by the YOLO (You Only Look Once) algorithm for real-time object detection [2]. ...
... Video surveillance systems have become critical in sectors like security, traffic control, and crowd management. The adoption of CNN has significantly enhanced these systems, enabling efficient and accurate detection and recognition of objects and activities [1]. ...
Research Proposal
Full-text available
Theft detection and intelligent surveillance systems have become critical in modern security applications, especially in high-value environments such as banks, retail stores, and jewelry shops. Traditional surveillance systems often rely on manual monitoring, which is inefficient and prone to human error. This paper presents a survey of recent advancements in theft detection systems, focusing on the integration of machine learning and real-time object detection techniques. We explore the use of the YOLO (You Only Look Once) algorithm for real-time object detection, which enables the system to detect and respond to suspicious activities swiftly. The system features two detection modules: Daytime Detection for monitoring armed individuals and Nighttime Detection for capturing anonymous movements. By leveraging machine learning, the system reduces the dependency on continuous manual monitoring and enhances overall security. This paper also discusses the advantages, challenges, and future directions of intelligent surveillance systems in theft detection.
... Dweepna Garg et al. [11] focus on improving the accuracy of face detection using DL models, employing YOLO, a popular DL library, for their implementation. The paper compares the efficiency and accuracy of their approach against traditional methods in face detection. ...
Article
Full-text available
The research tackles a major security system vulnerability: the ability of current face detection systems to identify changes in circumstances and spoofing attacks. The objective is to create a system that is resistant to security threats and can detect faces reliably under a range of circumstances and scales. The study suggests integrating Dense Convolutional Networks (Dense Net) with Multi-Task Cascaded Convolutional Networks (MTCNN) to do this. Dense Net is a deep learning model that takes advantage of its densely linked layers to improve on MTCNN, a widely used face detection technique. The proposed system improves the performance of face detection, making it more robust and secure. The effectiveness of the combined model is validated through rigorous experimentation and evaluation of standard datasets. The results demonstrate the potential of Dense Net in improving the robustness and security of face detection systems, paving the way for more secure applications in areas such as surveillance, biometric authentication, and social media.
... The table above represent the previous work done in the field of object detection using various algorithms such as CNN, R-CNN, YOLO [3][6], YOLOv1, YOLOv2, YOLOv3 [11], YOLOv4 and how object detection has been used to solve various problems such as Agricultural and food product inspection, human detection [5], face recognition [9], vehicle logo recognition [12] pedestrian detection [13] and bicycle and many others. It also represents future researches which can be done in this field. ...
Chapter
Digitalization has become essential for most educational institutions to measure students' knowledge and abilities through examinations. However, cheating remains a persistent issue, with students using various ways such as exchanging answer sheets, using hidden notes, or giving codes. Human monitoring is often inconsistent and limited by focus. This research proposes a solution for solving the cheating detection problem using computer vision, specifically by monitoring the suspicious behavior of students (physical) during the exam through CCTV. The method used to solve the problem is the YOLO (You Only Look Once) algorithm, comparing three versions—YOLOv5, YOLOv6, and YOLOv7. In this study, the accuracy results for each algorithm variation are 43%, 37%, and 51%, respectively. The existence of imbalanced classes in the dataset is the main factor that affects the model performance. Consequently, an extended experiment is undertaken by adding 10% data to the imbalanced classes. The highest accuracy recorded is 60%, given by YOLOv7, reflecting a noteworthy 9% increase in accuracy.
Article
Full-text available
Conducting face detection in low-light or nighttime conditions is quite difficult for purposes such as surveillance, security, and low-light imaging systems. This paper presents a YOLOv12-based deep learning pipeline that aims at face detection under quite challenging lighting conditions. The illumination variations being a challenge in face recognition, the model was trained on a diverse dataset covering in addition blur and conversion into grayscale, and CLAHE (Contrast Limited Adaptive Histogram Equalization). Additionally, to fine-tune the model’s hyperparameters, a hybrid optimization approach combining Harris Hawks Optimization (HHO) and Whale Optimization Algorithm (WOA) is employed, improving detection accuracy and efficiency. The proposed system achieves 0.994 precision, 0.944 recall, and 0.991 mAP50, demonstrating its high performance even in low-light conditions. In addition to providing a degree of model explainability and interpretability, XAI techniques such as LIME and Feature Map Visualizations aid in interpreting the important areas of influence on the model's decisions for face detection, thus building trust and interpretability. The proposed model is highly suitable for real-time deployment for security and monitoring applications.
Conference Paper
Full-text available
Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a lowcost by a fewnetwork applications. State-of-the-art performance of the approach is shown on Pascal VOC.
Article
Full-text available
Deep Neural Networks (DNNs) have recently shown outstanding performance on image classification tasks [14]. In this paper we go one step further and address the problem of object detection using DNNs, that is not only classifying but also precisely localizing objects of various classes. We present a simple and yet powerful formulation of object detection as a regression problem to object bounding box masks. We define a multi-scale inference procedure which is able to produce high-resolution object detections at a low cost by a few network applications. State-of-the-art performance of the approach is shown on Pascal VOC.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We present YOLO, a unified pipeline for object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance. Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems. Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN. By itself, YOLO detects objects at unprecedented speeds with moderate accuracy. When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.