ArticlePDF Available

Abstract and Figures

In an object detection system, the main objective during training is to maintain the detection and false positive rates under acceptable levels when the model is run over the test set. However, this typically translates into an unacceptable rate of false alarms when the system is deployed in a real surveillance scenario. To deal with this situation, which often leads to system shutdown, we propose to add a filter step to discard part of the new false positive detections that are typical of the new scenario. This step consists of a deep autoencoder trained with the false alarm detections generated after running the detector over a period of time in the new scenario. Therefore, this step will be in charge of determining whether the detection is a typical false alarm of that scenario or whether it is something anomalous for the autoencoder and, therefore, a true detection. In order to decide whether a detection must be filtered, three different approaches have been tested. The first one uses the autoencoder reconstruction error measured with the mean squared error to make the decision. The other two use the k-NN (k-nearest neighbors) and one-class SVMs (support vector machines) classifiers trained with the autoencoder vector representation. In addition, a synthetic scenario has been generated with Unreal Engine 4 to test the proposed methods in addition to a dataset with real images. The results obtained show a reduction in the number of false positives between 22.5% and 87.2% and an increase in the system’s precision of 1.2%-4747-47% when the autoencoder is applied.
This content is subject to copyright. Terms and conditions apply.
ORIGINAL ARTICLE
Deep autoencoder for false positive reduction in handgun detection
Noelia Vallez
1
Alberto Velasco-Mata
1
Oscar Deniz
1
Received: 24 September 2019 / Accepted: 11 September 2020 / Published online: 25 September 2020
ÓThe Author(s) 2020
Abstract
In an object detection system, the main objective during training is to maintain the detection and false positive rates under
acceptable levels when the model is run over the test set. However, this typically translates into an unacceptable rate of
false alarms when the system is deployed in a real surveillance scenario. To deal with this situation, which often leads to
system shutdown, we propose to add a filter step to discard part of the new false positive detections that are typical of the
new scenario. This step consists of a deep autoencoder trained with the false alarm detections generated after running the
detector over a period of time in the new scenario. Therefore, this step will be in charge of determining whether the
detection is a typical false alarm of that scenario or whether it is something anomalous for the autoencoder and, therefore, a
true detection. In order to decide whether a detection must be filtered, three different approaches have been tested. The first
one uses the autoencoder reconstruction error measured with the mean squared error to make the decision. The other two
use the k-NN (k-nearest neighbors) and one-class SVMs (support vector machines) classifiers trained with the autoencoder
vector representation. In addition, a synthetic scenario has been generated with Unreal Engine 4 to test the proposed
methods in addition to a dataset with real images. The results obtained show a reduction in the number of false positives
between 22.5% and 87.2% and an increase in the system’s precision of 1.2%47% when the autoencoder is applied.
Keywords Handgun detection False positive reduction Autoencoder One-class classification
1 Introduction
Weapons, among other threats, need to be detected as soon
as possible to eliminate or mitigate the danger they could
cause [1]. Traditionally, the surveillance of public scenar-
ios has been accomplished by the human supervision of the
images captured by closed-circuit television (CCTV) sys-
tems. However, even an experienced guard may miss a
dangerous event due to fatigue or loss of attention [2]. To
help with this situation, the creation of automated
surveillance systems (AVSs) able to locate potentially
threatening objects (or other events) in video has been
studied during the last decades [3].
Similarly to other areas, with the introduction of the new
deep learning methods these frameworks have obtained
promising results and are closer to be used in real scenarios
[4,5]. Nevertheless, although those detectors have high
detection (D) and low false positive (FP) rates, when they
are used in a different scenario from the one used for
training, the false positive ratio increases [6]. This fact
represents a major problem since even an increase of a
0.1% of the false positive ratio may cause 90 false alarms
per hour with a video input of 25 fps. Therefore, when
running the surveillance system in a real scenario, the
outcome is usually an unsatisfactory number of false
alarms. In most cases, this may lead to the guard switching
off the system.
In this context, we propose to include an extra step that
models the false alerts that are specific of the new scenario
while approximately maintaining the capability of identi-
fying the objects it was trained for. As an specific appli-
cation, this work focuses on detecting handguns in video
surveillance. After running the detector in a new scenario,
it is possible to collect all the detector alarms. Practically,
all of these alarms are false positives since the incidence of
the true event (a handgun in the scene) is very low.
&Noelia Vallez
Noelia.Vallez@uclm.es
1
VISILAB, University of Castilla La-Mancha, ETSI
Industriales, Av. Camilo Jose Cela SN, 13071 Ciudad Real,
Spain
123
Neural Computing and Applications (2021) 33:5885–5895
https://doi.org/10.1007/s00521-020-05365-w(0123456789().,-volV)(0123456789().,-volV)
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Therefore, all of these detections can be stored and used to
model the new scenario.
The new step will act as a filter able to recognize typical
FPs of the detector in the particular scenario. Therefore,
this problem can be seen as an anomaly detection problem
where the anomalies are those detections that are not
similar to the FPs modeled by the filter [7,8]. In fact, the
anomalies detected on this step will be the real alarms.
To detect abnormal and extreme data, one-class classi-
fiers have been widely used in the literature [9]. More
concretely, autoencoders have proven to be the most suit-
able of the techniques, obtaining good results even where
other methods fail [10]. In order to use the autoencoder as a
filter and decide whether a detection is an anomaly or not,
we have tested different approaches: using the autoencoder
reconstruction error as a threshold and using the central
vector representation to train a nearest neighbor (NN) and a
one-class support vector machine (SVM) classifiers.
Although autoencoders have been applied in anomaly
detection problems, to the best of our knowledge, this is the
first time they have been applied to reduce false positive
detections when the detector runs in a new scenario from
which it is not possible to obtain labeled data.
For the purpose of testing our idea, we have generated
an entirely synthetic dataset from the frames captured from
a realistic 3D scenario. The synthetic scenario resembles a
school hallway from the point of view of a surveillance
camera. This allow us to generate as much data as needed
with and without handguns to train and test the
autoencoder.
The rest of the paper is organized as follows. Section 2
performs an overview of the advances in handgun detec-
tion. Section 3shows the handgun detector used as base
detector of the proposed false positive reduction method.
Section 4describes the datasets used including the syn-
thetic dataset that has been generated. Section 5provides
detailed information about the proposed autoencoder-based
filtering step. Finally, Sect. 6shows the results and Sect. 7
summarizes the main conclusions.
2 Related work
In addition to automatic CCTV video surveillance, several
approaches have been proposed to deal with concealed
handguns in X-ray or millimetric wave images. These types
of image are commonly used in airports, train stations or
the entrance of some public buildings. In 2008, Nercessian
et al. presented a system for handgun detection using X-ray
luggage scan images [11]. The approach was based on the
Gaussian mixture expectation maximization (EM) method
to perform image segmentation prior to the obtention of the
edge-based feature vectors. Gesick et al. compared three
different approaches for the detection of handguns inside
luggage [12]. The first method employs edge detection
combined with pattern matching with reliable results.
However, both the computational time and the number of
false positives were high. The second method uses Dau-
bechies wavelet transforms with inconclusive results as the
authors commented. The third algorithm proposed in that
work was based on the scale-invariant feature transform
(SIFT). Later, in 2010, Harmer et al. used a completely
different approach based on the modeling of the complex
natural resonances of handguns and compared them with
those of other objects [13]. In addition, the work of Flitton
et al. concluded that using simpler 3D feature, descriptors
outperform even complex RIFT/SIFT solutions with an
accuracy of more than 95% [14]. In [15], Xiao et al.
employed an extension of the Haar-like features with an
AdaBoost-based cascade classifier to detect handguns in
passive millimeter wave (PMMW) images. Following a
similar approach, the study of Kundegorski et al. combines
bag of visual words (BoVW) based on feature point
descriptors and support vector machines (SVMs) and ran-
dom forest classifiers [16].
While there are numerous methods and devices that can
detect concealed weapons, unfortunately, the incidence of
mass shootings requires the use of RGB surveillance
images. Tiwari and Verma proposed a framework that
applies color-based segmentation and k-means clustering to
remove irrelevant objects and then uses Harris interest
point detector and Fast Retina Keypoint (FREAK) to locate
the handguns [17]. This resulted in high robustness when
detecting the desired object at different scales and rota-
tions. In addition, Halima and Hosam worked on a detector
that combined SIFT features, k-means clustering, a word
vocabulary histogram, and SVM [18].
The recent advances in deep learning have also been
applied to the handgun detection problem using CCTV
images. The first contribution in this area came in the work
of Olmos et al. where two different approaches were used
[4]. The first one uses a classification CNN to detect
handguns with the sliding window method, whereas the
second one is based on the Faster R-CNN detection
architecture. The latter obtained the best results when tes-
ted in a dataset composed of several YouTube videos. On
the other hand, Gelana et al. followed a more traditional
approach using edge detection and a classification CNN
with the sliding window method [19]. In addition, Romero
and Salamea trained a YOLO object detection and local-
ization system to detect firearms with the particularity of
running the detector only in areas where there are people
[20]. Another study of Olmos et al. proposed using a
symmetric dual camera system to increase the performance
of the detection model in low quality surveillance videos
improving both the false positive and the detection rates.
5886 Neural Computing and Applications (2021) 33:5885–5895
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
To model outliers, discordant objects or simply data that
has a different behavior or pattern, anomaly detection
techniques have been used [7]. Anomaly detection has a
wide range of applications. For example, it can be used to
detect anomalies in stock prices and time series [21,22],
abnormal medical images of findings [2325], abnormal
events in video [5,26,27], intrusion detection [28], or
disaster areas from radar images [29].
A simple method to model anomalies is to use neighbor-
based methods such as the k-nearest neighbor [3032]
where anomalies are identified as those points in the data
space that differ from the surrounding data points. The
advantage of these methods is the independence of the data
distribution. However, their performance relies on the
values of the parameters selected such as the number of
neighbors.
An alternative to use neighbor-based methods is to
detect anomalies taken into account that they are grouped
in a zone of the data space. Thus, the anomaly detection
problem is solved as a subspace learning problem [3336].
Although this method work well in some cases, finding the
number of subspaces in which the anomalies are distributed
is not trivial.
As in classification and detection tasks, CNNs have
demonstrated to improve the performance in anomaly
detection problems [26]. More concretely, convolutional
autoencoders have been used to model input data and
reduce data space dimensionality [37]. Their use has
reduced the need of reprocessing input data and compute
handcrafted features from it [10]. Following this approach,
Mabu et al. proposed to use a convolutional autoencoder
followed by a one-class SVM to model normal areas in
satellite images and detect abnormal areas caused by nat-
ural disasters in Japan [29]. Lu and Xu demonstrated the
potential of using variational autoencoders to detect
anomalies in skin disease images [23]. The authors rec-
ommend to use them instead of GANs (generative adver-
sarial networks) due to their training stability and
interpretable results. Sugimoto et al. use an autoencoder
followed by a k-NN classifier to detect myocardial
infarction.
Another approach is the one followed by Gutoski et al.
in which autoencoders and stacked denoising autoencoders
are used for clustering [38]. With the clustering, repre-
sentation in possible to define whether a new sample is an
anomaly or not according to its distance to the clusters.
Gutoski et al. also followed this approach for one-class
classification [38].
In some cases, there are more than one group of
abnormalities as in the work carried out by Mirsky et al. in
[28]. The authors proposed to use an ensemble of autoen-
coders instead of one to detect online network intrusions.
The decision of what is an anomaly or not is based on the
RMSE (root-mean-squared errors) score output by the
autoencoders.
For video input, Singh and Mohan use deep stacked
autoencoders to obtain a deep representation of spa-
tiotemporal video volumes to detect road accidents [37].
The anomaly score is obtained with a one-class SVM as in
other works.
Finally, non-symmetric autoencoders have also been
used to learn space representations. An example of this is
the work carried out by Tran and Hogg where the
autoencoder representation is used for detecting anomalies
in video [39]. In addition, recurrent autoencoders with
LSTM (long short-term memory) layers have also been
applied for anomaly detection in video in the work carried
out by Yan et al. in [40].
3 Handgun detector
Before addressing the false positive rate reduction through
the use of the autoencoder, we needed to train and test a
handgun detector. As shown in Sect. 2, there are several
approaches that can be selected, from the use of classifi-
cation CNNs with the sliding window approach to the most
modern CNN detection architectures. While the former
examines every subregion of the image, the latter uses
region proposal algorithms to reduce the number of
examined windows or process the full image in one pass
[41]. The advantage of the new architectures is the ability
to detect objects in different locations of the image without
being restricted to a certain aspect ratio. Moreover, the
number of regions to be examined is drastically reduced in
comparison with other methods. The most representative
architectures for object detection that follow a region
proposals approach are R-CNN, Fast R-CNN, and Faster
R-CNN [42].
Two well-known architectures are YOLO (You Only
Look Once) and SSD (single-shot detector). YOLO
addresses object detection as a regression problem with
spatially separated bounding boxes and their corresponding
class probabilities [43]. SSD is able to predict, with only
one pass over the entire image, the bounding boxes and the
class probabilities for them [44].
In addition to all the above, there is a recently developed
CNN-based detector called RetinaNet [45]. RetinaNet was
designed to solve the problem of having extreme fore-
ground–background class imbalanced problems and has
been also applied to X-ray images [46].
For the particular problem of weapon (handgun and
knife) detection, [47] reviews recent work and shows that
Faster R-CNN has been the prevalent method. For that
reason, we have selected the Faster R-CNN architecture to
train a handgun detector with a dataset provided by the
Neural Computing and Applications (2021) 33:5885–5895 5887
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
University of Seville [1]. The dataset is composed of 871
images that contain 177 annotated handguns. Those images
were extracted from the video captured by 2 CCTV cam-
eras located in two different college hallways.
4 Datasets
The collection and labeling of the data necessary to train
deep learning models are tasks that require significant time
and effort. This is even more complicated in detection or
segmentation problems in which someone has to select the
area of the image in which the object is located, or the
exact contour of the object, in addition to the category. A
possible solution to this problem is the use of public
datasets, but, depending on the problem, it is not always
possible to have one available. The use of synthetic images
facilitates the work required to obtain large datasets. For
this work, a completely synthetic dataset has been gener-
ated with Unreal Engine 4 [48], rendering a scenario that
represents a high-school hallway where people are walk-
ing. There are other popular alternatives such as Unity [49]
and Lumberyard/CryEngine [50] that can also be used for
the same purpose. While some of the people on the sce-
nario carry everyday objects in their hands, such as mobile
phones, others carry guns or nothing (see Fig. 1).
Another advantage of having the data generation fully
controlled by the researcher is that it is also possible to
automatically generate a mask image with the desired
objects for each frame. In this case, each generated image
contains the people in white, the background covered in
black, and each handgun filled with a different color to help
extract the information about its location. Once all masks
are obtained, the coordinates of the bounding boxes that
contain the weapons are extracted storing the annotations
in XML files with the format defined by the Pascal VOC
2012 Challenge [51].
A total of 4000 images were generated with this method
with a resolution of 1280720. From these, 3000 frames
were used to train and adjust the proposed autoencoder
filter, containing 5437 annotated handguns. The remaining
1000 frames were used to evaluate and compare the
detector and detector ?autoencoder systems.
In addition to the synthetic dataset, the Gun Movies
Database [52] has also been used to ensure the differences
are caused by the proposed method and not by changes in
the texture by the origin of the data. This dataset contains
Fig. 1 Synthetic scenario with a zoom on the elements of interest (in
this case, a mobile phone and a handgun)
Fig. 2 Sample frame from the Gun Movies dataset
Fig. 3 Proposed system
Fig. 4 Autoencoder training phase
5888 Neural Computing and Applications (2021) 33:5885–5895
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
images of size 640480 pixels from 7 laboratory-shot
movies with a total of 817 frames and 686 annotated
handguns (Fig. 2).
For this second dataset, a total of 817 images were used.
From these, 571 frames were used to train and adjust the
proposed autoencoder filter and the remaining 246 frames
were used to evaluate and compare the detector and
detector ?autoencoder systems.
5 Proposed method
As introduced above, when a detector runs in a new sce-
nario, the false positive rate increases due to its particu-
larities that were not seen in the training data. To deal with
this problem, we propose to add a filtering step after the
detector inference (see Fig. 3). The filter is used to discard
the FP detections of the object detector produced by the
particularities of the new scenario.
The filter can be considered as a one-class classifier that
learns how to identify a certain type of samples. Thus, the
rest of the samples can be considered as anomalies. This
problem has been addressed in the literature through the
use of the one-class versions of the SVM, k-nearest
neighbor (k-NN), random forests classifiers, and more
recently with deep autoencoders [9,38,53]. In our case, an
autoencoder is trained to model the class of the typical FP
detections.
In order to collect the training samples for the autoen-
coder, the detector is run in the particular scenario for a
certain period of time, storing all the FP detections (Fig. 4).
Initially, all detections can be considered as FPs in a real
scenario since the incidence of handguns is very low.
Deep autoencoders learn the input data distribution
using an intermediate representation. They are able to
compress the data into a small vector and then reconstruct
the input from it with accurate results. If new input data
come from a different distribution, the reconstruction error
will be higher.
Finally, according to the autoencoder structure, we
define and compare 3 different methods to check whether
the output of the detector is a typical false positive. The
Fig. 5 Autoencoder architecture
used
Table 1 Detailed description of the autoencoder architecture used.
The output of the conv2d_7 layer (in bold) is used as the FP inter-
mediate representation
Layer (type) Output shape Param #
input_1 (InputLayer) (None, 64, 64, 3) 0
conv2d_1 (Conv2D) (None, 64, 64, 4) 112
max_pooling2d_1 (MaxPooling2) (None, 32, 32, 4) 0
conv2d_2 (Conv2D) (None, 32, 32, 28) 1036
max_pooling2d_2 (MaxPooling2) (None, 16, 16, 28) 0
conv2d_3 (Conv2D) (None, 16, 16, 52) 13156
max_pooling2d_3 (MaxPooling2) (None, 8, 8, 52) 0
conv2d_4 (Conv2D) (None, 8, 8, 76) 35644
max_pooling2d_4 (MaxPooling2) (None, 4, 4, 76) 0
conv2d_5 (Conv2D) (None, 4, 4, 100) 68500
max_pooling2d_5 (MaxPooling2) (None, 2, 2, 100) 0
conv2d_6 (Conv2D) (None, 2, 2, 124) 111724
max_pooling2d_6 (MaxPooling2) (None, 1, 1, 124) 0
conv2d_7 (Conv2D) (None, 1, 1, 148) 165316
conv2d_8 (Conv2D) (None, 1, 1, 148) 197284
up_sampling2d_1 (UpSampling2) (None, 2, 2, 148) 0
conv2d_9 (Conv2D) (None, 2, 2, 124) 165292
up_sampling2d_2 (UpSampling2) (None, 4, 4, 124) 0
conv2d_10 (Conv2D) (None, 4, 4, 100) 111700
up_sampling2d_3 (UpSampling2) (None, 8, 8, 100) 0
conv2d_11 (Conv2D) (None, 8, 8, 76) 68476
up_sampling2d_4 (UpSampling2) (None, 16, 16, 76) 0
conv2d_12 (Conv2D) (None, 16, 16, 52) 35620
up_sampling2d_5 (UpSampling2) (None, 32, 32, 52) 0
conv2d_13 (Conv2D) (None, 32, 32, 28) 13132
up_sampling2d_6 (UpSampling2) (None, 64, 64, 28) 0
conv2d_14 (Conv2D) (None, 64, 64, 3) 759
Fig. 6 Subsets of the synthetic dataset used. The Gun Movies dataset
is similarly split
Neural Computing and Applications (2021) 33:5885–5895 5889
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
simplest one is to establish a threshold for the reconstruc-
tion error. Therefore, detections with low reconstruction
error will be discarded as typical FPs of the scenario. The
other two methods are based on the use of the central
vector as a compact representation of the images and then
train a one-class classifier with it. For that, SVM and k-NN
with k¼1 were used, and the thresholds were selected
according to the scores and the distance to the closest
neighbor, respectively.
5.1 Autoencoder architecture
The structure of an autoencoder consists of an encoder path
that ignores the noise and reduces the dimensionality and a
decoder path that makes the reconstruction. The compres-
sive path of the autoencoder used consists of a set of 6
convolutional and max-pooling layers (Fig. 5). Similarly,
the reconstruction path has also 6 convolutional and up-
sampling layers. The input is a 3-channel image of size
6464, and the central vector has 148 elements (conv2d_7
layer). A more detailed description of the architecture can
be seen in Table 1.
6 Results
The Faster R-CNN model trained with the dataset from the
University of Seville obtained an mAP of 0.7933. Training
took 2 days and executed 62 epochs. An Ubuntu 14.04 LTS
(a) (b) (c) (d)
Fig. 7 Typical false positives of
the handgun detector in the
synthetic scenario (enlarged)
(a) (b) (c)
(d) (e) (f)
Fig. 9 Autoencoder reconstruction of: aTP and d) FP of the detector
from the synthetic dataset. band eare the reconstructed images, and
cand fare the absolute difference between the reconstructions and
their corresponding original images
(a) (b) (c)
Fig. 8 Typical false positives of the handgun detector in the Gun
Movies dataset (enlarged)
Table 3 Increase in the detector precision when the autoencoder is
applied by method and dataset
MSE kNN SVM
Synthetic 1.46% 1.77% 1.2%
th =0.0057 th =0.34 th =14600
Gun Movies 47% 20% 8%
th =0.047 th =1.92 th =175
Table 2 Percentage of FPs that are filtered by method and dataset
MSE kNN SVM
Synthetic 26.4% 30% 22.5%
Gun Movies 87.2% 74.1% 49%
5890 Neural Computing and Applications (2021) 33:5885–5895
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
machine with 2 nVIDIA Quadro M4000 cards, Keras with
TensorFlow backend and CUDA 8.0 were used to perform
the training.
After obtaining this base detector, both datasets were
divided into 4 parts to (1) train and (2) validate the
autoencoder, (3) fit the k-NN and SVM classifiers, and (4)
test the system variants (Fig. 6). The detector was then run
on each of the subsets, and the FP and TP patches were
stored. Although only the FP detections are used to train
and validate the autoencoder and fit the classifiers, the
correct detections were also generated and stored for the
test subset to check that the detection rate is minimally
affected.
Overall, for the autoencoder training and validation, two
sets with 4913 and 3607 FPs, respectively, were obtained
for the synthetic dataset and 586 and 405 FPs for the Gun
Movies dataset. Another set composed of 3712 FP regions
for the synthetic dataset and 95 FP regions for the Gun
Movies dataset was used to fit the classifiers used to per-
form the decision. Finally, a set of 4832 regions (4632 FPs
and 200 TPs) for the synthetic dataset and 499 regions (359
FPs and 40 TPs) for the Gun Movies dataset was reserved
for testing. Figure 7shows some examples of the typical FP
detections obtained in the synthetic scenario.
The autoencoders were then trained and validated with
the stored FP detections of the training and validation
subsets. This process took only about an hour for each
dataset to complete 500 epochs in a Windows 10 PC with
an nVIDIA GTX 1060 MaxQ card using Keras with Ten-
sorFlow backend and CUDA 9.0. At this point, if the
autoencoder is used with some test images from TP and FP
detections, the ability to effectively reconstruct FPs is
evidenced (see Fig. 9).
The stored FP detections from the fit subset of each
dataset were used to feed each of the autoencoders and get
intermediate vectors to train the SVM and k-NN one-class
classifiers. The SVM selected uses a linear kernel. On the
other hand, k¼1 was selected for the k-NN algorithm.
To illustrate the performance of both the detector and
detector ?autoencoder approaches on the two datasets,
they were tested with the 1000 images from the fourth
subset of the synthetic scenario and the 246 frames of the
Gun Movies dataset. The histograms of the reconstruction
error for the MSE thresholding-based method, the proba-
bility score for the SVM one-class classifier, and the
Fig. 10 Synthetic dataset.
Histograms of the MSE
reconstruction error, SVM
score, and K-NN distance (y-
axis uses logarithmic scale)
Neural Computing and Applications (2021) 33:5885–5895 5891
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
distance to the nearest neighbor for the k-NN were obtained
(Figs. 10 and 11). Although TPs and FPs are overlapped in
all cases, the first part of the MSE and k-NN histograms
and the last part of the SVM histogram do not contain TPs.
Therefore, the FPs that lie on those parts of the histograms
can be potentially filtered selecting the value of the first bin
(or the last for the SVM) in the histogram that contains TPs
as threshold. For the synthetic dataset, this shows that,
without affecting the detection rate, up to 26.4% of all the
FPs can be filtered using the MSE reconstruction error,
22.5% using the one-class SVM, and 30% with the distance
to the nearest neighbor of the k-NN (2). On the other hand,
in the histograms obtained from the Gun Movies dataset
TPs and FPs are less overlapped making it possible to
remove up to 87.2% of all the FPs using the MSE recon-
struction error, 49% using the one-class SVM, and 74.1%
with the distance to the nearest neighbor of the k-NN
without affecting the detection rate.
In addition, the precision–recall curves corresponding to
the detector and the autoencoder with the three proposed
decision methods were obtained (see Figs. 12 and 13). The
experimental results show a reduction in the number of
false positives while roughly maintaining the detection
capabilities [54]. Since the precision–recall curve is cal-
culated by varying the detector output threshold and the
autoencoder has another threshold that can be varied too,
each curve was obtained under a specific value for the
autoencoder and varying the threshold of the detector (3).
For the synthetic dataset, comparing all the curves for a
specific autoencoder thresholding method, they show a
maximum increase in the precision of 1.46% at the same
recall values when the autoencoder and the MSE are used
(threshold =0.0057), of 1.2% using the autoencoder and
the SVM classifier (threshold =14600), and of 1.77% in
case of the autoencoder and K-NN (threshold =0.34). For
the Gun Movies dataset, results show a maximum increase
in the precision of 47% at the same recall values when the
autoencoder and the MSE are used (threshold =0.047), of
8% using the autoencoder and the SVM classifier (thresh-
old =175), and of 20% in case of the autoencoder and K-
NN (threshold =1.92).
Overall, the results show that the autoencoder is able to
filter part of the new FPs in all cases without affecting the
detection rate of the original system. All thresholding
Fig. 11 Gun movies dataset.
Histograms of the MSE
reconstruction error, SVM
score, and K-NN distance (y-
axis uses logarithmic scale)
5892 Neural Computing and Applications (2021) 33:5885–5895
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
methods are able to reduce the number of FPS to some
degree.
7 Conclusions
In this work, a step to filter the false positive detections that
appear when a pre-trained handgun detector is deployed in
the final surveillance scenario has been proposed. This step
consists of training a deep autoencoder with the false
positive regions obtained from the particular scenario.
Once the autoencoder is trained, it can be used to decide
whether a detection is similar to the already known typical
false alarms and can be filtered, or otherwise if an alert
should be triggered.
The ability of the autoencoder to reduce the number of
FPs has been demonstrated with a potential reduction by up
to 30% for the synthetic scenario when it is combined with
Fig. 12 Precision–recall curves for the synthetic dataset. Best viewed
in color (color figure online)
Fig. 13 Precision–recall curves for the Gun Movies dataset. Best
viewed in color (color figure online)
Neural Computing and Applications (2021) 33:5885–5895 5893
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ak-NN classifier trained with the vector representation of
the detector FPs regions and up to 78% of the FPs for the
Gun Movies dataset when the autoencoder is combined
with the MSE error metric. Furthermore, the handgun
detection capability of the system is not compromised by
the added filtering step under a wide range of threshold
levels.
Although the proposed approach has been only applied
to two particular scenarios, it can be extended to more than
one since having different perspectives, lighting conditions
or background objects will generate different false posi-
tives. Thus, during the system’s deployment only a generic
detector (in this case a handgun detector) is required and
one autoencoder will be trained for each camera feed.
Acknowledgements We thank Professor Dr. J.A. Alvarez for the
surveillance images provided for training the handgun detector and J.
J. Corroto for generating the synthetic dataset. This work was par-
tially funded by projects TIN2017-82113-C2-2-R by the Spanish
Ministry of Economy and Business and SBPLY/17/180501/000543
by the Autonomous Government of Castilla-La Mancha and the
ERDF.
Compliance with Ethical Standards
Conflict of interest The authors declare that they have no conflict of
interest.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate
if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons licence and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this licence, visit http://creativecommons.
org/licenses/by/4.0/.
References
1. Enrı
´quez F, Soria LM, A
´lvarez-Garcı
´a JA, Caparrini FS, Velasco
F, Deniz O, Vallez N (2019) Vision and crowdsensing technology
for an optimal response in physical-security. Comput Sci (ICCS)
11540:15–26
2. Ashby MPJ (2017) The value of CCTV surveillance cameras as
an investigative tool: an empirical analysis. Eur J Crim Policy
Res 23(3):441–459
3. Raghunandan A, Mohana M, Raghav P, Aradhya HR (2018)
Object detection algorithms for video surveillance applications.
In: IEEE-7th international conference on communication and
signal processing (ICISPC 2018), pp 563–568
4. Olmos R, Tabik S, Herrera F (2018) Automatic handgun detec-
tion alarm in videos using deep learning. Neurocomputing
275:66–72
5. Dan X, Yan Y, Ricci E, Dan N (2017) Detecting anomalous
events in videos by learning deep representations of appearance
and motion. Comput Vis Image Underst 156:117–127
6. Vallez N, Bueno G, Deniz O (2013) False positive reduction in
detector implantation. In: 14th Conference on artificial intelli-
gence in medicine, (AIME)
7. Xiaodan X, Liu H, Yao M (2019) Recent progress of anomaly
detection. Complexity 2015:1–11
8. Vallez N, Velasco-Mata A, Corroto JJ, Deniz O (2019) Weapon
detection for particular scenarios using deep learning. In: 9th
Iberian conference on pattern recognition and image analysis
(IbPRIA)
9. Khan SS, Madden MG (2013) One-class classification: taxonomy
of study and review of techniques. CoRR, abs/1312.0049
10. Hofer-Schmitz K, Nguyen P-H, Berwanger K (2018) One-class
autoencoder approach to classify Raman spectra outliers. In:
ESANN
11. Nercessian S, Panetta K, Agaian S (2008) Automatic detection of
potential threat objects in X-ray luggage scan images. In: 2008
IEEE conference on technologies for homeland security,
pp 504–509
12. Gesick R, Saritac C, Hung C-C (2009) Automatic image analysis
process for the detection of concealed weapons. In: Proceedings
of the 5th annual workshop on cyber security and information
intelligence research: cyber security and information intelligence
challenges and strategies, pp 1–20
13. Harmer SW, Andrews DA, Rezgui ND, Bowring NJ (2010)
Detection of handguns by their complex natural resonant fre-
quencies. IET Microwaves Antennas Propag 4:1182–1190
14. Flitton G, Breckon TP, Megherbi N (2013) A comparison of 3D
interest point descriptors with application to airport baggage
object detection in complex CT imagery. Pattern Recognit
46(9):2420–2436
15. Xiao Z, Lu X, Yan J, Wu L, Ren L (2015) Automatic detection of
concealed pistols using passive millimeter wave imaging. In:
2015 IEEE international conference on imaging systems and
techniques (IST), pp 1–4
16. Kundegorski ME, Akcay S, Devereux M, Mouton A, Bath
University, Breckon TP (2016) On using feature descriptors as
visual words for object detection within X-ray baggage security
screening. In: 7th International conference on imaging for crime
detection and prevention (ICDP), pp 1–12
17. Tiwari RK, Verma GK (2015) A computer vision based frame-
work for visual gun detection using Harris interest point detector.
In: 11th International conference on communication networks,
ICCN, 54: 703–712
18. Halima NB, Hosam O (2016) Bag of words based surveillance
system using support vector machines. Int J Secur Appl
10(4):331–346
19. Gelana F, Yadav A (2019) Firearm detection from surveillance
cameras using image processing and machine learning tech-
niques. In: Smart innovations in communication and computa-
tional sciences, pp 25–34
20. Romero David, Salamea Christian (2019) Convolutional models
for the detection of firearms in surveillance videos. Appl Sci
9:1–11
21. Gotou Hiroyuki, Suzuki Tomoya (2016) Biased reactions to
abnormal stock prices detected by autoencoder. J Signal Process
20(4):157–160
22. Zhang C, Chen Y (2019) Time series anomaly detection with
variational autoencoders. CoRR, abs/1907.01702
23. Lu Y, Xu P (2018) Anomaly detection for skin disease images
using variational autoencoder. CoRR, abs/1807.01349
24. Sato D, Hanaoka S, Nomura Y, Takenaga T, Miki S, Yoshikawa
T, Hayashi N, Abe O (2018) A primitive study on unsupervised
anomaly detection with an autoencoder in emergency head CT
5894 Neural Computing and Applications (2021) 33:5885–5895
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
volumes. In: Medical imaging 2018: computer-aided diagnosis,
p60
25. Freiman M, Manjeshwar R, Goshen L (2019) Unsupervised
abnormality detection through mixed structure regularization
(MSR) in deep sparse autoencoders. CoRR, abs/1902.11036
26. Chong YS, Tay YH (2017) Abnormal event detection in videos
using spatiotemporal autoencoder. In: Cong F, Leung A, Wei Q
(eds) Advances in neural networks - ISNN 2017. Springer
International Publishing, Cham, pp 189–196
27. Sabokrou M, Fathy M, Hoseini M (2016) Video anomaly
detection and localisation based on the sparsity and reconstruc-
tion error of auto-encoder. Electron Lett 52(13):1122–1124
28. Mirsky Y, Doitshman T, Elovici Y, Shabtai A (2018) Kitsune: an
ensemble of autoencoders for online network intrusion detection.
CoRR, abs/1802.09089
29. Mabu Shingo, Fujita Kohki, Kuremoto Takashi (2019) Disaster
area detection from synthetic aperture radar images using con-
volutional autoencoder and one-class svm. J Robot Netw Artif
Life 6:48–51
30. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms
for mining outliers from large data sets. 29: 427–438
31. Angiulli F, Pizzuti C (2002) Fast outlier detection in high
dimensional spaces. 2431: 15–26
32. Hautama
¨ki V, Karkkainen I (2004) Outlier detection using
k-nearest neighbour graph. 3: 430–433
33. Zhang J, Jiang Y, Chang K, Zhang S, Cai J, Hu L (2009) A
concept lattice based outlier mining method in low-dimensional
subspaces. Pattern Recognit Lett 30:1434–1439
34. Zhang J, Xiaolong Y, Li Y, Zhang S, Xun Y, Qin X (2016) A
relevant subspace based contextual outlier mining algorithm.
Knowledge-Based Syst 99:02
35. Muller E, Assent I, Steinhausen U, Seidl T (2008) Outrank:
ranking outliers in high dimensional data. pp 600–603
36. Pasillas-Diaz J, Ratte
´S (2016) Bagged subspaces for unsuper-
vised outlier detection: FBSO. Comput Intell 33
37. Singh D, Mohan CK (2019) Deep spatio-temporal representation
for detection of road accidents using stacked autoencoder. IEEE
Trans Intell Transp Syst 20(3):879–887
38. Gutoski M, Ribeiro M, Romero Aquino NM, Lazzaretti AE,
Lopes HS (2017) A clustering-based deep autoencoder for one-
class image classification. In: 2017 IEEE Latin American con-
ference on computational intelligence (LA-CCI), pp 1–6
39. Tran H, Hogg DC (2017) Anomaly detection using a convolu-
tional winner-take-all autoencoder. In BMVC
40. Yan S, Smith JS, Lu W, Zhang B (2020) Abnormal event
detection from videos using a two-stream recurrent variational
autoencoder. IEEE Trans Cognit Dev Syst 12(1):30–42
41. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature
hierarchies for accurate object detection and semantic segmen-
tation. In: Proceedings of the IEEE conference on computer
vision and pattern recognition (CVPR), pp 580–587
42. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards
real-time object detection with region proposal networks. Adv
Neural Inf Process Syst, pp 91–99
43. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only
look once: unified, real-time object detection. In: Proceedings of
the IEEE conference on computer vision and pattern recognition
(CVPR), pp 779–788
44. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg
AC (2016) SSD: single shot multibox detector. In: European
conference on computer vision (ECCV), pp 21–37
45. Lin T-Y, Goyal P, Girshick R, He K, Dolla
´r P (2017) Focal loss
for dense object detection. In: Proceedings of the IEEE interna-
tional conference on computer vision (CVPR), pp 2980–2988
46. Cui Y, Oztan B (2019) Automated firearms detection in cargo
x-ray images using RetinaNet. In: Anomaly detection and
imaging with X-Rays (ADIX) IV, 10999: 105–115
47. Fernandez-Carrobles MM, Deniz O, Maroto F (2019) Gun and
knife detection based on faster R-CNN for video surveillance. In:
9th Iberian conference on pattern recognition and image analysis
(IbPRIA)
48. Unreal Engine 4. https://www.unrealengine.com. Accessed 20
Sep 2019
49. Unity. https://unity.com. Accessed 20 Sep 2019
50. Lumberyard. https://aws.amazon.com/es/lumberyard. Accessed
20 Sep 2019
51. Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A
(2010) The Pascal visual object classes (VOC) challenge. Int J
Comput Vis 88(2):303–338
52. Grega M, Lach S, Sieradzki R (2013) Automated recognition of
firearms in surveillance video. In: 2013 IEEE International multi-
disciplinary conference on cognitive methods in situation
awareness and decision support (CogSIMA), pp 45–50
53. Leng Q, Qi H, Miao J, Zhu W, Guiping S (2015) One-class
classification with extreme learning machine. Math Probl Eng
1–11(05):2015
54. Tharwat A (2018) Classification assessment methods. Appl
Comput Inf pp 1–13
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Neural Computing and Applications (2021) 33:5885–5895 5895
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... In this application, the closed-circuit television is widely used for recognizing dangerous situations, where the closed-circuit television is considered as the effective operational requirement in terms of safety aspects [1][2][3]. The main purpose of closed circuit television is to provide security, crime investigation, deterrence, and reduction in insurance costs [4,5]. However, the deterrence effects of closed circuit television cameras vary from the different time periods and crime categories. ...
Article
Full-text available
In recent decades, automatic control systems are becoming the essential need for security forces, due to the increase in the number of criminal activities. The fast and precise automatic weapon detection system is useful to mitigate or avoid risks in public spaces. In this manuscript, a new automated model is implemented for effective weapon detection in closed circuit television videos. After collecting the data from YouTube and Gun movies databases, the Gaussian Mixture Model (GMM) is applied to detect the weapons in the video sequences. Then, the feature extraction is performed using deep learning models: AlexNet and ResNet 18, and a descriptor: Scale Invariant Feature Transform (SIFT) for extracting the feature vectors from the segmented regions. Whereas, the combination of deep and texture features reduces the semantic space between the feature subsets that helps in enhancing the classification performance. In addition, the feature optimization is accomplished by Human Inspired Particle Swarm Optimization (HIPSO) algorithm to select active feature vectors that decrease the system complexity and training time of the classifier. In the conventional PSO algorithm, the Human Group Optimization (HGO) algorithm is utilized to influence the particles, and then the adaptive uniform mutation is utilized to improve the convergence rate and makes the implementation simple. Finally, the selected active feature vectors are fed to the Support Vector Machine (SVM) classifier for weapon and non-weapon classification. The experiment results confirmed that the HIPSO-SVM model has achieved high accuracy of 95.34% and 98.60% on the YouTube and Gun movies databases, which are better compared to the existing models.
... Despite the advancements, there exist certain limitations and research challenges in the realm of deep learning-based approaches for handgun detection [14]. Pursuing high accuracy while maintaining real-time performance demands innovative solutions [15,16]. Addressing these challenges necessitates further investigation and exploration of novel methodologies to ensure the efficacy of real-time handgun detection systems. ...
... For example, because beavers rapidly transform a variety of landscape types into wetlands, fine-tuning a beaver dam identification model pre-trained on BigEarthNet land classification data (Sumbul et al., 2019) may speed up training and improve accuracy. Additionally, the inclusion of a reconstruction error during training may substantially lower false positives (Noh et al., 2022;Vallez et al., 2021). ...
Article
Full-text available
Beavers are ecosystem engineers that create and maintain riparian wetland ecosystems in a variety of ecologic, climatic, and physical settings. Despite the large‐scale implications of ongoing beaver conservation and range expansion, relatively few landscape‐scale studies have been conducted, due in part to the significant time required to manually locate beaver dams at scale. To address this need, we developed EEAGER—an image recognition machine learning model that detects beaver complexes in aerial and satellite imagery. We developed the model in the western United States using 13,344 known beaver dam locations and 56,728 nearby locations without beaver dams. Performance assessment was performed in twelve held out evaluation polygons of known beaver occupancy but previously unmapped dam locations. These polygons represented regions similar to the training data as well as more novel landscape settings. Our model performed well overall (accuracy = 98.5%, recall = 63.03%, precision = 25.83%) in these areas, with stronger performance in regions similar to where the model had been trained. We favored recall over precision, which results in a more complete catalog of beaver dams found but also a higher incidence of false positives to be manually removed during quality control. These results have far‐reaching implications for monitoring of beaver‐based river restoration, as well as potential applications detecting other complex landforms.
... There are some recent works that utilize deep learning techniques for concealed weapon detection [3,5,9,16,20]. The main contributions of this study are as follows: ...
Article
Full-text available
Violence involving firearms is a rising threat that requires precise and competent surveillance systems. Current surveillance technologies involve continuous human observation and are prone to human errors. To handle such errors and monitor with minimal human effort, new solutions using artificial intelligence approaches that can detect and pinpoint the threat are required. In this study, our aim is to develop a deep learning-based solution capable of detecting and locating concealed pistols on thermal images for real-time surveillance. For this purpose, we generate a dataset consisting of thermal video recordings of multiple human models and combine this dataset with thermal images from public sources. Then, we build up a deep learning-based framework by combining two deep learning models that detects and localizes the concealed pistol in the given thermal image. We evaluate multiple deep learning architectures for the classification and segmentation of the images. The best test set results in detecting the concealed pistol was achieved by a fine-tuned VGG19-based convolutional neural network model with an F1 score of 0.84 on the test set. In the second module of the system, a fine-tuned Yolo-V3 model trained as a multi-tasking model for both classification and location detection gave the highest mean average precision value of 0.95 in labeling and locating the pistol in a bounding box in approximately 10 milliseconds. The findings exhibit the potential of using deep learning techniques with thermal imaging for the real time concealed pistol detection.
Article
Rising number of crime rate using firearms (such as open firing, robbery, suicides, mass shootings, homicides, threatening at gun point, etc.), has underscored the growing importance of timely detection of weapons. Bounding box regression, an essential element in object detection, plays a crucial role in accurately identifying and localizing these firearms. This paper introduces a Manhattan-Complete IoU (MCIoU) loss for bounding boxes, which demonstrates significantly faster convergence during training compared to other IoU losses. By incorporating MCIoU into one of the advanced object detection framework, (You Only Look Once) YOLOv7, the proposed work demonstrates consistent improvement on their performance across popular weapon detection benchmarks datasets such as Granda, Synthetic, Cam157, Internet Movie Firearms Database (IMFDB), Gun Movie and Monash. The most encouraging outcomes were obtained on a Gun movie dataset with a precision and recall value of 98.2% and 96.3% respectively, which is an appreciable improvement compared to the baseline YOLOv7 model. Experiments show that the best precision achieved was at 98.2% and mAP@.50:.95 of 42.5% over other existing IoUs.
Article
Full-text available
Video Surveillance Systems (VSSs) are used in a wide range of applications including public safety and perimeter security. They are deployed in places such as markets, hospitals, schools, banks, shopping malls, offices, and smart cities. VSSs generate a massive amount of surveillance data, and significant research has been published on the use of machine learning algorithms to handle surveillance data. In this paper, we present an extensive overview and a thorough analysis of cutting-edge learning methods used in VSSs. Existing surveys on learning approaches in video surveillance have some drawbacks, such as a lack of in-depth analysis of the learning algorithms, omission of certain methodologies, insufficient critical evaluation, and absence of recent learning algorithms. To fill these gaps, this survey provides a thorough examination of the most recent learning algorithms for anomaly detection. A critical assessment of the algorithms including their strengths, weaknesses, and applicability as well as tailored classifications of anomaly types for different domains are provided. Our study also offers insights into the future development of learning techniques in VSS, positioning itself as a valuable resource for both researchers and practitioners in the field. Finally, we share our thoughts on what we learned and how it can help with new developments in the future.
Article
Violent assaults and homicides occur daily, and the number of victims of mass shootings increases every year. However, this number can be reduced with the help of Closed Circuit Television (CCTV) and weapon detection models, as generic object detectors have become increasingly accurate with more data for training. We present a new semi-supervised learning methodology based on conditioned cooperative student-teacher training with optimal pseudo-label generation using a novel confidence threshold search method and improving both models by conditional knowledge transfer. Furthermore, a novel firearms image dataset of 458,599 images was collected using Instagram hashtags to evaluate our approach and compare the improvements obtained using a specific unsupervised dataset instead of a general one such as ImageNet. We compared our methodology with supervised, semi-supervised and self-supervised learning techniques, outperforming approaches such as YOLOv5 m (up to +19.86), YOLOv5l (up to +6.52) Unbiased Teacher (up to +10.5 AP), DETReg (up to +2.8 AP) and UP-DETR (up to +1.22 AP).
Article
Full-text available
Closed-circuit television monitoring systems used for surveillance do not provide an immediate response in situations of danger such as armed robbery. In addition, they have multiple limitations when human operators perform the monitoring. For these reasons, a firearms detection system was developed using a new large database that was created from images extracted from surveillance videos of situations in which there are people with firearms. The system is made up of two parts—the “Front End” and “Back End”. The Front End is comprised of the YOLO object detection and localization system, and the Back End is made up of the firearms detection model that is developed in this work. These two systems are used to focus the detection system only in areas of the image where there are people, disregarding all other irrelevant areas. The performance of the firearm detection system was analyzed using multiple convolutional neural network (CNN) architectures, finding values up to 86% in metrics like recall and precision in a network configuration based on VGG Net using grayscale images.
Article
Full-text available
Purpose The purpose of this study is to introduce and evaluate the mixed structure regularization (MSR) approach for a deep sparse autoencoder aimed at unsupervised abnormality detection in medical images. Unsupervised abnormality detection based on identifying outliers using deep sparse autoencoders is a very appealing approach for computer‐aided detection systems as it requires only healthy data for training rather than expert annotated abnormality. However, regularization is required to avoid overfitting of the network to the training data. Methods We used coronary computed tomography angiography (CCTA) datasets of 90 subjects with expert annotated centerlines. We segmented coronary lumen and wall using an automatic algorithm with manual corrections where required. We defined normal coronary cross section as cross sections with a ratio between lumen and wall areas larger than 0.8. We divided the datasets into training, validation, and testing groups in a tenfold cross‐validation scheme. We trained a deep sparse overcomplete autoencoder model for normality modeling with random structure and noise augmentation. We assessed the performance of our deep sparse autoencoder with MSR without denoising (SAE‐MSR) and with denoising (SDAE‐MSR) in comparison to deep sparse autoencoder (SAE), and deep sparse denoising autoencoder (SDAE) models in the task of detecting coronary artery disease from CCTA data on the test group. Results The SDAE‐MSR achieved the best aggregated area under the curve (AUC) with a 20% improvement and the best aggregated Average Precision (AP) with a 30% improvement upon the SAE and SDAE (AUC: 0.78 to 0.94, AP: 0.66 to 0.86) in distinguishing between coronary cross sections with mild stenosis (stenosis grade < 0.3) and coronary cross sections with severe stenosis (stenosis grade > 0.7). The improvements were statistically significant (Mann–Whitney U‐test, P < 0.001). Similarly, The SDAE‐MSR achieved the best aggregated AUC (AP) with an 18% (18%) improvement upon the SAE and SDAE (AUC: 0.71 to 0.84, AP: 0.68 to 0.80). The improvements were statistically significant (Mann–Whitney U‐test, P < 0.05). Conclusion Deep sparse autoencoders with MSR in addition to explicit sparsity regularization term and stochastic corruption of the input data with Gaussian noise have the potential to improve unsupervised abnormality detection using deep‐learning compared to common deep autoencoders.
Article
Full-text available
Anomaly analysis is of great interest to diverse fields, including data mining and machine learning, and plays a critical role in a wide range of applications, such as medical health, credit card fraud, and intrusion detection. Recently, a significant number of anomaly detection methods with a variety of types have been witnessed. This paper intends to provide a comprehensive overview of the existing work on anomaly detection, especially for the data with high dimensionalities and mixed types, where identifying anomalous patterns or behaviours is a nontrivial work. Specifically, we first present recent advances in anomaly detection, discussing the pros and cons of the detection methods. Then we conduct extensive experiments on public datasets to evaluate several typical and popular anomaly detection methods. The purpose of this paper is to offer a better understanding of the state-of-the-art techniques of anomaly detection for practitioners. Finally, we conclude by providing some directions for future research.
Conference Paper
We present an one-class Anomaly detector based on (deep) Autoencoder for Raman spectra. Omitting preprocessing of the spectra, we use raw data of our main class to learn the reconstruction, with many typical noise sources automatically reduced as the outcome. To separate anomalies from the norm class, we use several, independent statistical metrics for a majority voting. Our evaluation shows a f1-score of up to 99% success
Chapter
Public safety in public areas is nowadays one of the main concerns for governments and companies around the world. Video surveillance systems can take advantage from the emerging techniques of deep learning to improve their performance and accuracy detecting possible threats. This paper presents a system for gun and knife detection based on the Faster R-CNN methodology. Two approaches have been compared taking as CNN base a GoogleNet and a SqueezeNet architecture respectively. The best result for gun detection was obtained using a SqueezeNet architecture achieving a 85.44% AP50AP_{50}. For knife detection, the GoogleNet approach achieved a 46.68% AP50AP_{50}. Both results improve upon previous literature results evidencing the effectiveness of our detectors.
Chapter
The development of object detection systems is normally driven to achieve both high detection and low false positive rates in a certain public dataset. However, when put into a real scenario the result is generally an unacceptable rate of false alarms. In this context we propose to add an additional step that models and filters the typical false alarms of the new scenario while roughly maintaining the ability to detect the objects of interest. We propose to use the false alarms of the new scenario to train a deep autoencoder and to model them. The latter will act as a filter that checks whether the output of the detector is one of its typical false positives or not based on the reconstruction error measured with the Mean Squared Error (MSE) and the Peak Signal-to-Noise Ratio (PSNR). We test the system using an entirely synthetic novel dataset for training and testing the autoencoder generated with Unreal Engine 4. Results show a reduction in the number of FPs of up to 37.9% in combination with the PSNR error while maintaining the same detection capability.
Article
There has been extensive research on the value of closed-circuit television (CCTV) for preventing crime, but little on its value as an investigative tool. This study sought to establish how often CCTV provides useful evidence and how this is affected by circumstances, analysing 251,195 crimes recorded by British Transport Police that occurred on the British railway network between 2011 and 2015. CCTV was available to investigators in 45% of cases and judged to be useful in 29% (65% of cases in which it was available). Useful CCTV was associated with significantly increased chances of crimes being solved for all crime types except drugs/weapons possession and fraud. Images were more likely to be available for more-serious crimes, and less likely to be available for cases occurring at unknown times or in certain types of locations. Although this research was limited to offences on railways, it appears that CCTV is a powerful investigative tool for many types of crime. The usefulness of CCTV is limited by several factors, most notably the number of public areas not covered. Several recommendations for increasing the usefulness of CCTV are discussed.
Chapter
Law enforcement agencies and private security companies work to prevent, detect and counteract any threat with the resources they have, including alarms and video surveillance. Even so, there are still terrorist attacks or shootings in schools in which armed people move around a venue exercising violence and generating victims, showing the limitations of current systems. For example, they force security agents to monitor continuously all the images coming from the installed cameras, and potential victims nearby are not aware of the danger until someone triggers a general alarm, which also does not give them information on what to do to protect themselves. In this article we present a project that is being developed to apply the latest technologies in early threat detection and optimal response. The system is based on the automatic processing of video surveillance images to detect weapons and a mobile app that serves both for detection through the analysis of mobile device sensors, and to send users personalised and dynamic indications. The objective is to react in the shortest possible time and minimise the damage suffered.